Abstract
The rapid growth of data size causes several problems such as storage limitation and increment of data management cost. In order to store and manage massive data, Distributed File System (DFS) is widely used. Furthermore, in order to reduce the volume of storage, data deduplication schemes are being extensively studied. The data deduplication increases the available storage capacity by eliminating duplicated data. However, deduplication process causes performance overhead such as disk I/O. In this paper, we propose a content-based chunk placement scheme to increase deduplication rate on the DFS. To avoid performance overhead caused by deduplication process, we use lessfs in each chunk server. With our design, our system performs decentralized deduplication process in each chunk server. Moreover, we use consistent hashing for chunk allocation and failure recovery. Our experimental results show that the proposed system reduces the storage space by 60% than the system without consistent hashing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Gantz, J., Reinsel, D.: 2011 Digital Universe Study: Extracting Value from Chaos. Technical report, IDC (2011)
Gantz, J., Reinsel, D.: The Digital Univers. In: 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East. Technical report, IDC (2011)
DuBois, L., Amaldas, M.: IDC key-considerations deduplication. Technical report, IDC (2010)
Webb, N.: Open Source Data Deduplication. In: Linuxfest Northwest, Bellingham, WA, USA (April 2011)
Koutoupis, P.: Data Deduplication with Linux. 7 (2011)
MooseFS, http://www.moosefs.org
Hupfeld, F., Cortes, T., Kolbeck, B., Stender, J., Focht, E., Hess, M., Malo, J., Marti, J., Cesario, E.: The XtreemFS architecture—a case for object-based file systems in Grids. Concurrency and Computation: Practice and Experience 20(17), 2049–2060 (2008)
XtreemFS, http://www.xtreemfs.org
GlusterFS, http://www.gluster.org
Weil, S., Brandt, S., Miller, E., Long, D., Maltzahn, C.: Ceph: A scalable, high-performance distributed file system. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI), pp. 307–320 (2006)
Ghemawat, S., Gobioff, H., Leung, S.: The Google file system. ACM SIGOPS Operating Systems Review 37, 29–43 (2003)
Thanh, T., Mohan, S., Choi, E., Kim, S., Kim, P.: A taxonomy and survey on distributed file systems. In: Fourth International Conference on Networked Computing and Advanced Information Management, NCM 2008, vol. 1, pp. 144–149. IEEE (2008)
Srinivasan, K., Bisson, T., Goodson, G., Voruganti, K.: iDedup: Latency-aware, inline data deduplication for primary storage. In: Proceedings of the Tenth USENIX Conference on File and Storage Technologies (FAST 2012), San Jose, CA (2012)
Meyer, D., Bolosky, W.: A study of practical deduplication. ACM Transactions on Storage (TOS) 7(4), 14 (2012)
El-Shimi, A., Kalach, R., Kumar, A., Oltean, A., Li, J., Sengupta, S.: Primary Data Deduplication-Large Scale Study and System Design. In: Proccedings of the USENIX Annual Technical Conference 2012 (2012)
Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., Welnicki, M.: Hydrastor: A scalable secondary storage. In: Proccedings of the 7th Conference on File and Storage Technologies, pp. 197–210. USENIX Association (2009)
Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., Camble, P.: Sparse indexing: large scale, inline deduplication using sampling and locality. In: Proccedings of the 7th Conference on File and Storage Technologies, pp. 111–123 (2009)
Wei, J., Jiang, H., Zhou, K., Feng, D.: Mad2: A scalable high-throughput exact deduplication approach for network backup services. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–14. IEEE (2010)
Zhu, B., Li, K., Patterson, H.: Avoiding the disk bottleneck in the data domain deduplication file system. In: Proceedings of the 6th USENIX Conference on File and Storage Technologies, vol. 18 (2008)
Clements, A., Ahmad, I., Vilayannur, M., Li, J., et al.: Decentralized deduplication in SAN cluster file systems. In: Proceedings of the 2009 Conference on USENIX Annual Technical Conference, p. 8. USENIX Association (2009)
Karger, D., Sherman, A., Berkheimer, A., Bogstad, B., Dhanidina, R., Iwamoto, K., Kim, B., Matkins, L., Yerushalmi, Y.: Web caching with consistent hashing. Computer Networks 31(11), 1203–1213 (1999)
Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., Lewin, D.: Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In: Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, pp. 654–663. ACM (1997)
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: amazon’s highly available key-value store. ACM SIGOPS Operating Systems Review 41, 205–220 (2007)
Cassandra, http://cassandra.apache.org
Memcached, http://memcached.org/
Chen, F., Luo, T., Zhang, X.: CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In: Proceedings of the 9th USENIX Conference on File and Stroage Technologies, p. 6. USENIX Association (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kim, K., Kim, J., Min, C., Eom, Y.I. (2013). Content-Based Chunk Placement Scheme for Decentralized Deduplication on Distributed File Systems. In: Murgante, B., et al. Computational Science and Its Applications – ICCSA 2013. ICCSA 2013. Lecture Notes in Computer Science, vol 7971. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39637-3_14
Download citation
DOI: https://doi.org/10.1007/978-3-642-39637-3_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39636-6
Online ISBN: 978-3-642-39637-3
eBook Packages: Computer ScienceComputer Science (R0)