Content-Based Chunk Placement Scheme for Decentralized Deduplication on Distributed File Systems

Kim, Keonwoo; Kim, Jeehong; Min, Changwoo; Eom, Young Ik

doi:10.1007/978-3-642-39637-3_14

Keonwoo Kim²⁴,
Jeehong Kim²⁴,
Changwoo Min^24,25 &
…
Young Ik Eom²⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7971))

Included in the following conference series:

International Conference on Computational Science and Its Applications

2387 Accesses
1 Citations

Abstract

The rapid growth of data size causes several problems such as storage limitation and increment of data management cost. In order to store and manage massive data, Distributed File System (DFS) is widely used. Furthermore, in order to reduce the volume of storage, data deduplication schemes are being extensively studied. The data deduplication increases the available storage capacity by eliminating duplicated data. However, deduplication process causes performance overhead such as disk I/O. In this paper, we propose a content-based chunk placement scheme to increase deduplication rate on the DFS. To avoid performance overhead caused by deduplication process, we use lessfs in each chunk server. With our design, our system performs decentralized deduplication process in each chunk server. Moreover, we use consistent hashing for chunk allocation and failure recovery. Our experimental results show that the proposed system reduces the storage space by 60% than the system without consistent hashing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Gantz, J., Reinsel, D.: 2011 Digital Universe Study: Extracting Value from Chaos. Technical report, IDC (2011)
Google Scholar
Gantz, J., Reinsel, D.: The Digital Univers. In: 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East. Technical report, IDC (2011)
Google Scholar
DuBois, L., Amaldas, M.: IDC key-considerations deduplication. Technical report, IDC (2010)
Google Scholar
Webb, N.: Open Source Data Deduplication. In: Linuxfest Northwest, Bellingham, WA, USA (April 2011)
Google Scholar
Koutoupis, P.: Data Deduplication with Linux. 7 (2011)
Google Scholar
MooseFS, http://www.moosefs.org
Hupfeld, F., Cortes, T., Kolbeck, B., Stender, J., Focht, E., Hess, M., Malo, J., Marti, J., Cesario, E.: The XtreemFS architecture—a case for object-based file systems in Grids. Concurrency and Computation: Practice and Experience 20(17), 2049–2060 (2008)
Article Google Scholar
XtreemFS, http://www.xtreemfs.org
GlusterFS, http://www.gluster.org
Weil, S., Brandt, S., Miller, E., Long, D., Maltzahn, C.: Ceph: A scalable, high-performance distributed file system. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI), pp. 307–320 (2006)
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.: The Google file system. ACM SIGOPS Operating Systems Review 37, 29–43 (2003)
Article Google Scholar
Thanh, T., Mohan, S., Choi, E., Kim, S., Kim, P.: A taxonomy and survey on distributed file systems. In: Fourth International Conference on Networked Computing and Advanced Information Management, NCM 2008, vol. 1, pp. 144–149. IEEE (2008)
Google Scholar
Srinivasan, K., Bisson, T., Goodson, G., Voruganti, K.: iDedup: Latency-aware, inline data deduplication for primary storage. In: Proceedings of the Tenth USENIX Conference on File and Storage Technologies (FAST 2012), San Jose, CA (2012)
Google Scholar
Meyer, D., Bolosky, W.: A study of practical deduplication. ACM Transactions on Storage (TOS) 7(4), 14 (2012)
Google Scholar
El-Shimi, A., Kalach, R., Kumar, A., Oltean, A., Li, J., Sengupta, S.: Primary Data Deduplication-Large Scale Study and System Design. In: Proccedings of the USENIX Annual Technical Conference 2012 (2012)
Google Scholar
Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., Welnicki, M.: Hydrastor: A scalable secondary storage. In: Proccedings of the 7th Conference on File and Storage Technologies, pp. 197–210. USENIX Association (2009)
Google Scholar
Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., Camble, P.: Sparse indexing: large scale, inline deduplication using sampling and locality. In: Proccedings of the 7th Conference on File and Storage Technologies, pp. 111–123 (2009)
Google Scholar
Wei, J., Jiang, H., Zhou, K., Feng, D.: Mad2: A scalable high-throughput exact deduplication approach for network backup services. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–14. IEEE (2010)
Google Scholar
Zhu, B., Li, K., Patterson, H.: Avoiding the disk bottleneck in the data domain deduplication file system. In: Proceedings of the 6th USENIX Conference on File and Storage Technologies, vol. 18 (2008)
Google Scholar
Clements, A., Ahmad, I., Vilayannur, M., Li, J., et al.: Decentralized deduplication in SAN cluster file systems. In: Proceedings of the 2009 Conference on USENIX Annual Technical Conference, p. 8. USENIX Association (2009)
Google Scholar
Karger, D., Sherman, A., Berkheimer, A., Bogstad, B., Dhanidina, R., Iwamoto, K., Kim, B., Matkins, L., Yerushalmi, Y.: Web caching with consistent hashing. Computer Networks 31(11), 1203–1213 (1999)
Article Google Scholar
Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., Lewin, D.: Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In: Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, pp. 654–663. ACM (1997)
Google Scholar
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: amazon’s highly available key-value store. ACM SIGOPS Operating Systems Review 41, 205–220 (2007)
Article Google Scholar
Cassandra, http://cassandra.apache.org
Memcached, http://memcached.org/
Chen, F., Luo, T., Zhang, X.: CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In: Proceedings of the 9th USENIX Conference on File and Stroage Technologies, p. 6. USENIX Association (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Information and Communication Engineering, Sungkyunkwan University, Suwon, Korea
Keonwoo Kim, Jeehong Kim, Changwoo Min & Young Ik Eom
Samsung Electronics Co., Ltd., Suwon, Korea
Changwoo Min

Authors

Keonwoo Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jeehong Kim
View author publications
You can also search for this author in PubMed Google Scholar
Changwoo Min
View author publications
You can also search for this author in PubMed Google Scholar
Young Ik Eom
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

L-I.S.U.T. - D.A.P.I.t. Facoltà Ingegneria, Università degli Studi della Basilicata, Viale dell’Ateneo Lucano, 10, 85100, Potenza, Italy
Beniamino Murgante
Covenant University, Canaanland, Ota, Nigeria
Sanjay Misra
Partimento di Scienze e Tecnologie per LAgricoltura, le Foreste, la Natura e lEnergia, Università degli Studi della Tuscia, Via S. Camillo de Lellis, snc, 01100, Viterbo, Italy
Maurizio Carlini
Dipartimento di Scienze dell’Ingegneria Civile e dell’Architecttura, Politecnico di Bari, Via Orabona, 4, 70125, Bari, Italy
Carmelo M. Torre
International University VNU-HCM, Quarter 6, Linh Trung, Thu Duc, Ho Chi Minh City, Vietnam
Hong-Quang Nguyen
School of Business Systems, Monash University, 3800, Clayton, VIC, Australia
David Taniar
Department of Intelligent Informatics, Kyushu Sangyo University, 2-3-1 Matsukadai, 813-8503, Higashi-ku, Fukuoka, Japan
Bernady O. Apduhan
Department of Mathematics and Computer Science, University of Perugia, Via Vanvitelli, 1, 06123, Perugia, Italy
Osvaldo Gervasi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, K., Kim, J., Min, C., Eom, Y.I. (2013). Content-Based Chunk Placement Scheme for Decentralized Deduplication on Distributed File Systems. In: Murgante, B., et al. Computational Science and Its Applications – ICCSA 2013. ICCSA 2013. Lecture Notes in Computer Science, vol 7971. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39637-3_14

Download citation

DOI: https://doi.org/10.1007/978-3-642-39637-3_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39636-6
Online ISBN: 978-3-642-39637-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics