Skip to main content

Content-Based Chunk Placement Scheme for Decentralized Deduplication on Distributed File Systems

  • Conference paper
Computational Science and Its Applications – ICCSA 2013 (ICCSA 2013)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7971))

Included in the following conference series:

Abstract

The rapid growth of data size causes several problems such as storage limitation and increment of data management cost. In order to store and manage massive data, Distributed File System (DFS) is widely used. Furthermore, in order to reduce the volume of storage, data deduplication schemes are being extensively studied. The data deduplication increases the available storage capacity by eliminating duplicated data. However, deduplication process causes performance overhead such as disk I/O. In this paper, we propose a content-based chunk placement scheme to increase deduplication rate on the DFS. To avoid performance overhead caused by deduplication process, we use lessfs in each chunk server. With our design, our system performs decentralized deduplication process in each chunk server. Moreover, we use consistent hashing for chunk allocation and failure recovery. Our experimental results show that the proposed system reduces the storage space by 60% than the system without consistent hashing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Gantz, J., Reinsel, D.: 2011 Digital Universe Study: Extracting Value from Chaos. Technical report, IDC (2011)

    Google Scholar 

  2. Gantz, J., Reinsel, D.: The Digital Univers. In: 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East. Technical report, IDC (2011)

    Google Scholar 

  3. DuBois, L., Amaldas, M.: IDC key-considerations deduplication. Technical report, IDC (2010)

    Google Scholar 

  4. Webb, N.: Open Source Data Deduplication. In: Linuxfest Northwest, Bellingham, WA, USA (April 2011)

    Google Scholar 

  5. Koutoupis, P.: Data Deduplication with Linux. 7 (2011)

    Google Scholar 

  6. MooseFS, http://www.moosefs.org

  7. Hupfeld, F., Cortes, T., Kolbeck, B., Stender, J., Focht, E., Hess, M., Malo, J., Marti, J., Cesario, E.: The XtreemFS architecture—a case for object-based file systems in Grids. Concurrency and Computation: Practice and Experience 20(17), 2049–2060 (2008)

    Article  Google Scholar 

  8. XtreemFS, http://www.xtreemfs.org

  9. GlusterFS, http://www.gluster.org

  10. Weil, S., Brandt, S., Miller, E., Long, D., Maltzahn, C.: Ceph: A scalable, high-performance distributed file system. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI), pp. 307–320 (2006)

    Google Scholar 

  11. Ghemawat, S., Gobioff, H., Leung, S.: The Google file system. ACM SIGOPS Operating Systems Review 37, 29–43 (2003)

    Article  Google Scholar 

  12. Thanh, T., Mohan, S., Choi, E., Kim, S., Kim, P.: A taxonomy and survey on distributed file systems. In: Fourth International Conference on Networked Computing and Advanced Information Management, NCM 2008, vol. 1, pp. 144–149. IEEE (2008)

    Google Scholar 

  13. Srinivasan, K., Bisson, T., Goodson, G., Voruganti, K.: iDedup: Latency-aware, inline data deduplication for primary storage. In: Proceedings of the Tenth USENIX Conference on File and Storage Technologies (FAST 2012), San Jose, CA (2012)

    Google Scholar 

  14. Meyer, D., Bolosky, W.: A study of practical deduplication. ACM Transactions on Storage (TOS) 7(4), 14 (2012)

    Google Scholar 

  15. El-Shimi, A., Kalach, R., Kumar, A., Oltean, A., Li, J., Sengupta, S.: Primary Data Deduplication-Large Scale Study and System Design. In: Proccedings of the USENIX Annual Technical Conference 2012 (2012)

    Google Scholar 

  16. Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., Welnicki, M.: Hydrastor: A scalable secondary storage. In: Proccedings of the 7th Conference on File and Storage Technologies, pp. 197–210. USENIX Association (2009)

    Google Scholar 

  17. Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., Camble, P.: Sparse indexing: large scale, inline deduplication using sampling and locality. In: Proccedings of the 7th Conference on File and Storage Technologies, pp. 111–123 (2009)

    Google Scholar 

  18. Wei, J., Jiang, H., Zhou, K., Feng, D.: Mad2: A scalable high-throughput exact deduplication approach for network backup services. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–14. IEEE (2010)

    Google Scholar 

  19. Zhu, B., Li, K., Patterson, H.: Avoiding the disk bottleneck in the data domain deduplication file system. In: Proceedings of the 6th USENIX Conference on File and Storage Technologies, vol. 18 (2008)

    Google Scholar 

  20. Clements, A., Ahmad, I., Vilayannur, M., Li, J., et al.: Decentralized deduplication in SAN cluster file systems. In: Proceedings of the 2009 Conference on USENIX Annual Technical Conference, p. 8. USENIX Association (2009)

    Google Scholar 

  21. Karger, D., Sherman, A., Berkheimer, A., Bogstad, B., Dhanidina, R., Iwamoto, K., Kim, B., Matkins, L., Yerushalmi, Y.: Web caching with consistent hashing. Computer Networks 31(11), 1203–1213 (1999)

    Article  Google Scholar 

  22. Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., Lewin, D.: Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In: Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, pp. 654–663. ACM (1997)

    Google Scholar 

  23. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: amazon’s highly available key-value store. ACM SIGOPS Operating Systems Review 41, 205–220 (2007)

    Article  Google Scholar 

  24. Cassandra, http://cassandra.apache.org

  25. Memcached, http://memcached.org/

  26. Chen, F., Luo, T., Zhang, X.: CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In: Proceedings of the 9th USENIX Conference on File and Stroage Technologies, p. 6. USENIX Association (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kim, K., Kim, J., Min, C., Eom, Y.I. (2013). Content-Based Chunk Placement Scheme for Decentralized Deduplication on Distributed File Systems. In: Murgante, B., et al. Computational Science and Its Applications – ICCSA 2013. ICCSA 2013. Lecture Notes in Computer Science, vol 7971. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39637-3_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-39637-3_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-39636-6

  • Online ISBN: 978-3-642-39637-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics