Skip to main content

An Overview on Data Deduplication Techniques

  • Conference paper
  • First Online:

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 455))

Abstract

The massive data puts forward higher requirements on the capacity of storage devices, but from a practical point of view, the increasement of capacity is far more behind the growth of data. Deduplication technique, for its high efficiency, few resource consumption and extensive application scope, comes to the fore among various data reduction techniques. The so-called data deduplication refers to find and eliminate redundant data among the storage system. For local storage system, the only one data object is needed to store to save limited storage space; for network system, not only storage space can be saved, but also transmission bandwidth can be reduced to increase the transmission rate. It is a compromise to achieve the purpose of efficient storage at cost of computational overhead. This article will introduce data deduplication techniques, describe basic principles and processes, summarize the main technique of the current study and provide recommendations for future development.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Bhagwat D, Pollack K, Long DD, Schwarz T, Miller EL, Paris JF (2006) Providing high reliability in a minimum redundancy archival storage system. In: 14th IEEE international symposium on modeling, analysis, and simulation of computer and telecommunication systems, MASCOTS. IEEE, pp 413–421

    Google Scholar 

  2. Bhagwat D, Eshghi K, Long DDE, Lillibridge M (2009) Extreme binning: scalable, parallel deduplication for chunk-based file backup. Modeling analysis and simulation of computer and telecommunication systems MASCOTS, pp 1–9

    Google Scholar 

  3. Bolosky WJ, Corbin S, Goebel D, Douceur JR (2000) Single instance storage in windows. In: Proceedings of the 4th USENIX windows systems symposium, pp 13–24. Seattle, WA

    Google Scholar 

  4. Centera E (2004) Content addressed storage system

    Google Scholar 

  5. Cox LP, Murray CD, Noble BD (2002) Pastiche: making backup cheap and easy. ACM SIGOPS Oper Syst Rev 36(SI):285–298

    Article  Google Scholar 

  6. Debnath BK, Sengupta S, Li J (2010) Chunkstash: speeding up inline storage deduplication using flash memory. In: USENIX annual technical conference

    Google Scholar 

  7. Denehy TE, Hsu WW (2003) Duplicate management for reference data. Technical report, Research Report RJ10305, IBM

    Google Scholar 

  8. Douglis F, Iyengar A (2003) Application-specific delta-encoding via resemblance detection. In: USENIX annual technical conference, general track, pp 113–126

    Google Scholar 

  9. Dubnicki C, Gryz L, Heldt L, Kaczmarczyk M, Kilian W, Strzelczak P, Szczepkowski J, Ungureanu C, Welnicki M (2009) Hydrastor: a scalable secondary storage. In: FAST, vol 9, pp 197–210

    Google Scholar 

  10. Guo F, Efstathopoulos P (2011) Building a high-performance deduplication system. In: USENIX annual technical conference

    Google Scholar 

  11. Henson V (2003) An analysis of compare-by-hash. In: HotOS, pp 13–18

    Google Scholar 

  12. Jain N, Dahlin M, Tewari R (2005) Taper: tiered approach for eliminating redundancy in replica synchronization. In: Proceedings of the 4th conference on USENIX conference on file and storage technologies, vol 4, pp 21–21. USENIX Association

    Google Scholar 

  13. Kruus E, Ungureanu C, Dubnicki C (2010) Bimodal content defined chunking for backup streams. In: FAST, pp 239–252

    Google Scholar 

  14. Kubiatowicz J, Bindel D, Chen Y, Czerwinski S, Eaton P, Geels D, Gummadi R, Rhea S, Weatherspoon H, Weimer W et al (2000) Oceanstore: an architecture for global-scale persistent storage. ACM SIGPLAN Not 35(11):190–201

    Article  Google Scholar 

  15. Li AO, Shu JW, Ming-Qiang LI (2010) Data deduplication techniques. J Softw 1(21):430–433

    Google Scholar 

  16. Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezis G, Camble P (2009) Sparse indexing: large scale, inline deduplication using sampling and locality. In: Fast, vol 9, pp 111–123

    Google Scholar 

  17. Lin X, Lu G, Douglis F, Shilane P, Wallace G (2014) Migratory compression: coarse-grained data reordering to improve compressibility. In: FAST, pp 257–271

    Google Scholar 

  18. Liu C, Lu Y, Shi C, Lu G, Du DH, Wang DS (2008) Admad: application-driven metadata aware de-duplication archival storage system. In: Fifth IEEE international workshop on storage network architecture and parallel I/Os, SNAPI’08. IEEE, pp 29–35

    Google Scholar 

  19. Liu C, Gu Y, Sun L, Yan B, Wang D (2009) R-admad: high reliability provision for large-scale de-duplication archival storage systems. In: Proceedings of the 23rd international conference on supercomputing. ACM, pp 370–379

    Google Scholar 

  20. Meister D, Brinkmann A (2009) Multi-level comparison of data deduplication in a backup scenario. In: Proceedings of SYSTOR 2009: the Israeli experimental systems conference. ACM, p 8

    Google Scholar 

  21. Meister D, Brinkmann A (2010) dedupv1: improving deduplication throughput using solid state drives (SSD). In: IEEE 26th symposium on mass storage systems and technologies (MSST). IEEE, pp 1–6

    Google Scholar 

  22. Min J, Yoon D, Won Y (2011) Efficient deduplication techniques for modern backup operation. IEEE Trans Comput 60(6):824–840

    Article  MathSciNet  Google Scholar 

  23. Muthitacharoen A, Chen B, Mazieres D (2001) A low-bandwidth network file system. In: ACM SIGOPS operating systems review, vol 35. ACM, pp 174–187

    Google Scholar 

  24. Quinlan S, Dorward S (2002) Venti: a new approach to archival storage. In: FAST, vol 2, pp 89–101

    Google Scholar 

  25. Tan Y, Yan Z, Feng D, He X, Zou Q, Yang L (2015) De-frag: an efficient scheme to improve deduplication performance via reducing data placement de-linearization. Clust Comput 18(1):79–92

    Article  Google Scholar 

  26. Won Y, Kim R, Ban J, Hur J, Oh S, Lee J (2008) Prun: eliminating information redundancy for large scale data backup system. In: International conference on computational sciences and its applications, ICCSA’08. IEEE, pp 139–144

    Google Scholar 

  27. Xia W, Jiang H, Feng D, Tian L, Fu M, Wang Z (2012) P-dedupe: exploiting parallelism in data deduplication system. In: IEEE 7th international conference on networking, architecture and storage (NAS). IEEE, pp 338–347

    Google Scholar 

  28. Xu M, Zhu Y, Lee PP, Xu Y, Even data placement for load balance in reliable distributed deduplication storage systems

    Google Scholar 

  29. Yinjin F, Nong X, Fang L (2012) Research and development on key techniques of data deduplication [j]. J Comput Res Dev 1:002

    Google Scholar 

  30. You L, Karamanolis CT (2004) Evaluation of efficient archival storage techniques. In: MSST, pp 227–232. Citeseer

    Google Scholar 

  31. You LL, Pollack KT, Long DD (2005) Deep store: an archival storage system architecture. In: Proceedings of the 21st international conference on data engineering, ICDE. IEEE, pp 804–815

    Google Scholar 

  32. Zhengda Z, Jingli Z (2010) A novel data redundancy scheme for de-duplication storage system. In: 3rd international symposium on knowledge acquisition and modeling (KAM). IEEE, pp 293–296

    Google Scholar 

  33. Zhou Z, Zhou J (2012) High availability replication strategy for deduplication storage system. Adv Inf Sci Serv Sci 4(8):115

    Google Scholar 

  34. Zhu B, Li K, Patterson RH (2008) Avoiding the disk bottleneck in the data domain deduplication file system. In: Fast, vol 8, pp 1–14

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xuecheng Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing Switzerland

About this paper

Cite this paper

Zhang, X., Deng, M. (2017). An Overview on Data Deduplication Techniques. In: Balas, V., Jain, L., Zhao, X. (eds) Information Technology and Intelligent Transportation Systems. Advances in Intelligent Systems and Computing, vol 455. Springer, Cham. https://doi.org/10.1007/978-3-319-38771-0_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-38771-0_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-38769-7

  • Online ISBN: 978-3-319-38771-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics