Skip to main content

Abstract

Though various deduplication techniques have been proposed and used, no single best solution has been developed to handle all types of redundancies. Considering performance and overhead, each deduplication technique has been developed with different designs considering the characteristics of data sets, system capacity and deduplication time. For example, if the data sets to be handled have many duplicate files, deduplication can compare files themselves without looking at the file content for faster running time. However, if data sets have similar files rather than identical files, deduplication should look inside the files to check what parts of the contents are the same as previously saved data for better storage space savings. Also, deduplication should consider different designs of system capacity. High-capacity servers can handle considerable overhead for deduplication, but low-capacity clients should have lightweight deduplication designs for fast performance. Studies have been conducted to reduce redundancies at routers (or switches) within a network. This approach requires the fast processing of data packets at the routers, which is of crucial necessity for Internet service providers (ISPs). Meanwhile, if a system removes redundancies directly in a write path within a confined storage space, it is better to eliminate redundant data before storage. On the other hand, if a system has residual (or idle) time or enough space to store data temporarily, deduplication can be performed after the data are placed in temporary storage. In this chapter, we classify existing deduplication techniques based on granularity, place of deduplication and deduplication time. We start by explaining how to efficiently detect redundancy using chunk index caches and bloom filters. Then we describe how each deduplication technique works along with existing approaches and elaborate on commercially and academically existing deduplication solutions. All implementation codes are tested and run on Ubuntu 12.04 precise.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alvarez, C.: Netapp deduplication for FAS and v-series deployment and implementation guide (TR-3505). http://www.netapp.com/us/media/tr-3505.pdf (2011)

  2. Amazon: Amazon simple storage service. http://aws.amazon.com/s3/

  3. Anand, A., Gupta, A., Akella, A., Seshan, S., Shenker, S.: Packet caches on routers: the implications of universal redundant traffic elimination. In: Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication (2008)

    Google Scholar 

  4. Anand, A., Sekar, V., Akella, A.: SmartRE: an architecture for coordinated network-wide redundancy elimination. In: Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication (2009)

    Google Scholar 

  5. Bolosky, W., Corbin, S., Goebel, D., Douceur, J.: Single instance storage in Windows 2000. In: Proceedings of the 4th USENIX Windows Systems Symposium (2000)

    Google Scholar 

  6. Bonwick, J.: ZFS deduplication. https://blogs.oracle.com/bonwick/entry/zfs_dedup (2009)

  7. Cisco: Wide area application services. http://www.cisco.com/c/en/us/products/routers/wide-area-application-services/index.html

  8. Citrix: Cloudbridge. http://www.citrix.com/products/cloudbridge/overview.html

  9. Debnath, B., Sengupta, S., Li, J.: ChunkStash: speeding up inline storage deduplication using flash memory. In: USENIX Annual Technical Conference (2010)

    Google Scholar 

  10. Dong, W., Douglis, F., Li, K., Patterson, R.H., Reddy, S., Shilane, P.: Tradeoffs in scalable data routing for deduplication clusters. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST) (2011)

    Google Scholar 

  11. Drago, I., Mellia, M., Munafo, M., Sperotto, A., Sadre, R., Pras, A.: Inside dropbox: understanding personal cloud storage services. In: Proceedings of the 2012 ACM Conference on Internet Measurement Conference (IMC), pp. 481–494 (2012)

    Google Scholar 

  12. Dropbox: http://www.dropbox.com

  13. Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., Welnicki, M.: HYDRAstor: a scalable secondary storage. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST) (2009)

    Google Scholar 

  14. ElShimi, A., Kalach, R., Kumar, A., Oltean, A., Li, J., Sengupta, S.: Primary data deduplication-large scale study and system design. In: USENIX Annual Technical Conference (2012)

    Google Scholar 

  15. EMC: Achieving storage efficiency through EMC celerra data deduplication. http://china.emc.com/collateral/hardware/white-papers/h6265-achieving-storage-efficiency-celerra-wp.pdf (2009)

  16. EMC: Avamar. http://www.emc.com/backup-and-recovery/avamar/avamar.htm

  17. EMC: Centera: Content Addresses Storage System, Data Sheet. http://www.emc.com/collateral/hardware/data-sheet/c931-emc-centera-cas-ds.pdf

  18. EMC: Networker. http://www.emc.com/domains/legato/index.htm

  19. Guo, F., Efstathopoulos, P.: Building a high-performance deduplication system. In: USENIX Annual Technical Conference (2011)

    Google Scholar 

  20. Hu, W., Yang, T., Matthews, J.N.: The good, the bad and the ugly of consumer cloud storage. ACM SIGOPS Oper. Syst. Rev. 44 (3), 110–115 (2010)

    Article  Google Scholar 

  21. IBM: IBM white paper: IBM storage tank - a distributed storage system. https://www.usenix.org/legacy/events/fast02/wips/pease.pdf (2002)

  22. JustCloud: http://www.justcloud.com/

  23. Kim, D., Choi, B.Y.: HEDS: hybrid deduplication approach for email servers. In: 2012 Fourth International Conference on Ubiquitous and Future Networks (ICUFN) (2012)

    Google Scholar 

  24. Kim, D., Song, S., Choi, B.Y.: SAFE: structure-aware file and email deduplication for cloud-based storage systems. In: Proceedings of the 2nd IEEE International Conference on Cloud Networking (2013)

    Google Scholar 

  25. Li, J., He, L.W., Sengupta, S., Aiyer, A.: Multimodal object de-duplication. Microsoft Corporation (2009). Patent

    Google Scholar 

  26. Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., Camble, P.: Sparse indexing: large scale, inline deduplication using sampling and locality. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST) (2009)

    Google Scholar 

  27. Liu, C., Lu, Y., Shi, C., Lu, G., Du, D., Wang, D.: ADMAD: Application-driven metadata aware de-duplication archival storage system. In: Fifth IEEE International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI), pp. 29–35 (2008)

    Google Scholar 

  28. Meyer, D.T., Bolosky, W.J.: A study of practical deduplication. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST) (2011)

    Google Scholar 

  29. Microsoft: Exchange server 2003. http://technet.microsoft.com/en-us/library/bb123872%28EXCHG.65%29.aspx

  30. Microsoft: Exchange server 2007. http://www.microsoft.com/exchange/en-us/exchange-2007-overview.aspx

  31. Min, J., Yoon, D., Won, Y.: Efficient deduplication techniques for modern backup operation. IEEE Trans. Comput. 60, 824–840 (2011)

    Article  MathSciNet  Google Scholar 

  32. Mozy: http://mozy.com/

  33. Muthitacharoen, A., Chen, B., Mazières, D.: A low-bandwidth network file system. In: SOSP (2001)

    Book  Google Scholar 

  34. National Institute of Standards and Technology (NIST): Secure Hash Standard 1 (SHA-1). http://csrc.nist.gov/publications/fips/fips180-4/fips-180-4.pdf (2015)

  35. National Institute of Standards and Technology (NIST): Secure hash standard 256 (sha256). http://csrc.nist.gov/groups/STM/cavp/documents/shs/sha256-384-512.pdf

  36. NEC: Hydrastor. https://www.necam.com/hydrastor/

  37. Netfilter: Packet Flow. https://upload.wikimedia.org/wikipedia/commons/3/37/Netfilter-packet-flow.svg

  38. Pagh, R., Rodler, F.F.: Cuckoo hashing. J. Algorithms 51 (2), 122–144 (2004). doi:10.1016/j.jalgor.2003.12.002. http://dx.doi.org/10.1016/j.jalgor.2003.12.002

    Article  MathSciNet  MATH  Google Scholar 

  39. Quinlan, S., Dorward, S.: Venti: a new approach to archival storage. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST) (2002)

    Google Scholar 

  40. Rabin, M.O.: Fingerprinting by random polynomials. Tech. Rep. Report TR-15-81, Harvard University (1981)

    Google Scholar 

  41. Riverbed: Steelhead for wan optimization. http://www.riverbed.com/products/wan-optimization/

  42. Silverberg, S.: SDFS. http://opendedup.org

  43. Spring, N.T., Wetherall, D.: A protocol-independent technique for eliminating redundant network traffic. In: Proceedings of the ACM SIGCOMM 2000 Conference on Data Communication (2000)

    Google Scholar 

  44. Srinivasan, K., Bisson, T., Goodson, G., Voruganti, K.: iDedup: latency-aware, inline data deduplication for primary storage. In: Proceedings of the Tenth USENIX Conference on File and Storage Technologies (FAST) (2012)

    Google Scholar 

  45. Symantec: Netbackup. http://www.symantec.com/netbackup

  46. Symantec: Puredisk. http://www.symantec.com/netbackup-puredisk

  47. Weiss, M.A.: Data Structures and Algorithm Analysis in C++, 3rd edn. Addison Wesley, Reading, MA (2005)

    Google Scholar 

  48. Xia, W., Jiang, H., Feng, D., Hua, Y.: SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In: USENIX Annual Technical Conference (2011)

    Google Scholar 

  49. Yan, F., Tan, Y.: A method of object-based de-duplication. J. Netw. 6 (12), 1705–1712 (2011)

    Google Scholar 

  50. Zhu, B., Li, K., Patterson, H.: Avoiding the disk bottleneck in the data domain deduplication file system. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST) (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Kim, D., Song, S., Choi, BY. (2017). Existing Deduplication Techniques. In: Data Deduplication for Data Optimization for Storage and Network Systems. Springer, Cham. https://doi.org/10.1007/978-3-319-42280-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-42280-0_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-42278-7

  • Online ISBN: 978-3-319-42280-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics