Existing Deduplication Techniques

Kim, Daehee; Song, Sejun; Choi, Baek-Young

doi:10.1007/978-3-319-42280-0_2

Daehee Kim⁴,
Sejun Song⁵ &
Baek-Young Choi⁵

819 Accesses
2 Citations
9 Altmetric

Abstract

Though various deduplication techniques have been proposed and used, no single best solution has been developed to handle all types of redundancies. Considering performance and overhead, each deduplication technique has been developed with different designs considering the characteristics of data sets, system capacity and deduplication time. For example, if the data sets to be handled have many duplicate files, deduplication can compare files themselves without looking at the file content for faster running time. However, if data sets have similar files rather than identical files, deduplication should look inside the files to check what parts of the contents are the same as previously saved data for better storage space savings. Also, deduplication should consider different designs of system capacity. High-capacity servers can handle considerable overhead for deduplication, but low-capacity clients should have lightweight deduplication designs for fast performance. Studies have been conducted to reduce redundancies at routers (or switches) within a network. This approach requires the fast processing of data packets at the routers, which is of crucial necessity for Internet service providers (ISPs). Meanwhile, if a system removes redundancies directly in a write path within a confined storage space, it is better to eliminate redundant data before storage. On the other hand, if a system has residual (or idle) time or enough space to store data temporarily, deduplication can be performed after the data are placed in temporary storage. In this chapter, we classify existing deduplication techniques based on granularity, place of deduplication and deduplication time. We start by explaining how to efficiently detect redundancy using chunk index caches and bloom filters. Then we describe how each deduplication technique works along with existing approaches and elaborate on commercially and academically existing deduplication solutions. All implementation codes are tested and run on Ubuntu 12.04 precise.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alvarez, C.: Netapp deduplication for FAS and v-series deployment and implementation guide (TR-3505). http://www.netapp.com/us/media/tr-3505.pdf (2011)
Amazon: Amazon simple storage service. http://aws.amazon.com/s3/
Anand, A., Gupta, A., Akella, A., Seshan, S., Shenker, S.: Packet caches on routers: the implications of universal redundant traffic elimination. In: Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication (2008)
Google Scholar
Anand, A., Sekar, V., Akella, A.: SmartRE: an architecture for coordinated network-wide redundancy elimination. In: Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication (2009)
Google Scholar
Bolosky, W., Corbin, S., Goebel, D., Douceur, J.: Single instance storage in Windows 2000. In: Proceedings of the 4th USENIX Windows Systems Symposium (2000)
Google Scholar
Bonwick, J.: ZFS deduplication. https://blogs.oracle.com/bonwick/entry/zfs_dedup (2009)
Cisco: Wide area application services. http://www.cisco.com/c/en/us/products/routers/wide-area-application-services/index.html
Citrix: Cloudbridge. http://www.citrix.com/products/cloudbridge/overview.html
Debnath, B., Sengupta, S., Li, J.: ChunkStash: speeding up inline storage deduplication using flash memory. In: USENIX Annual Technical Conference (2010)
Google Scholar
Dong, W., Douglis, F., Li, K., Patterson, R.H., Reddy, S., Shilane, P.: Tradeoffs in scalable data routing for deduplication clusters. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST) (2011)
Google Scholar
Drago, I., Mellia, M., Munafo, M., Sperotto, A., Sadre, R., Pras, A.: Inside dropbox: understanding personal cloud storage services. In: Proceedings of the 2012 ACM Conference on Internet Measurement Conference (IMC), pp. 481–494 (2012)
Google Scholar
Dropbox: http://www.dropbox.com
Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., Welnicki, M.: HYDRAstor: a scalable secondary storage. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST) (2009)
Google Scholar
ElShimi, A., Kalach, R., Kumar, A., Oltean, A., Li, J., Sengupta, S.: Primary data deduplication-large scale study and system design. In: USENIX Annual Technical Conference (2012)
Google Scholar
EMC: Achieving storage efficiency through EMC celerra data deduplication. http://china.emc.com/collateral/hardware/white-papers/h6265-achieving-storage-efficiency-celerra-wp.pdf (2009)
EMC: Avamar. http://www.emc.com/backup-and-recovery/avamar/avamar.htm
EMC: Centera: Content Addresses Storage System, Data Sheet. http://www.emc.com/collateral/hardware/data-sheet/c931-emc-centera-cas-ds.pdf
EMC: Networker. http://www.emc.com/domains/legato/index.htm
Guo, F., Efstathopoulos, P.: Building a high-performance deduplication system. In: USENIX Annual Technical Conference (2011)
Google Scholar
Hu, W., Yang, T., Matthews, J.N.: The good, the bad and the ugly of consumer cloud storage. ACM SIGOPS Oper. Syst. Rev. 44 (3), 110–115 (2010)
Article Google Scholar
IBM: IBM white paper: IBM storage tank - a distributed storage system. https://www.usenix.org/legacy/events/fast02/wips/pease.pdf (2002)
JustCloud: http://www.justcloud.com/
Kim, D., Choi, B.Y.: HEDS: hybrid deduplication approach for email servers. In: 2012 Fourth International Conference on Ubiquitous and Future Networks (ICUFN) (2012)
Google Scholar
Kim, D., Song, S., Choi, B.Y.: SAFE: structure-aware file and email deduplication for cloud-based storage systems. In: Proceedings of the 2nd IEEE International Conference on Cloud Networking (2013)
Google Scholar
Li, J., He, L.W., Sengupta, S., Aiyer, A.: Multimodal object de-duplication. Microsoft Corporation (2009). Patent
Google Scholar
Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., Camble, P.: Sparse indexing: large scale, inline deduplication using sampling and locality. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST) (2009)
Google Scholar
Liu, C., Lu, Y., Shi, C., Lu, G., Du, D., Wang, D.: ADMAD: Application-driven metadata aware de-duplication archival storage system. In: Fifth IEEE International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI), pp. 29–35 (2008)
Google Scholar
Meyer, D.T., Bolosky, W.J.: A study of practical deduplication. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST) (2011)
Google Scholar
Microsoft: Exchange server 2003. http://technet.microsoft.com/en-us/library/bb123872%28EXCHG.65%29.aspx
Microsoft: Exchange server 2007. http://www.microsoft.com/exchange/en-us/exchange-2007-overview.aspx
Min, J., Yoon, D., Won, Y.: Efficient deduplication techniques for modern backup operation. IEEE Trans. Comput. 60, 824–840 (2011)
Article MathSciNet Google Scholar
Mozy: http://mozy.com/
Muthitacharoen, A., Chen, B., Mazières, D.: A low-bandwidth network file system. In: SOSP (2001)
Book Google Scholar
National Institute of Standards and Technology (NIST): Secure Hash Standard 1 (SHA-1). http://csrc.nist.gov/publications/fips/fips180-4/fips-180-4.pdf (2015)
National Institute of Standards and Technology (NIST): Secure hash standard 256 (sha256). http://csrc.nist.gov/groups/STM/cavp/documents/shs/sha256-384-512.pdf
NEC: Hydrastor. https://www.necam.com/hydrastor/
Netfilter: Packet Flow. https://upload.wikimedia.org/wikipedia/commons/3/37/Netfilter-packet-flow.svg
Pagh, R., Rodler, F.F.: Cuckoo hashing. J. Algorithms 51 (2), 122–144 (2004). doi:10.1016/j.jalgor.2003.12.002. http://dx.doi.org/10.1016/j.jalgor.2003.12.002
Article MathSciNet MATH Google Scholar
Quinlan, S., Dorward, S.: Venti: a new approach to archival storage. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST) (2002)
Google Scholar
Rabin, M.O.: Fingerprinting by random polynomials. Tech. Rep. Report TR-15-81, Harvard University (1981)
Google Scholar
Riverbed: Steelhead for wan optimization. http://www.riverbed.com/products/wan-optimization/
Silverberg, S.: SDFS. http://opendedup.org
Spring, N.T., Wetherall, D.: A protocol-independent technique for eliminating redundant network traffic. In: Proceedings of the ACM SIGCOMM 2000 Conference on Data Communication (2000)
Google Scholar
Srinivasan, K., Bisson, T., Goodson, G., Voruganti, K.: iDedup: latency-aware, inline data deduplication for primary storage. In: Proceedings of the Tenth USENIX Conference on File and Storage Technologies (FAST) (2012)
Google Scholar
Symantec: Netbackup. http://www.symantec.com/netbackup
Symantec: Puredisk. http://www.symantec.com/netbackup-puredisk
Weiss, M.A.: Data Structures and Algorithm Analysis in C++, 3rd edn. Addison Wesley, Reading, MA (2005)
Google Scholar
Xia, W., Jiang, H., Feng, D., Hua, Y.: SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In: USENIX Annual Technical Conference (2011)
Google Scholar
Yan, F., Tan, Y.: A method of object-based de-duplication. J. Netw. 6 (12), 1705–1712 (2011)
Google Scholar
Zhu, B., Li, K., Patterson, H.: Avoiding the disk bottleneck in the data domain deduplication file system. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST) (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing and New Media Technologies, University of Wisconsin-Stevens Point, Stevens Point, Wisconsin, USA
Daehee Kim
Department of Computer Science and Electrical Engineering, University of Missouri-Kansas City, Kansas City, Missouri, USA
Sejun Song & Baek-Young Choi

Authors

Daehee Kim
View author publications
You can also search for this author in PubMed Google Scholar
Sejun Song
View author publications
You can also search for this author in PubMed Google Scholar
Baek-Young Choi
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kim, D., Song, S., Choi, BY. (2017). Existing Deduplication Techniques. In: Data Deduplication for Data Optimization for Storage and Network Systems. Springer, Cham. https://doi.org/10.1007/978-3-319-42280-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-42280-0_2
Published: 09 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-42278-7
Online ISBN: 978-3-319-42280-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics