Abstract
Though various deduplication techniques have been proposed and used, no single best solution has been developed to handle all types of redundancies. Considering performance and overhead, each deduplication technique has been developed with different designs considering the characteristics of data sets, system capacity and deduplication time. For example, if the data sets to be handled have many duplicate files, deduplication can compare files themselves without looking at the file content for faster running time. However, if data sets have similar files rather than identical files, deduplication should look inside the files to check what parts of the contents are the same as previously saved data for better storage space savings. Also, deduplication should consider different designs of system capacity. High-capacity servers can handle considerable overhead for deduplication, but low-capacity clients should have lightweight deduplication designs for fast performance. Studies have been conducted to reduce redundancies at routers (or switches) within a network. This approach requires the fast processing of data packets at the routers, which is of crucial necessity for Internet service providers (ISPs). Meanwhile, if a system removes redundancies directly in a write path within a confined storage space, it is better to eliminate redundant data before storage. On the other hand, if a system has residual (or idle) time or enough space to store data temporarily, deduplication can be performed after the data are placed in temporary storage. In this chapter, we classify existing deduplication techniques based on granularity, place of deduplication and deduplication time. We start by explaining how to efficiently detect redundancy using chunk index caches and bloom filters. Then we describe how each deduplication technique works along with existing approaches and elaborate on commercially and academically existing deduplication solutions. All implementation codes are tested and run on Ubuntu 12.04 precise.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alvarez, C.: Netapp deduplication for FAS and v-series deployment and implementation guide (TR-3505). http://www.netapp.com/us/media/tr-3505.pdf (2011)
Amazon: Amazon simple storage service. http://aws.amazon.com/s3/
Anand, A., Gupta, A., Akella, A., Seshan, S., Shenker, S.: Packet caches on routers: the implications of universal redundant traffic elimination. In: Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication (2008)
Anand, A., Sekar, V., Akella, A.: SmartRE: an architecture for coordinated network-wide redundancy elimination. In: Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication (2009)
Bolosky, W., Corbin, S., Goebel, D., Douceur, J.: Single instance storage in Windows 2000. In: Proceedings of the 4th USENIX Windows Systems Symposium (2000)
Bonwick, J.: ZFS deduplication. https://blogs.oracle.com/bonwick/entry/zfs_dedup (2009)
Cisco: Wide area application services. http://www.cisco.com/c/en/us/products/routers/wide-area-application-services/index.html
Citrix: Cloudbridge. http://www.citrix.com/products/cloudbridge/overview.html
Debnath, B., Sengupta, S., Li, J.: ChunkStash: speeding up inline storage deduplication using flash memory. In: USENIX Annual Technical Conference (2010)
Dong, W., Douglis, F., Li, K., Patterson, R.H., Reddy, S., Shilane, P.: Tradeoffs in scalable data routing for deduplication clusters. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST) (2011)
Drago, I., Mellia, M., Munafo, M., Sperotto, A., Sadre, R., Pras, A.: Inside dropbox: understanding personal cloud storage services. In: Proceedings of the 2012 ACM Conference on Internet Measurement Conference (IMC), pp. 481–494 (2012)
Dropbox: http://www.dropbox.com
Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., Welnicki, M.: HYDRAstor: a scalable secondary storage. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST) (2009)
ElShimi, A., Kalach, R., Kumar, A., Oltean, A., Li, J., Sengupta, S.: Primary data deduplication-large scale study and system design. In: USENIX Annual Technical Conference (2012)
EMC: Achieving storage efficiency through EMC celerra data deduplication. http://china.emc.com/collateral/hardware/white-papers/h6265-achieving-storage-efficiency-celerra-wp.pdf (2009)
EMC: Avamar. http://www.emc.com/backup-and-recovery/avamar/avamar.htm
EMC: Centera: Content Addresses Storage System, Data Sheet. http://www.emc.com/collateral/hardware/data-sheet/c931-emc-centera-cas-ds.pdf
EMC: Networker. http://www.emc.com/domains/legato/index.htm
Guo, F., Efstathopoulos, P.: Building a high-performance deduplication system. In: USENIX Annual Technical Conference (2011)
Hu, W., Yang, T., Matthews, J.N.: The good, the bad and the ugly of consumer cloud storage. ACM SIGOPS Oper. Syst. Rev. 44 (3), 110–115 (2010)
IBM: IBM white paper: IBM storage tank - a distributed storage system. https://www.usenix.org/legacy/events/fast02/wips/pease.pdf (2002)
JustCloud: http://www.justcloud.com/
Kim, D., Choi, B.Y.: HEDS: hybrid deduplication approach for email servers. In: 2012 Fourth International Conference on Ubiquitous and Future Networks (ICUFN) (2012)
Kim, D., Song, S., Choi, B.Y.: SAFE: structure-aware file and email deduplication for cloud-based storage systems. In: Proceedings of the 2nd IEEE International Conference on Cloud Networking (2013)
Li, J., He, L.W., Sengupta, S., Aiyer, A.: Multimodal object de-duplication. Microsoft Corporation (2009). Patent
Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., Camble, P.: Sparse indexing: large scale, inline deduplication using sampling and locality. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST) (2009)
Liu, C., Lu, Y., Shi, C., Lu, G., Du, D., Wang, D.: ADMAD: Application-driven metadata aware de-duplication archival storage system. In: Fifth IEEE International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI), pp. 29–35 (2008)
Meyer, D.T., Bolosky, W.J.: A study of practical deduplication. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST) (2011)
Microsoft: Exchange server 2003. http://technet.microsoft.com/en-us/library/bb123872%28EXCHG.65%29.aspx
Microsoft: Exchange server 2007. http://www.microsoft.com/exchange/en-us/exchange-2007-overview.aspx
Min, J., Yoon, D., Won, Y.: Efficient deduplication techniques for modern backup operation. IEEE Trans. Comput. 60, 824–840 (2011)
Mozy: http://mozy.com/
Muthitacharoen, A., Chen, B., Mazières, D.: A low-bandwidth network file system. In: SOSP (2001)
National Institute of Standards and Technology (NIST): Secure Hash Standard 1 (SHA-1). http://csrc.nist.gov/publications/fips/fips180-4/fips-180-4.pdf (2015)
National Institute of Standards and Technology (NIST): Secure hash standard 256 (sha256). http://csrc.nist.gov/groups/STM/cavp/documents/shs/sha256-384-512.pdf
NEC: Hydrastor. https://www.necam.com/hydrastor/
Netfilter: Packet Flow. https://upload.wikimedia.org/wikipedia/commons/3/37/Netfilter-packet-flow.svg
Pagh, R., Rodler, F.F.: Cuckoo hashing. J. Algorithms 51 (2), 122–144 (2004). doi:10.1016/j.jalgor.2003.12.002. http://dx.doi.org/10.1016/j.jalgor.2003.12.002
Quinlan, S., Dorward, S.: Venti: a new approach to archival storage. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST) (2002)
Rabin, M.O.: Fingerprinting by random polynomials. Tech. Rep. Report TR-15-81, Harvard University (1981)
Riverbed: Steelhead for wan optimization. http://www.riverbed.com/products/wan-optimization/
Silverberg, S.: SDFS. http://opendedup.org
Spring, N.T., Wetherall, D.: A protocol-independent technique for eliminating redundant network traffic. In: Proceedings of the ACM SIGCOMM 2000 Conference on Data Communication (2000)
Srinivasan, K., Bisson, T., Goodson, G., Voruganti, K.: iDedup: latency-aware, inline data deduplication for primary storage. In: Proceedings of the Tenth USENIX Conference on File and Storage Technologies (FAST) (2012)
Symantec: Netbackup. http://www.symantec.com/netbackup
Symantec: Puredisk. http://www.symantec.com/netbackup-puredisk
Weiss, M.A.: Data Structures and Algorithm Analysis in C++, 3rd edn. Addison Wesley, Reading, MA (2005)
Xia, W., Jiang, H., Feng, D., Hua, Y.: SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In: USENIX Annual Technical Conference (2011)
Yan, F., Tan, Y.: A method of object-based de-duplication. J. Netw. 6 (12), 1705–1712 (2011)
Zhu, B., Li, K., Patterson, H.: Avoiding the disk bottleneck in the data domain deduplication file system. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST) (2008)
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Kim, D., Song, S., Choi, BY. (2017). Existing Deduplication Techniques. In: Data Deduplication for Data Optimization for Storage and Network Systems. Springer, Cham. https://doi.org/10.1007/978-3-319-42280-0_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-42280-0_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-42278-7
Online ISBN: 978-3-319-42280-0
eBook Packages: EngineeringEngineering (R0)