Metadata Feedback and Utilization for Data Deduplication Across WAN

Zhou, Bing; Wen, Jiang-Tao

doi:10.1007/s11390-016-1650-6

Metadata Feedback and Utilization for Data Deduplication Across WAN

Regular Paper
Published: 06 May 2016

Volume 31, pages 604–623, (2016)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Bing Zhou^1,2,3 &
Jiang-Tao Wen^1,2,3

106 Accesses
2 Citations
Explore all metrics

Abstract

Data deduplication for file communication across wide area network (WAN) in the applications such as file synchronization and mirroring of cloud environments usually achieves significant bandwidth saving at the cost of significant time overheads of data deduplication. The time overheads include the time required for data deduplication at two geographi-cally distributed nodes (e.g., disk access bottleneck) and the duplication query/answer operations between the sender and the receiver, since each query or answer introduces at least one round-trip time (RTT) of latency. In this paper, we present a data deduplication system across WAN with metadata feedback and metadata utilization (MFMU), in order to harness the data deduplication related time overheads. In the proposed MFMU system, selective metadata feedbacks from the receiver to the sender are introduced to reduce the number of duplication query/answer operations. In addition, to harness the metadata related disk I/O operations at the receiver, as well as the bandwidth overhead introduced by the metadata feedbacks, a hysteresis hash re-chunking mechanism based metadata utilization component is introduced. Our experimental results demonstrated that MFMU achieved an average of 20%∼40% deduplication acceleration with the bandwidth saving ratio not reduced by the metadata feedbacks, as compared with the “baseline” content defined chunking (CDC) used in LBFS (Low-bandwith Network File system) and exiting state-of-the-art Bimodal chunking algorithms based data deduplication solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Content-Based Chunk Placement Scheme for Decentralized Deduplication on Distributed File Systems

Study of Chunking Algorithm in Data Deduplication

A smart hybrid content-defined chunking algorithm for data deduplication in cloud storage

Article 14 October 2023

Manogar Ellappan & Abirami Murugappan

References

Muthitacharoen A, Chen B, Mazi`eres D. A low-bandwidth network file system. In Proc. the 18th ACM Symposium on Operating Systems Principles (SOSP), October 2001, pp.174-187.
Jain N, Dahlin M, Tewari R. TAPER: Tiered approach for eliminating redundancy in replica synchronization. In Proc. the 10th USENIX Conference on File and Storage Technologies (FAST), December 2005, pp.21-34.
Rabin M O. Fingerprinting by random polynomials. Technical Report, TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.
Black J. Compare-by-hash: A reasoned analysis. In Proc. the USENIX Annual Technical Conference (ATC), May 2006, pp.85-90.
Zhu B, Li K, Patterson H. Avoiding the disk bottleneck in the data domain deduplication file system. In Proc. the 6th USENIX Conference on File and Storage Technologies (FAST), February 2008, pp.269-282.
Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezise G, Camble P. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proc. the 7th USENIX Conference on File and Storage Technologies (FAST), February 2009, pp.111-123.
Guo F, Efstathopoulos P. Building a high-performance deduplication system. In Proc. the USENIX Annual Technical Conference (ATC), June 2011, pp.25-38.
Kruus E, Ungureanu C, Dubnicki C. Bimodal content defined chunking for backup streams. In Proc. the 8th USENIX Conference on File and Storage Technologies (FAST), February 2010, pp.18-31.
Bartlomiej R, Lukasz H, Wojciech K, Krzysztof L, Cezary D. Anchor-driven subchunk deduplication. In Proc. the 4th Annual International Conference on Systems and Storage (SYSTOR), May 2011, pp.16:1–16:13.
Bhagwat D, Eshghi K, Long D D E, Lillibridge M. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proc. the IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS), September 2009.
Zhou B, Wen J. Hysteresis re-chunking based metadata harnessing deduplication of disk images. In Proc. the 42nd IEEE International Conference on Parallel Processing (ICPP), October 2013, pp.389-398.
Ha S, Rhee I, Xu L. CUBIC: A new TCP-friendly highspeed TCP variant. In Proc. the ACM Operating Systems Review, Research and Developments in the Linux Kernel (SIGOPS), July 2008, Volume 42, pp.64-74.
Zhou B, Wen J. Efficient file communication via deduplication with manifest feedback. IEEE Communications Letters, 2014, 18: 94–97.
Article Google Scholar
Shilane P, Huang M, Wallace G, Hsu W. WAN-optimized replication of backup datasets using stream-informed delta compression. Transactions on Storage, 2012, 8(4): 13:1-13:26.
Park K, Ihm S, Bowman M, Pai V S. Supporting practical content-addressable caching with CZIP compression. In Proc. the USENIX Annual Technical Conference (ATC), June 2007, pp.14:1–14:14.
Lin X, Lu G, Douglis F, Shilane P, Wallace G. Migratory compression: Coarse-grained data reordering to improve compressibility. In Proc. the 12th USENIX Conference on File and Storage Technologies (FAST), February 2014, pp.257-271.
Eshghi K, Tang H K. A framework for analyzing and improving contentbased chunking algorithms. Technical Report, HPL2005-30 R.1, Hewlett Packard Laboratories, Palo Alto, 2005.
Min J, Yoon D, Won Y. Efficient deduplication techniques for modern backup operation. IEEE Transactions on Computers, 2011, 60(6): 824–840.
Article MathSciNet Google Scholar
Pagh R, Rodler F F. Cuckoo hashing. Journal of Algorithms, 2004, 51(2): 122–144.
Fabiano C, Botelho N G, Hsu W. Memory efficient sanitization of a deduplicated storage system. In Proc. the 11th USENIX Conference on File and Storage Technologies (FAST), February 2013, pp.81-94.
Bloom B H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 1970, 13(7): 422–426.
Article MATH Google Scholar
Wildani A, Miller E, Rodeh O. HANDS: A heuristically arranged non-backup inline deduplication system. In Proc. the 29th IEEE International Conference on Data Engineering (ICDE), April 2013, pp.446-457.
Srinivasan K, Bisson T, Goodson G, Voruganti K. iDedup: Latency-aware, inline data deduplication for primary storage. In Proc. the 10th USENIX Conference on File and Storage Technologies (FAST), February 2012, pp.299-312.
Debnath B, Sengupta S, Li J. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proc. the USENIX Annual Technical Conference (ATC), February 2010, pp.16-29.
Meister D, Brinkmann A. dedupv1: Improving deduplication throughput using solid state drives (SSD). In Proc. the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST), May 2010.
Chen F, Luo T, Zhang X. CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In Proc. the 9th USENIX Conference on File and Storage Technologies (FAST), February 2011, pp.6-19.
Agrawal N, Prabhakaran V, Wobber T, Davis J D, Manasse M, Panigrahy R. Design tradeoffs for SSD performance. In Proc. the USENIX Annual Technical Conference (ATC), June 2008, pp.57-70.
Meister D, Brinkmann A, S¨u_ T. File recipe compression in data deduplication systems. In Proc. the 11th USENIX Conference on File and Storage Technologies (FAST), February 2013, pp.175-182.
Bobbarjung D R, Jagannathan S, Dubnicki C. Improving duplicate elimination in storage systems. ACM Transactions on Storage, 2006, 2(4): 424–448.
Article Google Scholar
Lu G, Jin Y, Du D. Frequency based chunking for data deduplication. In Proc. the IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS), August 2010, pp.287-296.
Fu M, Feng D, Hua Y, He X, Chen Z, XiaW, Zhang Y, Tan Y. Design tradeoffs for data deduplication performance in backup workloads. In Proc. the 13th USENIX Conference on File and Storage Technologies (FAST), February 2015, pp.331-344.
Fu M, Feng D, Hua Y, He X, Chen Z, Xia W, Huang F, Liu Q. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proc. the USENIX Annual Technical Conference (ATC), June 2014, pp.181-192.
Tang Y, Yang J. Secure deduplication of general computations. In Proc. the USENIX Annual Technical Conference (ATC), July 2015, pp.319-331.
Zhang W, Yang T, Narayanasamy G, Tang H. Low-cost data deduplication for virtual machine backup in cloud storage. In Proc. the 5th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage), June 2013, pp.12-16.
Leung A W, Shao M, Bisson T, Pasupathy S, Miller E L. Spyglass: Fast, scalable metadata search for large-scale storage systems. In Proc. the 7th USENIX Conference on File and Storage Technologies (FAST), February 2009, pp.153-166.
Hua Y, Jiang H, Zhu Y, Feng D, Tian L. Semantic-aware metadata organization paradigm in next-generation file systems. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(2): 337–344.
Article Google Scholar

Download references

Author information

Authors and Affiliations

State Key Laboratory on Intelligent Technology and Systems, Tsinghua University, Beijing, 100084, China
Bing Zhou & Jiang-Tao Wen
Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing, 100084, China
Bing Zhou & Jiang-Tao Wen
Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Bing Zhou & Jiang-Tao Wen

Authors

Bing Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jiang-Tao Wen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bing Zhou.

Additional information

This work was supported by the National Science Fund for Distinguished Young Scholars of China under Grant No. 61125102 and the State Key Program of National Natural Science Foundation of China under Grant No. 61133008.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, B., Wen, JT. Metadata Feedback and Utilization for Data Deduplication Across WAN. J. Comput. Sci. Technol. 31, 604–623 (2016). https://doi.org/10.1007/s11390-016-1650-6

Download citation

Received: 11 May 2015
Revised: 19 January 2016
Published: 06 May 2016
Issue Date: May 2016
DOI: https://doi.org/10.1007/s11390-016-1650-6

Keywords

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Metadata Feedback and Utilization for Data Deduplication Across WAN

Abstract

Access this article

Similar content being viewed by others

Content-Based Chunk Placement Scheme for Decentralized Deduplication on Distributed File Systems

Study of Chunking Algorithm in Data Deduplication

A smart hybrid content-defined chunking algorithm for data deduplication in cloud storage

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Metadata Feedback and Utilization for Data Deduplication Across WAN

Abstract

Access this article

Similar content being viewed by others

Content-Based Chunk Placement Scheme for Decentralized Deduplication on Distributed File Systems

Study of Chunking Algorithm in Data Deduplication

A smart hybrid content-defined chunking algorithm for data deduplication in cloud storage

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation