Simdedup: A New Deduplication Scheme Based on Simhash

  • Wenbin Yao
  • Pengdi Ye
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7901)


Maintaining higher deduplication throughput with lower system overheads is a challenge for deduplication system in massive data storage environment. In this paper, a near-exact deduplication scheme named Simdedup is presented, which exploits file similarity and chunk locality to achieve the goal. Simdedup partitions a file object into several segments, and leverages similarity to find the most similar segments based on simhash algorithm. It exploits a deduplication cache in memory to store the chunk fingerprints of the most similar segment, which can raise the speed of the detection of redundant data. Simdedup needs few disk accesses for chunk lookup per file, which leads to a reasonable throughput. Experimental results show that Simdedup can perform better on system overheads and deduplication ratio than deduplication schemes that employ traits detection.


Massive data storage deduplication similarity detection 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Muthitacharoen, A., Chen, B., Mazieres, D.: A low-bandwidth network file system. In: Proceedings of the 18th ACM Symposium on Operating Systems Principles, pp. 174–187. ACM, New York (2001)Google Scholar
  2. 2.
    Quinlan, S., Dorward, S.: Venti: a new approach to archival storage. In: Proceedings of the FAST 2002 Conference on File and Storage Technologies, vol. 4 (2002)Google Scholar
  3. 3.
    Zhu, B., Li, K., Patterson, H.: Avoiding the disk bottleneck in the data domain deduplication file system. In: Proceedings of the 6th USENIX Conference on File and Storage Technologies, pp. 1–14. USENIX Association (2008)Google Scholar
  4. 4.
    Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., Camble, P.: Sparse indexing: large scale, inline deduplication using sampling and locality. In: Proccedings of the 7th Conference on File and Storage Technologies, pp. 111–123. USENIX Association (2009)Google Scholar
  5. 5.
    Bhagwat, D., Eshghi, K., Long, D., Lillibridge, M.: Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In: IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, pp. 1–9. IEEE (2009)Google Scholar
  6. 6.
    Aronovich, L., Asher, R., Bachmat, E., Bitner, H., Hirsch, M., Klein, S.: The design of a similarity based deduplication system. In: Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, pp. 1–14. ACM (2009)Google Scholar
  7. 7.
    Douglis, F., Iyengar, A.: Application-specific deltaencoding via resemblance detection. In: Proceedings of the 2003 USENIX Annual Technical Conference, San Antonio, Texas, pp. 113–126 (June 2003)Google Scholar
  8. 8.
    Broder, A.Z., Mitzenmacher, M.: Network applications of Bloom filters: A survey. Internet Mathematics 1(4), 485–509 (2003)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Teodosiu, D., Bjorner, N., Gurevich, Y., Manasse, M., Porkka, J.: Optimizing file replication over limited-bandwidth networks using remote differential compression, Technical Report MSR-TR-2006-157, Microsoft Research (November 2006)Google Scholar
  10. 10.
    Charikar, M.: Similarity estimation techniques from rounding algorithms. In: Proc. 34th Annual Symposium on Theory of Computing (STOC 2002), pp. 380–388 (2002)Google Scholar
  11. 11.
    Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, pp. 141–150 (2007)Google Scholar
  12. 12.
    Henzinger, M.R.: Finding Near-Duplicate Web Pages: ALarge-Scale Evaluation of Algorithms. In: Proc. ACM SIGIR, pp. 284–291 (August 2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Wenbin Yao
    • 1
    • 2
    • 3
  • Pengdi Ye
    • 1
    • 2
    • 3
  1. 1.Beijing Key Laboratory of Intelligent Telecommunications Software and MultimediaChina
  2. 2.Ministry of EducationKey Laboratory of Trustworthy Distributed Computing and Service(BUPT)China
  3. 3.School of Computer ScienceBeijing University of Posts and TelecommunicationsBeijingChina

Personalised recommendations