Optimizing Memory Space by Removing Duplicate Files Using Similarity Digest Technique

  • Vedant Sharma
  • Priyamwada SharmaEmail author
  • Santosh Sahu
Conference paper
Part of the Lecture Notes in Networks and Systems book series (LNNS, volume 100)


In this paper, we proposed a data cleaning technique, for memory space optimization. We are using sdhash techniques for effective, fast and efficient duplicate files detection and removing in memory. The correct identification of duplicate files is the first critical step in data cleaning process. The fast growth of the data targets demands new automated methods for removing data duplication quickly, accurately, and reliably. Sdhash tool is used for calculation of similarity score of a data files, store, and compare its similarity hashes referred to as similarity digests (sdhash). In contrast, compare whole file, to brute force method, our method compares only the finger prints of all files and is able to efficiently distinguish among duplicate files. In addition, our evaluation data which contains hundreds of files, provides insights into the typical levels of content similarity across related Files. The proposed method is excellent in metric of time and space complexity.


Data cleaning Fingerprinting Ssdeep Sdhash 


  1. 1.
    Hernandez MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: Proceedings of the ACM SIGMOD’95. San Jose, CA, pp 127–138Google Scholar
  2. 2.
    Weis M, Naumann F (2004) Detecting duplicate objects in XML documents. In: Proceedings of IQIS’04. Paris, France, pp 10–19Google Scholar
  3. 3.
    Zhang J, Ling TW, Bruckner RM, Liu H (2004) PC-filter: a robust filtering technique for duplicate record detection in large databases. In: DEXA’04, Zaragoza, SpainGoogle Scholar
  4. 4.
    Zhang J (2010) An efficient and effective duplication detection method in large database applications. In: 2010 fourth international conference on network and system security. IEEE, pp 494–501Google Scholar
  5. 5.
    Breitinger F, Guttman B, McCarrin M, Roussev V, Approximate matching: definition and terminology.–168/sp800_168_draft.pdf
  6. 6.
    Roussev V, Ahmed I, Sires T (2014) Image-based kernel fingerprinting. In: Digital forensics research workshop. Elsevier LtdGoogle Scholar
  7. 7.
    Lin Z, Rhee J, Zhang X, Xu D, Jiang X (2011) Siggraph: brute force scanning of kernel data structure instances using graph-based signatures. NDSS.
  8. 8.
    Bjelland PC, Franke K, Arnes A (2014) Practical use of approximate hash based matching in digital investigations. Digit Invest 11(1):s18–s26CrossRefGoogle Scholar
  9. 9.
    Ranjithaa S, Sudhakara P, Seetharamanb KS (2016) A novel and efficient de-duplication system for HDFS. In: 2nd international conference on intelligent computing, communication & convergence. ICCC-2016Google Scholar
  10. 10.
    Moia VHG, Henriques MAA (2017) Similarity digest search: a survey and comparative analysis of strategies to perform known file filtering using approximate matching. In: Security and communication networksGoogle Scholar
  11. 11.
    Roussev V, Quates C (2012). Content triage with similarity digests: the m57 case study. In: Proceedings of the 12th annual digital forensics research conference, S60e8. Scholar
  12. 12.
    Roussev V (2010) Data fingerprinting with similarity digests. In: Advances in digital forensics VI. Springer, pp 207–226Google Scholar
  13. 13.
    Roussev V (2009) Building a better similarity trap with statistically improbable features. In: Proceedings of the 42nd Hawaii international conference on system sciences. Waikoloa Village Resort. IEEE, Hawaii, HIGoogle Scholar
  14. 14.
    Roussev V (2012) Managing terabyte-scale investigations with similarity digests. In: Advances in digital forensics VIII. Springer, pp 19–34Google Scholar
  15. 15.
    Roussev V (2010) Data fingerprinting with similarity digests. In: Chow K-P, Shenoi S (eds) Advances in digital forensics VI, IFIP AICT, vol 337, pp 207–225CrossRefGoogle Scholar
  16. 16.
    Roussev V (2011) An evaluation of forensic similarity hashes. In: The proceedings of the digital forensic research conference DFRWS 2011, USA Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  • Vedant Sharma
    • 1
  • Priyamwada Sharma
    • 2
    Email author
  • Santosh Sahu
    • 2
  1. 1.University Institute of TechnologyRajiv Gandhi Proudyogiki VishwavidyalayaBhopalIndia
  2. 2.School of Information TechnologyRajiv Gandhi Proudyogiki VishwavidyalayaBhopalIndia

Personalised recommendations