Skip to main content

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 556))

Abstract

The capacity to discover the similarity between two textual bases, or inside one textual base, has much utilization including plagiarism detection and in the area of reused text (strings) in a database manageable to the removal of duplication. Past structure-metric methodologies have used either suffix trees or variance of longest common subsequence algorithms to recognize duplicate text. In this paper, different string distance metrics have been investigated: Levenshtein Distance (L. Dist.), Cosine Similarity (C.S.), and Hamming Distance (H. Dist) and also Hashes (ASCII-based hashing) on token sequences to detect matching of strings were used. Similarity index techniques vary on the basis of granularity: some techniques work on character level, word level, and some work on corpus-based granularity. The benefit of the approaches evaluated is to handle multiples patterns for similarity at a time. The work has been carried out on strings. From the simulation, it has been observed that ASCII-based hashing performs better than other techniques in terms of running time and accuracy. All techniques face one issue of increase in similarity searching time linearly with database size, whereas hashing handles this issue efficiently. ASCII-based hashing handles the issue of scalability very well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Toomey W (2010) Code similarity detection in multiple large source trees using token hashes. In: PAN-09 3rd workshop on uncovering plagiarism, authorship and social software misuse and 1st international competition on plagiarism detection

    Google Scholar 

  2. Ducasse S, Rieger M, Demeyer S (1999) A language independent approach for detecting duplicated code. In: Proceedings of the IEEE international conference on software maintenance, p 109

    Google Scholar 

  3. Mayrand J, Leblanc C, Merlo E (1996) Experiment on the automatic detection of function clones in a software system using metrics. In: Proceedings of the IEEE international conference on software maintenance, pp 244–253

    Google Scholar 

  4. Tung KT, Hung ND, Hanh (2015) A Comparison of Algorithms used to measure the Similarity between two documents. Int J Adv Res Comput Eng Technol (IJARCET) 4(4)

    Google Scholar 

  5. Stein B, SM zu Eissen (2007) Fingerprint-based similarity search and its applications. 85–98

    Google Scholar 

  6. Hussein AS (2016) Visualizing document similarity using N-grams and latent semantic analysis. In: SAI computing conference 2016

    Google Scholar 

  7. Willassen SY (2009) Line based hash analysis of source code infringement. Dig Evid Electron Signat Law Rev 6:210–213

    Google Scholar 

  8. Gomaa WH, Fahmy AA (2013) A survey of text similarity approaches. Int J Comput Appl 68(13)

    Google Scholar 

  9. Li Y, McLean D, Bandar Z, O’Shea J, Crockett K (2006) Sentence similarity based on semantic nets and corpus statistics. IEEE Trans Knowl Data Eng 1138–1150

    Google Scholar 

  10. Miller GA (1995) WordNet: a lexical database for English. Commun ACM 39–41

    Google Scholar 

  11. Bergroth L, Hakonen H, Raita T (2000) A survey of longest common subsequence algorithms. In: International symposium on string processing and information retrieval, vol 39

    Google Scholar 

  12. Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10(8):707–710

    MathSciNet  Google Scholar 

  13. Lavoie T, Merlo E (2012) An accurate estimation of the Levenshtein distance using metric trees and, Manhattan distance. In: IWSC 2012, Zurich, Switzerland, 978-1-4673-1795-5/12/$31.00 © 2012. IEEE

    Google Scholar 

  14. Tiarks R, Koschke R, Falke R (2009) An assessment of type-3 clones as detected by state-of-the-art tools. In: Workshop on source code analysis and manipulation. IEEE Computer Society Press, pp 67–76

    Google Scholar 

  15. Lavoie T, Merlo E (2011) Automated type-3 clone oracle using Levenshtein metric, pp 25–32

    Google Scholar 

  16. Udagawa Y (2013) Source code retrieval using sequence-based similarity. Int J Data Min Knowl Manag Process 3(4)

    Google Scholar 

  17. Tung KT, Hung ND, Hanh LTM (2015) A comparison of algorithms used to measure the similarity between two documents. Int J Adv Res Comput Eng Technol (IJARCET) 4(4)

    Google Scholar 

  18. Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523

    Article  Google Scholar 

  19. Sadowski C, Levin G (2007) SimHash: hash-based similarity detection

    Google Scholar 

  20. Rolling Hash (Rabin-Karp Algorithm) 6.006 Intro to Algorithms Recitation 06 February 18, 2011

    Google Scholar 

  21. Jiang L, Misherghi G, Su Z, Glondu S (2007) DECKARD: scalable and accurate tree-based detection of code clones

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Harpreet Kaur .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kaur, H., Maini, R. (2019). Granularity-Based Assessment of Similarity Between Short Text Strings. In: Nath, V., Mandal, J. (eds) Proceedings of the Third International Conference on Microelectronics, Computing and Communication Systems. Lecture Notes in Electrical Engineering, vol 556. Springer, Singapore. https://doi.org/10.1007/978-981-13-7091-5_9

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-7091-5_9

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-7090-8

  • Online ISBN: 978-981-13-7091-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics