Granularity-Based Assessment of Similarity Between Short Text Strings

Kaur, Harpreet; Maini, Raman

doi:10.1007/978-981-13-7091-5_9

Harpreet Kaur³⁶ &
Raman Maini³⁶

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 556))

1007 Accesses
2 Citations

Abstract

The capacity to discover the similarity between two textual bases, or inside one textual base, has much utilization including plagiarism detection and in the area of reused text (strings) in a database manageable to the removal of duplication. Past structure-metric methodologies have used either suffix trees or variance of longest common subsequence algorithms to recognize duplicate text. In this paper, different string distance metrics have been investigated: Levenshtein Distance (L. Dist.), Cosine Similarity (C.S.), and Hamming Distance (H. Dist) and also Hashes (ASCII-based hashing) on token sequences to detect matching of strings were used. Similarity index techniques vary on the basis of granularity: some techniques work on character level, word level, and some work on corpus-based granularity. The benefit of the approaches evaluated is to handle multiples patterns for similarity at a time. The work has been carried out on strings. From the simulation, it has been observed that ASCII-based hashing performs better than other techniques in terms of running time and accuracy. All techniques face one issue of increase in similarity searching time linearly with database size, whereas hashing handles this issue efficiently. ASCII-based hashing handles the issue of scalability very well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Toomey W (2010) Code similarity detection in multiple large source trees using token hashes. In: PAN-09 3rd workshop on uncovering plagiarism, authorship and social software misuse and 1st international competition on plagiarism detection
Google Scholar
Ducasse S, Rieger M, Demeyer S (1999) A language independent approach for detecting duplicated code. In: Proceedings of the IEEE international conference on software maintenance, p 109
Google Scholar
Mayrand J, Leblanc C, Merlo E (1996) Experiment on the automatic detection of function clones in a software system using metrics. In: Proceedings of the IEEE international conference on software maintenance, pp 244–253
Google Scholar
Tung KT, Hung ND, Hanh (2015) A Comparison of Algorithms used to measure the Similarity between two documents. Int J Adv Res Comput Eng Technol (IJARCET) 4(4)
Google Scholar
Stein B, SM zu Eissen (2007) Fingerprint-based similarity search and its applications. 85–98
Google Scholar
Hussein AS (2016) Visualizing document similarity using N-grams and latent semantic analysis. In: SAI computing conference 2016
Google Scholar
Willassen SY (2009) Line based hash analysis of source code infringement. Dig Evid Electron Signat Law Rev 6:210–213
Google Scholar
Gomaa WH, Fahmy AA (2013) A survey of text similarity approaches. Int J Comput Appl 68(13)
Google Scholar
Li Y, McLean D, Bandar Z, O’Shea J, Crockett K (2006) Sentence similarity based on semantic nets and corpus statistics. IEEE Trans Knowl Data Eng 1138–1150
Google Scholar
Miller GA (1995) WordNet: a lexical database for English. Commun ACM 39–41
Google Scholar
Bergroth L, Hakonen H, Raita T (2000) A survey of longest common subsequence algorithms. In: International symposium on string processing and information retrieval, vol 39
Google Scholar
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10(8):707–710
MathSciNet Google Scholar
Lavoie T, Merlo E (2012) An accurate estimation of the Levenshtein distance using metric trees and, Manhattan distance. In: IWSC 2012, Zurich, Switzerland, 978-1-4673-1795-5/12/$31.00 © 2012. IEEE
Google Scholar
Tiarks R, Koschke R, Falke R (2009) An assessment of type-3 clones as detected by state-of-the-art tools. In: Workshop on source code analysis and manipulation. IEEE Computer Society Press, pp 67–76
Google Scholar
Lavoie T, Merlo E (2011) Automated type-3 clone oracle using Levenshtein metric, pp 25–32
Google Scholar
Udagawa Y (2013) Source code retrieval using sequence-based similarity. Int J Data Min Knowl Manag Process 3(4)
Google Scholar
Tung KT, Hung ND, Hanh LTM (2015) A comparison of algorithms used to measure the similarity between two documents. Int J Adv Res Comput Eng Technol (IJARCET) 4(4)
Google Scholar
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523
Article Google Scholar
Sadowski C, Levin G (2007) SimHash: hash-based similarity detection
Google Scholar
Rolling Hash (Rabin-Karp Algorithm) 6.006 Intro to Algorithms Recitation 06 February 18, 2011
Google Scholar
Jiang L, Misherghi G, Su Z, Glondu S (2007) DECKARD: scalable and accurate tree-based detection of code clones
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Punjabi University, Patiala, India
Harpreet Kaur & Raman Maini

Authors

Harpreet Kaur
View author publications
You can also search for this author in PubMed Google Scholar
Raman Maini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Harpreet Kaur .

Editor information

Editors and Affiliations

Department of Electronics and Communication Engineering, Birla Institute of Technology, Mesra, Ranchi, Jharkhand, India
Vijay Nath
Department of Computer Science and Engineering, University of Kalyani, Kalyani, West Bengal, India
Jyotsna Kumar Mandal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kaur, H., Maini, R. (2019). Granularity-Based Assessment of Similarity Between Short Text Strings. In: Nath, V., Mandal, J. (eds) Proceedings of the Third International Conference on Microelectronics, Computing and Communication Systems. Lecture Notes in Electrical Engineering, vol 556. Springer, Singapore. https://doi.org/10.1007/978-981-13-7091-5_9

Download citation

DOI: https://doi.org/10.1007/978-981-13-7091-5_9
Published: 24 May 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-7090-8
Online ISBN: 978-981-13-7091-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics