Abstract
The capacity to discover the similarity between two textual bases, or inside one textual base, has much utilization including plagiarism detection and in the area of reused text (strings) in a database manageable to the removal of duplication. Past structure-metric methodologies have used either suffix trees or variance of longest common subsequence algorithms to recognize duplicate text. In this paper, different string distance metrics have been investigated: Levenshtein Distance (L. Dist.), Cosine Similarity (C.S.), and Hamming Distance (H. Dist) and also Hashes (ASCII-based hashing) on token sequences to detect matching of strings were used. Similarity index techniques vary on the basis of granularity: some techniques work on character level, word level, and some work on corpus-based granularity. The benefit of the approaches evaluated is to handle multiples patterns for similarity at a time. The work has been carried out on strings. From the simulation, it has been observed that ASCII-based hashing performs better than other techniques in terms of running time and accuracy. All techniques face one issue of increase in similarity searching time linearly with database size, whereas hashing handles this issue efficiently. ASCII-based hashing handles the issue of scalability very well.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Toomey W (2010) Code similarity detection in multiple large source trees using token hashes. In: PAN-09 3rd workshop on uncovering plagiarism, authorship and social software misuse and 1st international competition on plagiarism detection
Ducasse S, Rieger M, Demeyer S (1999) A language independent approach for detecting duplicated code. In: Proceedings of the IEEE international conference on software maintenance, p 109
Mayrand J, Leblanc C, Merlo E (1996) Experiment on the automatic detection of function clones in a software system using metrics. In: Proceedings of the IEEE international conference on software maintenance, pp 244–253
Tung KT, Hung ND, Hanh (2015) A Comparison of Algorithms used to measure the Similarity between two documents. Int J Adv Res Comput Eng Technol (IJARCET) 4(4)
Stein B, SM zu Eissen (2007) Fingerprint-based similarity search and its applications. 85–98
Hussein AS (2016) Visualizing document similarity using N-grams and latent semantic analysis. In: SAI computing conference 2016
Willassen SY (2009) Line based hash analysis of source code infringement. Dig Evid Electron Signat Law Rev 6:210–213
Gomaa WH, Fahmy AA (2013) A survey of text similarity approaches. Int J Comput Appl 68(13)
Li Y, McLean D, Bandar Z, O’Shea J, Crockett K (2006) Sentence similarity based on semantic nets and corpus statistics. IEEE Trans Knowl Data Eng 1138–1150
Miller GA (1995) WordNet: a lexical database for English. Commun ACM 39–41
Bergroth L, Hakonen H, Raita T (2000) A survey of longest common subsequence algorithms. In: International symposium on string processing and information retrieval, vol 39
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10(8):707–710
Lavoie T, Merlo E (2012) An accurate estimation of the Levenshtein distance using metric trees and, Manhattan distance. In: IWSC 2012, Zurich, Switzerland, 978-1-4673-1795-5/12/$31.00 © 2012. IEEE
Tiarks R, Koschke R, Falke R (2009) An assessment of type-3 clones as detected by state-of-the-art tools. In: Workshop on source code analysis and manipulation. IEEE Computer Society Press, pp 67–76
Lavoie T, Merlo E (2011) Automated type-3 clone oracle using Levenshtein metric, pp 25–32
Udagawa Y (2013) Source code retrieval using sequence-based similarity. Int J Data Min Knowl Manag Process 3(4)
Tung KT, Hung ND, Hanh LTM (2015) A comparison of algorithms used to measure the similarity between two documents. Int J Adv Res Comput Eng Technol (IJARCET) 4(4)
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523
Sadowski C, Levin G (2007) SimHash: hash-based similarity detection
Rolling Hash (Rabin-Karp Algorithm) 6.006 Intro to Algorithms Recitation 06 February 18, 2011
Jiang L, Misherghi G, Su Z, Glondu S (2007) DECKARD: scalable and accurate tree-based detection of code clones
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kaur, H., Maini, R. (2019). Granularity-Based Assessment of Similarity Between Short Text Strings. In: Nath, V., Mandal, J. (eds) Proceedings of the Third International Conference on Microelectronics, Computing and Communication Systems. Lecture Notes in Electrical Engineering, vol 556. Springer, Singapore. https://doi.org/10.1007/978-981-13-7091-5_9
Download citation
DOI: https://doi.org/10.1007/978-981-13-7091-5_9
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-7090-8
Online ISBN: 978-981-13-7091-5
eBook Packages: EngineeringEngineering (R0)