Identifying Similar Sentences by Using N-Grams of Characters

Sultana, Saïma; Biskri, Ismaïl

doi:10.1007/978-3-319-92058-0_80

Saïma Sultana¹⁷ &
Ismaïl Biskri¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10868))

Included in the following conference series:

International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems

3038 Accesses
2 Citations

Abstract

Nowadays, detecting similar sentences can play a major role in various fundamental applications for reading and analyzing sentences like information retrieval, categorization, detection of paraphrases, summarizing, translation etc. In this work, we present a novel method for the detection of similar sentences. This method highlights the using of units of n-grams of characters. The online dictionary as well as any search engine are not being used. Hence, this idea leads our method a simplest and optimum way to handle the similarities between two sentences. In addition, the grammar rules as well as any syntax have not been used in our method. That’s why, our approach is language-independent. We analyze and compare a range of similarity measures with our methodology. Meanwhile, the complexity of our method is O(N2) which is pretty much better.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Akermi, I., Faiz, R.: An approach to semantic text similarity computing. In: Silhavy, R., Senkerik, R., Oplatkova, Z.K., Silhavy, P., Prokopova, Z. (eds.) Modern Trends and Techniques in Computer Science. AISC, vol. 285, pp. 383–393. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06740-7_32
Chapter Google Scholar
Akermi, I., Faiz, R.: Hybrid method for computing word-pair similarity based on web content. In: Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics. ACM, Craiova (2012)
Google Scholar
Kumari, P., Ravishankar, K.: Measuring Semantic Similarity between Words using Page-Count and Pattern Clustering Methods (2013)
Google Scholar
Takale, S.A., Nandgaonkar, S.S.: Measuring semantic similarity between words using web documents. Int. J. Adv. Comput. Sci. Appl. (2010)
Google Scholar
Rijsbergen, C.J.V.: Information Retrieval. Butterworth-Heinemann, London (1979)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Bollegala, D., Matsuo, Y., Ishizuka, M.: WebSim: a web-based semantic similarity measure. In: Proceedings of 21st Annual Conference of the Japanese Society of Artificial Intelligence (2007)
Google Scholar
Manning, C.: Foundations of statistical natural language processing. Nat. Lang. Eng. 8(1), 91–92 (2002)
Google Scholar
Platt, J.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classifiers 10(3), 61–74 (1999)
Google Scholar
Islam, A., Milios, E., Kešelj, V.: Text similarity using Google tri-grams. In: Kosseim, L. Inkpen, D. (eds.) AI 2012. LNCS (LNAI), vol. 7310, pp. 312–317. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30353-1_29
Chapter Google Scholar
Kondrak, G.: N-gram similarity and distance. In: Consens, M., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 115–126. Springer, Heidelberg (2005). https://doi.org/10.1007/11575832_13
Chapter Google Scholar
Grefenstette, G.: Comparing two language identification schemes. In: Proceedings of JADT 1995 (1995)
Google Scholar
Damashek, M.: Gauging similarity with n-grams: language-independent categorization of sentence. Science 267, 843–848 (1995)
Article Google Scholar
Huffman, S., Damashek, M.: Acquaintance: a novel vector-space n-gram technique for document categorization. In: NIST Special Publication, National Institute of Standards and Technology, pp. 305–310 (1995)
Google Scholar
Biskri, I., Delisle, S.: Les n-grams de caractères pour l’aide à l’extraction de connaissances dans des bases de données sentenceuelles multilingues. In: Proceedings of TALN-2001, pp. 93–102 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Université du Québec à Trois-Rivières, Trois-Rivières, QC, G8Z 4M3, Canada
Saïma Sultana & Ismaïl Biskri

Authors

Saïma Sultana
View author publications
You can also search for this author in PubMed Google Scholar
Ismaïl Biskri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ismaïl Biskri .

Editor information

Editors and Affiliations

University of Regina, Regina, SK, Canada
Malek Mouhoub
University of Regina, Regina, SK, Canada
Samira Sadaoui
Concordia University, Montreal, QC, Canada
Otmane Ait Mohamed
Texas State University, San Marcos, TX, USA
Moonis Ali

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sultana, S., Biskri, I. (2018). Identifying Similar Sentences by Using N-Grams of Characters. In: Mouhoub, M., Sadaoui, S., Ait Mohamed, O., Ali, M. (eds) Recent Trends and Future Technology in Applied Intelligence. IEA/AIE 2018. Lecture Notes in Computer Science(), vol 10868. Springer, Cham. https://doi.org/10.1007/978-3-319-92058-0_80

Download citation

DOI: https://doi.org/10.1007/978-3-319-92058-0_80
Published: 30 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92057-3
Online ISBN: 978-3-319-92058-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics