Abstract
The purpose of this paper is to propose an unsupervised approach for measuring the similarity of texts that can compete with supervised approaches. Finding the inherent properties of similarity between texts using a corpus in the form of a word n-gram data set is competitive with other text similarity techniques in terms of performance and practicality. Experimental results on a standard data set show that the proposed unsupervised method outperforms the state-of-the-art supervised method and the improvement achieved is statistically significant at 0.05 level. The approach is language-independent; it can be applied to other languages as long as n-grams are available.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
O’Shea, J., Bandar, Z., Crockett, K., McLean, D.: A Comparative Study of Two Short Text Semantic Similarity Measures. In: Nguyen, N.T., Jo, G.-S., Howlett, R.J., Jain, L.C. (eds.) KES-AMSTA 2008. LNCS (LNAI), vol. 4953, pp. 172–181. Springer, Heidelberg (2008)
Islam, A., Inkpen, D., Kiringa, I.: Applications of corpus-based semantic similarity and word segmentation to database schema matching. The VLDB Journal 17(5), 1293–1320 (2008)
Bickmore, T., Giorgino, T.: Health dialog systems for patients and consumers. J. of Biomedical Informatics 39, 556–571 (2006)
Gorin, A.L., Riccardi, G., Wright, J.H.: How may I help you? Speech Communication 23(1-2), 113–127 (1997)
Islam, A., Inkpen, D.: Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data 2, 10:1–10:25 (2008)
Brants, T., Franz, A.: Web 1T 5-gram corpus version 1.1. Technical report, Google Research (2006)
Islam, A., Inkpen, D.: Second order co-occurrence PMI for determining the semantic similarity of words. In: Proceedings of the International Conference on Language Resources and Evaluation, Genoa, Italy, pp. 1033–1038 (May 2006)
Ho, C., Murad, M.A.A., Kadir, R.A., Doraisamy, S.C.: Word sense disambiguation-based sentence similarity. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, COLING 2010, pp. 418–426. Association for Computational Linguistics, Stroudsburg (2010)
Liu, X., Zhou, Y., Zheng, R.: Sentence similarity based on dynamic time warping. In: Proceedings of the International Conference on Semantic Computing, pp. 250–256. IEEE Computer Society, Washington, DC (2007)
Feng, J., Zhou, Y.M., Martin, T.: Sentence similarity based on relevance. In: Magdalena, L., Ojeda-Aciego, M., Verdegay, J. (eds.) IPMU, pp. 832–839 (2008)
Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the American Association for Artificial Intelligence, Boston (2006)
Li, Y., McLean, D., Bandar, Z.A., O’Shea, J.D., Crockett, K.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. on Knowl. and Data Eng. 18, 1138–1150 (2006)
Kaplan, A.: An experimental study of ambiguity and context (November 1950), Published as Kaplan, A.: An experimental study of ambiguity and context. Mechanical Translation 2(2), 39–46 (1955)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Islam, A., Milios, E., Kešelj, V. (2012). Text Similarity Using Google Tri-grams. In: Kosseim, L., Inkpen, D. (eds) Advances in Artificial Intelligence. Canadian AI 2012. Lecture Notes in Computer Science(), vol 7310. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30353-1_29
Download citation
DOI: https://doi.org/10.1007/978-3-642-30353-1_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-30352-4
Online ISBN: 978-3-642-30353-1
eBook Packages: Computer ScienceComputer Science (R0)