A hybrid approach for measuring semantic similarity based on IC-weighted path distance in WordNet
As a valuable tool for text understanding, semantic similarity measurement enables discriminative semantic-based applications in the fields of natural language processing, information retrieval, computational linguistics and artificial intelligence. Most of the existing studies have used structured taxonomies such as WordNet to explore the lexical semantic relationship, however, the improvement of computation accuracy is still a challenge for them. To address this problem, in this paper, we propose a hybrid WordNet-based approach CSSM-ICSP to measuring concept semantic similarity, which leverage the information content(IC) of concepts to weight the shortest path distance between concepts. To improve the performance of IC computation, we also develop a novel model of the intrinsic IC of concepts, where a variety of semantic properties involved in the structure of WordNet are taken into consideration. In addition, we summarize and classify the technical characteristics of previous WordNet-based approaches, as well as evaluate our approach against these approaches on various benchmarks. The experimental results of the proposed approaches are more correlated with human judgment of similarity in term of the correlation coefficient, which indicates that our IC model and similarity detection approach are comparable or even better for semantic similarity measurement as compared to others.
KeywordsConcept semantic similarity Intrinsic information content WordNet Edge distance
The authors would like to thank the reviewers for their valuable comments and suggestions. This study is supported by National Natural Science Foundation of China (No.61502028), National Key Technology R&D Program of China (No. 2015BAK36B04), Training program foundation for the talents of Beijing (No.2015000020124G029), the Beijing Natural Science Foundation (No. 4172014) and the Research Foundation for Youth Scholars of Beijing Technology and Business University.
Compliance with Ethical Standards
I certify that this manuscript is original and has not been published and will not be submitted elsewhere for publication while being considered by Journal of Intelligent Information Systems. And the study is not split up into several parts to increase the quantity of submissions and submitted to various journals or one journal over time. No data have been fabricated or manipulated (including images) to support our conclusions. No data, text, or theories by others are presented as if they were our own. The submission has been received explicitly from all co-authors. And authors whose names appear on the manuscript have contributed sufficiently to the scientific work and therefore share collective responsibility and accountability for the results. In addition, consent to submit has been received explicitly from all co-authors, as well as from the responsible authorities - tacitly or explicitly - at the institute where the work has been carried out, before the work is submitted. Authors are strongly advised to ensure the correct author group, corresponding author, and order of authors at submission.
Conflict of interests
The authors declare that they have no conflict of interest.
This study is funded by National Natural Science Foundation of China (No.61502028), National Key Technology R&D Program of China (No.2015BAK36B04), Training program foundation for the talents of Beijing (No.2015000020124G029), the Beijing Natural Science Foundation (No. 4172014) and the Research Foundation for Youth Scholars of Beijing Technology and Business University.
Research involving Human Participants and/or Animals
There is no human participants or animals involved in this work.
- Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., & Soroa, A. (2009). A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of the 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL ’09 (pp. 19–27). Stroudsburg, PA, USA: Association for Computational Linguistics.Google Scholar
- Banerjee, S., & Pedersen, T. (2003). Extended gloss overlaps as a measure of semantic relatedness. In Proceeding of International Joint Conference on Artificial Intelligence, (Vol. 3 pp. 805–810).Google Scholar
- Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. In International Joint Conference on Artificial Intelligence (pp. 1606–1611).Google Scholar
- Hirst, G., & St-Onge, D. (1998). Lexical chains as representations of context for the detection and correction of malapropisms. In Fellbaum, C. (Ed.) WordNet: An Electronic Lexical Database (pp. 305–332): MIT Press.Google Scholar
- Jiang, J.J., & Conrath, D.W. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the 10th International Conference Research on Computational Linguistics. Taiwan.Google Scholar
- Leacock, C., & Chodrow, M. (1998). Combining local context and wordnet similarity for word sense identification. In Fellbaum, C. (Ed.) WordNet: An Electronic Lexical Database (pp. 265–283): MIT Press.Google Scholar
- Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning ICML. Madison, Wisconsin.Google Scholar
- Meng, L., Gu, J., & Zhou, Z. (2012). A new model of information content based on concept’s topology for measuring semantic similarity in wordnet. International Journal of Grid & Distributed Computing, 5(3), 81–94.Google Scholar
- Meng, L., Huang, R., & Gu, J. (2013). An effective algorithm for semantic similarity metric of word pairs International Journal of Multimedia and Ubiquitous Engineering, 8(2).Google Scholar
- Miller, T., Biemann, C., Zesch, T., & Gurevych, I. (2012). Using distributional similarity for lexical expansion in knowledge-based word sense disambiguation. In Proceedings of the 24th International Conference on Computational Linguistics COLING (pp. 1781–1796). Mumbai, India.Google Scholar
- Patwardhan, S. (2003). Incorporating dictionary and corpus information into a context vector measure of semantic relatedness. Duluth: Master’s thesis, University of Minnesota.Google Scholar
- Patwardhan, S., & Pedersen, T. (2006). Using wordnet-based context vectors to estimate the semantic relatedness of concepts. In Proceedings of the EACL 2006 Workshop Making Sense of Sense-Bringing Computational Linguistics and Psycholinguistics Together, (Vol. 1501 pp. 1–8).Google Scholar
- Pekar, V., & Staab, S. (2002). Taxonomy learning - factoring the structure of a taxonomy into a semantic classification decision. In Proceeding of the 19th International Conference on Computational LinguisticsCOLING. Taipei, Taiwan.Google Scholar
- Pesaranghader, A., Rezaei, A., & Pesaranghader, A. (2013). Adapting Gloss Vector Semantic Relatedness Measure for Semantic Similarity Estimation: An Evaluation in the Biomedical Domain Springer International Publishing.Google Scholar
- Petrakis, E.G., Varelas, G., Hliaoutakis, A., & Raftopoulou, P. (2006). X-similarity: computing semantic similarity between concepts from different ontologies. Journal of Digital Information Management, 4(4), 233–237.Google Scholar
- Pirró, G., & Seco, N. (2008). Design, Implementation and Evaluation of a New Semantic Similarity Metric Combining Features and Intrinsic Information Content, chap. On the Move to Meaningful Internet Systems: OTM 2008 Vol. 5332. Berlin, Heidelberg: Springer.Google Scholar
- Piskorski, J., & Sydow, M. (2007). String distance metrics for reference matching and search query correction. In Business Information Systems, International Conference, Bis 2007 (pp. 353–365). Poznan, Poland: Proceedings.Google Scholar
- Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence IJCAI (pp. 448–453). Canada: Montréal Québec.Google Scholar
- Richardson, R., Smeaton, A., & Murphy, J. (1994). Using wordnet as a knowledge base for measuring semantic similarity between words. In Proceedings of AICS Conference. Dublin: Trinity College.Google Scholar
- Ross, S.M. (2002). A First course in probability, 6th edn. Upper Saddle River, NJ: Prentice Hall.Google Scholar
- Rybiski, M., & Montes, J.F.A. (2017). Domesa: a novel approach for extending domain-oriented lexical relatedness calculations with domain-specific semantics. Journal of Intelligent Information Systems (pp. 1–17).Google Scholar
- Seco, N., Veale, T., Hayes, J., De Mántaras, R.L., & Saitta, L. (2004). An intrinsic information content metric for semantic similarity in wordnet. In Proceedings of the 16th Eureopean Conference on Artificial Intelligence ECAI (pp. 1089–1090). Valencia, Spain: IOS Press.Google Scholar
- Simonoff, J.S. (1996). Smoothing methods in statistics. Springer.Google Scholar
- Wu, Z., & Palmer, M. (1994). Verb semantics and lexical selection. In Proceeding of the 32nd annual meeting on Association for Computational Linguistics (pp. 133–138). doi: 10.3115/981732.981751
- Yih, W., He, X., & Meek, C. (2014). Semantic parsing for single-relation question answering. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 643–648).Google Scholar
- Zhou, Z., Wang, Y., & Gu, J. (2008a). A new model of information content for semantic similarity in wordnet. In Proceedings of the 2nd International Conference on Future Generation Communication and Networking Symposia FGCNS (pp. 85–89). Hainan Island, China: Sanya.Google Scholar
- Zhou, Z., Wang, Y., & Gu, J. (2008b). New model of semantic similarity measuring in wordnet. In Proceedings of 3rd International Conference on Intelligent System and Knowledge Engineering (pp. 256–261).Google Scholar