Advertisement

Journal of Intelligent Information Systems

, Volume 51, Issue 1, pp 23–47 | Cite as

A hybrid approach for measuring semantic similarity based on IC-weighted path distance in WordNet

  • Yuanyuan Cai
  • Qingchuan Zhang
  • Wei Lu
  • Xiaoping Che
Article

Abstract

As a valuable tool for text understanding, semantic similarity measurement enables discriminative semantic-based applications in the fields of natural language processing, information retrieval, computational linguistics and artificial intelligence. Most of the existing studies have used structured taxonomies such as WordNet to explore the lexical semantic relationship, however, the improvement of computation accuracy is still a challenge for them. To address this problem, in this paper, we propose a hybrid WordNet-based approach CSSM-ICSP to measuring concept semantic similarity, which leverage the information content(IC) of concepts to weight the shortest path distance between concepts. To improve the performance of IC computation, we also develop a novel model of the intrinsic IC of concepts, where a variety of semantic properties involved in the structure of WordNet are taken into consideration. In addition, we summarize and classify the technical characteristics of previous WordNet-based approaches, as well as evaluate our approach against these approaches on various benchmarks. The experimental results of the proposed approaches are more correlated with human judgment of similarity in term of the correlation coefficient, which indicates that our IC model and similarity detection approach are comparable or even better for semantic similarity measurement as compared to others.

Keywords

Concept semantic similarity Intrinsic information content WordNet Edge distance 

Notes

Acknowledgements

The authors would like to thank the reviewers for their valuable comments and suggestions. This study is supported by National Natural Science Foundation of China (No.61502028), National Key Technology R&D Program of China (No. 2015BAK36B04), Training program foundation for the talents of Beijing (No.2015000020124G029), the Beijing Natural Science Foundation (No. 4172014) and the Research Foundation for Youth Scholars of Beijing Technology and Business University.

Compliance with Ethical Standards

I certify that this manuscript is original and has not been published and will not be submitted elsewhere for publication while being considered by Journal of Intelligent Information Systems. And the study is not split up into several parts to increase the quantity of submissions and submitted to various journals or one journal over time. No data have been fabricated or manipulated (including images) to support our conclusions. No data, text, or theories by others are presented as if they were our own. The submission has been received explicitly from all co-authors. And authors whose names appear on the manuscript have contributed sufficiently to the scientific work and therefore share collective responsibility and accountability for the results. In addition, consent to submit has been received explicitly from all co-authors, as well as from the responsible authorities - tacitly or explicitly - at the institute where the work has been carried out, before the work is submitted. Authors are strongly advised to ensure the correct author group, corresponding author, and order of authors at submission.

Conflict of interests

The authors declare that they have no conflict of interest.

Funding

This study is funded by National Natural Science Foundation of China (No.61502028), National Key Technology R&D Program of China (No.2015BAK36B04), Training program foundation for the talents of Beijing (No.2015000020124G029), the Beijing Natural Science Foundation (No. 4172014) and the Research Foundation for Youth Scholars of Beijing Technology and Business University.

Research involving Human Participants and/or Animals

There is no human participants or animals involved in this work.

References

  1. Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., & Soroa, A. (2009). A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of the 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL ’09 (pp. 19–27). Stroudsburg, PA, USA: Association for Computational Linguistics.Google Scholar
  2. Banerjee, S., & Pedersen, T. (2003). Extended gloss overlaps as a measure of semantic relatedness. In Proceeding of International Joint Conference on Artificial Intelligence, (Vol. 3 pp. 805–810).Google Scholar
  3. Bogdanović, M., Stanimirović, A., & Stoimenov, L. (2015). Methodology for geospatial data source discovery in ontology-driven geo-information integration architectures. Journal of Web Semantics, 32, 1–15.CrossRefGoogle Scholar
  4. Bouras, C., & Tsogkas, V. (2012). A clustering technique for news articles using wordnet. Knowledge-Based Systems, 36, 115–128. doi: 10.1016/j.knosys.2012.06.015.CrossRefGoogle Scholar
  5. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2002). Placing search in context: the concept revisited. ACM Transactions on Information Systems, 20(1), 116–131. doi: 10.1145/503104.503110.CrossRefGoogle Scholar
  6. Formica, A. (2009). Concept similarity by evaluating information contents and feature vectors: a combined approach. Communications of the ACM, 52(3), 145–149. doi: 10.1145/1467247.1467281.CrossRefGoogle Scholar
  7. Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. In International Joint Conference on Artificial Intelligence (pp. 1606–1611).Google Scholar
  8. Gao, J., Zhang, B., & Chen, X. (2015). A wordnet-based semantic similarity measurement combining edge-counting and information content theory. Engineering Applications of Artificial Intelligence, 39, 80–88. doi: 10.1016/j.engappai.2014.11.009.CrossRefGoogle Scholar
  9. Hirst, G., & Budanitsky, A. (2005). Correcting real-word spelling errors by restoring lexical cohesion. Natural Language Engineering, 11(1), 87–111.CrossRefGoogle Scholar
  10. Hirst, G., & St-Onge, D. (1998). Lexical chains as representations of context for the detection and correction of malapropisms. In Fellbaum, C. (Ed.) WordNet: An Electronic Lexical Database (pp. 305–332): MIT Press.Google Scholar
  11. Jiang, J.J., & Conrath, D.W. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the 10th International Conference Research on Computational Linguistics. Taiwan.Google Scholar
  12. Leacock, C., & Chodrow, M. (1998). Combining local context and wordnet similarity for word sense identification. In Fellbaum, C. (Ed.) WordNet: An Electronic Lexical Database (pp. 265–283): MIT Press.Google Scholar
  13. Li, Y., Bandar, Z., & McLean, S. (2003). An approach for measuring semantic similarity between words using multiple information sources. Transactions on Data and Knowledge Engineering, 15(4), 871–882.CrossRefGoogle Scholar
  14. Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning ICML. Madison, Wisconsin.Google Scholar
  15. Liu, H., Bao, H., & Xu, D. (2012). Concept vector for semantic similarity and relatedness based on wordnet structure. Journal of Systems and Software, 85(2), 370–381.CrossRefGoogle Scholar
  16. Lu, W., Cai, Y., Che, X., & Shi, K. (2015). Semantic similarity assessment using differential evolution algorithm in continuous vector space. Journal of Visual Languages & Computing, 31, 246–251.CrossRefGoogle Scholar
  17. Lu, W., Shi, K., Cai, Y., & Che, X. (2016). Semantic similarity measurement using knowledge-augmented multiple-prototype distributed word vector. International Journal of Interdisciplinary Telecommunications & Networking, 8(2), 45–57.CrossRefGoogle Scholar
  18. Lu, W., Cai, Y., Che, X., & Lu, Y. (2016). Joint semantic similarity assessment with raw corpus and structured ontology for semantic-oriented service discovery. Personal and Ubiquitous Computing, 20(3), 311–323.CrossRefGoogle Scholar
  19. Meng, L., Gu, J., & Zhou, Z. (2012). A new model of information content based on concept’s topology for measuring semantic similarity in wordnet. International Journal of Grid & Distributed Computing, 5(3), 81–94.Google Scholar
  20. Meng, L., Huang, R., & Gu, J. (2013). An effective algorithm for semantic similarity metric of word pairs International Journal of Multimedia and Ubiquitous Engineering, 8(2).Google Scholar
  21. Miller, G.A. (1995). Wordnet: a lexical database for english. Communications of the ACM, 38(11), 39–41.CrossRefGoogle Scholar
  22. Miller, G.A., & Charles, W.G. (1991). Contextual correlates of semantic similarity. Language & Cognitive Processes, 6(1), 1–28.CrossRefGoogle Scholar
  23. Miller, T., Biemann, C., Zesch, T., & Gurevych, I. (2012). Using distributional similarity for lexical expansion in knowledge-based word sense disambiguation. In Proceedings of the 24th International Conference on Computational Linguistics COLING (pp. 1781–1796). Mumbai, India.Google Scholar
  24. Paliwal, A.V., Shafiq, B., Vaidya, J., Xiong, H., & Adam, N.R. (2012). Semantics-based automated service discovery. IEEE Transactions on Services Computing, 5(2), 260–275.CrossRefGoogle Scholar
  25. Patwardhan, S. (2003). Incorporating dictionary and corpus information into a context vector measure of semantic relatedness. Duluth: Master’s thesis, University of Minnesota.Google Scholar
  26. Patwardhan, S., & Pedersen, T. (2006). Using wordnet-based context vectors to estimate the semantic relatedness of concepts. In Proceedings of the EACL 2006 Workshop Making Sense of Sense-Bringing Computational Linguistics and Psycholinguistics Together, (Vol. 1501 pp. 1–8).Google Scholar
  27. Pekar, V., & Staab, S. (2002). Taxonomy learning - factoring the structure of a taxonomy into a semantic classification decision. In Proceeding of the 19th International Conference on Computational LinguisticsCOLING. Taipei, Taiwan.Google Scholar
  28. Pesaranghader, A., & Muthaiyah, S. (2013). Definition-based information content vectors for semantic similarity measurement. Communications in Computer & Information Science, 378, 268–282.CrossRefGoogle Scholar
  29. Pesaranghader, A., Rezaei, A., & Pesaranghader, A. (2013). Adapting Gloss Vector Semantic Relatedness Measure for Semantic Similarity Estimation: An Evaluation in the Biomedical Domain Springer International Publishing.Google Scholar
  30. Petrakis, E.G., Varelas, G., Hliaoutakis, A., & Raftopoulou, P. (2006). X-similarity: computing semantic similarity between concepts from different ontologies. Journal of Digital Information Management, 4(4), 233–237.Google Scholar
  31. Pirró, G. (2009). A semantic similarity metric combining features and intrinsic information content. Data & Knowledge Engineering, 68(11), 1289–1308.CrossRefGoogle Scholar
  32. Pirró, G., & Seco, N. (2008). Design, Implementation and Evaluation of a New Semantic Similarity Metric Combining Features and Intrinsic Information Content, chap. On the Move to Meaningful Internet Systems: OTM 2008 Vol. 5332. Berlin, Heidelberg: Springer.Google Scholar
  33. Piskorski, J., & Sydow, M. (2007). String distance metrics for reference matching and search query correction. In Business Information Systems, International Conference, Bis 2007 (pp. 353–365). Poznan, Poland: Proceedings.Google Scholar
  34. Piskorski, J., Wieloch, K., & Sydow, M. (2009). On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages. Information Retrieval Journal, 12(3), 275–299.CrossRefGoogle Scholar
  35. Rada, R., Mili, H., Bicknell, E., & Blettner, M. (1989). Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man, and Cybernetics, 19(1), 17–30. doi: 10.1109/21.24528.CrossRefGoogle Scholar
  36. Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence IJCAI (pp. 448–453). Canada: Montréal Québec.Google Scholar
  37. Richardson, R., Smeaton, A., & Murphy, J. (1994). Using wordnet as a knowledge base for measuring semantic similarity between words. In Proceedings of AICS Conference. Dublin: Trinity College.Google Scholar
  38. Rodríguez, M.A., & Egenhofer, M. J. (2003). Determining semantic similarity among entity classes from different ontologies. IEEE Transactions on Knowledge and Data Engineering, 15(2), 442–456.CrossRefGoogle Scholar
  39. Ross, S.M. (2002). A First course in probability, 6th edn. Upper Saddle River, NJ: Prentice Hall.Google Scholar
  40. Rubenstein, H., & Goodenough, J.B. (1965). Contextual correlates of synonymy. Communcation of the ACM, 8(10), 627–633.CrossRefGoogle Scholar
  41. Rybiski, M., & Montes, J.F.A. (2017). Domesa: a novel approach for extending domain-oriented lexical relatedness calculations with domain-specific semantics. Journal of Intelligent Information Systems (pp. 1–17).Google Scholar
  42. Sánchez, D., & Batet, M. (2011). Semantic similarity estimation in the biomedical domain: An ontology-based information-theoretic perspective. Journal of biomedical informatics, 44(5), 749–759.CrossRefGoogle Scholar
  43. Sánchez, D., Batet, M., & Isern, D. (2011). Ontology-based information content computation. Knowledge-Based Systems, 24(2), 297–303.CrossRefGoogle Scholar
  44. Sánchez, D., Batet, M., Isern, D., & Valls, A. (2012). Ontology-based semantic similarity: A new feature-based approach. Expert System Application, 39(9), 7718–7728.CrossRefGoogle Scholar
  45. Sánchez, D., Solé-Ribalta, A., Batet, M., & Serratosa, F. (2012). Enabling semantic similarity estimation across multiple ontologies: an evaluation in the biomedical domain. Journal of Biomedical Informatics, 45(1), 141–155.CrossRefGoogle Scholar
  46. Seco, N., Veale, T., Hayes, J., De Mántaras, R.L., & Saitta, L. (2004). An intrinsic information content metric for semantic similarity in wordnet. In Proceedings of the 16th Eureopean Conference on Artificial Intelligence ECAI (pp. 1089–1090). Valencia, Spain: IOS Press.Google Scholar
  47. Simonoff, J.S. (1996). Smoothing methods in statistics. Springer.Google Scholar
  48. Tversky, A. (1977). Features of similarity. Psychological Review, 84(4), 327–352.CrossRefGoogle Scholar
  49. Wei, T., Lu, Y., Chang, H., Zhou, Q., & Bao, X. (2015). A semantic approach for text clustering using wordnet and lexical chains. Expert System Application, 42(4), 2264–2275. doi: 10.1016/j.eswa.2014.10.023.CrossRefGoogle Scholar
  50. Wu, Z., & Palmer, M. (1994). Verb semantics and lexical selection. In Proceeding of the 32nd annual meeting on Association for Computational Linguistics (pp. 133–138). doi: 10.3115/981732.981751
  51. Yih, W., He, X., & Meek, C. (2014). Semantic parsing for single-relation question answering. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 643–648).Google Scholar
  52. Zhou, Z., Wang, Y., & Gu, J. (2008a). A new model of information content for semantic similarity in wordnet. In Proceedings of the 2nd International Conference on Future Generation Communication and Networking Symposia FGCNS (pp. 85–89). Hainan Island, China: Sanya.Google Scholar
  53. Zhou, Z., Wang, Y., & Gu, J. (2008b). New model of semantic similarity measuring in wordnet. In Proceedings of 3rd International Conference on Intelligent System and Knowledge Engineering (pp. 256–261).Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  • Yuanyuan Cai
    • 1
  • Qingchuan Zhang
    • 1
  • Wei Lu
    • 2
  • Xiaoping Che
    • 2
  1. 1.Beijing Key Laboratory of Big Data Technology for Food Safety, School of Computer and Information EngineeringBeijing Technology and Business UniversityBeijingChina
  2. 2.School of Software EngineeringBeijing Jiaotong UniversityBeijingChina

Personalised recommendations