Ontology-driven web-based semantic similarity

Sánchez, David; Batet, Montserrat; Valls, Aida; Gibert, Karina

doi:10.1007/s10844-009-0103-x

Ontology-driven web-based semantic similarity

Published: 14 October 2009

Volume 35, pages 383–413, (2010)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

David Sánchez¹,
Montserrat Batet¹,
Aida Valls¹ &
…
Karina Gibert²

676 Accesses
62 Citations
Explore all metrics

Abstract

Estimation of the degree of semantic similarity/distance between concepts is a very common problem in research areas such as natural language processing, knowledge acquisition, information retrieval or data mining. In the past, many similarity measures have been proposed, exploiting explicit knowledge—such as the structure of a taxonomy—or implicit knowledge—such as information distribution. In the former case, taxonomies and/or ontologies are used to introduce additional semantics; in the latter case, frequencies of term appearances in a corpus are considered. Classical measures based on those premises suffer from some problems: in the first case, their excessive dependency of the taxonomical/ontological structure; in the second case, the lack of semantics of a pure statistical analysis of occurrences and/or the ambiguity of estimating concept statistical distribution from term appearances. Measures based on Information Content (IC) of taxonomical concepts combine both approaches. However, they heavily depend on a properly pre-tagged and disambiguated corpus according to the ontological entities in order to compute accurate concept appearance probabilities. This limits the applicability of those measures to other ontologies –like specific domain ontologies- and massive corpus –like the Web-. In this paper, several of the presented issues are analyzed. Modifications of classical similarity measures are also proposed. They are based on a contextualized and scalable version of IC computation in the Web by exploiting taxonomical knowledge. The goal is to avoid the measures’ dependency on the corpus pre-processing to achieve reliable results and minimize language ambiguity. Our proposals are able to outperform classical approaches when using the Web for estimating concept probabilities.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

OWL2Vec*: embedding of OWL ontologies

Article Open access 16 June 2021

Short text similarity measurement methods: a review

Article 03 January 2021

Improving information retrieval through correspondence analysis instead of latent semantic analysis

Article Open access 09 September 2023

Notes

A synset in WordNet groups a set of synonyms and a gloss corresponding to a word sense (i.e. concept).
http://wordnet.princeton.edu/man/wnstats.7WN
http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html
Occurrence of the word dog is 204 millions, while canis is 2 millions, computed from Bing (Nov. 9th, 2008).
Bing search engine (http://www.bing.com).
http://wordnet.princeton.edu/
http://www.bing.com/

References

Batet, M., Valls, A., & Gibert, K. (2008). Improving classical clustering with ontologies. In Proceedings of the 4th world conference of the international association for statistical computing (pp. 137–146). Yokohama, Japan.
Berners-lee, T., Hendler, J., & Lassila, O. (2001). The semantic web. Scientific American, 284(5), 34–43.
Article Google Scholar
Bollegala, D., Matsuo, Y., & Ishizuka, M. (2007). WebSim: A web-based semantic similarity measure. In Proceedings of the 21st annual conference of the Japanese society for artificial intelligence. Miyazaki.
Brill, E. (2003). Processing natural language without natural language processing. In Proceedings of the 4th international conference on computational linguistics and intelligent text processing (pp. 360–369). Mexico City, Mexico.
Budanitsky, A., & Hirst, G. (2006). Evaluating wordnet-based measures of semantic distance. Computational Linguistics, 32(1), 13–47.
Article Google Scholar
Church, K. W., Gale, W., Hanks, P., & Hindle, D. (1991). Using statistics in lexical analysis. In Proceedings of lexical acquisition: Exploiting on-line resources to build a lexicon (pp. 115–164). New Jersey, USA.
Cilibrasi, R., & Vitanyi, P. M. B. (2006). The Google similarity distance. IEEE Transaction on Knowledge and Data Engineering, 19(3), 370–383.
Article Google Scholar
Cimiano, P. (2006). Ontology learning and population from text. Algorithms, evaluation and applications. Berlin: Springer.
Google Scholar
Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R. S., Peng, Y., et al. (2004). Swoogle: A search and metadata engine for the semantic web. In Proceedings of the thirteenth ACM conference on information and knowledge management (pp. 652–659). New York: ACM.
Chapter Google Scholar
Domingo-Ferrer, J., & Torra, V. (2001). A quantitative comparison of disclosure control methods for microdata. In P. Doyle, J. Lane, J. Theeuwes, & L. Zayatz (Eds.), Confidentiality, disclosure, and data access: Theory and practical applications for statistical agencies (pp. 111–134). Amsterdam: Elsevier.
Google Scholar
Downey, D., Broadhead, M., & Etzioni, O. (2007). Locating complex named entities in Web text. In Proceedings of the 20th international joint conference on artificial intelligence (pp. 2733–2739).
Dujmovic, J., & Bai, H. (2006). Evaluation and comparison of search engines using the LSP method. Computer Science and Information Systems, 3(2), 711–722.
Article Google Scholar
Etzioni, O., Cafarella, M., Downey, D., Popescu, A. M., Shaked, T., Soderland, S., et al. (2005). Unsupervised named-entity extraction form the Web: An experimental study. Artificial Intelligence, 165, 91–134.
Article Google Scholar
Euzenat, J., & Shvaiko, P. (2007). Ontology matching. Berlin: Springer.
MATH Google Scholar
Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.
MATH Google Scholar
Ferreira da Silva, J., & Lopes, G. P. (1999). Local maxima method and a fair dispersion normalization for extracting multi-word units from corpora. In Proceedings of sixth meeting on mathematics of language (pp. 369–381).
Gómez-Pérez, A., Fernández-López, M., & Corcho, O. (2004). Ontological engineering (2nd printing). Berlin: Springer.
Google Scholar
Guarino, N. (1998). Formal ontology in information systems. In N. Guarino (Ed.), 1st international conference on formal ontology in information systems (pp. 3–15). Trento: IOS Press.
Google Scholar
Hotho, A., Maedche, A., & Staab, S. (2002). Ontology-based text document clustering. Künstliche Intelligenz, 4, 48–54.
Google Scholar
Jiang, J., & Conrath, D. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the international conference on research in computational linguistics (pp. 19–33), Japan.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240.
Article Google Scholar
Leacock, C., & Chodorow, M. (1998). Combining local context and WordNet similarity for word sense identification. In C. Fellbaum (Ed.), WordNet: An electronic lexical database (pp. 265–283). Cambridge: MIT Press.
Google Scholar
Lee, J. H., Kim, M. H., & Lee, Y. J. (1993). Information retrieval based on conceptual distance in is-a hierarchies. Journal of Documentation, 49(2), 188–207.
Article Google Scholar
Lemaire, B., & Denhière, G. (2006). Effects of high-order co-occurrences on word semantic similarities. Current Psychology Letters, 18(1). http://cpl.revues.org/document471.html. Accessed 26 May 2009.
Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of the 15th international conf. on machine learning (pp. 296–304). San Francisco: Kaufmann.
Google Scholar
Miller, G., Leacock, C., Tengi, R., & Bunker, R. T. (1993). A semantic concordance. In Proceedings of ARPA workshop on human language technology (pp. 303–308). Morristown: Association for Computational Linguistics.
Chapter Google Scholar
Miller, G. A., & Charles, W. G. (1991). Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1), 1–28.
Article Google Scholar
Patwardhan, S., & Pedersen, T. (2006). Using WordNet-based context vectors to estimate the semantic relatedness of concepts. In Proceedings of the conference of the European association for computational linguistics (pp. 1–8). Trento, Italy.
Pedersen, T., Pakhomov, S., Patwardhan, S., & Chute, C. (2007). Measures of semantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics, 40, 288–299.
Article Google Scholar
Rada, R., Mili, H., Bichnell, E., & Blettner, M. (1989). Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man and Cybernetics, 9(1), 17–30.
Article Google Scholar
Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of 14th international joint conference on artificial intelligence (pp. 448–453).
Resnik, P. (1999). Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11, 95–130.
MATH Google Scholar
Ruch, P., Baud, R. H., Rassinoux, A. M., Bouillon, P., & Robert, G. (2000). Medical document anonymization with a semantic lexicon. In Proceeding of the American medical informatics association symposium (pp. 729–733).
Sánchez, D. (2008). Domain ontology learning from the web. Saabrucken: VDM Verlag.
Google Scholar
Sánchez, D., Batet, M., & Valls, A. (2009). Computing knowledge-based semantic similarity from the Web: An application to the biomedical domain. In Proceedings of the 3rd international conference on knowledge science, engineering and management (in press).
Spence, D. P., & Owens, K. C. (1990). Lexical co-occurrence and association strength. Journal of Psycholinguistic Research, 19, 317–330.
Article Google Scholar
Sweeney, L. (2002). K-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge Based Systems, 10(5), 557–570.
Article MATH MathSciNet Google Scholar
Tadepalli, S., Sinha, A. K., & Ramakrishnan, N. (2004). Ontology driven data mining for geosciences. Abstracts with Programs — Geological Society of America, 36(5), 149.
Google Scholar
Turney, P. D. (2001). Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the twelfth European conference on machine learning (pp. 491–499). Freiburg, Germany.
Wu, Z., & Palmer, M. (1994). Verb semantics and lexical selection. In Proceedings of the 32nd annual meeting of the association for computational linguistics (pp. 133–138). New Mexico, USA.
Yarowsky, D. (1995). Unsupervised word-sense disambiguation rivalling supervised methods. In Proceedings of the 33rd annual meeting of the association for computational linguistics (pp. 189–196). Cambridge, MA.

Download references

Acknowledgements

This research has been partially supported by the Spanish Government within projects ARES (CONSOLIDER-INGENIO 2010 CSD2007-00004) and E-AEGIS (TSI2007-65406-C03-02). The work is partially supported by the Universitat Rovira i Virgili (2009AIRE-04) and the DAMASK project (Data mining algorithms with semantic knowledge, TIN2009-11005). Montserrat Batet is also supported by a research grant provided by the Universitat Rovira i Virgili.

Author information

Authors and Affiliations

Department of Computer Science and Mathematics, Universitat Rovira i Virgili (URV), Avda. Països Catalans, 26, 43007, Tarragona, Spain
David Sánchez, Montserrat Batet & Aida Valls
Department of Statistics and Operations Research, Universitat Politècnica de Catalunya, Campus Nord, Ed.C5, c/Jordi Girona 1-3, 08034, Barcelona, Spain
Karina Gibert

Authors

David Sánchez
View author publications
You can also search for this author in PubMed Google Scholar
Montserrat Batet
View author publications
You can also search for this author in PubMed Google Scholar
Aida Valls
View author publications
You can also search for this author in PubMed Google Scholar
Karina Gibert
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Sánchez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sánchez, D., Batet, M., Valls, A. et al. Ontology-driven web-based semantic similarity. J Intell Inf Syst 35, 383–413 (2010). https://doi.org/10.1007/s10844-009-0103-x

Download citation

Received: 04 June 2009
Revised: 28 September 2009
Accepted: 29 September 2009
Published: 14 October 2009
Issue Date: December 2010
DOI: https://doi.org/10.1007/s10844-009-0103-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ontology-driven web-based semantic similarity

Abstract

Access this article

Similar content being viewed by others

OWL2Vec*: embedding of OWL ontologies

Short text similarity measurement methods: a review

Improving information retrieval through correspondence analysis instead of latent semantic analysis

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Ontology-driven web-based semantic similarity

Abstract

Access this article

Similar content being viewed by others

OWL2Vec*: embedding of OWL ontologies

Short text similarity measurement methods: a review

Improving information retrieval through correspondence analysis instead of latent semantic analysis

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation