Skip to main content
Log in

Ontology-driven web-based semantic similarity

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Estimation of the degree of semantic similarity/distance between concepts is a very common problem in research areas such as natural language processing, knowledge acquisition, information retrieval or data mining. In the past, many similarity measures have been proposed, exploiting explicit knowledge—such as the structure of a taxonomy—or implicit knowledge—such as information distribution. In the former case, taxonomies and/or ontologies are used to introduce additional semantics; in the latter case, frequencies of term appearances in a corpus are considered. Classical measures based on those premises suffer from some problems: in the first case, their excessive dependency of the taxonomical/ontological structure; in the second case, the lack of semantics of a pure statistical analysis of occurrences and/or the ambiguity of estimating concept statistical distribution from term appearances. Measures based on Information Content (IC) of taxonomical concepts combine both approaches. However, they heavily depend on a properly pre-tagged and disambiguated corpus according to the ontological entities in order to compute accurate concept appearance probabilities. This limits the applicability of those measures to other ontologies –like specific domain ontologies- and massive corpus –like the Web-. In this paper, several of the presented issues are analyzed. Modifications of classical similarity measures are also proposed. They are based on a contextualized and scalable version of IC computation in the Web by exploiting taxonomical knowledge. The goal is to avoid the measures’ dependency on the corpus pre-processing to achieve reliable results and minimize language ambiguity. Our proposals are able to outperform classical approaches when using the Web for estimating concept probabilities.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. A synset in WordNet groups a set of synonyms and a gloss corresponding to a word sense (i.e. concept).

  2. http://wordnet.princeton.edu/man/wnstats.7WN

  3. http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html

  4. Occurrence of the word dog is 204 millions, while canis is 2 millions, computed from Bing (Nov. 9th, 2008).

  5. Bing search engine (http://www.bing.com).

  6. http://wordnet.princeton.edu/

  7. http://www.bing.com/

References

  • Batet, M., Valls, A., & Gibert, K. (2008). Improving classical clustering with ontologies. In Proceedings of the 4th world conference of the international association for statistical computing (pp. 137–146). Yokohama, Japan.

  • Berners-lee, T., Hendler, J., & Lassila, O. (2001). The semantic web. Scientific American, 284(5), 34–43.

    Article  Google Scholar 

  • Bollegala, D., Matsuo, Y., & Ishizuka, M. (2007). WebSim: A web-based semantic similarity measure. In Proceedings of the 21st annual conference of the Japanese society for artificial intelligence. Miyazaki.

  • Brill, E. (2003). Processing natural language without natural language processing. In Proceedings of the 4th international conference on computational linguistics and intelligent text processing (pp. 360–369). Mexico City, Mexico.

  • Budanitsky, A., & Hirst, G. (2006). Evaluating wordnet-based measures of semantic distance. Computational Linguistics, 32(1), 13–47.

    Article  Google Scholar 

  • Church, K. W., Gale, W., Hanks, P., & Hindle, D. (1991). Using statistics in lexical analysis. In Proceedings of lexical acquisition: Exploiting on-line resources to build a lexicon (pp. 115–164). New Jersey, USA.

  • Cilibrasi, R., & Vitanyi, P. M. B. (2006). The Google similarity distance. IEEE Transaction on Knowledge and Data Engineering, 19(3), 370–383.

    Article  Google Scholar 

  • Cimiano, P. (2006). Ontology learning and population from text. Algorithms, evaluation and applications. Berlin: Springer.

    Google Scholar 

  • Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R. S., Peng, Y., et al. (2004). Swoogle: A search and metadata engine for the semantic web. In Proceedings of the thirteenth ACM conference on information and knowledge management (pp. 652–659). New York: ACM.

    Chapter  Google Scholar 

  • Domingo-Ferrer, J., & Torra, V. (2001). A quantitative comparison of disclosure control methods for microdata. In P. Doyle, J. Lane, J. Theeuwes, & L. Zayatz (Eds.), Confidentiality, disclosure, and data access: Theory and practical applications for statistical agencies (pp. 111–134). Amsterdam: Elsevier.

    Google Scholar 

  • Downey, D., Broadhead, M., & Etzioni, O. (2007). Locating complex named entities in Web text. In Proceedings of the 20th international joint conference on artificial intelligence (pp. 2733–2739).

  • Dujmovic, J., & Bai, H. (2006). Evaluation and comparison of search engines using the LSP method. Computer Science and Information Systems, 3(2), 711–722.

    Article  Google Scholar 

  • Etzioni, O., Cafarella, M., Downey, D., Popescu, A. M., Shaked, T., Soderland, S., et al. (2005). Unsupervised named-entity extraction form the Web: An experimental study. Artificial Intelligence, 165, 91–134.

    Article  Google Scholar 

  • Euzenat, J., & Shvaiko, P. (2007). Ontology matching. Berlin: Springer.

    MATH  Google Scholar 

  • Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.

    MATH  Google Scholar 

  • Ferreira da Silva, J., & Lopes, G. P. (1999). Local maxima method and a fair dispersion normalization for extracting multi-word units from corpora. In Proceedings of sixth meeting on mathematics of language (pp. 369–381).

  • Gómez-Pérez, A., Fernández-López, M., & Corcho, O. (2004). Ontological engineering (2nd printing). Berlin: Springer.

    Google Scholar 

  • Guarino, N. (1998). Formal ontology in information systems. In N. Guarino (Ed.), 1st international conference on formal ontology in information systems (pp. 3–15). Trento: IOS Press.

    Google Scholar 

  • Hotho, A., Maedche, A., & Staab, S. (2002). Ontology-based text document clustering. Künstliche Intelligenz, 4, 48–54.

    Google Scholar 

  • Jiang, J., & Conrath, D. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the international conference on research in computational linguistics (pp. 19–33), Japan.

  • Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240.

    Article  Google Scholar 

  • Leacock, C., & Chodorow, M. (1998). Combining local context and WordNet similarity for word sense identification. In C. Fellbaum (Ed.), WordNet: An electronic lexical database (pp. 265–283). Cambridge: MIT Press.

    Google Scholar 

  • Lee, J. H., Kim, M. H., & Lee, Y. J. (1993). Information retrieval based on conceptual distance in is-a hierarchies. Journal of Documentation, 49(2), 188–207.

    Article  Google Scholar 

  • Lemaire, B., & Denhière, G. (2006). Effects of high-order co-occurrences on word semantic similarities. Current Psychology Letters, 18(1). http://cpl.revues.org/document471.html. Accessed 26 May 2009.

  • Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of the 15th international conf. on machine learning (pp. 296–304). San Francisco: Kaufmann.

    Google Scholar 

  • Miller, G., Leacock, C., Tengi, R., & Bunker, R. T. (1993). A semantic concordance. In Proceedings of ARPA workshop on human language technology (pp. 303–308). Morristown: Association for Computational Linguistics.

    Chapter  Google Scholar 

  • Miller, G. A., & Charles, W. G. (1991). Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1), 1–28.

    Article  Google Scholar 

  • Patwardhan, S., & Pedersen, T. (2006). Using WordNet-based context vectors to estimate the semantic relatedness of concepts. In Proceedings of the conference of the European association for computational linguistics (pp. 1–8). Trento, Italy.

  • Pedersen, T., Pakhomov, S., Patwardhan, S., & Chute, C. (2007). Measures of semantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics, 40, 288–299.

    Article  Google Scholar 

  • Rada, R., Mili, H., Bichnell, E., & Blettner, M. (1989). Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man and Cybernetics, 9(1), 17–30.

    Article  Google Scholar 

  • Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of 14th international joint conference on artificial intelligence (pp. 448–453).

  • Resnik, P. (1999). Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11, 95–130.

    MATH  Google Scholar 

  • Ruch, P., Baud, R. H., Rassinoux, A. M., Bouillon, P., & Robert, G. (2000). Medical document anonymization with a semantic lexicon. In Proceeding of the American medical informatics association symposium (pp. 729–733).

  • Sánchez, D. (2008). Domain ontology learning from the web. Saabrucken: VDM Verlag.

    Google Scholar 

  • Sánchez, D., Batet, M., & Valls, A. (2009). Computing knowledge-based semantic similarity from the Web: An application to the biomedical domain. In Proceedings of the 3rd international conference on knowledge science, engineering and management (in press).

  • Spence, D. P., & Owens, K. C. (1990). Lexical co-occurrence and association strength. Journal of Psycholinguistic Research, 19, 317–330.

    Article  Google Scholar 

  • Sweeney, L. (2002). K-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge Based Systems, 10(5), 557–570.

    Article  MATH  MathSciNet  Google Scholar 

  • Tadepalli, S., Sinha, A. K., & Ramakrishnan, N. (2004). Ontology driven data mining for geosciences. Abstracts with Programs — Geological Society of America, 36(5), 149.

    Google Scholar 

  • Turney, P. D. (2001). Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the twelfth European conference on machine learning (pp. 491–499). Freiburg, Germany.

  • Wu, Z., & Palmer, M. (1994). Verb semantics and lexical selection. In Proceedings of the 32nd annual meeting of the association for computational linguistics (pp. 133–138). New Mexico, USA.

  • Yarowsky, D. (1995). Unsupervised word-sense disambiguation rivalling supervised methods. In Proceedings of the 33rd annual meeting of the association for computational linguistics (pp. 189–196). Cambridge, MA.

Download references

Acknowledgements

This research has been partially supported by the Spanish Government within projects ARES (CONSOLIDER-INGENIO 2010 CSD2007-00004) and E-AEGIS (TSI2007-65406-C03-02). The work is partially supported by the Universitat Rovira i Virgili (2009AIRE-04) and the DAMASK project (Data mining algorithms with semantic knowledge, TIN2009-11005). Montserrat Batet is also supported by a research grant provided by the Universitat Rovira i Virgili.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Sánchez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sánchez, D., Batet, M., Valls, A. et al. Ontology-driven web-based semantic similarity. J Intell Inf Syst 35, 383–413 (2010). https://doi.org/10.1007/s10844-009-0103-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-009-0103-x

Keywords

Navigation