Skip to main content

Fast, Accurate, Multilingual Semantic Relatedness Measurement Using Wikipedia Links

  • Chapter
  • First Online:

Part of the book series: Studies in Computational Intelligence ((SCI,volume 740))

Abstract

In this chapter we present a fast, accurate, and elegant metric to assess semantic relatedness among entities included in an hypertextual corpus building an novel language independent Vector Space Model. Such a technique is based upon the Jaccard similarity coefficient, approximated with the MinHash technique to generate a constant-size vector fingerprint for each entity in the considered corpus. This strategy allows evaluation of pairwise semantic relatedness in constant time, no matter how many entities are included in the data and how dense the internal link structure is. Being semantic relatedness a subtle and somewhat subjective matter, we evaluated our approach by running user tests on a crowdsourcing platform. To achieve a better evaluation we considered two collaboratively built corpora: the English Wikipedia and the Italian Wikipedia, which differ significantly in size, topology, and user base. The evaluation suggests that the proposed technique is able to generate satisfactory results, outperforming commercial baseline systems regardless of the employed data and the cultural differences of the considered test users.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   229.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   299.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   299.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://wordnet.princeton.edu.

  2. 2.

    For instance in Wikipedia only the first time an entity is referenced it is annotated with an hyperlink, and in literature bibliographies have no duplicate entries.

  3. 3.

    http://www.alexa.com/.

  4. 4.

    https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_pages.

  5. 5.

    https://chitika.com/google-positioning-value.

  6. 6.

    http://www.crowdflower.com/.

  7. 7.

    An Intel I7 with eight cores and 32 GB RAM.

  8. 8.

    http://challenges.2014.eswc-conferences.org/index.php/RecSys#DATASET.

References

  1. Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Statist. Soc. Ser. B Methodol. 57(1), 289–300 (1995)

    Google Scholar 

  2. Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of Compression and Complexity of Sequences (SEQUENCES’97), pp. 21–29. IEEE, June 1997

    Google Scholar 

  3. Alexander, B., Graeme, H.: Semantic distance in wordnet: an experimental, application-oriented evaluation of five measures. In: Workshop on WordNet and Other Lexical Resources, Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 29–34 (2001)

    Google Scholar 

  4. Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of lexical semantic relatedness. Comput. Linguist. 32(1), 13–47 (2006)

    Google Scholar 

  5. Rudi, L.C., Paul, M.B.V.: The google similarity distance. IEEE Trans. Knowled. Data Eng. 19(3), 370–383 (2007)

    Google Scholar 

  6. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. IJcAI 7, 1606–1611 (2007)

    Google Scholar 

  7. Risto, G., Warner ten K., Zharko, A., Frank Van H.: Using google distance to weight approximate ontology matches. In: The 16th International Conference on World Wide Web, pp. 767–776. ACM, (2007)

    Google Scholar 

  8. Sebastien, H., Sylvie, R., Stefan, J., Jacky, M.: Semantic similarity from natural language and ontology analysis. Synth. Lect. Human Lang. Technol. 8(1), 1–254 (2015)

    Google Scholar 

  9. Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954)

    Google Scholar 

  10. Tin, H., Kiem, H., Loc, Do, Huong, T., Hiep, L., Susan, G.: Scientific publication recommendations based on collaborative citation networks. In: Collaboration Technologies and Systems (CTS), 2012 International Conference on, pp. 316–321. IEEE, (2012)

    Google Scholar 

  11. Jaccard, P.: Lois de distribution florale. Bulletin de la Socíeté Vaudoise des Sciences Naturelles 38, 67–130 (1902)

    Google Scholar 

  12. Lillian, L.: Measures of distributional similarity. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (ACL), (199)

    Google Scholar 

  13. Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, (2014)

    Google Scholar 

  14. Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3, 211–225 (2015)

    Google Scholar 

  15. Christopher, D.M., Prabhakar, R., Hinrich, S.: Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA (2008)

    Google Scholar 

  16. Cataldo, M., Pasquale, L., Pierpaolo, B., Marco de G., Giovanni, S.: Semantics-aware graph-based recommender systems exploiting linked open data. In: Proceedings of the 2016 Conference on User Modeling Adaptation and Personalization, pp. 229–237. ACM, (2016)

    Google Scholar 

  17. Novak, J.D.: Learning, Creating, and Using Knowledge: Concept Maps as Facilitative Tools in Schools and Corporations. Taylor & Francis, London, United Kingdom (2010)

    Google Scholar 

  18. Mohammad, T.P. Roberto, N.: From senses to texts: an all-in-one graph-based approach for measuring semantic similarity. Artific. Intell. 228, 95–128 (2015)

    Google Scholar 

  19. Rodríguez, M.A., Egenhofer, M.J.: Determining semantic similarity among entity classes from different ontologies. IEEE Trans. Knowled. Data Eng. 15(2), 442–456 (2003)

    Google Scholar 

  20. Turney, Peter D.: Pantel, Patrick: from frequency to meaning: vector space models of semantics. J. Artif. Int. Res. 37(1), 141–188 (2010)

    MathSciNet  MATH  Google Scholar 

  21. Jingdong, W., Heng, T.S., Jingkuan, S., Jianqiu, J.: Hashing for similarity search: a survey. arXiv:1408.2927, (2014)

  22. Weeds, Julie: Weir, D.: Co-occurrence retrieval: a flexible framework for lexical distributional similarity. Comput. Linguist. 31(4), 439–475 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  23. Ian, W., David, M.: An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In: Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy, AAAI Press, Chicago, USA, pp. 25–30(2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dante Degl’Innocenti .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Degl’Innocenti, D., De Nart, D., Helmy, M., Tasso, C. (2018). Fast, Accurate, Multilingual Semantic Relatedness Measurement Using Wikipedia Links. In: Shaalan, K., Hassanien, A., Tolba, F. (eds) Intelligent Natural Language Processing: Trends and Applications. Studies in Computational Intelligence, vol 740. Springer, Cham. https://doi.org/10.1007/978-3-319-67056-0_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67056-0_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67055-3

  • Online ISBN: 978-3-319-67056-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics