Context Semantic Analysis: A Knowledge-Based Technique for Computing Inter-document Similarity

  • Fabio BenedettiEmail author
  • Domenico Beneventano
  • Sonia Bergamaschi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9939)


We propose a novel knowledge-based technique for inter-document similarity, called Context Semantic Analysis (CSA). Several specialized approaches built on top of specific knowledge base (e.g. Wikipedia) exist in literature but CSA differs from them because it is designed to be portable to any RDF knowledge base. Our technique relies on a generic RDF knowledge base (e.g. DBpedia and Wikidata) to extract from it a vector able to represent the context of a document. We show how such a Semantic Context Vector can be effectively exploited to compute inter-document similarity. Experimental results show that our general technique outperforms baselines built on top of traditional methods, and achieves a performance similar to the ones of specialized methods.


Knowledge graph Knowledge base Inter-document similarity Similarity measures 


  1. 1.
    Anyanwu, K., Maduko, A., Sheth, A.: SemRank: ranking complex relationship search results on the semantic web. In Proceedings of the 14th International Conference on World Wide Web, pp. 117–127. ACM (2005)Google Scholar
  2. 2.
    Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). doi: 10.1007/978-3-540-76298-0_52 CrossRefGoogle Scholar
  3. 3.
    Bär, D., Zesch, T., Gurevych, I.: A reflective view on text similarity. In: RANLP, pp. 515–520 (2011)Google Scholar
  4. 4.
    Beneventano, D., Bergamaschi, S., Sorrentino, S., Vincini, M., Benedetti, F.: Semantic annotation of the cerealab database by the agrovoc linked dataset. Ecol. Inform. 26, 119–126 (2015)CrossRefGoogle Scholar
  5. 5.
    Bizer, C., Heath, T., Berners-Lee, T.: Linked data-the story so far. In: Sheth, A.P. (ed.) Semantic Services, Interoperability, Web Applications: Emerging Concepts, pp. 205–227. IGI Global, Hershey (2009)Google Scholar
  6. 6.
    Bos, L., Donnelly, K.: SNOMED-CT: the advanced terminology and coding system for eHealth. Stud. Health Technol. Inform. 121, 279–290 (2006)Google Scholar
  7. 7.
    Caracciolo, C., Stellato, A., Morshed, A., Johannsen, G., Rajbhandari, S., Jaques, Y., Keizer, J.: The AGROVOC linked dataset. Semant. Web 4(3), 341–348 (2013)Google Scholar
  8. 8.
    Cyganiak, R., Wood, D., Lanthaler, M.: RDF 1.1 concepts, abstract syntax. W3C Recomm. 25, 1–8 (2014)Google Scholar
  9. 9.
    Dumais, S.T.: Latent semantic analysis. Annu. Rev. Inf. Sci. Technol. 38(1), 188–230 (2004)CrossRefGoogle Scholar
  10. 10.
    Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. IJCAI 7, 1606–1611 (2007)Google Scholar
  11. 11.
    Gomaa, W.H., Fahmy, A.A.: A survey of text similarity approaches. Int. J. Comput. Appl. 68(13), 13–18 (2013)Google Scholar
  12. 12.
    Hassan, S., Mihalcea, R.: Semantic relatedness using salient semantic analysis. In: AAAI (2011)Google Scholar
  13. 13.
    Haveliwala, T.H.: Topic-sensitive pagerank. In: Proceedings of the 11th International Conference on World Wide Web, pp. 517–526. ACM (2002)Google Scholar
  14. 14.
    Lawrence, I., Lin, K.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45, 255–268 (1989)CrossRefzbMATHGoogle Scholar
  15. 15.
    Lee, M., Pincombe, B., Welsh, M.: An empirical evaluation of models of text document similarity. In: Cognitive Science (2005)Google Scholar
  16. 16.
    Manning, C.D., Raghavan, P., Schütze, H., et al.: Introduction to Information Retrieval, vol. 1. Cambridge University Press, Cambridge (2008)CrossRefzbMATHGoogle Scholar
  17. 17.
    Mendes, P., Jakob, M., García-Silva, A., Bizer, C.: DBpedia spotlight shedding light on the web of documents. In: I-Semantics (2011)Google Scholar
  18. 18.
    Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)CrossRefGoogle Scholar
  19. 19.
    Nakov, P., Popova, A., Mateev, P.: Weight functions impact on LSA performance. In: EuroConference RANLP, pp. 187–193 (2001)Google Scholar
  20. 20.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web (1999)Google Scholar
  21. 21.
    Schuhmacher, M., Ponzetto, S.P.: Knowledge-based graph document modeling. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pp. 543–552. ACM (2014)Google Scholar
  22. 22.
    Suchanek, F.M., Kasneci, G., Weikum, G.: YAGO: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, pp. 697–706. ACM (2007)Google Scholar
  23. 23.
    Turney, P.D., Pantel, P., et al.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37(1), 141–188 (2010)MathSciNetzbMATHGoogle Scholar
  24. 24.
    Van de Cruys, T.: Two multivariate generalizations of pointwise mutual information. In Proceedings of the Workshop on Distributional Semantics and Compositionality, pp. 16–20. Association for Computational Linguistics (2011)Google Scholar
  25. 25.
    Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)CrossRefGoogle Scholar
  26. 26.
    Xing, W., Ghorbani, A.: Weighted pagerank algorithm. In: Second Annual Conference on Communication Networks and Services Research, 2004. Proceedings, pp. 305–314. IEEE (2004)Google Scholar
  27. 27.
    Yeh, E., Ramage, D., Manning, C.D., Agirre, E., Soroa, A.: WikiWalk: random walks on wikipedia for semantic relatedness. In Proceedings of the 2009 Workshop on Graph-Based Methods for Natural Language Processing, pp. 41–49. Association for Computational Linguistics (2009)Google Scholar
  28. 28.
    Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 515–524. ACM (2002)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Fabio Benedetti
    • 1
    Email author
  • Domenico Beneventano
    • 1
  • Sonia Bergamaschi
    • 1
  1. 1.Dipartimento di Ingegneria Enzo FerrariUniversità di Modena e Reggio EmiliaModenaItaly

Personalised recommendations