Evaluation of Path Based Methods for Conceptual Representation of the Text

  • Łukasz Kucharczyk
  • Julian Szymański
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8502)


Typical text clustering methods use the bag of words (BoW) representation to describe content of documents. However, this method is known to have several limitations. Employing Wikipedia as the lexical knowledge base has shown an improvement of the text representation for data-mining purposes. Promising extensions of that trend employ hierarchical organization of Wikipedia category system. In this paper we propose three path-based measures for calculating document relatedness in such conceptual space and compare them with the Path Length widely used approach. We perform their evaluation using the OPTICS clustering algorithm for categorization of keyword-based search results. The results have shown that our method outperforms the Path-Length approach.


text representation documents categorization information retrieval 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2009)Google Scholar
  2. 2.
    Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press (2011)Google Scholar
  3. 3.
    Medelyan, O., Milne, D., Legg, C., Witten, I.H.: Mining meaning from wikipedia. Int. J. Hum.-Comput. Stud. 67, 716–754 (2009)CrossRefGoogle Scholar
  4. 4.
    Zesch, T., Gurevych, I.: Analysis of the Wikipedia Category Graph for NLP Applications. In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT) (2007)Google Scholar
  5. 5.
    Syed, Z.S., Finin, T., Joshi, A.: Wikipedia as an ontology for describing documents. In: ICWSM (2008)Google Scholar
  6. 6.
    Hu, J., Fang, L., Cao, Y., Zeng, H.J., Li, H., Yang, Q., Chen, Z.: Enhancing text clustering by leveraging Wikipedia semantics. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, pp. 179–186. ACM, New York (2008)CrossRefGoogle Scholar
  7. 7.
    Banerjee, S., Ramanathan, K., Gupta, A.: Clustering short texts using wikipedia. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2007, pp. 787–788. ACM, New York (2007)Google Scholar
  8. 8.
    Hu, X., Zhang, X., Lu, C., Park, E.K., Zhou, X.: Exploiting wikipedia as external knowledge for document clustering. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 389–396. ACM (2009)Google Scholar
  9. 9.
    Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In: Proceedings of the 21st National Conference on Artificial Intelligence, AAAI 2006, vol. 2, pp. 1301–1306. AAAI Press (2006)Google Scholar
  10. 10.
    Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. IJCAI 7, 1606–1611 (2007)Google Scholar
  11. 11.
    Strube, M., Ponzetto, S.P.: Wikirelate! computing semantic relatedness using wikipedia. In: Proceedings of the 21st National Conference on Artificial Intelligence, pp. 1419–1424. AAAI Press (2006)Google Scholar
  12. 12.
    Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of lexical semantic relatedness. Computational Linguistics 32, 13–47 (2006)CrossRefzbMATHGoogle Scholar
  13. 13.
    Yazdani, M., Popescu-Belis, A.: Using a wikipedia-based semantic relatedness measure for document clustering. In: Proceedings of TextGraphs-6: Graph-based Methods for Natural Language Processing, TextGraphs-6, pp. 29–36. Association for Computational Linguistics, Stroudsburg (2011)Google Scholar
  14. 14.
    Sorg, P., Cimiano, P.: Exploiting Wikipedia for cross-lingual and multilingual information retrieval. Data & Knowledge Engineering 74, 26–45 (2012)CrossRefGoogle Scholar
  15. 15.
    McCrae, J.P., Cimiano, P., Klinger, R.: Orthonormal explicit topic analysis for cross-lingual document matching. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1732–1740. Association for Computational Linguistics, Seattle (2013)Google Scholar
  16. 16.
    Zesch, T., Gurevych, I., Mühlhäuser, M.: Comparing Wikipedia and German Wordnet by Evaluating Semantic Relatedness on Multiple Datasets. In: Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT) (2007)Google Scholar
  17. 17.
    Liu, T., Liu, S., Chen, Z.: An evaluation on feature selection for text clustering. In: ICML, pp. 488–495 (2003)Google Scholar
  18. 18.
    Ankerst, M., Breunig, M.M., Peter Kriegel, H., Sander, J.: Optics: Ordering points to identify the clustering structure, pp. 49–60. ACM Press (1999)Google Scholar
  19. 19.
    Sander, J., Qin, X., Lu, Z., Niu, N., Kovarsky, A.: Automatic extraction of clusters from hierarchical clustering representations. In: Whang, K.-Y., Jeon, J., Shim, K., Srivatava, J. (eds.) PAKDD 2003. LNCS (LNAI), vol. 2637, pp. 75–87. Springer, Heidelberg (2003)Google Scholar
  20. 20.
    Draszawka, K., Szymanski, J.: External validation measures for nested clustering of text documents. In: ISMIS Industrial Session, pp. 207–225 (2011)Google Scholar
  21. 21.
    Draszawka, K., Szymanski, J.: Thresholding strategies for large scale multi-label text classifier. In: 2013 The 6th International Conference on Human System Interaction (HSI), pp. 350–355 (2013)Google Scholar
  22. 22.
    Kryszkiewicz, M., Lasek, P.: Ti-dbscan: Clustering with dbscan by means of the triangle inequality. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS, vol. 6086, pp. 60–69. Springer, Heidelberg (2010)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Łukasz Kucharczyk
    • 1
  • Julian Szymański
    • 1
  1. 1.Department of Computer Systems ArchitectureGdańsk University of TechnologyPoland

Personalised recommendations