Skip to main content

Conceptual Hierarchical Clustering of Documents using Wikipedia knowledge

  • Conference paper
  • First Online:
Book cover Computer and Information Sciences

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 62))

  • 808 Accesses

Abstract

In this paper, we propose a novel method for conceptual hierarchical clustering of documents using knowledge extracted from Wikipedia. A robust and compact document representation is built in real-time using the Wikipedia API. The clustering process is hierarchi- cal and creates cluster labels which are descriptive and important for the examined corpus. Experiments show that the proposed technique greatly improves over the baseline approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Banerjee, S., Ramanathan, K. and Gupta, A.: Clustering short texts using Wikipedia. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2007) 787–788

    Google Scholar 

  2. Wang, P. and Domeniconi, C.: Building Semantic Kernels for text classification using Wikipedia. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2008) 713–721

    Google Scholar 

  3. Gabrilovich, E. and Markovitch, S.: Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In Pro- ceedings of the 21st National Conference on Artificial Intelligence (2006) 1301–1306

    Google Scholar 

  4. Hu, J., Fang, L., Cao, Y., Zeng, H., Li, H., Yang, Q., and Chen, Z.: Enhancing text clustering by leveraging Wikipedia semantics. In Proceedings of the 31st Annual international ACM SIGIR Conference on Research and Development in information Retrieval (2008) 179–186

    Google Scholar 

  5. Fung B., Wang K., Ester M.: Hierarchical Document Clustering Using Frequent Itemsets. In Proceedings of the SIAM International Conference on Data Mining (2003)

    Google Scholar 

  6. Marcus, M., Santorini, B., and Marcinkiewicz, M.A.: Building a large annotated corpus of English: The Penn Treebank. In Computational Linguistics (1993) Volume 19, Number 2, 313–330

    Google Scholar 

  7. Mihalcea, R. and Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In Proceedings of the Sixteenth ACM Conference on information and Knowledge Management (2007) 233–242

    Google Scholar 

  8. Schmid, H.: Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceed- ings of the International Conference on New Methods in Language Processing (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media B.V.

About this paper

Cite this paper

Spanakis, G., Siolas, G., Stafylopatis, A. (2011). Conceptual Hierarchical Clustering of Documents using Wikipedia knowledge. In: Gelenbe, E., Lent, R., Sakellari, G., Sacan, A., Toroslu, H., Yazici, A. (eds) Computer and Information Sciences. Lecture Notes in Electrical Engineering, vol 62. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-9794-1_25

Download citation

  • DOI: https://doi.org/10.1007/978-90-481-9794-1_25

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-90-481-9793-4

  • Online ISBN: 978-90-481-9794-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics