Conceptual Hierarchical Clustering of Documents using Wikipedia knowledge

Spanakis, Gerasimos; Siolas, Georgios; Stafylopatis, Andreas

doi:10.1007/978-90-481-9794-1_25

Gerasimos Spanakis⁷,
Georgios Siolas⁷ &
Andreas Stafylopatis⁷

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 62))

808 Accesses

Abstract

In this paper, we propose a novel method for conceptual hierarchical clustering of documents using knowledge extracted from Wikipedia. A robust and compact document representation is built in real-time using the Wikipedia API. The clustering process is hierarchi- cal and creates cluster labels which are descriptive and important for the examined corpus. Experiments show that the proposed technique greatly improves over the baseline approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Banerjee, S., Ramanathan, K. and Gupta, A.: Clustering short texts using Wikipedia. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2007) 787–788
Google Scholar
Wang, P. and Domeniconi, C.: Building Semantic Kernels for text classification using Wikipedia. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2008) 713–721
Google Scholar
Gabrilovich, E. and Markovitch, S.: Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In Pro- ceedings of the 21st National Conference on Artificial Intelligence (2006) 1301–1306
Google Scholar
Hu, J., Fang, L., Cao, Y., Zeng, H., Li, H., Yang, Q., and Chen, Z.: Enhancing text clustering by leveraging Wikipedia semantics. In Proceedings of the 31st Annual international ACM SIGIR Conference on Research and Development in information Retrieval (2008) 179–186
Google Scholar
Fung B., Wang K., Ester M.: Hierarchical Document Clustering Using Frequent Itemsets. In Proceedings of the SIAM International Conference on Data Mining (2003)
Google Scholar
Marcus, M., Santorini, B., and Marcinkiewicz, M.A.: Building a large annotated corpus of English: The Penn Treebank. In Computational Linguistics (1993) Volume 19, Number 2, 313–330
Google Scholar
Mihalcea, R. and Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In Proceedings of the Sixteenth ACM Conference on information and Knowledge Management (2007) 233–242
Google Scholar
Schmid, H.: Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceed- ings of the International Conference on New Methods in Language Processing (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Intelligent Systems Laboratory School of Electrical and Computer Engineering, National Technical University of Athens, 15780, Zografou, Athens, Greece
Gerasimos Spanakis, Georgios Siolas & Andreas Stafylopatis

Authors

Gerasimos Spanakis
View author publications
You can also search for this author in PubMed Google Scholar
Georgios Siolas
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Stafylopatis
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

, EEE Dept., Imperial College, Exhibition Road, London, SW72BT, United Kingdom
Erol Gelenbe
, EEE Dept., Imperial College, Exhibition Rd., London, SW72AZ, United Kingdom
Ricardo Lent
, EEE Dept., Imperial College, Exhibition Rd., London, SW72AZ, United Kingdom
Georgia Sakellari
, School of Biomedical Eng., Sci. and Heal, Drexel University, Bossone 702, 3120 Market Street, Philadelphia, 19104, Pennsylvania, USA
Ahmet Sacan
, Dept. of Computer Engineering, Middle East Technical University, Ankara, 06531, Turkey
Hakki Toroslu
Fac. Engineering, Dept. Computer Engineering, Middle East Technical University - METU, Ankara, 06531, Turkey
Adnan Yazici

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Spanakis, G., Siolas, G., Stafylopatis, A. (2011). Conceptual Hierarchical Clustering of Documents using Wikipedia knowledge. In: Gelenbe, E., Lent, R., Sakellari, G., Sacan, A., Toroslu, H., Yazici, A. (eds) Computer and Information Sciences. Lecture Notes in Electrical Engineering, vol 62. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-9794-1_25

Download citation

DOI: https://doi.org/10.1007/978-90-481-9794-1_25
Published: 18 August 2010
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-9793-4
Online ISBN: 978-90-481-9794-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics