W-kmeans: Clustering News Articles Using WordNet

Bouras, Christos; Tsogkas, Vassilis

doi:10.1007/978-3-642-15393-8_43

Christos Bouras^23,24 &
Vassilis Tsogkas²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6278))

Included in the following conference series:

International Conference on Knowledge-Based and Intelligent Information and Engineering Systems

1374 Accesses
10 Citations

Abstract

Document clustering is a powerful technique that has been widely used for organizing data into smaller and manageable information kernels. Several approaches have been proposed suffering however from problems like synonymy, ambiguity and lack of a descriptive content marking of the generated clusters. We are proposing the enhancement of standard kmeans algorithm using the external knowledge from WordNet hypernyms in a twofold manner: enriching the ”bag of words” used prior to the clustering process and assisting the label generation procedure following it. Our experimentation revealed a significant improvement over standard kmeans for a corpus of news articles derived from major news portals. Moreover, the cluster labeling process generates useful and of high quality cluster tags.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Zhao, Y., Karypi, G.: Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering. Machine Learning 55(3), 311–331 (2004)
Article MATH Google Scholar
Yanjun, L., Soon, C.: Parallel bisecting k-means with prediction clustering algorithm. The Journal of Supercomputing 39, 19–37 (2007)
Article Google Scholar
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proc. of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027–1035 (2007)
Google Scholar
Varelas, G., Voutsakis, E., Raftopoulou, P., Petrakis, E., Milios, E.: Semantic similarity methods in wordNet and their application to information retrieval on the web. In: Workshop On Web Information And Data Management, Proceedings of the 7th annual ACM international workshop on Web information and data management, pp. 10–16 (2005)
Google Scholar
Chen, C.-L., Frank, S., Tseng, C., Liang, T.: An integration of fuzzy association rules and wordNet for document clustering. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 147–159. Springer, Heidelberg (2009)
Chapter Google Scholar
Carmel, D., Roitman, H., Zwerdling, N.: Enhancing cluster labeling using wikipedia. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development ininformation retrieval, pp. 139–146 (2009)
Google Scholar
Sedding, J., Kazakov, D.: WordNet-based text document clustering. In: Proc. of COLING-Workshop on Robust Methods in Analysis of Natural Language Data (2004)
Google Scholar
Treeratpituk, P., Callan, J.: Automatically labeling hierarchical clusters. In: Proceedings of the 2006 international conference on Digital government research, San Diego, California, May 21-24 (2006)
Google Scholar
Tseng, Y.H.: Generic title labeling for clustered documents. In: Expert Systems With Applications, vol. 37(3), pp. 2247–2254. Elsevier, Amsterdam (2009)
Google Scholar
Bouras, C., Poulopoulos, V., Tsogkas, V.: PeRSSonal’s core functionality evaluation: Enhancing text labeling through personalized summaries. Data and Knowledge Engineering Journal, Elsevier Science 64(1), 330–345 (2008)
Google Scholar
Bouras, C., Tsogkas, V.: Improving text summarization using noun retrieval techniques. In: Lovrek, I., Howlett, R.J., Jain, L.C. (eds.) KES 2008, Part II. LNCS (LNAI), vol. 5178, pp. 593–600. Springer, Heidelberg (2008)
Chapter Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Taeho, J., Malrey, L.: The Evaluation Measure of Text Clustering for the Variable Number of Clusters. In: Liu, D., Fei, S., Hou, Z., Zhang, H., Sun, C. (eds.) ISNN 2007. LNCS, vol. 4492, pp. 871–879. Springer, Heidelberg (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Engineering and Informatics Department, University of Patras, Greece
Christos Bouras & Vassilis Tsogkas
Research Academic Computer Technology Institute N. Kazantzaki, Panepistimioupoli Patras, 26500, Greece
Christos Bouras

Authors

Christos Bouras
View author publications
You can also search for this author in PubMed Google Scholar
Vassilis Tsogkas
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Engineering, Cardiff University, The Parade, CF24 3AA, Cardiff, UK
Rossitza Setchi
Dept. of Computer Science and Software Engineering, University of Portsmouth, BUckingham Building, Lion Terrace, PO1 3HE, Portsmouth, UK
Ivan Jordanov
KES International, 145-157 St. John Street, EC1V 4PY, London, UK
Robert J. Howlett
School of Electrical and Information Engineering, University of South Australia, Adelaide, Mawson Lakes Campus, 5095, SA, Australia
Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bouras, C., Tsogkas, V. (2010). W-kmeans: Clustering News Articles Using WordNet. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based and Intelligent Information and Engineering Systems. KES 2010. Lecture Notes in Computer Science(), vol 6278. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15393-8_43

Download citation

DOI: https://doi.org/10.1007/978-3-642-15393-8_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15392-1
Online ISBN: 978-3-642-15393-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics