Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6278))

Abstract

Document clustering is a powerful technique that has been widely used for organizing data into smaller and manageable information kernels. Several approaches have been proposed suffering however from problems like synonymy, ambiguity and lack of a descriptive content marking of the generated clusters. We are proposing the enhancement of standard kmeans algorithm using the external knowledge from WordNet hypernyms in a twofold manner: enriching the ”bag of words” used prior to the clustering process and assisting the label generation procedure following it. Our experimentation revealed a significant improvement over standard kmeans for a corpus of news articles derived from major news portals. Moreover, the cluster labeling process generates useful and of high quality cluster tags.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Zhao, Y., Karypi, G.: Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering. Machine Learning 55(3), 311–331 (2004)

    Article  MATH  Google Scholar 

  2. Yanjun, L., Soon, C.: Parallel bisecting k-means with prediction clustering algorithm. The Journal of Supercomputing 39, 19–37 (2007)

    Article  Google Scholar 

  3. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proc. of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027–1035 (2007)

    Google Scholar 

  4. Varelas, G., Voutsakis, E., Raftopoulou, P., Petrakis, E., Milios, E.: Semantic similarity methods in wordNet and their application to information retrieval on the web. In: Workshop On Web Information And Data Management, Proceedings of the 7th annual ACM international workshop on Web information and data management, pp. 10–16 (2005)

    Google Scholar 

  5. Chen, C.-L., Frank, S., Tseng, C., Liang, T.: An integration of fuzzy association rules and wordNet for document clustering. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 147–159. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  6. Carmel, D., Roitman, H., Zwerdling, N.: Enhancing cluster labeling using wikipedia. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development ininformation retrieval, pp. 139–146 (2009)

    Google Scholar 

  7. Sedding, J., Kazakov, D.: WordNet-based text document clustering. In: Proc. of COLING-Workshop on Robust Methods in Analysis of Natural Language Data (2004)

    Google Scholar 

  8. Treeratpituk, P., Callan, J.: Automatically labeling hierarchical clusters. In: Proceedings of the 2006 international conference on Digital government research, San Diego, California, May 21-24 (2006)

    Google Scholar 

  9. Tseng, Y.H.: Generic title labeling for clustered documents. In: Expert Systems With Applications, vol. 37(3), pp. 2247–2254. Elsevier, Amsterdam (2009)

    Google Scholar 

  10. Bouras, C., Poulopoulos, V., Tsogkas, V.: PeRSSonal’s core functionality evaluation: Enhancing text labeling through personalized summaries. Data and Knowledge Engineering Journal, Elsevier Science 64(1), 330–345 (2008)

    Google Scholar 

  11. Bouras, C., Tsogkas, V.: Improving text summarization using noun retrieval techniques. In: Lovrek, I., Howlett, R.J., Jain, L.C. (eds.) KES 2008, Part II. LNCS (LNAI), vol. 5178, pp. 593–600. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  12. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  13. Taeho, J., Malrey, L.: The Evaluation Measure of Text Clustering for the Variable Number of Clusters. In: Liu, D., Fei, S., Hou, Z., Zhang, H., Sun, C. (eds.) ISNN 2007. LNCS, vol. 4492, pp. 871–879. Springer, Heidelberg (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bouras, C., Tsogkas, V. (2010). W-kmeans: Clustering News Articles Using WordNet. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based and Intelligent Information and Engineering Systems. KES 2010. Lecture Notes in Computer Science(), vol 6278. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15393-8_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15393-8_43

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15392-1

  • Online ISBN: 978-3-642-15393-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics