Skip to main content

Text Clustering with String Kernels in R

  • Conference paper
Advances in Data Analysis

Abstract

We present a package which provides a general framework, including tools and algorithms, for text mining in R using the S4 class system. Using this package and the kernlab R package we explore the use of kernel methods for clustering (e.g., kernel k-means and spectral clustering) on a set of text documents, using string kernels. We compare these methods to a more traditional clustering technique like k-means on a bag of word representation of the text and evaluate the viability of kernel-based methods as a text clustering technique.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • CANCEDDA, N., GAUSSIER, E., GOUTTE, C. and RENDERS, J.M. (2003): Word-sequence Kernels. Journal of Machine Learning Research, 3, 1059–1082.

    MathSciNet  MATH  Google Scholar 

  • FOWLKES, C., BELONGIE, S., CHUNG, F. and MALIK J. (2004): Spectral Grouping Using the Nystrom Method. Transactions on Pattern Analysis and Machine Intelligence, 26,2, 214–225.

    Article  Google Scholar 

  • HERBRICH, R. (2002): Learning Kernel Classifiers Theory and Algorithms. MIT Press.

    Google Scholar 

  • JOACHIMS, T. (1999): Making Large-scale SVM Learning Practical. In: Advances in Kernel Methods — Support Vector Learning.

    Google Scholar 

  • JOACHIMS, T. (2002): Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms. The Kluwer International Series In Engineering And Computer Science. Kluwer Academic Publishers, Boston.

    Book  Google Scholar 

  • LEWIS, D. (1997): Reuters-21578 Text Categorization Test Collection.

    Google Scholar 

  • LODHI, H., SAUNDERS, C., SHAWE-TAYLOR, J., CRISTIANINI, N. and WATKINS, C. (2002): Text Classification Using String Kernels. Journal of Machine Learning Research, 2, 419–444.

    MATH  Google Scholar 

  • NG, A., JORDAN, M. and WEISS, Y. (2001): On Spectral Clustering: Analysis and an Algorithm. Advances in Neural Information Processing Systems, 14.

    Google Scholar 

  • R DEVELOPMENT CORE TEAM (2006): R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

    Google Scholar 

  • SHI, J. and MALIK, J. (2000): Normalized Cuts and Image Segmentation. Transactions on Pattern Analysis and Machine Intelligence, 22,8, 888–905.

    Article  Google Scholar 

  • TEMPLE LANG, D. (2005): Rstem: Interface to Snowball Implementation of Porter’s Word Stemming Algorithm. R Package Version 0.2-0.

    Google Scholar 

  • VISHWANATHAN, S. and SMOLA, A. (2004): Fast Kernels for String and Tree Matching. In: K. Tsuda, B. Schölkopf and J.P. Vert (Eds.): Kernels and Bioinformatics. MIT Press, Cambridge.

    Google Scholar 

  • WATKINS, C. (2000): Dynamic Alignment Kernels. In: A.J. Smola, P.L. Bartlett, B. Schölkopf and D. Schuurmans (Eds.): Advances in Large Margin Classifiers. MIT Press, Cambridge, 39–50.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Karatzoglou, A., Feinerer, I. (2007). Text Clustering with String Kernels in R. In: Decker, R., Lenz, H.J. (eds) Advances in Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70981-7_11

Download citation

Publish with us

Policies and ethics