Abstract
We present a package which provides a general framework, including tools and algorithms, for text mining in R using the S4 class system. Using this package and the kernlab R package we explore the use of kernel methods for clustering (e.g., kernel k-means and spectral clustering) on a set of text documents, using string kernels. We compare these methods to a more traditional clustering technique like k-means on a bag of word representation of the text and evaluate the viability of kernel-based methods as a text clustering technique.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
CANCEDDA, N., GAUSSIER, E., GOUTTE, C. and RENDERS, J.M. (2003): Word-sequence Kernels. Journal of Machine Learning Research, 3, 1059–1082.
FOWLKES, C., BELONGIE, S., CHUNG, F. and MALIK J. (2004): Spectral Grouping Using the Nystrom Method. Transactions on Pattern Analysis and Machine Intelligence, 26,2, 214–225.
HERBRICH, R. (2002): Learning Kernel Classifiers Theory and Algorithms. MIT Press.
JOACHIMS, T. (1999): Making Large-scale SVM Learning Practical. In: Advances in Kernel Methods — Support Vector Learning.
JOACHIMS, T. (2002): Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms. The Kluwer International Series In Engineering And Computer Science. Kluwer Academic Publishers, Boston.
LEWIS, D. (1997): Reuters-21578 Text Categorization Test Collection.
LODHI, H., SAUNDERS, C., SHAWE-TAYLOR, J., CRISTIANINI, N. and WATKINS, C. (2002): Text Classification Using String Kernels. Journal of Machine Learning Research, 2, 419–444.
NG, A., JORDAN, M. and WEISS, Y. (2001): On Spectral Clustering: Analysis and an Algorithm. Advances in Neural Information Processing Systems, 14.
R DEVELOPMENT CORE TEAM (2006): R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
SHI, J. and MALIK, J. (2000): Normalized Cuts and Image Segmentation. Transactions on Pattern Analysis and Machine Intelligence, 22,8, 888–905.
TEMPLE LANG, D. (2005): Rstem: Interface to Snowball Implementation of Porter’s Word Stemming Algorithm. R Package Version 0.2-0.
VISHWANATHAN, S. and SMOLA, A. (2004): Fast Kernels for String and Tree Matching. In: K. Tsuda, B. Schölkopf and J.P. Vert (Eds.): Kernels and Bioinformatics. MIT Press, Cambridge.
WATKINS, C. (2000): Dynamic Alignment Kernels. In: A.J. Smola, P.L. Bartlett, B. Schölkopf and D. Schuurmans (Eds.): Advances in Large Margin Classifiers. MIT Press, Cambridge, 39–50.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Karatzoglou, A., Feinerer, I. (2007). Text Clustering with String Kernels in R. In: Decker, R., Lenz, H.J. (eds) Advances in Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70981-7_11
Download citation
DOI: https://doi.org/10.1007/978-3-540-70981-7_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70980-0
Online ISBN: 978-3-540-70981-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)