Abstract
It is known that Support Vector Machines (SVM) provide a fast and powerful means for the classification of documents. In this paper we examine how different representations of documents effect the performance of SVM in different text classification tasks. We discuss the role of importance-weights (inverse document frequency and redundancy) and we show that time consuming lemmatization can be avoided even when classifying a highly inflectional language like German.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Dumais, S. and Platt, J. and Heckerman, D. and Sahami, M. (1998): Inductive Learning Algorithms and Representations for Text Categorization. Proceedings of CIKM-98; 7th International Conference on Informationretrieval and Knowledgemanagement, 148–155.
Joachims, T. (1998a): Text categorization with support vector machines: learning with many relevant features. Proceedings of ECML-98, 10th European Conference on Machine Learning, Lecture Notes in Computer Science, Number 1398, 137–142, Springer, Heidelberg.
Joachims, T. (1998b): Making Large-scale Support Vector Machine Learning Practical B. Schölkopf and C. J. C. Burges and A. J. Smola (eds.) Advances in Kernel Methods, MIT Press: Cambridge MA, London, 169–184.
JOACHIMS, T.: SVMlight http://ais.gmd.de/~thorsten/svm_light/
Leopold, E. and Kindermann, J. (forthcoming): Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? Machine Learning.
Lezius, W. and Rapp, R. and Wettler, M. (1998): A Freely Available Morphological Analyzer, Disambiguator and Context Sensitive Lemmatizer for German. Proceedings of the COLING-ACL 1998. (Morphy is available at http://www-psycho.uni-paderborn.de/lezius/)
Manning, C. D. and SchüTze, H. (1999): Foundations of Statistical Natural Language Processing; MIT-Press: Cambridge, London.
Orlov, Ju. K.(1982): Linguostatistik: Aufstellung von Sprachnormen oder Analyse des Redeprozesses? (Die Antinomie’ Sprache-Rede’ in der statistischen Linguistik) Ju. K. Orlov and M. G. Boroda and I. S. Nadarejcvili (eds.) Sprache, Text, Kunst. Quantitative Analysen; (QL 15); Brockmeyer: Bochum, S. 1–55.
Porter, M. F.(1980): An algorithm for suffix stripping Program (Automated Library and Information Systems) 14(3), 130–137. REUTERS-21578 data set http://www.research.att.com/-lewis/reuters21578.html
Salton, G. and Mcgill, M. J. (1983): Introduction to Modem Information Retrieval; McGraw Hill, New York et al.
Simon, H. A. (1960): Some further notes on a class of skew distribution functions. 80-88Information and Control 3, 80–88.
Schölkopf, B. and Burges, C. J. C. and Smola, A. J. (Eds.) (1998) Advances in Kernel Methods, MIT Press: Cambridge MA, London.
Vapnik, V. N.(1998): Statistical Learning Theory, Wiley & sons, New York.
Zipf, G. K.(1932): Selected studies of the principle of relative frequency in language. Harvard University Press, Cambridge MA.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kindermann, J., Leopold, E. (2002). Classification of Texts with Support Vector Machines: An Examination of the Efficiency of Kernels and Data-transformations. In: Gaul, W., Ritter, G. (eds) Classification, Automation, and New Media. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-55991-4_20
Download citation
DOI: https://doi.org/10.1007/978-3-642-55991-4_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43233-3
Online ISBN: 978-3-642-55991-4
eBook Packages: Springer Book Archive