Classification of Texts with Support Vector Machines: An Examination of the Efficiency of Kernels and Data-transformations

Kindermann, Jörg; Leopold, Edda

doi:10.1007/978-3-642-55991-4_20

Jörg Kindermann⁶ &
Edda Leopold⁶

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

367 Accesses

Abstract

It is known that Support Vector Machines (SVM) provide a fast and powerful means for the classification of documents. In this paper we examine how different representations of documents effect the performance of SVM in different text classification tasks. We discuss the role of importance-weights (inverse document frequency and redundancy) and we show that time consuming lemmatization can be avoided even when classifying a highly inflectional language like German.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Dumais, S. and Platt, J. and Heckerman, D. and Sahami, M. (1998): Inductive Learning Algorithms and Representations for Text Categorization. Proceedings of CIKM-98; 7th International Conference on Informationretrieval and Knowledgemanagement, 148–155.
Google Scholar
Joachims, T. (1998a): Text categorization with support vector machines: learning with many relevant features. Proceedings of ECML-98, 10th European Conference on Machine Learning, Lecture Notes in Computer Science, Number 1398, 137–142, Springer, Heidelberg.
Google Scholar
Joachims, T. (1998b): Making Large-scale Support Vector Machine Learning Practical B. Schölkopf and C. J. C. Burges and A. J. Smola (eds.) Advances in Kernel Methods, MIT Press: Cambridge MA, London, 169–184.
Google Scholar
JOACHIMS, T.: SVM^light http://ais.gmd.de/~thorsten/svm_light/
Leopold, E. and Kindermann, J. (forthcoming): Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? Machine Learning.
Google Scholar
Lezius, W. and Rapp, R. and Wettler, M. (1998): A Freely Available Morphological Analyzer, Disambiguator and Context Sensitive Lemmatizer for German. Proceedings of the COLING-ACL 1998. (Morphy is available at http://www-psycho.uni-paderborn.de/lezius/)
Manning, C. D. and SchüTze, H. (1999): Foundations of Statistical Natural Language Processing; MIT-Press: Cambridge, London.
MATH Google Scholar
Orlov, Ju. K.(1982): Linguostatistik: Aufstellung von Sprachnormen oder Analyse des Redeprozesses? (Die Antinomie’ Sprache-Rede’ in der statistischen Linguistik) Ju. K. Orlov and M. G. Boroda and I. S. Nadarejcvili (eds.) Sprache, Text, Kunst. Quantitative Analysen; (QL 15); Brockmeyer: Bochum, S. 1–55.
Google Scholar
Porter, M. F.(1980): An algorithm for suffix stripping Program (Automated Library and Information Systems) 14(3), 130–137. REUTERS-21578 data set http://www.research.att.com/-lewis/reuters21578.html
Article Google Scholar
Salton, G. and Mcgill, M. J. (1983): Introduction to Modem Information Retrieval; McGraw Hill, New York et al.
Google Scholar
Simon, H. A. (1960): Some further notes on a class of skew distribution functions. 80-88Information and Control 3, 80–88.
Article MATH Google Scholar
Schölkopf, B. and Burges, C. J. C. and Smola, A. J. (Eds.) (1998) Advances in Kernel Methods, MIT Press: Cambridge MA, London.
MATH Google Scholar
Vapnik, V. N.(1998): Statistical Learning Theory, Wiley & sons, New York.
MATH Google Scholar
Zipf, G. K.(1932): Selected studies of the principle of relative frequency in language. Harvard University Press, Cambridge MA.
Google Scholar

Download references

Author information

Authors and Affiliations

GMD German National Research Center for Information Techonology, Institute for Autonomous intelligent Systems, Schloss Birlinghoven, D-53754, Sankt, Augustin
Jörg Kindermann & Edda Leopold

Authors

Jörg Kindermann
View author publications
You can also search for this author in PubMed Google Scholar
Edda Leopold
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute for Decision Theory and Operations Research, University of Karlsruhe, Kaiserstraße 12, 76128, Karlsruhe, Germany
Wolfgang Gaul
Department of Mathematics and Informatics, University of Passau, Innstraße 33, 94030, Passau, Germany
Gunter Ritter

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kindermann, J., Leopold, E. (2002). Classification of Texts with Support Vector Machines: An Examination of the Efficiency of Kernels and Data-transformations. In: Gaul, W., Ritter, G. (eds) Classification, Automation, and New Media. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-55991-4_20

Download citation

DOI: https://doi.org/10.1007/978-3-642-55991-4_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43233-3
Online ISBN: 978-3-642-55991-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics