Skip to main content

Classification of Texts with Support Vector Machines: An Examination of the Efficiency of Kernels and Data-transformations

  • Conference paper
Book cover Classification, Automation, and New Media

Abstract

It is known that Support Vector Machines (SVM) provide a fast and powerful means for the classification of documents. In this paper we examine how different representations of documents effect the performance of SVM in different text classification tasks. We discuss the role of importance-weights (inverse document frequency and redundancy) and we show that time consuming lemmatization can be avoided even when classifying a highly inflectional language like German.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Dumais, S. and Platt, J. and Heckerman, D. and Sahami, M. (1998): Inductive Learning Algorithms and Representations for Text Categorization. Proceedings of CIKM-98; 7th International Conference on Informationretrieval and Knowledgemanagement, 148–155.

    Google Scholar 

  • Joachims, T. (1998a): Text categorization with support vector machines: learning with many relevant features. Proceedings of ECML-98, 10th European Conference on Machine Learning, Lecture Notes in Computer Science, Number 1398, 137–142, Springer, Heidelberg.

    Google Scholar 

  • Joachims, T. (1998b): Making Large-scale Support Vector Machine Learning Practical B. Schölkopf and C. J. C. Burges and A. J. Smola (eds.) Advances in Kernel Methods, MIT Press: Cambridge MA, London, 169–184.

    Google Scholar 

  • JOACHIMS, T.: SVMlight http://ais.gmd.de/~thorsten/svm_light/

  • Leopold, E. and Kindermann, J. (forthcoming): Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? Machine Learning.

    Google Scholar 

  • Lezius, W. and Rapp, R. and Wettler, M. (1998): A Freely Available Morphological Analyzer, Disambiguator and Context Sensitive Lemmatizer for German. Proceedings of the COLING-ACL 1998. (Morphy is available at http://www-psycho.uni-paderborn.de/lezius/)

  • Manning, C. D. and SchüTze, H. (1999): Foundations of Statistical Natural Language Processing; MIT-Press: Cambridge, London.

    MATH  Google Scholar 

  • Orlov, Ju. K.(1982): Linguostatistik: Aufstellung von Sprachnormen oder Analyse des Redeprozesses? (Die Antinomie’ Sprache-Rede’ in der statistischen Linguistik) Ju. K. Orlov and M. G. Boroda and I. S. Nadarejcvili (eds.) Sprache, Text, Kunst. Quantitative Analysen; (QL 15); Brockmeyer: Bochum, S. 1–55.

    Google Scholar 

  • Porter, M. F.(1980): An algorithm for suffix stripping Program (Automated Library and Information Systems) 14(3), 130–137. REUTERS-21578 data set http://www.research.att.com/-lewis/reuters21578.html

    Article  Google Scholar 

  • Salton, G. and Mcgill, M. J. (1983): Introduction to Modem Information Retrieval; McGraw Hill, New York et al.

    Google Scholar 

  • Simon, H. A. (1960): Some further notes on a class of skew distribution functions. 80-88Information and Control 3, 80–88.

    Article  MATH  Google Scholar 

  • Schölkopf, B. and Burges, C. J. C. and Smola, A. J. (Eds.) (1998) Advances in Kernel Methods, MIT Press: Cambridge MA, London.

    MATH  Google Scholar 

  • Vapnik, V. N.(1998): Statistical Learning Theory, Wiley & sons, New York.

    MATH  Google Scholar 

  • Zipf, G. K.(1932): Selected studies of the principle of relative frequency in language. Harvard University Press, Cambridge MA.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kindermann, J., Leopold, E. (2002). Classification of Texts with Support Vector Machines: An Examination of the Efficiency of Kernels and Data-transformations. In: Gaul, W., Ritter, G. (eds) Classification, Automation, and New Media. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-55991-4_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-55991-4_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43233-3

  • Online ISBN: 978-3-642-55991-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics