Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

Text Categorization

  • Dou ShenEmail author
Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_414


Text classification


Text classification is to automatically assign textual documents (such as documents in plain text and Web pages) into some predefined categories based their content. Formally speaking, text classification works on an instance space X where each instance is a document d and a fixed set of classes C = {C1, C2, … , C|C|} where |C| is the number of classes. Given a training set Dl of training documents 〈d, Ci〉 where 〈d, Ci〉 ∈ X × C, using a learning method or learning algorithm, the goal of document classification is to learn a classifier or classification function γ that maps instances to classes: γ : XC [7].

Historical Background

Text classification, which is to classify documents into some predefined categories, provides an effective way to organize documents. Text classification dates back to the early 1960s, but only in the early 1990s did it become a major subfield of the information systems discipline. Recently, with the explosive growth of...

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Dumais S, Platt J, Heckerman D, Sahami M. Inductive learning algorithms and representations for text categorization. In: Proceedings of the 7th International Conference on Information and Knowledge Management; 1998. p. 148–55.Google Scholar
  2. 2.
    Glover EJ, Tsioutsiouliklis K, Lawrence S, Pennock DM, Flake GW. Using web structure for classifying and describing web pages. In: Proceedings of the 11th International World Wide Web Conference; 2002. p. 562–9.Google Scholar
  3. 3.
    Joachims T. Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning; 1998. p. 137–42.CrossRefGoogle Scholar
  4. 4.
    Kehagias A, Petridis V, Kaburlasos VG, Fragkou P. A comparison of word- and sense-based text categorization using several classification algorithms. J Intell Inf Syst. 2003;21(3):227–47.CrossRefGoogle Scholar
  5. 5.
    Kolcz A, Prabakarmurthi V, Kalita JK. String match and text extraction: summarization as feature selection for text categorization. In: Proceedings of the 10th International Conference on Information and Knowledge Management; 2001. p. 365–70.Google Scholar
  6. 6.
    Lewis DD. Representation quality in text classification: an introduction and experiment. In: Proceedings of the Workshop on Speech and Natural Language; 1990. p. 288–95.Google Scholar
  7. 7.
    Manning CD, Raghavan P, SchÜZe H. Introduction to information retrieval. Cambridge University Press, 2007.Google Scholar
  8. 8.
    Mccallum A, Nigam K. A comparison of event models for naive bayes text classication. In: Proceedings of the AAAI-98 Workshop on Learning for Text Categorization; 1998.Google Scholar
  9. 9.
    Peng F, Schuurmans D, Wang S. Augmenting naive bayes classifiers with statistical language models. Inf. Retr. 2004;7(3–4):317–45.CrossRefGoogle Scholar
  10. 10.
    Rijsbergen CV. Information retrieval. 2nd ed. London: Butterworths; 1979.zbMATHGoogle Scholar
  11. 11.
    Sebastiani F. Machine learning in automated text categorization. ACM Comput Surv. 2002;34(1):1–47.MathSciNetCrossRefGoogle Scholar
  12. 12.
    Shen D, Chen Z, Yang Q, Zeng HJ, Zhang B, Lu Y, Ma WY. Web-page classification through summarization. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 2004. p. 242–9.Google Scholar
  13. 13.
    Shen D, Sun JT, Yang Q, Chen Z. A comparison of implicit and explicit links for web page classification. In: Proceedings of the 15th International World Wide Web Conference; 2006. p. 643–50.Google Scholar
  14. 14.
    Yang Y. An evaluation of statistical approaches to text categorization. Inf Retr. 1999;1(1–2):69–90.CrossRefGoogle Scholar
  15. 15.
    Yang Y, Pedersen JO. A comparative study on feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning; 1997. p. 412–20.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Microsoft CorporationRedmondUSA
  2. 2.Baidu, Inc.Beijing CityChina

Section editors and affiliations

  • Zheng Chen
    • 1
  1. 1.Microsoft Research AsiaMicrosoft CorporationBeijingChina