Skip to main content

Text Classification: Basic Models

  • Chapter
  • First Online:
Machine Learning for Text

Abstract

In classification, the corpus is partitioned into classes that are typically defined by application-specific criteria. Therefore, training examples are provided that associate data points with labels indicating their class membership. For example, the training examples extracted from a news portal on political matters might attach one of three labels associated with each of the documents, such as “senate,” “congress,” and “legislation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 84.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Consider a classifier that memorizes the training examples as follows. For any test instance, it is determined whether a training instance has zero distance to it (which is guaranteed when the test instance is drawn from the training data). If such an instance is found, the label of that training instance is returned. Otherwise a random label is returned. Such a classifier will have 100% accuracy on the training data, but will perform randomly on unseen test instances. The key point is that generalization is about extrapolating predictions from known instances of the data space (i.e., training points) to all regions of the data space. Memorizing only the known instances is the worst possible way to achieve this.

  2. 2.

    Although \(\overline{X_{i}}\) is a binary vector, we are treating it like a set when we use a set-membership notation like \(t_{j} \in \overline{X_{i}}\). Any binary vector can also be viewed as a set of the 1s in it.

  3. 3.

    The constant of proportionality can be easily inferred by ensuring that the sum of the posterior probabilities across all classes is 1. As we will see later, there are scenarios associated with ranking instances to belong to specific classes, where the constant of proportionality does matter.

  4. 4.

    Most of the literature uses the notation of k instead of κ to denote the number of nearest neighbors. We use κ instead of k for notational disambiguation, since the latter variable has been used consistently in this chapter to denote the number of classes. Using k to denote both the number of classes and the number of neighbors would cause confusion.

  5. 5.

    We intentionally use the seemingly unusual notation K(⋅ , ⋅ ) for a similarity function, as we will later connect this principle with the kernel similarity function used by support vector machines.

  6. 6.

    In Sect. 5.5.6, we show further connections between nearest-neighbor classifiers and randomized variants of decision trees.

Bibliography

  1. C. Aggarwal. Data classification: Algorithms and applications, CRC Press, 2014.

    Google Scholar 

  2. C. Aggarwal. Data mining: The textbook. Springer, 2015.

    Google Scholar 

  3. C. Aggarwal, S. Gates, and P. Yu. On using partial supervision for text categorization. IEEE Transactions on Knowledge and Data Engineering, 16(2), 245–255, 2004. [Extended version of ACM KDD 1998 paper “On the merits of building categorization systems by supervised clustering.”]

    Article  Google Scholar 

  4. C. Aggarwal and P. Yu. On effective conceptual indexing and similarity search in text data. ICDM Conference, pp. 3–10, 2001.

    Google Scholar 

  5. C. Aggarwal, and C. Zhai, Mining text data. Springer, 2012.

    Google Scholar 

  6. M. Antonie and O Zaïane. Text document categorization by term association. IEEE ICDM Conference, pp. 19–26, 2002.

    Google Scholar 

  7. C. Apte, F. Damerau, and S. Weiss. Automated learning of decision rules for text categorization, ACM Transactions on Information Systems, 12(3), pp. 233–251, 1994.

    Article  Google Scholar 

  8. C. Apte, F. Damerau, and S. Weiss. Text mining with decision rules and decision trees. Conference on Automated Learning and Discovery, Also appears as IBM Research Report, RC21219, 1998.

    Google Scholar 

  9. L. Baker and A. McCallum. Distributional clustering of words for text classification. ACM SIGIR Conference, pp. 96–103, 1998.

    Google Scholar 

  10. A. Blum, and T. Mitchell. Combining labeled and unlabeled data with co-training. COLT, 1998.

    Google Scholar 

  11. A. Blum and S. Chawla. Combining labeled and unlabeled data with graph mincuts. ICML Conference, 2001.

    Google Scholar 

  12. D. Boley, M. Gini, R. Gross, E.-H. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Partitioning-based clustering for Web document categorization. Decision Support Systems, Vol. 27, pp. 329–341, 1999.

    Article  Google Scholar 

  13. L. Breiman. Random forests. Journal Machine Learning archive, 45(1), pp. 5–32, 2001.

    Article  MATH  Google Scholar 

  14. L. Breiman. Bagging predictors. Machine Learning, 24(2), pp. 123–140, 1996.

    MathSciNet  MATH  Google Scholar 

  15. L. Breiman and A. Cutler. Random Forests Manual v4.0, Technical Report, UC Berkeley, 2003. https://www.stat.berkeley.edu/~breiman/Using_random_forests_v4.0.pdf

  16. P. Bühlmann and B. Yu. Analyzing bagging. Annals of Statistics, pp. 927–961, 2002.

    Article  MathSciNet  MATH  Google Scholar 

  17. S. Chakrabarti, B. Dom. R. Agrawal, and P. Raghavan. Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB Journal, 7(3), pp. 163–178, 1998.

    Article  Google Scholar 

  18. S. Chakrabarti, S. Roy, and M. Soundalgekar. Fast and accurate text classification via multiple linear discriminant projections. The VLDB Journal, 12(2), pp. 170–185, 2003.

    Article  Google Scholar 

  19. O. Chapelle, B. Schölkopf, and A. Zien. Semi-supervised learning. MIT Press, 2010.

    Google Scholar 

  20. D. Chickering, D. Heckerman, and C. Meek. A Bayesian approach to learning Bayesian networks with local structure. Uncertainty in Artificial Intelligence, pp. 80–89, 1997.

    Google Scholar 

  21. W. Cohen. Fast effective rule induction. ICML Conference, pp. 115–123, 1995.

    Chapter  Google Scholar 

  22. W. Cohen. Learning rules that classify e-mail. AAAI Spring Symposium on Machine Learning in Information Access, 1996.

    Google Scholar 

  23. W. Cohen. Learning with set-valued features. In National Conference on Artificial Intelligence, 1996.

    Google Scholar 

  24. W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems, 17(2), pp 141–173, 1999.

    Article  Google Scholar 

  25. W. Cooper. Some inconsistencies and misnomers in probabilistic information retrieval. ACM Transactions on Information Systems, 13(1), pp. 100–111, 1995.

    Article  Google Scholar 

  26. T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), pp. 1–27, 1967.

    Article  MATH  Google Scholar 

  27. P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2–3), pp. 103–130, 1997.

    Article  MATH  Google Scholar 

  28. R. Duda, P. Hart, W. Stork. Pattern Classification, Wiley Interscience, 2000.

    MATH  Google Scholar 

  29. S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. ACM CIKM Conference, pp. 148–155, 1998.

    Google Scholar 

  30. M. Fernandez-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? The Journal of Machine Learning Research, 15(1), pp. 3133–3181, 2014.

    MathSciNet  MATH  Google Scholar 

  31. J. Fürnkranz and G. Widmer. Incremental reduced error pruning. ICML Conference, pp. 70–77, 1994.

    Chapter  Google Scholar 

  32. E.-H. Han, G. Karypis, and V. Kumar. Text categorization using weighted-adjusted k-nearest neighbor classification, PAKDD Conference, 2001.

    Google Scholar 

  33. E.-H. Han and G. Karypis. Centroid-based document classification: Analysis and experimental results. PKDD Conference, 2000.

    Google Scholar 

  34. T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6), pp. 607–616, 1996.

    Article  Google Scholar 

  35. T. Joachims. Text categorization with support vector machines: learning with many relevant features. ECML Conference, 1998.

    Google Scholar 

  36. T. Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. ICML Conference, 1997.

    Google Scholar 

  37. D. Johnson, F. Oles, T. Zhang, T. Goetz. A decision tree-based symbolic rule induction system for text categorization, IBM Systems Journal, 41(3), pp. 428–437, 2002.

    Article  Google Scholar 

  38. G. Karypis and E.-H. Han. Fast supervised dimensionality reduction with applications to document categorization and retrieval, ACM CIKM Conference, pp. 12–19, 2000.

    Google Scholar 

  39. M. Kuhn. Building predictive models in R Using the caret Package. Journal of Statistical Software, 28(5), pp. 1–26, 2008. https://cran.r-project.org/web/packages/caret/index.html

    Article  Google Scholar 

  40. W. Lam and C. Y. Ho. Using a generalized instance set for automatic text categorization. ACM SIGIR Conference, 1998.

    Google Scholar 

  41. D. Lewis. An evaluation of phrasal and clustered representations for the text categorization task. ACM SIGIR Conference, pp. 37–50, 1992.

    Google Scholar 

  42. D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. ECML Conference, pp. 4–15, 1998.

    Chapter  Google Scholar 

  43. D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization. Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93, 1994.

    Google Scholar 

  44. H. Li, and K. Yamanishi. Document classification using a finite mixture model. ACL Conference, pp. 39–47, 1997.

    Google Scholar 

  45. Y. Li and A. Jain. Classification of text documents. The Computer Journal, 41(8), pp. 537–546, 1998.

    Article  MATH  Google Scholar 

  46. B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. ACM KDD Conference, pp. 80–86, 1998.

    Google Scholar 

  47. A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow, 1996.

  48. A. McCallum and K. Nigam. A comparison of event models for naive Bayes text classification. AAAI Workshop on Learning for Text Categorization, 1998.

    Google Scholar 

  49. T. M. Mitchell. The role of unlabeled data in supervised learning. International Colloquium on Cognitive Science, pp. 2–11, 1999.

    Google Scholar 

  50. K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification with labeled and unlabeled data using EM. Machine Learning, 39(2), pp. 103–134, 2000.

    Article  MATH  Google Scholar 

  51. M. Pazzani and D. Kibler. The utility of knowledge in inductive learning. Machine Learning, 9(1), pp. 57–94, 1992.

    Google Scholar 

  52. J. Quinlan. C4.5: programs for machine learning. Morgan-Kaufmann Publishers, 1993.

    Google Scholar 

  53. J. Quinlan. Induction of decision trees. Machine Learning, 1, pp. 81–106, 1986.

    Google Scholar 

  54. J. Rodríguez, L. Kuncheva, and C. Alonso. Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), pp. 1619–1630, 2006.

    Article  Google Scholar 

  55. J. Rocchio. Relevance feedback information retrieval. The Smart Retrieval System- Experiments in Automatic Document Processing, G. Salton, Ed. Prentice Hall, Englewood Cliffs, NJ, pp. 313–323, 1971.

    Google Scholar 

  56. R. Samworth. Optimal weighted nearest neighbour classifiers. The Annals of Statistics, 40(5), pp. 2733–2763, 2012.

    Article  MathSciNet  MATH  Google Scholar 

  57. S. Sathe and C. Aggarwal. Similarity forests. ACM KDD Conference, 2017.

    Google Scholar 

  58. F. Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1), 2002.

    Article  Google Scholar 

  59. N. Slonim and N. Tishby. The power of word clusters for text classification. European Colloquium on Information Retrieval Research (ECIR), 2001.

    Google Scholar 

  60. S. Weiss, C. Apte, F. Damerau, D. Johnson, F. Oles, T. Goetz, and T. Hampp. Maximizing text-mining performance. IEEE Intelligent Systems, 14(4), pp. 63–69, 1999.

    Article  Google Scholar 

  61. Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1–2), pp. 69–90, 1999.

    Article  Google Scholar 

  62. Y. Yang. A study on thresholding strategies for text categorization. ACM SIGIR Conference, pp. 137–145, 2001.

    Google Scholar 

  63. Y. Yang and X. Liu. A re-examination of text categorization methods. ACM SIGIR Conference, pp. 42–49, 1999.

    Google Scholar 

  64. Y. Yang and J. O. Pederson. A comparative study on feature selection in text categorization, ACM SIGIR Conference, pp. 412–420, 1995.

    Google Scholar 

  65. http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

  66. https://cran.r-project.org/web/packages/tm/

  67. http://www.cs.waikato.ac.nz/ml/weka/

  68. https://cran.r-project.org/web/packages/RTextTools/RTextTools.pdf

  69. https://cran.r-project.org/web/packages/rotationForest/index.html

  70. http://mallet.cs.umass.edu/

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Aggarwal, C.C. (2018). Text Classification: Basic Models. In: Machine Learning for Text. Springer, Cham. https://doi.org/10.1007/978-3-319-73531-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73531-3_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73530-6

  • Online ISBN: 978-3-319-73531-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics