Text Classification: Basic Models

Aggarwal, Charu C.

doi:10.1007/978-3-319-73531-3_5

Charu C. Aggarwal²

10k Accesses
2 Citations

Abstract

In classification, the corpus is partitioned into classes that are typically defined by application-specific criteria. Therefore, training examples are provided that associate data points with labels indicating their class membership. For example, the training examples extracted from a news portal on political matters might attach one of three labels associated with each of the documents, such as “senate,” “congress,” and “legislation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Consider a classifier that memorizes the training examples as follows. For any test instance, it is determined whether a training instance has zero distance to it (which is guaranteed when the test instance is drawn from the training data). If such an instance is found, the label of that training instance is returned. Otherwise a random label is returned. Such a classifier will have 100% accuracy on the training data, but will perform randomly on unseen test instances. The key point is that generalization is about extrapolating predictions from known instances of the data space (i.e., training points) to all regions of the data space. Memorizing only the known instances is the worst possible way to achieve this.
2.
Although \(\overline{X_{i}}\) is a binary vector, we are treating it like a set when we use a set-membership notation like \(t_{j} \in \overline{X_{i}}\). Any binary vector can also be viewed as a set of the 1s in it.
3.
The constant of proportionality can be easily inferred by ensuring that the sum of the posterior probabilities across all classes is 1. As we will see later, there are scenarios associated with ranking instances to belong to specific classes, where the constant of proportionality does matter.
4.
Most of the literature uses the notation of k instead of κ to denote the number of nearest neighbors. We use κ instead of k for notational disambiguation, since the latter variable has been used consistently in this chapter to denote the number of classes. Using k to denote both the number of classes and the number of neighbors would cause confusion.
5.
We intentionally use the seemingly unusual notation K(⋅ , ⋅ ) for a similarity function, as we will later connect this principle with the kernel similarity function used by support vector machines.
6.
In Sect. 5.5.6, we show further connections between nearest-neighbor classifiers and randomized variants of decision trees.

Bibliography

C. Aggarwal. Data classification: Algorithms and applications, CRC Press, 2014.
Google Scholar
C. Aggarwal. Data mining: The textbook. Springer, 2015.
Google Scholar
C. Aggarwal, S. Gates, and P. Yu. On using partial supervision for text categorization. IEEE Transactions on Knowledge and Data Engineering, 16(2), 245–255, 2004. [Extended version of ACM KDD 1998 paper “On the merits of building categorization systems by supervised clustering.”]
Article Google Scholar
C. Aggarwal and P. Yu. On effective conceptual indexing and similarity search in text data. ICDM Conference, pp. 3–10, 2001.
Google Scholar
C. Aggarwal, and C. Zhai, Mining text data. Springer, 2012.
Google Scholar
M. Antonie and O Zaïane. Text document categorization by term association. IEEE ICDM Conference, pp. 19–26, 2002.
Google Scholar
C. Apte, F. Damerau, and S. Weiss. Automated learning of decision rules for text categorization, ACM Transactions on Information Systems, 12(3), pp. 233–251, 1994.
Article Google Scholar
C. Apte, F. Damerau, and S. Weiss. Text mining with decision rules and decision trees. Conference on Automated Learning and Discovery, Also appears as IBM Research Report, RC21219, 1998.
Google Scholar
L. Baker and A. McCallum. Distributional clustering of words for text classification. ACM SIGIR Conference, pp. 96–103, 1998.
Google Scholar
A. Blum, and T. Mitchell. Combining labeled and unlabeled data with co-training. COLT, 1998.
Google Scholar
A. Blum and S. Chawla. Combining labeled and unlabeled data with graph mincuts. ICML Conference, 2001.
Google Scholar
D. Boley, M. Gini, R. Gross, E.-H. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Partitioning-based clustering for Web document categorization. Decision Support Systems, Vol. 27, pp. 329–341, 1999.
Article Google Scholar
L. Breiman. Random forests. Journal Machine Learning archive, 45(1), pp. 5–32, 2001.
Article MATH Google Scholar
L. Breiman. Bagging predictors. Machine Learning, 24(2), pp. 123–140, 1996.
MathSciNet MATH Google Scholar
L. Breiman and A. Cutler. Random Forests Manual v4.0, Technical Report, UC Berkeley, 2003. https://www.stat.berkeley.edu/~breiman/Using_random_forests_v4.0.pdf
P. Bühlmann and B. Yu. Analyzing bagging. Annals of Statistics, pp. 927–961, 2002.
Article MathSciNet MATH Google Scholar
S. Chakrabarti, B. Dom. R. Agrawal, and P. Raghavan. Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB Journal, 7(3), pp. 163–178, 1998.
Article Google Scholar
S. Chakrabarti, S. Roy, and M. Soundalgekar. Fast and accurate text classification via multiple linear discriminant projections. The VLDB Journal, 12(2), pp. 170–185, 2003.
Article Google Scholar
O. Chapelle, B. Schölkopf, and A. Zien. Semi-supervised learning. MIT Press, 2010.
Google Scholar
D. Chickering, D. Heckerman, and C. Meek. A Bayesian approach to learning Bayesian networks with local structure. Uncertainty in Artificial Intelligence, pp. 80–89, 1997.
Google Scholar
W. Cohen. Fast effective rule induction. ICML Conference, pp. 115–123, 1995.
Chapter Google Scholar
W. Cohen. Learning rules that classify e-mail. AAAI Spring Symposium on Machine Learning in Information Access, 1996.
Google Scholar
W. Cohen. Learning with set-valued features. In National Conference on Artificial Intelligence, 1996.
Google Scholar
W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems, 17(2), pp 141–173, 1999.
Article Google Scholar
W. Cooper. Some inconsistencies and misnomers in probabilistic information retrieval. ACM Transactions on Information Systems, 13(1), pp. 100–111, 1995.
Article Google Scholar
T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), pp. 1–27, 1967.
Article MATH Google Scholar
P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2–3), pp. 103–130, 1997.
Article MATH Google Scholar
R. Duda, P. Hart, W. Stork. Pattern Classification, Wiley Interscience, 2000.
MATH Google Scholar
S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. ACM CIKM Conference, pp. 148–155, 1998.
Google Scholar
M. Fernandez-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? The Journal of Machine Learning Research, 15(1), pp. 3133–3181, 2014.
MathSciNet MATH Google Scholar
J. Fürnkranz and G. Widmer. Incremental reduced error pruning. ICML Conference, pp. 70–77, 1994.
Chapter Google Scholar
E.-H. Han, G. Karypis, and V. Kumar. Text categorization using weighted-adjusted k-nearest neighbor classification, PAKDD Conference, 2001.
Google Scholar
E.-H. Han and G. Karypis. Centroid-based document classification: Analysis and experimental results. PKDD Conference, 2000.
Google Scholar
T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6), pp. 607–616, 1996.
Article Google Scholar
T. Joachims. Text categorization with support vector machines: learning with many relevant features. ECML Conference, 1998.
Google Scholar
T. Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. ICML Conference, 1997.
Google Scholar
D. Johnson, F. Oles, T. Zhang, T. Goetz. A decision tree-based symbolic rule induction system for text categorization, IBM Systems Journal, 41(3), pp. 428–437, 2002.
Article Google Scholar
G. Karypis and E.-H. Han. Fast supervised dimensionality reduction with applications to document categorization and retrieval, ACM CIKM Conference, pp. 12–19, 2000.
Google Scholar
M. Kuhn. Building predictive models in R Using the caret Package. Journal of Statistical Software, 28(5), pp. 1–26, 2008. https://cran.r-project.org/web/packages/caret/index.html
Article Google Scholar
W. Lam and C. Y. Ho. Using a generalized instance set for automatic text categorization. ACM SIGIR Conference, 1998.
Google Scholar
D. Lewis. An evaluation of phrasal and clustered representations for the text categorization task. ACM SIGIR Conference, pp. 37–50, 1992.
Google Scholar
D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. ECML Conference, pp. 4–15, 1998.
Chapter Google Scholar
D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization. Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93, 1994.
Google Scholar
H. Li, and K. Yamanishi. Document classification using a finite mixture model. ACL Conference, pp. 39–47, 1997.
Google Scholar
Y. Li and A. Jain. Classification of text documents. The Computer Journal, 41(8), pp. 537–546, 1998.
Article MATH Google Scholar
B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. ACM KDD Conference, pp. 80–86, 1998.
Google Scholar
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow, 1996.
A. McCallum and K. Nigam. A comparison of event models for naive Bayes text classification. AAAI Workshop on Learning for Text Categorization, 1998.
Google Scholar
T. M. Mitchell. The role of unlabeled data in supervised learning. International Colloquium on Cognitive Science, pp. 2–11, 1999.
Google Scholar
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification with labeled and unlabeled data using EM. Machine Learning, 39(2), pp. 103–134, 2000.
Article MATH Google Scholar
M. Pazzani and D. Kibler. The utility of knowledge in inductive learning. Machine Learning, 9(1), pp. 57–94, 1992.
Google Scholar
J. Quinlan. C4.5: programs for machine learning. Morgan-Kaufmann Publishers, 1993.
Google Scholar
J. Quinlan. Induction of decision trees. Machine Learning, 1, pp. 81–106, 1986.
Google Scholar
J. Rodríguez, L. Kuncheva, and C. Alonso. Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), pp. 1619–1630, 2006.
Article Google Scholar
J. Rocchio. Relevance feedback information retrieval. The Smart Retrieval System- Experiments in Automatic Document Processing, G. Salton, Ed. Prentice Hall, Englewood Cliffs, NJ, pp. 313–323, 1971.
Google Scholar
R. Samworth. Optimal weighted nearest neighbour classifiers. The Annals of Statistics, 40(5), pp. 2733–2763, 2012.
Article MathSciNet MATH Google Scholar
S. Sathe and C. Aggarwal. Similarity forests. ACM KDD Conference, 2017.
Google Scholar
F. Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1), 2002.
Article Google Scholar
N. Slonim and N. Tishby. The power of word clusters for text classification. European Colloquium on Information Retrieval Research (ECIR), 2001.
Google Scholar
S. Weiss, C. Apte, F. Damerau, D. Johnson, F. Oles, T. Goetz, and T. Hampp. Maximizing text-mining performance. IEEE Intelligent Systems, 14(4), pp. 63–69, 1999.
Article Google Scholar
Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1–2), pp. 69–90, 1999.
Article Google Scholar
Y. Yang. A study on thresholding strategies for text categorization. ACM SIGIR Conference, pp. 137–145, 2001.
Google Scholar
Y. Yang and X. Liu. A re-examination of text categorization methods. ACM SIGIR Conference, pp. 42–49, 1999.
Google Scholar
Y. Yang and J. O. Pederson. A comparative study on feature selection in text categorization, ACM SIGIR Conference, pp. 412–420, 1995.
Google Scholar
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
https://cran.r-project.org/web/packages/tm/
http://www.cs.waikato.ac.nz/ml/weka/
https://cran.r-project.org/web/packages/RTextTools/RTextTools.pdf
https://cran.r-project.org/web/packages/rotationForest/index.html
http://mallet.cs.umass.edu/

Download references

Author information

Authors and Affiliations

IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
Charu C. Aggarwal

Authors

Charu C. Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Aggarwal, C.C. (2018). Text Classification: Basic Models. In: Machine Learning for Text. Springer, Cham. https://doi.org/10.1007/978-3-319-73531-3_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-73531-3_5
Published: 20 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73530-6
Online ISBN: 978-3-319-73531-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics