Abstract
In classification, the corpus is partitioned into classes that are typically defined by application-specific criteria. Therefore, training examples are provided that associate data points with labels indicating their class membership. For example, the training examples extracted from a news portal on political matters might attach one of three labels associated with each of the documents, such as “senate,” “congress,” and “legislation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Consider a classifier that memorizes the training examples as follows. For any test instance, it is determined whether a training instance has zero distance to it (which is guaranteed when the test instance is drawn from the training data). If such an instance is found, the label of that training instance is returned. Otherwise a random label is returned. Such a classifier will have 100% accuracy on the training data, but will perform randomly on unseen test instances. The key point is that generalization is about extrapolating predictions from known instances of the data space (i.e., training points) to all regions of the data space. Memorizing only the known instances is the worst possible way to achieve this.
- 2.
Although \(\overline{X_{i}}\) is a binary vector, we are treating it like a set when we use a set-membership notation like \(t_{j} \in \overline{X_{i}}\). Any binary vector can also be viewed as a set of the 1s in it.
- 3.
The constant of proportionality can be easily inferred by ensuring that the sum of the posterior probabilities across all classes is 1. As we will see later, there are scenarios associated with ranking instances to belong to specific classes, where the constant of proportionality does matter.
- 4.
Most of the literature uses the notation of k instead of κ to denote the number of nearest neighbors. We use κ instead of k for notational disambiguation, since the latter variable has been used consistently in this chapter to denote the number of classes. Using k to denote both the number of classes and the number of neighbors would cause confusion.
- 5.
We intentionally use the seemingly unusual notation K(⋅ , ⋅ ) for a similarity function, as we will later connect this principle with the kernel similarity function used by support vector machines.
- 6.
In Sect. 5.5.6, we show further connections between nearest-neighbor classifiers and randomized variants of decision trees.
Bibliography
C. Aggarwal. Data classification: Algorithms and applications, CRC Press, 2014.
C. Aggarwal. Data mining: The textbook. Springer, 2015.
C. Aggarwal, S. Gates, and P. Yu. On using partial supervision for text categorization. IEEE Transactions on Knowledge and Data Engineering, 16(2), 245–255, 2004. [Extended version of ACM KDD 1998 paper “On the merits of building categorization systems by supervised clustering.”]
C. Aggarwal and P. Yu. On effective conceptual indexing and similarity search in text data. ICDM Conference, pp. 3–10, 2001.
C. Aggarwal, and C. Zhai, Mining text data. Springer, 2012.
M. Antonie and O Zaïane. Text document categorization by term association. IEEE ICDM Conference, pp. 19–26, 2002.
C. Apte, F. Damerau, and S. Weiss. Automated learning of decision rules for text categorization, ACM Transactions on Information Systems, 12(3), pp. 233–251, 1994.
C. Apte, F. Damerau, and S. Weiss. Text mining with decision rules and decision trees. Conference on Automated Learning and Discovery, Also appears as IBM Research Report, RC21219, 1998.
L. Baker and A. McCallum. Distributional clustering of words for text classification. ACM SIGIR Conference, pp. 96–103, 1998.
A. Blum, and T. Mitchell. Combining labeled and unlabeled data with co-training. COLT, 1998.
A. Blum and S. Chawla. Combining labeled and unlabeled data with graph mincuts. ICML Conference, 2001.
D. Boley, M. Gini, R. Gross, E.-H. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Partitioning-based clustering for Web document categorization. Decision Support Systems, Vol. 27, pp. 329–341, 1999.
L. Breiman. Random forests. Journal Machine Learning archive, 45(1), pp. 5–32, 2001.
L. Breiman. Bagging predictors. Machine Learning, 24(2), pp. 123–140, 1996.
L. Breiman and A. Cutler. Random Forests Manual v4.0, Technical Report, UC Berkeley, 2003. https://www.stat.berkeley.edu/~breiman/Using_random_forests_v4.0.pdf
P. Bühlmann and B. Yu. Analyzing bagging. Annals of Statistics, pp. 927–961, 2002.
S. Chakrabarti, B. Dom. R. Agrawal, and P. Raghavan. Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB Journal, 7(3), pp. 163–178, 1998.
S. Chakrabarti, S. Roy, and M. Soundalgekar. Fast and accurate text classification via multiple linear discriminant projections. The VLDB Journal, 12(2), pp. 170–185, 2003.
O. Chapelle, B. Schölkopf, and A. Zien. Semi-supervised learning. MIT Press, 2010.
D. Chickering, D. Heckerman, and C. Meek. A Bayesian approach to learning Bayesian networks with local structure. Uncertainty in Artificial Intelligence, pp. 80–89, 1997.
W. Cohen. Fast effective rule induction. ICML Conference, pp. 115–123, 1995.
W. Cohen. Learning rules that classify e-mail. AAAI Spring Symposium on Machine Learning in Information Access, 1996.
W. Cohen. Learning with set-valued features. In National Conference on Artificial Intelligence, 1996.
W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems, 17(2), pp 141–173, 1999.
W. Cooper. Some inconsistencies and misnomers in probabilistic information retrieval. ACM Transactions on Information Systems, 13(1), pp. 100–111, 1995.
T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), pp. 1–27, 1967.
P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2–3), pp. 103–130, 1997.
R. Duda, P. Hart, W. Stork. Pattern Classification, Wiley Interscience, 2000.
S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. ACM CIKM Conference, pp. 148–155, 1998.
M. Fernandez-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? The Journal of Machine Learning Research, 15(1), pp. 3133–3181, 2014.
J. Fürnkranz and G. Widmer. Incremental reduced error pruning. ICML Conference, pp. 70–77, 1994.
E.-H. Han, G. Karypis, and V. Kumar. Text categorization using weighted-adjusted k-nearest neighbor classification, PAKDD Conference, 2001.
E.-H. Han and G. Karypis. Centroid-based document classification: Analysis and experimental results. PKDD Conference, 2000.
T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6), pp. 607–616, 1996.
T. Joachims. Text categorization with support vector machines: learning with many relevant features. ECML Conference, 1998.
T. Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. ICML Conference, 1997.
D. Johnson, F. Oles, T. Zhang, T. Goetz. A decision tree-based symbolic rule induction system for text categorization, IBM Systems Journal, 41(3), pp. 428–437, 2002.
G. Karypis and E.-H. Han. Fast supervised dimensionality reduction with applications to document categorization and retrieval, ACM CIKM Conference, pp. 12–19, 2000.
M. Kuhn. Building predictive models in R Using the caret Package. Journal of Statistical Software, 28(5), pp. 1–26, 2008. https://cran.r-project.org/web/packages/caret/index.html
W. Lam and C. Y. Ho. Using a generalized instance set for automatic text categorization. ACM SIGIR Conference, 1998.
D. Lewis. An evaluation of phrasal and clustered representations for the text categorization task. ACM SIGIR Conference, pp. 37–50, 1992.
D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. ECML Conference, pp. 4–15, 1998.
D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization. Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93, 1994.
H. Li, and K. Yamanishi. Document classification using a finite mixture model. ACL Conference, pp. 39–47, 1997.
Y. Li and A. Jain. Classification of text documents. The Computer Journal, 41(8), pp. 537–546, 1998.
B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. ACM KDD Conference, pp. 80–86, 1998.
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow, 1996.
A. McCallum and K. Nigam. A comparison of event models for naive Bayes text classification. AAAI Workshop on Learning for Text Categorization, 1998.
T. M. Mitchell. The role of unlabeled data in supervised learning. International Colloquium on Cognitive Science, pp. 2–11, 1999.
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification with labeled and unlabeled data using EM. Machine Learning, 39(2), pp. 103–134, 2000.
M. Pazzani and D. Kibler. The utility of knowledge in inductive learning. Machine Learning, 9(1), pp. 57–94, 1992.
J. Quinlan. C4.5: programs for machine learning. Morgan-Kaufmann Publishers, 1993.
J. Quinlan. Induction of decision trees. Machine Learning, 1, pp. 81–106, 1986.
J. Rodríguez, L. Kuncheva, and C. Alonso. Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), pp. 1619–1630, 2006.
J. Rocchio. Relevance feedback information retrieval. The Smart Retrieval System- Experiments in Automatic Document Processing, G. Salton, Ed. Prentice Hall, Englewood Cliffs, NJ, pp. 313–323, 1971.
R. Samworth. Optimal weighted nearest neighbour classifiers. The Annals of Statistics, 40(5), pp. 2733–2763, 2012.
S. Sathe and C. Aggarwal. Similarity forests. ACM KDD Conference, 2017.
F. Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1), 2002.
N. Slonim and N. Tishby. The power of word clusters for text classification. European Colloquium on Information Retrieval Research (ECIR), 2001.
S. Weiss, C. Apte, F. Damerau, D. Johnson, F. Oles, T. Goetz, and T. Hampp. Maximizing text-mining performance. IEEE Intelligent Systems, 14(4), pp. 63–69, 1999.
Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1–2), pp. 69–90, 1999.
Y. Yang. A study on thresholding strategies for text categorization. ACM SIGIR Conference, pp. 137–145, 2001.
Y. Yang and X. Liu. A re-examination of text categorization methods. ACM SIGIR Conference, pp. 42–49, 1999.
Y. Yang and J. O. Pederson. A comparative study on feature selection in text categorization, ACM SIGIR Conference, pp. 412–420, 1995.
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
https://cran.r-project.org/web/packages/RTextTools/RTextTools.pdf
https://cran.r-project.org/web/packages/rotationForest/index.html
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Aggarwal, C.C. (2018). Text Classification: Basic Models. In: Machine Learning for Text. Springer, Cham. https://doi.org/10.1007/978-3-319-73531-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-73531-3_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73530-6
Online ISBN: 978-3-319-73531-3
eBook Packages: Computer ScienceComputer Science (R0)