Data Classification: Advanced Concepts

Aggarwal, Charu C.

doi:10.1007/978-3-319-14142-8_11

Charu C. Aggarwal²

327k Accesses
1 Citations

Abstract

In this chapter, a number of advanced scenarios related to the classification problem will be addressed. These include more difficult special cases of the classification problem and various ways of enhancing classification algorithms with the use of additional inputs or a combination of classifiers. The enhancements discussed in this chapter belong to one of the following two categories:

1.
Difficult classification scenarios: Many scenarios of the classification problem are much more challenging. These include multiclass scenarios, rare-class scenarios, and cases where the size of the training data is large.
2.
Enhancing classification: Classification methods can be enhanced with additional data-centric input, user-centric input, or multiple models.

The difficult classification scenarios that are addressed in this chapter are as follows:

1.
Multiclass learning: Although many classifiers such as decision trees, Bayesian methods, and rule-based classifiers, can be directly used for multiclass learning, some of the models, such as support-vector machines, are naturally designed for binary classification. Therefore, numerous meta-algorithms have been designed for adapting binary classifiers to multiclass learning.
2.
Rare class learning: The positive and negative examples may be imbalanced. In other words, the data set contains only a small number of positive examples. A direct use of traditional learning models may often result in the classifier assigning all examples to the negative class. Such a classification is not very informative for imbalanced scenarios in which misclassification of the rare class incurs much higher cost than misclassification of the normal class.
3.
Scalable learning: The sizes of typical training data sets have increased significantly in recent years. Therefore, it is important to design models that can perform the learning in a scalable way. In cases where the data is not memory resident, it is important to design algorithms that can minimize disk accesses.
4.
Numeric class variables: Most of the discussion in this book assumes that the class variables are categorical. Suitable modifications are required to classification algorithms, when the class variables are numeric. This problem is also referred to as regression modeling.

The addition of more training data or the simultaneous use of a larger number of classification models can improve the learning accuracy. A number of methods have been proposed to enhance classification methods. Examples include the following:

1.
Semisupervised learning: In these cases, unlabeled examples are used to improve the effectiveness of classifiers. Although unlabeled data does not contain any information about the label distribution, it does contain a significant amount of information about the manifold and clustering structure of the underlying data. Because the classification problem is a supervised version of the clustering problem, this connection can be leveraged to improve the classification accuracy. The core idea is that in most real data sets, labels vary in a smooth way over dense regions of the data. The determination of dense regions in the data only requires unlabeled information.
2.
Active learning: In real life, it is often expensive to acquire labels. In active learning, the user (or an oracle) is actively involved in determining the most informative examples for which the labels need to be acquired. Typically, these are examples that provide the user the more accurate knowledge about the uncertain regions in the data, where the distribution of the class label is unknown.
3.
Ensemble learning: Similar to the clustering and the outlier detection problems, ensemble learning uses the power of multiple models to provide more robust results for the classification process. The motivation is similar to that for the clustering and outlier detection problems.

This chapter is organized as follows. Multiclass learning is addressed in Chap. 11.2. Rare class learning methods are introduced in Sect. 11.3. Scalable classification methods are introduced in Sect. 11.4. Classification with numeric class variables is discussed in Sect. 11.5. Semisupervised learning methods are introduced in Sect. 11.6. Active learning methods are discussed in Sect. 11.7. Ensemble methods are proposed in Sect. 11.8. Finally, a summary of the chapter is given in Sect. 11.9.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Hardcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Here, we assume that the total number of dimensions is \(d\), including the artificial column.
2.
Excluding constant terms, the objective function \(O =( D \overline {W}^T - \overline {y})^T ( D \overline {W}^T - \overline {y})\) can be expanded to the two additive terms \(\overline {W}D^T D \overline {W}^T\) and \( - (\overline {W} D^T \overline {y}+ \overline {y}^T D \overline {W}^T)= -2 \overline {W} D^T \overline {y}\). The gradients of these terms are \(2 D^T D \overline {W}^T\) and \(-2 D^T \overline {y}\), respectively. In the event that the Tikhonov regularization term \(\lambda ||\overline {W}||^2\) is added to the objective function, an additional term of \(2 \lambda \overline {W}^T\) will appear in the gradient.
3.
A slightly different convention of \(y_i \in \{ -1, +1 \}\) is used in Chap. 10 for notational convenience. In that case, the mean function would need to be adjusted to \(\frac {1 -\mbox {exp}(-\overline {W} \cdot \overline {X})}{ 1+ \mbox {exp}(-\overline {W} \cdot \overline {X})}\).
4.
This theoretical concept is discussed in detail in the next section.

Author information

Authors and Affiliations

IBM T.J. Watson Research Center, Yorktown Heights, New York, USA
Charu C. Aggarwal

Authors

Charu C. Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Charu C. Aggarwal .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Aggarwal, C. (2015). Data Classification: Advanced Concepts. In: Data Mining. Springer, Cham. https://doi.org/10.1007/978-3-319-14142-8_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-14142-8_11
Published: 14 April 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14141-1
Online ISBN: 978-3-319-14142-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics