Data Classification

Aggarwal, Charu C.

doi:10.1007/978-3-319-14142-8_10

Charu C. Aggarwal²

330k Accesses
20 Citations
4 Altmetric

Abstract

The classification problem is closely related to the clustering problem discussed in Chaps. 6 and 7. While the clustering problem is that of determining similar groups of data points, the classification problem is that of learning the structure of a data set of examples, already partitioned into groups, that are referred to as categories or classes. The learning of these categories is typically achieved with a model. This model is used to estimate the group identifiers (or class labels) of one or more previously unseen data examples with unknown labels. Therefore, one of the inputs to the classification problem is an example data set that has already been partitioned into different classes. This is referred to as the training data, and the group identifiers of these classes are referred to as class labels. In most cases, the class labels have a clear semantic interpretation in the context of a specific application, such as a group of customers interested in a specific product, or a group of data objects with a desired property of interest. The model learned is referred to as the training model. The previously unseen data points that need to be classified are collectively referred to as the test data set. The algorithm that creates the training model for prediction is also sometimes referred to as the learner.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Hardcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The unscaled versions of the two scatter matrices are \(n p_0 p_1 S_b\) and \(n S_w\), respectively. The sum of these two matrices is the total scatter matrix, which is \(n\) times the covariance matrix (see Exercise 21).
2.
Maximizing \(FS(\overline {W})= \frac {\overline {W} S_b \overline {W}^T}{\overline {W} S_w \overline {W}^T}\) is the same as maximizing \(\overline {W} S_b \overline {W}^T\) subject to \(\overline {W} S_w \overline {W}^T=1\). Setting the gradient of the Lagrangian relaxation \(\overline {W} S_b \overline {W}^T - \lambda (\overline {W} S_w \overline {W}^T -1)\) to 0 yields the generalized eigenvector condition \(S_b \overline {W}^T= \lambda S_w \overline {W}^T\). Because \(S_b \overline {W}^T= (\overline {\mu _1}^T - \overline {\mu _0}^T) \left [ (\overline {\mu _1} - \overline {\mu _0}) \overline {W}^T \right ]\) always points in the direction of \((\overline {\mu _1}^T - \overline {\mu _0}^T)\), it follows that \(S_w \overline {W}^T \propto \overline {\mu _1}^T - \overline {\mu _0}^T\). Therefore, we have \(\overline {W} \propto (\overline {\mu _1} - \overline {\mu _0}) S_w^{-1}\).
3.
Certain variations of linear models, such as \(L_1\)-regularized SVMs or Lasso (cf. Sect. 11.5.1 of Chap. 11), are particularly effective in this context. Such methods are also referred to as sparse learning methods.
4.
For the case where \(i=0\), the value of \(x_k^i\) is replaced by 1.
5.
The additional term in \(L_p\) involving \(\xi _i\) is \((C- \beta _i - \lambda _i) \xi _i\). This term evaluates to 0 because the partial derivative of \(L_p\) with respect to \(\xi _i\) is \((C- \beta _i - \lambda _i)\). This partial derivative must evaluate to 0 for optimality of \(L_p\).
6.
The original result [450] uses a more general argument to derive \( S' Q_k \Sigma _k^{-1}\) as the \(m \times k\) matrix of \(k\)-dimensional embedded coordinates of any out-of-sample \(m\times d\) matrix \(D'\). Here, \(S'=D'D^T\) is the \(m \times n\) matrix of kernel similarities between out-of-sample points in \(D'\) and in-sample points in \(d\). However, when \(D'=D\), this expression is (more simply) equivalent to \(Q_k \Sigma _k\) by expanding \(S'=S \approx Q_k \Sigma _k^2 Q_k^T\).
7.
Refer to Sect. 19.3.4 of Chap. 19. The small eigenvectors of the symmetric Laplacian are the same as the large eigenvectors of \(S= \Lambda ^{-1/2} W \Lambda ^{-1/2}\). Here, \(W\) is often defined by the sparsified heat-kernel similarity between data points, and the factors involving \(\Lambda ^{-1/2}\) provide local normalization of the similarity values to handle clusters of varying density.
8.
The derivative of the sign function is replaced by only the derivative of its argument. The derivative of the sign function is zero everywhere, except at zero, where it is indeterminate.
9.
This approach is also referred to as leave-one-out cross-validation, and is described in detail in Sect. 10.9 on classifier evaluation.
10.
The unscaled version may be obtained by multiplying \(S_w\) with the number of data points. There is no difference to the final result whether the scaled or unscaled version is used, within a constant of proportionality.

Author information

Authors and Affiliations

IBM T.J. Watson Research Center, Yorktown Heights, New York, USA
Charu C. Aggarwal

Authors

Charu C. Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Charu C. Aggarwal .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Aggarwal, C. (2015). Data Classification. In: Data Mining. Springer, Cham. https://doi.org/10.1007/978-3-319-14142-8_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-14142-8_10
Published: 14 April 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14141-1
Online ISBN: 978-3-319-14142-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics