Abstract
The classification problem is closely related to the clustering problem discussed in Chaps. 6 and 7. While the clustering problem is that of determining similar groups of data points, the classification problem is that of learning the structure of a data set of examples, already partitioned into groups, that are referred to as categories or classes. The learning of these categories is typically achieved with a model. This model is used to estimate the group identifiers (or class labels) of one or more previously unseen data examples with unknown labels. Therefore, one of the inputs to the classification problem is an example data set that has already been partitioned into different classes. This is referred to as the training data, and the group identifiers of these classes are referred to as class labels. In most cases, the class labels have a clear semantic interpretation in the context of a specific application, such as a group of customers interested in a specific product, or a group of data objects with a desired property of interest. The model learned is referred to as the training model. The previously unseen data points that need to be classified are collectively referred to as the test data set. The algorithm that creates the training model for prediction is also sometimes referred to as the learner.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The unscaled versions of the two scatter matrices are \(n p_0 p_1 S_b\) and \(n S_w\), respectively. The sum of these two matrices is the total scatter matrix, which is \(n\) times the covariance matrix (see Exercise 21).
- 2.
Maximizing \(FS(\overline {W})= \frac {\overline {W} S_b \overline {W}^T}{\overline {W} S_w \overline {W}^T}\) is the same as maximizing \(\overline {W} S_b \overline {W}^T\) subject to \(\overline {W} S_w \overline {W}^T=1\). Setting the gradient of the Lagrangian relaxation \(\overline {W} S_b \overline {W}^T - \lambda (\overline {W} S_w \overline {W}^T -1)\) to 0 yields the generalized eigenvector condition \(S_b \overline {W}^T= \lambda S_w \overline {W}^T\). Because \(S_b \overline {W}^T= (\overline {\mu _1}^T - \overline {\mu _0}^T) \left [ (\overline {\mu _1} - \overline {\mu _0}) \overline {W}^T \right ]\) always points in the direction of \((\overline {\mu _1}^T - \overline {\mu _0}^T)\), it follows that \(S_w \overline {W}^T \propto \overline {\mu _1}^T - \overline {\mu _0}^T\). Therefore, we have \(\overline {W} \propto (\overline {\mu _1} - \overline {\mu _0}) S_w^{-1}\).
- 3.
- 4.
For the case where \(i=0\), the value of \(x_k^i\) is replaced by 1.
- 5.
The additional term in \(L_p\) involving \(\xi _i\) is \((C- \beta _i - \lambda _i) \xi _i\). This term evaluates to 0 because the partial derivative of \(L_p\) with respect to \(\xi _i\) is \((C- \beta _i - \lambda _i)\). This partial derivative must evaluate to 0 for optimality of \(L_p\).
- 6.
The original result [450] uses a more general argument to derive \( S' Q_k \Sigma _k^{-1}\) as the \(m \times k\) matrix of \(k\)-dimensional embedded coordinates of any out-of-sample \(m\times d\) matrix \(D'\). Here, \(S'=D'D^T\) is the \(m \times n\) matrix of kernel similarities between out-of-sample points in \(D'\) and in-sample points in \(d\). However, when \(D'=D\), this expression is (more simply) equivalent to \(Q_k \Sigma _k\) by expanding \(S'=S \approx Q_k \Sigma _k^2 Q_k^T\).
- 7.
Refer to Sect. 19.3.4 of Chap. 19. The small eigenvectors of the symmetric Laplacian are the same as the large eigenvectors of \(S= \Lambda ^{-1/2} W \Lambda ^{-1/2}\). Here, \(W\) is often defined by the sparsified heat-kernel similarity between data points, and the factors involving \(\Lambda ^{-1/2}\) provide local normalization of the similarity values to handle clusters of varying density.
- 8.
The derivative of the sign function is replaced by only the derivative of its argument. The derivative of the sign function is zero everywhere, except at zero, where it is indeterminate.
- 9.
This approach is also referred to as leave-one-out cross-validation, and is described in detail in Sect. 10.9 on classifier evaluation.
- 10.
The unscaled version may be obtained by multiplying \(S_w\) with the number of data points. There is no difference to the final result whether the scaled or unscaled version is used, within a constant of proportionality.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Aggarwal, C. (2015). Data Classification. In: Data Mining. Springer, Cham. https://doi.org/10.1007/978-3-319-14142-8_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-14142-8_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14141-1
Online ISBN: 978-3-319-14142-8
eBook Packages: Computer ScienceComputer Science (R0)