Skip to main content

Data Classification

  • Chapter
  • First Online:
Data Mining

Abstract

The classification problem is closely related to the clustering problem discussed in Chaps. 6 and 7. While the clustering problem is that of determining similar groups of data points, the classification problem is that of learning the structure of a data set of examples, already partitioned into groups, that are referred to as categories or classes. The learning of these categories is typically achieved with a model. This model is used to estimate the group identifiers (or class labels) of one or more previously unseen data examples with unknown labels. Therefore, one of the inputs to the classification problem is an example data set that has already been partitioned into different classes. This is referred to as the training data, and the group identifiers of these classes are referred to as class labels. In most cases, the class labels have a clear semantic interpretation in the context of a specific application, such as a group of customers interested in a specific product, or a group of data objects with a desired property of interest. The model learned is referred to as the training model. The previously unseen data points that need to be classified are collectively referred to as the test data set. The algorithm that creates the training model for prediction is also sometimes referred to as the learner.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 89.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The unscaled versions of the two scatter matrices are \(n p_0 p_1 S_b\) and \(n S_w\), respectively. The sum of these two matrices is the total scatter matrix, which is \(n\) times the covariance matrix (see Exercise 21).

  2. 2.

    Maximizing \(FS(\overline {W})= \frac {\overline {W} S_b \overline {W}^T}{\overline {W} S_w \overline {W}^T}\) is the same as maximizing \(\overline {W} S_b \overline {W}^T\) subject to \(\overline {W} S_w \overline {W}^T=1\). Setting the gradient of the Lagrangian relaxation \(\overline {W} S_b \overline {W}^T - \lambda (\overline {W} S_w \overline {W}^T -1)\) to 0 yields the generalized eigenvector condition \(S_b \overline {W}^T= \lambda S_w \overline {W}^T\). Because \(S_b \overline {W}^T= (\overline {\mu _1}^T - \overline {\mu _0}^T) \left [ (\overline {\mu _1} - \overline {\mu _0}) \overline {W}^T \right ]\) always points in the direction of \((\overline {\mu _1}^T - \overline {\mu _0}^T)\), it follows that \(S_w \overline {W}^T \propto \overline {\mu _1}^T - \overline {\mu _0}^T\). Therefore, we have \(\overline {W} \propto (\overline {\mu _1} - \overline {\mu _0}) S_w^{-1}\).

  3. 3.

    Certain variations of linear models, such as \(L_1\)-regularized SVMs or Lasso (cf. Sect. 11.5.1 of Chap. 11), are particularly effective in this context. Such methods are also referred to as sparse learning methods.

  4. 4.

    For the case where \(i=0\), the value of \(x_k^i\) is replaced by 1.

  5. 5.

    The additional term in \(L_p\) involving \(\xi _i\) is \((C- \beta _i - \lambda _i) \xi _i\). This term evaluates to 0 because the partial derivative of \(L_p\) with respect to \(\xi _i\) is \((C- \beta _i - \lambda _i)\). This partial derivative must evaluate to 0 for optimality of \(L_p\).

  6. 6.

    The original result [450] uses a more general argument to derive \( S' Q_k \Sigma _k^{-1}\) as the \(m \times k\) matrix of \(k\)-dimensional embedded coordinates of any out-of-sample \(m\times d\) matrix \(D'\). Here, \(S'=D'D^T\) is the \(m \times n\) matrix of kernel similarities between out-of-sample points in \(D'\) and in-sample points in \(d\). However, when \(D'=D\), this expression is (more simply) equivalent to \(Q_k \Sigma _k\) by expanding \(S'=S \approx Q_k \Sigma _k^2 Q_k^T\).

  7. 7.

    Refer to Sect. 19.3.4 of Chap. 19. The small eigenvectors of the symmetric Laplacian are the same as the large eigenvectors of \(S= \Lambda ^{-1/2} W \Lambda ^{-1/2}\). Here, \(W\) is often defined by the sparsified heat-kernel similarity between data points, and the factors involving \(\Lambda ^{-1/2}\) provide local normalization of the similarity values to handle clusters of varying density.

  8. 8.

    The derivative of the sign function is replaced by only the derivative of its argument. The derivative of the sign function is zero everywhere, except at zero, where it is indeterminate.

  9. 9.

    This approach is also referred to as leave-one-out cross-validation, and is described in detail in Sect. 10.9 on classifier evaluation.

  10. 10.

    The unscaled version may be obtained by multiplying \(S_w\) with the number of data points. There is no difference to the final result whether the scaled or unscaled version is used, within a constant of proportionality.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Charu C. Aggarwal .

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Aggarwal, C. (2015). Data Classification. In: Data Mining. Springer, Cham. https://doi.org/10.1007/978-3-319-14142-8_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-14142-8_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-14141-1

  • Online ISBN: 978-3-319-14142-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics