Skip to main content

Linear Classification and Regression for Text

  • Chapter
  • First Online:

Abstract

Linear models for classification and regression express the dependent variable (or class variable) as a linear function of the independent variables (or feature variables). Specifically, consider the case in which y i is the dependent variable of the ith document, and \(\overline{X_{i}} = (x_{i1}\ldots x_{id})\) are the d-dimensional feature variables of this document. In the case of text, these feature variables correspond to the term frequencies of a lexicon with d terms. The value of y i is a numerical quantity in the case of regression, and it is a binary value drawn from { − 1, +1} in the case of classification.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   84.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    When using kernel methods, it is customary to add a small constant amount to every entry in the similarity matrix between points to account for the effect of the dummy variable representing the bias term [319] (see Exercise 5).

  2. 2.

    The notions of scatter and variance are different only in terms of scaling. The scatter of a set of n values is equal to n times their variance. Therefore, it does not matter whether the scatter or variance is used within a constant of proportionality.

  3. 3.

    This two-class variant of the scatter matrix S b is not exactly the same as defined in the multi-class version S b of Sect. 6.2.3.1. Nevertheless, all entries in the two matrices are related with the proportionality factor of \(\frac{n_{1}\cdot n_{0}} {n^{2}}\) which turns out to be inconsequential to the direction of the Fisher discriminant. In other words, the use of the multi-class formulas in Sect. 6.2.3.1 will yield the same result in the binary case.

  4. 4.

    Note that this matrix is different from the one introduced for the two-class case only by a proportionality factor, which does not affect the final solution.

  5. 5.

    One can also show more general equivalence by allowing for bias.

  6. 6.

    The SVM generally uses the hinge loss rather than the quadratic loss. The use of quadratic loss is possible in an SVM but it is less common. This is another key difference between the Fisher discriminant and the (most common implementation of the) SVM.

  7. 7.

    http://mathworld.wolfram.com/Point-PlaneDistance.html.

  8. 8.

    On the surface, these steps look different from [444]. However, they are mathematically the same, except that the objective function uses different parametrizations and notations. The parameter λ in [444] is equivalent to 1∕(n ⋅ C) in this book.

  9. 9.

    When a sparse vector \(\overline{a}\) is added to a dense vector \(\overline{b}\), the change in the squared norm of \(\overline{b}\) is \(\vert \vert \overline{a}\vert \vert ^{2} + 2\overline{a} \cdot \overline{b}\). This can be computed in time proportional to the number of nonzero entries in the sparse vector \(\overline{a}\).

  10. 10.

    We say “roughly” because we are ignoring the data-independent term i = 1 n α i .

  11. 11.

    It has been shown [383] how one can derive heuristic probability estimates with an SVM.

  12. 12.

    Regularization is equivalent to assuming that the parameters in \(\overline{W}\) are drawn from a Gaussian prior and it results in the addition of the term \(\lambda \vert \vert \overline{W}\vert \vert ^{2}/2\) to the log-likelihood to incorporate this prior assumption.

  13. 13.

    The data will typically be rotated and reflected in particular directions.

  14. 14.

    Strictly speaking, the transformation \(\varPhi (\overline{X})\) would need to be infinite dimensional to adequately represent the universe of all possible data points for Gaussian kernels. However, the relative positions of n points (and the origin) in any dimensionality can always be projected on an n-dimensional plane, just as a set of a two 3-dimensional points (with the origin) can always be projected on a 2-dimensional plane. The eigenvectors of the n × n similarity matrix of these points provide precisely this projection. This is referred to as the data-specific Mercer kernel map. Therefore, even though one often hears of the impossibility of extracting infinite dimensional points from a Gaussian kernel, this makes the nature of the transformation sound more abstract and impossible than it really is (as a practical matter). The reality is that we can always work with the data-specific n-dimensional transformation. As long as the similarity matrix is positive semi-definite, a finite dimensional transformation always exists for a finite data set, which is adequate for the learning algorithm. We use the notation Φ s (⋅ ) instead of Φ(⋅ ) to represent the fact that this is a data-specific transformation.

  15. 15.

    When all entries in the kernel matrix are nonnegative, it means that all pairwise angles between points are less than 90. One can always reflect the points to the nonnegative orthant without loss of generality.

  16. 16.

    Suppose one has p 1 …p t different possibilities for t different parameters. One now has to evaluate the algorithm at each combination of p 1 × p 2 × p t possibilities over a held out set.

Bibliography

  1. C. M. Bishop. Pattern recognition and machine learning. Springer, 2007.

    Google Scholar 

  2. C. M. Bishop. Neural networks for pattern recognition. Oxford University Press, 1995.

    Google Scholar 

  3. C. Burges. A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), pp. 121–167, 1998.

    Article  Google Scholar 

  4. S. Chakrabarti, S. Roy, and M. Soundalgekar. Fast and accurate text classification via multiple linear discriminant projections. The VLDB Journal, 12(2), pp. 170–185, 2003.

    Article  Google Scholar 

  5. C. Chang and C. Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 27, 2011. http://www.csie.ntu.edu.tw/~cjlin/libsvm/

    Article  Google Scholar 

  6. Y. Chang, C. Hsieh, K. Chang, M. Ringgaard, and C. J. Lin. Training and testing low-degree polynomial data mappings via linear SVM. Journal of Machine Learning Research, 11, pp. 1471–1490, 2010.

    MathSciNet  MATH  Google Scholar 

  7. O. Chapelle. Training a support vector machine in the primal. Neural Computation, 19(5), pp. 1155–1178, 2007.

    Article  MathSciNet  Google Scholar 

  8. T. Cooke. Two variations on Fisher’s linear discriminant for pattern recognition IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2), pp. 268–273, 2002.

    Article  MathSciNet  Google Scholar 

  9. C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3), pp. 273–297, 1995.

    MATH  Google Scholar 

  10. N. Cristianini, and J. Shawe-Taylor. An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, 2000.

    Google Scholar 

  11. N. Draper and H. Smith. Applied regression analysis. John Wiley & Sons, 2014.

    Google Scholar 

  12. H. Drucker, C. Burges, L. Kaufman, A. Smola, and V. Vapnik. Support Vector Regression Machines. NIPS Conference, 1997.

    Google Scholar 

  13. S. Dumais. Latent semantic indexing (LSI) and TREC-2. Text Retrieval Conference (TREC), pp. 105–115, 1993.

    Google Scholar 

  14. S. Dumais. Latent semantic indexing (LSI): TREC-3 Report. Text Retrieval Conference (TREC), pp. 219–230, 1995.

    Google Scholar 

  15. B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of Statistics, 32(2), pp. 407–499, 2004.

    Article  MathSciNet  Google Scholar 

  16. R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9, pp. 1871–1874, 2008. http://www.csie.ntu.edu.tw/~cjlin/liblinear/

    MATH  Google Scholar 

  17. R. Fan, P. Chen, and C. Lin. Working set selection using second order information for training support vector machines. Journal of Machine Learning Research, 6, pp. 1889–1918, 2005.

    MathSciNet  MATH  Google Scholar 

  18. R. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7: pp. 179–188, 1936.

    Article  Google Scholar 

  19. G. Fung and O. Mangasarian. Proximal support vector classifiers. ACM KDD Conference, pp. 77–86, 2001.

    Google Scholar 

  20. F. Girosi and T. Poggio. Networks and the best approximation property. Biological Cybernetics, 63(3), pp. 169–176, 1990.

    Article  MathSciNet  Google Scholar 

  21. T. Hastie, R. Tibshirani, and M. Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC Press, 2015.

    Google Scholar 

  22. T. Hastie and R. Tibshirani. Generalized additive models. CRC Press, 1990.

    Google Scholar 

  23. G. Hinton. Connectionist learning procedures. Artificial Intelligence, 40(1–3), pp. 185–234, 1989.

    Article  Google Scholar 

  24. T. Joachims. Making Large scale SVMs practical. Advances in Kernel Methods, Support Vector Learning, pp. 169–184, MIT Press, Cambridge, 1998.

    Google Scholar 

  25. T. Joachims. Training Linear SVMs in Linear Time. ACM KDD Conference, pp. 217–226, 2006.

    Google Scholar 

  26. I. T. Jolliffe. Principal component analysis. John Wiley & Sons, 2002.

    Google Scholar 

  27. I. T. Jolliffe. A note on the use of principal components in regression. Applied Statistics, 31(3), pp. 300–303, 1982..

    Article  Google Scholar 

  28. A. Karatzoglou, A. Smola A, K. Hornik, and A. Zeileis. kernlab – An S4 Package for Kernel Methods in R. Journal of Statistical Software, 11(9), 2004. http://epub.wu.ac.at/1048/1/document.pdf http://CRAN.R-project.org/package=kernlab

  29. M. Kuhn. Building predictive models in R Using the caret Package. Journal of Statistical Software, 28(5), pp. 1–26, 2008. https://cran.r-project.org/web/packages/caret/index.html

    Article  Google Scholar 

  30. H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2, pp. 419–444, 2002.

    MATH  Google Scholar 

  31. O. Mangasarian and D. Musicant. Successive overrelaxation for support vector machines. IEEE Transactions on Neural Networks, 10(5), pp. 1032–1037, 1999.

    Article  Google Scholar 

  32. P. McCullagh and J. Nelder. Generalized linear models CRC Press, 1989.

    Book  Google Scholar 

  33. G. McLachlan. Discriminant analysis and statistical pattern recognition John Wiley & Sons, 2004.

    Google Scholar 

  34. S. Mika, G. Rätsch, J. Weston, B. Schölkopf, and K. Müller. Fisher discriminant analysis with kernels. NIPS Conference, 1999.

    Google Scholar 

  35. K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. IJCAI Workshop on Machine Learning for Information Filtering, pp. 61–67, 1999.

    Google Scholar 

  36. E. Osuna, R. Freund, and F. Girosi. Improved training algorithm for support vector machines, IEEE Workshop on Neural Networks and Signal Processing, 1997.

    Google Scholar 

  37. J. C. Platt. Sequential minimal optimization: A fast algorithm for training support vector machines. Advances in Kernel Method: Support Vector Learning, MIT Press, pp. 85–208, 1998.

    Google Scholar 

  38. J. C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 10(3), pp. 61–74, 1999.

    Google Scholar 

  39. R. Rifkin. Everything old is new again: a fresh look at historical approaches in machine learning. Ph.D. Thesis, Massachusetts Institute of Technology, 2002. http://cbcl.mit.edu/projects/cbcl/publications/theses/thesis-rifkin.pdf

  40. S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming, 127(1), pp. 3–30, 2011.

    Article  MathSciNet  Google Scholar 

  41. A. Shashua. On the equivalence between the support vector machine for classification and sparsified Fisher’s linear discriminant. Neural Processing Letters, 9(2), pp. 129–139, 1999.

    Article  Google Scholar 

  42. J. Suykens and J. Venderwalle. Least squares support vector machine classifiers. Neural Processing Letters, 1999.

    Google Scholar 

  43. A. Tikhonov and V. Arsenin. Solution of ill-posed problems. Winston and Sons, 1977.

    Google Scholar 

  44. V. Vapnik. The nature of statistical learning theory. Springer, 2000.

    Google Scholar 

  45. G. Wahba. Support vector machines, reproducing kernel Hilbert spaces and the randomized GACV. Advances in Kernel Methods-Support Vector Learning, 6, pp. 69–87, 1999.

    Google Scholar 

  46. J. Weston and C. Watkins. Multi-class support vector machines. Technical Report CSD-TR-98-04, Department of Computer Science, Royal Holloway, University of London, May, 1998.

    Google Scholar 

  47. B. Widrow and M. Hoff. Adaptive switching circuits. IRE WESCON Convention Record, 4(1), pp. 96–104, 1960.

    Google Scholar 

  48. Y. Yang. Noise reduction in a statistical approach to text categorization, ACM SIGIR Conference, pp. 256–263, 1995.

    Google Scholar 

  49. Y. Yang and C. Chute. An application of least squares fit mapping to text information retrieval. ACM SIGIR Conference, pp. 281–290, 1993.

    Google Scholar 

  50. Y. Yang and X. Liu. A re-examination of text categorization methods. ACM SIGIR Conference, pp. 42–49, 1999.

    Google Scholar 

  51. H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Stat. Methodology), 67(2), pp. 301–320, 2005.

    Article  MathSciNet  Google Scholar 

  52. http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

  53. https://cran.r-project.org/web/packages/tm/

  54. http://www.cs.waikato.ac.nz/ml/weka/

  55. https://cran.r-project.org/web/packages/RTextTools/RTextTools.pdf

  56. http://mallet.cs.umass.edu/

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Aggarwal, C.C. (2018). Linear Classification and Regression for Text. In: Machine Learning for Text. Springer, Cham. https://doi.org/10.1007/978-3-319-73531-3_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73531-3_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73530-6

  • Online ISBN: 978-3-319-73531-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics