Linear Classification and Regression for Text



Linear models for classification and regression express the dependent variable (or class variable) as a linear function of the independent variables (or feature variables). Specifically, consider the case in which y i is the dependent variable of the ith document, and \(\overline{X_{i}} = (x_{i1}\ldots x_{id})\) are the d-dimensional feature variables of this document. In the case of text, these feature variables correspond to the term frequencies of a lexicon with d terms. The value of y i is a numerical quantity in the case of regression, and it is a binary value drawn from { − 1, +1} in the case of classification.


  1. [50]
    C. M. Bishop. Pattern recognition and machine learning. Springer, 2007.Google Scholar
  2. [51]
    C. M. Bishop. Neural networks for pattern recognition. Oxford University Press, 1995.Google Scholar
  3. [69]
    C. Burges. A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), pp. 121–167, 1998.CrossRefGoogle Scholar
  4. [82]
    S. Chakrabarti, S. Roy, and M. Soundalgekar. Fast and accurate text classification via multiple linear discriminant projections. The VLDB Journal, 12(2), pp. 170–185, 2003.CrossRefGoogle Scholar
  5. [87]
    C. Chang and C. Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 27, 2011. CrossRefGoogle Scholar
  6. [88]
    Y. Chang, C. Hsieh, K. Chang, M. Ringgaard, and C. J. Lin. Training and testing low-degree polynomial data mappings via linear SVM. Journal of Machine Learning Research, 11, pp. 1471–1490, 2010.MathSciNetzbMATHGoogle Scholar
  7. [89]
    O. Chapelle. Training a support vector machine in the primal. Neural Computation, 19(5), pp. 1155–1178, 2007.MathSciNetCrossRefGoogle Scholar
  8. [112]
    T. Cooke. Two variations on Fisher’s linear discriminant for pattern recognition IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2), pp. 268–273, 2002.MathSciNetCrossRefGoogle Scholar
  9. [115]
    C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3), pp. 273–297, 1995.zbMATHGoogle Scholar
  10. [117]
    N. Cristianini, and J. Shawe-Taylor. An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, 2000.Google Scholar
  11. [142]
    N. Draper and H. Smith. Applied regression analysis. John Wiley & Sons, 2014.Google Scholar
  12. [143]
    H. Drucker, C. Burges, L. Kaufman, A. Smola, and V. Vapnik. Support Vector Regression Machines. NIPS Conference, 1997.Google Scholar
  13. [145]
    S. Dumais. Latent semantic indexing (LSI) and TREC-2. Text Retrieval Conference (TREC), pp. 105–115, 1993.Google Scholar
  14. [146]
    S. Dumais. Latent semantic indexing (LSI): TREC-3 Report. Text Retrieval Conference (TREC), pp. 219–230, 1995.Google Scholar
  15. [151]
    B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of Statistics, 32(2), pp. 407–499, 2004.MathSciNetCrossRefGoogle Scholar
  16. [164]
    R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9, pp. 1871–1874, 2008. zbMATHGoogle Scholar
  17. [165]
    R. Fan, P. Chen, and C. Lin. Working set selection using second order information for training support vector machines. Journal of Machine Learning Research, 6, pp. 1889–1918, 2005.MathSciNetzbMATHGoogle Scholar
  18. [167]
    R. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7: pp. 179–188, 1936.CrossRefGoogle Scholar
  19. [177]
    G. Fung and O. Mangasarian. Proximal support vector classifiers. ACM KDD Conference, pp. 77–86, 2001.Google Scholar
  20. [191]
    F. Girosi and T. Poggio. Networks and the best approximation property. Biological Cybernetics, 63(3), pp. 169–176, 1990.MathSciNetCrossRefGoogle Scholar
  21. [208]
    T. Hastie, R. Tibshirani, and M. Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC Press, 2015.Google Scholar
  22. [209]
    T. Hastie and R. Tibshirani. Generalized additive models. CRC Press, 1990.Google Scholar
  23. [217]
    G. Hinton. Connectionist learning procedures. Artificial Intelligence, 40(1–3), pp. 185–234, 1989.CrossRefGoogle Scholar
  24. [241]
    T. Joachims. Making Large scale SVMs practical. Advances in Kernel Methods, Support Vector Learning, pp. 169–184, MIT Press, Cambridge, 1998.Google Scholar
  25. [242]
    T. Joachims. Training Linear SVMs in Linear Time. ACM KDD Conference, pp. 217–226, 2006.Google Scholar
  26. [247]
    I. T. Jolliffe. Principal component analysis. John Wiley & Sons, 2002.Google Scholar
  27. [248]
    I. T. Jolliffe. A note on the use of principal components in regression. Applied Statistics, 31(3), pp. 300–303, 1982..CrossRefGoogle Scholar
  28. [255]
    A. Karatzoglou, A. Smola A, K. Hornik, and A. Zeileis. kernlab – An S4 Package for Kernel Methods in R. Journal of Statistical Software, 11(9), 2004.
  29. [267]
    M. Kuhn. Building predictive models in R Using the caret Package. Journal of Statistical Software, 28(5), pp. 1–26, 2008. CrossRefGoogle Scholar
  30. [308]
    H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2, pp. 419–444, 2002.zbMATHGoogle Scholar
  31. [319]
    O. Mangasarian and D. Musicant. Successive overrelaxation for support vector machines. IEEE Transactions on Neural Networks, 10(5), pp. 1032–1037, 1999.CrossRefGoogle Scholar
  32. [328]
    P. McCullagh and J. Nelder. Generalized linear models CRC Press, 1989.CrossRefGoogle Scholar
  33. [330]
    G. McLachlan. Discriminant analysis and statistical pattern recognition John Wiley & Sons, 2004.Google Scholar
  34. [340]
    S. Mika, G. Rätsch, J. Weston, B. Schölkopf, and K. Müller. Fisher discriminant analysis with kernels. NIPS Conference, 1999.Google Scholar
  35. [363]
    K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. IJCAI Workshop on Machine Learning for Information Filtering, pp. 61–67, 1999.Google Scholar
  36. [368]
    E. Osuna, R. Freund, and F. Girosi. Improved training algorithm for support vector machines, IEEE Workshop on Neural Networks and Signal Processing, 1997.Google Scholar
  37. [382]
    J. C. Platt. Sequential minimal optimization: A fast algorithm for training support vector machines. Advances in Kernel Method: Support Vector Learning, MIT Press, pp. 85–208, 1998.Google Scholar
  38. [383]
    J. C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 10(3), pp. 61–74, 1999.Google Scholar
  39. [407]
    R. Rifkin. Everything old is new again: a fresh look at historical approaches in machine learning. Ph.D. Thesis, Massachusetts Institute of Technology, 2002.
  40. [444]
    S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming, 127(1), pp. 3–30, 2011.MathSciNetCrossRefGoogle Scholar
  41. [445]
    A. Shashua. On the equivalence between the support vector machine for classification and sparsified Fisher’s linear discriminant. Neural Processing Letters, 9(2), pp. 129–139, 1999.CrossRefGoogle Scholar
  42. [466]
    J. Suykens and J. Venderwalle. Least squares support vector machine classifiers. Neural Processing Letters, 1999.Google Scholar
  43. [474]
    A. Tikhonov and V. Arsenin. Solution of ill-posed problems. Winston and Sons, 1977.Google Scholar
  44. [482]
    V. Vapnik. The nature of statistical learning theory. Springer, 2000.Google Scholar
  45. [487]
    G. Wahba. Support vector machines, reproducing kernel Hilbert spaces and the randomized GACV. Advances in Kernel Methods-Support Vector Learning, 6, pp. 69–87, 1999.Google Scholar
  46. [496]
    J. Weston and C. Watkins. Multi-class support vector machines. Technical Report CSD-TR-98-04, Department of Computer Science, Royal Holloway, University of London, May, 1998.Google Scholar
  47. [497]
    B. Widrow and M. Hoff. Adaptive switching circuits. IRE WESCON Convention Record, 4(1), pp. 96–104, 1960.Google Scholar
  48. [515]
    Y. Yang. Noise reduction in a statistical approach to text categorization, ACM SIGIR Conference, pp. 256–263, 1995.Google Scholar
  49. [518]
    Y. Yang and C. Chute. An application of least squares fit mapping to text information retrieval. ACM SIGIR Conference, pp. 281–290, 1993.Google Scholar
  50. [519]
    Y. Yang and X. Liu. A re-examination of text categorization methods. ACM SIGIR Conference, pp. 42–49, 1999.Google Scholar
  51. [546]
    H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Stat. Methodology), 67(2), pp. 301–320, 2005.MathSciNetCrossRefGoogle Scholar
  52. [550]
  53. [551]
  54. [553]
  55. [571]
  56. [605]

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.IBM T. J. Watson Research CenterYorktown HeightsUSA

Personalised recommendations