Linear Classification and Regression for Text

Aggarwal, Charu C.

doi:10.1007/978-3-319-73531-3_6

Linear Classification and Regression for Text

Charu C. Aggarwal²

Chapter
First Online: 20 March 2018

9920 Accesses
2 Citations

Abstract

Linear models for classification and regression express the dependent variable (or class variable) as a linear function of the independent variables (or feature variables). Specifically, consider the case in which y _i is the dependent variable of the ith document, and \(\overline{X_{i}} = (x_{i1}\ldots x_{id})\) are the d-dimensional feature variables of this document. In the case of text, these feature variables correspond to the term frequencies of a lexicon with d terms. The value of y _i is a numerical quantity in the case of regression, and it is a binary value drawn from { − 1, +1} in the case of classification.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
When using kernel methods, it is customary to add a small constant amount to every entry in the similarity matrix between points to account for the effect of the dummy variable representing the bias term [319] (see Exercise 5).
2.
The notions of scatter and variance are different only in terms of scaling. The scatter of a set of n values is equal to n times their variance. Therefore, it does not matter whether the scatter or variance is used within a constant of proportionality.
3.
This two-class variant of the scatter matrix S _b is not exactly the same as defined in the multi-class version S _b of Sect. 6.2.3.1. Nevertheless, all entries in the two matrices are related with the proportionality factor of \(\frac{n_{1}\cdot n_{0}} {n^{2}}\) which turns out to be inconsequential to the direction of the Fisher discriminant. In other words, the use of the multi-class formulas in Sect. 6.2.3.1 will yield the same result in the binary case.
4.
Note that this matrix is different from the one introduced for the two-class case only by a proportionality factor, which does not affect the final solution.
5.
One can also show more general equivalence by allowing for bias.
6.
The SVM generally uses the hinge loss rather than the quadratic loss. The use of quadratic loss is possible in an SVM but it is less common. This is another key difference between the Fisher discriminant and the (most common implementation of the) SVM.
7.
http://mathworld.wolfram.com/Point-PlaneDistance.html.
8.
On the surface, these steps look different from [444]. However, they are mathematically the same, except that the objective function uses different parametrizations and notations. The parameter λ in [444] is equivalent to 1∕(n ⋅ C) in this book.
9.
When a sparse vector \(\overline{a}\) is added to a dense vector \(\overline{b}\), the change in the squared norm of \(\overline{b}\) is \(\vert \vert \overline{a}\vert \vert ^{2} + 2\overline{a} \cdot \overline{b}\). This can be computed in time proportional to the number of nonzero entries in the sparse vector \(\overline{a}\).
10.
We say “roughly” because we are ignoring the data-independent term ∑ _{i = 1} ⁿ α _i.
11.
It has been shown [383] how one can derive heuristic probability estimates with an SVM.
12.
Regularization is equivalent to assuming that the parameters in \(\overline{W}\) are drawn from a Gaussian prior and it results in the addition of the term \(\lambda \vert \vert \overline{W}\vert \vert ^{2}/2\) to the log-likelihood to incorporate this prior assumption.
13.
The data will typically be rotated and reflected in particular directions.
14.
Strictly speaking, the transformation \(\varPhi (\overline{X})\) would need to be infinite dimensional to adequately represent the universe of all possible data points for Gaussian kernels. However, the relative positions of n points (and the origin) in any dimensionality can always be projected on an n-dimensional plane, just as a set of a two 3-dimensional points (with the origin) can always be projected on a 2-dimensional plane. The eigenvectors of the n × n similarity matrix of these points provide precisely this projection. This is referred to as the data-specific Mercer kernel map. Therefore, even though one often hears of the impossibility of extracting infinite dimensional points from a Gaussian kernel, this makes the nature of the transformation sound more abstract and impossible than it really is (as a practical matter). The reality is that we can always work with the data-specific n-dimensional transformation. As long as the similarity matrix is positive semi-definite, a finite dimensional transformation always exists for a finite data set, which is adequate for the learning algorithm. We use the notation Φ _s(⋅ ) instead of Φ(⋅ ) to represent the fact that this is a data-specific transformation.
15.
When all entries in the kernel matrix are nonnegative, it means that all pairwise angles between points are less than 90^∘. One can always reflect the points to the nonnegative orthant without loss of generality.
16.
Suppose one has p ₁ …p _t different possibilities for t different parameters. One now has to evaluate the algorithm at each combination of p ₁ × p ₂ … × p _t possibilities over a held out set.

Bibliography

C. M. Bishop. Pattern recognition and machine learning. Springer, 2007.
Google Scholar
C. M. Bishop. Neural networks for pattern recognition. Oxford University Press, 1995.
Google Scholar
C. Burges. A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), pp. 121–167, 1998.
Article Google Scholar
S. Chakrabarti, S. Roy, and M. Soundalgekar. Fast and accurate text classification via multiple linear discriminant projections. The VLDB Journal, 12(2), pp. 170–185, 2003.
Article Google Scholar
C. Chang and C. Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 27, 2011. http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Article Google Scholar
Y. Chang, C. Hsieh, K. Chang, M. Ringgaard, and C. J. Lin. Training and testing low-degree polynomial data mappings via linear SVM. Journal of Machine Learning Research, 11, pp. 1471–1490, 2010.
MathSciNet MATH Google Scholar
O. Chapelle. Training a support vector machine in the primal. Neural Computation, 19(5), pp. 1155–1178, 2007.
Article MathSciNet Google Scholar
T. Cooke. Two variations on Fisher’s linear discriminant for pattern recognition IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2), pp. 268–273, 2002.
Article MathSciNet Google Scholar
C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3), pp. 273–297, 1995.
MATH Google Scholar
N. Cristianini, and J. Shawe-Taylor. An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, 2000.
Google Scholar
N. Draper and H. Smith. Applied regression analysis. John Wiley & Sons, 2014.
Google Scholar
H. Drucker, C. Burges, L. Kaufman, A. Smola, and V. Vapnik. Support Vector Regression Machines. NIPS Conference, 1997.
Google Scholar
S. Dumais. Latent semantic indexing (LSI) and TREC-2. Text Retrieval Conference (TREC), pp. 105–115, 1993.
Google Scholar
S. Dumais. Latent semantic indexing (LSI): TREC-3 Report. Text Retrieval Conference (TREC), pp. 219–230, 1995.
Google Scholar
B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of Statistics, 32(2), pp. 407–499, 2004.
Article MathSciNet Google Scholar
R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9, pp. 1871–1874, 2008. http://www.csie.ntu.edu.tw/~cjlin/liblinear/
MATH Google Scholar
R. Fan, P. Chen, and C. Lin. Working set selection using second order information for training support vector machines. Journal of Machine Learning Research, 6, pp. 1889–1918, 2005.
MathSciNet MATH Google Scholar
R. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7: pp. 179–188, 1936.
Article Google Scholar
G. Fung and O. Mangasarian. Proximal support vector classifiers. ACM KDD Conference, pp. 77–86, 2001.
Google Scholar
F. Girosi and T. Poggio. Networks and the best approximation property. Biological Cybernetics, 63(3), pp. 169–176, 1990.
Article MathSciNet Google Scholar
T. Hastie, R. Tibshirani, and M. Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC Press, 2015.
Google Scholar
T. Hastie and R. Tibshirani. Generalized additive models. CRC Press, 1990.
Google Scholar
G. Hinton. Connectionist learning procedures. Artificial Intelligence, 40(1–3), pp. 185–234, 1989.
Article Google Scholar
T. Joachims. Making Large scale SVMs practical. Advances in Kernel Methods, Support Vector Learning, pp. 169–184, MIT Press, Cambridge, 1998.
Google Scholar
T. Joachims. Training Linear SVMs in Linear Time. ACM KDD Conference, pp. 217–226, 2006.
Google Scholar
I. T. Jolliffe. Principal component analysis. John Wiley & Sons, 2002.
Google Scholar
I. T. Jolliffe. A note on the use of principal components in regression. Applied Statistics, 31(3), pp. 300–303, 1982..
Article Google Scholar
A. Karatzoglou, A. Smola A, K. Hornik, and A. Zeileis. kernlab – An S4 Package for Kernel Methods in R. Journal of Statistical Software, 11(9), 2004. http://epub.wu.ac.at/1048/1/document.pdf http://CRAN.R-project.org/package=kernlab
M. Kuhn. Building predictive models in R Using the caret Package. Journal of Statistical Software, 28(5), pp. 1–26, 2008. https://cran.r-project.org/web/packages/caret/index.html
Article Google Scholar
H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2, pp. 419–444, 2002.
MATH Google Scholar
O. Mangasarian and D. Musicant. Successive overrelaxation for support vector machines. IEEE Transactions on Neural Networks, 10(5), pp. 1032–1037, 1999.
Article Google Scholar
P. McCullagh and J. Nelder. Generalized linear models CRC Press, 1989.
Book Google Scholar
G. McLachlan. Discriminant analysis and statistical pattern recognition John Wiley & Sons, 2004.
Google Scholar
S. Mika, G. Rätsch, J. Weston, B. Schölkopf, and K. Müller. Fisher discriminant analysis with kernels. NIPS Conference, 1999.
Google Scholar
K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. IJCAI Workshop on Machine Learning for Information Filtering, pp. 61–67, 1999.
Google Scholar
E. Osuna, R. Freund, and F. Girosi. Improved training algorithm for support vector machines, IEEE Workshop on Neural Networks and Signal Processing, 1997.
Google Scholar
J. C. Platt. Sequential minimal optimization: A fast algorithm for training support vector machines. Advances in Kernel Method: Support Vector Learning, MIT Press, pp. 85–208, 1998.
Google Scholar
J. C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 10(3), pp. 61–74, 1999.
Google Scholar
R. Rifkin. Everything old is new again: a fresh look at historical approaches in machine learning. Ph.D. Thesis, Massachusetts Institute of Technology, 2002. http://cbcl.mit.edu/projects/cbcl/publications/theses/thesis-rifkin.pdf
S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming, 127(1), pp. 3–30, 2011.
Article MathSciNet Google Scholar
A. Shashua. On the equivalence between the support vector machine for classification and sparsified Fisher’s linear discriminant. Neural Processing Letters, 9(2), pp. 129–139, 1999.
Article Google Scholar
J. Suykens and J. Venderwalle. Least squares support vector machine classifiers. Neural Processing Letters, 1999.
Google Scholar
A. Tikhonov and V. Arsenin. Solution of ill-posed problems. Winston and Sons, 1977.
Google Scholar
V. Vapnik. The nature of statistical learning theory. Springer, 2000.
Google Scholar
G. Wahba. Support vector machines, reproducing kernel Hilbert spaces and the randomized GACV. Advances in Kernel Methods-Support Vector Learning, 6, pp. 69–87, 1999.
Google Scholar
J. Weston and C. Watkins. Multi-class support vector machines. Technical Report CSD-TR-98-04, Department of Computer Science, Royal Holloway, University of London, May, 1998.
Google Scholar
B. Widrow and M. Hoff. Adaptive switching circuits. IRE WESCON Convention Record, 4(1), pp. 96–104, 1960.
Google Scholar
Y. Yang. Noise reduction in a statistical approach to text categorization, ACM SIGIR Conference, pp. 256–263, 1995.
Google Scholar
Y. Yang and C. Chute. An application of least squares fit mapping to text information retrieval. ACM SIGIR Conference, pp. 281–290, 1993.
Google Scholar
Y. Yang and X. Liu. A re-examination of text categorization methods. ACM SIGIR Conference, pp. 42–49, 1999.
Google Scholar
H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Stat. Methodology), 67(2), pp. 301–320, 2005.
Article MathSciNet Google Scholar
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
https://cran.r-project.org/web/packages/tm/
http://www.cs.waikato.ac.nz/ml/weka/
https://cran.r-project.org/web/packages/RTextTools/RTextTools.pdf
http://mallet.cs.umass.edu/

Download references

Author information

Authors and Affiliations

IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
Charu C. Aggarwal

Authors

Charu C. Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Aggarwal, C.C. (2018). Linear Classification and Regression for Text. In: Machine Learning for Text. Springer, Cham. https://doi.org/10.1007/978-3-319-73531-3_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-73531-3_6
Published: 20 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73530-6
Online ISBN: 978-3-319-73531-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics