Abstract
Linear models for classification and regression express the dependent variable (or class variable) as a linear function of the independent variables (or feature variables). Specifically, consider the case in which y i is the dependent variable of the ith document, and \(\overline{X_{i}} = (x_{i1}\ldots x_{id})\) are the d-dimensional feature variables of this document. In the case of text, these feature variables correspond to the term frequencies of a lexicon with d terms. The value of y i is a numerical quantity in the case of regression, and it is a binary value drawn from { − 1, +1} in the case of classification.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
When using kernel methods, it is customary to add a small constant amount to every entry in the similarity matrix between points to account for the effect of the dummy variable representing the bias term [319] (see Exercise 5).
- 2.
The notions of scatter and variance are different only in terms of scaling. The scatter of a set of n values is equal to n times their variance. Therefore, it does not matter whether the scatter or variance is used within a constant of proportionality.
- 3.
This two-class variant of the scatter matrix S b is not exactly the same as defined in the multi-class version S b of Sect. 6.2.3.1. Nevertheless, all entries in the two matrices are related with the proportionality factor of \(\frac{n_{1}\cdot n_{0}} {n^{2}}\) which turns out to be inconsequential to the direction of the Fisher discriminant. In other words, the use of the multi-class formulas in Sect. 6.2.3.1 will yield the same result in the binary case.
- 4.
Note that this matrix is different from the one introduced for the two-class case only by a proportionality factor, which does not affect the final solution.
- 5.
One can also show more general equivalence by allowing for bias.
- 6.
The SVM generally uses the hinge loss rather than the quadratic loss. The use of quadratic loss is possible in an SVM but it is less common. This is another key difference between the Fisher discriminant and the (most common implementation of the) SVM.
- 7.
- 8.
- 9.
When a sparse vector \(\overline{a}\) is added to a dense vector \(\overline{b}\), the change in the squared norm of \(\overline{b}\) is \(\vert \vert \overline{a}\vert \vert ^{2} + 2\overline{a} \cdot \overline{b}\). This can be computed in time proportional to the number of nonzero entries in the sparse vector \(\overline{a}\).
- 10.
We say “roughly” because we are ignoring the data-independent term ∑ i = 1 n α i .
- 11.
It has been shown [383] how one can derive heuristic probability estimates with an SVM.
- 12.
Regularization is equivalent to assuming that the parameters in \(\overline{W}\) are drawn from a Gaussian prior and it results in the addition of the term \(\lambda \vert \vert \overline{W}\vert \vert ^{2}/2\) to the log-likelihood to incorporate this prior assumption.
- 13.
The data will typically be rotated and reflected in particular directions.
- 14.
Strictly speaking, the transformation \(\varPhi (\overline{X})\) would need to be infinite dimensional to adequately represent the universe of all possible data points for Gaussian kernels. However, the relative positions of n points (and the origin) in any dimensionality can always be projected on an n-dimensional plane, just as a set of a two 3-dimensional points (with the origin) can always be projected on a 2-dimensional plane. The eigenvectors of the n × n similarity matrix of these points provide precisely this projection. This is referred to as the data-specific Mercer kernel map. Therefore, even though one often hears of the impossibility of extracting infinite dimensional points from a Gaussian kernel, this makes the nature of the transformation sound more abstract and impossible than it really is (as a practical matter). The reality is that we can always work with the data-specific n-dimensional transformation. As long as the similarity matrix is positive semi-definite, a finite dimensional transformation always exists for a finite data set, which is adequate for the learning algorithm. We use the notation Φ s (⋅ ) instead of Φ(⋅ ) to represent the fact that this is a data-specific transformation.
- 15.
When all entries in the kernel matrix are nonnegative, it means that all pairwise angles between points are less than 90∘. One can always reflect the points to the nonnegative orthant without loss of generality.
- 16.
Suppose one has p 1 …p t different possibilities for t different parameters. One now has to evaluate the algorithm at each combination of p 1 × p 2 … × p t possibilities over a held out set.
Bibliography
C. M. Bishop. Pattern recognition and machine learning. Springer, 2007.
C. M. Bishop. Neural networks for pattern recognition. Oxford University Press, 1995.
C. Burges. A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), pp. 121–167, 1998.
S. Chakrabarti, S. Roy, and M. Soundalgekar. Fast and accurate text classification via multiple linear discriminant projections. The VLDB Journal, 12(2), pp. 170–185, 2003.
C. Chang and C. Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 27, 2011. http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Y. Chang, C. Hsieh, K. Chang, M. Ringgaard, and C. J. Lin. Training and testing low-degree polynomial data mappings via linear SVM. Journal of Machine Learning Research, 11, pp. 1471–1490, 2010.
O. Chapelle. Training a support vector machine in the primal. Neural Computation, 19(5), pp. 1155–1178, 2007.
T. Cooke. Two variations on Fisher’s linear discriminant for pattern recognition IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2), pp. 268–273, 2002.
C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3), pp. 273–297, 1995.
N. Cristianini, and J. Shawe-Taylor. An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, 2000.
N. Draper and H. Smith. Applied regression analysis. John Wiley & Sons, 2014.
H. Drucker, C. Burges, L. Kaufman, A. Smola, and V. Vapnik. Support Vector Regression Machines. NIPS Conference, 1997.
S. Dumais. Latent semantic indexing (LSI) and TREC-2. Text Retrieval Conference (TREC), pp. 105–115, 1993.
S. Dumais. Latent semantic indexing (LSI): TREC-3 Report. Text Retrieval Conference (TREC), pp. 219–230, 1995.
B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of Statistics, 32(2), pp. 407–499, 2004.
R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9, pp. 1871–1874, 2008. http://www.csie.ntu.edu.tw/~cjlin/liblinear/
R. Fan, P. Chen, and C. Lin. Working set selection using second order information for training support vector machines. Journal of Machine Learning Research, 6, pp. 1889–1918, 2005.
R. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7: pp. 179–188, 1936.
G. Fung and O. Mangasarian. Proximal support vector classifiers. ACM KDD Conference, pp. 77–86, 2001.
F. Girosi and T. Poggio. Networks and the best approximation property. Biological Cybernetics, 63(3), pp. 169–176, 1990.
T. Hastie, R. Tibshirani, and M. Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC Press, 2015.
T. Hastie and R. Tibshirani. Generalized additive models. CRC Press, 1990.
G. Hinton. Connectionist learning procedures. Artificial Intelligence, 40(1–3), pp. 185–234, 1989.
T. Joachims. Making Large scale SVMs practical. Advances in Kernel Methods, Support Vector Learning, pp. 169–184, MIT Press, Cambridge, 1998.
T. Joachims. Training Linear SVMs in Linear Time. ACM KDD Conference, pp. 217–226, 2006.
I. T. Jolliffe. Principal component analysis. John Wiley & Sons, 2002.
I. T. Jolliffe. A note on the use of principal components in regression. Applied Statistics, 31(3), pp. 300–303, 1982..
A. Karatzoglou, A. Smola A, K. Hornik, and A. Zeileis. kernlab – An S4 Package for Kernel Methods in R. Journal of Statistical Software, 11(9), 2004. http://epub.wu.ac.at/1048/1/document.pdf http://CRAN.R-project.org/package=kernlab
M. Kuhn. Building predictive models in R Using the caret Package. Journal of Statistical Software, 28(5), pp. 1–26, 2008. https://cran.r-project.org/web/packages/caret/index.html
H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2, pp. 419–444, 2002.
O. Mangasarian and D. Musicant. Successive overrelaxation for support vector machines. IEEE Transactions on Neural Networks, 10(5), pp. 1032–1037, 1999.
P. McCullagh and J. Nelder. Generalized linear models CRC Press, 1989.
G. McLachlan. Discriminant analysis and statistical pattern recognition John Wiley & Sons, 2004.
S. Mika, G. Rätsch, J. Weston, B. Schölkopf, and K. Müller. Fisher discriminant analysis with kernels. NIPS Conference, 1999.
K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. IJCAI Workshop on Machine Learning for Information Filtering, pp. 61–67, 1999.
E. Osuna, R. Freund, and F. Girosi. Improved training algorithm for support vector machines, IEEE Workshop on Neural Networks and Signal Processing, 1997.
J. C. Platt. Sequential minimal optimization: A fast algorithm for training support vector machines. Advances in Kernel Method: Support Vector Learning, MIT Press, pp. 85–208, 1998.
J. C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 10(3), pp. 61–74, 1999.
R. Rifkin. Everything old is new again: a fresh look at historical approaches in machine learning. Ph.D. Thesis, Massachusetts Institute of Technology, 2002. http://cbcl.mit.edu/projects/cbcl/publications/theses/thesis-rifkin.pdf
S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming, 127(1), pp. 3–30, 2011.
A. Shashua. On the equivalence between the support vector machine for classification and sparsified Fisher’s linear discriminant. Neural Processing Letters, 9(2), pp. 129–139, 1999.
J. Suykens and J. Venderwalle. Least squares support vector machine classifiers. Neural Processing Letters, 1999.
A. Tikhonov and V. Arsenin. Solution of ill-posed problems. Winston and Sons, 1977.
V. Vapnik. The nature of statistical learning theory. Springer, 2000.
G. Wahba. Support vector machines, reproducing kernel Hilbert spaces and the randomized GACV. Advances in Kernel Methods-Support Vector Learning, 6, pp. 69–87, 1999.
J. Weston and C. Watkins. Multi-class support vector machines. Technical Report CSD-TR-98-04, Department of Computer Science, Royal Holloway, University of London, May, 1998.
B. Widrow and M. Hoff. Adaptive switching circuits. IRE WESCON Convention Record, 4(1), pp. 96–104, 1960.
Y. Yang. Noise reduction in a statistical approach to text categorization, ACM SIGIR Conference, pp. 256–263, 1995.
Y. Yang and C. Chute. An application of least squares fit mapping to text information retrieval. ACM SIGIR Conference, pp. 281–290, 1993.
Y. Yang and X. Liu. A re-examination of text categorization methods. ACM SIGIR Conference, pp. 42–49, 1999.
H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Stat. Methodology), 67(2), pp. 301–320, 2005.
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
https://cran.r-project.org/web/packages/RTextTools/RTextTools.pdf
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Aggarwal, C.C. (2018). Linear Classification and Regression for Text. In: Machine Learning for Text. Springer, Cham. https://doi.org/10.1007/978-3-319-73531-3_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-73531-3_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73530-6
Online ISBN: 978-3-319-73531-3
eBook Packages: Computer ScienceComputer Science (R0)