Supervised Learning by Support Vector Machines

  • Gabriele Steidl


During the last 2 decades support vector machine learning has become a very active field of research with a large amount of both sophisticated theoretical results and exciting real-word applications. This chapter gives a brief introduction into the basic concepts of supervised support vector learning and touches some recent developments in this broad field.


Support Vector Machine Loss Function Support Vector Regression Dual Problem Sparse Representation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References and Further Reading

  1. 1.
    Aizerman M, Braverman E, Rozonoer L (1964) Uncovering shared structures in multiclassification. Int Conf Mach Learn 25: 821–837MathSciNetGoogle Scholar
  2. 2.
    Amit Y, Fink M, Srebro N, Ullman S (2007) Theoretocal foundations of the potential function method in pattern recognition learning. Automat Rem Contr 25:17–24Google Scholar
  3. 3.
    Anthony M, Bartlett PL (1999) Neural network learning: theoretical foundations. Cambridge University Press, CambridgeMATHCrossRefGoogle Scholar
  4. 4.
    Argyriou A, Evgeniou T, Pontil M (2008) Convex multi-task feature learning. Mach Learn 73(3):243–272CrossRefGoogle Scholar
  5. 5.
    Aronszajn N (1950) Theory of reproducing kernels. Trans Am Math Soc 68:337–404MathSciNetMATHCrossRefGoogle Scholar
  6. 6.
    Bartlett PL, Jordan MI, McAuliffe JD (2006) Convexity, classification, and risk bounds. J Am Stat Assoc 101:138–156MathSciNetMATHCrossRefGoogle Scholar
  7. 7.
    Bennett KP, Mangasarian OL (1992) Robust linear programming discrimination of two linearly inseparable sets. Optim Methods Softw 1:23–34CrossRefGoogle Scholar
  8. 8.
    Berlinet A, Thomas-Agnan C (2004) Reproducing kernel Hilbert spaces in probability and statistics. Kluwer, DordrechtMATHCrossRefGoogle Scholar
  9. 9.
    Bishop CM (2006) Pattern recognition and machine learning. Springer, HeidelbergMATHGoogle Scholar
  10. 10.
    Björck A (1996) Least squares problems. SIAM, PhiladelphiaMATHCrossRefGoogle Scholar
  11. 11.
    Bonnans JF, Shapiro A (2000) Perturbation analysis of optimization problems. Springer, New YorkMATHGoogle Scholar
  12. 12.
    Boser GE, Guyon I, Vapnik V (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual ACM workshop on computational learning theory, Madison, pp 144–152Google Scholar
  13. 13.
    Bottou L, Chapelle L, DeCoste O, Weston J (eds) (2007) Large scale kernel machines. MIT Press, CambridgeGoogle Scholar
  14. 14.
    Boucheron S, Bousquet O, Lugosi G (2005) Theory of classification: a survey on some recent advances. ESAIM Probab Stat 9:323–375MathSciNetMATHCrossRefGoogle Scholar
  15. 15.
    Bousquet O, Elisseeff A (2001) Algorithmic stability and generalization performance. In: Leen TK, Dietterich TG, Tresp V (eds) Advances in neural information processing systems 13. MIT Press, Cambridge, pp 196–202Google Scholar
  16. 16.
    Bradley PS, Mangasarian OL (1998) Feature selection via concave minimization and support vector machines. In: Proceedings of the 15th international conference on machine learning, Morgan Kaufmann, San Francisco, pp 82–90Google Scholar
  17. 17.
    Brown M, Grundy W, Lin D, Cristianini N, Sugnet C, Furey T, Ares M, Haussler D (2000) Knowledge-based analysis of microarray gene-expression data by using support vector machines. Proc Natl Acad Sci 97(1): 262–267CrossRefGoogle Scholar
  18. 18.
    Buhmann MD (2003) Radial basis functions. Cambridge University Press, CambridgeMATHCrossRefGoogle Scholar
  19. 19.
    Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2(2):121–167CrossRefGoogle Scholar
  20. 20.
    Cai J-F, Candès EJ, Shen Z (2008) A singular value thresholding algorithm for matrix completion. Technical report, UCLA computational and applied mathematicsGoogle Scholar
  21. 21.
    Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75MathSciNetCrossRefGoogle Scholar
  22. 22.
    Chang C-C, Lin C-J (2004) LIBSVM: a library for support vector machines.
  23. 23.
    Chapelle O, Haffner P, Vapnik VN (1999) SVMs for histogram-based image classification. IEEE Trans Neural Netw 10(5):1055–1064CrossRefGoogle Scholar
  24. 24.
    Chen P-H, Fan R-E, Lin C-J (2006) A study on SMO-type decomposition methods for support vector machines. IEEE Trans Neural Netw 17:893–908CrossRefGoogle Scholar
  25. 25.
    Collobert R, Bengio S (2001) Support vector machines for large scale regression problems. J Mach Learn Res 1:143–160MathSciNetGoogle Scholar
  26. 26.
    Cortes C, Vapnik V (1995) Support vector networks. Mach Learn 20:273–297MATHGoogle Scholar
  27. 27.
    Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines. Cambridge University Press, CambridgeGoogle Scholar
  28. 28.
    Cucker F, Smale S (2002) On the mathematical foundations of learning. Bull Am Math Soc 39:1–49MathSciNetMATHCrossRefGoogle Scholar
  29. 29.
    Cucker F, Zhou DX (2007) Learning theory: an approximation point of view. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  30. 30.
    Devroye L, Gyrfi L, Lugosi G (1996) A probabilistic theory of pattern recognition. Springer, New YorkMATHGoogle Scholar
  31. 31.
    Devroye LP (1982) Any discrimination rule can have an arbitrarily bad probability of error for finite sample size. IEEE Trans Pattern Anal Mach Intell 4:154–157MATHCrossRefGoogle Scholar
  32. 32.
    Dietterich TG, Bakiri G (1995) Solving multiclass learning problems via error-correcting output codes. J Artfic Int Res 2:263–286MATHGoogle Scholar
  33. 33.
    Dinuzzo F, Neve M, Nicolao GD, Gianazza UP (2007) On the representer theorem and equivalent degrees of freedom of SVR. J Mach Learn Res 8:2467–2495MathSciNetMATHGoogle Scholar
  34. 34.
    Duda RO, Hart PE, Stork D (2001) Pattern classification, 2nd edn. Wiley, New YorkMATHGoogle Scholar
  35. 35.
    Edmunds DE, Triebel H (1996) Function spaces, entropy numbers, differential operators. Cambridge University Press, CambridgeMATHCrossRefGoogle Scholar
  36. 36.
    Elisseeff A, Evgeniou A, Pontil M (2005) Stability of randomised learning algorithms. J Mach Learn Res 6:55–79MathSciNetMATHGoogle Scholar
  37. 37.
    Evgeniou T, Pontil M, Poggio T (2000) Regularization networks and support vector machines. Adv Comput Math 13(1):1–50MathSciNetMATHCrossRefGoogle Scholar
  38. 38.
    Fan R-E, Chen P-H, Lin C-J (2005) Working set selection using second order information for training support vector machines. J Mach Learn Res 6:1889–1918MathSciNetMATHGoogle Scholar
  39. 39.
    Fasshauer GE (2007) Meshfree approximation methods with MATLAB. World Scientific, New JerseyMATHGoogle Scholar
  40. 40.
    Fazel M, Hindi H, Boyd SP (2001) A rank minimization heuristic with application to minimum order system approximation. In: Proceedings of the American control conference, Arlington, pp 4734–4739Google Scholar
  41. 41.
    Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics 7:179–188CrossRefGoogle Scholar
  42. 42.
    Flake GW, Lawrence S (1999) Efficient SVM regression training with SMO. Technical report, NEC Research InstituteGoogle Scholar
  43. 43.
    Gauss CF (1963) Theory of the motion of the heavenly bodies moving about the sun in conic sections. (trans: Davis CH). Dover, New York; first published 1809Google Scholar
  44. 44.
    Girosi F (1998) An equivalence between sparse approximation and support vector machines. Neural Comput 10(6):1455–1480CrossRefGoogle Scholar
  45. 45.
    Golub GH, Loan CFV (1996) Matrix computation, 3rd edn. John Hopkins University Press, BaltimoreGoogle Scholar
  46. 46.
    Gyrfi L, Kohler M, Krzyżak A, Walk H (2002) A distribution-free theory of non-parametric regression. Springer, New YorkCrossRefGoogle Scholar
  47. 47.
    Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer, New YorkMATHGoogle Scholar
  48. 48.
    Herbrich R (2001) Learning Kernel classifiers: theory and algorithms. MIT Press, CambridgeGoogle Scholar
  49. 49.
    Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1):55–67MathSciNetMATHGoogle Scholar
  50. 50.
    Huang T, Kecman V, Kopriva I, Friedman J (2006) Kernel based algorithms for mining huge data sets: supervised semi-supervised and unsupervised learning. Springer, BerlinMATHGoogle Scholar
  51. 51.
    Jaakkola TS, Haussler D (1999) Probabilistic kerbnel regression models. In: Proceedings of the 1999 conference on artificial inteligence and statisticsGoogle Scholar
  52. 52.
    Joachims T (1999) Making large-scale SVM learning practical. In: Schlkopf B, Burges C, Smola A (eds) Advances in Kernel methods-support vector learning. MIT Press, Cambridge, pp 41–56Google Scholar
  53. 53.
    Joachims T (2002) Learning to classify text using support vector machines. Kluwer, BostonCrossRefGoogle Scholar
  54. 54.
    Kailath T (1971) RKHS approach to detection and estimation problems: Part I: deterministic signals in Gaussian noise. IEEE Trans Inform Theory 17(5):530–549MathSciNetMATHCrossRefGoogle Scholar
  55. 55.
    Keerthi SS, Shevade SK, Battacharyya C, Murthy KRK (2001) Improvements to Platt’s SMO algorithm for SMV classifier design. Neural Comput 13:637–649MATHCrossRefGoogle Scholar
  56. 56.
    Kimeldorf GS, Wahba G (1971) Some results on Tchebycheffian spline functions. J Math Anal Appl 33:82–95MathSciNetMATHCrossRefGoogle Scholar
  57. 57.
    Kolmogorov AN, Tikhomirov VM (1961) ε-entropy and ε-capacity of sets in functional spaces. Am Math Soc Trans 17:277–364Google Scholar
  58. 58.
    Kondor RI, Lafferty J (2002) Diffusion kernels on graphs and other discrete structures. In: Kauffman M (ed) Proceedings of the international conference on machine learning, Morgan Kaufman, San MateoGoogle Scholar
  59. 59.
    Krige DG (1951) A statistical approach to some basic mine valuation problems on the witwatersrand. J Chem Met Mining Soc S Africa 52(6):119–139Google Scholar
  60. 60.
    Kuhn HW, Tucker AW (1951) Nonlinear programming. In: Proceedings of the Berkley symposium on mathematical statistics and probability, University of California Press, Berkeley, pp 482–492Google Scholar
  61. 61.
    Laplace PS (1816) Théorie Analytique des Probabilités, 3rd edn. Courier, ParisGoogle Scholar
  62. 62.
    LeCun Y, Jackel LD, Bottou L, Brunot A, Cortes C, Denker JS, Drucker H, Guyon I, Müller U, Säckinger E, Simard P, Vapnik V (1995) Comparison of learning algorithms for handwritten digit recognition. In: Fogelman-Souleé F, Gallinari P (eds) Proceedings of ICANN’95, vol 2. EC2 & Cie, Paris, pp 53–60Google Scholar
  63. 63.
    Legendre AM (1805) Nouvelles Méthodes pour la Determination des Orbites des Cométes. Courier, ParisGoogle Scholar
  64. 64.
    Leopold E, Kinderman J (2002) Text categogization with support vector machines how to represent text in input space? Mach Learn 46(1–3):223–244Google Scholar
  65. 65.
    Lin CJ (2001) On the convergence of the decomposition method for support vector machines. IEEE Trans Neural Netw 12:1288–1298CrossRefGoogle Scholar
  66. 66.
    Lu Z, Monteiro RDC, Yuan M (2008) Convex optimization methods for dimension reduction and coefficient estimation in multivariate linear regression. Submitted to Math ProgramGoogle Scholar
  67. 67.
    Ma S, Goldfarb D, Chen L (2008) Fixed point and Bregman iterative methods for matrix rank minimization. Technical report 08-78, UCLA Computational and applied mathematicsGoogle Scholar
  68. 68.
    Mangasarian OL (1994) Nonlinear programming. SIAM, MadisonMATHCrossRefGoogle Scholar
  69. 69.
    Mangasarian OL, Musicant DR (1999) Successive overrelaxation for support vector machines. IEEE Trans Neural Netw 10:1032–1037CrossRefGoogle Scholar
  70. 70.
    Matheron G (1963) Principles of geostatistics. Econ Geol 58:1246–1266CrossRefGoogle Scholar
  71. 71.
    Micchelli CA (1986) Interpolation of scattered data: distance matices and conditionally positive definite functions. Constr Approx 2:11–22MathSciNetMATHCrossRefGoogle Scholar
  72. 72.
    Micchelli CA, Pontil M (2005) On learning vector-valued functions. Neural Comput 17: 177–204MathSciNetMATHCrossRefGoogle Scholar
  73. 73.
    Mitchell TM (1997) Machine learning. McGraw-Hill, BostonMATHGoogle Scholar
  74. 74.
    Mukherjee S, Niyogi P, Poggio T, Rifkin R (2006) Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Adv Comput Math 25:161–193MathSciNetMATHCrossRefGoogle Scholar
  75. 75.
    Neumann J, Schnörr C, Steidl G (2005) Efficient wavelet adaptation for hybrid wavelet–large margin classifiers. Pattern Recogn 38: 1815–1830MATHCrossRefGoogle Scholar
  76. 76.
    Obozinski G, Taskar B, Jordan MI (2009) Joint covariate selection and joint subspace selection for multiple classification problems. Stat Comput (in press)Google Scholar
  77. 77.
    Osuna E, Freund R, Girosi F (1997) Training of support vector machines: an application to face detection. In: Proceedings of the CVPR’97, IEEE Computer Society, Washington, pp 130–136Google Scholar
  78. 78.
    Parzen E (1970) Statistical inference on time series by RKHS methods. Technical report, Department of Statistics, Stanford UniversityGoogle Scholar
  79. 79.
    Pinkus A (1996) N-width in approximation theory. Springer, BerlinGoogle Scholar
  80. 80.
    Platt JC (1999) Fast training of support vector machines using sequential minimal optimization. In: Schölkopf B, Burges CJC, Smola AJ (eds) Advances in Kernel methods – support vector learning. MIT Press, Cambridge, pp 185–208Google Scholar
  81. 81.
    Poggio T, Girosi F (1990) Networks for approximation and learning. Proc IEEE 78(9):1481–1497CrossRefGoogle Scholar
  82. 82.
    Pong TK, Tseng P, Ji S, Ye J (2009) Trace norm regularization: reformulations, algorithms and multi-task learning. University of Washington, preprintGoogle Scholar
  83. 83.
    Povzner AY (1950) A class of Hilbert function spaces. Doklady Akademii Nauk USSR 68: 817–820MathSciNetGoogle Scholar
  84. 84.
    Rosenblatt F (1959) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65: 386–408CrossRefGoogle Scholar
  85. 85.
    Schoenberg IJ (1938) Metric spaces and completely monotone functions. Ann Math 39: 811–841MathSciNetCrossRefGoogle Scholar
  86. 86.
    Schölkopf B, Herbrich R, Smola AJ (2001) A generalized representer theorem. In: Helmbold D, Williamson B (eds) Proceedings of the 14th annual conference on computational learning theory. Springer, New York, pp 416–426Google Scholar
  87. 87.
    Schölkopf B, Smola AJ (2002) Learning with Kernels: support vector machnes, regularization, optimization, and beyond. MIT Press, CambridgeGoogle Scholar
  88. 88.
    Shawe-Taylor J, Cristianini N (2009) Kernel methods for pattern analysis, 4th edn. Cambridge University Press, New YorkGoogle Scholar
  89. 89.
    Smola AJ, Schölkopf B, Müller KR (1998) The connection between regularization operators and support vector kernels. Neural Netw 11: 637–649CrossRefGoogle Scholar
  90. 90.
    Spellucci P (1993) Numerische verfahren der nichtlinearen optimierung. Birkhäuser, Basel/Boston/BerlinMATHCrossRefGoogle Scholar
  91. 91.
    Srebro N, Rennie JDM, Jaakkola TS (2005) Maximum-margin matrix factorization. In NIPS, MIT Press, Cambridge, pp 1329–1336Google Scholar
  92. 92.
    Steinwart I (2003) Sparseness of support vector machines. J Mach Learn Res 4:1071–1105MathSciNetGoogle Scholar
  93. 93.
    Steinwart I, Christmann A (2008) Support vector machines. Springer, New YorkMATHGoogle Scholar
  94. 94.
    Stone C (1977) Consistent nonparametric regression. Ann Stat 5:595–645MATHCrossRefGoogle Scholar
  95. 95.
    Strauss DJ, Steidl G (2002) Hybrid wavelet-support vector classification of waveforms. J Comput Appl Math 148:375–400MathSciNetMATHCrossRefGoogle Scholar
  96. 96.
    Strauss DJ, Steidl G, Delb D (2003) Feature extraction by shape-adapted local discriminant bases. Signal Process 83:359–376MATHCrossRefGoogle Scholar
  97. 97.
    Sutton RS, Barton AG (1998) Reinforcement learning: an introduction. MIT Press, CambridgeGoogle Scholar
  98. 98.
    Suykens JAK, Gestel TV, Brabanter JD, Moor BD, Vandewalle J (2002) Least squares support vector machines. World Scientific, SingaporeMATHCrossRefGoogle Scholar
  99. 99.
    Suykens JAK, Vandevalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9(3):293–300CrossRefGoogle Scholar
  100. 100.
    Tao PD, An LTH (1998) A d.c. optimization algorithm for solving the trust-region subproblem. SIAM J Optimiz 8(2):476–505Google Scholar
  101. 101.
    Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58(1): 267–288MathSciNetMATHGoogle Scholar
  102. 102.
    Tikhonov AN, Arsenin VY (1977) Solution of ill-posed problems. Winston, WashingtonGoogle Scholar
  103. 103.
    Toh K-C, Yun S (2009) An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Technical report, Department of Mathematics, National University of Singapore, SingaporeGoogle Scholar
  104. 104.
    Tsypkin Y (1971) Adaptation and learning in automatic systems. Academic, New YorkGoogle Scholar
  105. 105.
    Vapnik V (1998) Statistical learning theory. Wiley, New YorkMATHGoogle Scholar
  106. 106.
    Vapnik VN (1982) Estimation of dependicies based on empirical data. Springer, New YorkGoogle Scholar
  107. 107.
    Vapnik VN, Chervonenkis A (1974) Theory of pattern regognition (in Russian). Nauka, Moscow; German translation: Theorie der Zeichenerkennung, Akademie-Verlag, Berlin, 1979 editionGoogle Scholar
  108. 108.
    Vapnik VN, Lerner A (1963) Pattern recognition using generalized portrait method. Automat Rem Contr 24:774–780Google Scholar
  109. 109.
    Vidyasagar M (2002) A theory of learning and generalization: with applications to neural networks and control systems. 2nd edn. Springer, LondonGoogle Scholar
  110. 110.
    Viola P, Jones M (2004) Robust real-time face detection. Int J Comput Vision 57(2):137–154CrossRefGoogle Scholar
  111. 111.
    Vito ED, Rosasco L, Caponnetto A, Piana M, Verri A (2004) Some properties of regularized kernel methods. J Mach Learn Res 5:1363–1390MATHGoogle Scholar
  112. 112.
    Wahba G (1990) Spline models for observational data. SIAM, New YorkMATHCrossRefGoogle Scholar
  113. 113.
    Weimer M, Karatzoglou A, Smola A (2008) Improving maximum margin matrix factorization. Mach Learn 72(3):263–276CrossRefGoogle Scholar
  114. 114.
    Wendland H (2005) Scattered data approximation. Cambridge University Press, CambridgeMATHGoogle Scholar
  115. 115.
    Weston J, Elisseeff A, Schölkopf B, Tipping M (2003) Use of the zero-norm with linear models and kernel methods. J Mach Learn Res 3: 1439–1461MATHGoogle Scholar
  116. 116.
    Weston J, Watkins C (1999) Multi-class support vector machines. In: Verlysen M (ed) Proceedings of ESANN’99, D-Facto Publications, BrusselsGoogle Scholar
  117. 117.
    Wolfe P (1961) Duality theorem for nonlinear programming. Q Appl Math 19:239–244MathSciNetMATHGoogle Scholar
  118. 118.
    Zdenek D (2009) Optimal quadratic programming algorithms with applications to variational inequalities. Springer, New YorkMATHGoogle Scholar
  119. 119.
    Zhang T (2004) Statistical behaviour and consistency of classification methods based on convex risk minimization. Ann Stat 32:56–134MATHCrossRefGoogle Scholar
  120. 120.
    Zoutendijk G (1960) Methods of feasible directions. A study in linear and nonlinear programming. Elsevier, AmsterdamGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Gabriele Steidl

There are no affiliations available

Personalised recommendations