# A Short Review of Statistical Learning Theory

## Abstract

Statistical learning theory has emerged in the last few years as a solid and elegant framework for studying the problem of learning from examples. Unlike previous “classical” learning techniques, this theory completely characterizes the necessary and sufficient conditions for a learning algorithm to be consistent. The key quantity is the *capacity* of the set of hypotheses employed in the learning algorithm and the goal is to control this capacity depending on the given examples. Structural risk minimization (SRM) is the main theoretical algorithm which implements this idea. SRM is inspired and closely related to regularization theory. For practical purposes, however, SRM is a very hard problem and impossible to implement when dealing with a large number of examples. Techniques such as support vector machines and older regularization networks are a viable solution to implement the idea of capacity control. The paper also discusses how these techniques can be formulated as a variational problem in a Hilbert space and show how SRM can be extended in order to implement both classical regularization networks and support vector machines.

## Keywords

Statistical learning theory Structural risk minimization Regularization## Preview

Unable to display preview. Download preview PDF.

## References

- [1]N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler: 1993, ‘Scale-sensitive dimensions, uniform convergence, and learnability’.
*Symposium on Foundations of Computer Science*.Google Scholar - [2]N. Aronszajn. Theory of reproducing kernels.
*Trans. Amer. Math. Soc.*, 686:337–404, 1950.CrossRefMathSciNetGoogle Scholar - [3]M. Bertero. Regularization methods for linear inverse problems. In C. G. Talenti, editor,
*Inverse Problems*. Springer-Verlag, Berlin, 1986.Google Scholar - [4]C. Cortes, and V. Vapnik: 1995, ‘Support Vector Networks’.
*Machine Learning*20, 1–25.Google Scholar - [5]L. Devroye, L. Györfi, and G. Lugosi: 1996,
*A Probabilistic Theory of Pattern Recognition*, No. 31 in Applications of mathematics. New York: Springer.zbMATHGoogle Scholar - [6]T. Evgeniou, M. Pontil, C. Papageorgiou, and T. Poggio: 2000, ‘Image representations for object detection using kernel classifiers’. In:
*Proceedings ACCV*. Taiwan, p. (to appear).Google Scholar - [7]T. Evgeniou, M. Pontil, and T. Poggio: 1999, ‘Regularization Networks and Support Vector Machines’. Advances in Computational Mathematics 13, pp 1–50, 2000.CrossRefMathSciNetGoogle Scholar
- [8]F. Girosi. An equivalence between sparse approximation and Support Vector Machines.
*Neural Computation*, 10(6):1455–1480, 1998.CrossRefGoogle Scholar - [9]F. Girosi, M. Jones, and T. Poggio: 1995, ‘Regularization theory and neural networks architectures’.
*Neural Computation***7**, 219–269.CrossRefGoogle Scholar - [10]T. Jaakkola, and D. Haussler: 1998, ‘Probabilistic Kernel Regression Models’. In:
*Proc. of Neural Information Processing Conference*.Google Scholar - [11]M. Kearns, and R. Shapire: 1994, ‘Efficient distribution-free learning of probabilistic concepts.’.
*Journal of Computer and Systems Sciences***48**(3), 464–497.zbMATHCrossRefGoogle Scholar - [12]V. A. Morozov: 1984,
*Methods for solving incorrectly posed problems*. Berlin, Springer-Verlag.Google Scholar - [13]E. Osuna, R. Freund, and F. Girosi: 1997, ‘An Improved Training Algorithm for Support Vector Machines’. In:
*IEEE Workshop on Neural Networks and Signal Processing*. Amelia Island, FL.Google Scholar - [14]J. C. Platt. Sequential minimal imization: A fast algorithm for training support vector machines. Technical Report MST-TR-98-14, Microsoft Research, April 1998.Google Scholar
- [15]T. Poggio and F. Girosi. Networks for approximation and learning.
*Proceedings of the IEEE*, 78(9), September 1990.Google Scholar - [16]M. Pontil, S. Mukherjee, and F. Girosi. On the noise model of support vector machine regression. A.I. Memo, MIT Artificial Intelligence Laboratory, 1998.Google Scholar
- [17]M. Pontil, R. Rifkin, and T. Evgeniou. From regression to classification in support vector machines. A. I. Memo 1649, MIT Artificial Intelligence Lab., 1998.Google Scholar
- [18]J. Rissanen. Modeling by shortest data description.
*Automatica*, 14:465–471, 1978.zbMATHCrossRefGoogle Scholar - [19]A. N. Tikhonov, and V. Y. Arsenin: 1977,
*Solutions of Ill-posed Problems*. Washington, D. C.: W. H. Winston.zbMATHGoogle Scholar - [20]V. N. Vapnik.
*The Nature of Statistical Learning Theory*. Springer, New York, 1995.zbMATHGoogle Scholar - [21]V. N. Vapnik: 1998,
*Statistical Learning Theory*. New York: Wiley.zbMATHGoogle Scholar - [22]V. N. Vapnik, and A.Y. Chervonenkis: 1971, ‘On the Uniform Convergence of Relative Frequencies of events to their probabilities’.
*Th. Prob. and its Applications***17**(2), 264–280.CrossRefMathSciNetGoogle Scholar - [23]G. Wahba: 1990,
*Splines Models for Observational Data*. Philadelphia: Series in Applied Mathematics, Vol. 59, SIAM.Google Scholar