Advertisement

Pattern Classification and Learning Theory

  • G. Lugosi
Part of the International Centre for Mechanical Sciences book series (CISM, volume 434)

Abstract

Pattern recognition (or classification or discrimination) is about guessing or predicting the unknown class of an observation. An observation is a collection of numerical measurements, represented by a d-dimensional vector x. The unknown nature of the observation is called a class. It is denoted by y and takes values in the set {0,1}. (For simplicity, we restrict our attention to binary classification.) In pattern recognition, one creates a function g(x): R d → {0, 1} which represents one’s guess of y given x. The mapping g is called a classifier. A classifier errs on x if g(x) ≠ y.

Keywords

Independent Random Variable Pattern Classification Empirical Process Empirical Risk Concentration Inequality 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

General

  1. M. Anthony and P. L. Bartlett, Neural Network Learning: Theoretical Foundations, Cambridge University Press, Cambridge, 1999.CrossRefMATHGoogle Scholar
  2. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, Cambridge, UK, 2000.Google Scholar
  3. L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag, New York, 1996.CrossRefMATHGoogle Scholar
  4. V.N. Vapnik. Estimation of Dependencies Based on Empirical Data. Springer-Verlag, New York, 1982.Google Scholar
  5. V.N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995.CrossRefMATHGoogle Scholar
  6. V.N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.MATHGoogle Scholar
  7. V.N. Vapnik and A.Ya. Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow, 1974. (in Russian); German translation: Theorie der Zeichenerkennung, Akademie Verlag, Berlin, 1979.Google Scholar

Concentration for sums of independent random variables

  1. G. Bennett. Probability inequalities for the sum of independent random variables. Journal of the American Statistical Association, 57: 33–45, 1962.CrossRefMATHGoogle Scholar
  2. S.N. Bernstein. The Theory of Probabilities. Gastehizdat Publishing House, Moscow, 1946.Google Scholar
  3. H. Chernoff. A measure of asymptotic efficiency of tests of a hypothesis based on the sum of observations. Annals of Mathematical Statistics, 23: 493–507, 1952.MathSciNetCrossRefMATHGoogle Scholar
  4. T. Hagerup and C. Rüb. A guided tour of Chernoff bounds. Information Processing Letters, 33: 305–308, 1990.MathSciNetCrossRefMATHGoogle Scholar
  5. C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics 1989, pages 148–188. Cambridge University Press, Cambridge, 1989.Google Scholar
  6. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58: 13–30, 1963.MathSciNetCrossRefMATHGoogle Scholar
  7. R.M. Karp. Probabilistic Analysis of Algorithms. Class Notes, University of California, Berkeley, 1988.Google Scholar
  8. M. Okamoto. Some inequalities relating to the partial sum of binomial probabilities. Annals of the Institute of Statistical Mathematics, 10: 29–35, 1958.MathSciNetCrossRefMATHGoogle Scholar

Concentration

  1. K. Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 68: 357–367, 1967.MathSciNetCrossRefGoogle Scholar
  2. S. Boucheron, G. Lugosi, and P. Massart. A sharp concentration inequality with applications in random combinatorics and learning. Random Structures and Algorithms, 16: 277292, 2000.Google Scholar
  3. L. Devroye. Exponential inequalities in nonparametric estimation. In G. Roussas, editor, Nonparametric Functional Estimation and Related Topics, pages 31–44. NATO ASI Series, Kluwer Academic Publishers, Dordrecht, 1991.CrossRefGoogle Scholar
  4. J. H. Kim. The Ramsey number R(3, t) has order of magnitude t2/ log t. Random Structures and Algorithms, 7: 173–207, 1995.Google Scholar
  5. M. Ledoux. On Talagrand’s deviation inequalities for product measures. ESAIM: Proba-bility and Statistics, 1, 63–87, (1996).MathSciNetCrossRefMATHGoogle Scholar
  6. K. Marton. A simple proof of the blowing-up lemma. IEEE Transactions on Information Theory, 32: 44546, 1986.MathSciNetCrossRefGoogle Scholar
  7. K. Marton. Bounding J-distance by informational divergence: a way to prove measure concentration. Annals of Probability, to appear: 0–0, 1996.Google Scholar
  8. K. Marton. A measure concentration inequality for contracting Markov chains Geometric and Functional Analysis, 6:556–571, 1996. Erratum: 7: 609–613, 1997.MathSciNetCrossRefMATHGoogle Scholar
  9. P. Massart. About the constant in Talagrand’s concentration inequalities from empirical processes. Annals of Probability, 28: 863–884, 2000.MathSciNetCrossRefMATHGoogle Scholar
  10. W. Rhee and M. Talagrand. Martingales, inequalities, and NP-complete problems. Mathematics of Operations Research, 12: 177–181, 1987.MathSciNetCrossRefMATHGoogle Scholar
  11. J.M. Steele. An Efron-Stein inequality for nonsymmetric statistics. Annals of Statistics, 14: 753–758, 1986.MathSciNetCrossRefMATHGoogle Scholar
  12. M. Talagrand. Concentration of measure and isoperimetric inequalities in product spaces. I.H.E.S. Publications Mathématiques, 81: 73–205, 1996.Google Scholar
  13. M. Talagrand. New concentration inequalities in product spaces. Invent. Math. 126: 505–563, 1996.MathSciNetCrossRefGoogle Scholar
  14. M. Talagrand. A new look at independence. Annals of Probability,24:0–0, 1996. special invited paper.Google Scholar

VC theory

  1. K. Alexander. Probability inequalities for empirical processes and a law of the iterated logarithm. Annals of Probability, 4: 1041–1067, 1984.CrossRefGoogle Scholar
  2. M. Anthony and J. Shawe-Taylor. A result of Vapnik with applications. Discrete Applied Mathematics, 47: 207–217, 1993.MathSciNetCrossRefMATHGoogle Scholar
  3. P. Bartlett and G. Lugosi. An inequality for uniform deviations of sample averages from their means. Statistics and Probability Letters, 44: 55–62, 1999.MathSciNetCrossRefMATHGoogle Scholar
  4. L. Breiman. Bagging predictors. Machine Learning, 24: 123–140, 1996.MathSciNetMATHGoogle Scholar
  5. Devroye, L. Bounds for the uniform deviation of empirical measures. Journal of Multivariate Analysis, 12: 72–79, 1982.MathSciNetCrossRefMATHGoogle Scholar
  6. A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound on the number of examples needed for learning. Information and Computation, 82: 247–261, 1989.MathSciNetCrossRefMATHGoogle Scholar
  7. Y. Freund. Boosting a weak learning algorithm by majority. Information and Computation, 121: 256–285, 1995.MathSciNetCrossRefMATHGoogle Scholar
  8. E. Giné and J. Zinn. Some limit theorems for empirical processes. Annals of Probability, 12: 929–989, 1984.MathSciNetCrossRefMATHGoogle Scholar
  9. D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100: 78–150, 1992.MathSciNetCrossRefMATHGoogle Scholar
  10. V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers, Annals of Statistics, 30, 2002.Google Scholar
  11. M. Ledoux and M. Talagrand. Probability in Banach Space, Springer-Verlag, New York, 1991.CrossRefGoogle Scholar
  12. G. Lugosi. Improved upper bounds for probabilities of uniform deviations. Statistics and Probability Letters, 25: 71–77, 1995.MathSciNetCrossRefMATHGoogle Scholar
  13. D. Pollard. Convergence of Stochastic Processes. Springer-Verlag, New York, 1984.CrossRefMATHGoogle Scholar
  14. R.E. Schapire, Y. Freund, P. Bartlett, and W.S. Lee. Boosting the margin: a new explanation for the effectiveness of voting methods, Annals of Statistics, 26: 1651–1686, 1998.MathSciNetCrossRefMATHGoogle Scholar
  15. R.E. Schapire. The strength of weak learnability. Machine Learning, 5: 197–227, 1990.Google Scholar
  16. M. Talagrand. Sharper bounds for Gaussian and empirical processes. Annals of Probability, 22: 28–76, 1994.MathSciNetCrossRefMATHGoogle Scholar
  17. S. Van de Geer. Estimating a regression function. Annals of Statistics, 18: 907–924, 1990.MathSciNetCrossRefMATHGoogle Scholar
  18. V.N. Vapnik. Estimation of Dependencies Based on Empirical Data. Springer-Verlag, New York, 1982.Google Scholar
  19. V.N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995.CrossRefMATHGoogle Scholar
  20. V.N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.MATHGoogle Scholar
  21. V.N. Vapnik and A.Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16: 264–280, 1971.CrossRefMATHGoogle Scholar
  22. V.N. Vapnik and A.Ya. Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow, 1974. (in Russian); German translation: Theorie der Zeichenerkennung, Akademie Verlag, Berlin, 1979.Google Scholar
  23. A. W. van der Vaart and J. A. Wellner. Weak convergence and empirical processes, Springer-Verlag, New York, 1996.CrossRefMATHGoogle Scholar

Shatter coefficients, VC dimension

  1. P. Assouad, Sur les classes de Vapnik-Chervonenkis, C.R. Acad. Sci. Paris, vol. 292, Sér.I, pp. 921–924, 1981.Google Scholar
  2. T. M. Cover, Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition, IEEE Transactions on Electronic Computers, vol. 14, pp. 326–334, 1965.CrossRefMATHGoogle Scholar
  3. R. M. Dudley, Central limit theorems for empirical measures, Annals of Probability, vol. 6, pp. 899–929, 1978.MathSciNetCrossRefMATHGoogle Scholar
  4. R. M. Dudley, Balls in R k do not cut all subsets of k + 2 points, Advances in Mathematics, vol. 31 (3), pp. 306–308, 1979.MathSciNetCrossRefMATHGoogle Scholar
  5. P. Franid, On the trace of finite sets, Journal of Combinatorial Theory, Series A, vol. 34, pp. 41–45, 1983.Google Scholar
  6. D. Haussier, Sphere packing numbers for subsets of the boolean n-cube with bounded Vapnik-Chervonenkis dimension, Journal of Combinatorial Theory, Series A, vol. 69, pp. 217–232, 1995.Google Scholar
  7. N. Sauer, On the density of families of sets, Journal of Combinatorial Theory Series A, vol. 13, pp. 145–147, 1972.MathSciNetCrossRefMATHGoogle Scholar
  8. L. Schläffli, Gesammelte Mathematische Abhandlungen, Birkhäuser-Verlag, Basel, 1950.CrossRefGoogle Scholar
  9. S. Shelah, A combinatorial problem: stability and order for models and theories in infinity languages, Pacific Journal of Mathematics, vol. 41, pp. 247–261, 1972.MathSciNetCrossRefMATHGoogle Scholar
  10. J. M. Steele, Combinatorial entropy and uniform limit laws, Ph.D. dissertation, Stanford University, Stanford, CA, 1975.Google Scholar
  11. J. M. Steele, Existence of submatrices with all possible columns, Journal of Combinatorial Theory, Series A, vol. 28, pp. 84–88, 1978.MathSciNetGoogle Scholar
  12. R. S. Wenocur and R. M. Dudley, Some special Vapnik-Chervonenkis classes, Discrete Mathematics, vol. 33, pp. 313–318, 1981.MathSciNetCrossRefMATHGoogle Scholar

Lower bounds

  1. A. Antos and G. Lugosi. Strong minimax lower bounds for learning. Machine Learning, vol. 30, 31–56, 1998.CrossRefGoogle Scholar
  2. P. Assouad. Deux remarques sur l’estimation. Comptes Rendus de l’Académie des Sciences de Paris, 296: 1021–1024, 1983.MathSciNetMATHGoogle Scholar
  3. L. Birgé. Approximation dans les espaces métriques et théorie de l’estimation. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 65: 181–237, 1983.CrossRefMATHGoogle Scholar
  4. L. Birgé. On estimating a density using Hellinger distance and some other strange facts. Probability Theory and Related Fields, 71: 271–291, 1986.MathSciNetCrossRefMATHGoogle Scholar
  5. A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. Learnability and the VapnikChervonenkis dimension. Journal of the ACM, 36: 929–965, 1989.MathSciNetCrossRefMATHGoogle Scholar
  6. L. Devroye and G. Lugosi. Lower bounds in pattern recognition and learning. Pattern Recognition, 28: 1011–1018, 1995.CrossRefGoogle Scholar
  7. A. Ehrenfeucht, D. Haussier, M. Kearns, and L. Valiant. A general lower bound on the number of examples needed for learning. Information and Computation, 82: 247–261, 1989.MathSciNetCrossRefMATHGoogle Scholar
  8. D. Haussier, N. Littlestone, and M. Warmuth. Predicting {0, 1}-functions on randomly drawn points. Information and Computation, 115: 248–292, 1994.MathSciNetCrossRefGoogle Scholar
  9. E. Mammen, A. B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27: 1808–1829, 1999.MathSciNetCrossRefMATHGoogle Scholar
  10. D. Schuurmans. Characterizing rational versus exponential learning curves. In Computational Learning Theory: Second European Conference. EuroCOLT’95, pages 272–286. Springer Verlag, 1995.Google Scholar
  11. V.N. Vapnik and A.Ya. Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow, 1974. (in Russian); German translation: Theorie der Zeichenerkennung, Akademie Verlag, Berlin, 1979.Google Scholar
  12. S. Geman and C.R. Hwang. Nonparametric maximum likelihood estimation by the method of sieves. Annals of Statistics, 10: 401–414, 1982.MathSciNetCrossRefMATHGoogle Scholar

Complexity regularization

  1. H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19: 716–723, 1974.MathSciNetCrossRefMATHGoogle Scholar
  2. A.R. Barron. Logically smooth density estimation. Technical Report TR 56, Department of Statistics, Stanford University, 1985.Google Scholar
  3. A.R. Barron. Complexity regularization with application to artificial neural networks. In G. Roussas, editor, Nonparametric Functional Estimation and Related Topics, pages 561576. NATO ASI Series, Kluwer Academic Publishers, Dordrecht, 1991.Google Scholar
  4. A.R. Barron, L. Birgé, and R Massart. Risk bounds for model selection via penalization. Probability Theory and Related fields, 113: 301–413, 1999.MathSciNetCrossRefMATHGoogle Scholar
  5. A.R. Barron and T.M. Cover. Minimum complexity density estimation. IEEE Transactions on Information Theory, 37: 1034–1054, 1991.MathSciNetCrossRefMATHGoogle Scholar
  6. P. L. Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44 (2): 525–536, March 1998.MathSciNetCrossRefMATHGoogle Scholar
  7. P. Bartlett, S. Boucheron, and G. Lugosi, Model selection and error estimation. Proceedings of the 13th Annual Conference on Computational Learning Theory, ACM Press, pp. 286–297, 2000.Google Scholar
  8. L. Birgé and P. Massart. From model selection to adaptive estimation. In E. Torgersen D. Pollard and G. Yang, editors, Festschrift for Lucien Le Cam: Research papers in Probability and Statistics, pages 55–87. Springer, New York, 1997.CrossRefGoogle Scholar
  9. L. Birgé and R Massart. Minimum contrast estimators on sieves: exponential bounds and rates of convergence. Bernoulli, 4: 329–375, 1998.MathSciNetCrossRefMATHGoogle Scholar
  10. Y. Freund. Self bounding learning algorithms. Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pages 247–258, 1998.Google Scholar
  11. A.R. Gallant. Nonlinear Statistical Models. John Wiley, New York, 1987.CrossRefMATHGoogle Scholar
  12. M. Kearns, Y. Mansour, A.Y. Ng, and D. Ron. An experimental and theoretical comparison of model selection methods. In Proceedings of the Eighth Annual ACM Workshop on Computational Learning Theory, pages 21–30. Association for Computing Machinery, New York, 1995.Google Scholar
  13. A. Krzyzak and T. Linder. Radial basis function networks and complexity regularization in function learning. IEEE Transactions on Neural Networks, 9: 247–256, 1998.CrossRefGoogle Scholar
  14. G. Lugosi and A. Nobel. Adaptive model selection using empirical complexities. Annals of Statistics, vol. 27, no. 6, 1999.Google Scholar
  15. G. Lugosi and K. Zeger. Nonparametric estimation via empirical risk minimization IEEE Transactions on Information Theory, 41: 677–678, 1995.MathSciNetCrossRefMATHGoogle Scholar
  16. G. Lugosi and K. Zeger. Concept learning using complexity regularization. IEEE Transactions on Information Theory, 42: 48–54, 1996.MathSciNetCrossRefMATHGoogle Scholar
  17. C.L. Mallows. Some comments on cp. IEEE Technometrics, 15: 661–675, 1997.Google Scholar
  18. P. Massart. Some applications of concentration inequalities to statistics. Annales de la faculté des sciences de l’université de Toulouse, Mathématiques, série 6, IX: 245–303, 2000.Google Scholar
  19. R. Meir. Performance bounds for nonlinear time series prediction. In Proceedings of the Tenth Annual ACM Workshop on Computational Learning Theory, page 122–129. Association for Computing Machinery, New York, 1997.Google Scholar
  20. D.S. Modha and E. Masry. Minimum complexity regression estimation with weakly de-pendent observations. IEEE Transactions on Information Theory, 42: 2133–2145, 1996.MathSciNetCrossRefMATHGoogle Scholar
  21. J. Rissanen. A universal prior for integers and estimation by minimum description length. Annals of Statistics, 11: 416–431, 1983.MathSciNetCrossRefMATHGoogle Scholar
  22. G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6: 461–464, 1978.MathSciNetCrossRefMATHGoogle Scholar
  23. J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44 (5): 1926–1940, 1998.MathSciNetCrossRefMATHGoogle Scholar
  24. X. Shen and W.H. Wong. Convergence rate of sieve estimates. Annals of Statistics, 22: 580–615, 1994.MathSciNetCrossRefMATHGoogle Scholar
  25. Y. Yang and A.R. Barron. Information-theoretic determination of minimax rates of convergence. Annals of Statistics, to appear, 1997.Google Scholar
  26. Y. Yang and A.R. Barron. An asymptotic property of model selection criteria. IEEE Transactions on Information Theory, 1998.Google Scholar

Copyright information

© Springer-Verlag Wien 2002

Authors and Affiliations

  • G. Lugosi
    • 1
  1. 1.Pompeu Fabra UniversityBarcelonaSpain

Personalised recommendations