Skip to main content

Foundations of Statistical Learning and Model Selection

  • Chapter
  • First Online:
Machine Learning for Audio, Image and Video Analysis

Abstract

What the reader should know to understand this chapter \(\bullet \) Basic notions of machine learning. \(\bullet \) Notions of calculus. \(\bullet \) ChapterĀ 5.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 99.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    \(\textit{erf}(u)=\frac{2}{\sqrt{\pi }}\int _{0}^u e^{-u^2} du\).

  2. 2.

    Numquam ponenda sine necessitate (W. Occam).

  3. 3.

    \(f(\cdot )_+\) stands for the positive part of \(f(\cdot )\).

References

  1. H. Akaike. Statistical predictor identification. Annals of the Institute of Statistical Mathematics, 21:202ā€“217, 1970.

    Google ScholarĀ 

  2. H. Akaike. Information theory and an extension of the maximum likelihood principle. In \(2^{nd}\) International Symposium on Information Theory, pages 267ā€“281, 1973.

    Google ScholarĀ 

  3. M. Anthony. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999.

    Google ScholarĀ 

  4. C. M. Bishop. Neural Networks for Pattern Recognition. Cambridge University Press, 1995.

    Google ScholarĀ 

  5. S. Boucheron, G. Lugosi, and S. Massart. A sharp concentration inequality with applications. Random Structures and Algorithms, 16(3):277ā€“292, 2000.

    Google ScholarĀ 

  6. V. Cherkassky and F. Mulier. Learning from Data. John Wiley, 1998.

    Google ScholarĀ 

  7. H. Chernoff. A measure of asymptotic efficiency of tests of a hypothesis based on the sum of observations. Annals of Mathematical Sciences, 23:493ā€“507, 1952.

    Google ScholarĀ 

  8. P. Craven and G. Wahba. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized crossvalidation. Numerische Mathematik, 31(4):377ā€“403, 1978.

    Google ScholarĀ 

  9. L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag, 1996.

    Google ScholarĀ 

  10. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley, 2001.

    Google ScholarĀ 

  11. B. Efron and R.J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, 1993.

    Google ScholarĀ 

  12. R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2):179ā€“188, 1936.

    Google ScholarĀ 

  13. K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, 1990.

    Google ScholarĀ 

  14. S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias-variance dilemma. Neural Networks, 4(1):1ā€“58, 1992.

    Google ScholarĀ 

  15. T. Hastie, R.J. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer-Verlag, 2001.

    Google ScholarĀ 

  16. F. Mosteller and J.W. Tukey. Data analysis, including statistics. In Handbook of Social Psychology, pages 80ā€“203. Addison-Wesley, 1968.

    Google ScholarĀ 

  17. J. Rissanen. A universal prior for integers and estimation by minimum description length. Annals of Statistics, 11(2):416ā€“431, 1983.

    Google ScholarĀ 

  18. B. Schƶlkopf and A.J. Smola. Learning with Kernels. MIT Press, 2002.

    Google ScholarĀ 

  19. G. Schwartz. Estimating the dimension of a model. Annals of Statistics, 6(2):461ā€“464, 1978.

    Google ScholarĀ 

  20. R. Shibata. An optimal selection of regression variables. Biometrika, 68(1):45ā€“54, 1981.

    Google ScholarĀ 

  21. M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, B36:111ā€“147, 1974.

    Google ScholarĀ 

  22. M. Stone. An asymptotic equivalence of choice of model by crossvalidation and akaikeā€™s criterion. Journal of the Royal Statistical Society, B39:44ā€“47, 1977.

    Google ScholarĀ 

  23. V.N. Vapnik. Estimation of Dependences based on Empirical Data. Springer-Verlag, 1982.

    Google ScholarĀ 

  24. V.N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 1995.

    Google ScholarĀ 

  25. V.N. Vapnik. Statistical Learning Theory. John Wiley, 1998.

    Google ScholarĀ 

  26. V.N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264ā€“280, 1971.

    Google ScholarĀ 

  27. V.N. Vapnik and A.Ā Ya. Chervonenkis. Theory of Pattern Recognition. Nauka, 1974.

    Google ScholarĀ 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francesco Camastra .

Problems

Problems

7.1

Prove that the average error, in the case of regression, \(\mathcal {E}[(f(\mathbf {x},\mathcal {D})-F(\mathbf {x}))^2]\) can be decomposed in the following way:

$$ \mathcal {E}[(f(\mathbf {x},\mathcal {D})-F(\mathbf {x}))^2]= (\mathcal {E}[f(\mathbf {x},\mathcal {D})-F(\mathbf {x})])^2 + \mathcal {E}[(f(\mathbf {x},\mathcal {D})-\mathcal {E}[f(\mathbf {x},\mathcal {D})])^2] $$

7.2

Consider the bias-variance decomposition for classification. Show that if the classification error \(P(f(\mathbf {x}, \mathcal {D}=y)\) does not coincide with Bayes discriminant error, it is given by:

$$ P(f(\mathbf {x}, \mathcal {D}=y)= |2 \gamma (\mathbf {x})-1| + P(y_b(\mathbf {x})=y). $$

7.3

Prove that the class of functions \(\textit{sin}(\alpha x)\) (\(\alpha \in \mathbb {R}\)) has infinite VC dimension (TheoremĀ 4). You can compare your proof with the one reported inĀ [25].

7.4

For any \(\epsilon > 0\), prove that

$$\begin{aligned} P(|\mathcal {R}_{\textit{emp}}[f]-\mathcal {R}[f]| \ge \epsilon ) \le \exp (-2 \ell \epsilon ^2) \end{aligned}$$
(7.50)

7.5

Prove that the annealed entropy is an upper bound of VC Entropy. Hint: use Jensenā€™s inequalityĀ [25] which states that for a concave function \(\psi \) the inequality

$$ \int \psi (\varPhi (x))dF(x) \le \psi \left( \int \varPhi (x) dF(x) \right) $$

holds.

7.6

Prove that if a class of function \(\mathcal {F}\) can shatter any data set of \(\ell \) samples the third milestone of VC theory is not fulfilled, that is the condition (7.32) does not hold.

7.7

Implement the AIC criterion. Consider spam data that can be dowloaded by ftp.ics.uci.edu/pub/machine-learning-databases/spam. Divide randomly spam data in two subsets with the same number of samples. Take the former and the latter sets respectively as the training and the test set. Select a learning algorithm for classification (e.g., K-Means or MLP) and train the algorithm with several parameter values. Use the AIC criterion for model selection. Compare their performances by means of the model assessment.

7.8

Implement the BIC criterion. Repeat ProblemĀ 7.7 and use the crossvalidation for model selection. Compare its performance with AIC.

7.9

Implement the crossvalidation criterion. Repeat ProblemĀ 7.7 and use 5-fold crossvalidation for model selection. Compare its performance with AIC and BIC.

7.10

Implement the leave-one-out method and test it on Iris Data Ā [12] which can be dowloaded by ftp.ics.uci.edu/pub/machine-learning-databases/iris.

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2015 Springer-Verlag London

About this chapter

Cite this chapter

Camastra, F., Vinciarelli, A. (2015). Foundations of Statistical Learning and Model Selection. In: Machine Learning for Audio, Image and Video Analysis. Advanced Information and Knowledge Processing. Springer, London. https://doi.org/10.1007/978-1-4471-6735-8_7

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-6735-8_7

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-6734-1

  • Online ISBN: 978-1-4471-6735-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics