Foundations of Statistical Learning and Model Selection

Camastra, Francesco; Vinciarelli, Alessandro

doi:10.1007/978-1-4471-6735-8_7

Francesco Camastra¹⁴ &
Alessandro Vinciarelli¹⁵

Part of the book series: Advanced Information and Knowledge Processing ((AI&KP))

4602 Accesses

Abstract

What the reader should know to understand this chapter $\bullet $ Basic notions of machine learning. $\bullet $ Notions of calculus. $\bullet $ Chapter 5.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Hardcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
$\textit{erf}(u)=\frac{2}{\sqrt{\pi }}\int _{0}^u e^{-u^2} du$.
2.
Numquam ponenda sine necessitate (W. Occam).
3.
$f(\cdot )_+$ stands for the positive part of $f(\cdot )$.

References

H. Akaike. Statistical predictor identification. Annals of the Institute of Statistical Mathematics, 21:202–217, 1970.
Google Scholar
H. Akaike. Information theory and an extension of the maximum likelihood principle. In $2^{nd}$ International Symposium on Information Theory, pages 267–281, 1973.
Google Scholar
M. Anthony. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999.
Google Scholar
C. M. Bishop. Neural Networks for Pattern Recognition. Cambridge University Press, 1995.
Google Scholar
S. Boucheron, G. Lugosi, and S. Massart. A sharp concentration inequality with applications. Random Structures and Algorithms, 16(3):277–292, 2000.
Google Scholar
V. Cherkassky and F. Mulier. Learning from Data. John Wiley, 1998.
Google Scholar
H. Chernoff. A measure of asymptotic efficiency of tests of a hypothesis based on the sum of observations. Annals of Mathematical Sciences, 23:493–507, 1952.
Google Scholar
P. Craven and G. Wahba. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized crossvalidation. Numerische Mathematik, 31(4):377–403, 1978.
Google Scholar
L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag, 1996.
Google Scholar
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley, 2001.
Google Scholar
B. Efron and R.J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, 1993.
Google Scholar
R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2):179–188, 1936.
Google Scholar
K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, 1990.
Google Scholar
S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias-variance dilemma. Neural Networks, 4(1):1–58, 1992.
Google Scholar
T. Hastie, R.J. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer-Verlag, 2001.
Google Scholar
F. Mosteller and J.W. Tukey. Data analysis, including statistics. In Handbook of Social Psychology, pages 80–203. Addison-Wesley, 1968.
Google Scholar
J. Rissanen. A universal prior for integers and estimation by minimum description length. Annals of Statistics, 11(2):416–431, 1983.
Google Scholar
B. Schölkopf and A.J. Smola. Learning with Kernels. MIT Press, 2002.
Google Scholar
G. Schwartz. Estimating the dimension of a model. Annals of Statistics, 6(2):461–464, 1978.
Google Scholar
R. Shibata. An optimal selection of regression variables. Biometrika, 68(1):45–54, 1981.
Google Scholar
M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, B36:111–147, 1974.
Google Scholar
M. Stone. An asymptotic equivalence of choice of model by crossvalidation and akaike’s criterion. Journal of the Royal Statistical Society, B39:44–47, 1977.
Google Scholar
V.N. Vapnik. Estimation of Dependences based on Empirical Data. Springer-Verlag, 1982.
Google Scholar
V.N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 1995.
Google Scholar
V.N. Vapnik. Statistical Learning Theory. John Wiley, 1998.
Google Scholar
V.N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264–280, 1971.
Google Scholar
V.N. Vapnik and A. Ya. Chervonenkis. Theory of Pattern Recognition. Nauka, 1974.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Science and Technology, Parthenope University of Naples, Naples, Italy
Francesco Camastra
School of Computing Science and the Institute of Neuroscience and Psychology, University of Glasgow, Glasgow, UK
Alessandro Vinciarelli

Authors

Francesco Camastra
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Vinciarelli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francesco Camastra .

Problems

7.1

Prove that the average error, in the case of regression, $\mathcal {E}[(f(\mathbf {x},\mathcal {D})-F(\mathbf {x}))^2]$ can be decomposed in the following way:

$$ \mathcal {E}[(f(\mathbf {x},\mathcal {D})-F(\mathbf {x}))^2]= (\mathcal {E}[f(\mathbf {x},\mathcal {D})-F(\mathbf {x})])^2 + \mathcal {E}[(f(\mathbf {x},\mathcal {D})-\mathcal {E}[f(\mathbf {x},\mathcal {D})])^2] $$

7.2

Consider the bias-variance decomposition for classification. Show that if the classification error $P(f(\mathbf {x}, \mathcal {D}=y)$ does not coincide with Bayes discriminant error, it is given by:

$$ P(f(\mathbf {x}, \mathcal {D}=y)= |2 \gamma (\mathbf {x})-1| + P(y_b(\mathbf {x})=y). $$

7.3

Prove that the class of functions $\textit{sin}(\alpha x)$ ($\alpha \in \mathbb {R}$) has infinite VC dimension (Theorem 4). You can compare your proof with the one reported in [25].

7.4

For any $\epsilon > 0$, prove that

$$\begin{aligned} P(|\mathcal {R}_{\textit{emp}}[f]-\mathcal {R}[f]| \ge \epsilon ) \le \exp (-2 \ell \epsilon ^2) \end{aligned}$$

(7.50)

7.5

Prove that the annealed entropy is an upper bound of VC Entropy. Hint: use Jensen’s inequality [25] which states that for a concave function $\psi $ the inequality

$$ \int \psi (\varPhi (x))dF(x) \le \psi \left( \int \varPhi (x) dF(x) \right) $$

holds.

7.6

Prove that if a class of function $\mathcal {F}$ can shatter any data set of $\ell $ samples the third milestone of VC theory is not fulfilled, that is the condition (7.32) does not hold.

7.7

Implement the AIC criterion. Consider spam data that can be dowloaded by ftp.ics.uci.edu/pub/machine-learning-databases/spam. Divide randomly spam data in two subsets with the same number of samples. Take the former and the latter sets respectively as the training and the test set. Select a learning algorithm for classification (e.g., K-Means or MLP) and train the algorithm with several parameter values. Use the AIC criterion for model selection. Compare their performances by means of the model assessment.

7.8

Implement the BIC criterion. Repeat Problem 7.7 and use the crossvalidation for model selection. Compare its performance with AIC.

7.9

Implement the crossvalidation criterion. Repeat Problem 7.7 and use 5-fold crossvalidation for model selection. Compare its performance with AIC and BIC.

7.10

Implement the leave-one-out method and test it on Iris Data [12] which can be dowloaded by ftp.ics.uci.edu/pub/machine-learning-databases/iris.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Camastra, F., Vinciarelli, A. (2015). Foundations of Statistical Learning and Model Selection. In: Machine Learning for Audio, Image and Video Analysis. Advanced Information and Knowledge Processing. Springer, London. https://doi.org/10.1007/978-1-4471-6735-8_7

Download citation

DOI: https://doi.org/10.1007/978-1-4471-6735-8_7
Published: 22 July 2015
Publisher Name: Springer, London
Print ISBN: 978-1-4471-6734-1
Online ISBN: 978-1-4471-6735-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Foundations of Statistical Learning and Model Selection

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Problems

Problems

7.1

7.2

7.3

7.4

7.5

7.6

7.7

7.8

7.9

7.10

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation