Abstract
First, well-known concepts from Statistical Learning Theory are reviewed. In reference to the problem of modelling an unknown input/output (I/O) relationship by fixed-structure parametrized functions, the concepts of expected risk, empirical risk, and generalization error are described. The last error is then split into approximation and estimation errors. Four quantities of interest are emphasized: the accuracy, the number of arguments of the I/O relationship, the model complexity, and the number of samples generated for the estimation. The possibility of generating such samples by deterministic algorithms like quasi-Monte Carlo methods, orthogonal arrays, Latin hypercubes, etc. gives rise to the so-called Deterministic Learning Theory. This possibility is an intriguing alternative to the random generation of input data, typically obtained by using Monte Carlo techniques, since it enables one to reduce the number of samples (under the same accuracy) and to obtain upper bounds on the errors in deterministic terms rather than in probabilistic ones. Deterministic learning relies on some basic quantities such as variation and discrepancy. Special families of deterministic sequences called “low-discrepancy sequences” are useful in the computation of integrals and in dynamic programming, to mitigate the danger of incurring the curse of dimensionality deriving from the use of regular grids.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
A more suitable expression is actually “supervised learning from data,” to distinguish it from other forms of learning, such as “unsupervised learning from data.” However, since in the book we shall deal with supervised learning problems, the term “supervised” will be typically omitted.
- 2.
Recall that the support of a function \(f :X \rightarrow \mathbb R\), where X is a normed linear space, is defined as \(\mathrm{supp}\,f \triangleq \mathrm{cl} (\{x \in X \,|\,f(x) \not = 0\})\), i.e., it is the closure in the norm of X of the set of points where \(f \not = 0\).
- 3.
We refer to O(1/L) and \(O(1/L^{1/2})\) as “linear” and “quadratic” rates with respect to L, respectively, for the following reasons that have close but simpler similarities with what reported in Assumptions 2.10, 2.11, 2.12, and 2.13 (L takes the place of n; the dimension d is absent). Let Q be a quantity dependent on L in such a way that \(Q(L) = O(1/L)\) (Q(L) corresponds, e.g., to the left-hand terms of (2.48) and (2.70)). Then, there exists a constant \(c_1\) such that \(Q(L) \le c_1/L\). Let \({\varepsilon } >0\). To evaluate the rate at which L has to grow when \({\varepsilon } \rightarrow 0\), hence when \(1/{\varepsilon } \rightarrow \infty \), in such a way to guarantee that \(Q(L) \le {\varepsilon }\) holds, we impose \(c_1/L \le {\varepsilon }\). This provides \(L \ge c_1/{\varepsilon }\), which means that when \(1/{\varepsilon } \rightarrow \infty \), \(L \rightarrow \infty \) linearly with respect to \(1/{\varepsilon }\). Analogously, \(Q(L) = O(1/L^{1/2})\) implies that there exists \(c_2\) such that \(Q(L) \le c_2/L^{1/2}\) and so, as shown above, we get \(L \ge (c_2/{\varepsilon })^2\), which expresses that, for \(1/{\varepsilon } \rightarrow \infty \), \(L \rightarrow \infty \) quadratically with respect to \(1/{\varepsilon }\).
References
Alon N, Ben-David S, Cesa-Bianchi N, Haussler D (1997) Scale-sensitive dimensions, uniform convergence, and learnability. J ACM 44:615–631
Angluin D, Valiant L (1979) Fast probabilistic algorithms for Hamiltonian circuits and matchings. J Comput Syst Sci 18:155–193
Barron AR (1994) Approximation and estimation bounds for artificial neural networks. Mach Learn 14:115–133
Cervellera C, Muselli M (2004) Deterministic design for neural network learning: an approach based on discrepancy. IEEE Trans Neural Netw 15:533–544
Cherkassky V, Mulier F (2007) learning from data: concepts, theory, and methods. Wiley
Cucker F, Smale S (2001) On the mathematical foundations of learning. Bull Am Math Soc 39:1–49
Devroye L, Györfi L, Lugosi G (1997) A probabilistic theory of pattern recognition. Springer, New York
Dick J, Pillichshammer F (2010) Digital nets and sequences: discrepancy theory and quasi-Monte Carlo integration. Cambridge University Press
Dudley RM (1979) Balls in \(\,{\mathbb{R}}^k\,\) do not cut all subsets of \(\, k+2\,\) points. Adv Math 31:306–308
Dudley RM, Giné R, Zinn J (1991) Uniform and universal Glivenko-Cantelli classes. J Theor Prob 4:485–510
Fang K-T, Wang Y (1994) Number-theoretic methods in statistics. Chapmann & Hall
Geman S, Bienenstock E, Doursat R (1992) Neural networks and the bias/variance dilemma. Neural Comput 4:1–58
Girosi F (1995) Approximating error bounds that use VC bounds. In: Proceedings of the international conference on artificial neural networks, pp 295–302
Girosi F, Anzellotti G (1992) Rates of convergence of approximation by translates. Technical Report 1288, Artificial Intelligence Laboratory, Massachusetts Institute of Technology
Guyon I, Vapnik V, Boser B, Bottou L, Solla S (1992) Capacity control in linear classifiers for pattern recognition. In: Proceedings of the 11th IAPR international conference on pattern recognition, conference B: pattern recognition methodology and systems, vol II, pp 385–388
Hastie T, Tibshirani R, Friedman J (2008) The elements of statistical learning. Springer, New York
Haussler D (1992) Decision theoretic generalizations of the PAC model for neural net and other learning applications. Inf Comput 100:78–150
Haykin S (2008) Neural networks and learning systems. Pearson Prentice-Hall
Hlawka E (1961) Funktionen von Beschränkter Variation in der Theorie der Gleichverteilung. Ann Mat Pura Appl 54:325–333
Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Am Stat Assoc 58:13–30
Kuipers L, Niederreiter H (1974) Uniform distribution of sequences. Wiley
McCaffrey DF, Gallant AR (1994) Convergence rates for single hidden layer feedforward nets. Neural Netw 7:147–158
Mendelson S (2003) A few notes on statistical learning theory. In: Mendelson S, Smola A (eds) Advanced lectures on machine learning – LNCS 2600, pp 1–40. Springer, Berlin
Niederreiter H (1992) Random number generation and Quasi-Monte Carlo methods. SIAM
Niyogi P, Girosi F (1994) On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions. Technical report, A.I. Memo No. 1467, C.B.C.L. No. 88, Massachusset Institute of Technology, ftp://publications.ai.mit.edu/ai-publications/1000-1499/AIM-1467.ps.Z
Niyogi P, Girosi F (1996) On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions. Neural Comput 8:819–842
Nussbaum M (1996) On nonparametric estimation of a regression function that is smooth in a domain on \({\mathbb{R}}^k\). Theor Probab Appl 31:118–125
Owen A (2005) Multidimensional variation for quasi-Monte Carlo. In: Fang J, Li G (eds) Contemporary multivariate analysis and experimental design – celebration in honor of Professor Kai-Tai Fang’s 65th birthday, pp 49–85. World Scientific
Pollard D (1990) Empirical processes: theory and applications. In: NSF-CBMS regional conference series in probability and statistics, vol 2. Institute of Mathematical Statistics and American Statistical Association
Schlier C (2004) Discrepancy behaviour in the non-asymptotic regime. Appl Numer Math 50:227–238
Sobol’ IM (1967) The distribution of points in a cube and the approximate evaluation of integrals. USSR Comput Math Math Phys 7:86–112
Vapnik VN (1982) Estimation of dependences based on empirical data. Springer
Vapnik VN (1998) Statistical learning theory. Wiley
Vapnik VN (2000) The nature of statistical learning theory, 2nd edn. Springer
Vapnik VN, Chervonenkis AY (1971) On the uniform convergence of relative frequencies of events to their probabilities. Theor Probab Appl 16:264–280
Vapnik VN, Chervonenkis AJ (1991) The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recogn Image Anal 1:283–305
Vidyasagar M (1997) A theory of learning and generalization. Springer, Berlin
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Zoppoli, R., Sanguineti, M., Gnecco, G., Parisini, T. (2020). Design of Mathematical Models by Learning From Data and FSP Functions. In: Neural Approximations for Optimal Control and Decision. Communications and Control Engineering. Springer, Cham. https://doi.org/10.1007/978-3-030-29693-3_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-29693-3_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-29691-9
Online ISBN: 978-3-030-29693-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)