Learning Theory
Introduction
How does a machine learn an abstract concept from examples? How can a machine generalize to previously unseen situations? Learning theory is the study of (formalized versions of) such questions. There are many possible ways to formulate such questions. Therefore, the focus of this entry is on one particular formalism, known as PAC (probably approximately correct) learning. It turns out that PAC learning theory is rich enough to capture intuitive notions of what learning should mean in the context of applications and, at the same time, is amenable to formal mathematical analysis. There are several precise and complete studies of PAC learning theory, many of which are cited in the bibliography. Therefore, this article is devoted to sketching some highlevel ideas.
Problem Formulation
In the PAC formalism, the starting point is the premise that there is an unknown set, say an unknown convex polygon, or an unknown halfplane. The unknown set cannot be completely unknown; rather, something should be specified about its nature, in order for the problem to be both meaningful and tractable. For instance, in the first example above, the learner knows that the unknown set is a convex polygon, though it is not known which polygon it might be. Similarly, in the second example, the learner knows that the unknown set is a halfplane, though it is not known which halfplane. The collection of all possible unknown sets is known as the concept class, and the particular unknown set is referred to as the “target concept.” In the first example, this would be the set of all convex polygons and in the second case it would be the set of halfplanes. The unknown set cannot be directly observed of course; otherwise, there would be nothing to learn. Rather, one is given clues about the target concept by an “oracle,” which informs the learner whether or not a particular element belongs to the target concept. Therefore, the information available to the learner is a collection of “labelled samples,” in the form \(\{(x_{i},I_{T}(x_{i}),i = 1,\ldots,m\}\), where m is the total number of labelled samples and I _{ t }( ⋅) is the indicator function of the target concept T. Based on this information, the learner is expected to generate a “hypothesis” H _{ m } that is a good approximation to the unknown target concept T.
One of the main features of PAC learning theory that distinguishes it from its forerunners is the observation that, no matter how many training samples are available to the learner, the hypothesis H _{ m } can never exactly equal the unknown target concept T. Rather, all that one can expect is that H _{ m } converges to T in some appropriate metric. Since the purpose of machine learning is to generate a hypothesis H _{ m } that can be used to approximate the unknown target concept T for prediction purposes, a natural candidate for the metric that measures the disparity between H _{ m } and T is the socalled generalization error, defined as follows: Suppose that, after m training samples that have led to the hypothesis H _{ m }, a testing sample x is generated at random. One can now ask: what is the probability that the hypothesis H _{ m } misclassifies x? In other words, what is the value of \(\Pr \{I_{H_{m}}(x)\neq I_{T}(x)\}\)? This quantity is known as the generalization error, and the objective is to ensure that it approaches zero as m → ∞.
The manner in which the samples are generated leads to different models of learning. For instance, if the learner is able to choose the next sample x _{ m + 1} on the basis of the previous m labelled samples, which is then passed on to the oracle for labeling, this is known as “active learning.” More common is “passive learning,” in which the sequence of training samples \(\{x_{i}\}_{i\geq 1}\) is generated at random, in an independent and identically distributed (i.i.d.) fashion, according to some probability distribution P. In this case, even the hypothesis H _{ m } and the generalization error are random, because they depend on the randomly generated training samples. This is the rationale behind the nomenclature “probably approximately correct.” The hypothesis H _{ m } is not expected to equal to unknown target concept T exactly, only approximately. Even that is only probably true, because in principle it is possible that the randomly generated training samples could be totally unrepresentative and thus lead to a poor hypothesis. If we toss a coin many times, there is a small but always positive probability that it could turn up heads every time. As the coin is tossed more and more times, this probability becomes smaller, but will never equal zero.
Examples
Example 1.
A reasonable approach is to choose some halfplane that agrees with the data and correctly classifies the labelled data. For instance, the wellknown support vector machine (SVM) algorithm chooses the unique halfplane such that the closest sample to the dividing line is as far as possible from it; see the paper by Cortes and Vapnik (1997).
The symbol H denotes the boundary of a hypothesis, which is another halfplane. The shaded region is the symmetric difference between the two halfplanes. The set \(T\Delta H\) is the set of points that are misclassified by the hypothesis H. Of course, we do not know what this set is, because we do not know T. It can be shown that, whenever the hypothesis H is chosen to be consistent in the sense of correctly classifying all labelled samples, the generalization error goes to zero as the number of samples approaches infinity.
Example 2.
Now suppose the concept class consists of all convex polygons in the unit square, and let T denote the (unknown) target convex polygon. This situation is depicted in the right side of Fig. 1. This time let us assume that the probability distribution that generates the samples is the uniform distribution on X. Given a set of positively and negatively labelled samples (the same convention as in Example 1), let us choose the hypothesis H to be the convex hull of all positively labelled samples, as shown in the figure. Since every positively labelled sample belongs to T, and T is a convex set, it follows that H is a subset of T. Moreover, P(T ∖ H) is the generalization error. It can be shown that this algorithm also “works” in the sense that the generalization error goes to zero as the number of samples approaches infinity.
VapnikChervonenkis Dimension
Given any concept class \(\mathcal{C}\), there is a single integer that offers a measure of the richness of the class, known as the VapnikChervonenkis (or VC) dimension, after its originators.
Definition 1.
A set \(S\subseteq X\) is said to be shattered by a concept class \(\mathcal{C}\) if, for every subset \(B\subseteq S\), there is a set \(A \in \mathcal{C}\) such that \(S \cap A = B\). The VC dimension of \(\mathcal{C}\) is the largest integer d such that there is a finite set of cardinality d that is shattered by \(\mathcal{C}\).
Example 3.
It can be shown that the set of halfplanes in \({\mathbb{R}}^{2}\) has VC dimension two. Choose a set S = { x, y, z} consisting of three points that are not collinear, as in Fig. 2. Then there are 2^{3} = 8 subsets of S. The point is to show that for each of these eight subsets, there is a halfplane that contains precisely that subset, nothing more and nothing less. That this is possible is shown in Fig. 2. Four out of the eight situations are depicted in this figure, and the remaining four situations can be covered by taking the complement of the halfplane shown. It is also necessary to show that no set with four or more elements can be shattered, but that step is omitted; instead the reader is referred to any standard text such as Vidyasagar (1997). More generally, it can be shown that the set of halfplanes in \({\mathbb{R}}^{k}\) has VC dimension k + 1.
Example 4.
The set of convex polygons has infinite VC dimension. To see this, let S be a strictly convex set, as shown in Fig. 2b. (Recall that a set is “strictly convex” if none of its boundary points is a convex combination of other points in the set.) Choose any finite collection of boundary points, call it \(S =\{ x_{1},\ldots,x_{n}\}\). If B is a subset of S, then the convex hull of B does not contain any other point of S, due to the strict convexity property. Since this argument holds for every integer n, the class of convex polygons has infinite VC dimension.
Two Important Theorems
Theorem 1 (Blumer et al. (1989)).
A concept class is distributionfree PAC learnable if and only if it has finite VC dimension.
Theorem 2 (Benedek and Itai (1991)).
Suppose P is a fixed probability distribution. Then the concept class \(\mathcal{C}\) is PAC learnable if and only if, for every positive numberε, it is possible to cover \(\mathcal{C}\) by a finite number of balls of radius ε, with respect to the pseudometric d _{P} .
Now let us return to the two examples studied previously. Since the set of halfplanes has finite VC dimension, it is distributionfree PAC learnable. The set of convex polygons can be shown to satisfy the conditions of Theorem 2 if P is the uniform distribution and is therefore PAC learnable. However, since it has infinite VC dimension, it follows from Theorem 1 that it is not distributionfree PAC learnable.
Summary and Future Directions

Learning under an “intermediate” family of probability distributions \(\mathcal{P}\) that is not necessarily equal to \({\mathcal{P}}^{{\ast}}\), the set of all distributions (Kulkarni and Vidyasagar 1997)

Relaxing the requirement that the algorithm should work uniformly well for all target concepts and requiring instead only that it should work with high probability (Campi and Vidyasagar 2001)

Relaxing the requirement that the training samples are independent of each other and permitting them to have Markovian dependence (Gamarnik 2003; Meir 2000) or βmixing dependence (Vidyasagar 2003)
There is considerable research in finding alternate sets of necessary and sufficient conditions for learnability. Unfortunately, many of these conditions are unverifiable and amount to tautological restatements of the problem under study.
Bibliography
 Anthony M, Bartlett PL (1999) Neural network learning: theoretical foundations. Cambridge University Press, CambridgeCrossRefzbMATHGoogle Scholar
 Anthony M, Biggs N (1992) Computational learning theory. Cambridge University Press, CambridgezbMATHGoogle Scholar
 Benedek G, Itai A (1991) Learnability by fixed distributions. Theor Comput Sci 86:377–389CrossRefzbMATHMathSciNetGoogle Scholar
 Blumer A, Ehrenfeucht A, Haussler D, Warmuth M (1989) Learnability and the VapnikChervonenkis dimension. J ACM 36(4):929–965CrossRefzbMATHMathSciNetGoogle Scholar
 Campi M, Vidyasagar M (2001) Learning with prior information. IEEE Trans Autom Control 46(11):1682–1695CrossRefzbMATHMathSciNetGoogle Scholar
 Devroye L, Györfi L, Lugosi G (1996) A probabilistic theory of pattern recognition. Springer, New YorkCrossRefzbMATHGoogle Scholar
 Gamarnik D (2003) Extension of the PAC framework to finite and countable Markov chains. IEEE Trans Inf Theory 49(1):338–345CrossRefzbMATHMathSciNetGoogle Scholar
 Kearns M, Vazirani U (1994) Introduction to computational learning theory. MIT, CambridgeGoogle Scholar
 Kulkarni SR, Vidyasagar M (1997) Learning decision rules under a family of probability measures. IEEE Trans Inf Theory 43(1):154–166CrossRefzbMATHMathSciNetGoogle Scholar
 Meir R (2000) Nonparametric time series prediction through adaptive model selection. Mach Learn 39(1):5–34CrossRefzbMATHGoogle Scholar
 Natarajan BK (1991) Machine learning: a theoretical approach. MorganKaufmann, San MateoGoogle Scholar
 van der Vaart AW, Wallner JA (1996) Weak convergence and empirical processes. Springer, New YorkCrossRefzbMATHGoogle Scholar
 Vapnik VN (1995) The nature of statistical learning theory. Springer, New YorkCrossRefzbMATHGoogle Scholar
 Vapnik VN (1998) Statistical learning theory. Wiley, New YorkzbMATHGoogle Scholar
 Vidyasagar M (1997) A theory of learning and generalization. Springer, LondonzbMATHGoogle Scholar
 Vidyasagar M (2003) Learning and generalization: with applications to neural networks. Springer, LondonCrossRefGoogle Scholar