# Encyclopedia of Systems and Control

Living Edition
| Editors: John Baillieul, Tariq Samad

# Learning Theory

• Mathukumalli Vidyasagar
Living reference work entry
DOI: https://doi.org/10.1007/978-1-4471-5102-9_227-1

## Introduction

How does a machine learn an abstract concept from examples? How can a machine generalize to previously unseen situations? Learning theory is the study of (formalized versions of) such questions. There are many possible ways to formulate such questions. Therefore, the focus of this entry is on one particular formalism, known as PAC (probably approximately correct) learning. It turns out that PAC learning theory is rich enough to capture intuitive notions of what learning should mean in the context of applications and, at the same time, is amenable to formal mathematical analysis. There are several precise and complete studies of PAC learning theory, many of which are cited in the bibliography. Therefore, this article is devoted to sketching some high-level ideas.

## Problem Formulation

In the PAC formalism, the starting point is the premise that there is an unknown set, say an unknown convex polygon, or an unknown half-plane. The unknown set cannot be completely unknown; rather, something should be specified about its nature, in order for the problem to be both meaningful and tractable. For instance, in the first example above, the learner knows that the unknown set is a convex polygon, though it is not known which polygon it might be. Similarly, in the second example, the learner knows that the unknown set is a half-plane, though it is not known which half-plane. The collection of all possible unknown sets is known as the concept class, and the particular unknown set is referred to as the “target concept.” In the first example, this would be the set of all convex polygons and in the second case it would be the set of half-planes. The unknown set cannot be directly observed of course; otherwise, there would be nothing to learn. Rather, one is given clues about the target concept by an “oracle,” which informs the learner whether or not a particular element belongs to the target concept. Therefore, the information available to the learner is a collection of “labelled samples,” in the form $$\{(x_{i},I_{T}(x_{i}),i = 1,\ldots,m\}$$, where m is the total number of labelled samples and I t ( ⋅) is the indicator function of the target concept T. Based on this information, the learner is expected to generate a “hypothesis” H m that is a good approximation to the unknown target concept T.

One of the main features of PAC learning theory that distinguishes it from its forerunners is the observation that, no matter how many training samples are available to the learner, the hypothesis H m can never exactly equal the unknown target concept T. Rather, all that one can expect is that H m converges to T in some appropriate metric. Since the purpose of machine learning is to generate a hypothesis H m that can be used to approximate the unknown target concept T for prediction purposes, a natural candidate for the metric that measures the disparity between H m and T is the so-called generalization error, defined as follows: Suppose that, after m training samples that have led to the hypothesis H m , a testing sample x is generated at random. One can now ask: what is the probability that the hypothesis H m misclassifies x? In other words, what is the value of $$\Pr \{I_{H_{m}}(x)\neq I_{T}(x)\}$$? This quantity is known as the generalization error, and the objective is to ensure that it approaches zero as m → .

The manner in which the samples are generated leads to different models of learning. For instance, if the learner is able to choose the next sample x m + 1 on the basis of the previous m labelled samples, which is then passed on to the oracle for labeling, this is known as “active learning.” More common is “passive learning,” in which the sequence of training samples $$\{x_{i}\}_{i\geq 1}$$ is generated at random, in an independent and identically distributed (i.i.d.) fashion, according to some probability distribution P. In this case, even the hypothesis H m and the generalization error are random, because they depend on the randomly generated training samples. This is the rationale behind the nomenclature “probably approximately correct.” The hypothesis H m is not expected to equal to unknown target concept T exactly, only approximately. Even that is only probably true, because in principle it is possible that the randomly generated training samples could be totally unrepresentative and thus lead to a poor hypothesis. If we toss a coin many times, there is a small but always positive probability that it could turn up heads every time. As the coin is tossed more and more times, this probability becomes smaller, but will never equal zero.

## Examples

### Example 1.

Consider the situation where the concept class consists of all half-planes in $${\mathbb{R}}^{2}$$, as indicated in the left side of Fig. 1. Here the unknown target concept T is some fixed but unknown half-plane. The symbol T is next to the boundary of the half-plane, and all points to the right of the line constitute the target half-plane. The training samples, generated at random according some unknown probability distribution P, are also shown in the figure. The samples that belong to T are shown as blue rectangles, while those that do not belong to T are shown as red dots. Knowing only these labelled samples, the learner is expected to guess what T might be.

A reasonable approach is to choose some half-plane that agrees with the data and correctly classifies the labelled data. For instance, the well-known support vector machine (SVM) algorithm chooses the unique half-plane such that the closest sample to the dividing line is as far as possible from it; see the paper by Cortes and Vapnik (1997).

The symbol H denotes the boundary of a hypothesis, which is another half-plane. The shaded region is the symmetric difference between the two half-planes. The set $$T\Delta H$$ is the set of points that are misclassified by the hypothesis H. Of course, we do not know what this set is, because we do not know T. It can be shown that, whenever the hypothesis H is chosen to be consistent in the sense of correctly classifying all labelled samples, the generalization error goes to zero as the number of samples approaches infinity.

### Example 2.

Now suppose the concept class consists of all convex polygons in the unit square, and let T denote the (unknown) target convex polygon. This situation is depicted in the right side of Fig. 1. This time let us assume that the probability distribution that generates the samples is the uniform distribution on X. Given a set of positively and negatively labelled samples (the same convention as in Example 1), let us choose the hypothesis H to be the convex hull of all positively labelled samples, as shown in the figure. Since every positively labelled sample belongs to T, and T is a convex set, it follows that H is a subset of T. Moreover, P(T ∖ H) is the generalization error. It can be shown that this algorithm also “works” in the sense that the generalization error goes to zero as the number of samples approaches infinity.

## Vapnik-Chervonenkis Dimension

Given any concept class $$\mathcal{C}$$, there is a single integer that offers a measure of the richness of the class, known as the Vapnik-Chervonenkis (or VC) dimension, after its originators.

### Definition 1.

A set $$S\subseteq X$$ is said to be shattered by a concept class $$\mathcal{C}$$ if, for every subset $$B\subseteq S$$, there is a set $$A \in \mathcal{C}$$ such that $$S \cap A = B$$. The VC dimension of $$\mathcal{C}$$ is the largest integer d such that there is a finite set of cardinality d that is shattered by $$\mathcal{C}$$.

### Example 3.

It can be shown that the set of half-planes in $${\mathbb{R}}^{2}$$ has VC dimension two. Choose a set S = { x, y, z} consisting of three points that are not collinear, as in Fig. 2. Then there are 23 = 8 subsets of S. The point is to show that for each of these eight subsets, there is a half-plane that contains precisely that subset, nothing more and nothing less. That this is possible is shown in Fig. 2. Four out of the eight situations are depicted in this figure, and the remaining four situations can be covered by taking the complement of the half-plane shown. It is also necessary to show that no set with four or more elements can be shattered, but that step is omitted; instead the reader is referred to any standard text such as Vidyasagar (1997). More generally, it can be shown that the set of half-planes in $${\mathbb{R}}^{k}$$ has VC dimension k + 1.

### Example 4.

The set of convex polygons has infinite VC dimension. To see this, let S be a strictly convex set, as shown in Fig. 2b. (Recall that a set is “strictly convex” if none of its boundary points is a convex combination of other points in the set.) Choose any finite collection of boundary points, call it $$S =\{ x_{1},\ldots,x_{n}\}$$. If B is a subset of S, then the convex hull of B does not contain any other point of S, due to the strict convexity property. Since this argument holds for every integer n, the class of convex polygons has infinite VC dimension.

## Two Important Theorems

Out of the many important results in learning theory, two are noteworthy.

### Theorem 1 (Blumer et al. (1989)).

A concept class is distribution-free PAC learnable if and only if it has finite VC dimension.

### Theorem 2 (Benedek and Itai (1991)).

Suppose P is a fixed probability distribution. Then the concept class $$\mathcal{C}$$ is PAC learnable if and only if, for every positive numberε, it is possible to cover $$\mathcal{C}$$ by a finite number of balls of radius ε, with respect to the pseudometric d P .

Now let us return to the two examples studied previously. Since the set of half-planes has finite VC dimension, it is distribution-free PAC learnable. The set of convex polygons can be shown to satisfy the conditions of Theorem 2 if P is the uniform distribution and is therefore PAC learnable. However, since it has infinite VC dimension, it follows from Theorem 1 that it is not distribution-free PAC learnable.

## Summary and Future Directions

This brief entry presents only the most basic aspects of PAC learning theory. Many more results are known about PAC learning theory, and of course many interesting problems remain unsolved. Some of the known extensions are:
• Learning under an “intermediate” family of probability distributions $$\mathcal{P}$$ that is not necessarily equal to $${\mathcal{P}}^{{\ast}}$$, the set of all distributions (Kulkarni and Vidyasagar 1997)

• Relaxing the requirement that the algorithm should work uniformly well for all target concepts and requiring instead only that it should work with high probability (Campi and Vidyasagar 2001)

• Relaxing the requirement that the training samples are independent of each other and permitting them to have Markovian dependence (Gamarnik 2003; Meir 2000) or β-mixing dependence (Vidyasagar 2003)

There is considerable research in finding alternate sets of necessary and sufficient conditions for learnability. Unfortunately, many of these conditions are unverifiable and amount to tautological restatements of the problem under study.

## Bibliography

1. Anthony M, Bartlett PL (1999) Neural network learning: theoretical foundations. Cambridge University Press, Cambridge
2. Anthony M, Biggs N (1992) Computational learning theory. Cambridge University Press, Cambridge
3. Benedek G, Itai A (1991) Learnability by fixed distributions. Theor Comput Sci 86:377–389
4. Blumer A, Ehrenfeucht A, Haussler D, Warmuth M (1989) Learnability and the Vapnik-Chervonenkis dimension. J ACM 36(4):929–965
5. Campi M, Vidyasagar M (2001) Learning with prior information. IEEE Trans Autom Control 46(11):1682–1695
6. Devroye L, Györfi L, Lugosi G (1996) A probabilistic theory of pattern recognition. Springer, New York
7. Gamarnik D (2003) Extension of the PAC framework to finite and countable Markov chains. IEEE Trans Inf Theory 49(1):338–345
8. Kearns M, Vazirani U (1994) Introduction to computational learning theory. MIT, CambridgeGoogle Scholar
9. Kulkarni SR, Vidyasagar M (1997) Learning decision rules under a family of probability measures. IEEE Trans Inf Theory 43(1):154–166
10. Meir R (2000) Nonparametric time series prediction through adaptive model selection. Mach Learn 39(1):5–34
11. Natarajan BK (1991) Machine learning: a theoretical approach. Morgan-Kaufmann, San MateoGoogle Scholar
12. van der Vaart AW, Wallner JA (1996) Weak convergence and empirical processes. Springer, New York
13. Vapnik VN (1995) The nature of statistical learning theory. Springer, New York
14. Vapnik VN (1998) Statistical learning theory. Wiley, New York
15. Vidyasagar M (1997) A theory of learning and generalization. Springer, London
16. Vidyasagar M (2003) Learning and generalization: with applications to neural networks. Springer, London