Abstract
We introduce the speculatecorrect method to derive error bounds for local classifiers. Using it, we show that knearest neighbor classifiers, in spite of their famously fractured decision boundaries, have exponential error bounds with \(\hbox {O} \left( \sqrt{(k + \ln n)/n} \right) \) range around an estimate of generalization error for n insample examples.
Introduction
Local classifiers use only a small subset of their examples to classify each input. The bestknown local classifier is the nearest neighbor classifier. To classify an example, a knearest neighbor (knn) classifier uses a majority vote over the k insample examples closest to the example. Deriving error bounds for knn classifiers is a challenge, because they can have extremely fractured decision boundaries, making approaches based on hypothesis class size ineffective. For general information on knn classifiers, see the books by Devroye et al. (1996), Duda et al. (2001) and Hastie et al. (2009).
The error bounds in this paper are probably approximately correct (PAC) bounds, consisting of a range of error rates and an upper bound on the probability that the outofsample error rate is outside the range. An effective PAC bound has a small range and a small bound failure probability. PAC error bounds include bounds based on Vapnik–Chervonenkis (VC) dimension (Vapnik and Chervonenkis 1971), bounds for concept learning by Valiant (1984), compressionbased bounds by Littlestone and Warmuth (1986), Floyd and Warmuth (1995), Blum and Langford (2003), and Bax (2008), and bounds based on worst likely assignments (Bax and Callejas 2008). Langford (2005) gives an overview and comparison of some types of PAC bounds.
Exponential error bounds have range proportional to \(\sqrt{\ln (1/\delta )}\) as bound failure probability \(\delta \rightarrow 0\). Devroye et al. (1996) (page 414) give knn classifier error bounds that are nonexponential (range proportional to \(\sqrt{\frac{1}{\delta }}\) as \(\delta \rightarrow 0\)) and have range \(\hbox {O} \left( (\sqrt{k}/n)^{1/2} \right) \). They state: “Exponential upper bounds ... are typically much harder to obtain.” Then they present an exponential bound by Devroye and Wagner (1979) with range \(\hbox {O} \left( (k/n)^{1/3} \right) \) (Devroye et al. (1996) p. 415, Theorem 24.5). A more recent exponential bound has expected (but not guaranteed) error bound range \(\hbox {O} \left( (k/n)^{2/5} \right) \) (Bax 2012).
The great conundrum of classifier validation is that we want to use data that are independent of the classifier to estimate its error rate, but we also want to use all available data for the classifier. At each step, speculatecorrect assumes that this problem does not exist, at least for some of the insample data. In subsequent steps, it corrects for its sometimesfalse earlier assumptions. As it does this, the number of corrections grows, but the size of each correction shrinks.
To illustrate, for some value \(m \le \frac{1}{2} (nk)\), let \(V_1\) be the first m and \(V_2\) be the second m insample examples. (Call \(V_1\) and \(V_2\)validation subsets.) Let g be the full classifier; our goal is to bound its error rate: \(Pr_{}\left\{ \overline{g}\right\} \). (Use \(Pr_{}\left\{ \right\} \) to indicate probability over outofsample examples, and use a bar on top to indicate classifier error.) Let \(g_{{S}}\) be the classifier formed by withholding the data sets indexed by S. For example, \(g_{\{1\}}\) is all insample examples except those in \(V_1\). Then the speculatecorrect process is:

1.
Speculate that withholding \(V_1\) does not affect classification: \(\forall x: g = g_{\{1\}}\). Compute \(Pr_{V_1}\left\{ \overline{g_{\{1\}}}\right\} \) as our initial estimate of \(Pr_{}\left\{ \overline{g}\right\} \). (Use \(Pr_{V_i}\left\{ \right\} \) to indicate empirical rate over examples in \(V_i\)—also called an empirical mean.) Split the probabilities by whether the speculation holds:
$$\begin{aligned} Pr_{}\left\{ \overline{g}\right\} = Pr_{}\left\{ g = g_{\{1\}} \wedge \overline{g}\right\} + Pr_{}\left\{ g \not = g_{\{1\}} \wedge \overline{g}\right\} , \end{aligned}$$(1)and
$$\begin{aligned} Pr_{}\left\{ \overline{g_{\{1\}}}\right\} = Pr_{}\left\{ g = g_{\{1\}} \wedge \overline{g_{\{1\}}}\right\} + Pr_{}\left\{ g \not = g_{\{1\}} \wedge \overline{g_{\{1\}}}\right\} . \end{aligned}$$(2)The RHS first terms are equal:
$$\begin{aligned} Pr_{}\left\{ g = g_{\{1\}} \wedge \overline{g}\right\} = Pr_{}\left\{ g = g_{\{1\}} \wedge \overline{g_{\{1\}}}\right\} , \end{aligned}$$(3)since
$$\begin{aligned} (g = g_{\{1\}}) \implies (\overline{g} = \overline{g_{\{1\}}}). \end{aligned}$$(4)So the bias in estimating \(Pr_{}\left\{ \overline{g}\right\} \) by \(Pr_{V_1}\left\{ \overline{g_{\{1\}}}\right\} \) is the other two RHS terms:
$$\begin{aligned} Pr_{}\left\{ \overline{g}\right\}  Pr_{}\left\{ \overline{g_{\{1\}}}\right\} = Pr_{}\left\{ g \not = g_{\{1\}} \wedge \overline{g}\right\}  Pr_{}\left\{ g \not = g_{\{1\}} \wedge \overline{g_{\{1\}}}\right\} . \end{aligned}$$(5)Note that \(g \not = g_{\{1\}}\) (failure of our speculation) is a condition in both bias terms. Also, the bias is in a range bounded by the probability that speculation fails:
$$\begin{aligned} \left[ Pr_{}\left\{ g \not = g_{\{1\}}\right\} , Pr_{}\left\{ g \not = g_{\{1\}}\right\} \right] . \end{aligned}$$(6) 
2.
To correct for bias due to failure of speculation in Step 1, now speculate that \(\forall x: g = g_{\{2\}}\) and \(g_{\{1\}} = g_{\{1,2\}}\), in other words, that removing \(V_2\) does not alter classifications. Then use empirical means over \(V_2\) to correct bias from Step 1:

(a)
Estimate \(Pr_{}\left\{ g \not = g_{\{1\}} \wedge \overline{g}\right\} \) by \(Pr_{V_2}\left\{ g_{\{2\}} \not = g_{\{1,2\}} \wedge \overline{g_{\{2\}}}\right\} \).

(b)
Estimate \(Pr_{}\left\{ g \not = g_{\{1\}} \wedge \overline{g_{\{1\}}}\right\} \) by \(Pr_{V_2}\left\{ g_{\{2\}} \not = g_{\{1,2\}} \wedge \overline{g_{\{1,2\}}}\right\} \).
Consider Estimate 2a. Split the probabilities by whether the new speculation holds:
$$\begin{aligned}&Pr_{}\left\{ g \not = g_{\{1\}} \wedge \overline{g}\right\} \end{aligned}$$(7)$$\begin{aligned}&\quad = Pr_{}\left\{ g \not = g_{\{1\}} \wedge (g = g_{\{2\}} \wedge g_{\{1\}} = g_{\{1,2\}}) \wedge \overline{g}\right\} \end{aligned}$$(8)$$\begin{aligned}&\qquad + Pr_{}\left\{ g \not = g_{\{1\}} \wedge \lnot (g = g_{\{2\}} \wedge g_{\{1\}} = g_{\{1,2\}}) \wedge \overline{g}\right\} , \end{aligned}$$(9)and
$$\begin{aligned}&Pr_{}\left\{ g_{\{2\}} \not = g_{\{1,2\}} \wedge \overline{g_{\{2\}}}\right\} \end{aligned}$$(10)$$\begin{aligned}&\quad = Pr_{}\left\{ g_{\{2\}} \not = g_{\{1,2\}} \wedge (g = g_{\{2\}} \wedge g_{\{1\}} = g_{\{1,2\}}) \wedge \overline{g_{\{2\}}}\right\} \end{aligned}$$(11)$$\begin{aligned}&\qquad + \,Pr_{}\left\{ g_{\{2\}} \not = g_{\{1,2\}} \wedge \lnot (g = g_{\{2\}} \wedge g_{\{1\}} = g_{\{1,2\}}) \wedge \overline{g_{\{2\}}}\right\} . \end{aligned}$$(12)The RHS first terms are equal, because \(g = g_{\{2\}} \wedge g_{\{1\}} = g_{\{1,2\}}\) implies \((g \not = g_{\{1\}}) = (g_{\{2\}} \not = g_{\{1,2\}})\) and \(\overline{g} = \overline{g_{\{2\}}}\). The RHS second terms contribute bias:
$$\begin{aligned}&Pr_{}\left\{ g \not = g_{\{1\}} \wedge \overline{g}\right\}  Pr_{}\left\{ g_{\{2\}} \not = g_{\{1,2\}} \wedge \overline{g_{\{2\}}}\right\} \end{aligned}$$(13)$$\begin{aligned}&\quad = Pr_{}\left\{ g \not = g_{\{1\}} \wedge \lnot (g = g_{\{2\}} \wedge g_{\{1\}} = g_{\{1,2\}}) \wedge \overline{g}\right\} \end{aligned}$$(14)$$\begin{aligned}&\qquad \, Pr_{}\left\{ g_{\{2\}} \not = g_{\{1,2\}} \wedge \lnot (g = g_{\{2\}} \wedge g_{\{1\}} = g_{\{1,2\}}) \wedge \overline{g_{\{2\}}}\right\} . \end{aligned}$$(15)A similar analysis of Estimate 2b yields two more bias terms. All four bias terms have failure of both the first and second speculations in their conditions. So if we estimate \(Pr_{}\left\{ \overline{g}\right\} \) by
$$\begin{aligned}&Pr_{V_1}\left\{ \overline{g_{\{1\}}}\right\} \end{aligned}$$(16)$$\begin{aligned}&\quad + \,Pr_{V_2}\left\{ g_{\{2\}} \not = g_{\{1,2\}} \wedge \overline{g_{\{2\}}}\right\}  Pr_{V_2}\left\{ g_{\{2\}} \not = g_{\{1,2\}} \wedge \overline{g_{\{1,2\}}}\right\} , \end{aligned}$$(17)the estimate has four bias terms, and lies in a range determined by the probability that both speculations fail:
$$\begin{aligned}&[2 Pr_{}\left\{ g \not = g_{\{1\}} \wedge \lnot (g = g_{\{2\}} \wedge g_{\{1\}} = g_{\{1,2\}})\right\} , \end{aligned}$$(18)$$\begin{aligned}&2 Pr_{}\left\{ g \not = g_{\{1\}} \wedge \lnot (g = g_{\{2\}} \wedge g_{\{1\}} = g_{\{1,2\}})\right\} ]. \end{aligned}$$(19) 
(a)
Continuing this for r steps, with r validation subsets \(V_1, \ldots , V_r\), produces a sum of \(2^r1\) estimates. All remaining bias depends on simultaneous failure of r speculations, but there are \(2^r\) bias terms. For knn, speculation can only fail for Step i if \(V_i\) has a nearer neighbor to x than its \(k{th}\) nearest neighbor among the insample examples not in any validation subset. So the bias is at most \(2^r\) times the probability that x has a nearer neighbor in every validation subset than the \(k{th}\) nearest neighbor among the other insample examples.
To produce effective error bounds, we must use validation subsets small enough to make the probability of r simultaneous speculation failures small, and yet large enough that the sum of \(2^r1\) estimates is likely to have a small deviation from the sum that it estimates. (Using Hoeffding bounds (Hoeffding 1963), the range for the difference between each estimate \(Pr_{V_i}\left\{ \right\} \) and its corresponding outofsample probability \(Pr_{}\left\{ \right\} \) is \(\hbox {O} \left( \frac{1}{\sqrt{V_i}} \right) \).) We show that an appropriate choice of validation subset size gives error bound range:
and, for a choice of r based on n, the range is
The next section formally introduces the speculatecorrect method to produce error bounds for local classifiers. Section 3 applies the method to knn classifiers. Section 4 shows how to compute the bounds. Section 5 shows how effective the bounds are for some actual classifiers. Section 6 concludes with potential directions for future work.
SpeculateCorrect
Let F be the full set of n insample examples (x, y), drawn i.i.d. from a joint inputoutput distribution D. Inputs x are drawn from an arbitrary domain, and outputs y are drawn from \(\{0,1\}\) (binary classification). Assume there is some ordering of the examples in F, so that we may refer to examples 1 to n in F, treating F as a sequence.
Select \(r>0\) and \(m>0\) such that \(rm \le nk\). For each \(i \in {1, \ldots , r}\), let validation subset \(V_i\) be the ith subset of m examples in F. For example, if \(r = 2\) and \(m = 1000\), then \(V_1\) is the first thousand examples in F and \(V_2\) is the second thousand. Let validation set \(V = V_1 \cup \ldots \cup V_r\). For convenience, define \(R \equiv \{1, \ldots , r\}\). For \(S \subseteq R\), let \(V_S\) be the union of validation subsets indexed by S.
Our PAC error bounds have probability of bound failure over draws of F. Let the subscript \(F \sim D^n\) denote a probability or expectation over draws of F. We use no subscript for probabilities or expectations over outofsample examples \((x,y) \sim D\). For example,
denotes the outofsample error rate of \(g\), and it is the quantity we wish to bound. (It is sometimes called the conditional error rate, because it is the error rate conditioned on a set of insample examples F rather than the expected error rate over draws of F.)
Let \(A_i = \{1, \ldots , i\}\). Let \(a_1, \ldots , a_r\) be any series of conditions such that
i.e., \(a_i(x)\) implies that for any classifier formed by withholding any subset of \(\{V_1, \ldots , V_{i1}\}\), withholding \(V_i\) too does not alter the classification of x.
Let \(b_i = \lnot a_1 \wedge \ldots \wedge \lnot a_i\). Define \(b_0\) to be true. The following theorem generalizes the speculatecorrect formula for \(r=2\) that we developed in the previous section.
Theorem 1
Proof
Use induction. The base case is \(r=0\):
Next, to show that the result for r:
implies the result for \(r+1\):
subtract the result for r from the result for \(r+1\). The difference is
We will show that this difference is zero.
Since \(A_{r} = R\), the first and third sums are over the same indices, so combine them:
Expand the first sum’s probabilities around \(a_{r+1}\) values:
The first and third terms cancel, because \(a_{r+1} \implies g_{{(S \cup \{r+1\})}} = g_{{S}}\). The other terms have \(b_{r+1}\), since \(b_{r} \wedge \lnot a_{r+1} = b_{r+1}\). So the difference is:
The first sum cancels the second: for each S in the first sum, the first term cancels the term for \(S \cup \{r+1\}\) in the second sum, and the second term cancels the term for S in the second sum. \(\square \)
The formulation of the error rate in Theorem 1 is useful because the examples in each validation subset \(V_i\) are independent of the conditions in term i in the first sum. So the rates of the conditions over the validation subsets are unbiased estimates of the probabilities of those conditions over outofsample examples. There are no such validation data for the second sum. Instead of estimating the second sum, our error bounds bound each of its terms by \(Pr_{}\left\{ b_r\right\} \). We select validation subset sizes to mediate a tradeoff: large validation subsets give tight bounds on terms in the first sum, but small validation subsets make \(Pr_{}\left\{ b_r\right\} \) small.
Error bounds for kNN classifiers
Before introducing knn error bounds, we need a brief aside about tiebreaking. Assume k is odd and assume binary classification, so there are no ties in voting. To break ties over which insample examples are nearest neighbors and hence vote, use the method from Devroye and Wagner (1979): assign each example i in F a real value \(Z_i\) drawn uniformly at random from [0, 1] and do the same for each other draw x from the input space to give it a value Z. If the distance from example i in F to an x is the same as the distance from example j in F to x, then declare i to be the closer example if \(Z_i  Z < Z_j  Z\) or if \(Z_i  Z = Z_j  Z\) and \(i<j\). Otherwise declare example j to be the closer example. This method returns the same ranking of distances to examples in F for the same input x every time the distances are measured, and it uses position within F to break a tie with probability zero.
Now apply the speculatecorrect concept to knn:
Corollary 2
Let \(a_i(x)\) be the condition that \(V_i\) does not have an example closer to x than the kth nearest neighbor to x in \(FV\). Let
and
where I() is the indicator function: one if its argument is true and zero otherwise. Then
Proof
Our \(a_i\) for knn meet the conditions of Theorem 1. \(\square \)
Next, we show that \(p^*\) is the average of the RHS of Eq. 39 from Corollary 2 over all permutations of the insample examples. Permuting the examples places different examples into the validation subsets because the \(i{th}\) validation subset is the \(i{th}\)m examples. We will use permutations to ensure that \(b_1, \ldots , b_r\) are rare enough to provide small error bound ranges.
Without permutations, even \(b_r\) may not be rare. For example, insample examples \(m, 2m, \ldots , rm\) may all be close to much of the input distribution, and the other insample examples may be far. Without permutations, we can develop a bound, but we can only show that it has a small error bound range in expectation. Permutations guarantee that the expectation is realized. In the next section, we show how to compute permutationbased bounds efficiently.
Lemma 3
Let P be the set of permutations of \(1, \ldots , n\). For each \(\sigma \in P\), let \(\sigma F\) be F permuted according to \(\sigma \): example j of \(\sigma F\) is the example of F indexed by element j of \(\sigma \). Let \(f_{i,\sigma }\) be \(f_i\), but with F replaced by \(\sigma F\), so that for \(i \in R\), \(V_i\) consists of the \(i{th}\)m examples in \(\sigma F\). Then
Proof
Corollary 2 holds for each partition of F into r sizem subsets \(V_1, \ldots , V_r\) and \(FV\). Each permutation of F uses one of these partitions to define \(f_{i,\sigma }\). So the outer expectation is over quantities that are each \(p^*\). \(\square \)
We will use two more lemmas to form a bound based on permutations:
Lemma 4
For any set of permutations \(P'\),
For \(i=1\),
Proof
For \(i>1\),
Since \(f_{i,\sigma }\) is a sum of \(2^{i1}\) terms, with half 0 or 1 and half 0 or \(1\),
So
For \(i=1\), note that \(f_{i,\sigma }\) has a single term, with value zero or one. \(\square \)
Lemma 5
Let \(P'\) be a set of permutations such that for each \(\sigma \in P'\), positions \(1, \ldots , im, rm+1, \ldots , n\) of the permutations in \(P'\) contain all permutations of entries in those positions in \(\sigma \), equally many times. Then
Proof
The LHS is the probability that a random permutation in \(P'\) places at least one example in each of \(V_1, \ldots , V_{i1}\) that is closer to (x, y) than the \(k{th}\) closest example to (x, y) in \(FV\). Since determining positions in a random draw from \(P'\) is equivalent to drawing positions at random without replacement, the LHS is the probability of drawing at least one element from each set \(\{1, \ldots , m\}, \ldots , \{(i1)m + 1, \ldots , im\}\) before drawing k elements from \(\{rm+1, \ldots , n\}\).
The probability of drawing k elements from \(\{rm+1, \ldots , n\}\) before drawing any from one specific set in \(\{1, \ldots , m\}, \ldots , \{(i2)m + 1, \ldots , (i1)m\}\) is
Similarly, the probability of drawing k elements from \(\{rm+1, \ldots , n\}\) before drawing any elements from any specific h of the \(i1\) sets in \(\{1, \ldots , m\}, \ldots , \{(i2)m + 1, \ldots , (i1)m\}\) is
So, by inclusion and exclusion, the probability of drawing at least one element from every set in \(\{1, \ldots , m\}, \ldots \{(i2)m + 1, \ldots , (i1)m\}\) before drawing k examples from \(\{rm+1, \ldots , n\}\) is:
\(\square \)
Note that \(i = r + 1\) and \(P' = P\) gives a result for \(Pr_{\sigma \in P}\left\{ b_{r}\sigma \right\} \).
We need another lemma for the bound. This one is about averaging bounds on differences between empirical means and actual means.
Lemma 6
Suppose there is a finite set of distributions, each with range size (difference between maximum and minimum values in support) at most s. For each distribution i in the set, let \(\mu _i\) be the mean, and let \(\hat{\mu }_i\) be an empirical mean based on m i.i.d. samples from distribution i. Let \(E_{i}\left\{ \right\} \) denote expectation over the distributions. Then
where
Specifically, for \(c = 3\),
Proof
For the proof, refer to Bax and Kooti (2016), page 3, Inequalities 8 and 9. We use \(\frac{2}{\delta }\) in place of the \(\frac{1}{\delta }\) found there, because we have twosided bounds. \(\square \)
This lemma offers bounds on differences between averages of means and averages of estimates that are similar to the Hoeffding bound (Hoeffding 1963) on the difference between a single mean and estimate:
Now separate the RHS of Eq. 40 into terms with \(i \in R\) and a term with \(i=r+1\):
To develop an error bound, we will estimate \(p_I\) with empirical means over samples and bound \(p_{II}\). For each \(\sigma \in P\) and \(i \in R\), the examples in \(V_i\sigma \) are independent of the function \(f_{i,\sigma }\), so we can use empirical means over \((x,y) \in V_i\sigma \) to estimate means over \((x,y) \sim D\). First, we rewrite \(p_I\) in a form that allows estimation by empirical means over permutations, in the following lemma.
Lemma 7
\(\forall i \in R\), let \(V_i = m\). Let M be the set of sizem subsets of F: \(M = \{Q  Q \subseteq F \wedge Q=m\}\). Let P(Q, i) be the set of permutations of \(1, \ldots , n\) that have set Q as validation subset \(V_i\) in \(\sigma F\): \(P(Q,i) = \{\sigma  (V_i  \sigma )=Q\}\). Then
Proof
Compare the definition of \(p_I\) (the first term on the RHS of Eq. 54) to Eq. 56. The definition averages over permutations in P and \(i \in R\). In Eq. 56, the expectation over \(Q \in M\), \(i \in R\), and P(Q, i) covers all permutations P and \(i \in R\), each with equal frequency. \(\square \)
Now we develop an error bound.
Theorem 8
Let
and
Then
where
and
Proof
Note that
We will show that
and that \(p_{II} \le \epsilon _{II}\). (The formula for \(Pr_{\sigma \in P}\left\{ b_i  \sigma \right\} \), to specify values for \(Pr_{\sigma \in P}\left\{ b_{i1}\sigma \right\} \) in \(\epsilon _I\) and \(Pr_{\sigma \in P}\left\{ b_{r}\sigma \right\} \) in \(\epsilon _{II}\), follows directly from Lemma 5.)
To prove Inequality 64, let
Then
since this is Inequality 56 from Lemma 7, with a different order of expectations.
For each \(Q \in M\), we will use each \(\hat{p}_Q\) to bound each \(p_Q\), using the fact the examples in Q are independent of \(p_Q\). First, we need to bound the range of terms in the expectations \(p_Q\) and \(\hat{p}_Q\):
By Lemma 4,
Also, by Lemma 5
since both P(Q, i) and P meet the conditions for \(P'\) in Lemma 5. So
For \(i=1\), \(2^{i1} Pr_{\sigma \in P}\left\{ b_{i1}\sigma \right\} = 1\). So the range is at most
Note that \(\hat{p}_Q\) is an average of this term (Expression 71) over \(Q = m\) i.i.d. (x, y) samples that are independent of the term (since \(Q = V_i  \sigma \) for \(\sigma \in P(Q,i)\) and the definition of \(f_{i,\sigma }\) depends only on \(V_1, \ldots , V_{i1}, FV  \sigma \).) Since \(p_I\) is the expectation of a finite set of means \(p_Q\) and \(\hat{p}_{I}\) is the expectation of corresponding empirical means \(\hat{p}_Q\), we can apply Lemma 6 to prove Inequality 64, showing that \(\hat{p}_{I}\) is an \(\epsilon _I\)range estimate of \(p_I\).
For \(p_{II}\) and \(\epsilon _{II}\), apply Lemma 4:
\(\square \)
We will use the following lemma to prove results about the size of the error bound range \(\epsilon _I + \epsilon _{II}\).
Lemma 9
Proof
Define \(d_i(x)\sigma \) to be the condition that the \(k+i1\) nearest neighbors to x in F include at least i examples from \(V_1 \cup \ldots \cup V_i\sigma \). Condition \(d_i\sigma \) is a necessary condition for \(b_i\sigma \), so
The probability of \(d_i\sigma \) over \(\sigma \in P\) is the same as the probability of drawing \(k+i1\) samples from \(1, \ldots , im, rm+1, \ldots , n\) uniformly without replacement and having at least i of those samples have values im or less. (The samples are the indices in \(\sigma F\) of the \(k+i1\) nearest neighbors to x from positions \(1, \ldots , im, rm+1, \ldots , n\) in \(\sigma F\).) So the probability of \(d_i\) is the tail of a hypergeometric distribution:
Using a hypergeometric tail bound from Chvátal (1979) (see also Skala (2013)), this is
\(\square \)
Corollary 10
(of Theorem 8)
Proof
Recall that
By Lemma 9,
Apply the wellknown identity for a sum of powers: \(1 + z + z^2 + \cdots + z^{r1} = \frac{1z^r}{1z}\), with \(z = \frac{2e(k+r2)m}{n}\):
So
For \(\epsilon _{II}\), apply Lemma 9:
\(\square \)
The following theorem and corollary are the main results for knn classifiers. The theorem allows r, k, and \(\delta \) to depend on the number of insample examples, n. The corollary uses the bound from the theorem with an appropriate growth rate for r as n increases.
Theorem 11
with
Proof
Let \(\epsilon _{r} = \epsilon _I + \epsilon _{II}\) in Theorem 8. Use Corollary refgreen for \(\epsilon _I + \epsilon _{II}\). Select validation subset sizes m to balance \(\epsilon _I\) and \(\epsilon _{II}\):
Then
and
If we allow for the possibility of k and r growing with n, then
\(\square \)
Corollary 12
For a choice of r based on n,
with
Proof
If we set \(r = \lceil \frac{1}{4}(\ln n  2) \rceil \), then
So
\(\square \)
An alternative proof of Corollary 12 uses a different value for m:
Proof
Proof (Alternative Proof of Corollary 12) Let
Then
For \(\epsilon _I\), note that
Substitute the RHS for m in Inequality 86:
Let \(r = \lceil \ln \sqrt{n} \rceil \). Then
and
\(\square \)
Computation
It would be infeasible to compute the error bounds developed in this paper directly from their definitions. Instead, we can sample the bound terms to produce a bound. In this section, we outline a sampling procedure that requires O(\(n (\ln n)^2\)) computation (in addition to identifying up to \(k+r1\) nearest neighbors in F for each example in F) and produces a bound with range \(\hbox {O} \left( \sqrt{(k + \ln n)/n} \right) \).
Note that
(For reference, \(\hat{p}_{I}\) is defined in Eqs. 57 and 58 of Lemma 8.) Let \(P((x,y), i) = \{\sigma  (x,y) \in (V_i  \sigma ))\}\). Reordering expectations,
Rewrite \(f_{i,\sigma }\) as the expectation of its terms:
Estimate \(\hat{p}_{I}\) as defined in the previous two equations by taking an empirical mean over s random samples:
with (x, y) drawn uniformly at random from F, i uniformly at random from R, \(\sigma \) uniformly at random from P((x, y), i), and S uniformly at random from the power set of \(A_{i1}\). Each sample value is
Let \(p'_I\) be the empirical mean of these samples.
Computing values for samples in \(p'_I\) need not involve drawing complete permutations \(\sigma \). Instead, randomly determine set membership in \(FV\sigma \), \(V_i\sigma \), ..., or \(V_r\sigma \) for neighbors of the sample (x, y) and tabulate votes to determine \(I(\overline{g_{{(S \cup \{i\})}}}\sigma )\), proceeding one neighbor at a time until the kth neighbor from \(FV\sigma \) is identified, as follows.
Let \(N_0(x) = (x,y)\). Let \(N_j(x)\) be (x, y) and the j nearest neighbors to (x, y) in F. At each step, let \(f = (FV\sigma ) \cap N_j(x)\). For each \(i \in R\), let \(v_{i} = V_i \cap N_j(x)\). Let b be the number of voting neighbors (b for “ballots”) among the j nearest neighbors to (x, y): \(b = ((F  V_i)  \cup _{h \in S} V_h \sigma ) \cap N_j(x)\). Let d be the number of those voters that have different labels than y.
Initially, \(j=0\), \(f = 0\), \(v_{i} = 1\), \(\forall h \not = i: v_{h} = 0\), \(b = 0\), and \(d = 0\). Then, for each j starting with \(j=1\), select a set for the jth nearest neighbor at random and increment its counter:
If \(b<k\) (fewer than k votes cast) and the set is \(FV\sigma \) or \(V_h\sigma \) for \(h \not = i\) and \(h \not \in S\), then \(b := b+1\) (another ballot is cast) and if the label of the jth nearest neighbor is not equal to y, then \(d:=d+1\) (another disagreeing vote). Stop when \(f=k\), and return the sample value:
This method may require up to O(\(rm+k\)) computation per sample, because it is possible (though extremely unlikely) for an example to have all validation examples in \(V\sigma \) as nearer neighbors than the \(k{th}\) nearest neighbor from \(FV\sigma \). To reduce worstcase computation, select a value \(w > k\), stop computation for a sample if w neighbors are assigned to subsets \(V_1, \ldots , V_r, FV\sigma \) before the \(k{th}\) neighbor is assigned to \(FV\sigma \) (that is, if \(f<k\) and \(j = w\)), and return zero as the value for the sample. Then only the w nearest neighbors to each example need to be found, and the remaining computation is \(\hbox {O} \left( w \right) \) per sample. Call this the modified sampling procedure. We can use it as the basis for a bound that is feasible to compute:
Theorem 13
Let \(\hat{p}_s\) be the empirical mean of s i.i.d. samples of the modified sampling procedure. Let \(\hat{s}\) represent the modified sampling procedure. Then
where
and
where
Proof
Let \(p_c\) be \(p_I\), but with terms set to zero if \(\sigma \) places fewer than k of the w nearest neighbors to x from F into \(FV\sigma \). That is, \(p_c\) is \(p_I\), but with terms set to zero if they are set to zero in modified sampling. Then
Let \(\hat{p}_c\) be \(\hat{p}_{I}\), but with terms set to zero if they are set to zero in modified sampling. Then \(\hat{p}_c\) is an unbiased empiricalmean estimate of \(p_c\), and
So
To prove the theorem, we will show results for \(\epsilon _v\), \(\epsilon _r\), \(\epsilon _c\), and \(\epsilon _s\):
and
For \(\epsilon _v\), \(p_c\) and \(\hat{p}_c\) are \(p_I\) and \(\hat{p}_{I}\), with some \(f_{i,\sigma }\) set to zero, which can only reduce the ranges of the terms in \(p_Q\) and \(\hat{p}_Q\). So the result from Theorem 8 (Inequality 64):
also applies with \(p_c\) and \(\hat{p}_c\) in place of \(p_I\) and \(\hat{p}_{I}\). We use a probability of bound failure \(\frac{9}{10} \delta \) in place of \(\delta \), so \(\epsilon _v\) is \(\epsilon _I\), with \(\frac{9}{10} \delta \) in place of \(\delta \). (We preserve the other \(\frac{1}{10} \delta \) probability of bound failure for using \(\hat{p}_s\) to estimate \(\hat{p}_c\).) So
For \(\epsilon _r\), \(\epsilon _r\) = \(\epsilon _{II}\), so apply Inequality 74 directly:
For \(\epsilon _c\), we need the probability that fewer than k of the nearest w neighbors to an \((x,y) \in V\sigma \) from \(F(x,y)\) are in \(FV\sigma \). The probability that the i nearest neighbors are in \(FV\sigma \) is
Given this, the probability that the next \(wi\) nearest neighbors are in \(V  (x,y)\sigma \) is
So take the product of these two products. There are \({{w}\atopwithdelims (){i}}\) different ways to choose positions for the i neighbors in \(FV\sigma \) among the first w neighbors. Each set of positions has equal probability. So multiply by \({{w}\atopwithdelims (){i}}\). Sum over \(i<k\) and multiply by the maximum term value to get \(\epsilon _c\). So
For \(\epsilon _s\), apply a result from Maurer and Pontil (2009) (page 2, Theorem 3) derived from Bennett’s Inequality (Bennett 1962), on the difference between the mean \(\mu \) of a distribution and an empirical mean \(\hat{\mu }\) over s samples drawn i.i.d. according to the distribution:
where
q is the range of the distribution, and v is any upper bound on the variance of the distribution. The result is stronger than Hoeffding bounds when sample variance is small relative to the range. To get \(\epsilon = \epsilon _s\), apply this inequality with \(\mu = \hat{p}_c\), \(\hat{\mu } = \hat{p}_s\), \(\delta \) set to \(\frac{1}{10} \delta \), \(q = r 2^r\), and v set to an upper bound on the expectation of the square of sample values. To get such an upper bound, start with the sample values:
drop \(I( \overline{g_{{(S \cup \{i\})}}}\sigma )\), and take the expectation of the square. \(\square \)
Similar to the result from Theorem 8:
Theorem 14
For some choices of m, r, w, and s,
Proof
Apply the alternative proof of Corollary 12, with \(\epsilon _v\) and \(\epsilon _r\) in place of \(\epsilon _I\) and \(\epsilon _{II}\), keeping \(r = \lceil \ln \sqrt{n} \rceil \), but with 3 in place of 2 in the value for m from Equality 98. Then
and the bound on \(\epsilon _{II}\) in the alternative proof becomes:
Changing from 2 to 3 in m only affects constants in the alternative proof’s result for \(\epsilon _I\), so:
For \(\epsilon _c\), recall that it is the maximum (absolute) value of a term times the probability that a random \(\sigma \in P\) places more than \(wk\) of the w closest neighbors to an \((x,y) \in V\sigma \) from \(F  (x,y)\) into \(V(x,y)\sigma \). In Lemma 9, \(d_i(x\sigma )\) is the condition that the closest \(k+i1\) neighbors in F include at least i in \(V\sigma \). So apply the result from Lemma 9:
with \(k+i1 = w\) and \(i = w  k +1\). (Using \(F(x,y)\) in place of F and \(V(x,y)\sigma \) in place of \(V\sigma \) can only decrease the probability, so just use m and n as in the lemma for this bound.) Let \(w=k+r1\). Then the RHS is
Since each term is less than \(r 2^r\),
Using Inequality 136,
Note that \(r < 1.5^r\). (To prove it, show that \(r^{1/r} < 1.5\) by setting the derivative of \(\ln (r^{1/r}) = (1 / r) \ln r\) to zero. Solve: \(r=e\). So \(\max r^{1/r} = e^{1/e} < 1.5\).) Then
With \(r = \lceil \ln \sqrt{n} \rceil \),
We can make \(\epsilon _s\) arbitrarily small by increasing s. How many samples we need to achieve a given \(\epsilon _s\) depends on v. To bound v, note that \(b_{i1}\) has probability at most
according to Lemma 9. So
So set
to ensure \(v \le r^2\). (The value for m in Equality 135 meets this condition.) Let \(s=rn\). Then
With \(r = \lceil \ln \sqrt{n} \rceil \),
So
\(\square \)
Appendix A presents methods to compute rather than estimate \(\hat{p}_{I}\) or \(\hat{p}_c\). The method to compute \(\hat{p}_c\) requires O(\(n \ln n\)) computation, like sampling, but it requires O(\((\ln n)^4\)) space, and it is more complicated than sampling.
Tests
To apply the error bound method from the previous section to some actual classifiers, we use three randomly generated insample data sets of different sizes. Each example input is drawn uniformly at random from \([1, 1]^3\), the label is set to one if an even number of coordinates are negative and zero otherwise, giving each quadrant in the cube a different label than the quadrants bordering its sides. Then, to add some noise, with probability \(\frac{1}{10}\) the label is changed: from one to zero or zero to one.
In this section, we compute bounds for specific values of n, k, and \(\delta \) rather than prove asymptotic results. So we use a tighter version of \(\epsilon _v\), instead of the more easily analyzed, but looser, form in Theorem 13. We use the dynamic programming procedure from an appendix of Bax and Kooti (2016), with their parameter \(t = 3\). The resulting bounds are about 1.5 times the corresponding Hoeffding bounds and about half the bounds using the value for \(\epsilon _v\) from the previous section. Call the optimized version \(\epsilon '_v\).
For each bound we use \(\epsilon _r\), \(\epsilon _c\), and \(\epsilon _s\) and the estimation procedure from Theorem 13. We set \(w = 29\), s = 10 million, and \(\delta = 0.05\). With \(w = 29\), \(\epsilon _c < 0.000005\) for all bounds we computed, so we do not show it in the tables. (It would be displayed as 0.00000 for all entries.)
Tables 1, 2, and 3 show error bounds for the three data sets, with \(n = 20{,}000\), \(n = 50{,}000\), and \(n = 100{,}000\), respectively. For each data set and \(k \in \{3, 5, 7, 9, 11, 13\}\), we minimize \(\epsilon '_v + \epsilon _r + \epsilon _c + \epsilon _s\) over \((r, m) \in \{1, 2, \ldots , 10\} \times \{0.001 n, 0.002 n, \ldots , 0.099 n\}\). With the minimizing (r, m), we then use the data to perform the sampling procedure to compute \(\hat{p}_s\).
For each k, each table shows the minimizing values of r and m, the values of \(\epsilon '_v\), \(\epsilon _r\), \(\epsilon _c\), and \(\epsilon _s\), the samplebased estimate \(\hat{p}_s\), and the sum \(\epsilon = \epsilon '_v + \epsilon _r + \epsilon _c + \epsilon _s\). The bound is \(\hat{p}_s \pm \epsilon \). For comparison, we also include an estimate of the outofsample error rate \(p^*\), which is the average over 10 million outofsample examples drawn i.i.d. from the same distribution as the insample examples.
Overall, the error estimates \(\hat{p}_s\) are close to the estimated outofsample error rates \(p^*\), with mean absolute differences \(0.29\%\) for \(n = 20{,}000\), \(0.12\%\) for \(n = 50{,}000\), and \(0.07\%\) for \(n = 100{,}000\). The error bound ranges \(\epsilon \) range from about \(5\%\) to about \(17.5\%\), growing with k and shrinking as the number of insample examples increases. The optimal r value is 4 for \(k=3\) and \(n=20{,}000\) and 5 for the other bounds. This shows that moving beyond the previous \(r=2\)style bounds (Bax 2012) can strengthen bounds, even for moderate numbers of insample examples.
In general, \(\epsilon '_v\) is the main contributor to error bound range \(\epsilon \), with \(\epsilon _r\) and \(\epsilon _s\) contributing less than \(0.4\%\) in every case. The small contributions from \(\epsilon _r\) may seem surprising, since the choice of m mediates a tradeoff between \(\epsilon '_v\) and \(\epsilon _r\). However, increasing m decreases \(\epsilon '_v\) slowly (approximately \(\hbox {O} \left( 1/\sqrt{m} \right) \)) but increases \(\epsilon _r\) quickly (as \(\hbox {O} \left( m^r \right) \)). Optimizing m means balancing the derivatives with respect to m of \(\epsilon '_v\) and \(\epsilon _r\), not their values, and this occurs at a large value of \(\epsilon '_v\) relative to \(\epsilon _r\).
The optimal values of m are small relative to n. They range from about \(1\%\) of the data to about \(3\%\). With \(r=5\), that is about \(5\%\) to \(15\%\) of the insample examples in V. The fraction shrinks as k increases, because m must be smaller to avoid having a large probability of all validation data sets having examples closer to a random input than the k closest examples in \(FV\)—to avoid a large \(\epsilon _r\). The small values of m still produce moderately small \(\epsilon '_v\), because we are using a bound for a single estimate rather than a uniform bound over error estimates for a large class of classifiers as is the case for traditional VCstyle bounds (Vapnik and Chervonenkis 1971).
Conclusion
We have shown that knearest neighbor classifiers have exponential PAC error bounds with
error bound ranges. The bounds are quite general. They apply to any type of inputs, because they are based on probability rather than geometry. As a result, they have no terms that increase with the number of dimensions or other properties of the input space. The bounds do not require the knn classifier’s method to compute distances among examples to be symmetric or to obey the triangle inequality—it need not be a metric in the mathematical sense. It can be any function on two example inputs that returns a number.
We average bounds over all choices of validation subsets so that we can prove the resulting bound has a small range. If, instead, we use a single random choice of validation subsets, then we can also produce an exponential PAC error bound. To do this, use each validation subset \(V_i\) to validate \(f_i()\), and use a random subset of the remaining insample examples to validate the rate of all validation subsets having a neighbor closer to an input than the \(k{th}\) nearest neighbors among the other insample examples. (In a transductive setting (Vapnik 1998), or if unlabeled inputs are otherwise available, use them for this validation.) This bound has \(\hbox {O} \left( \sqrt{(k + \ln n)/n} \right) \) range in expectation. We average over choices of validation subsets to guarantee that we realize the expectation.
We use bounds on \(Pr_{\sigma \in P}\left\{ b_{i1}\sigma \right\} \) to bound the range of the random variables in \(\epsilon _v\) and to bound the variance in \(\epsilon _s\). If the classifier is accurate (and the votes are mostly not nearties), then \(I( \overline{g_{{(S \cup \{i\})}}}\sigma )\) tends to be zero for a large portion of \((S, i, \sigma )\). So the variance among terms in \(\hat{p}_{I}\) and among terms in \(\hat{p}_s\) tends to be very small. In those cases, using empirical Bernstein bounds (Audibert 2004), such as those by Maurer and Pontil (2009) (Theorem 3, page 2), can significantly shrink \(\epsilon _v\) and \(\epsilon _s\), because those bounds scale with \(\sqrt{\hat{v}/m}\), where \(\hat{v}\) is the sample variance. To shrink the variance in those cases, tighten \(a_i\) to be the RHS of Expression 23. We can use the resulting definition of \(b_{i1}\) to validate terms in \(p_I\), but still need to use \(b_r\) as defined in this paper to bound \(p_{II}\), keeping \(\epsilon _{II}\) and \(\epsilon _r\) the same.
We showed how to use sampling to “estimate the estimates” of the error bounds. We also showed (in the appendix) an efficient, but more complex and spaceconsuming, method to compute an estimate. It may be possible to improve or simplify that procedure by gathering terms in a different way. In the future, it would be interesting to explore how close the estimate developed in this paper tends to be to actual error rate for practical problems, and whether it tends to outperform the leaveoneout estimate.
It would be interesting to extend the knearest neighbor error bounds from this paper to cover selection of a distance metric from a parameterized set of “hypothesis” metrics (Kedem et al. 2012). One approach might be to use uniform bounds of the type derived in this paper over the class of potential metrics. The bounds might depend on some notion of the complexity of the class of potential metrics.
Finally, it would be interesting to apply the speculatecorrect technique from this paper to derive error bounds for classifiers other than nearest neighbors. Other local classifiers include some collective classifiers (Sen et al. 2008; Macskassy and Provost 2007), such as network classifiers based only on neighbors or neighbors of neighbors in a graph. (For some background on error bounds for network classifiers, refer to London et al. (2012), Li et al. (2012) and Bax et al. (2013)). It may also be possible to apply the speculatecorrect method to other types of classifiers that are typically based on small subsets of the insample examples, such as support vector machines (Vapnik 1998; Cristianini and ShaweTaylor 2000; Joachims 2002) and setcovering machines (Marchand and ShaweTaylor 2001).
References
Audibert, J. Y. (2004). PACBayesian Statistical Learning Theory. Ph.D. thesis, Laboratoire de Probabilities et Modeles Aleatoires, Universites Paris 6 and Paris 7. http://cermis.enpc.fr/~audibert/ThesePack.zip.
Bax, E. (2008). Nearly uniform validation improves compressionbased error bounds. Journal of Machine Learning Research, 9, 1741–1755.
Bax, E. (2012). Validation of \(k\)nearest neighbor classifiers. IEEE Transactions on Information Theory, 58(5), 3225–3234.
Bax, E., & Callejas, A. (2008). An error bound based on a worst likely assignment. Journal of Machine Learning Research, 9, 581–613.
Bax, E., Li, J., Sonmez, A., & Cataltepe, Z. (2013). Validating collective classification using cohorts. In NIPS workshop on frontiers of network analysis: methods, models, and applications.
Bax, E., & Kooti, F. (2016). Ensemble validation: Selectivity has a price, but variety is free (pg. 3, Inequalities 8 and 9). Baylearn 2016. https://arxiv.org/pdf/1610.01234.pdf.
Bennett, G. (1962). Probability inequalities for the sum of independent random variables. Journal of the American Statistical Association, 57(297), 33–45.
Blum, A., & Langford, J. (2003) PACMDL bounds. In Proceedings of the 16th annual conference on computational learning theory (COLT) (pp. 344–357).
Chvátal, V. (1979). The tail of the hypergeometric distribution. Discrete Mathematics, 25(3), 285–287.
Cristianini, N., & ShaweTaylor, J. (2000). An introduction to support vector machines and other Kernelbased learning methods. Cambridge: Cambridge University Press.
Devroye, L., & Wagner, T. (1979). Distributionfree inequalities for the deleted and holdout estimates. IEEE Transactions on Information Theory, 25, 202–207.
Devroye, L., Györfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. Berlin: Springer.
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. New York: Wiley.
Floyd, S., & Warmuth, M. (1995). Sample compression, learnability, and the Vapnik–Chervonenkis dimension. Machine Learning, 21(3), 1–36.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). Berlin: Springer.
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301), 13–30.
Joachims, T. (2002). Learning to classify text using support vector machines. London: Kluwer Academic Publishers.
Kedem, D., Tyree, S., Sha, F., Lanckriet, G. R. & Weinberger, K. Q. (2012). Nonlinear metric learning. In Pereira, F., Burges, C. J. C., Bottou, L., & Weinberger, K. Q. (Eds.), Advances in neural information processing systems (Vol. 25, pp. 2573–2581). Curran Associates, Inc. http://papers.nips.cc/paper/4840nonlinearmetriclearning.pdf.
Langford, J. (2005). Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 6, 273–306.
Li, J., Sonmez, A., Cataltepe, Z., & Bax, E. (2012). Validation of network classifiers. Structural, Syntactic, and Statistical Pattern Recognition Lecture Notes in Computer Science, 7626, 448–457.
Littlestone, N., & Warmuth, M. (1986). Relating data compression and learnability. Unpublished manuscript, University of California, Santa Cruz.
London, B., Huang, B., & Getoor, L. (2012). Improved generalization bounds for largescale structured prediction. In NIPS workshop on algorithmic and statistical approaches for large social networks.
Macskassy, S. A., & Provost, F. (2007). Classification in networked data: A toolkit and a univariate case study. Journal of Machine Learning Research, 8, 935–983.
Marchand, M., & ShaweTaylor, J. (2001). Learning with the set covering machine. In Proceedings of the eighteenth international conference on machine learning (ICML 2001) (pp. 345–352).
Maurer, A., & Pontil, M. (2009). Empirical Bernstein bounds and samplevariance penalization. In 22nd annual conference on learning theory (COLT). http://www0.cs.ucl.ac.uk/staff/M.Pontil/reading/svpfinal.pdf.
Mullin, M., & Sukthankar, R. (2000). Complete crossvalidation for nearest neighbor classifiers. In Proceedings of the seventeenth international conference on machine learning (pp. 639–646).
Sen, P., Namata, G., Bilgic, M., Getoor, L., Gallagher, B., & EliassiRad, T. (2008). Collective classification in network data. AI Magazine, 29(3), 93–106.
Skala, M. (2013). Hypergeometric tail inequalities: Ending the insanity. arXiv arXiv:1311.5939v1. https://arxiv.org/abs/1311.5939v1.
Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142. https://doi.org/10.1145/1968.1972. (ISSN: 00010782).
Vapnik, V. (1998). Statistical learning theory. New York: Wiley.
Vapnik, V., & Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16, 264–280.
Acknowledgements
We thank the anonymous referees for their detailed and extremely helpful corrections on the main results and advice on testing and presentation.
Author information
Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editor: Tapio Elomaa.
A Method to compute \(\hat{p}_{I}\) and \(\hat{p}_c\)
A Method to compute \(\hat{p}_{I}\) and \(\hat{p}_c\)
By gathering terms rather than sampling, we can compute \(\hat{p}_{I}\) and \(\hat{p}_c\) exactly. In this appendix, we show how to compute \(\hat{p}_{I}\) exactly and how to compute \(\hat{p}_c\) in O(\(n \ln n\)) time and O(\((\ln n)^4\)) space, assuming \(w = k + r  1\) and \(k+r \in \hbox {O}(\ln n)\), and ignoring any time and space required to find the \(k+r\) nearest neighbors to each insample example. The methods in this section are inspired by a similar approach for a single validation subset by Mullin and Sukthankar (2000).
Recall from Eqs. 105 and 106 that
and
Use the symmetry of permutations over samesize subsets S to compute only for \(S = \{1, \ldots , S\}\), and use s to index values of S. Note that
Let \(A_s = \{1, \ldots , s\}\). Let
Then
Refer to the jth nearest neighbor to (x, y) in \(F  \{(x,y)\}\) as neighbor j. Let \(c_{t,u,v}(\sigma )\) be the condition that a permutation \(\sigma \) assigns the neighbors to (x, y) to sets \(F, V_1, \ldots , V_r\sigma \) such that there are exactly k voters (in \(F  (V_1 \cup \ldots \cup V_s \cup V_i)\sigma \)) among neighbors 1 to t, neighbor t is a voter, there are k neighbors from \(FV\sigma \) among neighbors 1 to u, neighbor u is from \(FV\sigma \), and there are v voters among neighbors 1 to u. Let
and
Then
where
To see why, compare this to Eq. 156. For each (t, u, v), we multiply the probability of \(c_{t,u,v}\), which is \(p_{t,u,v}\), by \(p_s\), \(p_{i1s}\), and \(p_g\), each conditioned on \(c_{t,u,v}\). (Taking probabilities over \(\hat{P}\) conditions on \(c_{t,u,v}\).) Together, the conditions in \(p_s\), \(p_{i1s}\), and \(p_g\) are equivalent to the condition in Eq. 156, because \(b_s \wedge \lnot a_{s+1} \wedge \ldots \wedge \lnot a_{i1} \sigma \) equals \(b_{i1}\sigma \). The limits of summation for t and u follow from the fact that, with \((x,y) \in V_i\sigma \), there are \((s+1)m1\) remaining nonvoter assignments and \(rm1\) remaining validation subset assignments for each \(\sigma \) in P((x, y), i).
Probabilities \(p_s\), \(p_{i1s}\), and \(p_g\), each conditioned on \(c_{t,u,v}\), are independent of each other: \(c_{t,u,v}\) specifies that there are \(uv\) nonvoters before the \(k{th}\) neighbor in \(FV\), so it completely determines \(p_s\). Also, \(c_{t,u,v}\) specifies that neighbor t is the \(k{th}\) voter, so each size \(k1\) subset of the first \(t1\) is equally likely to be the other voters that determine \(\overline{g_{{A_s \cup \{i\}}}}\) in \(p_g\), no matter how the v voters are allocated among \(V_{s+1}\sigma , \ldots , V_{i1}\sigma \) in \(p_{i1s}\).
To compute \(p_{t,u,v}\), note that with \((x,y) \in V_i\sigma \), for \(\sigma \in P((x,y),i)\), there are \(n1\) remaining assignments, including \(m1\) to \(V_i\sigma \), m for each other validation subset, and \(n  rm\) for \(FV\sigma \). This includes \(n  (s+1)m\) voters and \((s+1)m1\) nonvoters. So
The terms are the probabilities of the following conditions, respectively, each conditioned on the previous terms’ conditions:

1.
There are exactly k voters among the first t neighbors.

2.
Neighbor t is one of those k voters.

3.
The first u neighbors include exactly v voters.

4.
Exactly k of the v voters are in \(FV\sigma \).

5.
Neighbor u is from \(FV\sigma \). (The sum is over the number z of neighbors \(t+1\) to u in \(FV\sigma \).) If \(\frac{z}{ut} = \frac{0}{0}\), then treat it as one.
Now consider the three probabilities \(p_s\), \(p_{i1s}\), and \(p_g\). The first is the probability that the validation subsets \(V_1\sigma , \ldots , V_s\sigma \) are all represented among the nearer neighbors to (x, y) than the \(k{th}\) nearest neighbor from \(FV\sigma \). Since we condition on \(c_{t,u,v}\) (by taking the probability only over \(\sigma \in \hat{P}\), for which \(c_{t,u,v}\) holds), the condition is that among the neighbors assigned \(uv\) of the \((s+1)m1\) nonvoter positions, each of s sets of m positions is represented. Use inclusion and exclusion, counting all ways to select the \(uv\) neighbors, subtracting ways to select the \(uv\) neighbors without drawing from each set \(V_1\sigma , \ldots , V_s\sigma \), adding those that avoid drawing from each pair of sets, and so on:
Similarly, the condition for \(p_{i1s}\), given \(c_{t,u,v}\), is that all of \(V_{s+1}\sigma , \ldots , V_{i1}\sigma \) are represented among the \(vk\) voters with positions in \(V_{s+1} \cup \ldots \cup V_{i1} \cup V_{i+1} \cup \ldots \cup V_r\sigma \). (The other k voters are in \(FV\sigma \).) Once again, use inclusion and exclusion:
The condition for \(p_g\), given \(c_{t,u,v}\), is that at least \(\frac{k+1}{2}\) of the nearest k voters, of which the last is neighbor t, have labels that disagree with y. Let \(y_j\) be the label of neighbor j. Let \(d_j\) count the labels among neighbors 1 to j that disagree with y. Use b to count how many neighbors with labels that disagree with y are among the \(k1\) voters nearer to (x, y) than neighbor t. Then
Substitute Eq. 160 into Eq. 157 to get an equation for \(\hat{p}_{I}\):
For \(\hat{p}_c\) with \(w = k+r1\), reduce the upper limits of summation for t and u to \(k+r1\):
To compute this value, notice that only \(p_g\) depends on values that are specific to each example (x, y)—the values \(d_{t1}\) and \(I(y_t\not =y)\). Since \(p_g\) only depends on those values and t, we can rearrange the sum:
where
To compute q(t), first compute and store \(p_{i1s}\) for all feasible (i, s, v) and \(p_s\) for all feasible (s, u, v). Next, compute and store the last term of \(p_{t,u,v}\) for all feasible \((v, ut)\), then use those values to compute and store \(p_{t,u,v}\) for all feasible (s, t, u, v). This requires O(\(r^4\)) computation and storage. Then compute q(t) for each \(t \in \{k, \ldots , k+r1\}\) by iterating through the sums and using the precomputed values for \(p_{t,u,v}\), \(p_s\), and \(p_{i1s}\). This requires O(\(r^4\)) computation.
To compute \(\hat{p}_c\), first compute \(p_g\) for all feasible \((t,d_{t1},I(y_t\not =y))\). This requires O(\(rk(k+r)\)) computation and O(\(r(k+r)\)) storage. Then, for each \((x,y) \in F\), find its \(k+r1\) nearest neighbors in \(FV\), use the neighbors’ labels to compute \(d_{t1}\) and \(I(y_t\not =y)\) for \(t \in \{k, \ldots , k+r1\}\). This requires O(\(k+r\)) computation. Then compute the sum over t in Eq. 172, using \(d_{t1}\) and \(I(y_t\not =y)\) values to select precomputed \(p_g\) values and using the precomputed q(t) values. This produces a sample value for (x, y). Average those sample values over \((x,y) \in F\) to compute \(\hat{p}_c\).
Using this method, aside from the time to find the \(k+r1\) nearest neighbors to each insample example, the time complexity is O(\(\max (r^4, rk(k+r), n (k+r))\)) and the storage complexity is O(\(\max (r^4, r(k+r))\)). If \(k \in \hbox {O}(\ln n)\) and \(r \in \hbox {O}(\ln n)\) and \(n > (\ln n)^3\), then this is O(\(n \ln n\)) time and O(\((\ln n)^4\)) storage.
Rights and permissions
About this article
Cite this article
Bax, E., Weng, L. & Tian, X. Speculatecorrect error bounds for knearest neighbor classifiers. Mach Learn 108, 2087–2111 (2019). https://doi.org/10.1007/s10994019058141
Received:
Revised:
Accepted:
Published:
Issue Date:
Keywords
 Nearest neighbors
 Error bounds
 Generalization