Advertisement

Machine Learning

, Volume 92, Issue 2–3, pp 349–376 | Cite as

Conditional validity of inductive conformal predictors

  • Vladimir VovkEmail author
Article

Abstract

Conformal predictors are set predictors that are automatically valid in the sense of having coverage probability equal to or exceeding a given confidence level. Inductive conformal predictors are a computationally efficient version of conformal predictors satisfying the same property of validity. However, inductive conformal predictors have only been known to control unconditional coverage probability. This paper explores various versions of conditional validity and various ways to achieve them using inductive conformal predictors and their modifications. In particular, it discusses a convenient expression of one of the modifications in terms of ROC curves.

Keywords

Inductive conformal predictors Conditional validity Batch mode of learning ROC curves Boosting MART Spam detection 

1 Introduction

This paper continues study of the method of conformal prediction, introduced in Vovk et al. (1999) and Saunders et al. (1999) and further developed in Vovk et al. (2005). An advantage of the method is that its predictions (which are set rather than point predictions) automatically satisfy a finite-sample property of validity. Its disadvantage is its relative computational inefficiency in many situations. A modification of conformal predictors, called inductive conformal predictors was proposed in Papadopoulos et al. (2002a, 2002b) with the purpose of improving on the computational efficiency of conformal predictors. For further information on conformal predictors and inductive conformal predictors see, e.g., Balasubramanian et al. (2013) and Papadopoulos et al. (2013).

Most of the literature on conformal prediction studies the behavior of set predictors in the online mode of prediction, perhaps because the property of validity can be stated in an especially strong form in the on-line mode (as first shown in Vovk 2002). The online mode, however, is much less popular in applications of machine learning than the batch mode of prediction. This paper follows the recent papers by Lei et al. (2013) and Lei and Wasserman (2013) studying properties of conformal prediction in the batch mode; we, however, concentrate on inductive conformal prediction. The performance of inductive conformal predictors in the batch mode is illustrated using the well-known Spambase data set; for earlier empirical studies of conformal prediction in the batch mode see, e.g., Vanderlooy et al. (2007).

We will usually be making the assumption of randomness, which is standard in machine learning and nonparametric statistics: the available data is a sequence of examples generated independently from the same probability distribution Q. (In some cases we will make the weaker assumption of exchangeability; for some of our results even weaker assumptions, such as conditional randomness or exchangeability, would have been sufficient.) Each example consists of two components: an object and a label. We are given a training set of examples and a new object, and our goal is to predict the label of the new object. (If we have a whole test set of new objects, we can apply the procedure for predicting one new label to each of the objects in the test set.)

The two desiderata for inductive conformal predictors are their validity and efficiency: validity requires that the coverage probability of the prediction sets should be at least equal to a preset confidence level, and efficiency requires that the prediction sets should be as small as possible. However, there is a wide variety of notions of validity, since the “coverage probability” is, in general, conditional probability. The simplest case is where we condition on the trivial σ-algebra, i.e., the probability is in fact unconditional probability, but several other notions of conditional validity are depicted in Fig. 1, where T refers to conditioning on the training set, O to conditioning on the test object, and L to conditioning on the test label. The arrows in Fig. 1 lead from stronger to weaker notions of conditional validity; U is the sink and TOL is the source (the latter is not shown).
Fig. 1

Eight notions of conditional validity. The visible vertices of the cube are U (unconditional), T (training conditional), O (object conditional), L (label conditional), OL (example conditional), TL (training and label conditional), TO (training and object conditional). The invisible vertex is TOL (and corresponds to conditioning on everything)

Inductive conformal predictors (slightly generalized as compared with the standard version) will be defined in Sect. 2. They are automatically valid, in the sense of unconditional validity. It should be said that, in general, the unconditional error probability is easier to deal with than conditional error probabilities; e.g., the standard statistical methods of cross-validation and bootstrap provide decent estimates of the unconditional error probability but poor estimates for the training conditional error probability: see Hastie et al. (2009), Sect. 7.12.

In Sect. 3 we explore training conditional validity of inductive conformal predictors. Our simple results (Theorem 1 and Corollaries 1 and 2) are of the PAC type, involving two parameters: the target training conditional coverage probability 1−ϵ and the probability 1−δ with which 1−ϵ is attained. They show that inductive conformal predictors achieve training conditional validity automatically (whereas for other notions of conditional validity the method has to be modified). We give a self-contained proof of Theorem 1, but Appendix A explains how its significant part can be deduced from classical results about tolerance regions.

In the following section, Sect. 4, we introduce a conditional version of inductive conformal predictors and explain, in particular, how it achieves label conditional validity. Label conditional validity is important as it allows the learner to control the set-prediction analogues of false positive and false negative rates. Section 5 is about object conditional validity and its main result (a version of a lemma in Lei and Wasserman 2013) is negative: precise object conditional validity cannot be achieved in a useful way unless the test object has a positive probability. Whereas precise object conditional validity is usually not achievable, we should aim for approximate and asymptotic object conditional validity when given enough data (cf. Lei and Wasserman 2013).

Section 6 reports on the results of empirical studies for the standard Spambase data set (see, e.g., Hastie et al. 2009, Chap. 1, Example 1, and Sect. 9.1.2). Section 7 discusses close connections between an important class of label conditional ICPs and ROC curves. Section 8 concludes the main part of the paper, and two appendixes are devoted to related approaches to set prediction. Appendix A discusses connections with the classical theory of tolerance regions (in particular, it explains how part of Theorem 1 can be deduced from classical results about the training conditional validity of tolerance regions). Appendix B discusses training conditional validity of conformal predictors.

2 Inductive conformal predictors

The example space will be denoted Z; it is the Cartesian product X×Y of two measurable spaces, the object space X and the label space Y. In other words, each example zZ consists of two components: z=(x,y), where xX is its object and yY is its label. Two important special cases are the problem of classification, where Y is a finite set (equipped with the discrete σ-algebra), and the problem of regression, where Y is the real line \(\mathbb{R}\).

Various predictors defined and discussed in this paper are randomized: they depend, in addition to the data, on an element \(\omega\in\bar{\varOmega}\) of a measurable space \(\bar{\varOmega}\) equipped with a probability distribution R (the “coin-tossing” distribution). This is important to cover various predictors based on the MART procedure, which is randomized and used in our computational experiments in Sect. 6.

Let (z 1,…,z l ) be the training set, z i =(x i ,y i )∈Z. We split it into two parts, the proper training set (z 1,…,z m ) of size m<l and the calibration set of size n:=lm. An inductive conformity m-measure is a measurable function \(A:\mathbf{Z}^{m}\times\mathbf{Z}\times\bar{\varOmega}\to\mathbb{R}\); the idea behind the conformity score A((z 1,…,z m ),z,ω) is that it should measure how well z conforms to the proper training set. We omit “m-” when it is clear from the context. A standard choice of an inductive conformity measure is
$$ A\bigl((z_1,\ldots,z_m),(x,y),\omega \bigr) := \varDelta\bigl(y,f(x)\bigr), $$
(1)
where f:XY′ is a prediction rule found (perhaps using a randomized procedure) from (z 1,…,z m ) as the training set and \(\varDelta:\mathbf{Y}\times\mathbf{Y}'\to\mathbb{R}\) is a measure of similarity between a label and a prediction. Allowing Y′ to be different from Y (often Y′⊃Y) may be useful when the underlying prediction method gives additional information to the predicted label; e.g., the MART procedure used in Sect. 6 gives the logit of the predicted probability that the label is 1.

Remark 1

The idea behind the term “calibration set” is that this set allows us to calibrate the conformity scores of test examples by translating them into a probability-type scale.

The inductive conformal predictor (ICP) corresponding to A is defined as the set predictor
$$ \varGamma^{\epsilon}(z_1,\ldots,z_l,x, \omega) := \bigl\{y \mid p^y>\epsilon\bigr\}, $$
(2)
where ϵ∈(0,1) is the chosen significance level (1−ϵ is known as the confidence level), the p-values p y , yY, are defined by
$$ p^y := \frac {\vert \{ i=m+1,\ldots,l \mid \alpha_i\le\alpha^y \}\vert + 1 }{ l-m+1}, $$
(3)
and are the conformity scores. Given the training set and a new object x the ICP predicts its label y; it makes an error if yΓ ϵ (z 1,…,z l ,x,ω). All predictors considered in this paper are randomized, and so we omit the word “randomized”.

We consider a canonical probability space Δ whose elements are all possible sequences z i =(x i ,y i ), i=1,…,l+1, of l+1 examples and which is equipped with a probability distribution P. Random variables Z i =(X i ,Y i ), i=1,…,l+1, are projections of this probability space onto its ith coordinate: Z i (z 1,…,z l+1):=z i , X i (z 1,…,z l+1):=x i , and Y i (z 1,…,z l+1):=y i . We often let x i , y i , and z i stand for realizations of the random variables X i , Y i , and Z i , respectively. Our overall probability space is \(\varDelta\times\bar{\varOmega}\times[0,1]\), and it is equipped with the product measure P×R×U, where R is the coin-tossing distribution mentioned above and U is the uniform probability distribution on [0,1] (we will need U in the definition of “smoothed” ICP below). The generic element of \(\varDelta\times\bar{\varOmega}\times[0,1]\) will usually be denoted (z 1,…,z l+1,ω,θ), and the projections onto the last two components will be denoted Ω(z 1,…,z l+1,ω,θ):=ω and Θ(z 1,…,z l+1,ω,θ):=θ; Z i will also be regarded as random variables on the overall probability space that ignore the last two coordinates. In cases where θ is irrelevant we will also consider the probability space \(\varDelta\times\bar{\varOmega}\) equipped with the probability distribution P×R. It will always be clear from the context which of the three probability spaces we are talking about.

Smoothed inductive conformal predictors are defined as ICPs except that (3) is replaced by
$$ p^y := \frac { \vert \{ i=m+1,\ldots,l \mid \alpha_i<\alpha^y \}\vert + \theta ( \vert \{ i=m+1,\ldots,l \mid \alpha_i=\alpha^y \}\vert + 1 ) }{ l-m+1}; $$
(5)
therefore, Γ ϵ now depends on θ as well (remember that θ stands for values taken by the random variable Θ distributed uniformly on [0,1]).

Remark 2

The smoothed inductive conformal predictors defined in this section are more general than the corresponding smoothed predictors considered in Vovk et al. (2005): the former involve not only the tie-breaking random variable Θ but also randomized conformity measures. However, this generalization is straightforward: we get it essentially for free.

Proposition 1

(Vovk et al. 2005, Proposition 4.1)

Let random examples Z m+1,…,Z l ,Z l+1=(X l+1,Y l+1) be exchangeable (i.e., their distribution P is invariant under permutations). The probability of error Y l+1Γ ϵ (Z 1,…,Z l ,X l+1,Ω) does not exceed ϵ for any ϵ and any inductive conformal predictor Γ. The probability of error Y l+1Γ ϵ (Z 1,…,Z l ,X l+1,Ω,Θ) is equal to ϵ for any ϵ and any smoothed inductive conformal predictor Γ.

This simple proposition of validity is proved in Vovk et al. (2005) for inductive conformal predictors based on deterministic inductive conformity measures, but integration over \(\bar{\varOmega}\) immediately yields Proposition 1. In practice the probability of error is usually close to ϵ even for unsmoothed ICPs (as we will see in Sect. 6 and Appendix B).

In conclusion of this section, let me give two specific examples of ICPs. Since an ICP is determined by its inductive conformity measure, it suffices to specify the latter.
  • In the case of regression, \(\mathbf{Y}=\mathbb{R}\), we can define the inductive conformity measure by (1) where Δ(y,f(x)):=−|yf(x)| and f is the prediction rule found by using ridge regression from (z 1,…,z m ) as the training set. This ICP is the inductive counterpart of the Ridge Regression Confidence Machine (Vovk et al. 2005, Sect. 2.3).

  • An example not covered by the scheme (1) is the 1-Nearest Neighbor ICP, whose inductive conformity measure is
    $$ A\bigl((z_1,\ldots,z_m),(x,y),\omega \bigr) := \frac{\min_{i=1,\ldots,m:y_i\ne y}d(x,x_i)}{\min_{i=1,\ldots,m:y_i=y}d(x,x_i)}, $$
    (6)
    where d is a distance on X. Intuitively, an example conforms to the proper training set if it is closer to the examples labeled in the same way than to those labeled differently. In the case of classification, this ICP will be called the 1-Nearest Neighbor ICP.
Another example, based on boosting, will be given in Sect. 6. For numerous other examples, see Vovk et al. (2005), Sect. 4.2.

3 Training conditional validity

As discussed in Sect. 1, the standard property of validity of inductive conformal predictors is unconditional. The property of training conditional validity can be formalized using a PAC-type 2-parameter definition. It will be convenient to represent the ICP (2) in a slightly different form downplaying the structure (x i ,y i ) of z i . Define Γ ϵ (z 1,…,z l ,ω):={(x,y)∣p y >ϵ}, where p y is defined, as before, by (3) and (4) (therefore, p y depends implicitly on x). In this notation the first part of Proposition 1 can be restated by saying that the probability of error Z l+1Γ ϵ (Z 1,…,Z l ,Ω) does not exceed ϵ provided Z 1,…,Z l+1 are exchangeable. We will also use similar conventions in the smoothed case.

A set predictor Γ (outputting a subset of Z given l examples and measurable in the sense of the set {Z l+1Γ(Z 1,…,Z l ,Ω,Θ)} being measurable) is (ϵ,δ)-valid with respect to a probability distribution Q on Z if
$$\bigl(Q^{l+1}\times R\times U\bigr) \bigl( Q\bigl(\varGamma(Z_1,\ldots,Z_l,\varOmega,\varTheta)\bigr)\ge1-\epsilon \bigr) \ge 1-\delta $$
(we will apply this definition to both smoothed and unsmoothed ICPs, even though the latter in fact do not depend on θ). We say that Γ is (ϵ,δ)-valid if it is (ϵ,δ)-valid with respect to any probability distribution Q on Z. Our next result (Theorem 1 below) says that ICPs satisfy this property for suitable ϵ and δ; we will see, however, that this is not true for smoothed ICPs in general. Some conditions in the statement of Theorem 1 are not straightforward to interpret; for more explicit conditions, see Corollaries 1 and 2.

Let Z be the random variable Z(z):=z on the measurable space Z (equipped with a probability distribution usually denoted Q). We will say that an inductive conformity measure is continuous under a probability distribution Q on Z if, for Q m -almost all (z 1,…,z m )∈Z m and R-almost all \(\omega\in\bar{\varOmega}\), the random variable A((z 1,…,z m ),Z,ω) on the probability space (Z,Q) is continuous.

Theorem 1

Let \(\operatorname{bin}_{n,E}\) be the cumulative binomial distribution function with n trials and probability of success E; set \(\operatorname{bin}_{n,E}(-1):=0\).
  1. (a)
    Let Γ be an inductive conformal predictor. Suppose that ϵ,δ,E∈(0,1) satisfy
    $$ \delta \ge \mathop{\mathrm{bin}}\limits_{n,E} \bigl( \bigl\lfloor\epsilon(n+1)-1 \bigr\rfloor \bigr), $$
    (7)
    where n:=lm is the size of the calibration set. The set predictor Γ ϵ is then (E,δ)-valid. Moreover, for any probability distribution Q on Z, any proper training set (z 1,…,z m )∈Z m , and any \(\omega\in\bar{\varOmega}\),
    $$ Q^{l+1} \bigl( Q\bigl(\varGamma^{\epsilon}(z_1, \ldots,z_m,Z_{m+1},\ldots,Z_l,\omega)\bigr)\ge1-E \bigr) \ge 1-\delta. $$
    (8)
    If Γ is based on an inductive conformity measure that is continuous under Q, Γ ϵ is (E,δ)-valid with respect to Q if and only if (7) holds.
     
  2. (b)
    Let Q be a probability distribution on Z and Γ be a smoothed inductive conformal predictor based on an inductive conformity measure continuous under Q. Suppose ϵ,δ,E∈(0,1) satisfy
    $$ \delta \ge \mathop{\mathrm{bin}}\limits_{n,E} \bigl( \bigl\lfloor \epsilon(n+1)\bigr\rfloor \bigr). $$
    (9)
    The set predictor Γ ϵ is (E,δ)-valid with respect to Q. Moreover, for Q m -almost all proper training sets (z 1,…,z m )∈Z m , R-almost all ω, and all θ∈[0,1],
    $$ Q^{l+1} \bigl( Q\bigl( \varGamma^{\epsilon}(z_1,\ldots,z_m,Z_{m+1}, \ldots,Z_l,\omega,\theta)\bigr)\ge1-E \bigr) \ge 1-\delta. $$
    (10)
    The set predictor Γ ϵ is not (E,δ)-valid with respect to Q unless ϵ,δ,E satisfy (7).
     

In the case of smoothed ICPs there is a gap between the sufficient condition (9) and the necessary condition (7), but it does not appear excessive. More worrying is the requirement that the inductive conformity measure be continuous under the unknown data-generating distribution Q. Unfortunately, without this or similar requirement there are no meaningful guarantees of training conditional validity. Indeed, consider the trivial smoothed ICP based on the inductive conformity measure identically equal to 0. At significance level ϵ, it has coverage probability 1 with probability 1−ϵ and coverage probability 0 with probability ϵ. Therefore, it cannot be (E,δ)-valid for E<1 unless δϵ. This contrasts with the case of unsmoothed ICPs where very small δ are achievable: see, e.g., Fig. 8 below. Another natural way to define smoothed ICPs is to use different random variables Θ when computing p y for different labels yY; however, this version also encounters similar problems with training conditional validity when the inductive conformity measure is not required to be continuous under Q.

Proof of Theorem 1

We start from part (a), namely, from proving (8). By (2) and (3), the set predictor Γ ϵ makes an error, z l+1Γ ϵ (z 1,…,z l ,ω), if and only if the number of i=m+1,…,l such that α i α y is at most ⌊ϵ(n+1)−1⌋; in other words, if and only if α y <α (k), where α (k) is the kth smallest α i and k:=⌊ϵ(n+1)−1⌋+1. (Formally, α (k) is defined by the requirement that |{i|α i <α (k)}|<k≤|{i|α i α (k)}|; in other words, α (k) is the kth order statistic.) Therefore, the Q-probability of the complement of Γ ϵ (z 1,…,z l ,ω) is Q(A((z 1,…,z m ),Z,ω)<α (k)), where A is the inductive conformity measure. Set The σ-additivity of measures implies that E′≤EE″, and E′=E=E″ unless α is an atom of the distribution of A((z 1,…,z m ),Z,ω). Both when E′=E and when E′<E, the probability of error will exceed E if and only if α (k)>α . In other words, if and only if we have at most k−1 of the α i below or equal to α . The probability that at most k−1=⌊ϵ(n+1)−1⌋ values of the α i are below or equal to α equals \(\operatorname{\mathbb{P}}(B''_{n}\le\lfloor\epsilon(n+1)-1\rfloor)\le \operatorname{\mathbb{P}}(B_{n}\le\lfloor\epsilon(n+1)-1\rfloor)\), where \(B''_{n}\sim \operatorname{bin}_{n,E''}\), \(B_{n}\sim \operatorname{bin}_{n,E}\), and \(\operatorname{bin}_{n,p}\) is also allowed to stand for the binomial distribution with parameters (n,p). (For the inequality, see Lemma 1 below.) This completes the proof of (8) and, therefore, the first two statements of part (a). And the last statement of part (a) follows from the fact that E″=E unless α is an atom of the distribution of A((z 1,…,z m ),Z,ω).

Let us now prove part (b), starting from (10). We will assume that the distribution of A((z 1,…,z m ),Z,ω) is continuous (we can do so since (10) is required to hold only for almost all proper training sets and ω). By (5), the set predictor Γ ϵ can make an error only if the number of i=m+1,…,l such that α i <α y is at most ⌊ϵ(n+1)⌋ (set θ:=0 in (5) and combine this with p y ϵ); in other words, only if α y α (k), where α (k) is the kth smallest α i and k:=⌊ϵ(n+1)⌋+1. Therefore, the Q-probability of the complement of Γ ϵ (z 1,…,z l ,ω,θ) is at most Q(A((z 1,…,z m ),Z,ω)≤α (k)). Define α ,E′,E″ as before; now we know that E′=E=E″. The probability of error can exceed E only if α (k)>α . In other words, only if we have at most k−1 of the α i below or at α . The probability that at most k−1=⌊ϵ(n+1)⌋ values of the α i are below or at α equals \(\operatorname{\mathbb{P}}(B_{n}\le\lfloor\epsilon(n+1)\rfloor)\), where \(B_{n}\sim \operatorname{bin}_{n,E}\). This proves (10).

The last statement of part (b) follows immediately from what we have already proved. □

In the proof of Theorem 1 we used the first statement of the following lemma.

Lemma 1

Fix the number of trials n. The distribution function \(\operatorname{bin}_{n,p}(K)\) of the binomial distribution is decreasing in the probability of success p for a fixed K∈{0,…,n}. It is strictly decreasing unless K=n.

Proof

For the first statement of the lemma, it suffices to check that
$$\frac{d\mathop{\mathrm{bin}}\nolimits_{n,p}(K)}{dp} = \frac{d}{dp} \sum_{k=0}^K\binom{n}{k}p^k(1-p)^{n-k} = \sum_{k=0}^K \frac{k-np}{p(1-p)} \binom{n}{k}p^{k}(1-p)^{n-k} $$
is nonpositive for p∈(0,1). The last sum has the same sign as the mean of the function f(k):=knp over the set k∈{0,…,K} with respect to the binomial distribution, and so it remains to notice that the overall mean of f is 0 and that the function f is increasing. This proves the first statement, and the second statement is now obvious. □

The following corollary makes (7) and (9) in Theorem 1 less precise but more explicit using Hoeffding’s inequality.

Corollary 1

Let ϵ,δ,E∈(0,1).
  1. (a)
    If Γ is an inductive conformal predictor, the set predictor Γ ϵ is (E,δ)-valid provided
    $$ E \ge \epsilon + \sqrt{\frac{-\ln\delta}{2n}}. $$
    (11)
     
  2. (b)
    If Γ is a smoothed inductive conformal predictor based on an inductive conformity measure continuous under Q, the set predictor Γ ϵ is (E,δ)-valid with respect to Q provided
    $$ E \ge \biggl(1+\frac{1}{n} \biggr)\epsilon + \sqrt{ \frac{-\ln\delta}{2n}}. $$
    (12)
     

This corollary gives the following recipe for constructing (ϵ,δ)-valid set predictors. The recipe only works if the training set is sufficiently large; in particular, its size l should significantly exceed N:=(−lnδ)/(2ϵ 2). Choose an ICP Γ with the size n of the calibration set exceeding N. Then the set predictor \(\varGamma^{\epsilon-\sqrt{(-\ln\delta)/(2n)}} \) will be (ϵ,δ)-valid.

Proof of Corollary 1

Suppose E>ϵ. Combining (7) with Hoeffding’s inequality (see, e.g., Vovk et al. 2005, p. 287), we can see that the probability of error Q(ZΓ ϵ (Z 1,…,Z l ,Ω)) for an ICP will exceed E with probability at most
$$ \operatorname{\mathbb{P}}\bigl(B_n\le\bigl\lfloor\epsilon(n+1)-1\bigr\rfloor\bigr) \le \operatorname{\mathbb{P}}(B_n\le\epsilon n) \le e^{-2(E-\epsilon)^2n}, $$
where \(B_{n}\sim \operatorname{bin}_{n,E}\) and ϵ is the significance level. Solving \(e^{-2(E-\epsilon)^{2}n} = \delta \) we obtain that Γ ϵ is (E,δ)-valid whenever (11) is satisfied.
Analogously, in the case of a smoothed ICP and (9) we have
$$ \operatorname{\mathbb{P}}\bigl(B_n\le\bigl\lfloor\epsilon(n+1)\bigr\rfloor\bigr) \le \operatorname{\mathbb{P}}\bigl(B_n\le(1+1/n)\epsilon n\bigr) \le e^{-2(E-(1+1/n)\epsilon)^2n}, $$
and solving \(e^{-2(E-(1+1/n)\epsilon)^{2}n} = \delta \) leads to (12). □

Remark 3

The training conditional guarantees discussed in this section are very similar to those for the hold-out estimate of the probability of error of a classifier: compare, e.g., Theorem 1(a) above and Theorem 3.3 in Langford (2005). The former says that Γ ϵ is (E,δ)-valid for
$$ E := \mathop{\overline{\mathrm{bin}}}\limits_{n,\delta} \bigl( \bigl\lfloor \epsilon(n+1)-1\bigr\rfloor \bigr) \le \mathop{\overline{\mathrm{bin}}}\limits_{n,\delta} ( \epsilon n ) $$
(13)
where \(\operatorname{\overline{bin}}\) is the inverse function to \(\operatorname{bin}\):
$$ \mathop{\overline{\mathrm{bin}}}\limits_{n,\delta}(k) := \max\bigl\{p\mid \mathop{\mathrm{bin}}\limits_{n,p}(k)\ge\delta\bigr\} $$
(14)
(unless k=n, we can also say that \(\operatorname{\overline{bin}}_{n,\delta}(k)\) is the only value of p such that \(\operatorname{bin}_{n,p}(k)=\delta\): cf. Lemma 1 above). And the latter says that a point predictor’s error probability (over the test example) does not exceed
$$ \mathop{\overline{\mathrm{bin}}}\limits_{n,\delta} ( k ) $$
(15)
with probability at least 1−δ (over the training set), where k is the number of errors on a held-out set of size n. The main difference between (13) and (15) is that whereas one inequality contains the approximate expected number of errors ϵn for n new examples the other contains the actual number of errors k on n examples. Several researchers have found that the hold-out estimate is surprisingly difficult to beat; however, like the ICP of this section, it is not example conditional at all.

In conclusion of this section we give a statement intermediate between Theorem 1 and Corollary 1.

Corollary 2

Let ϵ,δ,E∈(0,1).
  1. a
    If Γ is an inductive conformal predictor, the set predictor Γ ϵ is (E,δ)-valid provided
    $$ E \ge \epsilon + \sqrt{\frac{-2\epsilon\ln\delta}{n}} - \frac{2\ln\delta}{n}. $$
     
  2. (b)
    If Γ is a smoothed inductive conformal predictor based on an inductive conformity measure continuous under Q, the set predictor Γ ϵ is (E,δ)-valid with respect to Q provided
    $$ E \ge (1+1/n)\epsilon + \sqrt{\frac{-2(1+1/n)\epsilon\ln\delta}{n}} - \frac{2\ln\delta}{n}. $$
     

Proof

Inequality (7) can be rewritten as
$$ E \ge \mathop{\overline{\mathrm{bin}}}\limits_{n,\delta} \bigl( \bigl\lfloor\epsilon(n+1)-1\bigr\rfloor \bigr) $$
(using the notation (14)). In combination with inequality (2) in Langford (2005), p. 278, this leads to the first statement. The second statement follows by replacing ϵ with (1+1/n)ϵ. □

4 Conditional inductive conformal predictors

The motivation behind conditional inductive conformal predictors is that ICPs do not always achieve the required probability ϵ of error Y l+1Γ ϵ (Z 1,…,Z l ,X l+1,Ω) conditional on (X l+1,Y l+1)∈E for important sets EZ. This is often undesirable. If, e.g., our set predictor is valid at the significance level 5 % but makes an error with probability 10 % for men and 0 % for women, both men and women can be unhappy with calling 5 % the probability of error. Moreover, in many problems we might want different significance levels for different regions of the example space: e.g., in the problem of spam detection (considered in Sects. 6 and 7) classifying spam as email usually makes much less harm than classifying email as spam.

An inductive m-taxonomy is a measurable function K:Z m ×ZK, where K is a measurable space. Usually the category K((z 1,…,z m ),z) of an example z is a kind of classification of z, which may depend on the proper training set (z 1,…,z m ).

The conditional inductive conformal predictor (conditional ICP) corresponding to K and an inductive conformity measure A is defined as the set predictor (2), where the p-values p y are now defined by
$$ p^y := \frac { \vert \{ i=m+1,\ldots,l \mid \kappa_i=\kappa^y \;\&\; \alpha_i\le\alpha^y \}\vert + 1 }{ \vert \{ i=m+1,\ldots,l \mid \kappa_i=\kappa^y \}\vert + 1 }, $$
(16)
the categories κ are defined by
$$ \kappa_i := K\bigl((z_1, \ldots,z_m),z_i\bigr), \quad i=m+1,\ldots,l, \qquad \kappa^y := K\bigl((z_1,\ldots,z_m),(x,y) \bigr), $$
and the conformity scores α are defined as before by (4). A label conditional ICP is a conditional ICP with the inductive m-taxonomy K(⋅,(x,y)):=y; this notion is useful only in classification problems.

The following proposition is the conditional analogue of the first part of Proposition 1; in particular, it shows that in classification problems label conditional ICPs achieve label conditional validity.

Proposition 2

If random examples Z m+1,…,Z l ,Z l+1=(X l+1,Y l+1) are exchangeable, the probability of error Y l+1Γ ϵ (Z 1,…,Z l ,X l+1,Ω) given the category K((Z 1,…,Z m ),Z l+1) of Z l+1 does not exceed ϵ for any ϵ and any conditional inductive conformal predictor Γ corresponding to K.

We refrain from giving the definition of smoothed conditional ICPs, which is straightforward. The categories can also be made dependent on \(\omega\in\bar{\varOmega}\).

5 Object conditional validity

In this section we prove a negative result (a version of Lemma 1 in Lei and Wasserman 2013) which says that the requirement of precise object conditional validity cannot be satisfied in a non-trivial way for rich object spaces (such as \(\mathbb{R}\)). If Q is a probability distribution on Z, we let Q X stand for its marginal distribution on X: Q X (A):=Q(A×Y). In this section we consider only set predictors that do not depend on θ, but the case of set predictors depending on θ (such as smoothed ICPs) is also covered by redefining ω:=(ω,θ).

Let us say that a set predictor Γ has 1−ϵ object conditional validity, where ϵ∈(0,1), if, for all probability distributions Q on Z and Q X -almost all xX,
$$ \bigl(Q^{l+1}\times R\bigr) \bigl( Y_{l+1}\in \varGamma(Z_1,\ldots,Z_l,X_{l+1},\varOmega) \mid X_{l+1}=x \bigr) \ge 1-\epsilon. $$
If P is a probability distribution on X, we say that a property F of elements of X holds for P-almost all elements of a measurable set EX if P(EF)=0; a P-non-atom is an element xX such that P({x})=0. The Lebesgue measure on \(\mathbb{R}\) will be denoted Λ, and the convex hull of \(E\subseteq\mathbb{R}\) will be denoted \(\operatorname{co}E\).

Theorem 2

Suppose X is a separable metric space equipped with the Borel σ-algebra. Let ϵ∈(0,1). Suppose that a set predictor Γ has 1−ϵ object conditional validity. In the case of regression, we have, for all probability distributions Q on Z and for Q X -almost all Q X -non-atoms xX,
$$ \bigl(Q^{l+1}\times R\bigr) \bigl( \varLambda\bigl(\varGamma(Z_1, \ldots,Z_l,x,\varOmega)\bigr) = \infty \bigr) \ge 1-\epsilon $$
(17)
and
$$ \bigl(Q^{l+1}\times R\bigr) \bigl( \operatorname{co}\varGamma(Z_1,\ldots,Z_l,x,\varOmega) = \mathbb{R} \bigr) \ge 1-2\epsilon. $$
(18)
In the case of classification, we have, for all Q, all yY, and Q X -almost all Q X -non-atoms x,
$$ \bigl(Q^{l+1}\times R\bigr) \bigl( y\in\varGamma(Z_1,\ldots,Z_l,x, \varOmega) \bigr) \ge 1-\epsilon. $$
(19)
The constant ϵ in each of (17), (18), and (19) is optimal, in the sense that it cannot be replaced by a smaller constant.

We are mainly interested in the case of a small ϵ (corresponding to high confidence), and in this case (17) implies that, in the case of regression, the prediction interval (i.e., the convex hull of the prediction set) can be expected to be infinitely long unless the test object is an atom. Even an infinitely long prediction interval can be somewhat informative providing a one-sided bound on the label of the test example; (18) says that, with probability at least 1−2ϵ, the prediction interval is completely uninformative unless the test object is an atom. In the case of classification, (19) says that each particular yY is likely to be included in the prediction set, and so the prediction set is likely to be large. In particular, (19) implies that the expected size of the prediction set is a least (1−ϵ)|Y|.

Of course, the condition that the test object x be a non-atom is essential: if Q X ({x})>0, an inductive conformal predictor that ignores all examples with objects different from the current test object can have 1−ϵ object conditional validity and still produce a small prediction set for a test object x if the training set is big enough to contain many examples with x as their object.

Remark 4

Nontrivial set predictors having 1−ϵ object conditional validity are constructed by McCullagh et al. (2009) assuming the Gauss linear model.

Proof of Theorem 2

The proof will be based on the ideas of Lei and Wasserman (2013, the proof of Lemma 1).

We start from showing that the ϵ in (17), (18), and (19) cannot be replaced by a smaller constant. For (17) and (19) this follows from the fact that the trivial set predictor predicting Y with probability 1−ϵ and ∅ with probability ϵ has 1−ϵ object conditional validity. In the case of (18) the bound 1−2ϵ is attained by the set predictor predicting \(\mathbb{R}\) with probability 1−2ϵ, [0,∞) with probability ϵ, and (−∞,0] with probability ϵ (this assumes ϵ<1/2; the case ϵ≥1/2 is trivial). This predictor’s conditional probability of error given all l+1 examples is at most ϵ (0 if y l+1=0 and ϵ otherwise); therefore, the conditional probability of error will be at most ϵ given the test object.

Next we prove the first statement about regression. Suppose (17) does not hold on a measurable set E of Q X -non-atoms xX such that Q X (E)>0. Shrink E in such a way that Q X (E)>0 still holds but there exist δ>0 and C>0 such that, for each xE,
$$ \bigl(Q^{l+1}\times R\bigr) \bigl( \varLambda\bigl( \varGamma(Z_1,\ldots,Z_l,x,\varOmega)\bigr) \le C \bigr) \ge \epsilon+\delta. $$
(20)
Let V be the total variation distance between probability measures, V(P,Q):=sup A |P(A)−Q(A)|; we then have
$$V\bigl(P^l,Q^l\bigr) \le \sqrt{2}\sqrt{1-\bigl(1-V(P,Q)\bigr)^l} $$
(this follows from the connection of V with the Hellinger distance: see, e.g., Tsybakov 2010, Sect. 2.4). Shrink E further so that Q X (E)>0 still holds but
$$ \sqrt{2}\sqrt{1-\bigl(1-Q_{\mathbf{X}}(E) \bigr)^l} \le \delta/2. $$
(21)
(This can be done under our assumption that X is a separable metric space: see Lemma 2 below.) Define another probability distribution P on Z by the requirements that P(A×B)=Q(A×B) for all measurable A⊆(XE), \(B\subseteq\mathbb{R}\) and that P(A×B)=Q X (AU(B) for all measurable AE, \(B\subseteq\mathbb{R}\), where U is the uniform probability distribution on the interval [−DC,DC] and D>0 will be chosen below. Since V(P,Q)≤Q X (E), we have V(P l ,Q l )≤δ/2, which implies V(P l ×R,Q l ×R)≤δ/2; therefore, by (20),
$$ \bigl(P^{l+1}\times R\bigr) \bigl( \varLambda\bigl( \varGamma(Z_1,\ldots,Z_l,x,\varOmega)\bigr) \le C \bigr) \ge \epsilon+\delta/2 $$
for each xE. The last inequality implies, by Fubini’s theorem,
$$ \bigl(P^{l+1}\times R\bigr) \bigl( \varLambda\bigl( \varGamma(Z_1,\ldots,Z_l,X_{l+1},\varOmega)\bigr) \le C \;\&\; X_{l+1}\in E \bigr) \ge ( \epsilon+\delta/2 ) P_{\mathbf{X}}(E), $$
where P X (E)=Q X (E)>0 is the marginal P-probability of E. When D (depending on δP X (E)) is sufficiently large this in turn implies
$$ \bigl(P^{l+1}\times R\bigr) \bigl( Y_{l+1}\notin \varGamma(Z_1,\ldots,Z_l,X_{l+1},\varOmega) \;\& \; X_{l+1}\in E \bigr) \ge ( \epsilon+\delta/4 ) P_{\mathbf{X}}(E). $$
However, the last inequality contradicts
$$ \frac { (P^{l+1}\times R) ( Y_{l+1}\notin\varGamma(Z_1,\ldots,Z_l,X_{l+1},\varOmega) \;\&\; X_{l+1}\in E ) }{ P_{\mathbf{X}}(E)} \le \epsilon, $$
(22)
which follows from Γ having 1−ϵ object conditional validity and the definition of conditional probability.
For the second statement about regression, suppose (18) does not hold on a measurable set E of Q X -non-atoms xX such that Q X (E)>0. In other words, for all xE,
$$ \bigl(Q^{l+1}\times R\bigr) \bigl( \sup\varGamma(Z_1, \ldots,Z_l,x,\varOmega)<\infty \quad \text{or}\quad \inf\varGamma(Z_1, \ldots,Z_l,x,\varOmega)>-\infty \bigr) > 2\epsilon. $$
For each xE we have either
$$ \bigl(Q^{l+1}\times R\bigr) \bigl( \sup \varGamma(Z_1,\ldots,Z_l,x,\varOmega)<\infty \bigr) > \epsilon $$
(23)
or
$$ \bigl(Q^{l+1}\times R\bigr) \bigl( \inf \varGamma(Z_1,\ldots,Z_l,x,\varOmega)>-\infty \bigr) > \epsilon. $$
(24)
Therefore, either (23) or (24) holds on a subset of E of a positive Q X -probability. Suppose, for concreteness, that (23) does. Shrink E in such a way that Q X (E)>0 still holds and (23) holds for all xE. Shrink E further in such a way that Q X (E)>0 still holds but there exist δ>0 and C>0 such that, for each xE,
$$ \bigl(Q^{l+1}\times R\bigr) \bigl( \sup \varGamma(Z_1,\ldots,Z_l,x,\varOmega)\le C \bigr) \ge \epsilon+\delta. $$
(25)
Shrink E further so that both Q X (E)>0 and (21) hold. Define a probability distribution P on Z by the requirements that P(A×B)=Q(A×B) for all measurable A⊆(XE) and \(B\subseteq\mathbb{R}\) and that P(A×{C+1})=Q X (A) for all measurable AE (i.e., modify Q setting the conditional distribution of Y given XE to the unit mass concentrated at C+1). Since V(P l ×R,Q l ×R)≤δ/2, (25) implies
$$\bigl(P^{l+1}\times R\bigr) \bigl( \sup\varGamma(Z_1,\ldots,Z_l,x,\varOmega)\le C \bigr) \ge \epsilon+\delta/2 $$
for all xE, which in turn implies
$$\bigl(P^{l+1}\times R\bigr) \bigl( \sup\varGamma(Z_1,\ldots,Z_l,X_{l+1},\varOmega)\le C \;\&\; X_{l+1}\in E \bigr) \ge (\epsilon+\delta/2) P_{\mathbf{X}}(E), $$
which in turn implies
$$\bigl(P^{l+1}\times R\bigr) \bigl( Y_{l+1}\notin\varGamma(Z_1,\ldots,Z_l,X_{l+1},\varOmega) \;\&\; X_{l+1}\in E \bigr) \ge (\epsilon+\delta/2) P_{\mathbf{X}}(E), $$
which contradicts (22).
It remains to prove the statement about classification. Suppose (19) does not hold on a measurable set E of Q X -non-atoms xX such that Q X (E)>0. Shrink E in such a way that Q X (E)>0 still holds but there exists δ>0 such that, for each xE,
$$ \bigl(Q^{l+1}\times R\bigr) \bigl( y\in\varGamma(Z_1, \ldots,Z_l,x,\varOmega) \bigr) \le 1-\epsilon-\delta. $$
Without loss of generality we further assume that (21) also holds. Define a probability distribution P on Z by the requirements that P(A×B)=Q(A×B) for all measurable A⊆(XE) and all BY and that P(A×{y})=Q X (A) for all measurable AE (i.e., modify Q setting the conditional distribution of Y given XE to the unit mass concentrated at y). Then for each xE we have
$$ \bigl(P^{l+1}\times R\bigr) \bigl( y\in\varGamma(Z_1, \ldots,Z_l,x,\varOmega) \bigr) \le 1-\epsilon-\delta/2, $$
which implies
$$ \bigl(P^{l+1}\times R\bigr) \bigl( Y_{l+1}\in \varGamma(Z_1,\ldots,Z_l,X_{l+1},\varOmega) \;\& \; X_{l+1}\in E \bigr) \le ( 1-\epsilon-\delta/2 ) P_{\mathbf{X}}(E). $$
The last inequality contradicts Γ having 1−ϵ object conditional validity. □

In the proof of Theorem 2 we used the following lemma.

Lemma 2

If Q is a probability measure on X, which is assumed to be a separable metric space, E is a set of Q-non-atoms such that Q(E)>0, and δ>0 is an arbitrarily small number, then there is E′⊆E such that 0<Q(E′)<δ.

Proof

We can take as E′ the intersection of E and an open ball centered at any element of X for which all such intersections have a positive Q-probability. Let us prove that such elements exist. Suppose they do not.

Fix a countable dense subset A 1 of X. Let A 2 be the union of all open balls B with rational radii centered at points in A 1 such that Q(BE)=0. On one hand, the σ-additivity of measures implies Q(A 2E)=0. On the other hand, A 2=X: indeed, for each xX there is an open ball B of some radius ϵ>0 centered at x that satisfies Q(BE)=0; since x belongs to the radius ϵ/2 open ball centered at a point in A 1 at a distance of less than ϵ/2 from x, we have xA 2. This contradicts Q(E)>0. □

Theorem 2 demonstrates an interesting all-or-nothing phenomenon for set predictors having 1−ϵ object conditional validity: each such predictor produces hopelessly large prediction sets with probability at least 1−ϵ; on the other hand, already a trivial predictor of this kind (mentioned in the proof) produces the smallest possible prediction sets with probability ϵ.

The theorem does not prevent the existence of efficient set predictors that are object conditionally valid in an asymptotic sense; indeed, the paper by Lei and Wasserman (2013) is devoted to constructing asymptotically efficient and asymptotically object conditionally valid set predictors in the case of regression.

6 Experiments

This section describes some simple experiments on the well-known Spambase data set contributed by George Forman to the UCI Machine Learning Repository (Frank and Asuncion 2010). Its overall size is 4601 examples and it contains examples of two classes: email (also written as 0) and spam (also written as 1). Hastie et al. (2009) report results of several machine-learning algorithms on this data set split randomly into a training set of size 3065 and test set of size 1536. The best result is achieved by MART (multiple additive regression tree; 4.5 % error rate according to the second edition of Hastie et al. 2009). The R programs used in the experiments described in this and next sections use the gbm package with virtually all parameters set to the default values (given in the description provided in response to help("gbm")).

All our experiments are for (unsmoothed) ICPs. We randomly permute the data set and divide it into 2602 examples for the proper training set, 999 for the calibration set, and 1000 for the test set. Our split between the proper training, calibration, and test sets, approximately 2:1:1, is inspired by the standard recommendation for the allocation of data into training, validation, and test sets (see, e.g., Hastie et al. 2009, Sect. 7.2). We consider the ICP whose conformity measure is defined by (1) where f is output by MART and
$$ \varDelta\bigl(y,f(x)\bigr) := \begin{cases} f(x) & \text{if}\ y=1\\ -f(x) & \text{if}\ y=0. \end{cases} $$
(26)
MART’s output f(x) models the log-odds of spam vs email,
$$ f(x) = \log\frac {P(1\mid x)}{ P(0\mid x)}, $$
which makes the interpretation of (26) as conformity score very natural.
The upper left plot in Fig. 2 is the scatter plot of the pairs \((p^{{\rm email}},p^{{\rm spam}})\) produced by the ICP for all examples in the test set. Email is shown as (blue) noughts and spam as (red) crosses (and when the figure is viewed in color, it is noticeable that the noughts were drawn after the crosses). The other two plots in the upper row are for email and spam separately. Ideally, email should be close to the horizontal axis and spam to the vertical axis; we can see that this is often true, with a few exceptions. The picture for the label conditional ICP looks almost identical: see the lower row of Fig. 2. However, on the log scale the difference becomes more noticeable: see Fig. 3.
Fig. 2

Scatter plots of the pairs \((p^{{\rm email}},p^{{\rm spam}})\) for all examples in the test set (left plots), for email only (middle), and for spam only (right). Email is shown as (blue) noughts and spam as (red) crosses. The three upper plots are for the ICP and the three lower ones are for the label conditional ICP

Fig. 3

The analogue of Fig. 2 on the log scale

Table 1 gives some statistics for the numbers of errors, multiple set predictions {0,1}, and empty set predictions ∅ in the case of the (unconditional) ICP Γ 5 % at significance level 5 % (we obtain different numbers not only because of different splits but also because MART is randomized; the columns of the table correspond to the random number generator seeds 0, 1, 2, etc.). The table demonstrates the validity, (lack of) conditional validity, and efficiency of the algorithm (the latter is of course inherited from the efficiency of MART). We give two kinds of conditional figures: the percentages of errors, multiple, and empty predictions for different labels and for two different kinds of objects. The two kinds of objects are obtained by splitting the object space X by the value of an attribute that we denote $: it shows the percentage of the character $ in the text of the message. The condition $<5.55 % was the root of the decision tree chosen both by Hastie et al. (2009, Sect. 9.2.5), who use all attributes in their analysis, and by Maindonald and Braun (2007, Chap. 11), who use 6 attributes chosen by them manually. (Both books use the rpart R package for decision trees.)
Table 1

Percentages of errors, multiple predictions, and empty predictions at significance level 5 % on the full test set and separately on email and spam and on two kinds of objects. The results are given for the first 100 seeds for the R (pseudo)random number generator (RNG); column “Average” gives the average percentages for all 100 seeds 0–99, and column “St. dev.” gives usual estimates of the standard deviations (namely, the square roots of the standard unbiased estimates of the variances) of the percentages for the 100 seeds

RNG seed

0

1

2

99

Average

St. dev.

errors overall

4.1 %

6.9 %

4.6 %

4.2 %

5.08 %

1.00 %

 for email

2.44 %

4.61 %

2.26 %

2.82 %

3.35 %

0.92 %

 for spam

6.77 %

10.43 %

8.42 %

6.30 %

7.74 %

1.64 %

 for $<5.55 %

4.36 %

7.91 %

5.15 %

4.34 %

5.76 %

1.24 %

 for $>5.55 %

3.29 %

4.12 %

2.69 %

3.75 %

2.96 %

1.02 %

multiple overall

2.7 %

0 %

0.1 %

1.2 %

0.86 %

0.98 %

 for email

2.11 %

0 %

0.16 %

0.83 %

0.60 %

0.68 %

 for spam

3.65 %

0 %

0 %

1.76 %

1.26 %

1.52 %

 for $<5.55 %

3.04 %

0 %

0.13 %

1.18 %

0.98 %

1.15 %

 for $>5.55 %

1.65 %

0 %

0 %

1.25 %

0.49 %

0.68 %

empty overall

0 %

2.7 %

0 %

0 %

0.31 %

0.63 %

 for email

0 %

1.48 %

0 %

0 %

0.24 %

0.47 %

 for spam

0 %

4.58 %

0 %

0 %

0.42 %

0.96 %

 for $<5.55 %

0 %

3.14 %

0 %

0 %

0.36 %

0.73 %

 for $>5.55 %

0 %

1.50 %

0 %

0 %

0.14 %

0.40 %

Notice that the numbers of errors, multiple predictions, and empty predictions tend to be greater for spam than for email. Somewhat counter-intuitively, they also tend to be greater for “email-like” objects containing few $ characters than for “spam-like” objects. The percentage of multiple and empty predictions is relatively small since the error rate of the underlying predictor happens to be close to our significance level of 5 %.

In practice, using a fixed significance level (such as the standard 5 %) is not a good idea; we should at least pay attention to what happens at several significance levels. However, experimenting with prediction sets at a fixed significance level facilitates a comparison with theoretical results.

Table 2 gives similar statistics in the case of the label conditional ICP. The error rates are now about equal for email and spam, as expected. We refrain from giving similar predictable results for “object conditional” ICP with $<5.55 % and $>5.55 % as categories.
Table 2

The analogue of a subset of Table 1 in the case of the label conditional ICP

RNG seed

0

1

2

99

Average

St. dev.

errors overall

3.4 %

6.0 %

3.8 %

3.6 %

4.92 %

0.91 %

 for email

3.73 %

6.92 %

3.87 %

3.48 %

4.97 %

1.15 %

 for spam

2.86 %

4.58 %

3.68 %

3.78 %

4.82 %

1.33 %

multiple overall

4.2 %

0 %

4.0 %

2.6 %

1.68 %

1.54 %

 for email

3.90 %

0 %

5.48 %

2.49 %

1.94 %

1.86 %

 for spam

4.69 %

0 %

1.58 %

2.77 %

1.28 %

1.26 %

empty overall

0 %

1.0 %

0 %

0 %

0.15 %

0.45 %

 for email

0 %

1.48 %

0 %

0 %

0.15 %

0.47 %

 for spam

0 %

0.25 %

0 %

0 %

0.15 %

0.47 %

We define the calibration plot of an ICP Γ on a test set as the percentage of errors made by Γ ϵ plotted against ϵ∈(0,1). Figure 4 gives three calibration plots for the ICP: for the full test set and for email and spam separately. It shows approximate validity even for email and spam separately, except for the all-important lower-left corners. The latter are shown separately in Fig. 5, where the lack of conditional validity becomes evident; cf. Fig. 6 for the label conditional ICP.
Fig. 4

The calibration plot for the test set overall, the email in the test set, and the spam in the test set (for the first 8 seeds, 0–7)

Fig. 5

The lower left corners of the plots in Fig. 4

Fig. 6

The analogue of Fig. 5 for the label conditional ICP

From the numbers in the “errors overall” row of Table 1 (both given and hidden in the … part) we can extract the corresponding confidence intervals for the probability of error conditional on the training set and MART’s internal coin tosses; these are shown in Fig. 7. It can be seen that training conditional validity is not grossly violated. (Notice that the 100 training sets used for producing this figure are not completely independent. Besides, the assumption of randomness might not be completely satisfied: permuting the data set ensures exchangeability but not necessarily randomness.) It is instructive to compare Fig. 7 with the “theoretical” Fig. 8 obtained from Theorem 1(a) (the thick black line), Corollary 1(a) (the thin solid line, which may be shown in red), and Corollary 2(a) (the thin dashed line, which may be shown in blue). The dotted black line corresponds to the significance level 5 %. There is no obvious discrepancy between Figs. 7 and 8.
Fig. 7

Confidence intervals for training conditional error probabilities: 95 % shown as thin lines (in black) and 80 % shown as thick lines (perhaps in blue). The 5 % significance level is shown as the horizontal dotted black line

Fig. 8

The upper bounds on the training conditional probability of error vs δ given by Theorem 1(a) (the thick black line), Corollary 1(a) (the thin solid line, perhaps shown in red), and Corollary 2(a) (the thin dashed line, perhaps shown in blue), where ϵ=5 % and n=999

Figure 8 gives bounds on the training conditional error probability as a function of δ for a fixed size n=999 of the calibration set. Figure 9, on the other hand, gives bounds on the training conditional error probability as a function of the size n of the calibration set for a fixed δ, namely for δ=1 %.
Fig. 9

The upper bounds on the training conditional probability of error vs δ in the same format as in Fig. 8, except that now δ is fixed at 1 % and n ranges between 19 (the smallest value giving non-trivial prediction sets) and 1500; as before, ϵ=5 %

Figure 10 is the analogue of Fig. 8 for significance level ϵ=1 %. Notice that the thin solid line (corresponding to Corollary 1(a) and perhaps shown in red) simply shifts down by 4 %. However, the quality of the thick black line (corresponding to Theorem 1(a)) and the thin dashed line (corresponding to Corollary 2(a) and perhaps shown in blue) becomes significantly better than that.
Fig. 10

The analogue of Fig. 8 for ϵ=1 %

7 ICPs and ROC curves

This section discusses a close connection between an important class of ICPs (“scoring-type” label conditional ICPs) and ROC curves. (For a previous study of connections between conformal prediction and ROC curves, see Vanderlooy and Sprinkhuizen-Kuyper 2007.) Let us say that an ICP or a label conditional ICP is scoring-type if its inductive conformity measure is defined by (1) where f takes values in \(\mathbb{R}\) and Δ is defined by (26).

The reader might have noticed that the two leftmost plots in Fig. 2 look similar to a ROC curve. The following proposition will show that this is not coincidental in the case of the lower left one. However, before we state it, we need a few definitions. We will now consider a general binary classification problem and will denote the labels as 0 and 1. For a threshold \(c\in\mathbb{R}\), the type I error on the calibration set is
$$ \alpha(c) := \frac {\vert \{i=m+1,\ldots,l\mid f(x_i)\ge c\;\&\;y_i=0\}\vert }{ \vert \{i=m+1,\ldots,l\mid y_i=0\}\vert } $$
(27)
and the type II error on the calibration set is
$$ \beta(c) := \frac {\vert \{i=m+1,\ldots,l\mid f(x_i)\le c\;\&\;y_i=1\}\vert }{ \vert \{i=m+1,\ldots,l\mid y_i=1\}\vert } $$
(28)
(with 0/0 set, e.g., to 1/2). Intuitively, these are the error rates for the classifier that predicts 1 when f(x)>c and predicts 0 when f(x)<c (our definition is conservative in that it counts the prediction as error whenever f(x)=c); namely, α(c) is the false positive rate and β(c) is the false negative rate. The empirical ROC curve is the parametric curve
$$ \bigl\{\bigl(\alpha(c),\beta(c)\bigr)\mid c\in\mathbb{R}\bigr\} \subseteq [0,1]^2. $$
(29)
(Our version of ROC curves is the original version reflected in the line y=1/2; in deviating from the original version we follow Hastie et al. 2009, whose version is the original one reflected in the line x=1/2, and many other books and papers; see, e.g., Bengio et al. 2005, Fig. 1.) Since α(c) and β(c) take only finitely many values, the empirical ROC curve (along with its modifications introduced below) is not continuous but consists of discrete points.

Proposition 3

In the case of a scoring-type label conditional ICP, for any object xX, the distance between the pair (p 0,p 1) (see (16)) and the empirical ROC curve is at most
$$ \sqrt { \frac{1}{(n^0+1)^2} + \frac{1}{(n^1+1)^2} }, $$
(30)
where n y is the number of examples in the calibration set labeled as y.

Proof

Let c:=f(x). Then we have
$$ \bigl(p^0,p^1\bigr) = \biggl( \frac{n^0_{\ge}+1}{n^0+1}, \frac{n^1_{\le}+1}{n^1+1} \biggr) $$
(31)
where \(n^{0}_{\ge}\) is the number of examples (x i ,y i ) in the calibration set such that y i =0 and f(x i )≥c and \(n^{1}_{\le}\) is the number of examples in the calibration set such that y i =1 and f(x i )≤c. It remains to notice that the point \(( n^{0}_{\ge}/n^{0}, n^{1}_{\le}/n^{1} ) \) belongs to the empirical ROC curve: the horizontal (resp. vertical) distance between this point and (31) does not exceed 1/(n 0+1) (resp. 1/(n 1+1)), and the overall Euclidean distance does not exceed (30). □
So far we have discussed the empirical ROC curve: (27) and (28) are the empirical probabilities of errors of the two types on the calibration set. It corresponds to the estimate k/n of the parameter of the binomial distribution based on observing k successes out of n. The minimax estimate is (k+1/2)/(n+1), and the corresponding ROC curve (29) where α(c) and β(c) are defined by (27) and (28) with the numerators increased by \(\frac{1}{2}\) and the denominators increased by 1 will be called the minimax ROC curve. Notice that for the minimax ROC curve we can put a coefficient of \(\frac{1}{2}\) in front of (30). Similarly, when using the Laplace estimate (k+1)/(n+2), we obtain the Laplace ROC curve. See the left panel of Fig. 11 for the lower left corner of the lower left plot of Fig. 2 with different ROC curves added to it.
Fig. 11

Left panel: the lower left corner of the lower left plot of Fig. 2 with the empirical (solid), minimax (dashed), and Laplace (dotted) ROC curves. Right panel: the lower left corner of the lower left plot of Fig. 2 with the upper Venn ROC curve and the partition of the plane corresponding to the label conditional ICP with significance level 5 %

The non-standard estimate (k+1)/(n+1) of the parameter of the binomial distribution leads to a version of ROC curve that is connected to the label conditional ICP in the most direct way. Let us call this estimate the upper Venn estimate and the corresponding ROC curve the upper Venn ROC curve (cf. the discussion of the Venn predictor in Vovk et al. 2005, pp. 159–160). (The upper Venn estimate is unusual in that the estimate of the probability of an event plus the estimate of the probability of its complement is different from 1.) Notice that the upper Venn ROC curve lies Northeast of all three ROC curves discussed earlier. In the square [0,0.5]×[0,0.5] the order of the ROC curves from Southwest to Northeast is: empirical, minimax, Laplace, and upper Venn; the last two are very close to each other for large n 0 and n 1 and small ratios \(n^{0}_{\ge}/n^{0}\) and \(n^{1}_{\le}/n^{1}\), as in Fig. 11.

The rest of this section is devoted to a discussion of the upper Venn ROC curve. Remember that it is defined as the parametric curve (29), where now

The pair (p 0,p 1) of p-values for any test example belongs to the upper Venn ROC curve; therefore, this curve passes through all test examples in Fig. 11. The curve can serve as a convenient classification of all possible test objects: each of them corresponds to a point on the curve.

The label conditional ICP can also be conveniently described in terms of the upper Venn ROC curve. An example is given as the right panel of Fig. 11. Each test object is represented by a point (p 0,p 1). Let ϵ be the significance level; it is 5 % in Fig. 11 (but as mentioned earlier, there is no need to have the same significance level for email and spam). If the point (ϵ,ϵ) lies Southwest of the curve, the label conditional ICP can produce multiple predictions but never produces empty predictions. If it lies Northeast of the curve, the predictor can produce empty predictions but never produces multiple predictions. In particular, it is impossible to produce both multiple and empty predictions for the same calibration set, which is demonstrated by columns 0–99 of Table 2. (Lying on the curve is regarded as a special case of lying Northeast of it. Because of the discreteness of the upper Venn ROC curve it is also possible that (ϵ,ϵ) lies neither Northeast nor Southwest of it; in this case predictions are always singletons.)

If the test object is in the Northeast region NE with respect to (ϵ,ϵ) (i.e., p 0>ϵ and p 1>ϵ), the prediction set is multiple, {0,1}. If it is in the region SW (i.e., p 0ϵ and p 1ϵ), the prediction set is empty. Otherwise the prediction set is a singleton: {1} if it is in NW (p 0ϵ and p 1>ϵ) and {0} if it is in SE (p 0>ϵ and p 1ϵ). This is shown in the right panel of Fig. 11.

However, a one-sided approach may be more appropriate in the case of the Spambase data set. There is a clear asymmetry of the two kinds of error in spam detection: classifying email as spam is much more harmful than letting occasional spam in. A reasonable approach is to start from a small number ϵ>0, the maximum tolerable percentage of email classified as spam, and then to try to minimize the percentage of spam classified as email under this constraint. For example, we can use the “one-sided label conditional ICP” classifying x as spam if and only if1 p 0ϵ for x; otherwise, x is classified as email. In the case of ϵ=5 %, this means classifying a test object as spam if and only if it lands to the left of (or onto) the vertical dotted line in the right panel of Fig. 11.

Both our procedures, two-sided and one-sided, look very similar to the standard uses of ROC curves. However, the standard justification of these uses presupposes that we know the true ROC curve. In practice, we only have access to an estimate of the true ROC curve, and the error of estimation is usually very significant. The upper Venn ROC curve is defined in terms of the data rather than the unknown true distribution. Despite this, we still have guarantees of validity. For example, our one-sided procedure guarantees that the (unconditional) probability of mistaking email for spam is at most ϵ (see Proposition 2).

This section of the paper raises a large number of questions. Not all inductive conformity measures are scoring-type; can other types be analyzed using the notion of ROC curves? Can other kinds of conditional ICPs be analyzed this way? What about smoothed ICPs? And even in the case of scoring-type label conditional ICPs we have not proved their property of training conditional validity (i.e., the version of Theorem 1 for label conditional ICPs).

8 Conclusion

The goal of this paper has been to explore various versions of the requirement of conditional validity. With a small training set, we have to content ourselves with unconditional validity (or abandon any formal requirement of validity altogether). For bigger training sets training conditional validity will be approached by ICPs automatically, and we can approach example conditional validity by using conditional ICPs but making sure that the size of a typical category does not become too small (say, less than 100). In problems of binary classification, we can control false positive and false negative rates by using label conditional ICPs.

The known property of validity of inductive conformal predictors (Proposition 1) can be stated in the traditional statistical language (see, e.g., Fraser 1957 and Guttman 1970) by saying that they are 1−ϵ expectation tolerance regions, where ϵ is the significance level. In classical statistics, however, there are two kinds of tolerance regions: 1−ϵ expectation tolerance regions and PAC-type 1−δ tolerance regions for a proportion 1−ϵ, in the terminology of Fraser (1957). We have seen (Theorem 1) that inductive conformal predictors are tolerance regions in the second sense as well (cf. Appendix A).

A disadvantage of inductive conformal predictors is their potential predictive inefficiency: indeed, the calibration set is wasted as far as the development of the prediction rule f in (1) is concerned, and the proper training set is wasted as far as the calibration (3) of conformity scores into p-values is concerned. Conformal predictors use the full training set for both purposes, and so can be expected to be significantly more efficient. (There have been reports of comparable and even better predictive efficiency of ICPs as compared to conformal predictors but they may be unusual artefacts of the methods used and particular data sets.) It is an open question whether we can guarantee training conditional validity under (11) or a similar condition for conformal predictors different from classical tolerance regions. Perhaps no universal results of this kind exist, and different families of conformal predictors will require different methods. See Appendix B for an empirical study of a simple conformal predictor.

Footnotes

  1. 1.

    In practice, we might want to improve the predictor by adding another step and changing the classification from spam to email if p 1 is also small, in which case x looks neither like spam nor email. This step can usually be disregarded for scoring-type ICPs unless ϵ is very lax.

Notes

Acknowledgements

I am grateful to Bob Williamson for a useful discussion. Many thanks to the reviewers of this paper and Vovk (2012) for their suggestions, which led, in particular, to Appendix B and Figs. 9 and 10. The empirical studies described in this paper used the R system, the gbm package for R written by Greg Ridgeway (based on the work of Freund and Schapire 1997 and Friedman 2001, 2002), MATLAB, and the C program for computing tangent distance written by Daniel Keysers and adapted to MATLAB by Aditi Krishn.

References

  1. Balasubramanian, V. N., Ho, S. S., & Vovk, V. (Eds.) (2013). Conformal prediction for reliable machine learning: theory, adaptations, and applications. Waltham: Elsevier (to appear). Google Scholar
  2. Bengio, S., Mariéthoz, J., & Keller, M. (2005). The expected performance curve. In Proceedings of the ICML 2005 workshop on ROC analysis in machine learning. URL http://users.dsic.upv.es/~flip/ROCML2005/. Google Scholar
  3. Frank, A., & Asuncion, A. (2010). UCI machine learning repository. URL http://archive.ics.uci.edu/ml.
  4. Fraser, D. A. S. (1957). Nonparametric methods in statistics. New York: Wiley. zbMATHGoogle Scholar
  5. Fraser, D. A. S., & Wormleighton, R. (1951). Nonparametric estimation IV. The Annals of Mathematical Statistics, 22, 294–298. MathSciNetzbMATHCrossRefGoogle Scholar
  6. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55, 119–139. MathSciNetzbMATHCrossRefGoogle Scholar
  7. Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. The Annals of Statistics, 29, 1189–1232. MathSciNetzbMATHCrossRefGoogle Scholar
  8. Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38, 367–378. MathSciNetzbMATHCrossRefGoogle Scholar
  9. Guttman, I. (1970). Statistical tolerance regions: classical and Bayesian. London: Griffin. zbMATHGoogle Scholar
  10. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction (2nd ed.). New York: Springer. CrossRefGoogle Scholar
  11. Langford, J. (2005). Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 6, 273–306. MathSciNetzbMATHGoogle Scholar
  12. Lei, J., & Wasserman, L. (2013). Distribution free prediction bands for nonparametric regression. Journal of the Royal Statistical Society B (to appear), preliminary version published as Technical Report. arXiv:1203.5422 [stat.ME].
  13. Lei, J., Robins, J., & Wasserman, L. (2013). Distribution free prediction sets. Journal of the American Statistical Association, 108, 278–287. Preliminary version published as Technical Report. arXiv:1111.1418 [math.ST]. zbMATHCrossRefGoogle Scholar
  14. Maindonald, J., & Braun, J. (2007). Data analysis and graphics using R: an example-based approach (2nd ed.). Cambridge: Cambridge University Press. zbMATHGoogle Scholar
  15. McCullagh, P., Vovk, V., Nouretdinov, I., Devetyarov, D., & Gammerman, A. (2009). Conditional prediction intervals for linear regression. In Proceedings of the eighth international conference on machine learning and applications, December 13–15, Miami, FL (pp. 131–138). Available from http://www.stat.uchicago.edu/~pmcc/reports/predict.pdf. Google Scholar
  16. National Institute of Standards and Technology (2012). Digital library of mathematical functions. URL http://dlmf.nist.gov/.
  17. Nouretdinov, I. R. (2008). Offline Nearest Neighbour transductive Confidence Machine. In Poster and workshop proceedings of the eighth industrial conference on data mining (pp. 16–24). Google Scholar
  18. Papadopoulos, H., Proedrou, K., Vovk, V., & Gammerman, A. (2002a). Inductive confidence machines for regression. In T. Elomaa, H. Mannila, & H. Toivonen (Eds.), Lecture notes in computer science: Vol. 2430. Proceedings of the thirteenth European conference on machine learning, August 19–23, 2002, Helsinki (pp. 345–356). Berlin: Springer. Google Scholar
  19. Papadopoulos, H., Vovk, V., & Gammerman, A. (2002b). Qualified predictions for large data sets in the case of pattern recognition. In Proceedings of the first international conference on machine learning and applications, June 24–27, 2002, Las Vegas, NV (pp. 159–163). Las Vegas: CSREA Press. Google Scholar
  20. Papadopoulos, H., Gammerman, A., & Vovk, V. (Eds.) (2013). Special Issue of the Annals of Mathematics and Artificial Intelligence on Conformal Prediction and its Applications. Springer (to appear). Google Scholar
  21. Saunders, C., Gammerman, A., & Vovk, V. (1999). Transduction with confidence and credibility. In T. Dean (Ed.), Proceedings of the sixteenth international joint conference on artificial intelligence, July 31–August 6, 1999, Stockholm (Vol. 2, pp. 722–726). San Francisco: Morgan Kaufmann. Google Scholar
  22. Scheffé, H., & Tukey, J. W. (1945). Nonparametric estimation I: Validation of order statistics. The Annals of Mathematical Statistics, 16, 187–192. zbMATHCrossRefGoogle Scholar
  23. Tsybakov, A. B. (2010). Introduction to nonparametric estimation. New York: Springer. Google Scholar
  24. Tukey, J. W. (1947). Nonparametric estimation II: Statistically equivalent blocks and tolerance regions – the continuous case. The Annals of Mathematical Statistics, 18, 529–539. MathSciNetzbMATHCrossRefGoogle Scholar
  25. Tukey, J. W. (1948). Nonparametric estimation III: Statistically equivalent blocks and tolerance regions – the discontinuous case. The Annals of Mathematical Statistics, 19, 30–39. MathSciNetzbMATHCrossRefGoogle Scholar
  26. Vanderlooy, S., & Sprinkhuizen-Kuyper, I. G. (2007). A comparison of two approaches to classify with guaranteed performance. In J. N. Kok, J. Koronacki, R. L. de Mántaras, S. Matwin, D. Mladenic, & A. Skowron (Eds.), Lecture notes in computer science: Vol. 4702. Proceedings of the eleventh European conference on principles and practice of knowledge discovery in databases, September 17–21, 2007, Warsaw (pp. 288–299). Berlin: Springer. Google Scholar
  27. Vanderlooy, S., van der Maaten, L., & Sprinkhuizen-Kuyper, I. (2007). Off-line learning with Transductive Confidence Machines: an empirical evaluation. In P. Perner (Ed.), Lecture notes in artificial intelligence: Vol. 4571. Proceedings of the fifth international conference on machine learning and data mining in pattern recognition, July 18–20, 2007, Leipzig, Germany (pp. 310–323). Berlin: Springer. CrossRefGoogle Scholar
  28. Vovk, V. (2002). On-line Confidence Machines are well-calibrated. In Proceedings of the forty third annual symposium on foundations of computer science, November 16–19, 2002, Vancouver (pp. 187–196). Los Alamitos: IEEE Computer Society. Google Scholar
  29. Vovk, V. (2012). Conditional validity of inductive conformal predictors. In S. C. H. Hoi & W. Buntine (Eds.), Asian conference on machine learning: Vol. 25. JMLR: Workshop and conference proceedings (pp. 475–490). Google Scholar
  30. Vovk, V., Gammerman, A., & Saunders, C. (1999). Machine-learning applications of algorithmic randomness. In Proceedings of the sixteenth international conference on machine learning, June 27–30, 1999, Bled, Slovenia (pp. 444–453). San Francisco: Morgan Kaufmann. Google Scholar
  31. Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic learning in a random world. New York: Springer. zbMATHGoogle Scholar
  32. Wilks, S. S. (1941). Determination of sample sizes for setting tolerance limits. The Annals of Mathematical Statistics, 12, 91–96. MathSciNetCrossRefGoogle Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  1. 1.Computer Learning Research Centre, Department of Computer ScienceRoyal Holloway, University of LondonEghamUK

Personalised recommendations