Conditional validity of inductive conformal predictors
 505 Downloads
 18 Citations
Abstract
Conformal predictors are set predictors that are automatically valid in the sense of having coverage probability equal to or exceeding a given confidence level. Inductive conformal predictors are a computationally efficient version of conformal predictors satisfying the same property of validity. However, inductive conformal predictors have only been known to control unconditional coverage probability. This paper explores various versions of conditional validity and various ways to achieve them using inductive conformal predictors and their modifications. In particular, it discusses a convenient expression of one of the modifications in terms of ROC curves.
Keywords
Inductive conformal predictors Conditional validity Batch mode of learning ROC curves Boosting MART Spam detection1 Introduction
This paper continues study of the method of conformal prediction, introduced in Vovk et al. (1999) and Saunders et al. (1999) and further developed in Vovk et al. (2005). An advantage of the method is that its predictions (which are set rather than point predictions) automatically satisfy a finitesample property of validity. Its disadvantage is its relative computational inefficiency in many situations. A modification of conformal predictors, called inductive conformal predictors was proposed in Papadopoulos et al. (2002a, 2002b) with the purpose of improving on the computational efficiency of conformal predictors. For further information on conformal predictors and inductive conformal predictors see, e.g., Balasubramanian et al. (2013) and Papadopoulos et al. (2013).
Most of the literature on conformal prediction studies the behavior of set predictors in the online mode of prediction, perhaps because the property of validity can be stated in an especially strong form in the online mode (as first shown in Vovk 2002). The online mode, however, is much less popular in applications of machine learning than the batch mode of prediction. This paper follows the recent papers by Lei et al. (2013) and Lei and Wasserman (2013) studying properties of conformal prediction in the batch mode; we, however, concentrate on inductive conformal prediction. The performance of inductive conformal predictors in the batch mode is illustrated using the wellknown Spambase data set; for earlier empirical studies of conformal prediction in the batch mode see, e.g., Vanderlooy et al. (2007).
We will usually be making the assumption of randomness, which is standard in machine learning and nonparametric statistics: the available data is a sequence of examples generated independently from the same probability distribution Q. (In some cases we will make the weaker assumption of exchangeability; for some of our results even weaker assumptions, such as conditional randomness or exchangeability, would have been sufficient.) Each example consists of two components: an object and a label. We are given a training set of examples and a new object, and our goal is to predict the label of the new object. (If we have a whole test set of new objects, we can apply the procedure for predicting one new label to each of the objects in the test set.)
Inductive conformal predictors (slightly generalized as compared with the standard version) will be defined in Sect. 2. They are automatically valid, in the sense of unconditional validity. It should be said that, in general, the unconditional error probability is easier to deal with than conditional error probabilities; e.g., the standard statistical methods of crossvalidation and bootstrap provide decent estimates of the unconditional error probability but poor estimates for the training conditional error probability: see Hastie et al. (2009), Sect. 7.12.
In Sect. 3 we explore training conditional validity of inductive conformal predictors. Our simple results (Theorem 1 and Corollaries 1 and 2) are of the PAC type, involving two parameters: the target training conditional coverage probability 1−ϵ and the probability 1−δ with which 1−ϵ is attained. They show that inductive conformal predictors achieve training conditional validity automatically (whereas for other notions of conditional validity the method has to be modified). We give a selfcontained proof of Theorem 1, but Appendix A explains how its significant part can be deduced from classical results about tolerance regions.
In the following section, Sect. 4, we introduce a conditional version of inductive conformal predictors and explain, in particular, how it achieves label conditional validity. Label conditional validity is important as it allows the learner to control the setprediction analogues of false positive and false negative rates. Section 5 is about object conditional validity and its main result (a version of a lemma in Lei and Wasserman 2013) is negative: precise object conditional validity cannot be achieved in a useful way unless the test object has a positive probability. Whereas precise object conditional validity is usually not achievable, we should aim for approximate and asymptotic object conditional validity when given enough data (cf. Lei and Wasserman 2013).
Section 6 reports on the results of empirical studies for the standard Spambase data set (see, e.g., Hastie et al. 2009, Chap. 1, Example 1, and Sect. 9.1.2). Section 7 discusses close connections between an important class of label conditional ICPs and ROC curves. Section 8 concludes the main part of the paper, and two appendixes are devoted to related approaches to set prediction. Appendix A discusses connections with the classical theory of tolerance regions (in particular, it explains how part of Theorem 1 can be deduced from classical results about the training conditional validity of tolerance regions). Appendix B discusses training conditional validity of conformal predictors.
2 Inductive conformal predictors
The example space will be denoted Z; it is the Cartesian product X×Y of two measurable spaces, the object space X and the label space Y. In other words, each example z∈Z consists of two components: z=(x,y), where x∈X is its object and y∈Y is its label. Two important special cases are the problem of classification, where Y is a finite set (equipped with the discrete σalgebra), and the problem of regression, where Y is the real line \(\mathbb{R}\).
Various predictors defined and discussed in this paper are randomized: they depend, in addition to the data, on an element \(\omega\in\bar{\varOmega}\) of a measurable space \(\bar{\varOmega}\) equipped with a probability distribution R (the “cointossing” distribution). This is important to cover various predictors based on the MART procedure, which is randomized and used in our computational experiments in Sect. 6.
Remark 1
The idea behind the term “calibration set” is that this set allows us to calibrate the conformity scores of test examples by translating them into a probabilitytype scale.
We consider a canonical probability space Δ whose elements are all possible sequences z _{ i }=(x _{ i },y _{ i }), i=1,…,l+1, of l+1 examples and which is equipped with a probability distribution P. Random variables Z _{ i }=(X _{ i },Y _{ i }), i=1,…,l+1, are projections of this probability space onto its ith coordinate: Z _{ i }(z _{1},…,z _{ l+1}):=z _{ i }, X _{ i }(z _{1},…,z _{ l+1}):=x _{ i }, and Y _{ i }(z _{1},…,z _{ l+1}):=y _{ i }. We often let x _{ i }, y _{ i }, and z _{ i } stand for realizations of the random variables X _{ i }, Y _{ i }, and Z _{ i }, respectively. Our overall probability space is \(\varDelta\times\bar{\varOmega}\times[0,1]\), and it is equipped with the product measure P×R×U, where R is the cointossing distribution mentioned above and U is the uniform probability distribution on [0,1] (we will need U in the definition of “smoothed” ICP below). The generic element of \(\varDelta\times\bar{\varOmega}\times[0,1]\) will usually be denoted (z _{1},…,z _{ l+1},ω,θ), and the projections onto the last two components will be denoted Ω(z _{1},…,z _{ l+1},ω,θ):=ω and Θ(z _{1},…,z _{ l+1},ω,θ):=θ; Z _{ i } will also be regarded as random variables on the overall probability space that ignore the last two coordinates. In cases where θ is irrelevant we will also consider the probability space \(\varDelta\times\bar{\varOmega}\) equipped with the probability distribution P×R. It will always be clear from the context which of the three probability spaces we are talking about.
Remark 2
The smoothed inductive conformal predictors defined in this section are more general than the corresponding smoothed predictors considered in Vovk et al. (2005): the former involve not only the tiebreaking random variable Θ but also randomized conformity measures. However, this generalization is straightforward: we get it essentially for free.
Proposition 1
(Vovk et al. 2005, Proposition 4.1)
Let random examples Z _{ m+1},…,Z _{ l },Z _{ l+1}=(X _{ l+1},Y _{ l+1}) be exchangeable (i.e., their distribution P is invariant under permutations). The probability of error Y _{ l+1}∉Γ ^{ ϵ }(Z _{1},…,Z _{ l },X _{ l+1},Ω) does not exceed ϵ for any ϵ and any inductive conformal predictor Γ. The probability of error Y _{ l+1}∉Γ ^{ ϵ }(Z _{1},…,Z _{ l },X _{ l+1},Ω,Θ) is equal to ϵ for any ϵ and any smoothed inductive conformal predictor Γ.
This simple proposition of validity is proved in Vovk et al. (2005) for inductive conformal predictors based on deterministic inductive conformity measures, but integration over \(\bar{\varOmega}\) immediately yields Proposition 1. In practice the probability of error is usually close to ϵ even for unsmoothed ICPs (as we will see in Sect. 6 and Appendix B).

In the case of regression, \(\mathbf{Y}=\mathbb{R}\), we can define the inductive conformity measure by (1) where Δ(y,f(x)):=−y−f(x) and f is the prediction rule found by using ridge regression from (z _{1},…,z _{ m }) as the training set. This ICP is the inductive counterpart of the Ridge Regression Confidence Machine (Vovk et al. 2005, Sect. 2.3).
 An example not covered by the scheme (1) is the 1Nearest Neighbor ICP, whose inductive conformity measure iswhere d is a distance on X. Intuitively, an example conforms to the proper training set if it is closer to the examples labeled in the same way than to those labeled differently. In the case of classification, this ICP will be called the 1Nearest Neighbor ICP.$$ A\bigl((z_1,\ldots,z_m),(x,y),\omega \bigr) := \frac{\min_{i=1,\ldots,m:y_i\ne y}d(x,x_i)}{\min_{i=1,\ldots,m:y_i=y}d(x,x_i)}, $$(6)
3 Training conditional validity
As discussed in Sect. 1, the standard property of validity of inductive conformal predictors is unconditional. The property of training conditional validity can be formalized using a PACtype 2parameter definition. It will be convenient to represent the ICP (2) in a slightly different form downplaying the structure (x _{ i },y _{ i }) of z _{ i }. Define Γ ^{ ϵ }(z _{1},…,z _{ l },ω):={(x,y)∣p ^{ y }>ϵ}, where p ^{ y } is defined, as before, by (3) and (4) (therefore, p ^{ y } depends implicitly on x). In this notation the first part of Proposition 1 can be restated by saying that the probability of error Z _{ l+1}∉Γ ^{ ϵ }(Z _{1},…,Z _{ l },Ω) does not exceed ϵ provided Z _{1},…,Z _{ l+1} are exchangeable. We will also use similar conventions in the smoothed case.
Let Z be the random variable Z(z):=z on the measurable space Z (equipped with a probability distribution usually denoted Q). We will say that an inductive conformity measure is continuous under a probability distribution Q on Z if, for Q ^{ m }almost all (z _{1},…,z _{ m })∈Z ^{ m } and Ralmost all \(\omega\in\bar{\varOmega}\), the random variable A((z _{1},…,z _{ m }),Z,ω) on the probability space (Z,Q) is continuous.
Theorem 1
 (a)Let Γ be an inductive conformal predictor. Suppose that ϵ,δ,E∈(0,1) satisfywhere n:=l−m is the size of the calibration set. The set predictor Γ ^{ ϵ } is then (E,δ)valid. Moreover, for any probability distribution Q on Z, any proper training set (z _{1},…,z _{ m })∈Z ^{ m }, and any \(\omega\in\bar{\varOmega}\),$$ \delta \ge \mathop{\mathrm{bin}}\limits_{n,E} \bigl( \bigl\lfloor\epsilon(n+1)1 \bigr\rfloor \bigr), $$(7)If Γ is based on an inductive conformity measure that is continuous under Q, Γ ^{ ϵ } is (E,δ)valid with respect to Q if and only if (7) holds.$$ Q^{l+1} \bigl( Q\bigl(\varGamma^{\epsilon}(z_1, \ldots,z_m,Z_{m+1},\ldots,Z_l,\omega)\bigr)\ge1E \bigr) \ge 1\delta. $$(8)
 (b)Let Q be a probability distribution on Z and Γ be a smoothed inductive conformal predictor based on an inductive conformity measure continuous under Q. Suppose ϵ,δ,E∈(0,1) satisfyThe set predictor Γ ^{ ϵ } is (E,δ)valid with respect to Q. Moreover, for Q ^{ m }almost all proper training sets (z _{1},…,z _{ m })∈Z ^{ m }, Ralmost all ω, and all θ∈[0,1],$$ \delta \ge \mathop{\mathrm{bin}}\limits_{n,E} \bigl( \bigl\lfloor \epsilon(n+1)\bigr\rfloor \bigr). $$(9)The set predictor Γ ^{ ϵ } is not (E,δ)valid with respect to Q unless ϵ,δ,E satisfy (7).$$ Q^{l+1} \bigl( Q\bigl( \varGamma^{\epsilon}(z_1,\ldots,z_m,Z_{m+1}, \ldots,Z_l,\omega,\theta)\bigr)\ge1E \bigr) \ge 1\delta. $$(10)
In the case of smoothed ICPs there is a gap between the sufficient condition (9) and the necessary condition (7), but it does not appear excessive. More worrying is the requirement that the inductive conformity measure be continuous under the unknown datagenerating distribution Q. Unfortunately, without this or similar requirement there are no meaningful guarantees of training conditional validity. Indeed, consider the trivial smoothed ICP based on the inductive conformity measure identically equal to 0. At significance level ϵ, it has coverage probability 1 with probability 1−ϵ and coverage probability 0 with probability ϵ. Therefore, it cannot be (E,δ)valid for E<1 unless δ≥ϵ. This contrasts with the case of unsmoothed ICPs where very small δ are achievable: see, e.g., Fig. 8 below. Another natural way to define smoothed ICPs is to use different random variables Θ when computing p ^{ y } for different labels y∈Y; however, this version also encounters similar problems with training conditional validity when the inductive conformity measure is not required to be continuous under Q.
Proof of Theorem 1
Let us now prove part (b), starting from (10). We will assume that the distribution of A((z _{1},…,z _{ m }),Z,ω) is continuous (we can do so since (10) is required to hold only for almost all proper training sets and ω). By (5), the set predictor Γ ^{ ϵ } can make an error only if the number of i=m+1,…,l such that α _{ i }<α ^{ y } is at most ⌊ϵ(n+1)⌋ (set θ:=0 in (5) and combine this with p ^{ y }≤ϵ); in other words, only if α ^{ y }≤α _{(k)}, where α _{(k)} is the kth smallest α _{ i } and k:=⌊ϵ(n+1)⌋+1. Therefore, the Qprobability of the complement of Γ ^{ ϵ }(z _{1},…,z _{ l },ω,θ) is at most Q(A((z _{1},…,z _{ m }),Z,ω)≤α _{(k)}). Define α ^{∗},E′,E″ as before; now we know that E′=E=E″. The probability of error can exceed E only if α _{(k)}>α ^{∗}. In other words, only if we have at most k−1 of the α _{ i } below or at α ^{∗}. The probability that at most k−1=⌊ϵ(n+1)⌋ values of the α _{ i } are below or at α ^{∗} equals \(\operatorname{\mathbb{P}}(B_{n}\le\lfloor\epsilon(n+1)\rfloor)\), where \(B_{n}\sim \operatorname{bin}_{n,E}\). This proves (10).
The last statement of part (b) follows immediately from what we have already proved. □
In the proof of Theorem 1 we used the first statement of the following lemma.
Lemma 1
Fix the number of trials n. The distribution function \(\operatorname{bin}_{n,p}(K)\) of the binomial distribution is decreasing in the probability of success p for a fixed K∈{0,…,n}. It is strictly decreasing unless K=n.
Proof
The following corollary makes (7) and (9) in Theorem 1 less precise but more explicit using Hoeffding’s inequality.
Corollary 1
 (a)If Γ is an inductive conformal predictor, the set predictor Γ ^{ ϵ } is (E,δ)valid provided$$ E \ge \epsilon + \sqrt{\frac{\ln\delta}{2n}}. $$(11)
 (b)If Γ is a smoothed inductive conformal predictor based on an inductive conformity measure continuous under Q, the set predictor Γ ^{ ϵ } is (E,δ)valid with respect to Q provided$$ E \ge \biggl(1+\frac{1}{n} \biggr)\epsilon + \sqrt{ \frac{\ln\delta}{2n}}. $$(12)
This corollary gives the following recipe for constructing (ϵ,δ)valid set predictors. The recipe only works if the training set is sufficiently large; in particular, its size l should significantly exceed N:=(−lnδ)/(2ϵ ^{2}). Choose an ICP Γ with the size n of the calibration set exceeding N. Then the set predictor \(\varGamma^{\epsilon\sqrt{(\ln\delta)/(2n)}} \) will be (ϵ,δ)valid.
Proof of Corollary 1
Remark 3
In conclusion of this section we give a statement intermediate between Theorem 1 and Corollary 1.
Corollary 2
 aIf Γ is an inductive conformal predictor, the set predictor Γ ^{ ϵ } is (E,δ)valid provided$$ E \ge \epsilon + \sqrt{\frac{2\epsilon\ln\delta}{n}}  \frac{2\ln\delta}{n}. $$
 (b)If Γ is a smoothed inductive conformal predictor based on an inductive conformity measure continuous under Q, the set predictor Γ ^{ ϵ } is (E,δ)valid with respect to Q provided$$ E \ge (1+1/n)\epsilon + \sqrt{\frac{2(1+1/n)\epsilon\ln\delta}{n}}  \frac{2\ln\delta}{n}. $$
Proof
4 Conditional inductive conformal predictors
The motivation behind conditional inductive conformal predictors is that ICPs do not always achieve the required probability ϵ of error Y _{ l+1}∉Γ ^{ ϵ }(Z _{1},…,Z _{ l },X _{ l+1},Ω) conditional on (X _{ l+1},Y _{ l+1})∈E for important sets E⊆Z. This is often undesirable. If, e.g., our set predictor is valid at the significance level 5 % but makes an error with probability 10 % for men and 0 % for women, both men and women can be unhappy with calling 5 % the probability of error. Moreover, in many problems we might want different significance levels for different regions of the example space: e.g., in the problem of spam detection (considered in Sects. 6 and 7) classifying spam as email usually makes much less harm than classifying email as spam.
An inductive mtaxonomy is a measurable function K:Z ^{ m }×Z→K, where K is a measurable space. Usually the category K((z _{1},…,z _{ m }),z) of an example z is a kind of classification of z, which may depend on the proper training set (z _{1},…,z _{ m }).
The following proposition is the conditional analogue of the first part of Proposition 1; in particular, it shows that in classification problems label conditional ICPs achieve label conditional validity.
Proposition 2
If random examples Z _{ m+1},…,Z _{ l },Z _{ l+1}=(X _{ l+1},Y _{ l+1}) are exchangeable, the probability of error Y _{ l+1}∉Γ ^{ ϵ }(Z _{1},…,Z _{ l },X _{ l+1},Ω) given the category K((Z _{1},…,Z _{ m }),Z _{ l+1}) of Z _{ l+1} does not exceed ϵ for any ϵ and any conditional inductive conformal predictor Γ corresponding to K.
We refrain from giving the definition of smoothed conditional ICPs, which is straightforward. The categories can also be made dependent on \(\omega\in\bar{\varOmega}\).
5 Object conditional validity
In this section we prove a negative result (a version of Lemma 1 in Lei and Wasserman 2013) which says that the requirement of precise object conditional validity cannot be satisfied in a nontrivial way for rich object spaces (such as \(\mathbb{R}\)). If Q is a probability distribution on Z, we let Q _{ X } stand for its marginal distribution on X: Q _{ X }(A):=Q(A×Y). In this section we consider only set predictors that do not depend on θ, but the case of set predictors depending on θ (such as smoothed ICPs) is also covered by redefining ω:=(ω,θ).
Theorem 2
We are mainly interested in the case of a small ϵ (corresponding to high confidence), and in this case (17) implies that, in the case of regression, the prediction interval (i.e., the convex hull of the prediction set) can be expected to be infinitely long unless the test object is an atom. Even an infinitely long prediction interval can be somewhat informative providing a onesided bound on the label of the test example; (18) says that, with probability at least 1−2ϵ, the prediction interval is completely uninformative unless the test object is an atom. In the case of classification, (19) says that each particular y∈Y is likely to be included in the prediction set, and so the prediction set is likely to be large. In particular, (19) implies that the expected size of the prediction set is a least (1−ϵ)Y.
Of course, the condition that the test object x be a nonatom is essential: if Q _{ X }({x})>0, an inductive conformal predictor that ignores all examples with objects different from the current test object can have 1−ϵ object conditional validity and still produce a small prediction set for a test object x if the training set is big enough to contain many examples with x as their object.
Remark 4
Nontrivial set predictors having 1−ϵ object conditional validity are constructed by McCullagh et al. (2009) assuming the Gauss linear model.
Proof of Theorem 2
The proof will be based on the ideas of Lei and Wasserman (2013, the proof of Lemma 1).
We start from showing that the ϵ in (17), (18), and (19) cannot be replaced by a smaller constant. For (17) and (19) this follows from the fact that the trivial set predictor predicting Y with probability 1−ϵ and ∅ with probability ϵ has 1−ϵ object conditional validity. In the case of (18) the bound 1−2ϵ is attained by the set predictor predicting \(\mathbb{R}\) with probability 1−2ϵ, [0,∞) with probability ϵ, and (−∞,0] with probability ϵ (this assumes ϵ<1/2; the case ϵ≥1/2 is trivial). This predictor’s conditional probability of error given all l+1 examples is at most ϵ (0 if y _{ l+1}=0 and ϵ otherwise); therefore, the conditional probability of error will be at most ϵ given the test object.
In the proof of Theorem 2 we used the following lemma.
Lemma 2
If Q is a probability measure on X, which is assumed to be a separable metric space, E is a set of Qnonatoms such that Q(E)>0, and δ>0 is an arbitrarily small number, then there is E′⊆E such that 0<Q(E′)<δ.
Proof
We can take as E′ the intersection of E and an open ball centered at any element of X for which all such intersections have a positive Qprobability. Let us prove that such elements exist. Suppose they do not.
Fix a countable dense subset A _{1} of X. Let A _{2} be the union of all open balls B with rational radii centered at points in A _{1} such that Q(B∩E)=0. On one hand, the σadditivity of measures implies Q(A _{2}∩E)=0. On the other hand, A _{2}=X: indeed, for each x∈X there is an open ball B of some radius ϵ>0 centered at x that satisfies Q(B∩E)=0; since x belongs to the radius ϵ/2 open ball centered at a point in A _{1} at a distance of less than ϵ/2 from x, we have x∈A _{2}. This contradicts Q(E)>0. □
Theorem 2 demonstrates an interesting allornothing phenomenon for set predictors having 1−ϵ object conditional validity: each such predictor produces hopelessly large prediction sets with probability at least 1−ϵ; on the other hand, already a trivial predictor of this kind (mentioned in the proof) produces the smallest possible prediction sets with probability ϵ.
The theorem does not prevent the existence of efficient set predictors that are object conditionally valid in an asymptotic sense; indeed, the paper by Lei and Wasserman (2013) is devoted to constructing asymptotically efficient and asymptotically object conditionally valid set predictors in the case of regression.
6 Experiments
This section describes some simple experiments on the wellknown Spambase data set contributed by George Forman to the UCI Machine Learning Repository (Frank and Asuncion 2010). Its overall size is 4601 examples and it contains examples of two classes: email (also written as 0) and spam (also written as 1). Hastie et al. (2009) report results of several machinelearning algorithms on this data set split randomly into a training set of size 3065 and test set of size 1536. The best result is achieved by MART (multiple additive regression tree; 4.5 % error rate according to the second edition of Hastie et al. 2009). The R programs used in the experiments described in this and next sections use the gbm package with virtually all parameters set to the default values (given in the description provided in response to help("gbm")).
Percentages of errors, multiple predictions, and empty predictions at significance level 5 % on the full test set and separately on email and spam and on two kinds of objects. The results are given for the first 100 seeds for the R (pseudo)random number generator (RNG); column “Average” gives the average percentages for all 100 seeds 0–99, and column “St. dev.” gives usual estimates of the standard deviations (namely, the square roots of the standard unbiased estimates of the variances) of the percentages for the 100 seeds
RNG seed  0  1  2  …  99  Average  St. dev. 

errors overall  4.1 %  6.9 %  4.6 %  …  4.2 %  5.08 %  1.00 % 
for email  2.44 %  4.61 %  2.26 %  …  2.82 %  3.35 %  0.92 % 
for spam  6.77 %  10.43 %  8.42 %  …  6.30 %  7.74 %  1.64 % 
for $<5.55 %  4.36 %  7.91 %  5.15 %  …  4.34 %  5.76 %  1.24 % 
for $>5.55 %  3.29 %  4.12 %  2.69 %  …  3.75 %  2.96 %  1.02 % 
multiple overall  2.7 %  0 %  0.1 %  …  1.2 %  0.86 %  0.98 % 
for email  2.11 %  0 %  0.16 %  …  0.83 %  0.60 %  0.68 % 
for spam  3.65 %  0 %  0 %  …  1.76 %  1.26 %  1.52 % 
for $<5.55 %  3.04 %  0 %  0.13 %  …  1.18 %  0.98 %  1.15 % 
for $>5.55 %  1.65 %  0 %  0 %  …  1.25 %  0.49 %  0.68 % 
empty overall  0 %  2.7 %  0 %  …  0 %  0.31 %  0.63 % 
for email  0 %  1.48 %  0 %  …  0 %  0.24 %  0.47 % 
for spam  0 %  4.58 %  0 %  …  0 %  0.42 %  0.96 % 
for $<5.55 %  0 %  3.14 %  0 %  …  0 %  0.36 %  0.73 % 
for $>5.55 %  0 %  1.50 %  0 %  …  0 %  0.14 %  0.40 % 
Notice that the numbers of errors, multiple predictions, and empty predictions tend to be greater for spam than for email. Somewhat counterintuitively, they also tend to be greater for “emaillike” objects containing few $ characters than for “spamlike” objects. The percentage of multiple and empty predictions is relatively small since the error rate of the underlying predictor happens to be close to our significance level of 5 %.
In practice, using a fixed significance level (such as the standard 5 %) is not a good idea; we should at least pay attention to what happens at several significance levels. However, experimenting with prediction sets at a fixed significance level facilitates a comparison with theoretical results.
The analogue of a subset of Table 1 in the case of the label conditional ICP
RNG seed  0  1  2  …  99  Average  St. dev. 

errors overall  3.4 %  6.0 %  3.8 %  …  3.6 %  4.92 %  0.91 % 
for email  3.73 %  6.92 %  3.87 %  …  3.48 %  4.97 %  1.15 % 
for spam  2.86 %  4.58 %  3.68 %  …  3.78 %  4.82 %  1.33 % 
multiple overall  4.2 %  0 %  4.0 %  …  2.6 %  1.68 %  1.54 % 
for email  3.90 %  0 %  5.48 %  …  2.49 %  1.94 %  1.86 % 
for spam  4.69 %  0 %  1.58 %  …  2.77 %  1.28 %  1.26 % 
empty overall  0 %  1.0 %  0 %  …  0 %  0.15 %  0.45 % 
for email  0 %  1.48 %  0 %  …  0 %  0.15 %  0.47 % 
for spam  0 %  0.25 %  0 %  …  0 %  0.15 %  0.47 % 
7 ICPs and ROC curves
This section discusses a close connection between an important class of ICPs (“scoringtype” label conditional ICPs) and ROC curves. (For a previous study of connections between conformal prediction and ROC curves, see Vanderlooy and SprinkhuizenKuyper 2007.) Let us say that an ICP or a label conditional ICP is scoringtype if its inductive conformity measure is defined by (1) where f takes values in \(\mathbb{R}\) and Δ is defined by (26).
Proposition 3
Proof
The nonstandard estimate (k+1)/(n+1) of the parameter of the binomial distribution leads to a version of ROC curve that is connected to the label conditional ICP in the most direct way. Let us call this estimate the upper Venn estimate and the corresponding ROC curve the upper Venn ROC curve (cf. the discussion of the Venn predictor in Vovk et al. 2005, pp. 159–160). (The upper Venn estimate is unusual in that the estimate of the probability of an event plus the estimate of the probability of its complement is different from 1.) Notice that the upper Venn ROC curve lies Northeast of all three ROC curves discussed earlier. In the square [0,0.5]×[0,0.5] the order of the ROC curves from Southwest to Northeast is: empirical, minimax, Laplace, and upper Venn; the last two are very close to each other for large n ^{0} and n ^{1} and small ratios \(n^{0}_{\ge}/n^{0}\) and \(n^{1}_{\le}/n^{1}\), as in Fig. 11.
The pair (p ^{0},p ^{1}) of pvalues for any test example belongs to the upper Venn ROC curve; therefore, this curve passes through all test examples in Fig. 11. The curve can serve as a convenient classification of all possible test objects: each of them corresponds to a point on the curve.
The label conditional ICP can also be conveniently described in terms of the upper Venn ROC curve. An example is given as the right panel of Fig. 11. Each test object is represented by a point (p ^{0},p ^{1}). Let ϵ be the significance level; it is 5 % in Fig. 11 (but as mentioned earlier, there is no need to have the same significance level for email and spam). If the point (ϵ,ϵ) lies Southwest of the curve, the label conditional ICP can produce multiple predictions but never produces empty predictions. If it lies Northeast of the curve, the predictor can produce empty predictions but never produces multiple predictions. In particular, it is impossible to produce both multiple and empty predictions for the same calibration set, which is demonstrated by columns 0–99 of Table 2. (Lying on the curve is regarded as a special case of lying Northeast of it. Because of the discreteness of the upper Venn ROC curve it is also possible that (ϵ,ϵ) lies neither Northeast nor Southwest of it; in this case predictions are always singletons.)
If the test object is in the Northeast region NE with respect to (ϵ,ϵ) (i.e., p ^{0}>ϵ and p ^{1}>ϵ), the prediction set is multiple, {0,1}. If it is in the region SW (i.e., p ^{0}≤ϵ and p ^{1}≤ϵ), the prediction set is empty. Otherwise the prediction set is a singleton: {1} if it is in NW (p ^{0}≤ϵ and p ^{1}>ϵ) and {0} if it is in SE (p ^{0}>ϵ and p ^{1}≤ϵ). This is shown in the right panel of Fig. 11.
However, a onesided approach may be more appropriate in the case of the Spambase data set. There is a clear asymmetry of the two kinds of error in spam detection: classifying email as spam is much more harmful than letting occasional spam in. A reasonable approach is to start from a small number ϵ>0, the maximum tolerable percentage of email classified as spam, and then to try to minimize the percentage of spam classified as email under this constraint. For example, we can use the “onesided label conditional ICP” classifying x as spam if and only if^{1} p ^{0}≤ϵ for x; otherwise, x is classified as email. In the case of ϵ=5 %, this means classifying a test object as spam if and only if it lands to the left of (or onto) the vertical dotted line in the right panel of Fig. 11.
Both our procedures, twosided and onesided, look very similar to the standard uses of ROC curves. However, the standard justification of these uses presupposes that we know the true ROC curve. In practice, we only have access to an estimate of the true ROC curve, and the error of estimation is usually very significant. The upper Venn ROC curve is defined in terms of the data rather than the unknown true distribution. Despite this, we still have guarantees of validity. For example, our onesided procedure guarantees that the (unconditional) probability of mistaking email for spam is at most ϵ (see Proposition 2).
This section of the paper raises a large number of questions. Not all inductive conformity measures are scoringtype; can other types be analyzed using the notion of ROC curves? Can other kinds of conditional ICPs be analyzed this way? What about smoothed ICPs? And even in the case of scoringtype label conditional ICPs we have not proved their property of training conditional validity (i.e., the version of Theorem 1 for label conditional ICPs).
8 Conclusion
The goal of this paper has been to explore various versions of the requirement of conditional validity. With a small training set, we have to content ourselves with unconditional validity (or abandon any formal requirement of validity altogether). For bigger training sets training conditional validity will be approached by ICPs automatically, and we can approach example conditional validity by using conditional ICPs but making sure that the size of a typical category does not become too small (say, less than 100). In problems of binary classification, we can control false positive and false negative rates by using label conditional ICPs.
The known property of validity of inductive conformal predictors (Proposition 1) can be stated in the traditional statistical language (see, e.g., Fraser 1957 and Guttman 1970) by saying that they are 1−ϵ expectation tolerance regions, where ϵ is the significance level. In classical statistics, however, there are two kinds of tolerance regions: 1−ϵ expectation tolerance regions and PACtype 1−δ tolerance regions for a proportion 1−ϵ, in the terminology of Fraser (1957). We have seen (Theorem 1) that inductive conformal predictors are tolerance regions in the second sense as well (cf. Appendix A).
A disadvantage of inductive conformal predictors is their potential predictive inefficiency: indeed, the calibration set is wasted as far as the development of the prediction rule f in (1) is concerned, and the proper training set is wasted as far as the calibration (3) of conformity scores into pvalues is concerned. Conformal predictors use the full training set for both purposes, and so can be expected to be significantly more efficient. (There have been reports of comparable and even better predictive efficiency of ICPs as compared to conformal predictors but they may be unusual artefacts of the methods used and particular data sets.) It is an open question whether we can guarantee training conditional validity under (11) or a similar condition for conformal predictors different from classical tolerance regions. Perhaps no universal results of this kind exist, and different families of conformal predictors will require different methods. See Appendix B for an empirical study of a simple conformal predictor.
Footnotes
 1.
In practice, we might want to improve the predictor by adding another step and changing the classification from spam to email if p ^{1} is also small, in which case x looks neither like spam nor email. This step can usually be disregarded for scoringtype ICPs unless ϵ is very lax.
Notes
Acknowledgements
I am grateful to Bob Williamson for a useful discussion. Many thanks to the reviewers of this paper and Vovk (2012) for their suggestions, which led, in particular, to Appendix B and Figs. 9 and 10. The empirical studies described in this paper used the R system, the gbm package for R written by Greg Ridgeway (based on the work of Freund and Schapire 1997 and Friedman 2001, 2002), MATLAB, and the C program for computing tangent distance written by Daniel Keysers and adapted to MATLAB by Aditi Krishn.
References
 Balasubramanian, V. N., Ho, S. S., & Vovk, V. (Eds.) (2013). Conformal prediction for reliable machine learning: theory, adaptations, and applications. Waltham: Elsevier (to appear). Google Scholar
 Bengio, S., Mariéthoz, J., & Keller, M. (2005). The expected performance curve. In Proceedings of the ICML 2005 workshop on ROC analysis in machine learning. URL http://users.dsic.upv.es/~flip/ROCML2005/. Google Scholar
 Frank, A., & Asuncion, A. (2010). UCI machine learning repository. URL http://archive.ics.uci.edu/ml.
 Fraser, D. A. S. (1957). Nonparametric methods in statistics. New York: Wiley. zbMATHGoogle Scholar
 Fraser, D. A. S., & Wormleighton, R. (1951). Nonparametric estimation IV. The Annals of Mathematical Statistics, 22, 294–298. MathSciNetzbMATHCrossRefGoogle Scholar
 Freund, Y., & Schapire, R. E. (1997). A decisiontheoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55, 119–139. MathSciNetzbMATHCrossRefGoogle Scholar
 Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. The Annals of Statistics, 29, 1189–1232. MathSciNetzbMATHCrossRefGoogle Scholar
 Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38, 367–378. MathSciNetzbMATHCrossRefGoogle Scholar
 Guttman, I. (1970). Statistical tolerance regions: classical and Bayesian. London: Griffin. zbMATHGoogle Scholar
 Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction (2nd ed.). New York: Springer. CrossRefGoogle Scholar
 Langford, J. (2005). Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 6, 273–306. MathSciNetzbMATHGoogle Scholar
 Lei, J., & Wasserman, L. (2013). Distribution free prediction bands for nonparametric regression. Journal of the Royal Statistical Society B (to appear), preliminary version published as Technical Report. arXiv:1203.5422 [stat.ME].
 Lei, J., Robins, J., & Wasserman, L. (2013). Distribution free prediction sets. Journal of the American Statistical Association, 108, 278–287. Preliminary version published as Technical Report. arXiv:1111.1418 [math.ST]. zbMATHCrossRefGoogle Scholar
 Maindonald, J., & Braun, J. (2007). Data analysis and graphics using R: an examplebased approach (2nd ed.). Cambridge: Cambridge University Press. zbMATHGoogle Scholar
 McCullagh, P., Vovk, V., Nouretdinov, I., Devetyarov, D., & Gammerman, A. (2009). Conditional prediction intervals for linear regression. In Proceedings of the eighth international conference on machine learning and applications, December 13–15, Miami, FL (pp. 131–138). Available from http://www.stat.uchicago.edu/~pmcc/reports/predict.pdf. Google Scholar
 National Institute of Standards and Technology (2012). Digital library of mathematical functions. URL http://dlmf.nist.gov/.
 Nouretdinov, I. R. (2008). Offline Nearest Neighbour transductive Confidence Machine. In Poster and workshop proceedings of the eighth industrial conference on data mining (pp. 16–24). Google Scholar
 Papadopoulos, H., Proedrou, K., Vovk, V., & Gammerman, A. (2002a). Inductive confidence machines for regression. In T. Elomaa, H. Mannila, & H. Toivonen (Eds.), Lecture notes in computer science: Vol. 2430. Proceedings of the thirteenth European conference on machine learning, August 19–23, 2002, Helsinki (pp. 345–356). Berlin: Springer. Google Scholar
 Papadopoulos, H., Vovk, V., & Gammerman, A. (2002b). Qualified predictions for large data sets in the case of pattern recognition. In Proceedings of the first international conference on machine learning and applications, June 24–27, 2002, Las Vegas, NV (pp. 159–163). Las Vegas: CSREA Press. Google Scholar
 Papadopoulos, H., Gammerman, A., & Vovk, V. (Eds.) (2013). Special Issue of the Annals of Mathematics and Artificial Intelligence on Conformal Prediction and its Applications. Springer (to appear). Google Scholar
 Saunders, C., Gammerman, A., & Vovk, V. (1999). Transduction with confidence and credibility. In T. Dean (Ed.), Proceedings of the sixteenth international joint conference on artificial intelligence, July 31–August 6, 1999, Stockholm (Vol. 2, pp. 722–726). San Francisco: Morgan Kaufmann. Google Scholar
 Scheffé, H., & Tukey, J. W. (1945). Nonparametric estimation I: Validation of order statistics. The Annals of Mathematical Statistics, 16, 187–192. zbMATHCrossRefGoogle Scholar
 Tsybakov, A. B. (2010). Introduction to nonparametric estimation. New York: Springer. Google Scholar
 Tukey, J. W. (1947). Nonparametric estimation II: Statistically equivalent blocks and tolerance regions – the continuous case. The Annals of Mathematical Statistics, 18, 529–539. MathSciNetzbMATHCrossRefGoogle Scholar
 Tukey, J. W. (1948). Nonparametric estimation III: Statistically equivalent blocks and tolerance regions – the discontinuous case. The Annals of Mathematical Statistics, 19, 30–39. MathSciNetzbMATHCrossRefGoogle Scholar
 Vanderlooy, S., & SprinkhuizenKuyper, I. G. (2007). A comparison of two approaches to classify with guaranteed performance. In J. N. Kok, J. Koronacki, R. L. de Mántaras, S. Matwin, D. Mladenic, & A. Skowron (Eds.), Lecture notes in computer science: Vol. 4702. Proceedings of the eleventh European conference on principles and practice of knowledge discovery in databases, September 17–21, 2007, Warsaw (pp. 288–299). Berlin: Springer. Google Scholar
 Vanderlooy, S., van der Maaten, L., & SprinkhuizenKuyper, I. (2007). Offline learning with Transductive Confidence Machines: an empirical evaluation. In P. Perner (Ed.), Lecture notes in artificial intelligence: Vol. 4571. Proceedings of the fifth international conference on machine learning and data mining in pattern recognition, July 18–20, 2007, Leipzig, Germany (pp. 310–323). Berlin: Springer. CrossRefGoogle Scholar
 Vovk, V. (2002). Online Confidence Machines are wellcalibrated. In Proceedings of the forty third annual symposium on foundations of computer science, November 16–19, 2002, Vancouver (pp. 187–196). Los Alamitos: IEEE Computer Society. Google Scholar
 Vovk, V. (2012). Conditional validity of inductive conformal predictors. In S. C. H. Hoi & W. Buntine (Eds.), Asian conference on machine learning: Vol. 25. JMLR: Workshop and conference proceedings (pp. 475–490). Google Scholar
 Vovk, V., Gammerman, A., & Saunders, C. (1999). Machinelearning applications of algorithmic randomness. In Proceedings of the sixteenth international conference on machine learning, June 27–30, 1999, Bled, Slovenia (pp. 444–453). San Francisco: Morgan Kaufmann. Google Scholar
 Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic learning in a random world. New York: Springer. zbMATHGoogle Scholar
 Wilks, S. S. (1941). Determination of sample sizes for setting tolerance limits. The Annals of Mathematical Statistics, 12, 91–96. MathSciNetCrossRefGoogle Scholar