## Abstract

In many classification systems, sensing modalities have different acquisition costs. It is often *unnecessary* to use every modality to classify a majority of examples. We study a multi-stage system in a prediction time cost reduction setting, where the full data is available for training, but for a test example, measurements in a new modality can be acquired at each stage for an additional cost. We seek decision rules to reduce the average measurement acquisition cost. We formulate an empirical risk minimization problem (ERM) for a multi-stage reject classifier, wherein the stage *k* classifier either classifies a sample using only the measurements acquired so far or rejects it to the next stage where more attributes can be acquired for a cost. If we restrict ourselves to binary classification setting then, to solve the ERM problem, we show that the optimal reject classifier at each stage is a combination of two binary classifiers, one biased towards positive examples and the other biased towards negative examples. We use this parameterization to construct stage-by-stage global surrogate risk, develop an iterative algorithm in the boosting framework and present convergence and generalization results. We test our work on synthetic, medical and explosives detection datasets. Our results demonstrate that substantial cost reduction without a significant sacrifice in accuracy is achievable.

## Keywords

Multi-stage classification Sequential decision Boosting Cost sensitive learning## 1 Introduction

In many applications including homeland security and medical diagnosis, decision systems are composed of an ordered sequence of stages. Each stage is associated with a sensor or a physical sensing modality. Typically, a less informative sensor is cheap (or fast) while a more informative sensor is either expensive or requires more time to acquire a measurement. In practice, a measurement budget (or throughput constraint) does not allow all the modalities to be used simultaneously in making decisions. The goal in these scenarios is to attempt to classify examples with low cost sensors and limit the number of examples for which more expensive or time consuming informative sensor is required.

For example, in explosives detection, in the first stage, an infrared imager or a metal detector can be used with high throughput and low cost. A second stage could be the use of a slower, more expensive active millimeter wave (AMMW) scanner. The final third stage is a time consuming human inspection. In medical applications, first stages are typically non-invasive procedures (such as a physical exam) followed by more expensive tests (blood test, CT scan etc) and the final stages are invasive (surgical) procedures.

- (A)
*Sensors & ordered stages*: Each stage is associated with a new sensor measurement or a sensing modality. Multiple stages are an ordered sequence of sensors or sensor modalities with later stages corresponding to expensive or time-consuming measurements. In many situations, there is often some flexibility in choosing a sensing modality from a collection of possible modalities. In these cases, the optimal choice of sensing actions also becomes an issue. While our methodology can be modified to account for this more general setting, we primarily consider a fixed order of stages and sensing modalities in this paper. This is justified on account of the fact that many of the situations we have come across consist of a handful of sensors or sensing modalities. Consequently, for these situations, the problem of choosing sensor ordering is not justified since one could by brute force enumerate and optimize over the different possibilities. - (B)
*Reject classifiers*: Our sequential decision rules either attempt to fully classify an instance at each stage or “reject” the instance on to the next stage for more measurements in case of ambiguity. For example, in explosives detection, a decision rule in the first stage, based on IR scan, would attempt to detect whether or not a person is a threat and identify the explosive type/location in case of a threat. If the person is identified as a threat at the first stage it is unnecessary (and indeed dangerous—the explosive could be detonated) to seek more information. Similarly in medical diagnosis if a disease is diagnosed at an early stage, it makes sense to begin early treatment rather than waiting for more conclusive tests. - (C)
*Information vs. computation*: Note that our setup can only use the partial measurements acquired up to a stage in making a decision. In other methods, such as detection cascades (Viola and Jones 2004), the full measurement and therefore all the information is available to every stage. Therefore, any region in the feature space can be carved out with more complex regions in the measurement space, or equivalently complex features can be extracted but with higher costs. In contrast, we have only partial measurements (or information) and so any feature or classifier that we employ has to be agnostic to unavailable measurements at that stage.

Our approach is based on the so-called *Prediction Time Cost Reduction approach* (Kanani and Melville 2008). Specifically, we assume a set of training examples in which measurements from all the sensors or sensing modalities as well as the ground truth labels are available. Our goal is to derive *sequential reject classifiers* that reduces cost of measurement acquisition and error in the *prediction (or testing) phase*.

We show that this sequential reject classifier problem can be formulated as an instance of a *partially observable Markov Decision Process* (POMDP) (Kaelbling et al. 1998) when the class-specific probability models for the different sensor measurements are known. In this case the optimal sequential classifier can be cast as a solution to a Dynamic Program (DP). The DP solution is a sequence of *stage-wise optimization* problems, where each stage problem is a combination of the cost from the current stage and the cost-to-go function that is carried on from later stages.

Nevertheless, class probability models are typically unknown; our scenarios produce high-dimensional sensor data (such as images). Consequently, unlike some of the conventional approaches (Ji and Carin 2007), where probability models are first estimated to solve POMDPs, we have to adopt a non-parametric *discriminative learning* approach. We utilize the structure of the POMDP solution to empirically approximate the value of the cost-to-go function only at a discrete subset of the data-space. Next, instead of interpolating or parameterizing the cost-to-go function and learning it from data, we formulate an empirical discriminative objective that utilizes point-wise cost-to-go estimates evaluated on the training set and directly learn classifiers that minimize this objective. Using this decomposition, we formulate a novel *multi-stage expected risk minimization (ERM) problem*.

We solve this ERM problem at each stage by first factoring the cost function into classification and rejection decisions. When probability models are known, optimal strategies for a multi-class setting are given by the DP solution, but it is unclear how to mimic these strategies in the empirical setting. However, if we restrict ourselves to a binary classification setting then we can transform reject decisions into binary classification problems. Specifically, we show that the optimal reject classifier at each stage is a combination of two binary classifiers, one biased towards positive examples and the other biased towards negative examples. The disagreement region of the two then defines the reject region.

We then approximate this empirical risk with a global surrogates. We present an iterative solution and demonstrate local convergence properties. The solution is obtained in a boosting framework. We then extend well-known margin-based generalization bounds (Schapire et al. 1998) to this multi-stage setting. We tested our methods on synthetic, medical and explosives datasets. Our results demonstrate an advantage of multistage classifier: cost reduction without a significant sacrifice in accuracy.

### 1.1 Related work

#### 1.1.1 Active feature acquisition (AFA)

The subject of this paper is not new and has been studied in the Machine Learning community as early as (MacKay 1992). Our work is closely related to the so called prediction time active feature acquisition (AFA) approach in the area of cost-sensitive learning. The goal there is to make sequential decisions of whether or not to acquire a new feature to improve prediction accuracy. A natural approach is to formalize a problem as an POMDP. Ji and Carin (2007) and Kapoor and Horvitz (2009) model the decision process and infer feature dependencies while taking acquisition costs into account. Sheng and Ling (2006), Bilgic and Getoor (2007), and Zubek and Dietterich (2002) study strategies for optimizing decision trees while minimizing acquisition costs. The construction is usually based on some purity metric such as entropy. Kanani and Melville (2008) propose a method that acquires an attribute if it increases an expected utility. However, all these methods require estimating a probability likelihood that a certain feature value occurs given the features collected so far. While surrogates based on classifiers or regressors can be employed to estimate likelihoods, this approach requires discrete, binary or quantized attributes. In contrast, our problem domain deals with high dimensional measurements (images consisting of million of pixels), so we develop a discriminative learning approach and formulate a multi-stage empirical risk optimization problem to reduce measurement costs and misclassification errors. At each stage, we solve the reject classification problem by factorizing the cost function into classification and rejection decisions. We then embed the rejection decision into a binary classification problem.

#### 1.1.2 Single stage reject classifiers

*x*and a reject cost

*δ*and

*J*classes, reject

*x*if the maximum of the posteriors for each class is less than the reject cost: max

_{ k=1..J }P(

*y*=

*j*|

*x*)<

*δ*. In the context of machine learning, the posterior distributions are not known, and a decision rule is estimated directly. One popular approach is to reject examples with a small margin. Specifically, in the context of support vector machine classifiers, Yuan and Casasent (2003), Bartlett and Wegkamp (2008), Rodríguez-Díaz and Castañón (2009), and Grandvalet et al. (2008) define a reject region to lie within a small distance (margin) to the separating hyperplane and embed this in the hinge loss of the SVM formulation. El-Yaniv and Wiener (2011) propose a reject criteria motivated by active learning but its implementation turns out to be computationally impractical. In contrast, we consider multiple stages of reject classifiers. We assume an error prone second stage which occurs in such fields as threat detection and medical imaging. In this scenario, rejecting in the margin is not always meaningful. Figure 3 illustrates that thresholding the margin to reject can lead to significant degradation. This usually happens when stage measurements are complimentary; then examples within a small margin of the 1st stage boundary may not be meaningful to reject. Multiple stages of margin based reject classifiers have been considered by Liu et al. (2008) using SVMs in image classification. The method does not take into account the cost of later stages and is similar to the myopic method that we compare in the Experiments section.

#### 1.1.3 Detection cascades

Our multi-stage sequential reject classifiers bears close resemblance to detection cascades. There is much literature on cascade design (see Zhang and Zhang 2010; Chen et al. 2012 and references therein) but most cascades roughly follow the set-up introduced by Viola and Jones (2004) to reduce computation cost during classification. At each stage in a cascade, there is a binary classifier with a very high detection rate and a mediocre false alarm rate. Each stage makes a partial decision; it either detects an instance as negative or passes it on to the next stage. Only the last stage in the cascade makes a full decision, namely, whether the example belongs to a positive or negative class.

There are several fundamental differences between detection cascades and the multi-stage reject classifiers (MSRC). A key difference is the system architecture. Detection cascades make partial binary decisions, delaying a positive decision until the final stage. In contrast, MSRCs can make full classification decisions at any stage. Conceptually, this distinction requires a fundamentally new approach; detection cascades work because their focus is on unbalanced problems with few positives and a large number of negatives; and so the goal at each stage is to admit large false positives with negligible missed detections. Consequently, each stage can be associated with a binary classification problem that is acutely sensitive to missed detections. In contrast, our scheme at each stage is a composite scheme composed of a classifier as well as a rejection decision. The rejection decision is itself a binary classification problem. In practice, MSRCs arise in important areas such as medical diagnosis and explosives detection as we argued in Sect. 1, item (B). As a performance metric detection cascades tradeoff missed detections at the final stage with average computation. MSRC’s tradeoff average misclassification errors against number of examples that reached later stages (i.e. required more sensors or sensing modalities). For these reasons it is difficult to directly compare algorithms developed for MSRCs to those developed for detection cascades. Nevertheless, our goals and resulting algorithms are similar to some of the issues that arise in cascade design (see Chen et al. 2012 and references therein), namely, perform a joint optimization for all the stages in a cascade given a cost structure for different features.

#### 1.1.4 Other cost sensitive methods

Network intrusion detection systems (IDS) is an area where sequential decision systems have been explored (see Fan et al. 2000; Lee et al. 2002; Cordella and Sansone 2007). In IDS, features have different computation costs. For each cost level, a ruleset is learned. The goal is to use as many low cost rules as possible. In a related set-up, Fan et al. (2002) and Wang et al. (2003) consider a more general ensemble of base classifiers and explore how to minimize the ensemble size without sacrificing performance. In the test phase, for a sample, another classifier is added to the ensemble if the confidence of the current classification low. Here, similar to detection cascades, the goal is to reduce computation time. As we described in Sect. 1, item (C), the important distinction is that, in our setting, a decision is based only on the partial information acquired up to a stage. In a computation driven method, a stage (or base classifier) decides using a feature computed from the full measurement vector.

## 2 Problem statement

Let Open image in new window be distributed according to an unknown distribution Open image in new window . A data point has *K* features, **x**={*x* _{1},*x* _{2},…,*x* _{ K }}, and belongs to one of *C* classes indicated by its label *y*. A *k*th feature is extracted from a measurement acquired at *k*th stage. *x* _{ k } is allowed to be a vector. We define a truncated feature vector at *k*th stage: **x** ^{ k }={*x* _{1},*x* _{2},…,*x* _{ k }}. Let Open image in new window be the space of the first *k* features such that Open image in new window .

*K*stages, the order of the stages is fixed, and

*k*th stage acquires a

*k*th measurement. At each stage,

*k*, there is a decision with a reject option,

*f*

^{ k }. It can either classify an example, Open image in new window , or delay the decision until the next stage,

*f*

^{ k }(

*x*

^{ k })=

*r*and incur a penalty of

*δ*

^{ k+1}. Here,

*r*indicates the “reject” decision.

*f*

^{ k }has to make a decision using only the first

*k*sensing modalities. The last stage

*K*is terminal, a standard classifier. Define the system risk to be, Here,

*R*

_{ k }is the cost of classifying at

*k*th stage, and

*S*

^{ k }(

**x**

^{ k })∈{0,1} is the binary state variable indicating whether

*x*has been rejected up to

*k*th stage. If

*x*is active and is misclassified, the penalty is 1.

^{1}If it is rejected then the system incurs a penalty of

*δ*

^{ k+1}, and the state variable for that example remains at 1.

### 2.1 Bayesian setting

In this section, we will digress from the discriminative setting and analyze the problem under the assumption that the underlying distribution Open image in new window is known. In doing so, we hope to discover some fundamental structure that will simplify our empirical risk formulation in the next section.

### Theorem 1

*The optimal solution*

*f*

^{1},

*f*

^{2},…,

*f*

^{ K }

*to the multi*-

*stage risk in Eq*. (4)

*decomposes to single stage optimization*,

*and the solution is*:

### Proof

*k*th stage minimization,

*f*

^{ k }can take

*C*+1 possible values {1,2,…

*C*,

*r*} and

*J*

_{ k }(

**x**

^{ k },

*S*

^{ k }) can be recast as a conditional expected risk minimization, Define,

The main implication of this result is that if the cost-to-go function \(\tilde{\delta}^{k}(\mathbf{x}^{k})\) is known then the risk \(\tilde{R}_{k}(\cdot)\) is only a function of the current stage decision *f* ^{ k }. Therefore, we can ignore all of the other stages and minimize a single stage risk. Effectively, we decomposed the multi-stage problem in Eq. (4) into a stage-wise optimization in Eq. (5).

Note that the modified risk functional, \(\tilde{R}_{k}\), is remarkably similar to *R* _{ k } except that the modified reject cost \(\tilde{\delta}^{k}(\mathbf{x}^{k})\) replaces the constant stage cost *δ* ^{ k }. Also, consider the range for which *δ* ^{ k }(**x** ^{ k }) is meaningful. If we have *C* classes then a random guessing strategy would incur an average risk of \(1-\frac{1}{C}\). Therefore the risk for rejecting, \(\tilde{\delta}^{k}(\mathbf{x}^{k}) \leq1-\frac{1}{C}\) in order to be a meaningful option. The work in Chow (1970) contains a detailed analysis of single stage reject classifier in a Bayesian setting.

In the analysis of the POMDP, we allowed multiple classes because it is a natural extension of the binary case. However, each stage still has *C*+1 decisions, and it is unclear how to parameterizing such multi-class classifier with a reject option in an empirical setting. Parameterizing regular multi-class learning is a difficult problem in itself, and most existing techniques (Allwein et al. 2001) reduce the problem to a series of binary learning methods. In our setting, the reject option cannot be treated as an additional class since there is no ground truth labels for which examples should be rejected. So in forming the empirical risk problem, we restrict ourselves to the binary setting since it allows for an intuitive parametrization of a reject option which we describe in the next section. We leave the multi-class setting to be the subject of future research.

### Reject classifier as two binary decisions

*k*classifier with a reject option from Theorem 1 in a binary classification setting,

*y*∈{−1,+1}. It is clear from the expression that we can express the decision regions in terms of two binary classifiers

*f*

_{ n }and

*f*

_{ p }. Observe that for a given reject cost \(\tilde{\delta}^{k}(\mathbf{x}^{k})\), the reject region is an intersection of two binary decision regions. To this end we further modify the risk function in terms of agreement and disagreement regions of the two classifiers,

*f*

_{ n },

*f*

_{ p }, namely, Note that the above loss function is symmetric between

*f*

_{ n }and

*f*

_{ p }and so any optimal solution can be interchanged. Nevertheless, we claim:

### Theorem 2

*Suppose*

*f*

_{ n }

*and*

*f*

_{ p }

*are two binary classifiers that minimize*\(\mathbf{E}[ L_{k}(\mathbf{x}^{k},y,f_{n},\allowbreak f_{p},\tilde{\delta}^{k}) \mid \mathbf{x}^{k} ]\)

*over all binary classifiers*

*f*

_{ n }

*and*

*f*

_{ p }.

*Then following resulting reject classifier*:

*is the minimizer for*\(\mathbf{E}[ \tilde{R}_{k}(\mathbf{x}^{k},y,f,\tilde{\delta}^{k}) \mid \mathbf{x}^{k} ]\)

*in Theorem*1

*and the*

*kth stage minimizer in Eq*. (3).

### Proof

**x**

^{ k }and \(\tilde{\delta}(\mathbf{x}^{k})\), By inspection, the decomposition in (15) is the optimal Bayesian classifier minimizing \(\mathbf{E}_{y} [ {\tilde{R}_{k}(\mathbf{x}^{k},y,f,\tilde{\delta}^{k})} \mid{ \mathbf{x}^{k}} ]\). □

*a*,

*b*,

*c*.

### 2.2 Stage-wise empirical minimization

In this section, we assume that the probability model Open image in new window is no longer known and cannot be estimated due to high-dimensionality of the data. Instead, our task is to find multi-stage decision rules based on a given training set: (**x** _{1},*y* _{1}),(**x** _{2},*y* _{2}),…,(**x** _{ N },*y* _{ N }). Here, we consider binary classification setting: *y* _{ i }∈{+1,−1}.

*L*

_{ k }(⋅) in Eq. (16). However, this requires the knowledge of the cost-to-go, Open image in new window . Instead of trying to learn this complex function, we will define a point-wise empirical estimate of the cost-to-go on the training data:

*f*

^{ k+1},…,

*f*

^{ K }. So the cost-to-go estimate is conveniently defined by the recursion,

*k*over some family of functions, Open image in new window . Observe that, as in standard setting, we need to constrain the class of decision rules Open image in new window here. This is because with no constraints the minimum risk is equal to zero and can be achieved in the first stage itself.

Note, our stage-wise decomposition significantly simplifies the ERM. The objective in Eq. (18) is only a function of \(f_{p}^{k},f_{n}^{k}\) given \(\tilde{\delta}^{k}_{i}\) and the state \(S^{k}_{i}\). To minimize an empirical version of a multi-stage risk in Eq. (3) is much more difficult due to stage interdependencies.

*k*th, we can solve (18) by iterating between \(f^{k}_{p}\) and \(f^{k}_{n}\). To solve for \(f^{k}_{p}\), we fix \(f^{k}_{n}\) and minimize a weighted error We can solve for

*f*

_{ n }in the same fashion by fixing

*f*

_{ p }, To derive these expressions from (18), we used another identity for any binary variables

*a*,

*b*,

*c*

Note the advantage of our parametrization from Theorem 2. We converted the problem from learning a complicated three region decision to learning two binary classifiers (*f* _{ p },*f* _{ n }), where learning each of the binary classifiers reduces to solving a weighted binary classification problem. This is desirable since binary classification is a very well studied problem, and existing machine learning techniques can be utilized here, as we will demonstrate in the next section.

## 3 Algorithm

Minimizing the indicator loss is a hard problem. Instead, we take the usual ERM (empirical risk minimization; Friedman et al. 2001) approach and replace it with a surrogate. We introduce an algorithm in the boosting framework based on the analysis from the previous section. Boosting is just one of our many possible machine learning approaches that can be used to solve it. We use boosting because it is easy to implement and is known to have good performance.

Boosting is a way to combine simple classifiers to form a strong classifier. We are given a set of such weak classifiers Open image in new window . Note that the set of weak classifiers need not be finite. Also, denote Open image in new window as a subset of weak classifiers that operate only on the first *k* measurements of *x*. Open image in new window .

*h*

_{ j }(

**x**) (a descent direction) is selected to be added to the linear combination classifier,

*F*(

**x**). Then a weight

*q*

_{ j }(an optimal step size in that direction) is computed. This is repeated until termination criteria is reached. For details on boosting various losses please refers to (Rosset et al. 2004).

### Global surrogate

In our algorithm, we use the sigmoid loss function \(\mathbf{C}(z)={1 \over1+\exp(z)} \) to approximate the indicator. Similar sigmoid based losses have been used in boosting before (Masnadi-Shirazi and Vasconcelos 2009). Each subproblem (19) reduces to boosting a weighted loss.

*k*, we keep the rest of the stages constant. To find \(f^{k}_{p}=\sum q_{j} h_{j}(\mathbf{x})\), we fix \(f_{n}^{k}\) and solve: Note that the weights

*w*

_{ i }, state variables \(S^{k}_{i}\) and cost-to-go \(\tilde{\delta}^{k}_{i}\) are also expressed in terms of the

**C**(

*z*) instead of Open image in new window : To solve for \(f_{n}^{k}\), we solve the same problem but keep \(f_{p}^{k}\) constant instead: Note that the terms \(\tilde{\delta}_{i}^{k}\) and \(S_{i}^{k}\) do not depend on stage

*k*and remain constant when solving for \(f_{p}^{k}\) and \(f_{n}^{k}\). For the ease of notation, we define a new term

**C**

_{ r }that indicates if

**x**

_{ i }is rejected at a

*k*th stage. The term is close to one if \(f^{k}_{p}\) and \(f^{k}_{n}\) disagree (reject) and small if they agree.

**x**

_{ i }is rejected at every stage. The expression for cost-to-go at

*k*th stage is: The last two terms are simply a surrogate for

*L*

_{ k }(⋅) from (16) in terms of

**C**(⋅).

*K*−1 stages and solve:

Our algorithms performs cyclical optimization over the stages. To initialize \(f^{k}_{n},~f^{k}_{p}~\forall k\), we simply hard code \(f^{k}_{p}\) to classify any **x** as +1 and \(f^{k}_{n}\) as −1 so that all **x**’s are rejected to the last stage. Using these nominal classifiers, we compute \(S^{k}_{i}\) and \(\delta^{k}_{i}\) according to Eqs. (25) and (26), respectively.

At a stage *k*, for a fixed \(\delta^{k}_{i}\) and \(S^{k}_{i}\), we alternate among minimizing \(f^{k}_{p}\) and \(f^{k}_{n}\) according to Eqs. (22) and (24). In practice, we found that one iteration is sufficient.

Given a new estimate of stage *k*, we update \(\delta^{s}_{i}\) for *s*>*k* and \(S^{s}_{j}\) for *s*<*k* and then move on to optimizing another stage *k*′. Given an estimate for stage *k*′, we again update the state variables and cost-to-go for the rest of the system.

Our formulation allows us to form a surrogate for the entire risk in Eq. (1), not just for each subproblem. This enables us to prove the following theorem,

### Theorem 3

*Our global surrogate algorithm converges to a local minimum*.

### Proof

This is simply due to a fact that we are minimizing a global smooth cost function by coordinate descent over \(\mathbf{q}_{p}^{1}, \mathbf{q}^{1}_{n}, \mathbf{q}_{p}^{2}, \mathbf{q}^{2}_{n}, \ldots, \mathbf{q}^{K}\). Here, \(\mathbf{q}^{k}_{p}\) is the vector of weak learner weights parameterizing \(f^{k}_{p}\). For the derivation of three stage system global cost refer to Appendix B. □

However, since the global loss and the loss for each subproblem are non-convex programs, there is no global optimality guarantee. Theorem 3 ensures that our algorithm terminates.

### Regularization to reduce overfitting

To reduce overtraining, we introduce a simple but effective regularization. For any loss **C**(*z*) and a parameter *λ*, we introduce a multiplicative term to the cost function: Open image in new window . Here, \(C'(z)=\frac{dC(z)}{dz}\). The term exp(*λ*|**q**|) limits how large a step size for a weak hypothesis can become. It also introduces a simple stopping criteria: abort if \({\sum_{i=1}^{n} C'(y_{i} f_{t}(x_{i})) y_{i} h_{t+1}(x_{i}) \over\sum_{i=1}^{n} \mathbf{C}(y_{i} f_{t}(x_{i}))} \leq\lambda\). This corresponds to a situation when no descent directions ( weak hypothesis *h* _{ t+1} ) can be found to minimize the cost function.

## 4 Generalization error

### Theorem 4

*Let*Open image in new window

*be a distribution on*Open image in new window ,

*and let*Open image in new window

*be a sample of*

*m*

*examples chosen independently at random according to*Open image in new window ,

*and a rejected subsample of size*

*m*

_{ r }, Open image in new window .

*Assume that the base*-

*classifier spaces*Open image in new window

*and*Open image in new window

*are finite*,

*and let*

*δ*>0.

*Then with probability at least*1−

*δ*

*over the random choice of the training set*

*S*,

*all boosted classifiers*\(f^{1}_{n},f^{1}_{p},f^{2}\)

*satisfy the following bound for all*

*θ*

_{1}>0

*and*

*θ*

_{2}>0:

### Proof

The proof extends the approach in Schapire et al. (1998) to a two stage system. For complete details please refers to the appendix. □

*F*(

**x**) is bounded by the empirical margin error over the training set

*S*plus a term that is inversely proportional to the margins and the number of training samples at that stage. An interesting observation is that

*m*

_{ r }, number of samples that reaches the 2nd stage, depends on the reject classifier at the 1st stage. So if very few examples make it to the second stage then we do not have strong generalization.

## 5 Experiments

The goal is to demonstrate that a large fraction of data can be classified at an early stage using a cheap modality. In our experiments, we use four real life datasets with measurements arising from meaningful stages.

### 5.1 Related algorithms

We compare our algorithm to two methods:

### Myopic

An absolute margin of a classifier is a measure of how confident a classifier is on an example. Examples with small margin have low confidence and should be rejected to the next stage to acquire more features. This approach is based on reject classification (Bartlett and Wegkamp 2008). We know from Theorem 1 that the optimal classifier is a threshold of the posterior. For each stage, we obtain a binary boosted classifier, *f* ^{ k }(⋅), trained on all the data. We then threshold the margin of the classifier, |*f* ^{ k }(**x** ^{ k })|. It is known that given an infinite amount of training data, boosting certain losses (sigmoid loss in our case) approaches the log likelihood ratio, \(f(\mathbf{x})={1 \over2} \log{ \mathrm{P}(y=1|\mathbf{x}) \over \mathrm{P}(y=-1|\mathbf{x})}\) (Masnadi-Shirazi and Vasconcelos 2009). So a reject region for a given threshold *t* _{ k } is defined: {**x**∣|*f* ^{ k }(**x**)|≤*t* _{ k }}. This is a completely myopic approach as the rejection does not take into account performance of later stages. This method is very similar to TEFE (Liu et al. 2008) which also uses absolute margin as a measure for rejection. The difference is that our myopic strategy is a boosting classifier not an SVM as used in TEFE.

### Expected utility/margin

*f*

^{ k }(

**x**

^{ k }). Given the measurement at the current stage

**x**

^{ k }, we compute an expected utility (change in normalized margin) of acquiring the next measurement

**x**

_{ k+1}: An

**x**

^{ k }is rejected to the next stage if its utility

*U*(

**x**

^{ k })≥

*t*

_{ k }is greater than a threshold. Here, Open image in new window denotes the possible values that

*x*

_{ k+1}can take. Note this approach requires estimating P(

*x*

_{ k+1}|

**x**

^{ k }),

^{2}therefore the (

*k*+1)th measurement has to be discrete or distribution needs to be parametrized. Due to this limitation, we only compare this method on two datasets.

### 5.2 Simulations

### Performance metric

A natural performance metric is the trade off between system error and measurement cost. Note, for utility and myopic methods, it is unclear how to set a thresholds *t* _{ k } for each stage given a measurement cost *δ* _{ k }. For this reason, we only compare them in a two stages system. More than two stages is not-practical because we would need to test every possible *t* _{ k } for every stage *k*.

In a two stage setting, since every example has to pass through the first stage, only the cost of the second stage, *δ* _{2}, affects the performance. The average measurement cost of the system is proportional to *δ* _{1}+ (the fraction of examples rejected to the second stage) ×*δ* _{2}. So knowing the exact cost of the second stage sensor, *δ*=*δ* _{2}, is not necessary. In our algorithm, we vary *δ* to generate a system error vs reject rate plot. For margin and utility, we sweep a threshold *t* _{ k }. System error is the sum of 1st stage and 2nd stage errors. Reject rate is the fraction of examples rejected to the 2nd stage and require additional measurements. Low reject rate (cost) corresponds to higher error rate as most of the data will be classified at the first stage using less informative measurements. High reject rate will have performance similar to a centralized classifier, as most examples will be classified at the 2nd stage.

### Set up

In all our experiments, we use stumps as weak learners. A stump classifier *h* _{ d,g,s }∈{+1,−1} is parametrized by a threshold *g* on *d*th dimension and a sign variable *s*∈{+1,−1}: *h* _{ d,g,s }(**x**)=*s*×sgn[*x* _{ d }−*g*]. We chose stumps for their simplicity, computation speed and relatively good performance. While more complicated weak learners, such as decision trees can be used, they would only change the absolute performance of our experiments. The entire curves would just move vertically up or down. Our goal is to demonstrate the advantage of a multi-stage classifier relative to the centralized system (a system that uses all the measurements for all examples).

*T*=50. In our global surrogate algorithm, the number of outer loop iterations is set to

*D*=10. Refer to Table 1 for dataset descriptions and to Table 2 for summary of our experiments.

Dataset descriptions

Name | Size | 1st stage | 2nd stage |
---|---|---|---|

Gaussian mixture | 1000 | 1st dim | 2nd dim |

Mammogram mass | 830 | 3 CAD meas. | Radiologist rating |

Pima diabetes | 810 | 6 simple tests: BMI, sex, … | 2 blood tests |

Polyps | 310 | 12 freq. bins | 126 freq. bins |

Threat | 1300 | Images in IR, PMMW | Images in AMMW |

Performance illustration for different datasets (quantitate view of the curves). Datasets have 2 sensing modalities. *Centralized* denotes the test error obtained with all modalities. *Last three columns* denote performance for different approaches. Performance is measured by the average number of examples requiring 2nd stage to achieve error close to centralized. Utility approach does not work for last three datasets due to high-dimensionality issues. We note the significant gains of our approach over competing ones of many interesting datasets

Name | Centralized | Utility | Myopic | Ours |
---|---|---|---|---|

2D Gaussian Mix | 0.09 | 50 % | – | 30 % |

Mammogram | 0.165 | 60 % | – | 15 % |

Pima diabetes | 0.26 | – | 60 % | 45 % |

Polyps | 0.24 | – | 75 % | 50 % |

Threat | 0.185 | – | 50 % | 45 % |

### Discrete valued data experiments

To compare our method to the utility approach, we consider discrete data. The first dataset is a quantized (with 20 levels) Gaussian mixture synthetic data in two dimension. The 1st dimension is stage one; the 2nd dimension is stage two. The second dataset is Mammogram Mass from UCI Machine Learning Repository. It is used to predict the severity of a mammographic mass lesion (malicious or benign). It contains 3 attributes extracted from the CAD image and also an evaluation by a radiologist on a confidence scale in addition to the true biopsy results. The first stage are features extracted from the CAD image, and the second stage is the expert confidence rated on a discrete scale 1–5. Automatic analysis of the CAD image is cheaper than employing an opinion of a radiologist.

### Continuous valued data experiments

We compare our global method to the myopic method on three datasets. The Pima Indians Diabetes Dataset (UCI MLR) consists of 8 measurements. Since the stages are not specified in this dataset, we group measurements with similar costs into separate modalities. 6 of the measurements are inexpensive to acquire and consist of simple tests such as body mass index, age, pedigree. These we designate as the first stage. The other two measurements constitute the second stage and require more expensive procedures.

The polyp dataset consists of hyper-spectral measurements of colon polyps collected during colonoscopies (Rodríguez-Díaz and Castañón 2009). The attribute is a measured intensity at 126 equally spaced frequencies. Finer resolution requires higher photon count which is proportional to acquisition time. For a first stage, we use a coarse measurement downsampled to only 12 frequency bins. The second stage is the full resolution frequency response. Using the course measurements is cheaper than acquiring the full resolution.

The threat dataset contains images taken of people wearing various explosives devices. The imaging is done in three modalities: infrared (IR), passive millimeter wave (PMMW), and active millimeter (AMMW). All the images are registered. We extract many patches from the images and use them as our training data. A patch carries a binary label, it either contains a threat or is clean. IR and PMMW are the fastest modalities but also less informative. AMMW requires raster scanning a person and is slow but also the most useful.

The goal is to reach the performance of a centralized classifier (100 % reject rate) while utilizing the 2nd stage sensor only for a small fraction of examples. Overall, the results demonstrate the benefit of multi-stage classification: rejection rate can be set to less than 50 % with only small sacrifices in performance. For the mammogram data, this implies that for half of the patients a diagnoses can be made solely by an automatic analysis of a CAD image without an expensive opinion of a radiologist. For the Pima data, similar error can be achieved without an expensive medical procedures. For the polyps dataset, a fast low resolution measurement is enough to classify a large fraction of patience. In the threat dataset, IR and PMMW are sufficient to decide whether or not a threat is present for the majority of instances without requiring a person to go through a slower AMMW scanner.

### Unbalanced false positive and false negative penalties

*w*

_{ p }for a Type I error and

*w*

_{ n }for a Type II error. The experiment in Fig. 7 demonstrates our global algorithms in such scenario. For each reject cost

*δ*, we compute an ROC curve. This allows to select an operating point of the system with a desired false alarm or detection rate. We also compute a corresponding average reject rate for each value of

*δ*. So the highest reject rate corresponds to the best performance but also to the highest acquisition cost incurred by the system. Note that very good performance can be achieved by requesting only 50 % of instances to be measured at the second stage.

### Three stages

*δ*

_{1}, and AMMW (3rd),

*δ*

_{2}, to generate an error map (color in Fig. 8). A point on the map corresponds to a performance of a particular multistage classification strategy. The vertical axis is the fraction of examples for which only IR and PMMW measurements are used in making a decision. The horizontal axis is the fraction of examples for which all three modalities are used. For example, a red point in the figure, {.4,.15,.195}, correspond to a system where 40 % of examples use IR and PMMW, 15 % use only IR and the rest of data (45 %) use all the modalities. And this strategy achieves a system error rate of 19.5 %. Note that the support lies below the diagonal. This is because the sum or reject rates has to be less than one. Results demonstrate some interesting observations. While best performance (about 19 %) is achieved when all the modalities are used for every example, we can move along the vertical lines and allow a fraction to be classified by IR and PMMW, avoiding AMMW all together. This strategy achieves performance comparable to a centralized system, (IR+PMMW+AMMW).

## 6 Conclusion

In this paper, we propose a general framework for a sequential decision system in a non-parametric setting. Starting from basic principles, we derive the Bayesian optimal solution. Then, to simplify the problem, we parameterize a classifier at each stage in terms of two binary decisions. We formulate an ERM problem and optimize it by alternatively minimizing one stage at a time. Remarkably, all subproblems turn out to be weighed binary error minimizations. We introduce a practical boosting algorithm that minimizes a global surrogate of the empirical risk and test it on several datasets. Results show the advantage of our formulation to more heuristic approaches. Overall, our experiments demonstrate how multi-stage classifiers can achieve good performance by acquiring full measurements only for a fraction of samples.

## Footnotes

- 1.
To simplify our discussion, we consider equal error penalties. However, our approach can be easily extended to unbalanced error penalties as we will demonstrate in the experiments section.

- 2.
While there are many different ways to estimate a probability likelihood we used a Gaussian mixture due to its computational efficiency. The number of mixture components is equal to the number of discrete values that

*x*_{2}can take from an alphabet Open image in new window . The conditional P(*x*_{1}∣*x*_{2}=*j*) is a Gaussian whose parameters are learned from the training set. Using Bayes rule, \(\mathrm{P}(x_{2} \mid x_{1})=\frac{\mathrm{P}(x_{1} \mid x_{2})}{ \sum_{x' \in X2} \mathrm{P}(x_{1} \mid x_{2}=x')}\).

## Notes

### Acknowledgements

This work is partially supported by the U.S. DHS Award 2008-ST-061-ED000, NSF Grant 0932114 and NGA Grant HM1582-09-1-0037.

## References

- Allwein, E. L., Schapire, R. E., & Singer, Y. (2001). Reducing multiclass to binary: a unifying approach for margin classifiers.
*Journal of Machine Learning Research*,*1*, 113–141. MathSciNetzbMATHGoogle Scholar - Bartlett, P. L., & Wegkamp, M. H. (2008). Classification with a reject option using a hinge loss.
*The Journal of Machine Learning Research*,*9*, 1823–1840. MathSciNetzbMATHGoogle Scholar - Bilgic, M., & Getoor, L. (2007). Voila: efficient feature-value acquisition for classification. In
*AAAI conference on artificial intelligence*. Google Scholar - Chen, M., Xu, Z., Weinberger, K. Q., Chapelle, O., & Kedem, D. (2012). Classifier cascade: tradeoff between accuracy and feature evaluation cost. In
*International conference on artificial intelligence and statistics*. Google Scholar - Chow, C. (1970). On optimum recognition error and reject tradeoff.
*IEEE Transactions on Information Theory*,*16*(1), 41–46. doi: 10.1109/TIT.1970.1054406. zbMATHCrossRefGoogle Scholar - Cordella, L. P., & Sansone, C. (2007). A multi-stage classification system for detecting intrusions in computer networks.
*Pattern Analysis & Applications*,*10*(2), 83–100. MathSciNetCrossRefGoogle Scholar - El-Yaniv, R., & Wiener, Y. (2011). Agnostic selective classification. In
*Advances in neural information processing systems*. Google Scholar - Fan, W., Chu, F., Wang, H., & Yu, P. S. (2002). Pruning and dynamic scheduling of cost-sensitive ensembles. In
*AAAI conference on artificial intelligence*. Google Scholar - Fan, W., Lee, W., Stolfo, S. J., & Miller, M. (2000). A multiple model cost-sensitive approach for intrusion detection. In
*European conference on machine learning*. Google Scholar - Friedman, J., Hastie, T., & Tibshirani, R. (2001).
*Springer series in statistics*:*Vol*.*1*.*The elements of statistical learning*. Berlin: Springer. zbMATHGoogle Scholar - Grandvalet, Y., Rakotomamonjy, A., Keshet, J., & Canu, S. (2008). Support vector machines with a reject option. In
*Advances in neural information processing systems*. Google Scholar - Ji, S., & Carin, L. (2007). Cost-sensitive feature acquisition and classification.
*Pattern Recognition*,*40*(5), 1474–1485. zbMATHCrossRefGoogle Scholar - Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains.
*Artificial Intelligence*,*101*(1), 99–134. MathSciNetzbMATHCrossRefGoogle Scholar - Kanani, P., & Melville, P. (2008). Prediction-time active feature-value acquisition for cost-effective customer targeting. In
*Advances in neural information processing systems*. Google Scholar - Kapoor, A., & Horvitz, E. (2009). Breaking boundaries: active information acquisition across learning and diagnosis. In
*Advances in neural information processing systems*. Google Scholar - Lee, W., Fan, W., Miller, M., Stolfo, S. J., & Zadok, E. (2002). Toward cost-sensitive modeling for intrusion detection and response.
*Journal of Computer Security*,*10*(1), 5–22. Google Scholar - Liu, L. P., Yu, Y., Jiang, Y., & Zhou, Z. H. (2008). TEFE: a time-efficient approach to feature extraction. In
*International conference on data mining*. Google Scholar - MacKay, D. J. (1992). Information-based objective functions for active data selection.
*Neural Computation*,*4*(4), 590–604. CrossRefGoogle Scholar - Masnadi-Shirazi, H., & Vasconcelos, N. (2009). On the design of loss functions for classification: theory, robustness to outliers, and savageboost. In
*Advances in neural information processing systems*. Google Scholar - Rodríguez-Díaz, E., & Castañón, D. A. (2009). Support vector machine classifiers for sequential decision problems. In
*IEEE conference on decision and control*. Google Scholar - Rosset, S., Zhu, J., & Hastie, T. (2004). Boosting as a regularized path to a maximum margin classifier.
*The Journal of Machine Learning Research*,*5*, 941–973. MathSciNetzbMATHGoogle Scholar - Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1998). Boosting the margin: a new explanation for the effectiveness of voting methods.
*The Annals of Statistics*,*26*(5), 1651–1686. MathSciNetzbMATHCrossRefGoogle Scholar - Sheng, V. S., & Ling, C. X. (2006). Feature value acquisition in testing: a sequential batch test algorithm. In
*International conference on machine learning*(pp. 809–816). Google Scholar - Trapeznikov, K., Saligrama, V., & Castañon, D. A. (2012). Multi-stage classifier design. In
*Asian conference on machine learning*. Google Scholar - Viola, P., & Jones, M. J. (2004). Robust real-time face detection.
*International Journal of Computer Vision*,*57*(2), 137–154. CrossRefGoogle Scholar - Wang, H., Fan, W., Yu, P. S., & Han, J. (2003). Mining concept-drifting data streams using ensemble classifiers. In
*Knowledge discovery and data mining*. CrossRefGoogle Scholar - Yuan, C., & Casasent, D. (2003). A novel support vector classifier with better rejection performance. In
*Computer vision and pattern recognition*. Google Scholar - Zhang, C., & Zhang, Z. (2010).
*A survey of recent advances in face detection*(Microsoft research technical report). Google Scholar - Zubek, V. B., & Dietterich, T. G. (2002). Pruning improves heuristic search for cost-sensitive learning. In
*International conference on machine learning*. Google Scholar