Keywords

1 Introduction

Branch-and-bound technique is a useful tool in computer science. Multiple application examples can be named—let us mention DNA regulatory motif finding [8] and \(\alpha \)-\(\beta \) pruning in games, just to give two examples from quite remote fields. In this paper we adopt the technique to train cascades of classifiers.

Cascades were in principle designed to work as classifying systems operating under the following two conditions: (1) very large number of incoming requests, (2) significant classes imbalance. The second condition should not be seen as a difficulty but rather a favorable setting that makes the whole idea viable. Namely, a cascade should vary its computational efforts depending on the contents of an object to be classified. Objects that are obvious negatives (non-targets) should be recognized fast, using only a few features extracted. Targets, or objects resembling, them are allowed to employ more features and time for computations.

Despite the development of deep learning, recent literature shows that cascades of classifiers are still widely applied in detection systems or batch classification jobs. Let us list a few examples: crowd analysis and people counting [1], human detection in thermal images [6], localization of white blood cells [4], eye tracking [9, 11], detection of birds near high power electric lines [7].

There exist a certain average value of the computational cost incurred by an operating cascade. It can be mathematically defined as an expected value and, in fact, calculated explicitly for a given cascade (we do this in Sect. 2.3) in terms of: the number of features applied on successive stages, false alarm and detection rates on successive stages, probability distribution from which the data is drawn. Since the true distribution underlying the data is typically unknown in practice, the exact expected value cannot be determined. Interestingly though, it can be accurately approximated using just the feature counts and false alarm rates.

Training procedures for cascades are time-consuming, taking days or even weeks. As Viola and Jones noted in their pionieering work [18], cascade training is a difficult combinatorial optimization involving many parameters: number of stages, number of features on successive stages, selection of those features, and finally decision thresholds. The problem has not been ultimately solved yet. Viola and Jones tackled it by imposing the final requirements the whole cascade should meet in order to be accepted, defined by a pair of numbers (AD), where A denotes the largest allowed false alarm rate (FAR), and D the smallest allowed detection rate (sensitivity). Due to probabilistic properties of cascade structure, one can translate final requirements onto per-stage requirements as geometric means: \(a_{\max }\!=\!A^{1/K}\) and \(d_{\min }\!=\!D^{1/K}\), where K is the fixed number of stages.

Many modifications to cascade training have been introduced over the years. Most of them try out different: feature selection approaches, subsampling methods, or are simply tailored to a particular type of features [3, 10, 12, 17] (e.g. Haar, HOG, LBP, etc.). Some authors obtain modified cascades by designing new boosting algorithms that underlie the training [14, 15], but due to mathematical difficulties, the expected number of features is seldom the main optimization criterion. One of few exceptions is an elegant work by Saberian and Vasconcelos [14]. The authors use gradient descent to optimize explicitly a Lagrangian representing the trade-off between cascade’s error rate and the operating cost (expected value). They use a trick that translates non-differentiable recursive formulas to smooth ones using hyperbolic tangent approximations. The approach is analytically tractable but expensive, because all cascade stages are kept open while training. In every step one has to check all variational derivatives based on features at disposal for all open stages.

The main contribution of this paper is an algorithm—or in fact a general framework—for training cascades of classifiers via a tree search approach and the branch-and-bound technique. Successive tree levels correspond to successive cascade stages. Sibling nodes represent variants of the same stage with different number of features applied. We provide suitable formulas for lower bounds on the expected value that we optimize. During an ongoing search, we observe the lower bounds, and whenever a bound for some tree branch is greater than (or equal to) the best-so-far expectation, the branch becomes pruned. Once the search is finished, one of the paths from the root to some terminal node indicates the cascade with the smallest expected number of features. Apart from the exact approach to pruning, we additionally propose an approximate one, using suitable predictions of expected values.

2 Preliminaries

2.1 Notation

Throughout this paper we use the following notation:

  • K — number of cascade stages,

  • \(n=(n_1, n_2, \ldots , n_K)\)—numbers of features used on successive stages,

  • \((a_1, a_2, \ldots , a_K)\)—FAR values on successive stages (false alarm rates),

  • \((d_1, d_2, \ldots , d_K)\)—sensitivities on successive stages (detection rates),

  • A—required FAR for the whole cascade,

  • D—required detection rate (sensitivity) for the whole cascade,

  • \(F=(F_1, F_2, \ldots , F_K)\)—ensemble classifiers on successive stages (the cascade),

  • \(A_k\)—FAR observed up to k-th stage of cascade (\(A_k=\prod _{1\leqslant i \leqslant k} a_i\)),

  • \(D_k\)—sensitivity observed up to k-th stage of cascade (\(D_k=\prod _{1\leqslant i \leqslant k} d_i\)),

  • \((p, 1-p)\)—true probability distribution of classes (unknown in practice),

  • \(\mathcal {D}, \mathcal {V}\)—training and validation data sets,

  • \(\#\)—set size operator (cardinality of a set),

  • \(\Vert \)—concatenation operator (to concatenate cascade stages).

The probabilistic meaning of relevant quantities is as follows. The final requirements (AD) demand that: \(P\left( F(\mathbf {x}){=}+|y{=}-\right) {\leqslant } A\) and \(P\left( F(\mathbf {x}){=}+|y{=}+\right) {\geqslant } D\), whereas false alarm and detection rates observed on particular stages are, respectively, equal to:

$$\begin{aligned} a_k&= P\left( F_k(\mathbf {x}){=}+|y{=}-,F_{1}(\mathbf {x}){=}\cdots {=}F_{k-1}(\mathbf {x}){=}+\right) , \nonumber \\ d_k&= P\left( F_k(\mathbf {x}){=}+|y{=}+,F_{1}(\mathbf {x}){=}\cdots {=}F_{k-1}(\mathbf {x})=+\right) . \end{aligned}$$
(1)

2.2 Classical Cascade Training Algorithm (Viola-Jones Style)

The classical cascade training algorithm given below (Algorithm 1) can be treated as a reference for new algorithms we propose.

figure a

Please note, in the final line of the pseudocode, that we return \((F_1, F_2, \ldots , F_k)\) rather than \((F_1, F_2, \ldots , F_K)\). This is because the training procedure can potentially stop earlier, when \(k<K\), provided that the final requirements (AD) for the entire cascade are already satisfied i.e. \(A_k\leqslant A\) and \(D_k\geqslant D\).

The step “Adjust decision threshold” requires a more detailed explanation. The real-valued response of any stage can be suitably thresholded to obtain either some wanted sensitivity or FAR. Hence, the resulting \(\{-1,+1\}\)-decision of a stage is, in fact, calculated as the sign of expression

$$\begin{aligned} F_k(\mathbf {x})-\theta _k, \end{aligned}$$

where \(\theta _k\) represents the decision threshold. Suppose \((v_1,v_2,\ldots ,v_{\#\mathcal {P}})\) denotes a sequence of sorted, \(v_i\leqslant v_{i+1}\), real-valued responses of a new cascade stage \(F_{k+1}\) obtained on positive examples (subset \(\mathcal {P}\)). Then, the \(d_{\min }\) per-stage requirement can be satisfied by simply choosing: \(\theta _{k+1} = v_{\lfloor (1-d_{\min }) \cdot \#\mathcal {P}\rfloor }\).

2.3 Expected Number of Extracted Features

Definition-Based Formula. A cascade stops operating after a certain number of stages. It does not stop in the middle of a stage. Therefore the possible outcomes of the random variable of interest, describing the disjoint events, are: \(n_1\), \(n_1+n_2\), ..., \(n_1 + n_2 + \cdots + n_K\). Hence, by the definition of expected value, the expected number of features can be calculated as follows:

$$\begin{aligned} E(n)=\sum _{1\leqslant k \leqslant K} \Bigl (\sum _{1\leqslant i\leqslant k} \!\,\!\,n_i\Bigr ) \Biggl (p \bigl (\!\,\!\,\prod _{1\leqslant i<k} \!\,\!\,d_i\bigr ) (1-d_k)^{[k<K]} +(1-p) \bigl (\!\,\!\,\prod _{1\leqslant i<k} \!\,\!\,a_i\bigr ) (1-a_k)^{[k<K]}\Biggr ), \end{aligned}$$
(2)

where \([\cdot ]\) is an indicator function.

Incremental Formula and Its Approximation. By grouping the terms in (2) with respect to \(n_k\) the following alternative formula can be derived:

$$\begin{aligned} E(n)=\sum _{1\leqslant k \leqslant K} n_k \left( p \prod _{1\leqslant i<k} d_i + (1-p) \prod _{1\leqslant i<k} a_i\right) . \end{aligned}$$
(3)

Obviously, in practical applications the true probability distribution underlying the data is unknown. Since the probability p of the positive class is very small (typically \(p<10^{-4}\)), the expected value can be accurately approximated using only the summands related to the negative class as follows:

$$\begin{aligned} \widehat{E}(n)=\sum _{1\leqslant k \leqslant K} n_k \prod _{1\leqslant i<k} a_i \approx E(n). \end{aligned}$$
(4)

It is also interesting to remark that in the original Viola and Jones’ paper [18] the authors proposed an incorrect formula to estimate the expected number of features, namely:

$$\begin{aligned} E_\text {VJ}(n) = \sum \limits _{k=1}^K n_k \prod \limits _{i=1}^{k-1} r_i, \end{aligned}$$
(5)

where \(r_i\) represents the “positive rate” of i-th stage. This is equivalent to

$$\begin{aligned} E_\text {VJ}(n) = \sum \limits _{k=1}^K n_k \prod \limits _{i=1}^{k-1} (pd_i + (1-p)a_i). \end{aligned}$$
(6)

Please note that by multiplying positive rates of stages, one obtains mixed terms of form \(d_i\cdot a_j\) that do not have any probabilistic sense. For example for \(k = 3\) the product under summation becomes \(\left( p d_1 + (1-p)a_1\right) \left( p d_2+(1-p)a_2\right) \), with the terms \(d_1 a_2\) and \(a_1 d_2\) having no sense, because a fixed data point does not change its class label while traveling along the cascade.

3 Cascade Training as a Tree Search

In stage-wise training procedures, each stage, once fixed, must not be altered. The paper [14], discussed in the introduction, represents an opposite approach, where all stages can be extended with a weak classifier at any time. The approach we propose is in-between the two mentioned above. It provides more flexibility than stage-wise training and simultaneously avoids high complexity of [14].

We treat cascade training as a tree search process. The root of the tree represents an empty cascade. Successive tree levels correspond to successive cascade stages. Each non-terminal tree node is going to have an odd number of children nodes. They will represent variants of a subsequent stage with slightly different number of features. The children will be processed recursively from left to right until the stop condition is met. It should be understood that the nodes are not simply generated mechanically but, in fact, trained as ensemble classifiers.

The size of the tree shall be controlled by two integer parameters L and C, predefined by the user. To keep the tree fairly small, the branching of variants shall take place only at L top-most levels, e.g. \(L=2\). At those levels the branching factor will be equal to C, an odd number, e.g. \(C=5\). At deeper levels the branching factor will be one. Therefore, the actual branching shall affect only initial stages having the largest impact on the expected number of features. Once the tree search is finished, one of the paths from the root to some terminal node shall indicate the best cascade i.e. having the smallest expectation.

For notation purposes, children nodes being variants of the same stage use an additional subindex. For example, the classifier \(F_{1,0}\) denotes the main variant of the first stage (using a certain number of features) and is graphically represented as the middle child. Its left siblings \(F_{1,-1},F_{1,-2},\ldots \) denote classifiers using fewer features (one less, two less, etc.). The right siblings \(F_{1,+1},F_{1,+2},\ldots \) use more features than the middle child (one more, two more, etc.). This notation will be used only locally within single recursive calls (due to global ambiguity).

3.1 Pruning Search Tree Using Current Partial Expectations—Exact Branch-and-bound

During an ongoing tree search (combined with cascade training) one can observe partial values for the expected value of interest — formula (4). Suppose a new \((k+1)\)-th stage has been completed, revealing \(n_{k+1}\) features. The formula

$$\begin{aligned} \widehat{E}\Bigl ((n_1, \ldots , n_{k + 1})\Bigr ) \!= \!\,\!\,\sum _{1\leqslant j \leqslant k} \!\,\!\,\!\,\!n_j \!\,\!\,\prod _{1\leqslant i< j} \!\,\!\,\!\,\!a_i + n_{k+1} \!\,\!\,\!\,\!\,\!\!\prod _{1\leqslant i< k+1} \!\,\!\,\!\,\!\,\!\!a_i = \widehat{E}\Bigl ((n_1, \ldots , n_k)\Bigr ) + n_{k+1} \!\,\!\,\!\,\!\,\!\!\prod _{1\leqslant i < k+1} \!\,\!\,\!\,\!\!\,\!\,a_i. \end{aligned}$$
(7)

expresses the partial expectation for the extended cascade in an incremental manner. It should be clear that whenever a partial expectation for some tree branch is greater than (or equal to) the best-so-far exact expectation, say \(\widehat{E}\bigl ((n_1, \ldots , n_{k + 1})\bigr )\geqslant \widehat{E}^*\), then there is no point in pursuing that branch further down the tree. In other words, pruning can be applied because formula (7) provides a lower bound on the final unknown expectation.

Figure 1 provides a symbolic illustration of a search tree with pruning. In the figure, the subindexes \(E_1,E_2,\ldots \) are meant to indicate chronologically the partial expected values observed on the successive branches as the tree is being traversed from left to right. Crossed-out lines represent the pruned branches.

Fig. 1.
figure 1

Cascade training as a tree search with pruning—example illustration.

Algorithm 2 stated as a recursion. A single recursive call can be summarized as follows. It takes as input a partial cascade F with k stages and trains the new \((k+1)\)-th stage in its main variant \(F_{k+1,0}\). We refer to it as the middle child. Then, the algorithm “branches” the stage (if level not greater than L) by creating clones of the middle child with fewer features: \(F_{k+1,-1},F_{k+1,-2},\ldots \) (left children), and with more features: \(F_{k+1,+1},F_{k+1,+2},\ldots \) (right children). The algorithm iterates over all children and performs recursive calls to train their subsequent stages provided that the lower bound (7) on the final expectation is not worse than the best expectation \(\widehat{E}^*\) so far. A recursion path, representing some cascade, reaches its stopping point when final requirements (AD) are satisfied and when its expected value is strictly less than \(\widehat{E}^*\) (initially, set to \(\infty \)). The outermost recursion call is

$$\begin{aligned} \textsc {TrainTreeCascade}\left( \mathcal {D},A,D,K,0,\mathcal {V},(),L,C,\text {null},\infty \right) \end{aligned}$$

yielding a pair of results: the best cascade \(F^*\) and its expectation \(\widehat{E}^*\).

Inside the subroutine TrainStage we train a single ensemble using per-stage requirements. They can be calculated a standard geometric means (classical VJ-style), leading to constant per-stage requirements for the whole training, or as updated geometric means (UGM): uniform or greedy. The formulas below represent the three options.

$$\begin{aligned} \text {VJ}:\quad \!\,\!\,a_{\text {max},k+1}= A^{1/K},\qquad \qquad \qquad \quad d_{\text {min},k+1}= D^{1/K}. \end{aligned}$$
(8)
$$\begin{aligned} \text {UGM} :\quad \!\,\!\,a_{\text {max},k+1}= \Bigl (A \Big / \!\,\!\,\! \prod \limits _{1\leqslant i\leqslant k} a_i\Bigr )^{1/(K-k)}\!\,\!\,\!\,\!\,\!\!\!\,\!\,\!\,\!\,\!\!,\qquad d_{\text {min},k+1}= \Bigl (D \Big / \!\,\!\,\! \prod \limits _{1\leqslant i\leqslant k} d_i\Bigr )^{1/(K-k)}\!\,\!\,\!\,\!\,\!\!\!\,\!\,\!\,\!\,\!\!. \end{aligned}$$
(9)
$$\begin{aligned} \text {UGM-G}: \quad \!\,\!\,a_{\text {max},k+1}= A^{(k+1)/K} \Big / \!\,\!\,\prod \limits _{1\leqslant i\leqslant k} a_i, \quad d_{\text {min},k+1}= D^{(k+1)/K} \Big /\!\,\!\,\prod \limits _{1\leqslant i\leqslant k} d_i. \end{aligned}$$
(10)

3.2 Pruning Search Tree Using Expectation Predictions—Approximate Branch-and-bound

Suppose we have completed the training of stage \(k+1\) and would like to make a prediction about the partial expectation for stage \(k+2\) without training it. Obviously, the training of any stage is time-consuming, hence a significant gain would be benefited by not wasting time on a stage that is not going to improve the best-so-far expectation. Observe that when the stage \(k+1\) is completed, we get to know two new pieces of information: \(n_{k+1}\) and \(a_{k+1}\). That second piece is not needed to calculate formula (7) for stage \(k+1\), but it is needed for stage \(k+2\). Therefore, the only unknown preventing us from calculating the exact partial expectation for stage \(k+2\) is \(n_{k+2}\). We are going to approximate it.

figure b

As cascade experiments on real-data show, the counts of features \((n_k)_{k=1,\ldots ,K}\) typically form a non-decreasing sequence. There exist counter-examples, but in the vast majority of cases it is true that \(n_{k+1}\geqslant n_k\). Therefore, to build our prediction it could potentially be sufficient to lowerbound \(n_{k+2}\) by \(n_{k+1}\). Instead, we prefer to propose a safer parameterized approach — by assuming:

$$\begin{aligned} n_{k+2} \geqslant \alpha ~\! n_{k+1}, \end{aligned}$$
(11)

where parameter \(\alpha \) could be selected e.g. from [0.5, 1.5] interval. The following lines demonstrate explicitly the prediction we are going to apply:

$$\begin{aligned}&\widehat{E}\Bigl ((n_1, \ldots , n_{k + 2})\Bigr ) = \! \widehat{E}\Bigl ((n_1, \ldots , n_{k})\Bigr ) + n_{k+1} \!\,\!\,\!\,\!\prod _{1\leqslant i< k + 1} \!\,\!\,a_i + n_{k+2} \!\,\!\,\!\,\!\prod _{1\leqslant i< k+2} \!\,\!\,a_i \nonumber \\&\qquad \qquad \quad \approx \!\widehat{E}\Bigl ((n_1, \ldots , n_{k})\Bigr ) + n_{k+1} \!\,\!\,\!\,\!\,\!\!\! \prod _{1\leqslant i< k+1} \!\,\!\,\!\,\!\,\!\!a_i + \alpha ~\! n_{k+1} \! \Big (\!\,\!\,\!\,\!\!\!\prod _{1\leqslant i < k+1} \!\,\!\,\!\,\!\,\!\!\! a_i\Bigr ) a_{k+1} \equiv \widehat{E}_\alpha . \end{aligned}$$
(12)

The influence of parameter \(\alpha \) can be described as follows. By lowering \(\alpha \), one decreases the risk of pruning a branch incorrectly, but simultaneously one strengthens the underestimation of the expected value, which can lead to training continuation despite a negligible chance of improvement. In contrast, higher \(\alpha \) values lead to more pruning but with some risk of missing the optimum solution. Additionally, it is worth to remark that the prediction we make is only for one stage ahead, ignoring all subsequent stages. Since those stages shall too contribute their summands to the final expectation then this suggests that high \(\alpha \) values should still be safe, especially for initial levels.

Algorithm 3 represents the described approach for cascade training based on tree search and approximate pruning.

figure c

4 Experiments

In all experiments we apply RealBoost+bins [13] as the main learning algorithm, producing ensembles of weak classifiers as successive cascade stages. Each weak classifier is based on a single selected feature.

Experiments on two collections of images are carried out. Firstly, we test the proposed approach in face detection task, using Haar-like features (HFs) as input information. Secondly, we experiment with synthetic images representing letters (computer fonts originally prepared by T.E. de Campos et al. [5]) and we treat the ‘A’ letter as our target object. In that experiment we expect to detect our targets regardless of their rotation. To do so, we apply rotationally invariant features based on Zernike moments (ZMs) [2]. In both cases, feature extraction is backed with integral images (complex-valued for ZMs).

In experiments we used a machine with Intel Core i7-4790K 4/8 c/t, 8MB cache. For clear interpretation of time measurements, we report detection times using only a single thread [ST]. The software has been programmed in C#, with key computational procedures implemented in C++ as a dll library.

Experiment: “Faces” (Haar-like features). Training faces were cropped from \(3\,000\) images, looked up using Google Images, yielding \(7\,258\) face examples described by \(14\,406\) HFs. The test set contained \(3\,014\) examples from Essex Face Data [16]. Validation sets contained \(1\,000\) examples. The number of negatives in the test set was constant and equal to \(1\,000\,000\). To reduce training time, the number of negatives in training and validation sets was gradually reduced for successive stages, as described in Table 1. Detection times, reported later, were determined as averages from 200 executions of the detection procedure.

Table 1. “Faces”: experimental setup.

We start reporting results by showing some visual examples of detection outcomes obtained by two best detectors (in terms of expected number of features) trained to satisfy \(A = 10^{-4}\) and \(A = 10^{-5}\) requirements, respectively, see Fig. 2.

Fig. 2.
figure 2

“Faces”: detection examples (false alarms marked in yellow).

Table 2 provides detailed information about cascades trained with \(A=10^{-3}\) requirement. Every row contains a cascade, represented by two sequences: a sequence of features counts \(n_k\) on successive stages (top), and a sequence of false alarm rates \(a_k\) (bottom). The third column reports the expected value \(\widehat{E}(n)\) calculated according to (4). The right-most columns provide information about the effectiveness of tree pruning, showing how many nodes were in fact trained with respect to the potential total. We allow ourselves to report approximate pruning (for both \(\alpha =0.8\) and \(\alpha =1.2\)) in the same row as exact pruning, because in all experiments the approximate pruning has never led to a suboptimal solution. The table shows clearly that in general the greater the “bushiness” of the tree the better the expected value we try to minimize — an increase in either C or L parameter lead to an improvement. Additionally, owing to pruning, the time needed to train cascades involving wider trees did not increase proportionally to the overall number of nodes. One should realize that nodes (stages) lying deeper in the tree, with low effective FAR resulting from chain multiplication of \(a_k\) rates, require much time for resampling, since only a small fraction of negative examples reaches those stages. That is why it is so important to prune redundant nodes. In particular, for TREE-C3-L2-UGM-G an exhaustive search would require 84 nodes, exact pruning reduces this number to 60, whereas approximate pruning cuts it further down to 57 (for \(\alpha =0.8\)) and 55 (for \(\alpha =1.2\)).

Table 2. “Faces”: cascades trained for \(A=10^{-3}\) (pruning information in last columns).
Table 3. “Faces”: VJ vs tree-based cascades with \(K{=}5\) (left) and \(K{=}10\) (right) stages.

Table 3 compares cascades trained traditionally (VJ) against selected best cascades trained via tree search. The comparison pertains to accuracy and detection times. This time we show three variants of A requirement: \(10^{-3}\), \(10^{-4}\) and \(10^{-5}\) (that last setting only for cascades with 10 stages). In addition, the theoretical expected value for cascades can be compared against an average observed on the test set (column \(\bar{n}\)). We remark that the tree-based approach combined with greedy per-stage requirements — TREE-C3-L1-UGM-G — produced the best cascades (marked with dark gray) having the smallest expected values. Savings in detection times per image with respect to VJ approach are at the level of \(\approx 7.5\) ms (about \(8\%\) per thread). This may seems not large but we remind that the measurements are for single-threaded executions [ST]. For example, if 8 threads are used this implies a reduction of \(\approx 4\) FPS.

Experiment: “Synthetic A letters” (Zernike Moments). Table 4 lists details of the experimental setup for this experiment. In train images, only objects with limited rotations were allowed (\(\pm 45^\circ \) with respect to their upright positions). In contrast, in test images, rotations within the full range of \(360^\circ \) were allowed. During the training 540 features were at disposal [2].

Table 4. “Synthetic A letters”: experimental setup.

Figure 3 presents examples of detection outcomes obtained by best detectors trained to satisfy \(10^{-3}\) and \(10^{-4}\) FAR requirements. As it turned out for this data, the cascades did not need many stages nor features. Table 5 compares VJ against tree-based cascades. One can note that despite small feature counts (comparing to the previous experiment), the proposed method still allows to reduce the expectations. The smallest were achieved by the TREE-C3-L1-UGM-G variant, yielding 2.5682 and 2.9910, respectively for \(A=10^{-3}\) and \(A=10^{-4}\).

Fig. 3.
figure 3

“Synthetic A letters”: detection examples.

Table 5. “Synthetic A letters”: VJ vs tree-based cascades with \(K = 5\) stages.

5 Conclusion

Training a cascade of classifiers is a difficult optimization problem that, in our opinion, should be always carried out with a primary focus on the expected number of extracted features. This quantity reflects directly how fast an operating cascade is. Our proposition of the tree search-based training allows to ‘track’ more than one variant of a cascade. Potentially, this approach can be computationally expensive, but we have managed to reduce it with suitable branch-and-bound techniques. Being able to prune some of the subtrees, we save both the training and resampling time needed by later cascade stages. To our knowledge, no such proposition regarding the cascade structure has been tried out before. In our future research we plan to investigate more the approximate variant, trying to predict partial expectations for more than one stage ahead.