# Branch-and-Bound Search for Training Cascades of Classifiers

- 131 Downloads

## Abstract

We propose a general algorithm that treats cascade training as a tree search process working according to the *branch-and-bound* technique. The algorithm allows to reduce the *expected number of features* used by an operating cascade—a key quantity we focus on in the paper. While searching, we observe suitable lower bounds on partial expectations and prune tree branches that cannot improve the best-so-far result. Both exact and approximate variants of the approach are formulated. Experiments pertain to cascades trained to be face or letter detectors with Haar-like features or Zernike moments being the input information, respectively. Results confirm shorter operating times of cascades obtained owing to the reduction in the number of extracted features.

## Keywords

Cascade of classifiers Branch-and-bound tree search Expected number of features## 1 Introduction

Branch-and-bound technique is a useful tool in computer science. Multiple application examples can be named—let us mention DNA regulatory motif finding [8] and \(\alpha \)-\(\beta \) pruning in games, just to give two examples from quite remote fields. In this paper we adopt the technique to train cascades of classifiers.

Cascades were in principle designed to work as classifying systems operating under the following two conditions: (1) very large number of incoming requests, (2) significant classes imbalance. The second condition should not be seen as a difficulty but rather a favorable setting that makes the whole idea viable. Namely, a cascade should vary its computational efforts depending on the contents of an object to be classified. Objects that are obvious negatives (non-targets) should be recognized fast, using only a few features extracted. Targets, or objects resembling, them are allowed to employ more features and time for computations.

Despite the development of deep learning, recent literature shows that cascades of classifiers are still widely applied in detection systems or batch classification jobs. Let us list a few examples: crowd analysis and people counting [1], human detection in thermal images [6], localization of white blood cells [4], eye tracking [9, 11], detection of birds near high power electric lines [7].

There exist a certain *average* value of the computational cost incurred by an operating cascade. It can be mathematically defined as an *expected value* and, in fact, calculated explicitly for a given cascade (we do this in Sect. 2.3) in terms of: the number of features applied on successive stages, false alarm and detection rates on successive stages, probability distribution from which the data is drawn. Since the true distribution underlying the data is typically unknown in practice, the exact expected value cannot be determined. Interestingly though, it can be accurately approximated using just the feature counts and false alarm rates.

Training procedures for cascades are time-consuming, taking days or even weeks. As Viola and Jones noted in their pionieering work [18], cascade training is a difficult combinatorial optimization involving many parameters: number of stages, number of features on successive stages, selection of those features, and finally decision thresholds. The problem has not been ultimately solved yet. Viola and Jones tackled it by imposing the final requirements the whole cascade should meet in order to be accepted, defined by a pair of numbers (*A*, *D*), where *A* denotes the largest allowed false alarm rate (FAR), and *D* the smallest allowed detection rate (sensitivity). Due to probabilistic properties of cascade structure, one can translate final requirements onto per-stage requirements as geometric means: \(a_{\max }\!=\!A^{1/K}\) and \(d_{\min }\!=\!D^{1/K}\), where *K* is the fixed number of stages.

Many modifications to cascade training have been introduced over the years. Most of them try out different: feature selection approaches, subsampling methods, or are simply tailored to a particular type of features [3, 10, 12, 17] (e.g. Haar, HOG, LBP, etc.). Some authors obtain modified cascades by designing new boosting algorithms that underlie the training [14, 15], but due to mathematical difficulties, the expected number of features is seldom the main optimization criterion. One of few exceptions is an elegant work by Saberian and Vasconcelos [14]. The authors use gradient descent to optimize explicitly a Lagrangian representing the trade-off between cascade’s error rate and the operating cost (expected value). They use a trick that translates non-differentiable recursive formulas to smooth ones using hyperbolic tangent approximations. The approach is analytically tractable but expensive, because all cascade stages are kept open while training. In every step one has to check *all* variational derivatives based on features at disposal for *all* open stages.

The main contribution of this paper is an algorithm—or in fact a general framework—for training cascades of classifiers via a tree search approach and the *branch-and-bound* technique. Successive tree levels correspond to successive cascade stages. Sibling nodes represent variants of the same stage with different number of features applied. We provide suitable formulas for lower bounds on the expected value that we optimize. During an ongoing search, we observe the lower bounds, and whenever a bound for some tree branch is greater than (or equal to) the best-so-far expectation, the branch becomes pruned. Once the search is finished, one of the paths from the root to some terminal node indicates the cascade with the smallest expected number of features. Apart from the exact approach to pruning, we additionally propose an approximate one, using suitable predictions of expected values.

## 2 Preliminaries

### 2.1 Notation

*K*— number of cascade stages,\(n=(n_1, n_2, \ldots , n_K)\)—numbers of features used on successive stages,

\((a_1, a_2, \ldots , a_K)\)—FAR values on successive stages (false alarm rates),

\((d_1, d_2, \ldots , d_K)\)—sensitivities on successive stages (detection rates),

*A*—required FAR for the whole cascade,*D*—required detection rate (sensitivity) for the whole cascade,\(F=(F_1, F_2, \ldots , F_K)\)—ensemble classifiers on successive stages (the cascade),

\(A_k\)—FAR observed up to

*k*-th stage of cascade (\(A_k=\prod _{1\leqslant i \leqslant k} a_i\)),\(D_k\)—sensitivity observed up to

*k*-th stage of cascade (\(D_k=\prod _{1\leqslant i \leqslant k} d_i\)),\((p, 1-p)\)—true probability distribution of classes (unknown in practice),

\(\mathcal {D}, \mathcal {V}\)—training and validation data sets,

\(\#\)—set size operator (cardinality of a set),

\(\Vert \)—concatenation operator (to concatenate cascade stages).

*A*,

*D*) demand that: \(P\left( F(\mathbf {x}){=}+|y{=}-\right) {\leqslant } A\) and \(P\left( F(\mathbf {x}){=}+|y{=}+\right) {\geqslant } D\), whereas false alarm and detection rates observed on particular stages are, respectively, equal to:

### 2.2 Classical Cascade Training Algorithm (Viola-Jones Style)

Please note, in the final line of the pseudocode, that we return \((F_1, F_2, \ldots , F_k)\) rather than \((F_1, F_2, \ldots , F_K)\). This is because the training procedure can potentially stop earlier, when \(k<K\), provided that the final requirements (*A*, *D*) for the entire cascade are already satisfied i.e. \(A_k\leqslant A\) and \(D_k\geqslant D\).

### 2.3 Expected Number of Extracted Features

**Definition-Based Formula.**A cascade stops operating after a certain number of stages. It does not stop in the middle of a stage. Therefore the possible outcomes of the random variable of interest, describing the disjoint events, are: \(n_1\), \(n_1+n_2\), ..., \(n_1 + n_2 + \cdots + n_K\). Hence, by the definition of expected value, the expected number of features can be calculated as follows:

**Incremental Formula and Its Approximation.**By grouping the terms in (2) with respect to \(n_k\) the following alternative formula can be derived:

*p*of the positive class is very small (typically \(p<10^{-4}\)), the expected value can be accurately approximated using only the summands related to the negative class as follows:

*i*-th stage. This is equivalent to

## 3 Cascade Training as a Tree Search

In stage-wise training procedures, each stage, once fixed, must not be altered. The paper [14], discussed in the introduction, represents an opposite approach, where all stages can be extended with a weak classifier at any time. The approach we propose is in-between the two mentioned above. It provides more flexibility than stage-wise training and simultaneously avoids high complexity of [14].

We treat cascade training as a tree search process. The root of the tree represents an empty cascade. Successive tree levels correspond to successive cascade stages. Each non-terminal tree node is going to have an odd number of children nodes. They will represent variants of a subsequent stage with slightly different number of features. The children will be processed recursively from left to right until the stop condition is met. It should be understood that the nodes are not simply generated mechanically but, in fact, trained as ensemble classifiers.

The size of the tree shall be controlled by two integer parameters *L* and *C*, predefined by the user. To keep the tree fairly small, the branching of variants shall take place only at *L* top-most levels, e.g. \(L=2\). At those levels the branching factor will be equal to *C*, an odd number, e.g. \(C=5\). At deeper levels the branching factor will be one. Therefore, the actual branching shall affect only initial stages having the largest impact on the expected number of features. Once the tree search is finished, one of the paths from the root to some terminal node shall indicate the best cascade i.e. having the smallest expectation.

For notation purposes, children nodes being variants of the same stage use an additional subindex. For example, the classifier \(F_{1,0}\) denotes the main variant of the first stage (using a certain number of features) and is graphically represented as the *middle* child. Its *left* siblings \(F_{1,-1},F_{1,-2},\ldots \) denote classifiers using fewer features (one less, two less, etc.). The *right* siblings \(F_{1,+1},F_{1,+2},\ldots \) use more features than the middle child (one more, two more, etc.). This notation will be used only locally within single recursive calls (due to global ambiguity).

### 3.1 Pruning Search Tree Using Current Partial Expectations—Exact Branch-and-bound

*partial*values for the expected value of interest — formula (4). Suppose a new \((k+1)\)-th stage has been completed, revealing \(n_{k+1}\) features. The formula

*F*with

*k*stages and trains the new \((k+1)\)-th stage in its main variant \(F_{k+1,0}\). We refer to it as the

*middle child*. Then, the algorithm “branches” the stage (if level not greater than

*L*) by creating clones of the middle child with fewer features: \(F_{k+1,-1},F_{k+1,-2},\ldots \) (

*left children*), and with more features: \(F_{k+1,+1},F_{k+1,+2},\ldots \) (

*right children*). The algorithm iterates over all children and performs recursive calls to train their subsequent stages provided that the lower bound (7) on the final expectation is not worse than the best expectation \(\widehat{E}^*\) so far. A recursion path, representing some cascade, reaches its stopping point when final requirements (

*A*,

*D*) are satisfied and when its expected value is strictly less than \(\widehat{E}^*\) (initially, set to \(\infty \)). The outermost recursion call is

### 3.2 Pruning Search Tree Using Expectation Predictions—Approximate Branch-and-bound

Suppose we have completed the training of stage \(k+1\) and would like to make a prediction about the partial expectation for stage \(k+2\) without training it. Obviously, the training of any stage is time-consuming, hence a significant gain would be benefited by not wasting time on a stage that is not going to improve the best-so-far expectation. Observe that when the stage \(k+1\) is completed, we get to know two new pieces of information: \(n_{k+1}\) and \(a_{k+1}\). That second piece is not needed to calculate formula (7) for stage \(k+1\), but it is needed for stage \(k+2\). Therefore, the only unknown preventing us from calculating the exact partial expectation for stage \(k+2\) is \(n_{k+2}\). We are going to approximate it.

*one*stage ahead, ignoring all subsequent stages. Since those stages shall too contribute their summands to the final expectation then this suggests that high \(\alpha \) values should still be safe, especially for initial levels.

Algorithm 3 represents the described approach for cascade training based on tree search and approximate pruning.

## 4 Experiments

In all experiments we apply *RealBoost+bins* [13] as the main learning algorithm, producing ensembles of weak classifiers as successive cascade stages. Each weak classifier is based on a single selected feature.

Experiments on two collections of images are carried out. Firstly, we test the proposed approach in face detection task, using Haar-like features (HFs) as input information. Secondly, we experiment with synthetic images representing letters (computer fonts originally prepared by T.E. de Campos et al. [5]) and we treat the ‘A’ letter as our target object. In that experiment we expect to detect our targets regardless of their rotation. To do so, we apply rotationally invariant features based on Zernike moments (ZMs) [2]. In both cases, feature extraction is backed with integral images (complex-valued for ZMs).

In experiments we used a machine with Intel Core i7-4790K 4/8 c/t, 8MB cache. For clear interpretation of time measurements, we report detection times using only a single thread [ST]. The software has been programmed in C#, with key computational procedures implemented in C++ as a dll library.

**Experiment: “Faces” (Haar-like features).**Training faces were cropped from \(3\,000\) images, looked up using

*Google Images*, yielding \(7\,258\) face examples described by \(14\,406\) HFs. The test set contained \(3\,014\) examples from

*Essex Face Data*[16]. Validation sets contained \(1\,000\) examples. The number of negatives in the test set was constant and equal to \(1\,000\,000\). To reduce training time, the number of negatives in training and validation sets was gradually reduced for successive stages, as described in Table 1. Detection times, reported later, were determined as averages from 200 executions of the detection procedure.

“Faces”: experimental setup.

Train data | Validation data | Test data | Detection procedure | ||||
---|---|---|---|---|---|---|---|

qty./parameter | value | qty./parameter | value | qty./parameter | value | qty./parameter | value |

no. of positives | \(7\,258\) | no. of positives | 1000 | no. of positives | 3014 | no. of repetitions | 200 |

no. of negatives | \(139\,373\) | no. of negatives | \(40\,000\) | no. of negatives | \(1\,000\,000\) | image resolution | \(600 \times 480\) |

” 2nd stage | \(42\,742\) | ” other stages | \(24\,000\) | total set size | \(1\,003\,014\) | no. of detection scales | 5 |

” other stages | \(27\,742\) | total set size | \(41\,000\) | window growing coef. | 1.2 | ||

total set size | \(146\,631\) | ” other stages | \(25\,000\) | smallest window | \(48\times 48\) | ||

” 2nd stage | \(50\,000\) | largest window size | \(100\times 100\) | ||||

” other stages | \(35\,000\) | window jumping coef. | 0.05 |

*C*or

*L*parameter lead to an improvement. Additionally, owing to pruning, the time needed to train cascades involving wider trees did not increase proportionally to the overall number of nodes. One should realize that nodes (stages) lying deeper in the tree, with low effective FAR resulting from chain multiplication of \(a_k\) rates, require much time for resampling, since only a small fraction of negative examples reaches those stages. That is why it is so important to prune redundant nodes. In particular, for TREE-C3-L2-UGM-G an exhaustive search would require 84 nodes, exact pruning reduces this number to 60, whereas approximate pruning cuts it further down to 57 (for \(\alpha =0.8\)) and 55 (for \(\alpha =1.2\)).

“Faces”: cascades trained for \(A=10^{-3}\) (pruning information in last columns).

“Faces”: VJ vs tree-based cascades with \(K{=}5\) (left) and \(K{=}10\) (right) stages.

Table 3 compares cascades trained traditionally (VJ) against selected best cascades trained via tree search. The comparison pertains to accuracy and detection times. This time we show three variants of *A* requirement: \(10^{-3}\), \(10^{-4}\) and \(10^{-5}\) (that last setting only for cascades with 10 stages). In addition, the theoretical expected value for cascades can be compared against an average observed on the test set (column \(\bar{n}\)). We remark that the tree-based approach combined with greedy per-stage requirements — TREE-C3-L1-UGM-G — produced the best cascades (marked with dark gray) having the smallest expected values. Savings in detection times per image with respect to VJ approach are at the level of \(\approx 7.5\) ms (about \(8\%\) per thread). This may seems not large but we remind that the measurements are for single-threaded executions [ST]. For example, if 8 threads are used this implies a reduction of \(\approx 4\) FPS.

**Experiment: “Synthetic A letters” (Zernike Moments).**Table 4 lists details of the experimental setup for this experiment. In train images, only objects with limited rotations were allowed (\(\pm 45^\circ \) with respect to their upright positions). In contrast, in test images, rotations within the full range of \(360^\circ \) were allowed. During the training 540 features were at disposal [2].

“Synthetic A letters”: experimental setup.

Train data | Validation data | Test data | Detection procedure | ||||
---|---|---|---|---|---|---|---|

qty./parameter | value | qty./parameter | value | qty./parameter | value | qty./parameter | value |

no. of positives | \(20\,384\) | no. of positives | \(1\,000\) | no. of positives | \(20\,000\) | no. of repetitions | 200 |

no. of negatives | \(50\,546\) | no. of negatives | \(10\,000\) | no. of negatives | \(1\,000\,000\) | image resolution | \(600 \times 480\) |

total set size | \(70\,930\) | total set size | \(11\,000\) | total set size | \(1\,020\,000\) | no. of detection scales | 5 |

window growing coef. | 1.2 | ||||||

smallest window | \(100\times 100\) | ||||||

largest window size | \(208\times 208\) | ||||||

window jumping coef. | 0.05 |

“Synthetic A letters”: VJ vs tree-based cascades with \(K = 5\) stages.

## 5 Conclusion

Training a cascade of classifiers is a difficult optimization problem that, in our opinion, should be always carried out with a primary focus on the *expected number of extracted features*. This quantity reflects directly how fast an operating cascade is. Our proposition of the tree search-based training allows to ‘track’ more than one variant of a cascade. Potentially, this approach can be computationally expensive, but we have managed to reduce it with suitable branch-and-bound techniques. Being able to prune some of the subtrees, we save both the training and resampling time needed by later cascade stages. To our knowledge, no such proposition regarding the cascade structure has been tried out before. In our future research we plan to investigate more the approximate variant, trying to predict partial expectations for more than one stage ahead.

## References

- 1.Abbas, S., et al.: Crowd detection and management using cascade classifier on ARMv8 and OpenCV-Python. In: 2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), pp. 1–6, March 2017Google Scholar
- 2.Bera, A., Klęsk, P., Sychel, D.: Constant-time calculation of zernike moments for detection with rotational invariance. IEEE Trans. Pattern Anal. Mach. Intell.
**41**(3), 537–551 (2019)CrossRefGoogle Scholar - 3.Bourdev, L., Brandt, J.: Robust object detection via soft cascade. In: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005) - Volume 2, pp. 236–243. IEEE Computer Society (2005)Google Scholar
- 4.Budiman, R.A.M., Achmad, B., Faridah, Arif, A., Nopriadi, Zharif, L.: Localization of white blood cell images using Haar cascade classifiers. In: 2016 1st International Conference on Biomedical Engineering (IBIOMED), pp. 1–5, October 2016Google Scholar
- 5.de Campos, T.E., et al.: Character recognition in natural images. In: International Conference on Computer Vision Theory and Applications, Portugal, pp. 273–280 (2009)Google Scholar
- 6.Setjo, C.H., et al.: Thermal image human detection using Haar-cascade classifier. In: 2017 7th International Annual Engineering Seminar (InAES), pp. 1–6 (2017)Google Scholar
- 7.Lu, J., et al.: Detection of bird’s nest in high power lines in the vicinity of remote campus based on combination features and cascade classifier. IEEE Access
**6**, 39063–39071 (2018)CrossRefGoogle Scholar - 8.Jones, N., Pevzner, P.: An Introduction to Bioinformatics Algorithms. MIT Press, Cambridge (2002)Google Scholar
- 9.Cuimei, L., et al.: Human face detection algorithm via Haar cascade classifier combined with three additional classifiers. In: 2017 13th IEEE International Conference on Electronic Measurement Instruments (ICEMI), pp. 483–487 (2017)Google Scholar
- 10.Li, J., Zhang, Y.: Learning SURF cascade for fast and accurate object detection. In: Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3468–3475. CVPR 2013. IEEE Computer Society (2013)Google Scholar
- 11.Li, Y., Xu, X., Mu, N., Chen, L.: Eye-gaze tracking system by Haar cascade classifier. In: 2016 IEEE 11th Conference on Industrial Electronics and Applications (ICIEA), pp. 564–567, June 2016Google Scholar
- 12.Pham, M., Cham, T.: Fast training and selection of Haar features using statistics in boosting-based face detection. In: IEEE 11th International Conference on Computer Vision, ICCV 2007, pp. 1–7 (2007)Google Scholar
- 13.Rasolzadeh, B., et al.: Response binning: improved weak classifiers for boosting. In: IEEE Intelligent Vehicles Symposium, pp. 344–349 (2006)Google Scholar
- 14.Saberian, M., Vasconcelos, N.: Boosting algorithms for detector cascade learning. J. Mach. Learn. Res.
**15**, 2569–2605 (2014)MathSciNetzbMATHGoogle Scholar - 15.Shen, C., Wang, P., Paisitkriangkrai, S., van den Hengel, A.: Training effective node classifiers for cascade classification. Int. J. Comput. Vis.
**103**(3), 326–347 (2013). https://doi.org/10.1007/s11263-013-0608-1MathSciNetCrossRefzbMATHGoogle Scholar - 16.University of Essex: Face Recognition Data. https://cswww.essex.ac.uk/mv/allfaces/faces96.html (1997). Accessed 11 May 2019
- 17.Vallez, N., Deniz, O., Bueno, G.: Sample selection for training cascade detectors. PLos ONE
**10**, e0133059 (2015)CrossRefGoogle Scholar - 18.Viola, P., Jones, M.: Robust real-time face detection. Int. J. Comput. Vis.
**57**(2), 137–154 (2004). https://doi.org/10.1023/B:VISI.0000013087.49260.fbCrossRefGoogle Scholar