Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

One of the major and fundamental challenges in object detection is to increase localization accuracy, which indicates the detector’s ability to predict correct regions of target objects. The metric is typically measured by the bounding box overlap, i.e., the intersection over union (IoU) of the ground truth and predicted bounding boxes. While previous challenges (e.g., PASCAL VOC [3] and KITTI [5]) normally requires an IoU threshold of 0.5 to be considered a correct detection, real-world applications usually call for a higher accuracy (e.g., IoU \(\ge \) 0.7). For example, the vehicle and pedestrian detection in autonomous driving need an accurate measurement of distance through real-time road traffic captures.

Recent literature has focused on the modification of region-based detection models at the post-recognition level to boost the localization accuracy [4, 6, 7]. However, limited work has addressed the problem from a data perspective. Data is important. The rapid advancement in the data collection, storage, and processing technology has made machine learning, especially deep learning, much easier by lightening the burden of generalizing well to unseen data with a limited number of training data [10].

However, the challenge of learning from imbalanced data [12] still exists. Within the “Recognition Using Regions” paradigm [11], the training set of object detection is divided into two distinct groups of annotated objects and background regions, and the number of examples in these groups experience a huge imbalance. Online Hard Example Mining (OHEM) [27] is proposed to overcome the data imbalance by integrating bootstrapping technique [30] with region-based detectors, and can be effortlessly implemented on most of the region-based detectors.

Fig. 1.
figure 1

Architecture of the Stratified Online Hard Example Mining algorithm (S-OHEM). We use the parameter denotation from [27]. In each mini-batch iteration, N is the number of images sampled from the dataset, R is the number of forward-propagated RoIs, and B is the number of subsampled RoIs to be fed into backpropagation. We denote classification loss by \(L_{cls}\) and localization loss by \(L_{loc}\). S-OHEMiner conducts stratified sampling over R region proposals according to the sampling distribution at current training stage and produces B RoIs to be fed into backpropagation. We maintain a read-only RoI network and a standard RoI network with sharing weights for efficient memory allocation, derived from [27]. The blue solid stream indicates the process of forward-propagation and the green dashed stream shows the backpropagation process. More details are described in Sect. 3.3.

In this paper, we propose S-OHEM, the Stratified Online Hard Example Mining algorithm for training region-based deep convolutional network detectors to enhance localization accuracy, as shown in Fig. 1. The intuition of our method is that feeding hard examples to the backpropagation process could overcome the dilemma of unbalanced data, resulting in a more efficient and effective training process [27]. In the field of object detection, the hard example is defined as region proposal with higher training loss. Thus, previous hard example mining method (e.g., OHEM) is conducted by sampling region proposals according to a distribution that favors high loss instances. However, the training loss defined in previous work is the multitask loss with equal weight settings across all loss types (e.g., classification, localization, mask [13], or rigid categories and non-rigid categories). This approach ignores the influence of different loss types throughout the training process, which we found essential to the training efficacy (e.g., localization loss is more important during the latter part of the training period). Therefore, maintaining a sampling distribution according to this influence during hard example mining should enhance the performance of object detectors.

S-OHEM exploits stratified sampling, a sampling method involving the division of a population into distinct groups known as strata [21] (homogeneous subgroups, in which the inner items are similar to each other). During each mini-batch iteration, S-OHEM firstly assigns candidate examples (in the form of Region of Interests, RoIs) to different strata by the ratio between classification and localization loss. Then the RoIs are subsampled according to a dynamic distribution and fed into the backpropagation process. With an increasing focus on the localization loss, S-OHEM can predict more accurate bounding boxes and therefore enhance the localization accuracy. We apply S-OHEM to the standard Fast R-CNN [8] and Faster R-CNN [26] detection method and evaluate it on PASCAL VOC 2007 and KITTI datasets. Our systematic experimental analysis reports that S-OHEM yields some AP improvements of 0.5% on rigid categories of PASCAL VOC 2007 for both the IoU thresholds of 0.6 and 0.7. For KITTI 2012, both results of the same metric are 1.6%. Regarding the mAP, a relative increase of 0.3% and 0.5% (1% and 0.5%) is observed for VOC07 (KITTI12) with the same set of IoU threshold.

The remainder of this paper is structured as follows. In Sect. 2, we compare our work with related research with a focus on the improvement of localization accuracy and the use of data in object detection. In Sect. 3, we describe the design of the algorithm. In Sect. 4, we show the experimental results, and in Sect. 5, we conclude this work.

2 Related Work

Object detection has significantly benefited from the advancement of image classification task. The remarkable feature extraction ability of Deep Convolutional Networks [16, 20, 28, 31, 32] has equipped us with abundant information for the classification of region proposals. In addition, the continuously developing practical strategies (e.g., activation functions [15, 24, 34], regularization [18, 29, 30], and optimization [2, 17, 19]) further contribute to the efficacy of deep neural networks.

Several region-based detectors depend on the strong classification capability of deep convolutional networks to evaluate generated RoIs. R-CNN is the first to adopt this approach by evaluating each RoI separately. Fast R-CNN [8] improved this method by allowing computation sharing through projecting RoIs to a shared feature map (called RoIPool layer, derived from SPPnets [14]), resulting in better speed and accuracy. It was then integrated with the region proposal module (the Region Proposal Network, RPN) by sharing their convolutional features and extended to a unified network with “attention” [1] mechanism, leading to further speedup and accuracy enhancement. R-FCN [22] eliminates the fully-connected layers of region-based detectors and turns the whole model fully convolutional with the backbones of state-of-the-art image classifiers [16, 32] to fully share computation, contributing to a significant speedup. Mask R-CNN [13], which adds a small Fully Convolutional Network (FCN) [23] as a parallel branch to standard Faster R-CNN and replaces the RoIPool layer with the RoIAlign layer, is the latest descendant of this stream and achieves significant advancement in several benchmarks of both the detection and segmentation tasks. However, most of these models use the multitask loss with equal weight settings without considering the influence of different loss type throughout the training process.

Recent work has focused on the post-recognition level of region-based detection models to boost the localization accuracy. Gidaris and Komodakis [6] proposed a CNN-model for bounding box regression, which is used with iterative localization and bounding box voting. LocNet [7] aims to enhance the localization accuracy by assigning a probability to each border of a loosely localized search region for being related to the object’s bounding box. It’s different from the bounding box regression approaches [4] adopted by most of the aforementioned region-based detectors and can be served as an effective alternative.

However, little work has focused on the advancement of region-based detectors from a data perspective. Online Hard Example Mining (OHEM) [27] integrates bootstrapping [30] (or hard example mining) with region-based detectors for a small extra computational cost, but still lacks enough focus on the localization accuracy because of the derived multi-task loss imbalance. Further discussion is available in Sect. 3.

3 Model Design

In this section, we argue that the current way of choosing hard examples lacks enough focus on localization accuracy and is suboptimal, and we will show that our approach results in better training (lower training loss), higher localization performance, and higher average precision. Firstly, we discuss the design motivation. Then we give a brief introduction of stratified sampling and definition of stratified constraint in this work. Finally, we present the design and implementation of our Stratified Online Hard Example Mining algorithm (S-OHEM).

3.1 Motivation

Most of the region-based detectors derive the multitask learning from Fast R-CNN, and assume equal contributions of classification loss and localization loss throughout the training process. However, this assumption is not often the case. We apply the original OHEM on standard Fast R-CNN and Faster R-CNN, then report the classification and localization loss throughout the training process on PASCAL VOC and KITTI datasets separately.

As is illustrated in Fig. 2, the classification loss is consistently larger than the localization loss (more than double in average). But this could result in a problem. Let’s consider a situation where we have two region proposals RoI A and RoI B and are asked to choose one as the hard example for backpropagation. Based on the preliminary experiment result shown in Fig. 2, we make a common assumption that the training loss for RoI A and RoI B is \(L_{cls}(A)=0.21\), \(L_{loc}(A)=0.11\), and \(L_{cls}(B)=0.19\), \(L_{loc}(B)=0.12\) respectively. Recall that the classification loss is defined as log loss \(L_{cls}(p,u)=-\)log\(p_u\) for true class u [8], and thus the probability for the true class is 61.5% and 64.5% for RoI A and RoI B respectively. It’s not a significant gap of the class prediction probability between these two RoIs, and we can believe they have similar performance for the classification task.

Fig. 2.
figure 2

Influence of different loss types throughout the training process. For better visualization, we average out the training loss of every 1000 iterations.

Regarding the localization loss, the gap between RoI A and RoI B is 0.01 \((L_{loc}(B) - L_{loc}(A) = 0.12 - 0.11 = 0.01)\). Within the smooth \(L_{1}\) loss settings [8], this gap turns to a 0.14 difference between the bounding boxes of ground truth and prediction. Note that this gap is quite significant when we use the parameterization for bounding box offsets given in [9], and therefore we are supposed to choose RoI B as the hard example for better localization accuracy and prediction quality. However, within the equal-weight multitask loss settings, RoI A will be chosen as the hard one. Thus, the previous hard example mining approach lacks focus on localization accuracy.

3.2 Stratified Sampling

Stratified sampling is a sampling method involving the division of a population into distinct groups known as strata [21]. These strata are homogeneous subgroups of the original data with similar inner items. Stratified sampling can get higher statistical precision because the variability within subgroups sharing the same properties is lower than that of the entire population [33]. Therefore. stratified sampling improves the representativeness by reducing sampling error.

Each stratum constraint \(s_k\) is denoted by \(s_k = (p_k, f_k)\), where \(p_k\) is a propositional formula and \(f_k\) is the required sample size. In this work, the four stratum constraint is defined by the ratio between classification loss (\(L_{cls}\)) and localization loss: \(s_1 = (\)high \(L_{cls}\) and high \(L_{loc}, f_1)\), \(s_2 = (\)high \(L_{cls}\) and low \(L_{loc}, f_2)\), \(s_3 = (\)low \(L_{cls}\) and high \(L_{loc}, f_3)\), and \(s_4 = (\)low \(L_{cls}\) and low \(L_{loc}, f_4)\). The required sample size and threshold of high loss (hard examples) change dynamically throughout the training process.

3.3 Stratified Online Hard Example Mining Algorithm

Given the observation that the previous hard example mining approach ignores the influence of different loss types throughout the training process and lacks focus on localization accuracy, we now demonstrate our approach of Stratified Online Hard Example Mining (S-OHEM).

The architecture of S-OHEM is shown in Fig. 1. In each mini-batch iteration, S-OHEM firstly generates region proposals of the input images, forward-propagates all of them across the region-based detector, and gathers the training loss of each RoI. Then each RoI is assigned to one of the four strata defined in Sect. 3.2. Different loss type combinations represent how well the current detector performs in classification and localization tasks on each RoI respectively. Inside each stratum, hard examples are chosen by sorting the RoIs by loss. After that, all RoIs are subsampled according to a dynamic distribution, and a total number of B hard examples are fed into the backpropagation process. The sampling distribution of RoIs from each stratum changes dynamically throughout the learning process, as each loss type maintains different contribution to the detector model at different training stages. Specifically, the effect of classification loss is more important in the beginning, while the localization loss contributes more at later training stages.

For implementation, we keep a record of history training loss and start to change the sampling distribution when the loss becomes stable (e.g., after 40K iterations shown in Fig. 2). At the beginning of training, we only sample the first B RoIs with high \(L_{cls}\) (i.e., sample from \(s_{12}\), the union of strata \(s_1\) and \(s_2\)). When loss becomes stable, we gradually focus on choosing the RoIs with high \(L_{loc}\) (i.e., sample from the union of \(s_2\) and \(s_3\), denoted by \(s_{23}\)) by increasing the sampling ratio between \(s_{23}\) and \(s_1\). Because of the gradually increasing focus on the localization loss, S-OHEM can predict more accurate bounding boxes and thus enhance the localization accuracy.

An equivalent alternative is available. To make it simple, we denote the contribution coefficient of \(L_{cls}\) and \(L_{loc}\) to hard example selection by \(\alpha \) and \(\beta \) respectively. And our approach aims to find the optimal value of \(\alpha \) and \(\beta \) in Formula (1) at different training stages. \(L_{select}\) is only for hard example mining, and the actual loss backpropagated across the network will not be affected.

$$\begin{aligned} L_{select} = \alpha L_{cls} + \beta L_{loc} \end{aligned}$$
(1)

When training begins, we only sample the first B RoIs with high \(L_{cls}\) by setting \(\alpha \) and \(\beta \) in Formula (1) to 1 and 0 respectively. When loss becomes stable, we gradually focus on choosing the RoIs with high \(L_{loc}\) by gradually decreasing the value of \(\alpha \) and increasing \(\beta \) in Formula (1).

S-OHEM will not have a significant influence on the training time because most of the forward computation is shared between RoIs [8], and the number of backpropagated examples is much smaller than that of all region proposals of the input images. To overcome co-located RoIs and loss double counting, we follow the solution of [27] and apply non-maximum suppression (NMS) [25] to perform deduplication before the sampling procedure. NMS works by finding the highest loss RoI, and eliminating all other RoIs with lower loss and high overlap with the selected region. Besides, we derive their method of maintaining a read-only RoI network and a standard RoI network with sharing weights for efficient memory allocation. It is also worth noting that S-OHEM can be combined with any post-recognition regressors introduced in Sect. 2, because it focuses on enhancing the localization accuracy from the data perspective.

4 Experiments and Results

In this section, we conduct systematic experiments to evaluate the proposed S-OHEM and compare it with original OHEM. We describe the experimental setup in Sect. 4.1, and demonstrate the efficiency and accuracy of the algorithm by examining the training loss and average precision.

4.1 Experimental Setup

We use the standard and popular CNN architecture VGG16 from [28], and evaluate the algorithms on the PASCAL VOC 2007 and KITTI Object Detection Evaluation 2012 dataset. In the PASCAL VOC experiment, training is done on the trainval set and testing on the test set. In the KITTI 2012 experiment, we use the first 5000 images to form the training set and the remaining 2481 images for testing. All models are trained with SGD for 80k mini-batch iterations and followed the same setup from Sect. 4.1. For average precision, we report the results with IoU thresholds of 0.5, 0.6, and 0.7, to evaluate the localization accuracy in a wider range of IoU thresholds. We use Fast R-CNN [8] as the detector base for our PASCAL VOC experiment, and Faster R-CNN [26] for the KITTI 2012 experiment, to prove the usability of our approach. The initial learning rate is set to 0.001 and dropped in “steps” by a factor of 0.1 every 30K iterations. We process 2 images in each mini-batch iteration, and subsample 128 RoIs to feed them into backpropagation. Note that the baseline of OHEM reported in Table 2 (row 1–2) were reproduced and are slightly higher than the ones reported in [27].

For both experiments, we follow the procedure described in Sect. 3.3 to control the contribution coefficient of \(L_{cls}\) and \(L_{loc}\). In the beginning, \(\alpha \) and \(\beta \) are set to 1 and 0 when training starts. Then we gradually increase \(\beta \) to the ratio between classification and localization loss when the loss becomes stable. Specifically, \(\beta \) will increase to 1.9 and 2.3 for the VOC07 and KITTI12 experiment respectively.

4.2 Results and Analysis

Training Convergence. We firstly analyze the training loss for both methods by logging the average training loss every 10K steps. Figure 3 shows the average loss per RoI for VGG16 with settings presented in Sect. 4.1. We see that S-OHEM yields lower loss in both classification and localization than the original OHEM, validating our claims that S-OHEM leads to better training than OHEM. Also, the results indicate that S-OHEM is better in classification confidence and localization accuracy during the training process.

Fig. 3.
figure 3

Training loss for S-OHEM and OHEM. We show the average loss per RoI for VGG16. These results indicate that S-OHEM is better in classification confidence and localization accuracy during the training process.

VOC 2007. Table 1 shows that on VOC07, S-OHEM improves the mAP of OHEM from 71% to 71.1% for an IoU threshold of 0.5, and an improvement of 0.4% and 0.3% for IoU 0.6 and 0.7 respectively. For category-specific improvements, S-OHEM performs well in most of the rigid categories (bold categories in Table 1) across all three IoU thresholds, especially for IoU 0.7.

As is listed on Table 3(a), we compute the mAP among rigid categories and show increase of 0.1%, 0.5%, and 0.5% for IoU 0.5, 0.6, and 0.7 respectively. It’s also interesting to find that S-OHEM performs quite well in detecting cats for IoU threshold 0.6, which indicates the better bounding boxes generated by S-OHEM in this environment.

Table 1. VOC 2007 test detection average precision (%). All methods use VGG16 and bounding-box regression. Legend: IoU: IoU threshold.

KITTI 2012. The evaluation results on KITTI 2012 is shown in Table 2. S-OHEM improves the mAP of OHEM from 63.9% to 64.9% for an IoU threshold of 0.6, and an improvement of 0.5% for IoU 0.7. We also compute the mAP among rigid categories and list results in Table 3(b). Note that the Note that the misc category is classified as rigid based on our observation of the dataset. We show some increase of 1.6% for both IoU thresholds 0.6 and 0.7.

Table 2. KITTI 2012 test detection average precision (%). All methods use VGG16 and bounding-box regression. Legend: IoU: IoU threshold.
Table 3. Category specific mean average precision (%). All methods use VGG16 and bounding-box regression. Legend: IoU: IoU threshold. (a) On VOC 2007 test set. (b) On KITTI 2012 test set.

Rigid and Non-rigid Category. Our experimental results have shown that S-OHEM performs quite well on rigid categories of both the VOC07 and KITTI12 dataset. The reason is that rigid bodies can reach better classification accuracy on pre-trained deep convolutional networks ascribed to its strong resistance to deformation. Therefore, the influence of different loss distribution throughout the training process (as described in Sect. 3.1) is more likely to happen on rigid bodies. Also, the border distribution of rigid bodies is more similar to each other and is thus easier to learn.

5 Conclusion

In this paper, we proposed Stratified Online Hard Example Mining (S-OHEM) algorithm, a simple and effective method for training region-based deep convolutional network detectors to enhance localization accuracy. During hard example mining, S-OHEM exploits stratified sampling and focuses on the influence of different loss types throughout the training process. Experimental analysis shows that S-OHEM outperforms OHEM regarding training convergence and localization accuracy, and achieves some AP improvements on rigid categories of PASCAL VOC 2007 and KITTI 2012. Besides, S-OHEM addresses the localization enhancing problem merely from the data perspective and can be easily plugged into existing region-based detectors. Furthermore, the state-of-the-art Mask R-CNN [13] also derives the equal-weight multi-task loss with an addition task of semantic segmentation, which is improvable through S-OHEM. S-OHEM can also be applied to other multi-task loss, including the loss of semantic segmentation, key-point detection, etc.