S-OHEM: Stratified Online Hard Example Mining for Object Detection

Li, Minne; Zhang, Zhaoning; Yu, Hao; Chen, Xinyuan; Li, Dongsheng

doi:10.1007/978-981-10-7305-2_15

Minne Li¹⁶,
Zhaoning Zhang¹⁶,
Hao Yu¹⁶,
Xinyuan Chen¹⁶ &
…
Dongsheng Li¹⁶

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 773))

Included in the following conference series:

CCF Chinese Conference on Computer Vision

2876 Accesses
6 Citations

Abstract

One of the major challenges in object detection is to propose detectors with highly accurate localization of objects. The online sampling of high-loss region proposals (hard examples) uses the multitask loss with equal weight settings across all loss types (e.g., classification and localization, rigid and non-rigid categories) and ignores the influence of different loss distributions throughout the training process, which we find essential to the training efficacy. In this paper, we present the Stratified Online Hard Example Mining (S-OHEM) algorithm for training higher efficiency and accuracy detectors. S-OHEM exploits OHEM with stratified sampling, a widely-adopted sampling technique, to choose the training examples according to this influence during hard example mining, and thus enhance the performance of object detectors. We show through systematic experiments that S-OHEM yields an average precision (AP) improvement of 0.5% on rigid categories of PASCAL VOC 2007 for both the IoU threshold of 0.6 and 0.7. For KITTI 2012, both results of the same metric are 1.6%. Regarding the mean average precision (mAP), a relative increase of 0.3% and 0.5% (1% and 0.5%) is observed for VOC07 (KITTI12) using the same set of IoU threshold. Also, S-OHEM is easy to integrate with existing region-based detectors and is capable of acting with post-recognition level regressors.

This work was done by Minne Li as a student.

You have full access to this open access chapter, Download conference paper PDF

WMBAL: weighted minimum bounds for active learning

Article 17 February 2024

Shuai Lu, Jiaxi Zheng, … Xuerui Dai

Balanced Sampling-Based Active Learning for Object Detection

Dense Teacher: Dense Pseudo-Labels for Semi-supervised Object Detection

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

One of the major and fundamental challenges in object detection is to increase localization accuracy, which indicates the detector’s ability to predict correct regions of target objects. The metric is typically measured by the bounding box overlap, i.e., the intersection over union (IoU) of the ground truth and predicted bounding boxes. While previous challenges (e.g., PASCAL VOC [3] and KITTI [5]) normally requires an IoU threshold of 0.5 to be considered a correct detection, real-world applications usually call for a higher accuracy (e.g., IoU $\ge $ 0.7). For example, the vehicle and pedestrian detection in autonomous driving need an accurate measurement of distance through real-time road traffic captures.

Recent literature has focused on the modification of region-based detection models at the post-recognition level to boost the localization accuracy [4, 6, 7]. However, limited work has addressed the problem from a data perspective. Data is important. The rapid advancement in the data collection, storage, and processing technology has made machine learning, especially deep learning, much easier by lightening the burden of generalizing well to unseen data with a limited number of training data [10].

However, the challenge of learning from imbalanced data [12] still exists. Within the “Recognition Using Regions” paradigm [11], the training set of object detection is divided into two distinct groups of annotated objects and background regions, and the number of examples in these groups experience a huge imbalance. Online Hard Example Mining (OHEM) [27] is proposed to overcome the data imbalance by integrating bootstrapping technique [30] with region-based detectors, and can be effortlessly implemented on most of the region-based detectors.

In this paper, we propose S-OHEM, the Stratified Online Hard Example Mining algorithm for training region-based deep convolutional network detectors to enhance localization accuracy, as shown in Fig. 1. The intuition of our method is that feeding hard examples to the backpropagation process could overcome the dilemma of unbalanced data, resulting in a more efficient and effective training process [27]. In the field of object detection, the hard example is defined as region proposal with higher training loss. Thus, previous hard example mining method (e.g., OHEM) is conducted by sampling region proposals according to a distribution that favors high loss instances. However, the training loss defined in previous work is the multitask loss with equal weight settings across all loss types (e.g., classification, localization, mask [13], or rigid categories and non-rigid categories). This approach ignores the influence of different loss types throughout the training process, which we found essential to the training efficacy (e.g., localization loss is more important during the latter part of the training period). Therefore, maintaining a sampling distribution according to this influence during hard example mining should enhance the performance of object detectors.

S-OHEM exploits stratified sampling, a sampling method involving the division of a population into distinct groups known as strata [21] (homogeneous subgroups, in which the inner items are similar to each other). During each mini-batch iteration, S-OHEM firstly assigns candidate examples (in the form of Region of Interests, RoIs) to different strata by the ratio between classification and localization loss. Then the RoIs are subsampled according to a dynamic distribution and fed into the backpropagation process. With an increasing focus on the localization loss, S-OHEM can predict more accurate bounding boxes and therefore enhance the localization accuracy. We apply S-OHEM to the standard Fast R-CNN [8] and Faster R-CNN [26] detection method and evaluate it on PASCAL VOC 2007 and KITTI datasets. Our systematic experimental analysis reports that S-OHEM yields some AP improvements of 0.5% on rigid categories of PASCAL VOC 2007 for both the IoU thresholds of 0.6 and 0.7. For KITTI 2012, both results of the same metric are 1.6%. Regarding the mAP, a relative increase of 0.3% and 0.5% (1% and 0.5%) is observed for VOC07 (KITTI12) with the same set of IoU threshold.

The remainder of this paper is structured as follows. In Sect. 2, we compare our work with related research with a focus on the improvement of localization accuracy and the use of data in object detection. In Sect. 3, we describe the design of the algorithm. In Sect. 4, we show the experimental results, and in Sect. 5, we conclude this work.

2 Related Work

Object detection has significantly benefited from the advancement of image classification task. The remarkable feature extraction ability of Deep Convolutional Networks [16, 20, 28, 31, 32] has equipped us with abundant information for the classification of region proposals. In addition, the continuously developing practical strategies (e.g., activation functions [15, 24, 34], regularization [18, 29, 30], and optimization [2, 17, 19]) further contribute to the efficacy of deep neural networks.

Several region-based detectors depend on the strong classification capability of deep convolutional networks to evaluate generated RoIs. R-CNN is the first to adopt this approach by evaluating each RoI separately. Fast R-CNN [8] improved this method by allowing computation sharing through projecting RoIs to a shared feature map (called RoIPool layer, derived from SPPnets [14]), resulting in better speed and accuracy. It was then integrated with the region proposal module (the Region Proposal Network, RPN) by sharing their convolutional features and extended to a unified network with “attention” [1] mechanism, leading to further speedup and accuracy enhancement. R-FCN [22] eliminates the fully-connected layers of region-based detectors and turns the whole model fully convolutional with the backbones of state-of-the-art image classifiers [16, 32] to fully share computation, contributing to a significant speedup. Mask R-CNN [13], which adds a small Fully Convolutional Network (FCN) [23] as a parallel branch to standard Faster R-CNN and replaces the RoIPool layer with the RoIAlign layer, is the latest descendant of this stream and achieves significant advancement in several benchmarks of both the detection and segmentation tasks. However, most of these models use the multitask loss with equal weight settings without considering the influence of different loss type throughout the training process.

Recent work has focused on the post-recognition level of region-based detection models to boost the localization accuracy. Gidaris and Komodakis [6] proposed a CNN-model for bounding box regression, which is used with iterative localization and bounding box voting. LocNet [7] aims to enhance the localization accuracy by assigning a probability to each border of a loosely localized search region for being related to the object’s bounding box. It’s different from the bounding box regression approaches [4] adopted by most of the aforementioned region-based detectors and can be served as an effective alternative.

However, little work has focused on the advancement of region-based detectors from a data perspective. Online Hard Example Mining (OHEM) [27] integrates bootstrapping [30] (or hard example mining) with region-based detectors for a small extra computational cost, but still lacks enough focus on the localization accuracy because of the derived multi-task loss imbalance. Further discussion is available in Sect. 3.

3 Model Design

In this section, we argue that the current way of choosing hard examples lacks enough focus on localization accuracy and is suboptimal, and we will show that our approach results in better training (lower training loss), higher localization performance, and higher average precision. Firstly, we discuss the design motivation. Then we give a brief introduction of stratified sampling and definition of stratified constraint in this work. Finally, we present the design and implementation of our Stratified Online Hard Example Mining algorithm (S-OHEM).

3.1 Motivation

Most of the region-based detectors derive the multitask learning from Fast R-CNN, and assume equal contributions of classification loss and localization loss throughout the training process. However, this assumption is not often the case. We apply the original OHEM on standard Fast R-CNN and Faster R-CNN, then report the classification and localization loss throughout the training process on PASCAL VOC and KITTI datasets separately.

As is illustrated in Fig. 2, the classification loss is consistently larger than the localization loss (more than double in average). But this could result in a problem. Let’s consider a situation where we have two region proposals RoI A and RoI B and are asked to choose one as the hard example for backpropagation. Based on the preliminary experiment result shown in Fig. 2, we make a common assumption that the training loss for RoI A and RoI B is $L_{cls}(A)=0.21$, $L_{loc}(A)=0.11$, and $L_{cls}(B)=0.19$, $L_{loc}(B)=0.12$ respectively. Recall that the classification loss is defined as log loss $L_{cls}(p,u)=-$log$p_u$ for true class u [8], and thus the probability for the true class is 61.5% and 64.5% for RoI A and RoI B respectively. It’s not a significant gap of the class prediction probability between these two RoIs, and we can believe they have similar performance for the classification task.

Regarding the localization loss, the gap between RoI A and RoI B is 0.01 $(L_{loc}(B) - L_{loc}(A) = 0.12 - 0.11 = 0.01)$. Within the smooth $L_{1}$ loss settings [8], this gap turns to a 0.14 difference between the bounding boxes of ground truth and prediction. Note that this gap is quite significant when we use the parameterization for bounding box offsets given in [9], and therefore we are supposed to choose RoI B as the hard example for better localization accuracy and prediction quality. However, within the equal-weight multitask loss settings, RoI A will be chosen as the hard one. Thus, the previous hard example mining approach lacks focus on localization accuracy.

3.2 Stratified Sampling

Stratified sampling is a sampling method involving the division of a population into distinct groups known as strata [21]. These strata are homogeneous subgroups of the original data with similar inner items. Stratified sampling can get higher statistical precision because the variability within subgroups sharing the same properties is lower than that of the entire population [33]. Therefore. stratified sampling improves the representativeness by reducing sampling error.

Each stratum constraint $s_k$ is denoted by $s_k = (p_k, f_k)$, where $p_k$ is a propositional formula and $f_k$ is the required sample size. In this work, the four stratum constraint is defined by the ratio between classification loss ($L_{cls}$) and localization loss: $s_1 = ($high $L_{cls}$ and high $L_{loc}, f_1)$, $s_2 = ($high $L_{cls}$ and low $L_{loc}, f_2)$, $s_3 = ($low $L_{cls}$ and high $L_{loc}, f_3)$, and $s_4 = ($low $L_{cls}$ and low $L_{loc}, f_4)$. The required sample size and threshold of high loss (hard examples) change dynamically throughout the training process.

3.3 Stratified Online Hard Example Mining Algorithm

Given the observation that the previous hard example mining approach ignores the influence of different loss types throughout the training process and lacks focus on localization accuracy, we now demonstrate our approach of Stratified Online Hard Example Mining (S-OHEM).

The architecture of S-OHEM is shown in Fig. 1. In each mini-batch iteration, S-OHEM firstly generates region proposals of the input images, forward-propagates all of them across the region-based detector, and gathers the training loss of each RoI. Then each RoI is assigned to one of the four strata defined in Sect. 3.2. Different loss type combinations represent how well the current detector performs in classification and localization tasks on each RoI respectively. Inside each stratum, hard examples are chosen by sorting the RoIs by loss. After that, all RoIs are subsampled according to a dynamic distribution, and a total number of B hard examples are fed into the backpropagation process. The sampling distribution of RoIs from each stratum changes dynamically throughout the learning process, as each loss type maintains different contribution to the detector model at different training stages. Specifically, the effect of classification loss is more important in the beginning, while the localization loss contributes more at later training stages.

For implementation, we keep a record of history training loss and start to change the sampling distribution when the loss becomes stable (e.g., after 40K iterations shown in Fig. 2). At the beginning of training, we only sample the first B RoIs with high $L_{cls}$ (i.e., sample from $s_{12}$, the union of strata $s_1$ and $s_2$). When loss becomes stable, we gradually focus on choosing the RoIs with high $L_{loc}$ (i.e., sample from the union of $s_2$ and $s_3$, denoted by $s_{23}$) by increasing the sampling ratio between $s_{23}$ and $s_1$. Because of the gradually increasing focus on the localization loss, S-OHEM can predict more accurate bounding boxes and thus enhance the localization accuracy.

An equivalent alternative is available. To make it simple, we denote the contribution coefficient of $L_{cls}$ and $L_{loc}$ to hard example selection by $\alpha $ and $\beta $ respectively. And our approach aims to find the optimal value of $\alpha $ and $\beta $ in Formula (1) at different training stages. $L_{select}$ is only for hard example mining, and the actual loss backpropagated across the network will not be affected.

$$\begin{aligned} L_{select} = \alpha L_{cls} + \beta L_{loc} \end{aligned}$$

(1)

When training begins, we only sample the first B RoIs with high $L_{cls}$ by setting $\alpha $ and $\beta $ in Formula (1) to 1 and 0 respectively. When loss becomes stable, we gradually focus on choosing the RoIs with high $L_{loc}$ by gradually decreasing the value of $\alpha $ and increasing $\beta $ in Formula (1).

S-OHEM will not have a significant influence on the training time because most of the forward computation is shared between RoIs [8], and the number of backpropagated examples is much smaller than that of all region proposals of the input images. To overcome co-located RoIs and loss double counting, we follow the solution of [27] and apply non-maximum suppression (NMS) [25] to perform deduplication before the sampling procedure. NMS works by finding the highest loss RoI, and eliminating all other RoIs with lower loss and high overlap with the selected region. Besides, we derive their method of maintaining a read-only RoI network and a standard RoI network with sharing weights for efficient memory allocation. It is also worth noting that S-OHEM can be combined with any post-recognition regressors introduced in Sect. 2, because it focuses on enhancing the localization accuracy from the data perspective.

4 Experiments and Results

In this section, we conduct systematic experiments to evaluate the proposed S-OHEM and compare it with original OHEM. We describe the experimental setup in Sect. 4.1, and demonstrate the efficiency and accuracy of the algorithm by examining the training loss and average precision.

4.1 Experimental Setup

We use the standard and popular CNN architecture VGG16 from [28], and evaluate the algorithms on the PASCAL VOC 2007 and KITTI Object Detection Evaluation 2012 dataset. In the PASCAL VOC experiment, training is done on the trainval set and testing on the test set. In the KITTI 2012 experiment, we use the first 5000 images to form the training set and the remaining 2481 images for testing. All models are trained with SGD for 80k mini-batch iterations and followed the same setup from Sect. 4.1. For average precision, we report the results with IoU thresholds of 0.5, 0.6, and 0.7, to evaluate the localization accuracy in a wider range of IoU thresholds. We use Fast R-CNN [8] as the detector base for our PASCAL VOC experiment, and Faster R-CNN [26] for the KITTI 2012 experiment, to prove the usability of our approach. The initial learning rate is set to 0.001 and dropped in “steps” by a factor of 0.1 every 30K iterations. We process 2 images in each mini-batch iteration, and subsample 128 RoIs to feed them into backpropagation. Note that the baseline of OHEM reported in Table 2 (row 1–2) were reproduced and are slightly higher than the ones reported in [27].

For both experiments, we follow the procedure described in Sect. 3.3 to control the contribution coefficient of $L_{cls}$ and $L_{loc}$. In the beginning, $\alpha $ and $\beta $ are set to 1 and 0 when training starts. Then we gradually increase $\beta $ to the ratio between classification and localization loss when the loss becomes stable. Specifically, $\beta $ will increase to 1.9 and 2.3 for the VOC07 and KITTI12 experiment respectively.

4.2 Results and Analysis

Training Convergence. We firstly analyze the training loss for both methods by logging the average training loss every 10K steps. Figure 3 shows the average loss per RoI for VGG16 with settings presented in Sect. 4.1. We see that S-OHEM yields lower loss in both classification and localization than the original OHEM, validating our claims that S-OHEM leads to better training than OHEM. Also, the results indicate that S-OHEM is better in classification confidence and localization accuracy during the training process.

VOC 2007. Table 1 shows that on VOC07, S-OHEM improves the mAP of OHEM from 71% to 71.1% for an IoU threshold of 0.5, and an improvement of 0.4% and 0.3% for IoU 0.6 and 0.7 respectively. For category-specific improvements, S-OHEM performs well in most of the rigid categories (bold categories in Table 1) across all three IoU thresholds, especially for IoU 0.7.

As is listed on Table 3(a), we compute the mAP among rigid categories and show increase of 0.1%, 0.5%, and 0.5% for IoU 0.5, 0.6, and 0.7 respectively. It’s also interesting to find that S-OHEM performs quite well in detecting cats for IoU threshold 0.6, which indicates the better bounding boxes generated by S-OHEM in this environment.

Table 1. VOC 2007 test detection average precision (%). All methods use VGG16 and bounding-box regression. Legend: IoU: IoU threshold.

Full size table

KITTI 2012. The evaluation results on KITTI 2012 is shown in Table 2. S-OHEM improves the mAP of OHEM from 63.9% to 64.9% for an IoU threshold of 0.6, and an improvement of 0.5% for IoU 0.7. We also compute the mAP among rigid categories and list results in Table 3(b). Note that the Note that the misc category is classified as rigid based on our observation of the dataset. We show some increase of 1.6% for both IoU thresholds 0.6 and 0.7.

Table 2. KITTI 2012 test detection average precision (%). All methods use VGG16 and bounding-box regression. Legend: IoU: IoU threshold.

Full size table

Table 3. Category specific mean average precision (%). All methods use VGG16 and bounding-box regression. Legend: IoU: IoU threshold. (a) On VOC 2007 test set. (b) On KITTI 2012 test set.

Full size table

Rigid and Non-rigid Category. Our experimental results have shown that S-OHEM performs quite well on rigid categories of both the VOC07 and KITTI12 dataset. The reason is that rigid bodies can reach better classification accuracy on pre-trained deep convolutional networks ascribed to its strong resistance to deformation. Therefore, the influence of different loss distribution throughout the training process (as described in Sect. 3.1) is more likely to happen on rigid bodies. Also, the border distribution of rigid bodies is more similar to each other and is thus easier to learn.

5 Conclusion

In this paper, we proposed Stratified Online Hard Example Mining (S-OHEM) algorithm, a simple and effective method for training region-based deep convolutional network detectors to enhance localization accuracy. During hard example mining, S-OHEM exploits stratified sampling and focuses on the influence of different loss types throughout the training process. Experimental analysis shows that S-OHEM outperforms OHEM regarding training convergence and localization accuracy, and achieves some AP improvements on rigid categories of PASCAL VOC 2007 and KITTI 2012. Besides, S-OHEM addresses the localization enhancing problem merely from the data perspective and can be easily plugged into existing region-based detectors. Furthermore, the state-of-the-art Mask R-CNN [13] also derives the equal-weight multi-task loss with an addition task of semantic segmentation, which is improvable through S-OHEM. S-OHEM can also be applied to other multi-task loss, including the loss of semantic segmentation, key-point detection, etc.

References

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2014)
Google Scholar
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(Jul), 2121–2159 (2011)
MathSciNet MATH Google Scholar
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)
Article Google Scholar
Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010)
Article Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2012)
Google Scholar
Gidaris, S., Komodakis, N.: Object detection via a multi-region and semantic segmentation-aware CNN model. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1134–1142 (2015)
Google Scholar
Gidaris, S., Komodakis, N.: Locnet: improving localization accuracy for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 789–798 (2016)
Google Scholar
Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Google Scholar
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016). http://www.deeplearningbook.org
Gu, C., Lim, J.J., Arbeláez, P., Malik, J.: Recognition using regions. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 1030–1037. IEEE (2009)
Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. arXiv preprint arXiv:1703.06870 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 346–361. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10578-9_23
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hinton, G., Srivastava, N., Swersky, K.: Neural networks for machine learning lecture 6a overview of mini-batch gradient descent (2012)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, JMLR Workshop and Conference Proceedings, vol. 37, pp. 448–456. JMLR.org (2015)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Li, M., Li, D., Shen, S., Zhang, Z., Lu, X.: DSS: a scalable and efficient stratified sampling algorithm for large-scale datasets. In: Gao, G.R., Qian, D., Gao, X., Chapman, B., Chen, W. (eds.) NPC 2016. LNCS, vol. 9966, pp. 133–146. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47099-3_11
Chapter Google Scholar
Li, Y., He, K., Sun, J., et al.: R-FCN: object detection via region-based fully convolutional networks. In: Advances in Neural Information Processing Systems, pp. 379–387 (2016)
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 807–814 (2010)
Google Scholar
Neubeck, A., Van Gool, L.: Efficient non-maximum suppression. In: 18th International Conference on Pattern Recognition, ICPR 2006, vol. 3, pp. 850–855. IEEE (2006)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Google Scholar
Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769 (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Sung, K.K.: Learning and example selection for object and pattern detection. Ph.D. thesis, Cambridge, MA, USA (1996). aAI0800657
Google Scholar
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Thompson, S.K.: Stratified Sampling, pp. 139–156. Wiley (2012). http://dx.doi.org/10.1002/9781118162934.ch11
Xu, B., Wang, N., Chen, T., Li, M.: Empirical evaluation of rectified activations in convolutional network. CoRR abs/1505.00853. http://arxiv.org/abs/1505.00853

Download references

Author information

Authors and Affiliations

National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha, 410073, China
Minne Li, Zhaoning Zhang, Hao Yu, Xinyuan Chen & Dongsheng Li

Authors

Minne Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhaoning Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hao Yu
View author publications
You can also search for this author in PubMed Google Scholar
Xinyuan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Dongsheng Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhaoning Zhang .

Editor information

Editors and Affiliations

Civil Aviation University of China, Tianjin, China
Jinfeng Yang
Tianjin University, Tianjin, China
Qinghua Hu
Nankai University, Tianjin, China
Ming-Ming Cheng
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Liang Wang
Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Huazhong University of Science and Technology, Wuhan, China
Xiang Bai
Xi’an Jiaotong University, Xi’an, China
Deyu Meng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, M., Zhang, Z., Yu, H., Chen, X., Li, D. (2017). S-OHEM: Stratified Online Hard Example Mining for Object Detection. In: Yang, J., et al. Computer Vision. CCCV 2017. Communications in Computer and Information Science, vol 773. Springer, Singapore. https://doi.org/10.1007/978-981-10-7305-2_15

Download citation

DOI: https://doi.org/10.1007/978-981-10-7305-2_15
Published: 08 December 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7304-5
Online ISBN: 978-981-10-7305-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

S-OHEM: Stratified Online Hard Example Mining for Object Detection

Abstract

Similar content being viewed by others

WMBAL: weighted minimum bounds for active learning

Balanced Sampling-Based Active Learning for Object Detection

Dense Teacher: Dense Pseudo-Labels for Semi-supervised Object Detection

Keywords

1 Introduction

2 Related Work