Unsupervised Hard Example Mining from Videos for Improved Object Detection

  • SouYoung JinEmail author
  • Aruni RoyChowdhury
  • Huaizu Jiang
  • Ashish Singh
  • Aditya Prasad
  • Deep Chakraborty
  • Erik Learned-Miller
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11217)


Important gains have recently been obtained in object detection by using training objectives that focus on hard negative examples, i.e., negative examples that are currently rated as positive or ambiguous by the detector. These examples can strongly influence parameters when the network is trained to correct them. Unfortunately, they are often sparse in the training data, and are expensive to obtain. In this work, we show how large numbers of hard negatives can be obtained automatically by analyzing the output of a trained detector on video sequences. In particular, detections that are isolated in time, i.e., that have no associated preceding or following detections, are likely to be hard negatives. We describe simple procedures for mining large numbers of such hard negatives (and also hard positives) from unlabeled video data. Our experiments show that retraining detectors on these automatically obtained examples often significantly improves performance. We present experiments on multiple architectures and multiple data sets, including face detection, pedestrian detection and other object categories.


Object detection Face detection Pedestrian detection Semi-supervised learning Hard negative mining 



This research is based in part upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA) under contract number 2014-14071600010 and in part on research sponsored by the Air Force Research Laboratory and DARPA under agreement number FA8750-18-2-0126. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of ODNI, IARPA, the Air Force Research Laboratory and DARPA or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purpose notwithstanding any copyright annotation thereon.

Supplementary material

474201_1_En_19_MOESM1_ESM.pdf (2.1 mb)
Supplementary material 1 (pdf 2161 KB)

Supplementary material 2 (mp4 20529 KB)


  1. 1.
    Abdullah Jamal, M., Li, H., Gong, B.: Deep face detector adaptation without negative transfer or catastrophic forgetting. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018Google Scholar
  2. 2.
    Appel, R., Fuchs, T., Dollár, P., Perona, P.: Quickly boosting decision trees-pruning underachieving features early. In: International Conference on Machine Learning, pp. 594–602 (2013)Google Scholar
  3. 3.
    Athalye, A., Sutskever, I.: Synthesizing robust adversarial examples (2017). arXiv preprint: arXiv:1707.07397
  4. 4.
    Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pp. 92–100. ACM (1998)Google Scholar
  5. 5.
    Brazil, G., Yin, X., Liu, X.: Illuminating pedestrians via simultaneous detection & segmentation (2017). arXiv preprint: arXiv:1706.08564
  6. 6.
    Cai, Z., Fan, Q., Feris, R.S., Vasconcelos, N.: A unified multi-scale deep convolutional neural network for fast object detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part IV. LNCS, vol. 9908, pp. 354–370. Springer, Cham (2016). Scholar
  7. 7.
    Cai, Z., Saberian, M., Vasconcelos, N.: Learning complexity-aware cascades for deep pedestrian detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3361–3369 (2015)Google Scholar
  8. 8.
    Chang, H.S., Learned-Miller, E., McCallum, A.: Active bias: training more accurate neural networks by emphasizing high variance samples. In: Advances in Neural Information Processing Systems, pp. 1003–1013 (2017)Google Scholar
  9. 9.
    Chapelle, O., Scholkopf, B., Zien, A.: Semi-supervised learning. IEEE Trans. Neural Netw. 20(3), 542 (2009). (chapelle, o. et al. (eds.); 2006) [book reviews]CrossRefGoogle Scholar
  10. 10.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR, pp. 886–893 (2005).
  11. 11.
    Dollár, P., Tu, Z., Perona, P., Belongie, S.: Integral channel features (2009)Google Scholar
  12. 12.
    Dollár, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: a benchmark. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 304–311. IEEE (2009)Google Scholar
  13. 13.
    Dollár, P., Zitnick, C.L.: Fast edge detection using structured forests. IEEE Trans. Pattern Anal. Mach. Intell. 37(8), 1558–1570 (2015). Scholar
  14. 14.
    Du, X., El-Khamy, M., Lee, J., Davis, L.: Fused DNN: a deep neural network fusion approach to fast and robust pedestrian detection. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 953–961. IEEE (2017)Google Scholar
  15. 15.
    Farfade, S.S., Saberian, M.J., Li, L.: Multi-view face detection using deep convolutional neural networks. In: ICMR, pp. 643–650 (2015).
  16. 16.
    Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010)CrossRefGoogle Scholar
  17. 17.
    Friedman, J., Hastie, T., Tibshirani, R., et al.: Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Stat. 28(2), 337–407 (2000)CrossRefGoogle Scholar
  18. 18.
    Geman, S., Graffigne, C.: Markov random field image models and their applications to computer vision. In: Proceedings of the International Congress of Mathematicians, vol. 1, p. 2 (1986)Google Scholar
  19. 19.
    Girshick, R.B.: Fast R-CNN. In: ICCV, pp. 1440–1448 (2015).
  20. 20.
    Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp. 580–587 (2014).
  21. 21.
    He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part III. LNCS, vol. 8691, pp. 346–361. Springer, Cham (2014). Scholar
  22. 22.
    Hosang, J., Omran, M., Benenson, R., Schiele, B.: Taking a deeper look at pedestrians. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4073–4082 (2015)Google Scholar
  23. 23.
    Hu, P., Ramanan, D.: Finding tiny faces. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1522–1530. IEEE (2017)Google Scholar
  24. 24.
    Jain, V., Learned-Miller, E.: FDDB: a benchmark for face detection in unconstrained settings. Technical report UM-CS-2010-009, University of Massachusetts, Amherst (2010)Google Scholar
  25. 25.
    Jiang, H., Learned-Miller, E.: Face detection with the faster R-CNN. In: 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 650–657. IEEE (2017)Google Scholar
  26. 26.
    Jin, S., Su, H., Stauffer, C., Learned-Miller, E.: End-to-end face detection and cast grouping in movies using Erdos-Renyi clustering. In: ICCV (2017)Google Scholar
  27. 27.
    Kalal, Z., Matas, J., Mikolajczyk, K.: PN learning: bootstrapping binary classifiers by structural constraints. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 49–56. IEEE (2010)Google Scholar
  28. 28.
    Kläser, A., Marszałek, M., Schmid, C., Zisserman, A.: Human focused action localization in video. In: Kutulakos, K.N. (ed.) ECCV 2010, Part I. LNCS, vol. 6553, pp. 219–233. Springer, Heidelberg (2012). Scholar
  29. 29.
    Li, H., Lin, Z., Shen, X., Brandt, J., Hua, G.: A convolutional neural network cascade for face detection. In: CVPR, pp. 5325–5334 (2015).
  30. 30.
    Li, J., Liang, X., Shen, S., Xu, T., Feng, J., Yan, S.: Scale-aware fast R-CNN for pedestrian detection. IEEE Trans. Multimed. 20, 985–996 (2017)Google Scholar
  31. 31.
    Li, Y., Sun, B., Wu, T., Wang, Y., Gao, W.: Face detection with end-to-end integration of a convnet and a 3D model. ECCV abs/1606.00850 (2016).
  32. 32.
    Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR, vol. 1, p. 4 (2017)Google Scholar
  33. 33.
    Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection (2017). arXiv preprint: arXiv:1708.02002
  34. 34.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part I. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). Scholar
  35. 35.
    Loshchilov, I., Hutter, F.: Online batch selection for faster training of neural networks (2015). arXiv preprint: arXiv:1511.06343
  36. 36.
    Lu, J., Sibai, H., Fabry, E., Forsyth, D.: No need to worry about adversarial examples in object detection in autonomous vehicles (2017). arXiv preprint: arXiv:1707.03501
  37. 37.
    Luo, Y., Boix, X., Roig, G., Poggio, T., Zhao, Q.: Foveation-based mechanisms alleviate adversarial examples (2015). arXiv preprint: arXiv:1511.06292
  38. 38.
    Ozerov, A., Vigouroux, J.R., Chevallier, L., Pérez, P.: On evaluating face tracks in movies. In: 2013 20th IEEE International Conference on Image Processing (ICIP), pp. 3003–3007. IEEE (2013)Google Scholar
  39. 39.
    Ranjan, R., Patel, V.M., Chellappa, R.: A deep pyramid deformable part model for face detection. In: BTAS, pp. 1–8. IEEE (2015).
  40. 40.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)Google Scholar
  41. 41.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 1(6), 1137–1149 (2016)CrossRefGoogle Scholar
  42. 42.
    Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015).
  43. 43.
    Rosenberg, C., Hebert, M., Schneiderman, H.: Semi-supervised self-training of object detection models (2005)Google Scholar
  44. 44.
    Rowley, H.A., Baluja, S., Kanade, T.: Neural network-based face detection. IEEE Trans. Pattern Anal. Mach. Intell. 20(1), 23–38 (1998)CrossRefGoogle Scholar
  45. 45.
    Schölkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2002)Google Scholar
  46. 46.
    Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769 (2016)Google Scholar
  47. 47.
    Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Moreno-Noguer, F.: Fracking deep convolutional image descriptors. CoRR, abs/1412.6537 2 (2014)Google Scholar
  48. 48.
    Singh, K.K., Xiao, F., Lee, Y.J.: Track and transfer: watching videos to simulate strong human supervision for weakly-supervised object detection. In: CVPR, vol. 1, p. 2 (2016)Google Scholar
  49. 49.
    Sonntag, D., et al.: Fine-tuning deep cnn models on specific MS COCO categories (2017). arXiv preprint: arXiv:1709.01476
  50. 50.
    Stalder, S., Grabner, H., Van Gool, L.: Cascaded confidence filtering for improved tracking-by-detection. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 369–382. Springer, Heidelberg (2010). Scholar
  51. 51.
    Sun, X., Wu, P., Hoi, S.C.: Face detection using deep learning: an improved faster RCNN approach (2017). arXiv preprint: arXiv:1701.08289
  52. 52.
    Sung, K.K., Poggio, T.: Learning and example selection for object and pattern detection (1994)Google Scholar
  53. 53.
    Sutton, C., McCallum, A.: An Introduction to Conditional Random Fields for Relational Learning. Introduction to Statistical Relational Learning, vol. 2. MIT Press, Cambridge (2006)zbMATHGoogle Scholar
  54. 54.
    Tang, K., Ramanathan, V., Fei-Fei, L., Koller, D.: Shifting weights: adapting object detectors from image to video. In: Advances in Neural Information Processing Systems, pp. 638–646 (2012)Google Scholar
  55. 55.
    Wan, S., Chen, Z., Zhang, T., Zhang, B., Wong, K.K.: Bootstrapping face detection with hard negative examples (2016). arXiv preprint: arXiv:1608.02236
  56. 56.
    Wang, X., Shrivastava, A., Gupta, A.: A-fast-RCNN: hard positive generation via adversary for object detection (2017)Google Scholar
  57. 57.
    Wang, Y., Ji, X., Zhou, Z., Wang, H., Li, Z.: Detecting faces using region-based fully convolutional networks (2017). arXiv preprint: arXiv:1709.05256
  58. 58.
    Weston, J.: Large-scale semi-supervised learningGoogle Scholar
  59. 59.
    Yang, B., Nevatia, R.: An online learned CRF model for multi-target tracking. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2034–2041. IEEE (2012)Google Scholar
  60. 60.
    Yang, S., Luo, P., Loy, C.C., Tang, X.: From facial parts responses to face detection: a deep learning approach. In: ICCV, pp. 3676–3684 (2015).
  61. 61.
    Yang, S., Luo, P., Loy, C.C., Tang, X.: WIDER FACE: a face detection benchmark. In: CVPR (2016)Google Scholar
  62. 62.
    Yu, J., Jiang, Y., Wang, Z., Cao, Z., Huang, T.: Unitbox: an advanced object detection network. In: Proceedings of the 2016 ACM on Multimedia Conference, pp. 516–520. ACM (2016)Google Scholar
  63. 63.
    Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23(10), 1499–1503 (2016)CrossRefGoogle Scholar
  64. 64.
    Zhang, L., Lin, L., Liang, X., He, K.: Is faster R-CNN doing well for pedestrian detection? In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part II. LNCS, vol. 9906, pp. 443–457. Springer, Cham (2016). Scholar
  65. 65.
    Zhang, S., Benenson, R., Omran, M., Hosang, J., Schiele, B.: How far are we from solving pedestrian detection? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1259–1267 (2016)Google Scholar
  66. 66.
    Zhang, S., Zhu, X., Lei, Z., Shi, H., Wang, X., Li, S.Z.: S\(^3\)FD: single shot scale-invariant face detector (2017). arXiv preprint: arXiv:1708.05237
  67. 67.
    Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 391–405. Springer, Cham (2014). Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • SouYoung Jin
    • 1
    Email author
  • Aruni RoyChowdhury
    • 1
  • Huaizu Jiang
    • 1
  • Ashish Singh
    • 1
  • Aditya Prasad
    • 1
  • Deep Chakraborty
    • 1
  • Erik Learned-Miller
    • 1
  1. 1.College of Information and Computer SciencesUniversity of MassachusettsAmherstUSA

Personalised recommendations