Advertisement

Frustratingly Easy Trade-off Optimization Between Single-Stage and Two-Stage Deep Object Detectors

  • Petru Soviany
  • Radu Tudor IonescuEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11132)

Abstract

There are mainly two types of state-of-the-art object detectors. On one hand, we have two-stage detectors, such as Faster R-CNN (Region-based Convolutional Neural Networks) or Mask R-CNN, that (i) use a Region Proposal Network to generate regions of interests in the first stage and (ii) send the region proposals down the pipeline for object classification and bounding-box regression. Such models reach the highest accuracy rates, but are typically slower. On the other hand, we have single-stage detectors, such as YOLO (You Only Look Once) and SSD (Singe Shot MultiBox Detector), that treat object detection as a simple regression problem, by taking an input image and learning the class probabilities and bounding box coordinates. Such models reach lower accuracy rates, but are much faster than two-stage object detectors. In this paper, we propose and evaluate four simple and straightforward approaches to achieve an optimal trade-off between accuracy and speed in object detection. All the approaches are based on separating the test images in two batches, an easy batch that is fed to a faster single-stage detector and a difficult batch that is fed to a more accurate two-stage detector. The difference between the four approaches is the criterion used for splitting the images in two batches. The criteria are the image difficulty score (easier images go into the easy batch), the number of detected objects (images with less objects go into the easy batch), the average size of the detected objects (images with bigger objects go into the easy batch), and the number of detected objects divided by their average size (images with less and bigger objects go into the easy batch). The first approach is based on an image difficulty predictor, while the other three approaches employ a faster single-stage detector to determine the approximate number of objects and their sizes. Our experiments on PASCAL VOC 2007 show that using image difficulty compares favorably to a random split of the images. However, splitting the images based on the number objects divided by their size, an approach that is frustratingly easy to implement, produces even better results. Remarkably, it shortens the processing time nearly by half, while reducing the mean Average Precision of Faster R-CNN by only \(0.5\%\).

Keywords

Object detection Deep neural networks Single-shot multibox detector Faster R-CNN 

Notes

Acknowledgments

The work of Petru Soviany was supported through project grant PN-III-P2-2.1-PED-2016-1842. The work of Radu Tudor Ionescu was supported through project grant PN-III-P1-1.1-PD-2016-0787.

References

  1. 1.
    Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Proceedings of ICML, pp. 41–48 (2009)Google Scholar
  2. 2.
    Chang, C.C., Lin, C.J.: Training \(\nu \)-support vector regression: theory and algorithms. Neural Comput. 14, 1959–1977 (2002)CrossRefGoogle Scholar
  3. 3.
    Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. In: Proceedings of BMVC (2014)Google Scholar
  4. 4.
    Everingham, M., Van Gool, L., Williams, C., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2007 Results (2007)Google Scholar
  5. 5.
    Everingham, M., Van Gool, L., Williams, C., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2012 Results (2012)Google Scholar
  6. 6.
    Everingham, M., Eslami, S.M., Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The PASCAL visual object classes challenge: a retrospective. Int. J. Comput. Vis. 111(1), 98–136 (2015)CrossRefGoogle Scholar
  7. 7.
    Everingham, M., van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The PASCAL visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)CrossRefGoogle Scholar
  8. 8.
    Girshick, R.: Fast R-CNN. In: Proceedings of ICCV, pp. 1440–1448 (2015)Google Scholar
  9. 9.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016)Google Scholar
  10. 10.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of ICCV, pp. 2961–2969 (2017)Google Scholar
  11. 11.
    Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint http://arxiv.org/abs/1704.04861arXiv:1704.04861 (2017)
  12. 12.
    Huang, J., et al.: Speed/accuracy trade-offs for modern convolutional object detectors. In: Proceedings of CVPR, pp. 7310–7319 (2017)Google Scholar
  13. 13.
    Ionescu, R., Alexe, B., Leordeanu, M., Popescu, M., Papadopoulos, D.P., Ferrari, V.: How hard can it be? Estimating the difficulty of visual search in an image. In: Proceedings of CVPR, pp. 2157–2166 (2016)Google Scholar
  14. 14.
    Li, X., Liu, Z., Luo, P., Loy, C.C., Tang, X.: Not all pixels are equal: difficulty-aware semantic segmentation via deep layer cascade. In: Proceedings of CVPR, pp. 3193–3202 (2017)Google Scholar
  15. 15.
    Liu, W., et al.: SSD: single shot multibox detector. In: Proceedings of ECCV, pp. 21–37 (2016)CrossRefGoogle Scholar
  16. 16.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of CVPR, pp. 779–788 (2016)Google Scholar
  17. 17.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of NIPS, pp. 91–99 (2015)Google Scholar
  18. 18.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of ICLR (2014)Google Scholar
  20. 20.
    Soviany, P., Ionescu, R.T.: Optimizing the trade-off between single-stage and two-stage deep object detectors using image difficulty prediction. In: Proceedings of SYNASC (2018)Google Scholar
  21. 21.
    Upton, G., Cook, I.: A Dictionary of Statistics. Oxford University Press, Oxford (2004)zbMATHGoogle Scholar
  22. 22.
    Zhou, P., Ni, B., Geng, C., Hu, J., Xu, Y.: Scale-transferrable object detection. In: Proceedings of CVPR, pp. 528–538 (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.University of BucharestBucharestRomania
  2. 2.Inception Institute of Artificial Intelligence (IIAI)Abu DhabiUAE

Personalised recommendations