Skip to main content
Log in

Single-Shot Scale-Aware Network for Real-Time Face Detection

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In this work, we describe a single-shot scale-aware convolutional neural network based face detector (SFDet). In comparison with the state-of-the-art anchor-based face detection methods, the main advantages of our method are summarized in four aspects. (1) We propose a scale-aware detection network using a wide scale range of layers associated with appropriate scales of anchors to handle faces with various scales, and describe a new equal density principle to ensure anchors with different scales to be evenly distributed on the image. (2) To improve the recall rates of faces with certain scales (e.g., the scales of the faces are quite different from the scales of designed anchors), we design a new anchor matching strategy with scale compensation. (3) We introduce an IoU-aware weighting scheme for each training sample in classification loss calculation to encode samples accurately in training process. (4) Considering the class imbalance issue, a max-out background strategy is used to reduce false positives. Several experiments are conducted on public challenging face detection datasets, i.e., WIDER FACE, AFW, PASCAL Face, FDDB, and MAFA, to demonstrate that the proposed method achieves the state-of-the-art results and runs at 82.1 FPS for the VGA-resolution images.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24

Similar content being viewed by others

Notes

  1. We denote the reference bounding box as “anchor box”, which is also called “anchor” for simplicity, as in Ren et al. (2017). However, in Liu et al. (2016), it is also called “default box”.

  2. We denote the very tiny faces or some faces with the certain scales located around the middle of two scales of designed anchors as the outlier faces.

  3. The Jaccard overlap (Erhan et al. 2014) is also known as intersection-over-union, which is used to measure the overlap rate between two regions here.

  4. Since the ratio between positive and negative anchors is set to 1:3, we use \(\lambda =4\) to balance the classification and regression losses in training.

  5. Empirically, we set \(C=3\) in our experiments.

  6. The negative anchor indicates that the anchor is not matched to any ground truth bounding box.

  7. http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/index.html.

  8. https://www.cs.cmu.edu/~peiyunh/tiny/.

  9. http://web.engr.illinois.edu/~dhoiem/projects/detectionAnalysis.

  10. http://www.ics.uci.edu/~xzhu/face/.

  11. http://host.robots.ox.ac.uk/pascal/VOC/voc2012/.

  12. http://vis-www.cs.umass.edu/fddb/index.html.

  13. http://www.escience.cn/people/geshiming/mafa.html.

  14. https://github.com/BVLC/caffe/pull/2213.

  15. For example, in our FPN architecture, there are mainly five steps, i.e., (1) input low-level features L (\(15\times 15\)), (2) use operation with stride 2 to get high-level features H (\(8\times 8\)), (3) upsample H to H-up (\(16\times 16\)), (4) crop H-up to H-up-crop (\(15\times 15\)), and (5) add L and H-up-Crop.

  16. https://bitbucket.org/deeplab/deeplab-public.

References

  • Barbu, A., Lay, N., & Gramajo, G. (2014). Face detection with a 3d model. CoRR arXiv:abs/1404.35968.

  • Bell, S., Zitnick, C. L., Bala, K., & Girshick, R. B. (2016). Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 2874–2883).

  • Brubaker, S. C., Wu, J., Sun, J., Mullin, M. D., & Rehg, J. M. (2008). On the design of cascades of boosted ensembles for face detection. International Journal of Computer Vision, 77(1–3), 65–86.

    Article  Google Scholar 

  • Cai, Z., Fan, Q., Feris, R. S., & Vasconcelos, N. (2016). A unified multi-scale deep convolutional neural network for fast object detection. In Proceedings of European conference on computer vision (pp. 354–370).

  • Chen, D., Hua, G., Wen, F., & Sun, J. (2016). Supervised transformer network for efficient face detection. In Proceedings of European conference on computer vision (pp. 122–138).

  • Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2015). Semantic image segmentation with deep convolutional nets and fully connected crfs. In International conference on learning representations .

  • Chen, Y., Song, L., & He, R. (2017). Masquer hunter: Adversarial occlusion-aware face detection. CoRR arXiv:abs/1709.05188.

  • Dai, J., Li, Y., He, K., & Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. In D. D. Lee, M. Sugiyama, V. von Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in neural information processing systems, Barcelona, Spain (pp. 379–387).

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 886–893).

  • Erhan, D., Szegedy, C., Toshev, A., & Anguelov, D. (2014). Scalable object detection using deep neural networks. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 2155–2162).

  • Farfade, S. S., Saberian, M. J., & Li, L. (2015). Multi-view face detection using deep convolutional neural networks. In ACM on international conference on multimedia retrieval (pp. 643–650).

  • Felzenszwalb, P. F., Girshick, R. B., McAllester, D. A., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.

    Article  Google Scholar 

  • Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.

    Article  MathSciNet  MATH  Google Scholar 

  • Fu, C., Liu, W., Ranga, A., Tyagi, A., & Berg, A. C. (2017). DSSD : Deconvolutional single shot detector. CoRR arXiv:abs/1701.06659.

  • Ge, S., Li, J., Ye, Q., & Luo, Z. (2017). Detecting masked faces in the wild with lle-cnns. CVPR (pp. 426–434).

  • Ghiasi, G., & Fowlkes, C. C. (2015). Occlusion coherence: Detecting and localizing occluded faces. CoRR arXiv:abs/1506.08347.

  • Gidaris, S., & Komodakis, N. (2015). Object detection via a multi-region and semantic segmentation-aware CNN model. In Proceedings of IEEE international conference on computer vision (pp. 1134–1142).

  • Girshick, R. B. (2015). Fast R-CNN. In Proceedings of IEEE international conference on computer vision (pp. 1440–1448).

  • Girshick, R. B., Donahue, J., Darrell, T., Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 580–587).

  • Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In International conference on artificial intelligence and statistics (pp. 249–256).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In Proceedings of European conference on computer vision (pp. 346–361).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 770–778).

  • Hoiem, D., Chodpathumwan, Y., & Dai, Q. (2012). Diagnosing error in object detectors. In ECCV (pp. 340–353).

  • Howard, A. G. (2013). Some improvements on deep convolutional neural network based image classification. CoRR arXiv:abs/1312.5402.

  • Hu, P., & Ramanan, D. (2017). Finding tiny faces. In CVPR (pp. 1522–1530).

  • Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., et al. (2016). Speed/accuracy trade-offs for modern convolutional object detectors. CoRR arXiv:abs/1611.10012.

  • Jain, V., & Learned-Miller, E. (2010). Fddb: A benchmark for face detection in unconstrained settings. Technical Report UM-CS-2010-009, University of Massachusetts, Amherst.

  • Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R. B., et al. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM international conference on multimedia (pp. 675–678).

  • Jiang, H., & Learned-Miller, E. (2016). Face detection with the faster r-cnn. CoRR arXiv:abs/1606.03473.

  • Jiang, H., & Learned-Miller, E. G. (2017). Face detection with the faster R-CNN. In Proceedings of IEEE international conference on automatic face and gesture recognition (pp. 650–657).

  • Kalal, Z., Matas, J., & Mikolajczyk, K. (2008). Weighted sampling for large-scale boosting. In Proceedings of British machine vision conference (pp. 1–10).

  • Kong, T., Sun, F., Yao, A., Liu, H., Lu, M., & Chen, Y. (2017). RON: Reverse connection with objectness prior networks for object detection. In Proceedings of IEEE conference on computer vision and pattern recognition.

  • Kong, T., Yao, A., Chen, Y., & Sun, F. (2016). Hypernet: Towards accurate region proposal generation and joint object detection. In Proceedings of IEEE conference on computer vision and pattern recognition.

  • Kumar, V., Namboodiri, A. M., & Jawahar, C. V. (2015). Visual phrases for exemplar face detection. In Proceedings of IEEE international conference on computer vision (pp. 1994–2002).

  • Lee, H., Eum, S., & Kwon, H. (2017). ME R-CNN: Multi-expert region-based CNN for object detection. In Proceedings of IEEE international conference on computer vision.

  • Li, H., Lin, Z., Brandt, J., Shen, X., & Hua, G. (2014). Efficient boosted exemplar-based face detection. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 1843–1850).

  • Li, H., Lin, Z., Shen, X., Brandt, J., & Hua, G. (2015). A convolutional neural network cascade for face detection. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 5325–5334).

  • Li, J., & Zhang, Y. (2013). Learning SURF cascade for fast and accurate object detection. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 3468–3475).

  • Li, Y., Sun, B., Wu, T., & Wang, Y. (2016). Face detection with end-to-end integration of a convnet and a 3d model. In Proceedings of European conference on computer vision (pp. 420–436).

  • Liao, S., Jain, A. K., & Li, S. Z. (2016). A fast and accurate unconstrained face detector. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2), 211–223.

    Article  Google Scholar 

  • Lin, T., Dollár, P., Girshick, R. B., He, K., Hariharan, B., & Belongie, S. J. (2017a). Feature pyramid networks for object detection. In Proceedings of IEEE conference on computer vision and pattern recognition.

  • Lin, T., Goyal, P., Girshick, R. B., He, K., & Dollár, P. (2017b). Focal loss for dense object detection. In Proceedings of IEEE international conference on computer vision.

  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.E., Fu, C., et al. (2016). SSD: Single shot multibox detector. In Proceedings of European conference on computer vision (pp. 21–37).

  • Liu, W., Rabinovich, A., & Berg, A. C. (2015). Parsenet: Looking wider to see better. CoRR arXiv:abs/1506.04579.

  • Liu, Y., Li, H., Yan, J., Wei, F., Wang, X., & Tang, X. (2017). Recurrent scale approximation for object detection in CNN. In ICCV.

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.

    Article  Google Scholar 

  • Luo, W., Li, Y., Urtasun, R., & Zemel, R. S. (2016). Understanding the effective receptive field in deep convolutional neural networks. In Advances in neural information processing systems (pp. 4898–4906).

  • Mathias, M., Benenson, R., Pedersoli, M., & Gool, L. J. V. (2014). Face detection without bells and whistles. In Proceedings of European conference on computer vision.

  • Najibi, M., Samangouei, P., Chellappa, R., & Davis, L. S. (2017). SSH: Single stage headless face detector. In ICCV.

  • Ohn-Bar, E., & Trivedi, M. M. (2016). To boost or not to boost? On the limits of boosted trees for object detection. In International conference on pattern recognition.

  • Pham, M., & Cham, T. (2007). Fast training and selection of haar features using statistics in boosting-based face detection. In Proceedings of IEEE international conference on computer vision (pp. 1–7).

  • Qin, H., Yan, J., Li, X., & Hu, X. (2016). Joint training of cascaded CNN for face detection. In Proceedings of IEEE conference on computer vision and pattern recognition.

  • Ranjan, R., Patel, V. M., & Chellappa, R. (2015). A deep pyramid deformable part model for face detection. In: IEEE International conference on biometrics theory, applications and systems (pp. 1–8).

  • Ranjan, R., Patel, V. M., & Chellappa, R. (2016). Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. CoRR arXiv:abs/1603.01249.

  • Redmon, J., Divvala, S. K., Girshick, R. B., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 779–788).

  • Redmon, J., & Farhadi, A. (2016). YOLO9000: Better, faster, stronger. CoRR arXiv:abs/1612.08242.

  • Ren, S., He, K., Girshick, R. B., & Sun, J. (2017). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149.

    Article  Google Scholar 

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2014). Overfeat: Integrated recognition, localization and detection using convolutional networks. In International conference on learning representations.

  • Shen, X., Lin, Z., Brandt, J., & Wu, Y. (2013). Detecting and aligning faces by image retrieval. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 3460–3467).

  • Shen, Z., Liu, Z., Li, J., Jiang, Y., Chen, Y., & Xue, X. (2017). DSOD: Learning deeply supervised object detectors from scratch. In Proceedings of IEEE international conference on computer vision.

  • Shrivastava, A., & Gupta, A. (2016). Contextual priming and feedback for faster R-CNN. In Proceedings of European conference on computer vision (pp. 330–348).

  • Shrivastava, A., Gupta, A., & Girshick, R. B. (2016a). Training region-based object detectors with online hard example mining. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 761–769).

  • Shrivastava, A., Sukthankar, R., Malik, J., & Gupta, A. (2016b). Beyond skip connections: Top-down modulation for object detection. CoRR arXiv:abs/1612.06851.

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR arXiv:abs/1409.1556.

  • Sun, X., Wu, P., & Hoi, S. C. H. (2017). Face detection using deep learning: An improved faster RCNN approach. CoRR arXiv:abs/1701.08289.

  • Triantafyllidou, D., & Tefas, A. (2016). A fast deep convolutional neural network for face detection in big visual data. In INNS conference on big data (pp. 61–70).

  • Uijlings, J. R. R., van de Sande, K. E. A., Gevers, T., & Smeulders, A. W. M. (2013). Selective search for object recognition. International Journal of Computer Vision, 104(2), 154–171.

    Article  Google Scholar 

  • Viola, P. A., & Jones, M. J. (2004). Robust real-time face detection. International Journal of Computer Vision, 57(2), 137–154.

    Article  Google Scholar 

  • Wan, S., Chen, Z., Zhang, T., Zhang, B., & Wong, K. (2016). Bootstrapping face detection with hard negative examples. CoRR arXiv:abs/1608.02236.

  • Wang, H., Li, Z., Ji, X., & Wang, Y. (2017a). Face R-CNN. CoRR arXiv:abs/1706.01061.

  • Wang, J., Yuan, Y., & Yu, G. (2017b). Face attention network: An effective face detector for the occluded faces. CoRR arXiv:abs/1711.07246.

  • Wang, X., Shrivastava, A., & Gupta, A. (2017c). A-fast-rcnn: Hard positive generation via adversary for object detection. In Proceedings of IEEE conference on computer vision and pattern recognition.

  • Wang, X., Zhang, S., Lei, Z., Liu, S., Guo, X., & Li, S. Z. (2018). Ensemble soft-margin softmax loss for image classification. In IJCAI (pp. 992–998).

  • Wang, Y., Ji, X., Zhou, Z., Wang, H., & Li, Z. (2017d). Detecting faces using region-based fully convolutional networks. CoRR arXiv:abs/1709.05256.

  • Yan, J., Lei, Z., Wen, L. & Li, S. Z. (2014a). The fastest deformable part model for object detection. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 2497–2504).

  • Yan, J., Zhang, X., Lei, Z., & Li, S. Z. (2014b). Face detection by structural models. Image Vision Computing, 32(10), 790–799.

    Article  Google Scholar 

  • Yang, B., Yan, J., Lei, Z., & Li, S. Z. (2014). Aggregate channel features for multi-view face detection. In International joint conference on biometrics (pp. 1–8).

  • Yang, B., Yan, J., Lei, Z., & Li, S. Z. (2015a). Convolutional channel features. In Proceedings of IEEE international conference on computer vision (pp. 82–90).

  • Yang, S., Luo, P., Loy, C. C., & Tang, X. (2015b). From facial parts responses to face detection: A deep learning approach. In Proceedings of IEEE international conference on computer vision (pp. 3676–3684).

  • Yang, S., Luo, P., Loy, C. C., & Tang, X. (2016). WIDER FACE: A face detection benchmark. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 5525–5533).

  • Yang, S., Xiong, Y., Loy, C. C., & Tang, X. (2017). Face detection through scale-friendly deep convolutional networks. CoRR arXiv:abs/1706.02863.

  • Yu, J., Jiang, Y., Wang, Z., Cao, Z., & Huang, T. S. (2016). Unitbox: An advanced object detection network. In ACM conference on multimedia conference (pp. 516–520).

  • Zeng, X., Ouyang, W., Yang, B., Yan, J., & Wang, X. (2016). Gated bi-directional CNN for object detection. In Proceedings of European conference on computer vision (pp. 354–369).

  • Zhang, K., Zhang, Z., Li, Z., & Qiao, Y. (2016). Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10), 1499–1503.

    Article  Google Scholar 

  • Zhang, K., Zhang, Z., Wang, H., Li, Z., Qiao, Y., & Liu, W. (2017a). Detecting faces using inside cascaded contextual cnn. In ICCV.

  • Zhang, S., Wen, L., Bian, X., Lei, Z., & Li, S. Z. (2017b). Single-shot refinement neural network for object detection. CoRR arXiv:abs/1711.06897.

  • Zhang, S., Zhu, X., Lei, Z., Shi, H., Wang, X., & Li, S. Z. (2017c). Faceboxes: A CPU real-time face detector with high accuracy. In International joint conference on biometrics.

  • Zhang, S., Zhu, X., Lei, Z., Shi, H., Wang, X., & Li, S. Z.(2017d). S\({}^{{3}}\)FD: Single shot scale-invariant face detector. In Proceedings of IEEE international conference on computer vision.

  • Zhu, C., Zheng, Y., Luu, K., & Savvides, M. (2016). CMS-RCNN: Contextual multi-scale region-based CNN for unconstrained face detection. CoRR arXiv:abs/1606.05413.

  • Zhu, X., & Ramanan, D. (2012). Face detection, pose estimation, and landmark localization in the wild. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 2879–2886).

  • Zhu, Y., Zhao, C., Wang, J., Zhao, X., Wu, Y., & Lu, H. (2017). Couplenet: Coupling global structure with local parts for object detection. In Proceedings of IEEE international conference on computer vision.

  • Zitnick, C. L., & Dollár, P. (2014). Edge boxes: Locating object proposals from edges. In Proceedings of European conference on computer vision (pp. 391–405).

Download references

Acknowledgements

This work was supported by the Chinese National Natural Science Foundation Projects #61876178, #61473291, #61806196, the National Key Research and Development Plan (Grant No. 2016YFC0801002), JD Grapevine Plan and AuthenMetric R&D Funds.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhen Lei.

Additional information

Communicated by Xiaoou Tang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was done when Hailin Shi worked in CBSR, CASIA.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 24401 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, S., Wen, L., Shi, H. et al. Single-Shot Scale-Aware Network for Real-Time Face Detection. Int J Comput Vis 127, 537–559 (2019). https://doi.org/10.1007/s11263-019-01159-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-019-01159-3

Keywords

Navigation