Abstract
Vision-based person, hand or face detection approaches have achieved incredible success in recent years with the development of deep convolutional neural network (CNN). In this paper, we take the inherent correlation between the body and body parts into account and propose a new framework to boost up the detection performance of the multi-level objects. In particular, we adopt region-based object detection structure with two carefully designed detectors to separately pay attention to the human body and body parts in a coarse-to-fine manner, which we call Detector-in-Detector network (DID-Net). The first detector is designed to detect human body, hand and face. The second detector, based on the body detection results of the first detector, mainly focus on detection of small hand and face inside each body. The framework is trained in an end-to-end way by optimizing a multi-task loss. Due to the lack of human body, face and hand detection dataset, we have collected and labeled a new large dataset named Human-Parts with 14,962 images and 106,879 annotations. Experiments show that our method can achieve excellent performance on Human-Parts.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Ribeiro, D., Mateus, A., Nascimento, J.C., Miraldo, P.: A real-time pedestrian detector using deep learning for human-aware navigation (2016)
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation (2017)
Papandreou, G., et al.: Towards accurate multi-person pose estimation in the wild, pp. 3711–3719 (2017)
Xiao, S., et al.: Recurrent 3D–2D dual learning for large-pose facial landmark detection. In: IEEE International Conference on Computer Vision, pp. 1642–1651 (2017)
Dhawan, A., Honrao, V.: Implementation of hand detection based techniques for human computer interaction. Comput. Sci. 72, 6–13 (2013)
Ghorban, F., Marín, J., Yu, S., Colombo, A., Kummert, A.: Aggregated channels network for real-time pedestrian detection (2018)
Samangouei, P., Najibi, M., Davis, L., Chellappa, R.: Face-MagNet: magnifying feature maps to detect small faces (2018)
Zhu, C., Zheng, Y., Luu, K., Savvides, M.: CMS-RCNN: contextual multi-scale region-based CNN for unconstrained face detection (2017)
Deng, X., et al.: Joint hand detection and rotation estimation by using CNN. IEEE Trans. Image Process. 27 (2016)
Le, T.H.N., Quach, K.G., Zhu, C., Chi, N.D., Luu, K., Savvides, M.: Robust hand detection and classification in vehicles and in the wild. In: Computer Vision and Pattern Recognition Workshops, pp. 1203–1210 (2017)
Mittal, A., Zisserman, A., Torr, P.: Hand detection using multiple proposals. In: British Machine Vision Conference, pp. 75.1–75.11 (2011)
Zhao, K., Zhang, W., Jiang, Y.: Semantic interactions in multi-level objects segmentation. In: International Conference on Computational and Information Sciences, pp. 665–668 (2010)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Girshick, R.: Fast R-CNN. Comput. Sci. (2015)
Liu, W., et al.: SSD: single shot multibox detector, pp. 21–37 (2015)
Li, Z., Zhou, F.: FSSD: feature fusion single shot multibox detector (2017)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection, pp. 779–788 (2015)
Jiang, H., Learnedmiller, E.: Face detection with the faster R-CNN, pp. 650–657 (2016)
He, K., Fu, Y., Xue, X.: A jointly learned deep architecture for facial attribute analysis and face detection in the wild (2017)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: International Conference on Neural Information Processing Systems, pp. 91–99 (2015)
Hu, P., Ramanan, D.: Finding tiny faces (2016)
Najibi, M., Samangouei, P., Chellappa, R., Davis, L.S.: SSH: single stage headless face detector, pp. 4885–4894 (2017)
Li, H., Lin, Z., Shen, X., Brandt, J., Hua, G.: A convolutional neural network cascade for face detection. In: Computer Vision and Pattern Recognition, pp. 5325–5334 (2015)
Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Sign. Process. Lett. 23, 1499–1503 (2016)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN (2017)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Li, F.F.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 248–255 (2009)
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks (2016)
Girshick, R., Iandola, F., Darrell, T., Malik, J.: Deformable part models are convolutional neural networks, pp. 437–446 (2014)
Dollar, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: a benchmark. In: Proceedings Conference on Computer Vision Pattern Recognition, pp. 304–311 (2009)
Zhang, S., Benenson, R., Schiele, B.: CityPersons: a diverse dataset for pedestrian detection (2017)
Bambach, S., Lee, S., Crandall, D.J., Yu, C.: Lending a hand: detecting hands and recognizing activities in complex egocentric interactions. In: IEEE International Conference on Computer Vision (2016)
Jain, V., Learned-Miller, E.: FDDB: a benchmark for face detection in unconstrained settings (2010)
Yang, S., Luo, P., Chen, C.L., Tang, X.: WIDER FACE: a face detection benchmark. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5525–5533 (2016)
Wu, J., et al.: AI challenger: a large-scale dataset for going deeper in image understanding (2017)
Everingham, M., Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 303–338 (2010)
Qin, H., Yan, J., Li, X., Hu, X.: Joint training of cascaded CNN for face detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3456–3465 (2016)
Yang, S., Luo, P., Loy, C.C., Tang, X.: Faceness-Net: face detection through deep facial part responses. IEEE Trans. Pattern Anal. Mach. Intell. PP, 1 (2017)
Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining, pp. 761–769 (2016)
Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection, pp. 936–944 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Li, X., Yang, L., Song, Q., Zhou, F. (2019). Detector-in-Detector: Multi-level Analysis for Human-Parts. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11362. Springer, Cham. https://doi.org/10.1007/978-3-030-20890-5_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-20890-5_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20889-9
Online ISBN: 978-3-030-20890-5
eBook Packages: Computer ScienceComputer Science (R0)