Abstract
Upper body detection on images is a challenging task in practical application scenarios and shares all the difficulties of object detection. This paper focuses on the problems of the multiple upper bodies, including the diversity of appearances, the various object scales, and the frequent occlusions. To address these problems, we divide the upper body detection into two stages to form a Candidate-Region Convolutional Neural Network(CR-CNN). In the upper body candidate generation stage, a deep hierarchical model is proposed. This model is built by a graphical model that contains the appearance model and deformable model. The appearance model is built based on the feature maps in a CNN, and the deformable model is defined by each pair of connected parts to compute the relative spatial information in the graphical model. In the upper body candidate refining stage, the detected bounding boxes serve as the candidate regions and refined in the CR-CNN. Moreover, multiple convolutional features are introduced into the CR-CNN to provide the local information and contextual information. The proposed method is compared with the state of the art on the TV Human Interaction (TVHI) dataset and HollywoodHeads dataset. The experimental results demonstrate the effectiveness of the proposed method.
Similar content being viewed by others
References
Andriluka M, Roth S, Schiele B (2010) Monocular 3d pose estimation and tracking by detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 623–630
Bishop CM (2006) Pattern recognition and machine learning. Springer, Berlin
Chen B, Yang Z, Huang S, Du X, Cui Z, Bhimani J, Xie X, Mi N (2017) Cyber-physical system enabled nearby traffic flow modelling for autonomous vehicles. In: IEEE International PERFORMANCE computing and communications conference, pp 1–6
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol 1, pp 886–893. https://doi.org/10.1109/CVPR.2005.177
Deng J, Dong W, Socher R, Li LJ, Li K, Li FF (2009) Imagenet: a large-scale hierarchical image database. In: IEEE Conference on computer vision and pattern recognition, 2009. CVPR 2009., pp 248–255
Ding M, Fan G (2015) Articulated and generalized gaussian kernel correlation for human pose estimation. IEEE Trans Image Process 25(2):776–789
Ding M, Fan G (2015) Multilayer joint gait-pose manifolds for human gait motion modeling. IEEE Trans Cybern 45(11):1–8
Ding X, Xu H, Cui P, Sun L (2009) A cascade svm approach for head-shoulder detection using histograms of oriented gradients. In: IEEE International symposium on circuits and systems, pp 1791–1794
Duan K, Batra D, Crandall DJ (2012) A multi-layer composite model for human pose estimation. In: BMVC, pp 1–11
Everingham M, Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338
Fan X, Zheng K, Lin Y, Wang S (2015) Combining local appearance and holistic view: Dual-source deep neural networks for human pose estimation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Fang Z, Fei F, Fang Y, Lee C, Xiong N, Shu L, Chen S (2016) Abnormal event detection in crowded scenes based on deep learning. Multimed Tools Appl 75(22):1–23
Felzenszwalb PF, Girshick R, McAllester D, Ramanan D (2010) Object detection with discriminatively trained part based models. IEEE Trans Pattern Anal Mach Intell 32(9):1627–1645
Felzenszwalb PF, Huttenlocher DP (2005) Pictorial structures for object recognition. Int J Comput Vis 61(1):55–79. https://doi.org/10.1023/B:VISI.0000042934.15159.49
Fischler MA, Elschlager RA (1973) The representation and matching of pictorial structures. IEEE Transactions on computers 22(1):67–92
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Girshick R (2015) Fast r-cnn. In: International conference on computer vision (ICCV)
Glauner PO (2015) Deep convolutional neural networks for smile recognition. arXiv:1508.06535
He K, Zhang X, Ren S, Sun J (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Computer vision–ECCV 2014. Springer, pp 346–361
Hoai M, Zisserman A (2014) Talking heads: Detecting humans and recognizing their interactions. In: IEEE Computer vision and pattern recognition
Jarrett K, Kavukcuoglu K, Ranzato M, LeCun Y (2009) What is the best multi-stage architecture for object recognition?. In: Proceedings of International conference on computer vision (ICCV’09). IEEE
Jiang H, Martin D (2008) Global pose estimation using non-tree models. In: 2008. CVPR 2008. IEEE conference on Computer vision and pattern recognition, pp 1–8. https://doi.org/10.1109/CVPR.2008.4587457
Karpagavalli P, Ramprasad AV (2016) An adaptive hybrid gmm for multiple human detection in crowd scenario. Multimedia Tools & Applications 76(12):1–21
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Kumar M, Zisserman A, Torr P (2009) Efficient discriminative learning of parts-based models. In: Proceedings of International Conference on Computer Vision (ICCV), pp 552–559. https://doi.org/10.1109/ICCV.2009.5459192
LeCun Y, Huang FJ, Bottou L (2004) Learning methods for generic object recognition with invariance to pose and lighting. In: 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 2, pp II–97–104. https://doi.org/10.1109/CVPR.2004.1315150
Lee H, Grosse R, Ranganath R, Ng AY (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09. ACM, pp 609–616. https://doi.org/10.1145/1553374.1553453
Li M, Zhang Z, Huang K, Tan T (2009) Rapid and robust human detection and tracking based on omega-shape features, pp 2545–2548
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In: European conference on computer vision, pp 21–37
Redmon J, Divvala S, Girshick R, Farhadi A (2015) You only look once: Unified, real-time object detection, pp 779–788
Redmon J, Farhadi A (2016) Yolo9000: Better, faster, stronger, pp 6517–6525
Liu Y, Wu Q, Tang L, Shi H (2017) Gaze-assisted multi-stream deep neural network for action recognition. IEEE Access PP (99):1–1. https://doi.org/10.1109/ACCESS.2017.2753830
Long J, Shelhamer E, Darrell T (2014) Fully convolutional networks for semantic segmentation. arXiv:1411.4038
Lowe DG (1999) Object recognition from local scale-invariant features. In: 1999. The proceedings of the seventh IEEE international conference on Computer vision. IEEE, vol 2, pp 1150–1157
Meng C, Zhao X (2017) Webcam-based eye movement analysis using cnn. IEEE Access PP(99):1–1. https://doi.org/10.1109/ACCESS.2017.2754299
Patron-Perez A, Marszalek M, Reid I, Zisserman A (2012) Structured learning of human interactions in tv shows. IEEE Trans Pattern Anal Mach Intell 34(12):2441–53
Ren S, Girshick R, Girshick R, Sun J (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Roh MC, Lee JY (2017) Refining faster-rcnn for accurate object detection. In: Fifteenth iapr international conference on machine vision applications
Sapp B, Toshev A, Taskar B (2010) Cascaded models for articulated pose estimation. In: Proceedings of European Conference on Computer Vision (ECCV), ECCV’10. Springer-Verlag, Berlin, pp 406–420. http://dl.acm.org/citation.cfm?id=1888028.1888060
Sapp B, Taskar B (2013) Modec: Multimodal decomposable models for human pose estimation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3674–3681. https://doi.org/10.1109/CVPR.2013.471. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6619315
Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2014) Overfeat: Integrated recognition, localization and detection using convolutional networks. In: International Conference on Learning Representations (ICLR 2014). CBLS. http://openreview.net/document/d332e77d-459a-4af8-b3ed-55ba
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Computer Science
Sun M, Savarese S (2011) Articulated part-based model for joint object detection and pose estimation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), ICCV ’11. IEEE Computer Society, Washington, pp 723–730. https://doi.org/10.1109/ICCV.2011.6126309
Szegedy C, Toshev A, Erhan D (2013) Deep neural networks for object detection. In: Burges C, Bottou L, Welling M, Ghahramani Z, Weinberger K (eds) Advances in Neural Information Processing Systems, vol 26, pp 2553–2561
Tian TP, Sclaroff S (2010) Fast globally optimal 2d human detection with loopy graph models. In: 2010 IEEE conference on Computer vision and pattern recognition (CVPR), pp 81–88. https://doi.org/10.1109/CVPR.2010.5540227
Tsochantaridis I, Hofmann T, Joachims T, Altun Y (2004) Support vector machine learning for interdependent and structured output spaces. In: Proceedings of the twenty-first international conference on Machine learning. ACM, pp 104
Uijlings J, van de Sande K, Gevers T, Smeulders A (2013) Selective search for object recognition. In: International Journal of Computer Vision. Springer, US, vol 104, pp 154–171. https://doi.org/10.1007/s11263-013-0620-5
Wang F, Li Y (2013) Beyond physical connections: Tree models in human pose estimation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 596–603. https://doi.org/10.1109/CVPR.2013.83. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6618927
Wang Y, Tran D, Liao Z (2011) Learning hierarchical poselets for human parsing. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’11. IEEE Computer Society, Washington, pp 1705–1712. https://doi.org/10.1109/CVPR.2011.5995519
Xie X, Liu S, Yang C, Yang Z, Xu J, Zhai X (2017) The application of smart materials in tactile actuators for tactile information delivery . arXiv:1708.07077
Xu R, Guan Y, Huang Y (2015) Multiple human detection and tracking based on head detection for real-time video surveillance. Multimed Tools Appl 74(3):729–742
Yang Y, Ramanan D (2011) Articulated pose estimation with flexible mixtures-of-parts. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1385–1392
Yoo HJ (2015) Deep convolution neural networks in computer vision. IEIE Trans Smart Process Comput 4(1):35–43
Zhang N, Donahue J, Girshick R, Darrell T (2014) Part-based r-cnns for fine-grained category detection. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T (eds) Computer Vision ECCV 2014, Lecture Notes in Computer Science, vol 8689. Springer International Publishing, pp 834–849
Zhu A, Snoussi H, Cherouat A (2015) Articulated pose estimation via multiple mixture parts model. In: 2015 12th IEEE international conference on Advanced video and signal based surveillance (AVSS). IEEE, pp 1–5
Zhu A, Snoussi H, Wang T, Cherouat A (2015) Human pose estimation with multiple mixture parts model based on upper body categories. J Electron Imaging 24(4):043,021. https://doi.org/10.1117/1.JEI.24.4.043021
Acknowledgements
This work is partially supported by the National Natural Science Foundation of China (61503017,61702150), the Aeronautical Science Foundation of China (2016ZC51022).
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhu, A., Wang, T. & Qiao, T. Multiple human upper bodies detection via candidate-region convolutional neural network. Multimed Tools Appl 78, 16077–16096 (2019). https://doi.org/10.1007/s11042-018-6964-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6964-7