Multimedia Tools and Applications

, Volume 78, Issue 12, pp 16077–16096 | Cite as

Multiple human upper bodies detection via candidate-region convolutional neural network

  • Aichun ZhuEmail author
  • Tian WangEmail author
  • Tong Qiao


Upper body detection on images is a challenging task in practical application scenarios and shares all the difficulties of object detection. This paper focuses on the problems of the multiple upper bodies, including the diversity of appearances, the various object scales, and the frequent occlusions. To address these problems, we divide the upper body detection into two stages to form a Candidate-Region Convolutional Neural Network(CR-CNN). In the upper body candidate generation stage, a deep hierarchical model is proposed. This model is built by a graphical model that contains the appearance model and deformable model. The appearance model is built based on the feature maps in a CNN, and the deformable model is defined by each pair of connected parts to compute the relative spatial information in the graphical model. In the upper body candidate refining stage, the detected bounding boxes serve as the candidate regions and refined in the CR-CNN. Moreover, multiple convolutional features are introduced into the CR-CNN to provide the local information and contextual information. The proposed method is compared with the state of the art on the TV Human Interaction (TVHI) dataset and HollywoodHeads dataset. The experimental results demonstrate the effectiveness of the proposed method.


Upper body detection Convolutional neural network Candidate regions 



This work is partially supported by the National Natural Science Foundation of China (61503017,61702150), the Aeronautical Science Foundation of China (2016ZC51022).


  1. 1.
    Andriluka M, Roth S, Schiele B (2010) Monocular 3d pose estimation and tracking by detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 623–630Google Scholar
  2. 2.
    Bishop CM (2006) Pattern recognition and machine learning. Springer, BerlinzbMATHGoogle Scholar
  3. 3.
    Chen B, Yang Z, Huang S, Du X, Cui Z, Bhimani J, Xie X, Mi N (2017) Cyber-physical system enabled nearby traffic flow modelling for autonomous vehicles. In: IEEE International PERFORMANCE computing and communications conference, pp 1–6Google Scholar
  4. 4.
    Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol 1, pp 886–893.
  5. 5.
    Deng J, Dong W, Socher R, Li LJ, Li K, Li FF (2009) Imagenet: a large-scale hierarchical image database. In: IEEE Conference on computer vision and pattern recognition, 2009. CVPR 2009., pp 248–255Google Scholar
  6. 6.
    Ding M, Fan G (2015) Articulated and generalized gaussian kernel correlation for human pose estimation. IEEE Trans Image Process 25(2):776–789MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Ding M, Fan G (2015) Multilayer joint gait-pose manifolds for human gait motion modeling. IEEE Trans Cybern 45(11):1–8CrossRefGoogle Scholar
  8. 8.
    Ding X, Xu H, Cui P, Sun L (2009) A cascade svm approach for head-shoulder detection using histograms of oriented gradients. In: IEEE International symposium on circuits and systems, pp 1791–1794Google Scholar
  9. 9.
    Duan K, Batra D, Crandall DJ (2012) A multi-layer composite model for human pose estimation. In: BMVC, pp 1–11Google Scholar
  10. 10.
    Everingham M, Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338CrossRefGoogle Scholar
  11. 11.
    Fan X, Zheng K, Lin Y, Wang S (2015) Combining local appearance and holistic view: Dual-source deep neural networks for human pose estimation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Google Scholar
  12. 12.
    Fang Z, Fei F, Fang Y, Lee C, Xiong N, Shu L, Chen S (2016) Abnormal event detection in crowded scenes based on deep learning. Multimed Tools Appl 75(22):1–23CrossRefGoogle Scholar
  13. 13.
    Felzenszwalb PF, Girshick R, McAllester D, Ramanan D (2010) Object detection with discriminatively trained part based models. IEEE Trans Pattern Anal Mach Intell 32(9):1627–1645CrossRefGoogle Scholar
  14. 14.
    Felzenszwalb PF, Huttenlocher DP (2005) Pictorial structures for object recognition. Int J Comput Vis 61(1):55–79. CrossRefGoogle Scholar
  15. 15.
    Fischler MA, Elschlager RA (1973) The representation and matching of pictorial structures. IEEE Transactions on computers 22(1):67–92Google Scholar
  16. 16.
    Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Google Scholar
  17. 17.
    Girshick R (2015) Fast r-cnn. In: International conference on computer vision (ICCV)Google Scholar
  18. 18.
    Glauner PO (2015) Deep convolutional neural networks for smile recognition. arXiv:1508.06535
  19. 19.
    He K, Zhang X, Ren S, Sun J (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Computer vision–ECCV 2014. Springer, pp 346–361Google Scholar
  20. 20.
    Hoai M, Zisserman A (2014) Talking heads: Detecting humans and recognizing their interactions. In: IEEE Computer vision and pattern recognitionGoogle Scholar
  21. 21.
    Jarrett K, Kavukcuoglu K, Ranzato M, LeCun Y (2009) What is the best multi-stage architecture for object recognition?. In: Proceedings of International conference on computer vision (ICCV’09). IEEEGoogle Scholar
  22. 22.
    Jiang H, Martin D (2008) Global pose estimation using non-tree models. In: 2008. CVPR 2008. IEEE conference on Computer vision and pattern recognition, pp 1–8.
  23. 23.
    Karpagavalli P, Ramprasad AV (2016) An adaptive hybrid gmm for multiple human detection in crowd scenario. Multimedia Tools & Applications 76(12):1–21Google Scholar
  24. 24.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105Google Scholar
  25. 25.
    Kumar M, Zisserman A, Torr P (2009) Efficient discriminative learning of parts-based models. In: Proceedings of International Conference on Computer Vision (ICCV), pp 552–559.
  26. 26.
    LeCun Y, Huang FJ, Bottou L (2004) Learning methods for generic object recognition with invariance to pose and lighting. In: 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 2, pp II–97–104.
  27. 27.
    Lee H, Grosse R, Ranganath R, Ng AY (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09. ACM, pp 609–616.
  28. 28.
    Li M, Zhang Z, Huang K, Tan T (2009) Rapid and robust human detection and tracking based on omega-shape features, pp 2545–2548Google Scholar
  29. 29.
    Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In: European conference on computer vision, pp 21–37Google Scholar
  30. 30.
    Redmon J, Divvala S, Girshick R, Farhadi A (2015) You only look once: Unified, real-time object detection, pp 779–788Google Scholar
  31. 31.
    Redmon J, Farhadi A (2016) Yolo9000: Better, faster, stronger, pp 6517–6525Google Scholar
  32. 32.
    Liu Y, Wu Q, Tang L, Shi H (2017) Gaze-assisted multi-stream deep neural network for action recognition. IEEE Access PP (99):1–1. Google Scholar
  33. 33.
    Long J, Shelhamer E, Darrell T (2014) Fully convolutional networks for semantic segmentation. arXiv:1411.4038
  34. 34.
    Lowe DG (1999) Object recognition from local scale-invariant features. In: 1999. The proceedings of the seventh IEEE international conference on Computer vision. IEEE, vol 2, pp 1150–1157Google Scholar
  35. 35.
    Meng C, Zhao X (2017) Webcam-based eye movement analysis using cnn. IEEE Access PP(99):1–1. CrossRefGoogle Scholar
  36. 36.
    Patron-Perez A, Marszalek M, Reid I, Zisserman A (2012) Structured learning of human interactions in tv shows. IEEE Trans Pattern Anal Mach Intell 34(12):2441–53CrossRefGoogle Scholar
  37. 37.
    Ren S, Girshick R, Girshick R, Sun J (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149CrossRefGoogle Scholar
  38. 38.
    Roh MC, Lee JY (2017) Refining faster-rcnn for accurate object detection. In: Fifteenth iapr international conference on machine vision applicationsGoogle Scholar
  39. 39.
    Sapp B, Toshev A, Taskar B (2010) Cascaded models for articulated pose estimation. In: Proceedings of European Conference on Computer Vision (ECCV), ECCV’10. Springer-Verlag, Berlin, pp 406–420.
  40. 40.
    Sapp B, Taskar B (2013) Modec: Multimodal decomposable models for human pose estimation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3674–3681.
  41. 41.
    Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2014) Overfeat: Integrated recognition, localization and detection using convolutional networks. In: International Conference on Learning Representations (ICLR 2014). CBLS.
  42. 42.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Computer ScienceGoogle Scholar
  43. 43.
    Sun M, Savarese S (2011) Articulated part-based model for joint object detection and pose estimation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), ICCV ’11. IEEE Computer Society, Washington, pp 723–730.
  44. 44.
    Szegedy C, Toshev A, Erhan D (2013) Deep neural networks for object detection. In: Burges C, Bottou L, Welling M, Ghahramani Z, Weinberger K (eds) Advances in Neural Information Processing Systems, vol 26, pp 2553–2561Google Scholar
  45. 45.
    Tian TP, Sclaroff S (2010) Fast globally optimal 2d human detection with loopy graph models. In: 2010 IEEE conference on Computer vision and pattern recognition (CVPR), pp 81–88.
  46. 46.
    Tsochantaridis I, Hofmann T, Joachims T, Altun Y (2004) Support vector machine learning for interdependent and structured output spaces. In: Proceedings of the twenty-first international conference on Machine learning. ACM, pp 104Google Scholar
  47. 47.
    Uijlings J, van de Sande K, Gevers T, Smeulders A (2013) Selective search for object recognition. In: International Journal of Computer Vision. Springer, US, vol 104, pp 154–171.
  48. 48.
    Wang F, Li Y (2013) Beyond physical connections: Tree models in human pose estimation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 596–603.
  49. 49.
    Wang Y, Tran D, Liao Z (2011) Learning hierarchical poselets for human parsing. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’11. IEEE Computer Society, Washington, pp 1705–1712.
  50. 50.
    Xie X, Liu S, Yang C, Yang Z, Xu J, Zhai X (2017) The application of smart materials in tactile actuators for tactile information delivery . arXiv:1708.07077
  51. 51.
    Xu R, Guan Y, Huang Y (2015) Multiple human detection and tracking based on head detection for real-time video surveillance. Multimed Tools Appl 74(3):729–742CrossRefGoogle Scholar
  52. 52.
    Yang Y, Ramanan D (2011) Articulated pose estimation with flexible mixtures-of-parts. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1385–1392Google Scholar
  53. 53.
    Yoo HJ (2015) Deep convolution neural networks in computer vision. IEIE Trans Smart Process Comput 4(1):35–43MathSciNetCrossRefGoogle Scholar
  54. 54.
    Zhang N, Donahue J, Girshick R, Darrell T (2014) Part-based r-cnns for fine-grained category detection. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T (eds) Computer Vision ECCV 2014, Lecture Notes in Computer Science, vol 8689. Springer International Publishing, pp 834–849Google Scholar
  55. 55.
    Zhu A, Snoussi H, Cherouat A (2015) Articulated pose estimation via multiple mixture parts model. In: 2015 12th IEEE international conference on Advanced video and signal based surveillance (AVSS). IEEE, pp 1–5Google Scholar
  56. 56.
    Zhu A, Snoussi H, Wang T, Cherouat A (2015) Human pose estimation with multiple mixture parts model based on upper body categories. J Electron Imaging 24(4):043,021. CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.The School of Computer Science and TechnologyNanjing Tech UniversityNanjing ShiChina
  2. 2.School of Automation Science and Electrical EngineeringBeihang UniversityBeijingChina
  3. 3.School of CyberspaceHangzhou Dianzi UniversityZhejiang ShengChina

Personalised recommendations