Skip to main content
Log in

Multiple human upper bodies detection via candidate-region convolutional neural network

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Upper body detection on images is a challenging task in practical application scenarios and shares all the difficulties of object detection. This paper focuses on the problems of the multiple upper bodies, including the diversity of appearances, the various object scales, and the frequent occlusions. To address these problems, we divide the upper body detection into two stages to form a Candidate-Region Convolutional Neural Network(CR-CNN). In the upper body candidate generation stage, a deep hierarchical model is proposed. This model is built by a graphical model that contains the appearance model and deformable model. The appearance model is built based on the feature maps in a CNN, and the deformable model is defined by each pair of connected parts to compute the relative spatial information in the graphical model. In the upper body candidate refining stage, the detected bounding boxes serve as the candidate regions and refined in the CR-CNN. Moreover, multiple convolutional features are introduced into the CR-CNN to provide the local information and contextual information. The proposed method is compared with the state of the art on the TV Human Interaction (TVHI) dataset and HollywoodHeads dataset. The experimental results demonstrate the effectiveness of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Andriluka M, Roth S, Schiele B (2010) Monocular 3d pose estimation and tracking by detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 623–630

  2. Bishop CM (2006) Pattern recognition and machine learning. Springer, Berlin

    MATH  Google Scholar 

  3. Chen B, Yang Z, Huang S, Du X, Cui Z, Bhimani J, Xie X, Mi N (2017) Cyber-physical system enabled nearby traffic flow modelling for autonomous vehicles. In: IEEE International PERFORMANCE computing and communications conference, pp 1–6

  4. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol 1, pp 886–893. https://doi.org/10.1109/CVPR.2005.177

  5. Deng J, Dong W, Socher R, Li LJ, Li K, Li FF (2009) Imagenet: a large-scale hierarchical image database. In: IEEE Conference on computer vision and pattern recognition, 2009. CVPR 2009., pp 248–255

  6. Ding M, Fan G (2015) Articulated and generalized gaussian kernel correlation for human pose estimation. IEEE Trans Image Process 25(2):776–789

    Article  MathSciNet  MATH  Google Scholar 

  7. Ding M, Fan G (2015) Multilayer joint gait-pose manifolds for human gait motion modeling. IEEE Trans Cybern 45(11):1–8

    Article  Google Scholar 

  8. Ding X, Xu H, Cui P, Sun L (2009) A cascade svm approach for head-shoulder detection using histograms of oriented gradients. In: IEEE International symposium on circuits and systems, pp 1791–1794

  9. Duan K, Batra D, Crandall DJ (2012) A multi-layer composite model for human pose estimation. In: BMVC, pp 1–11

  10. Everingham M, Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338

    Article  Google Scholar 

  11. Fan X, Zheng K, Lin Y, Wang S (2015) Combining local appearance and holistic view: Dual-source deep neural networks for human pose estimation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  12. Fang Z, Fei F, Fang Y, Lee C, Xiong N, Shu L, Chen S (2016) Abnormal event detection in crowded scenes based on deep learning. Multimed Tools Appl 75(22):1–23

    Article  Google Scholar 

  13. Felzenszwalb PF, Girshick R, McAllester D, Ramanan D (2010) Object detection with discriminatively trained part based models. IEEE Trans Pattern Anal Mach Intell 32(9):1627–1645

    Article  Google Scholar 

  14. Felzenszwalb PF, Huttenlocher DP (2005) Pictorial structures for object recognition. Int J Comput Vis 61(1):55–79. https://doi.org/10.1023/B:VISI.0000042934.15159.49

    Article  Google Scholar 

  15. Fischler MA, Elschlager RA (1973) The representation and matching of pictorial structures. IEEE Transactions on computers 22(1):67–92

  16. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  17. Girshick R (2015) Fast r-cnn. In: International conference on computer vision (ICCV)

  18. Glauner PO (2015) Deep convolutional neural networks for smile recognition. arXiv:1508.06535

  19. He K, Zhang X, Ren S, Sun J (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Computer vision–ECCV 2014. Springer, pp 346–361

  20. Hoai M, Zisserman A (2014) Talking heads: Detecting humans and recognizing their interactions. In: IEEE Computer vision and pattern recognition

  21. Jarrett K, Kavukcuoglu K, Ranzato M, LeCun Y (2009) What is the best multi-stage architecture for object recognition?. In: Proceedings of International conference on computer vision (ICCV’09). IEEE

  22. Jiang H, Martin D (2008) Global pose estimation using non-tree models. In: 2008. CVPR 2008. IEEE conference on Computer vision and pattern recognition, pp 1–8. https://doi.org/10.1109/CVPR.2008.4587457

  23. Karpagavalli P, Ramprasad AV (2016) An adaptive hybrid gmm for multiple human detection in crowd scenario. Multimedia Tools & Applications 76(12):1–21

  24. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  25. Kumar M, Zisserman A, Torr P (2009) Efficient discriminative learning of parts-based models. In: Proceedings of International Conference on Computer Vision (ICCV), pp 552–559. https://doi.org/10.1109/ICCV.2009.5459192

  26. LeCun Y, Huang FJ, Bottou L (2004) Learning methods for generic object recognition with invariance to pose and lighting. In: 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 2, pp II–97–104. https://doi.org/10.1109/CVPR.2004.1315150

  27. Lee H, Grosse R, Ranganath R, Ng AY (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09. ACM, pp 609–616. https://doi.org/10.1145/1553374.1553453

  28. Li M, Zhang Z, Huang K, Tan T (2009) Rapid and robust human detection and tracking based on omega-shape features, pp 2545–2548

  29. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In: European conference on computer vision, pp 21–37

  30. Redmon J, Divvala S, Girshick R, Farhadi A (2015) You only look once: Unified, real-time object detection, pp 779–788

  31. Redmon J, Farhadi A (2016) Yolo9000: Better, faster, stronger, pp 6517–6525

  32. Liu Y, Wu Q, Tang L, Shi H (2017) Gaze-assisted multi-stream deep neural network for action recognition. IEEE Access PP (99):1–1. https://doi.org/10.1109/ACCESS.2017.2753830

    Google Scholar 

  33. Long J, Shelhamer E, Darrell T (2014) Fully convolutional networks for semantic segmentation. arXiv:1411.4038

  34. Lowe DG (1999) Object recognition from local scale-invariant features. In: 1999. The proceedings of the seventh IEEE international conference on Computer vision. IEEE, vol 2, pp 1150–1157

  35. Meng C, Zhao X (2017) Webcam-based eye movement analysis using cnn. IEEE Access PP(99):1–1. https://doi.org/10.1109/ACCESS.2017.2754299

    Article  Google Scholar 

  36. Patron-Perez A, Marszalek M, Reid I, Zisserman A (2012) Structured learning of human interactions in tv shows. IEEE Trans Pattern Anal Mach Intell 34(12):2441–53

    Article  Google Scholar 

  37. Ren S, Girshick R, Girshick R, Sun J (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  38. Roh MC, Lee JY (2017) Refining faster-rcnn for accurate object detection. In: Fifteenth iapr international conference on machine vision applications

  39. Sapp B, Toshev A, Taskar B (2010) Cascaded models for articulated pose estimation. In: Proceedings of European Conference on Computer Vision (ECCV), ECCV’10. Springer-Verlag, Berlin, pp 406–420. http://dl.acm.org/citation.cfm?id=1888028.1888060

  40. Sapp B, Taskar B (2013) Modec: Multimodal decomposable models for human pose estimation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3674–3681. https://doi.org/10.1109/CVPR.2013.471. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6619315

  41. Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2014) Overfeat: Integrated recognition, localization and detection using convolutional networks. In: International Conference on Learning Representations (ICLR 2014). CBLS. http://openreview.net/document/d332e77d-459a-4af8-b3ed-55ba

  42. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Computer Science

  43. Sun M, Savarese S (2011) Articulated part-based model for joint object detection and pose estimation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), ICCV ’11. IEEE Computer Society, Washington, pp 723–730. https://doi.org/10.1109/ICCV.2011.6126309

  44. Szegedy C, Toshev A, Erhan D (2013) Deep neural networks for object detection. In: Burges C, Bottou L, Welling M, Ghahramani Z, Weinberger K (eds) Advances in Neural Information Processing Systems, vol 26, pp 2553–2561

  45. Tian TP, Sclaroff S (2010) Fast globally optimal 2d human detection with loopy graph models. In: 2010 IEEE conference on Computer vision and pattern recognition (CVPR), pp 81–88. https://doi.org/10.1109/CVPR.2010.5540227

  46. Tsochantaridis I, Hofmann T, Joachims T, Altun Y (2004) Support vector machine learning for interdependent and structured output spaces. In: Proceedings of the twenty-first international conference on Machine learning. ACM, pp 104

  47. Uijlings J, van de Sande K, Gevers T, Smeulders A (2013) Selective search for object recognition. In: International Journal of Computer Vision. Springer, US, vol 104, pp 154–171. https://doi.org/10.1007/s11263-013-0620-5

  48. Wang F, Li Y (2013) Beyond physical connections: Tree models in human pose estimation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 596–603. https://doi.org/10.1109/CVPR.2013.83. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6618927

  49. Wang Y, Tran D, Liao Z (2011) Learning hierarchical poselets for human parsing. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’11. IEEE Computer Society, Washington, pp 1705–1712. https://doi.org/10.1109/CVPR.2011.5995519

  50. Xie X, Liu S, Yang C, Yang Z, Xu J, Zhai X (2017) The application of smart materials in tactile actuators for tactile information delivery . arXiv:1708.07077

  51. Xu R, Guan Y, Huang Y (2015) Multiple human detection and tracking based on head detection for real-time video surveillance. Multimed Tools Appl 74(3):729–742

    Article  Google Scholar 

  52. Yang Y, Ramanan D (2011) Articulated pose estimation with flexible mixtures-of-parts. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1385–1392

  53. Yoo HJ (2015) Deep convolution neural networks in computer vision. IEIE Trans Smart Process Comput 4(1):35–43

    Article  MathSciNet  Google Scholar 

  54. Zhang N, Donahue J, Girshick R, Darrell T (2014) Part-based r-cnns for fine-grained category detection. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T (eds) Computer Vision ECCV 2014, Lecture Notes in Computer Science, vol 8689. Springer International Publishing, pp 834–849

  55. Zhu A, Snoussi H, Cherouat A (2015) Articulated pose estimation via multiple mixture parts model. In: 2015 12th IEEE international conference on Advanced video and signal based surveillance (AVSS). IEEE, pp 1–5

  56. Zhu A, Snoussi H, Wang T, Cherouat A (2015) Human pose estimation with multiple mixture parts model based on upper body categories. J Electron Imaging 24(4):043,021. https://doi.org/10.1117/1.JEI.24.4.043021

    Article  Google Scholar 

Download references

Acknowledgements

This work is partially supported by the National Natural Science Foundation of China (61503017,61702150), the Aeronautical Science Foundation of China (2016ZC51022).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Aichun Zhu or Tian Wang.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, A., Wang, T. & Qiao, T. Multiple human upper bodies detection via candidate-region convolutional neural network. Multimed Tools Appl 78, 16077–16096 (2019). https://doi.org/10.1007/s11042-018-6964-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6964-7

Keywords

Navigation