Modelling Human Body Pose for Action Recognition Using Deep Neural Networks

Research Article - Computer Engineering and Computer Science
  • 11 Downloads

Abstract

Body pose is an important indicator of human actions. The existing pose-based action recognition approaches are typically designed for individual human bodies and require a fixed-size (e.g., \(13\times 2\)) input vector. This requirement is questionable and may degrade the recognition accuracy, particularly for real-world videos, in which scenes with multiple people or partially visible bodies are common. Inspired by the recent success of convolutional neural networks (CNNs) in various computer vision tasks, we propose an approach based on a deep neural network architecture for 2D pose-based action recognition tasks in this work. To this end, a human pose encoding scheme is designed to eliminate the above requirement and to provide a general representation of 2D human body joints, which can be used as the input for CNNs. We also propose a weighted fusion scheme to integrate RGB and optical flow with human pose features to perform action classification. We evaluate our approach on two real-world datasets and achieve better performances compared to state-of-the-art approaches.

Keywords

Pose-based action recognition Convolutional neural networks Human pose encoding scheme 2D human pose RGB videos 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Notes

Acknowledgements

The research is supported in part by NSFC (61572424) and the People Programme (Marie Curie Actions) of the European Unions Seventh Framework Programme FP7 (2007–2013) under REA Grant Agreement No. 612627-"AniNex". Min Tang is supported in part by NSFC (61572423) and Zhejiang Provincial NSFC (LZ16F020003).

References

  1. 1.
    Cristani, M.; Raghavendra, R.; Del Bue, A.; Murino, V.: Human behavior analysis in video surveillance: a social signal processing perspective. Neurocomputing 100, 86–97 (2013)CrossRefGoogle Scholar
  2. 2.
    Rautaray, S.S.; Agrawal, A.: Vision based hand gesture recognition for human computer interaction: a survey. Artif. Intell. Rev. 43(1), 1–54 (2015)CrossRefGoogle Scholar
  3. 3.
    Papachristou, K.; Nikolaidis, N.; Pitas, I.; Linnemann, A.; Liu, M.; Gerke, S.: Human-centered 2d/3d video content analysis and description. In: International Conference on Electrical and Computer Engineering, pp. 385–388 (2014)Google Scholar
  4. 4.
    Sadanand, S.; Corso, J.J.: Action bank: a high-level representation of activity in video. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1234–1241 (2012)Google Scholar
  5. 5.
    Wang, H.; Kläser, A.; Schmid, C.; Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Wang, H.; Schmid, C.: Action recognition with improved trajectories. In: IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)Google Scholar
  7. 7.
    Zhu, J.; Wang, B.; Yang, X.; Zhang, W.; Tu, Z.: Action recognition with actons. In: IEEE International Conference on Computer Vision, pp. 3559–3566 (2013)Google Scholar
  8. 8.
    Huang, S.; Ye, J.; Wang, T.; Jiang, L.; Li, Y.; Wu, X.: Extracting discriminative parts with flexible number from low-rank features for human action recognition. Arab. J. Sci. Eng. 41(8), 2987–3001 (2016)CrossRefGoogle Scholar
  9. 9.
    Simonyan, K.; Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Annual Conference on Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  10. 10.
    Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)Google Scholar
  11. 11.
    Wang, X.; Farhadi, A.; Gupta, A.: Actions\(\sim \) transformations. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2658–2667 (2016)Google Scholar
  12. 12.
    Feichtenhofer, C.; Pinz, A.; Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)Google Scholar
  13. 13.
    Wang, C.; Wang, Y.; Yuille, A.L.: An approach to pose-based action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 915–922 (2013)Google Scholar
  14. 14.
    Jhuang, H.; Gall, J.; Zuffi, S.; Schmid, C.; Black, M.J.: Towards understanding action recognition. In: IEEE International Conference on Computer Vision, pp. 3192–3199 (2013)Google Scholar
  15. 15.
    Moussa, M.M.; Hemayed, E.E.; El Nemr, H.A.; Fayek, M.B.: Human action recognition utilizing variations in skeleton dimensions. Arab. J. Sci. Eng. pp. 1–14 (2017)Google Scholar
  16. 16.
    Bulat, A.; Tzimiropoulos, G.: Human pose estimation via convolutional part heatmap regression. In: European Conference on Computer Vision, pp. 717–732 (2016)Google Scholar
  17. 17.
    Newell, A.; Yang, K.; Deng, J.: Stacked hourglass networks for human pose estimation. In: European Conference on Computer Vision, pp. 483–499 (2016)Google Scholar
  18. 18.
    Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. arXiv preprint arXiv:1611.08050 (2016)
  19. 19.
    Ramanathan, V.; Huang, J.; Abu-El-Haija, S.; Gorban, A.; Murphy, K.; Fei-Fei, L.: Detecting events and key actors in multi-person videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3043–3053 (2016)Google Scholar
  20. 20.
    Krizhevsky, A.; Sutskever, I.; Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Annual Conference on Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  21. 21.
    Chatfield, K.; Simonyan, K.; Vedaldi, A.; Zisserman, A.: Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 (2014)
  22. 22.
    Huang, G.; Liu, Z.; Weinberger, K.Q.; van der Maaten, L.: Densely connected convolutional networks. arXiv preprint arXiv:1608.06993 (2016)
  23. 23.
    He, K.; Zhang, X.; Ren, S.; Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  24. 24.
    Johansson, G.: Visual perception of biological motion and a model for its analysis. Percept. Psychophys. 14(2), 201–211 (1973)CrossRefGoogle Scholar
  25. 25.
    Feng, X.; Perona, P.: Human action recognition by sequence of movelet codewords. In: Proceedings of First International Symposium on 3D Data Processing Visualization and Transmission, pp. 717–721 (2002)Google Scholar
  26. 26.
    Thurau, C.; Hlavác, V.: Pose primitive based human action recognition in videos or still images. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008)Google Scholar
  27. 27.
    Schuldt, C.; Laptev, I.; Caputo, B.: Recognizing human actions: a local SVM approach. Int. Conf. Pattern Recognit. 3, 32–36 (2004)Google Scholar
  28. 28.
    Blank, M.; Gorelick, L.; Shechtman, E.; Irani, M.; Basri, R.: Actions as space–time shapes. IEEE Int. Conf. Comput. Vis. 2, 1395–1402 (2005)Google Scholar
  29. 29.
    Yang, Y.; Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1385–1392 (2011)Google Scholar
  30. 30.
    Yao, B.; Fei-Fei, L.: Action recognition with exemplar based 2.5 d graph matching. In: European Conference on Computer Vision, pp. 173–186 (2012)Google Scholar
  31. 31.
    Yu, T.H.; Kim, T.K.; Cipolla, R.: Unconstrained monocular 3d human pose estimation by action detection and cross-modality regression forest. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3642–3649 (2013)Google Scholar
  32. 32.
    Xu, R.; Agarwal, P.; Kumar, S.; Krovi, V.; Corso, J.: Combining skeletal pose with local motion for human activity recognition. In: International Conference on Articulated Motion and Deformable Objects, pp. 114–123 (2012)Google Scholar
  33. 33.
    Wang, J.; Nie, X.; Xia, Y.; Wu, Y.; Zhu, S.C.: Cross-view action modeling, learning and recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2649–2656 (2014)Google Scholar
  34. 34.
    Garbade, M.; Gall, J.: Handcrafting vs deep learning: an evaluation of ntraj + features for pose based action recognition. In: Workshop on New Challenges in Neural Computation and Machine Learning (\(NC^2\)), pp. 85–92 (2016)Google Scholar
  35. 35.
    Chéron, G.; Laptev, I.; Schmid, C.: P-cnn: Pose-based cnn features for action recognition. In: IEEE International Conference on Computer Vision, pp. 3218–3226 (2015)Google Scholar
  36. 36.
    Cao, C.; Zhang, Y.; Zhang, C.; Lu, H.: Action recognition with joints-pooled 3d deep convolutional descriptors. In: International Joint Conference on Artificial Intelligence, pp. 3324–3330 (2016)Google Scholar
  37. 37.
    Du, W.; Wang, Y.; Qiao, Y.: Rpan: An end-to-end recurrent pose-attention network for action recognition in videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3725–3734 (2017)Google Scholar
  38. 38.
    Carreira, J.; Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. arXiv preprint arXiv:1705.07750 (2017)
  39. 39.
    Brox, T.; Bruhn, A.; Papenberg, N.; Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: European Conference on Computer Vision, pp. 25–36 (2004)Google Scholar
  40. 40.
    Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)Google Scholar
  41. 41.
    Soomro, K.; Zamir, A.R.; Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
  42. 42.
    Zhang, W.; Zhu, M.; Derpanis, K.G.: From actemes to action: A strongly-supervised representation for detailed action understanding. In: IEEE International Conference on Computer Vision, pp. 2248–2255 (2013)Google Scholar
  43. 43.
    Iqbal, U.; Garbade, M.; Gall, J.: Pose for action-action for pose. In: 12th IEEE International Conference on Automatic Face & Gesture Recognition, pp. 438–445 (2017)Google Scholar
  44. 44.
    Xiaohan Nie, B.; Xiong, C.; Zhu, S.C.: Joint action recognition and pose estimation from video. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1293–1301 (2015)Google Scholar
  45. 45.
    Yao, A.; Gall, J.; Van Gool, L.: Coupled action recognition and pose estimation from multiple views. Int. J. Comput. Vis. 100(1), 16–37 (2012)CrossRefMATHGoogle Scholar

Copyright information

© King Fahd University of Petroleum & Minerals 2018

Authors and Affiliations

  1. 1.State Key Lab of CAD&CGZhejiang UniversityHangzhouChina

Personalised recommendations