Advertisement

Multimedia Tools and Applications

, Volume 77, Issue 17, pp 22901–22921 | Cite as

3D skeleton based action recognition by video-domain translation-scale invariant mapping and multi-scale dilated CNN

  • Bo Li
  • Mingyi He
  • Yuchao Dai
  • Xuelian Cheng
  • Yucheng Chen
Article

Abstract

In this paper, we present an image classification approach to action recognition with 3D skeleton videos. First, we propose a video domain translation-scale invariant image mapping, which transforms the 3D skeleton videos to color images, namely skeleton images. Second, a multi-scale dilated convolutional neural network (CNN) is designed for the classification of the skeleton images. Our multi-scale dilated CNN model could effectively improve the frequency adaptiveness and exploit the discriminative temporal-spatial cues for the skeleton images. Even though the skeleton images are very different from natural images, we show that the fine-tuning strategy still works well. Furthermore, we propose different kinds of data augmentation strategies to improve the generalization and robustness of our method. Experimental results on popular benchmark datasets such as NTU RGB + D, UTD-MHAD, MSRC-12 and G3D demonstrate the superiority of our approach, which outperforms the state-of-the-art methods by a large margin.

Keywords

3D skeleton CNN Image mapping Recognition 

Notes

Acknowledgements

This work was supported in part by Natural Science Foundation of China grants (61420106007, 61671387) and Australian Research Council grants (DE140100180).

References

  1. 1.
    Barnachon M, Bouakaz S, Boufama B, Guillou E (2014) Ongoing human action recognition with motion capture. Pattern Recogn 47(1):238–247CrossRefGoogle Scholar
  2. 2.
    Bloom V, Makris D, Argyriou V (2012) G3d: a gaming action dataset and real time action recognition evaluation framework. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 7–12Google Scholar
  3. 3.
    Chen C, Jafari R, Kehtarnavaz N (2015) Utd-mhad: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: Proceedings of the IEEE international conference on image processing, pp 168–172Google Scholar
  4. 4.
    Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1110–1118Google Scholar
  5. 5.
    Du Y, Fu Y, Wang L (2016) Representation learning of temporal dynamics for skeleton-based action recognition. IEEE Trans Image Proc 25(7):3010–3022MathSciNetCrossRefGoogle Scholar
  6. 6.
    Fothergill S, Mentis H, Kohli P, Nowozin S (2012) Instructing people for training gestural interactive systems. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 1737–1746Google Scholar
  7. 7.
    Gao Y, Wang M, Ji R, Wu X, Dai Q (2013) 3-D object retrieval with hausdorff distance learning. IEEE Trans Ind Electron 61(4):2088–2098CrossRefGoogle Scholar
  8. 8.
    Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587Google Scholar
  9. 9.
    Gowayyed MA, Torki M, Hussein ME, El-Saban M (2013) Histogram of oriented displacements (hod): describing trajectories of human joints for action recognition. In: AAAI, pp 1351–1357Google Scholar
  10. 10.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778Google Scholar
  11. 11.
    Hou Y, Li Z, Wang P, Li W (2016) Skeleton optical spectra based action recognition using convolutional neural networks. IEEE Trans Circuits Syst Video Technol PP(99):1–1Google Scholar
  12. 12.
    Huang Z, Wan C, Probst T, Van Gool L (2017) Deep learning on lie groups for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6099–6108Google Scholar
  13. 13.
    Hussein ME, Torki M, Gowayyed MA, El-Saban M (2013) Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In: AAAI, pp 2466–2472Google Scholar
  14. 14.
    Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on multimedia, pp 675–678Google Scholar
  15. 15.
    Ke Q, An S, Bennamoun M, Sohel F, Boussaid F (2017) Skeletonnet: mining deep part features for 3-d action recognition. IEEE Signal Process Lett 24 (6):731–735CrossRefGoogle Scholar
  16. 16.
    Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3288–3297Google Scholar
  17. 17.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of the advance neural information processing systems, pp 1097–1105Google Scholar
  18. 18.
    Li B, Dai Y, Cheng X, Chen H, Lin Y, He M (2017) Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn. In: Proceedings of the IEEE international conference on multimedia and expo workshops, pp 1–1Google Scholar
  19. 19.
    Liu C, Hu Y, Li Y, Song S, Liu J Pku-mmd: a large scale benchmark for continuous multi-modal human action understanding. arXiv:1703.07475
  20. 20.
    Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In: Proceedings of the european conference on computer vision. Springer, pp 816–833Google Scholar
  21. 21.
    Lu G, Zhou Y, Li X, Kudo M (2016) Efficient action recognition via local position offset of 3d skeletal body joints. Multimedia Tools and Applications 75 (6):3479–3494CrossRefGoogle Scholar
  22. 22.
    Luo J, Wang W, Qi H (2013) Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In: Proceedings of the IEEE international conference on computer vision, pp 1809–1816Google Scholar
  23. 23.
    Nie S, Wang Z, Ji Q (2015) A generative restricted boltzmann machine based method for high-dimensional motion data modeling. Comp Vis Image Understanding 136:14–22CrossRefGoogle Scholar
  24. 24.
    Ohn-Bar E, Trivedi M (2013) Joint angles similarities and hog2 for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 465–470Google Scholar
  25. 25.
    Oquab M, Bottou L, Laptev I, Sivic J (2014) Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1717–1724Google Scholar
  26. 26.
    Presti LL, La Cascia M (2016) 3d skeleton-based human action classification: a survey. Pattern Recogn 53:130–147CrossRefGoogle Scholar
  27. 27.
    Shahroudy A, Liu J, Ng T-T, Wang G (2016) Ntu rgb + d: a large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019Google Scholar
  28. 28.
    Shotton J, Girshick R, Fitzgibbon A, Sharp T, Cook M, Finocchio M, Moore R, Kohli P, Criminisi A, Kipman A, et al (2013) Efficient human pose estimation from single depth images. IEEE Trans Pattern Anal Mach Intell 35 (12):2821–2840CrossRefGoogle Scholar
  29. 29.
    Simonyan K, Zisserman A Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
  30. 30.
    Song Y, Liu S, Tang J (2015) Describing trajectory of surface patch for human action recognition on rgb and depth videos. IEEE Signal Process Lett 22(4):426–429CrossRefGoogle Scholar
  31. 31.
    Song S, Lan C, Xing J, Zeng W, Liu J (2017) An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: AAAI, pp 4263–4270Google Scholar
  32. 32.
    Veeriah V, Zhuang N, Qi G-J (2015) Differential recurrent neural networks for action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 4041–4049Google Scholar
  33. 33.
    Vemulapalli R, Chellapa R (2016) Rolling rotations for recognizing human actions from 3d skeletal data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4471–4479Google Scholar
  34. 34.
    Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 588–595Google Scholar
  35. 35.
    Vemulapalli R, Arrate F, Chellappa R (2016) R3dg features: relative 3d geometry-based skeletal representations for human action recognition. Comp Vis Image Understanding 152:155–166CrossRefGoogle Scholar
  36. 36.
    Wang P, Li Z, Hou Y, Li W (2016) Action recognition based on joint trajectory maps using convolutional neural networks. In: Proceedings of the 2016 ACM on multimedia conference, pp 102–106Google Scholar
  37. 37.
    Wang D, Wang B, Zhao S, Yao H, Liu H (2017) View-based 3d object retrieval with discriminative views. Neurocomputing 252(C):58–66CrossRefGoogle Scholar
  38. 38.
    Wang X, Gao L, Song J, Shen H (2017) Beyond frame-level cnn: saliency-aware 3-d cnn with lstm for video action recognition. IEEE Signal Process Lett 24(4):510–514CrossRefGoogle Scholar
  39. 39.
    Wu D, Shao L (2014) Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 724–731Google Scholar
  40. 40.
    Yang S, Yuan C, Hu W, Ding X (2014) A hierarchical model based on latent dirichlet allocation for action recognition. In: Proceedings of the IEEE international conference on pattern recognition, pp 2613–2618Google Scholar
  41. 41.
    Yong D, Yun F, Liang W (2015) Skeleton based action recognition with convolutional neural network. In: Iapr asian conference on pattern recognition, pp 579–583Google Scholar
  42. 42.
    Yu F, Koltun V (2016) Multi-scale context aggregation by dilated convolutions. In: Proceedings of the international conference on learning representations, pp 1–10Google Scholar
  43. 43.
    Zanfir M, Leordeanu M, Sminchisescu C (2013) The moving pose: an efficient 3d kinematics descriptor for low-latency action recognition and detection. In: Proceedings of the IEEE international conference on computer vision, pp 2752–2759Google Scholar
  44. 44.
    Zhao S, Chen L, Yao H, Zhang Y, Sun X (2015) Strategy for dynamic 3d depth data matching towards robust action retrieval. Neurocomputing 151:533–543CrossRefGoogle Scholar
  45. 45.
    Zhao S, Yao H, Zhang Y, Wang Y, Liu S (2015) View-based 3d object retrieval via multi-modal graph learning. Signal Process 112(C):110–118CrossRefGoogle Scholar
  46. 46.
    Zhao S, Yao H, Gao Y, Ji R, Ding G (2017) Continuous probability distribution prediction of image emotions via multitask shared sparse regression. IEEE Trans Multimedia 19(3):632–645CrossRefGoogle Scholar
  47. 47.
    Zheng Y, Yao H, Sun X, Zhao S (2015) Distinctive action sketch. In: IEEE international conference on image processing, pp 576–580Google Scholar
  48. 48.
    Zhou L, Li W, Zhang Y, Ogunbona P, Nguyen DT, Zhang H (2014) Discriminative key pose extraction using extended lc-ksvd for action recognition. In: Proceedings of the IEEE international conference on digital lmage computing: techniques and applications, pp 1–8Google Scholar
  49. 49.
    Zhou L, Li W, Ogunbona P (2016) Learning a pose lexicon for semantic action recognition. In: Proceedings of the IEEE international conference on multimedia and expo, pp 1–6Google Scholar
  50. 50.
    Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X (2016) Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. In: AAAI, pp 3697–3703Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018
Corrected publication February/2018

Authors and Affiliations

  1. 1.Northwestern Polytechnical UniversityXi’anChina

Personalised recommendations