Spatial-Temporal Attention Res-TCN for Skeleton-Based Dynamic Hand Gesture Recognition

  • Jingxuan Hou
  • Guijin WangEmail author
  • Xinghao Chen
  • Jing-Hao Xue
  • Rui Zhu
  • Huazhong Yang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11134)


Dynamic hand gesture recognition is a crucial yet challenging task in computer vision. The key of this task lies in an effective extraction of discriminative spatial and temporal features to model the evolutions of different gestures. In this paper, we propose an end-to-end Spatial-Temporal Attention Residual Temporal Convolutional Network (STA-Res-TCN) for skeleton-based dynamic hand gesture recognition, which learns different levels of attention and assigns them to each spatial-temporal feature extracted by the convolution filters at each time step. The proposed attention branch assists the networks to adaptively focus on the informative time frames and features while exclude the irrelevant ones that often bring in unnecessary noise. Moreover, our proposed STA-Res-TCN is a lightweight model that can be trained and tested in an extremely short time. Experiments on DHG-14/28 Dataset and SHREC’17 Track Dataset show that STA-Res-TCN outperforms state-of-the-art methods on both the 14 gestures setting and the more complicated 28 gestures setting.


Dynamic hand gesture recognition Spatial-Temporal Attention Temporal Convolutional Networks 


  1. 1.
    Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015)., software available from
  2. 2.
    Boulahia, S., Anquetil, E., Kulpa, R., Multon, F.: HIF3D: handwriting-inspired features for 3D skeleton-based action recognition. In: ICPR (2017)Google Scholar
  3. 3.
    Boulahia, S.Y., Anquetil, E., Multon, F., Kulpa, R.: Dynamic hand gesture recognition based on 3D pattern assembled trajectories. In: IPTA (2017)Google Scholar
  4. 4.
    Caputo, F., Prebianca, P., Carcangiu, A., Spano, L.D., Giachetti, A.: Comparing 3D trajectories for simple mid-air gesture recognition. Comput. Graph. 73, 17–25 (2018)CrossRefGoogle Scholar
  5. 5.
    Chen, X., Guo, H., Wang, G., Zhang, L.: Motion feature augmented recurrent neural network for skeleton-based dynamic hand gesture recognition. In: ICIP (2017)Google Scholar
  6. 6.
    Chen, X., Wang, G., Guo, H., Zhang, C.: Pose guided structured region ensemble network for cascaded hand pose estimation. In: Neurocomputing (2018)Google Scholar
  7. 7.
    Chollet, F., et al.: Keras (2015).
  8. 8.
    Corbetta, M., Shulman, G.L.: Control of goal-directed and stimulus-driven attention in the brain. Nat. Rev. Neurosci. 3, 201–215 (2002)CrossRefGoogle Scholar
  9. 9.
    Devanne, M., Wannous, H., Berretti, S., Pala, P., Daoudi, M., Bimbo, A.D.: 3-D human action recognition by shape analysis of motion trajectories on riemannian manifold. IEEE Trans. Cybern. 45(7), 1340–1352 (2015)CrossRefGoogle Scholar
  10. 10.
    Han, J., Shao, L., Xu, D., Shotton, J.: Enhanced computer vision with microsoft kinect sensor: a review. IEEE Trans. Cybern. 43(5), 1318–1334 (2013)CrossRefGoogle Scholar
  11. 11.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  12. 12.
    He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). Scholar
  13. 13.
    Huang, G., Liu, Z., Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR (2017)Google Scholar
  14. 14.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)Google Scholar
  15. 15.
    Keselman, L., Woodfill, J.I., Grunnet-Jepse, A., Bhowmik, A.: Intel realsense stereoscopic depth cameras. In: CVPRW (2017)Google Scholar
  16. 16.
    Kim, T., Reiter, A.: Interpretable 3D human action analysis with temporal convolutional networks. In: CVPR BNMW Workshop (2017)Google Scholar
  17. 17.
    Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  18. 18.
    Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: CVPR (2017)Google Scholar
  19. 19.
    Lin, M., Chen, Q., Yan, S.: Network in network. arXiv preprint arXiv:1312.4400 (2013)
  20. 20.
    Liu, J., Wang, G., Hu, P., Duan, L.Y., Kot, A.C.: Global context-aware attention LSTM networks for 3D action recognition. In: CVPR (2017)Google Scholar
  21. 21.
    Molchanov, P., Gupta, S., Kim, K., Kautz, J.: Hand gesture recognition with 3D convolutional neural networks. In: CVPRW (2015)Google Scholar
  22. 22.
    Moon, G., Chang, J.Y., Lee, K.M.: V2V-posenet: Voxel-to-Voxel prediction network for accurate 3D hand and human pose estimation from a single depth map. arXiv preprint arXiv:1711.07399 (2018)
  23. 23.
    Nunez, J.C., Cabido, R., Pantrigo, J.J., Montemayor, A.S., Velez, J.F.: Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recognit. 76, 80–94 (2018)CrossRefGoogle Scholar
  24. 24.
    Ohn-Bar, E., Trivedi, M.: Joint angles similarities and HOG2 for action recognition. In: CVPRW (2013)Google Scholar
  25. 25.
    Oreifej, O., Liu, Z.: HON4D: histogram of oriented 4D normals for activity recognition from depth sequences. In: CVPR (2013)Google Scholar
  26. 26.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)Google Scholar
  27. 27.
    Shi, C., Wang, G., Yin, X., Pei, X., He, B., Lin, X.: High-accuracy stereo matching based on adaptive ground control points. IEEE Trans. Image Process. 24(4), 1412–1423 (2015)MathSciNetCrossRefGoogle Scholar
  28. 28.
    Smedt, Q.D.: Dynamic hand gesture recognition - from traditional handcrafted to recent deep learning approaches. In: Computer Vision and Pattern Recognition [cs.CV]. Universite de Lille 1, Sciences et Technologies; CRIStAL UMR 9189, English (2017)Google Scholar
  29. 29.
    Smedt, Q.D., Wannous, H., Vandeborre, J.P.: Skeleton-based dynamic hand gesture recognition. In: CVPRW (2016)Google Scholar
  30. 30.
    Smedt, Q.D., Wannous, H., Vandeborre, J.P., Guerry, J., Saux, B.L., Filliat, D.: SHREC’17 track: 3D hand gesture recognition using a depth and skeletal dataset. In: Eurographics Workshop on 3D Object Retrieval (2017)Google Scholar
  31. 31.
    Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: AAAI (2017)Google Scholar
  32. 32.
    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)MathSciNetzbMATHGoogle Scholar
  33. 33.
    Wang, G., Yin, X., Pei, X., Shi, C.: Depth estimation for speckle projection system using progressive reliable points growing matching. Appl. Opt. 52, 516–524 (2013)CrossRefGoogle Scholar
  34. 34.
    Wang, G., Chen, X., Guo, H., Zhang, C.: Region ensemble network: towards good practices for deep 3D hand pose estimation. J. Vis. Commun. Image Represent. 55, 404–414 (2018)CrossRefGoogle Scholar
  35. 35.
    Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Jingxuan Hou
    • 1
  • Guijin Wang
    • 1
    Email author
  • Xinghao Chen
    • 1
  • Jing-Hao Xue
    • 2
  • Rui Zhu
    • 3
  • Huazhong Yang
    • 1
  1. 1.Tsinghua UniversityBeijingChina
  2. 2.University College LondonLondonUK
  3. 3.University of KentKentUK

Personalised recommendations