Spatio-temporal SRU with global context-aware attention for 3D human action recognition

She, Qingshan; Mu, Gaoyuan; Gan, Haitao; Fan, Yingle

doi:10.1007/s11042-019-08587-w

Spatio-temporal SRU with global context-aware attention for 3D human action recognition

Published: 14 January 2020

Volume 79, pages 12349–12371, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Qingshan She ORCID: orcid.org/0000-0001-5206-9833^1,2,
Gaoyuan Mu^1,2,
Haitao Gan^1,2 &
…
Yingle Fan^1,2

393 Accesses
3 Citations
Explore all metrics

Abstract

3D action recognition has attracted much attention in machine learning fields in recent years, and recurrent neural networks (RNNs) have been widely used for 3D action recognition due to their efficiency in processing sequential data. However, in order to achieve good performance, traditional RNN architectures are usually time-consuming for the training and inference process. To address the problem, a global context-aware attention spatio-temporal SRU (GCA-ST-SRU) method is proposed and applied for 3D action recognition in this paper, through extending the original simple recurrent unit (SRU) algorithm to joint spatio-temporal domain with an attention mechanism. First, deep neural networks were employed to learn the features of skeleton joints at each frame, and then these new high-level feature sequences were classified using the GCA-ST-SRU method which can learn the spatio-temporal dependence between different joints in the same frame and pay more attention to informative joints. Extensive experiments were conducted on the UT-Kinect and SBU-Kinect Interaction datasets to evaluate the effectiveness of the proposed method. Compared with several existing algorithms including SRU, long short-term memory (LSTM), spatio-temporal LSTM (ST-LSTM) and global context-aware attention LSTM (GCA-LSTM), our method has exhibited better performance in classification accuracy and computational efficiency. The experimental results demonstrate the effectiveness and practicability of our algorithm. Compared to the methods with similar performance, our algorithms can reduce training time and improve the inference speed, and thus it achieves a balance between speed and accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Spatio-Temporal Convolutional Neural Network for Skeletal Action Recognition

Skeleton joint trajectories based human activity recognition using deep RNN

Article 03 May 2023

Online Human Action Detection Using Joint Classification-Regression Recurrent Neural Networks

References

Ahmed F, Paul P P (2016) Gavrilova M L. Joint-triplet motion image and local binary pattern for 3d action recognition using Kinect. In: Proceedings of the 29th International Conference on Computer Animation and Social Agents. ACM, pp: 111–119
Baradel F, Wolf C, Mille J (2017) Human action recognition: pose-based attention draws focus to hands. Proceedings of the IEEE International Conference on Computer Vision, In, pp 604–613
Google Scholar
Chen C, Liu K, Kehtarnavaz N (2016) Real-time human action recognition based on depth motion maps. J Real-Time Image Proc 12(1):155–163
Article Google Scholar
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
De Mulder W, Bethard S, Moens MF (2015) A survey on the application of recurrent neural networks to statistical language modeling. Comput Speech Lang 30(1):61–98
Article Google Scholar
Di Gangi M A, Federico M. (2018) Deep neural machine translation with weakly-recurrent unit. arXiv preprint arXiv:1805.04185
Gal Y, Ghahramani Z (2016) A theoretically grounded application of dropout in recurrent neural networks. Advances in neural information processing systems, In, pp 1019–1027
Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Huang H, Wang H, Mak B (2019) Recurrent poisson process unit for speech recognition. Recurrent Poisson Process Unit for Speech Recognition, AAAI, In, pp 6538–6545
Google Scholar
Kingma D P, Ba J (2014) Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980
Kong, Y., & Fu, Y. (2018). Human action recognition and prediction: a survey. arXiv preprint arXiv:1806.11230
Koniusz P, Cherian A, Porikli F (2016) Tensor representations via kernel linearization for action recognition from 3d skeletons. In: European conference on computer vision. Springer, Cham, pp 37–53
Google Scholar
Lei, T., Zhang, Y., Wang, S. I., Dai, H., & Artzi, Y. (2018) Simple recurrent units for highly parallelizable recurrence. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp: 4470–4481
Liu J, Shahroudy A, Xu D, Wang G (2016, October) Spatio-temporal lstm with trust gates for 3d human action recognition. In European Conference on Computer Vision, Springer, Cham, pp 816–833
Google Scholar
Liu J, Wang G, Duan LY, Abdiyeva K, Kot AC (2017) Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Trans Image Process 27(4):1586–1599
Article MathSciNet Google Scholar
Mabrouk AB, Zagrouba E (2018) Abnormal behavior recognition for intelligent video surveillance systems: a review. Journal of Expert Systems with Applications 91(1):480–491
Article Google Scholar
Park, J., Boo, Y., Choi, I., Shin, S., & Sung, W. (2018) Fully neural network based speech recognition on mobile and embedded devices. In Advances in Neural Information Processing Systems, pp: 10620–10630
Park, C., Lee, C., Hong, L., Hwang, Y., Yoo, T., Jang, J., ... & Kim, H. K. (2019) S2-Net: machine reading comprehension with SRU-based self-matching networks. ETRI J, 41(3):371–382
Presti LL, La Cascia M (2016) 3D skeleton-based human action classification: a survey. Pattern Recogn 53(5):130–147
Article Google Scholar
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, In, pp 568–576
Google Scholar
Slama R, Wannous H, Daoudi M et al (2015) Accurate 3D action recognition using learning on the Grassmann manifold. Pattern Recogn 48(2):556–567
Article Google Scholar
Tamamori A, Hayashi T, Toda T, Takeda K (2018) Daily activity recognition based on recurrent neural network using multi-modal signals. APSIPA Transactions on Signal and Information Processing 7:E21. https://doi.org/10.1017/ATSIP.2018.25
Article Google Scholar
Tang Y, Tian Y, Lu J, Li P, Zhou J (2018) Deep progressive reinforcement learning for skeleton-based action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, In, pp 5323–5332
Google Scholar
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. Proceedings of the IEEE conference on computer vision and pattern recognition, In, pp 588–595
Google Scholar
Wang D, Nyberg E (2015) A long short-term memory model for answer sentence selection in question answering. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing 2(4): 707–712
Xi R, Li M, Hou M, Fu M, Qu H, Liu D, Haruna CR (2018) Deep dilation on multimodality time series for human activity recognition. IEEE Access 6(1):53381–53396
Article Google Scholar
Xia, L., Chen, C. C., & Aggarwal, J. K. (2012) View invariant human action recognition using histograms of 3d joints. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops, pp 20-27
Yang, Z., Bu, L., Wang, T., Ouyang, J., & Yuan, P. (2018, July) Fire Alarm for Video Surveillance Based on Convolutional Neural Network and SRU. In: 2018 5th International Conference on Information Science and Control Engineering (ICISCE), pp: 232–236
Yun, K., Honorio, J., Chattopadhyay, D., Berg, T. L., & Samaras, D. (2012, June). Two-person interaction detection using body-pose features and multiple instance learning. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp: 28–35
Zhang S, Liu X, Xiao J (2017) On geometric features for skeleton-based action recognition using multilayer lstm networks. In: 2017 IEEE winter conference on applications of computer vision (WACV). IEEE, pp: 148–157
Zhang, P., Xue, J., Lan, C., Zeng, W., Gao, Z., & Zheng, N. (2018) Adding attentiveness to the neurons in recurrent neural networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp: 135–151
Zheng Z, An G, Ruan Q (2017) Multi-level recurrent residual networks for action recognition. arXiv preprint arXiv:1711.08238.
Zhu Y, Chen W, Guo G (2013) Fusing spatiotemporal features and joints for 3d action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, In, pp 486–491
Google Scholar
Zhu, W., Lan, C., Xing, J., Zeng, W., Li, Y., Shen, L., & Xie, X. (2016, March) Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In Thirtieth AAAI Conference on Artificial Intelligence

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China under Grant No.61871427. The authors would like to acknowledge the UT-Kinect Dataset and SBU-Kinect Interaction Dataset which were used to test the algorithms proposed in this study.

Author information

Authors and Affiliations

School of Automation, Hangzhou Dianzi University, Zhejiang, 310018, Hangzhou, China
Qingshan She, Gaoyuan Mu, Haitao Gan & Yingle Fan
Institute of Intelligent Control and Robotics, Hangzhou Dianzi University, Zhejiang, 310018, Hangzhou, China
Qingshan She, Gaoyuan Mu, Haitao Gan & Yingle Fan

Authors

Qingshan She
View author publications
You can also search for this author in PubMed Google Scholar
Gaoyuan Mu
View author publications
You can also search for this author in PubMed Google Scholar
Haitao Gan
View author publications
You can also search for this author in PubMed Google Scholar
Yingle Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qingshan She.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1 . Long short-term memory network (LSTM)

The LSTM network is a popular model for processing sequence data which has good performance on modeling long-term temporal dependencies. According to the length of the sequence data, the compute process of the LSTM network is divided into multiple steps, and the computation process on each step can be formulated as below:

$$ \left(\begin{array}{c}{i}_t\\ {}{f}_t\\ {}{o}_t\\ {}{u}_t\end{array}\right)=\left(\begin{array}{c}\sigma \\ {}\sigma \\ {}\sigma \\ {}\tanh \end{array}\right)\left(M\left(\begin{array}{c}{x}_t\\ {}{h}_{t-1}\end{array}\right)\right), $$

(18)

$$ {c}_t={i}_t\odot {u}_t+{f}_t\odot {c}_{t-1}, $$

(19)

$$ {h}_t={o}_t\odot \tanh \left({c}_t\right), $$

(20)

where M is a model parameter matrix, x_t is the input at t^th step, h_t − 1 is the output state at (t − 1)^th step, σ denotes the sigmoid activation function. The detailed process is given as follows: The input gate i_t, forget gate f_t, output gate o_t and input information u_t are firstly calculated according to Eq. (18), and the elements of these matrices range from 0 to 1. Then, the internal state c_t is updated according to Eq. (19), where ⊙ denotes the element-wise product, i_t determines the weight of u_t, and f_t determines the weight of the previous internal state c_t − 1. Finally, the output state h_t is obtained by Eq. (20).

Appendix 2. Spatio-temporal long short-term memory network (ST-LSTM)

The ST-LSTM is a spatial extension of LSTM, which can capture the dependence of joints in temporal and spatial domains simultaneously. The ST-STLM transition equations are formulated as below:

$$ \left(\begin{array}{c}{i}_{j,t}\\ {}{f}_{j,t}^S\\ {}{f}_{j,t}^T\\ {}{o}_{j,t}\\ {}{u}_{j,t}\end{array}\right)=\left(\begin{array}{c}\sigma \\ {}\sigma \\ {}\sigma \\ {}\sigma \\ {}\tanh \end{array}\right)\left(M\left(\begin{array}{c}{x}_{j,t}\\ {}{h}_{j-1,t}\\ {}{h}_{j,t-1}\end{array}\right)\right), $$

(21)

$$ {c}_{j,t}={i}_{j,t}\odot {u}_{j,t}+{f}_{j,t}^S\odot {c}_{j-1,t}+{f}_{j,t}^T\odot {c}_{j,t-1}, $$

(22)

$$ {h}_{j,t}={o}_{j,t}\odot \tanh \left({c}_{j,t}\right), $$

(23)

where x_{j, t} is the input at the spatio-temporal step (j, t) corresponds to the t^th frame and j^thjoint, Eq. (21) extends f_t in Eq. (18) to $ {f}_{j,t}^S $ and $ {f}_{j,t}^T $, which correspond to the forget gate in spatial and temporal domain, respectively. Similarly, In Eq. (22), the computation of internal state c_{j, t} depends on c_{j − 1, t} at previous temporal step and c_{j, t − 1} at previous spatial step.

Appendix 3. simple recurrent unit (SRU)

The SRU is a recurrent network like LSTM and GRU, but the majority of computation for each step is independent of the recurrence and can be easily parallelized [13]. The transition equations are formulated as below:

$$ {\tilde{x}}_t=W{x}_t, $$

(24)

$$ {f}_t=\mathrm{sigmoid}\left({W}_f{x}_t+{b}_f\right), $$

(25)

$$ {r}_t=\mathrm{sigmoid}\left({W}_r{x}_t+{b}_r\right), $$

(26)

$$ {c}_t={f}_t\odot {c}_{t-1}+\left(1-{f}_t\right)\odot {\tilde{x}}_t, $$

(27)

$$ {h}_t={r}_t\odot \tanh \left({c}_t\right)+\left(1-{r}_t\right)\odot {x}_t, $$

(28)

where x_t denotes input at time t, $ {\tilde{x}}_t $ is a linear transformation of x_t, f_t is the forget state, r_t is the reset gate, c_t denotes internal state and h_t denotes output state. Different from LSTM, the computation of $ {\tilde{x}}_t $, f_t and r_t of SRU does not depend on the internal state c_t − 1 at previous time step, and the inference speed is promoted by parallelizing these computations.

The difference between the computation process of LSTM and SRU is shown in Fig. 6. For LSTM, the symbols in each rectangle represent the variables that need to be computed in each time step t = 1, 2, ..., n. It shows that the computation of h_t depends on completing the previous time step, and the sequential dependencies limit the inference speed of the LSTM. For SRU, the computations of $ {\tilde{x}}_t $, f_t and r_t depend only on the input x_t at each time step, so they are independent of the recurrence as shown in the dotted-line box, the recurrent computations of c_t and h_t involves only element-wise product implementations that are relatively lightweight, and thus they are fast. Parallelizing the majority of computation for each step can significantly improve the inference speed of SRU.

Rights and permissions

Reprints and permissions

About this article

Cite this article

She, Q., Mu, G., Gan, H. et al. Spatio-temporal SRU with global context-aware attention for 3D human action recognition. Multimed Tools Appl 79, 12349–12371 (2020). https://doi.org/10.1007/s11042-019-08587-w

Download citation

Received: 26 March 2019
Revised: 18 October 2019
Accepted: 13 December 2019
Published: 14 January 2020
Issue Date: May 2020
DOI: https://doi.org/10.1007/s11042-019-08587-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatio-temporal SRU with global context-aware attention for 3D human action recognition

Abstract

Access this article

Similar content being viewed by others

A Spatio-Temporal Convolutional Neural Network for Skeletal Action Recognition

Skeleton joint trajectories based human activity recognition using deep RNN

Online Human Action Detection Using Joint Classification-Regression Recurrent Neural Networks

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix 1 . Long short-term memory network (LSTM)

Appendix 2. Spatio-temporal long short-term memory network (ST-LSTM)

Appendix 3. simple recurrent unit (SRU)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Spatio-temporal SRU with global context-aware attention for 3D human action recognition

Abstract

Access this article

Similar content being viewed by others

A Spatio-Temporal Convolutional Neural Network for Skeletal Action Recognition

Skeleton joint trajectories based human activity recognition using deep RNN

Online Human Action Detection Using Joint Classification-Regression Recurrent Neural Networks

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix 1 . Long short-term memory network (LSTM)

Appendix 2. Spatio-temporal long short-term memory network (ST-LSTM)

Appendix 3. simple recurrent unit (SRU)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation