Multimedia Tools and Applications

, Volume 77, Issue 24, pp 32275–32285 | Cite as

Time-varying LSTM networks for action recognition

  • Zichao MaEmail author
  • Zhixin Sun


We describe an architecture of Time-Varying Long Short-Term Memory recurrent neural networks (TV-LSTMs) for human action recognition. The main innovation of this architecture is the use of hybrid weights, shared weights and non-shared weights which we refer to as varying weights. The varying weights can enhance the ability of LSTMs to represent videos and other sequential data. We evaluate TV-LSTMs on UCF-11, HMDB-51, and UCF-101 human action datasets and achieve the top-1 accuracy of 99.64%, 57.52%, and 85.06% respectively. This model performs competitively against the models that use both RGB and other features, such as optical flows, improved Dense Trajectory, etc. In this paper, we also propose and analyze the methods of selecting varying weights.


RNNs CNNs LSTMs TV-LSTMs Action recognition 



This work was supported by the National Natural Science Foundation of China (No.61672299). We would like to thank Songle Chen for his valuable advices.


  1. 1.
    Amodei D, Anubhai R, Battenberg E, et al. (2015) Deep speech 2: End-to-end speech recognition in english and mandarin[J]. arXiv preprint arXiv:1512.02595Google Scholar
  2. 2.
    Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 Google Scholar
  3. 3.
    Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5:157–166CrossRefGoogle Scholar
  4. 4.
    Cho K et al. (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 Google Scholar
  5. 5.
    Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, 886–893 (IEEE)Google Scholar
  6. 6.
    Deng, J. et al. (2009) Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 248–255 (IEEE)Google Scholar
  7. 7.
    Donahue J et al. (2015) Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2625–2634Google Scholar
  8. 8.
    El Hihi S., Bengio Y (1995) Hierarchical Recurrent Neural Networks for Long-Term Dependencies. In NIPS, vol. 400, 409 (Citeseer)Google Scholar
  9. 9.
    Gers FA, Schmidhuber J (2000) Recurrent nets that time and count. In Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, vol. 3, 189–194 (IEEE)Google Scholar
  10. 10.
    Gers FA, Schmidhuber J, Cummins F (2000) Learning to forget: continual prediction with LSTM. Neural Comput 12:2451–2471CrossRefGoogle Scholar
  11. 11.
    Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. URL, book in preparation for MIT Press
  12. 12.
    Graves A, Mohamed A-R, Hinton G (2013) Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, 6645–6649 (IEEE)Google Scholar
  13. 13.
    Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18:602–610CrossRefGoogle Scholar
  14. 14.
    Greff K, Srivastava RK, Koutnk J, Steunebrink BR, Schmidhuber J (2015) LSTM: A search space odyssey. arXiv preprint arXiv:1503.04069 Google Scholar
  15. 15.
    Hannun A, Case C, Casper J et al. (2014) Deep speech: Scaling up end-to-end speech recognition[J]. arXiv preprint arXiv:1412.5567Google Scholar
  16. 16.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778Google Scholar
  17. 17.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780 This paper introduced LSTM recurrent networks, which have become a crucial ingredient in recent advances with recurrent networks because they are good at learning long-range dependenciesCrossRefGoogle Scholar
  18. 18.
    Hochreiter S, Schmidhuber J (1995) Long Short-term MemoryGoogle Scholar
  19. 19.
    Jain M, van Gemert JC, Snoek CG (2015) What do 15,000 object categories tell us about classifying and localizing actions? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 46–55Google Scholar
  20. 20.
    Jhuang H, Serre T, Wolf L, Poggio T (2007) A biologically inspired system for action recognition. In ICCVGoogle Scholar
  21. 21.
    Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35:221–231CrossRefGoogle Scholar
  22. 22.
    Karpathy A et al. (2014) Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 1725–1732Google Scholar
  23. 23.
    Lan Z, Lin M, Li X, Hauptmann AG, Raj B (2015) Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 204–212Google Scholar
  24. 24.
    Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In Proceedings CVPR08 (citeseer)Google Scholar
  25. 25.
    LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444CrossRefGoogle Scholar
  26. 26.
    Lipton ZC, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019 Google Scholar
  27. 27.
    Ng JY-H. et al. (2015) Beyond short snippets: Deep networks for video classification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Google Scholar
  28. 28.
    Otte S, Liwicki M, Zell A (2014) Dynamic cortex memory: enhancing recurrent neural networks for gradient-based sequence learning. In International Conference on Artificial Neural Networks, 1–8 (Springer)Google Scholar
  29. 29.
    Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. ICML 28(3):1310–1318Google Scholar
  30. 30.
    Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125CrossRefGoogle Scholar
  31. 31.
    Peng X, Zou C, Qiao Y, Peng Q (2014) Action recognition with stacked fisher vectors. In European Conference on Computer Vision, 581–595 (Springer)Google Scholar
  32. 32.
    Sak H, Senior AW, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Interspeech, 338–342Google Scholar
  33. 33.
    Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv preprint arXiv:1511.04119 Google Scholar
  34. 34.
    Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, 568–576Google Scholar
  35. 35.
    Soomro, K., Zamir, A. R. & Shah, M. (2012) UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. CRCV-TR-12-01 Google Scholar
  36. 36.
    Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958MathSciNetzbMATHGoogle Scholar
  37. 37.
    Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised Learning of Video Representations using LSTMs. In ICML, 843–852Google Scholar
  38. 38.
    Sutskever I (2013) Training recurrent neural networks. Ph.D. thesis, University of TorontoGoogle Scholar
  39. 39.
    Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 3104–3112Google Scholar
  40. 40.
    Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, 4489–4497Google Scholar
  41. 41.
    Venugopalan S, Xu H, Donahue J, et al. (2014) Translating videos to natural language using deep recurrent neural networks[J]. arXiv preprint arXiv:1412.4729Google Scholar
  42. 42.
    Vinyals O, Toshev A, Bengio S, et al. (2015) Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–4164Google Scholar
  43. 43.
    Wang H, Schmid C (2013) Action Recognition with Improved Trajectories. In Computer Vision (ICCV), 2013 IEEE International Conference on, 3551–3558 (IEEE)Google Scholar
  44. 44.
    Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In BMVC 2009 British Machine Vision Conference, 124–1 (BMVA Press)Google Scholar
  45. 45.
    Wu Y, Zhang S, Zhang Y, Bengio Y, Salakhutdinov R (2016) On Multiplicative Integration with Recurrent Neural Networks. arXiv preprint arXiv:1606.06630 Google Scholar
  46. 46.
    Xu K et al. (2015) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML, vol. 14, 77–81Google Scholar
  47. 47.
    Yao L, Torabi A, Cho K et al. (2015) Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision: 4507–4515Google Scholar
  48. 48.
    Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 Google Scholar
  49. 49.
    Zeiler MD (2012) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 Google Scholar
  50. 50.
    Zhang B, Wang L, Wang Z, et al. (2016) Real-time action recognition with enhanced motion vector CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2718–2726Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Nanjing University of Posts and TelecommunicationsNanjingChina
  2. 2.Key Lab of Broadband Wireless Communication and Sensor Network Technology, Ministry of EducationNanjingChina

Personalised recommendations