Convolutional Gated Recurrent Units Fusion for Video Action Recognition

Huang, Bo; Huang, Hualong; Lu, Hongtao

doi:10.1007/978-3-319-70090-8_12

Bo Huang¹⁸,
Hualong Huang¹⁸ &
Hongtao Lu¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10636))

Included in the following conference series:

International Conference on Neural Information Processing

4479 Accesses
3 Citations

Abstract

Two-stream Convolutional Networks (ConvNets) have achieved great success in video action recognition. Research also shows that early fusion of the two-stream ConvNets can further boost the performance. Existing fusion methods focus on short snippets thus fails to learn global representations for videos. We introduce a Convolutional Gated Recurrent Units (ConvGRU) fusion method to model long-term dependency inside actions. This fusion method takes advantage of both Recurrent Neural Networks (RNN) models which have strong capacity to handle long-term dependency for sequence modeling and early fusion architecture which learns the evolution of appearance feature and motion feature. We further propose an end-to-end architecture according to this fusion method and evaluate our approach using a widely used action recognition dataset named UCF101. We investigate different input lengths and fusion layers and find that fusing at the last convolutional layer with an input length of 10 entries yields best performance (93.0%) which is comparable to the state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)
Article Google Scholar
Wang, H., Klaser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)
Article MathSciNet Google Scholar
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of ICCV, pp. 3551–3558 (2013)
Google Scholar
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Article Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.F.: Large-scale video classification with convolutional neural networks. In: Proceedings of CVPR, pp. 1725–1732 (2014)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of ICCV, pp. 4489–4497 (2015)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of NIPS, pp. 568–576 (2014)
Google Scholar
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of CVPR, pp. 1933–1941 (2016)
Google Scholar
Feichtenhofer, C., Pinz, A., Wildes., R.P.: Spatiotemporal residual networks for video action recognition. In: Proceedings of NIPS, pp. 3468–3476 (2016)
Google Scholar
Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of CVPR, pp. 4305–4314 (2015)
Google Scholar
Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Proceedings of CVPR, pp. 4694–4702 (2015)
Google Scholar
Ballas, N., Yao, L., Pal, C., Courville, A.: Delving deeper into convolutional networks for learning video representations. In: Proceedings of ICLR (2016)
Google Scholar
Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. In: Proceedings of BMVC (2014)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)
Article Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. Technical report CRCV-TR-12-01, UCF Center for Research in Computer Vision (2012)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of ICLR (2014)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of NIPS, pp. 1106–1114 (2012)
Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of CVPR, pp. 779–788 (2016)
Google Scholar

Download references

Acknowledgement

This paper is supported by NSFC (No. 61772330, 61272247, 61533012, 61472075), the 863 National High Technology Research and Development Program of China (SS2015AA020501), the Basic Research Project of Innovation Action Plan (16JC1402800) and the Major Basic Research Program (15JC1400103) of Shanghai Science and Technology Committee.

Author information

Authors and Affiliations

Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, People’s Republic of China
Bo Huang, Hualong Huang & Hongtao Lu

Authors

Bo Huang
View author publications
You can also search for this author in PubMed Google Scholar
Hualong Huang
View author publications
You can also search for this author in PubMed Google Scholar
Hongtao Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongtao Lu .

Editor information

Editors and Affiliations

Guangdong University of Technology, Guangzhou, China
Derong Liu
Guangdong University of Technology, Guangzhou, China
Shengli Xie
South China University of Technology, Guangzhou, China
Yuanqing Li
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Dongbin Zhao
King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
El-Sayed M. El-Alfy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, B., Huang, H., Lu, H. (2017). Convolutional Gated Recurrent Units Fusion for Video Action Recognition. In: Liu, D., Xie, S., Li, Y., Zhao, D., El-Alfy, ES. (eds) Neural Information Processing. ICONIP 2017. Lecture Notes in Computer Science(), vol 10636. Springer, Cham. https://doi.org/10.1007/978-3-319-70090-8_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-70090-8_12
Published: 28 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-70089-2
Online ISBN: 978-3-319-70090-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics