Abstract
Two-stream Convolutional Networks (ConvNets) have achieved great success in video action recognition. Research also shows that early fusion of the two-stream ConvNets can further boost the performance. Existing fusion methods focus on short snippets thus fails to learn global representations for videos. We introduce a Convolutional Gated Recurrent Units (ConvGRU) fusion method to model long-term dependency inside actions. This fusion method takes advantage of both Recurrent Neural Networks (RNN) models which have strong capacity to handle long-term dependency for sequence modeling and early fusion architecture which learns the evolution of appearance feature and motion feature. We further propose an end-to-end architecture according to this fusion method and evaluate our approach using a widely used action recognition dataset named UCF101. We investigate different input lengths and fusion layers and find that fusing at the last convolutional layer with an input length of 10 entries yields best performance (93.0%) which is comparable to the state-of-the-art.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)
Wang, H., Klaser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of ICCV, pp. 3551–3558 (2013)
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.F.: Large-scale video classification with convolutional neural networks. In: Proceedings of CVPR, pp. 1725–1732 (2014)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of ICCV, pp. 4489–4497 (2015)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of NIPS, pp. 568–576 (2014)
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of CVPR, pp. 1933–1941 (2016)
Feichtenhofer, C., Pinz, A., Wildes., R.P.: Spatiotemporal residual networks for video action recognition. In: Proceedings of NIPS, pp. 3468–3476 (2016)
Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of CVPR, pp. 4305–4314 (2015)
Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Proceedings of CVPR, pp. 4694–4702 (2015)
Ballas, N., Yao, L., Pal, C., Courville, A.: Delving deeper into convolutional networks for learning video representations. In: Proceedings of ICLR (2016)
Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. In: Proceedings of BMVC (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. Technical report CRCV-TR-12-01, UCF Center for Research in Computer Vision (2012)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of ICLR (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of NIPS, pp. 1106–1114 (2012)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of CVPR, pp. 779–788 (2016)
Acknowledgement
This paper is supported by NSFC (No. 61772330, 61272247, 61533012, 61472075), the 863 National High Technology Research and Development Program of China (SS2015AA020501), the Basic Research Project of Innovation Action Plan (16JC1402800) and the Major Basic Research Program (15JC1400103) of Shanghai Science and Technology Committee.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Huang, B., Huang, H., Lu, H. (2017). Convolutional Gated Recurrent Units Fusion for Video Action Recognition. In: Liu, D., Xie, S., Li, Y., Zhao, D., El-Alfy, ES. (eds) Neural Information Processing. ICONIP 2017. Lecture Notes in Computer Science(), vol 10636. Springer, Cham. https://doi.org/10.1007/978-3-319-70090-8_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-70090-8_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-70089-2
Online ISBN: 978-3-319-70090-8
eBook Packages: Computer ScienceComputer Science (R0)