Advertisement

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

  • Xiao LiuEmail author
  • Xudong Yang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11301)

Abstract

Recently, convolutional neural networks (CNNs) have been extensively applied for human action recognition in videos with the fusion of appearance and motion information by two-stream network. However, for human action recognition in videos, the performance over still images recognition is so far away because of difficulty in extracting the temporal information. In this paper, we propose a multi-stream architecture with convolutional neural networks for human action recognition in videos to extract more temporal features. We make the three contributions: (a) we present a multi-stream with 3D and 2D convolutional neural networks by using still RGB frames, dense optical flows and gradient maps as the input of networks separately; (b) we propose a novel 3D convolutional neural network with residual blocks, use deep 2D convolutional neural network as the pre-train network which is added attention blocks to extract the major motion information; (c) we fuse the multi-stream networks by weights not only for networks but also for every action category to take advantage of the optimal performance of each network. Our networks are trained and evaluated on the standard video action benchmarks of UCF-101 and HMDB-51 datasets, and result shows that our method achieves considerable and comparable recognition performance to the state-of-the-art.

Keywords

Action recognition Multi-stream 3D CNNs Attention Category weights 

References

  1. 1.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, vol. 159, pp. 3551–3558. IEEE Press, New York (2013)Google Scholar
  2. 2.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  3. 3.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941. IEEE Press, New York (2016)Google Scholar
  4. 4.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_2CrossRefGoogle Scholar
  5. 5.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. J. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)CrossRefGoogle Scholar
  6. 6.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497. IEEE Press, New York (2015)Google Scholar
  7. 7.
    Wu, Z., Jiang, Y.G., Wang, X., Ye, H., Xue, X., Wang, J.: Fusing multi-stream deep networks for video classification. J. Comput. Sci. (2015)Google Scholar
  8. 8.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634. IEEE Press, New York (2015)Google Scholar
  9. 9.
    Lev, G., Sadeh, G., Klein, B., Wolf, L.: RNN fisher vectors for action recognition and image annotation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 833–850. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46466-4_50CrossRefGoogle Scholar
  10. 10.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732. IEEE Press, New York (2014)Google Scholar
  11. 11.
    Zhu, W., Hu, J., Sun, G., Cao, X., Qiao, Y.: A key volume mining deep framework for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1991–1999. IEEE Press, New York (2016)Google Scholar
  12. 12.
    Diba, A., Sharma, V., Van Gool, L.: Deep temporal linear encoding networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1. IEEE Press, New York (2017)Google Scholar
  13. 13.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. J. Comput. Sci. (2014)Google Scholar
  14. 14.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. IEEE Press, New York (2016)Google Scholar
  15. 15.
    Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI, vol. 4, p. 12 (2017)Google Scholar
  16. 16.
    Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004).  https://doi.org/10.1007/978-3-540-24673-2_3CrossRefGoogle Scholar
  17. 17.
    Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 363–370. Springer, Heidelberg (2003).  https://doi.org/10.1007/3-540-45103-X_50CrossRefGoogle Scholar
  18. 18.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. J. Comput. Sci. (2012)Google Scholar
  19. 19.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2556–2563. IEEE Press, New York (2011)Google Scholar
  20. 20.
    Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
  21. 21.
    Yu, Z., Jiang-Kun, Z., Yi-Ning, W., Bing-Bing, Z.: A review of human action recognition based on deep learning. J. Acta Automiatica Sinica 42(6), 848–857 (2016)Google Scholar
  22. 22.
    Wang, F., et al.: Residual attention network for image classification. arXiv preprint arXiv:1704.06904 (2017)
  23. 23.
    Tran, A., Cheong, L.F.: Two-stream flow-guided convolutional attention networks for action recognition. arXiv preprint arXiv:1708.09268 (2017)
  24. 24.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: Proceedings of the 17th International Conference on Pattern Recognition, vol. 3, pp. 32–36. IEEE Press, New York (2004)Google Scholar
  25. 25.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
  26. 26.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)MathSciNetCrossRefGoogle Scholar
  27. 27.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826. IEEE Press, New York (2016)Google Scholar
  28. 28.
    Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5987–5995. IEEE Press, New York (2017)Google Scholar
  29. 29.
    Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: evolution of optical flow estimation with deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, p. 6. IEEE Press, New York (2017)Google Scholar
  30. 30.
    Sun, S., Kuang, Z., Sheng, L., Ouyang, W., Zhang, W.: Optical flow guided feature: a fast and robust motion representation for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1390–1399. IEEE Press, New York (2018)Google Scholar
  31. 31.
    Duta, I.C., Ionescu, B., Aizawa, K., Sebe, N.: Spatio-temporal vector of locally max pooled features for action recognition in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3205–3214. IEEE Press, New York (2017)Google Scholar
  32. 32.
    Sun, L., Jia, K., Chen, K., Yeung, D.Y., Shi, B.E., Savarese, S.: Lattice long short-term memory for human action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2166–2175. IEEE Press, New York (2017)Google Scholar
  33. 33.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4733. IEEE Press, New York (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.National Engineering Laboratory for Integrated Command and Dispatch Technology, School of Computer ScienceBeijing University of Posts and TelecommunicationsBeijingChina

Personalised recommendations