A Novel Scheme for Training Two-Stream CNNs for Action Recognition

  • Reinier Oves GarcíaEmail author
  • Eduardo F. Morales
  • L. Enrique Sucar
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11896)


Human actions recognition from realistic video data constitutes a challenging and relevant research area. Leading the state-of-the-art we can find those methods based on Convolutional Neural Networks (CNNs) and specially two-stream CNNs (appearance and motion). In this paper we present a novel scheme for training two-stream CNNs that increases the accuracy of the fusion (when one of the channels does not perform as well as the other one) and reduces the total time used for training the entire architecture. In addition, we introduce a new descriptor for motion representation that improves the state-of-the-art. Based on this more efficient scheme, we developed an early recognition system. The proposed approach is evaluated on the UCF101 data set with competitive results.


Human Action Recognition Convolutional Neural Networks 


  1. 1.
    Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for human action recognition. In: Salah, A.A., Lepri, B. (eds.) HBU 2011. LNCS, vol. 7065, pp. 29–39. Springer, Heidelberg (2011). Scholar
  2. 2.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE Conference on CVPR, pp. 4724–4733. IEEE (2017)Google Scholar
  3. 3.
    Choutas, V., Weinzaepfel, P., Revaud, J., Schmid, C.: Potion: pose motion representation for action recognition. In: CVPR. pp. 7024–7033 (2018)Google Scholar
  4. 4.
    Cruz, C., Sucar, L.E., Morales, E.F.: Real-time face recognition for human-robot interaction. In: 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition, pp. 1–6. IEEE (2008)Google Scholar
  5. 5.
    Diba, A., et al.: Temporal 3d convnets: new architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200 (2017)
  6. 6.
    Diba, A., Pazandeh, A.M., Van Gool, L.: Efficient two-stream motion and appearance 3d cnns for video classification. arXiv preprint arXiv:1608.08851 (2016)
  7. 7.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp. 2625–2634 (2015)Google Scholar
  8. 8.
    Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: CVPR, pp. 1110–1118 (2015)Google Scholar
  9. 9.
    Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 363–370. Springer, Heidelberg (2003). Scholar
  10. 10.
    Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: CVPR. pp. 6546–6555 (2018)Google Scholar
  11. 11.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)Google Scholar
  12. 12.
    He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). Scholar
  13. 13.
    Herath, S., Harandi, M., Porikli, F.: Going deeper into action recognition: a survey. Image Vis. Comput. 60, 4–21 (2017)CrossRefGoogle Scholar
  14. 14.
    Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
  15. 15.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)CrossRefGoogle Scholar
  16. 16.
    Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 816–833. Springer, Cham (2016). Scholar
  17. 17.
    Ma, S., Sigal, L., Sclaroff, S.: Learning activity progression in lstms for activity detection and early detection. In: CVPR, pp. 1942–1950 (2016)Google Scholar
  18. 18.
    Nanni, L., Ghidoni, S., Brahnam, S.: Handcrafted vs. non-handcrafted features for computer vision classification. Pattern Recogn. 71, 158–172 (2017)CrossRefGoogle Scholar
  19. 19.
    Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 744–759. Springer, Cham (2016). Scholar
  20. 20.
    Saha, S., Singh, G., Sapienza, M., Torr, P.H., Cuzzolin, F.: Deep learning for detecting multiple space-time action tubes in videos. arXiv preprint arXiv:1608.01529 (2016)
  21. 21.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  22. 22.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  23. 23.
    Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
  24. 24.
    Suter, D.: Motion estimation and vector splines. In: CVPR, vol. 94, pp. 939–942 (1994)Google Scholar
  25. 25.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR, pp. 2818–2826 (2016)Google Scholar
  26. 26.
    Tran, D., Ray, J., Shou, Z., Chang, S.F., Paluri, M.: Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 (2017)
  27. 27.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV, pp. 3551–3558 (2013)Google Scholar
  28. 28.
    Wang, Y., Song, J., Wang, L., Van Gool, L., Hilliges, O.: Two-stream sr-cnns for action recognition in videos. In: BMVC (2016)Google Scholar
  29. 29.
    Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: CVPR, pp. 4694–4702 (2015)Google Scholar
  30. 30.
    Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L1 optical flow. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007). Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Reinier Oves García
    • 1
    Email author
  • Eduardo F. Morales
    • 1
  • L. Enrique Sucar
    • 1
  1. 1.Instituto Nacional de Astrofísica, Óptica y ElectrónicaSan Andrés CholulaMexico

Personalised recommendations