Advertisement

Multi-level Three-Stream Convolutional Networks for Video-Based Action Recognition

  • Yijing Lv
  • Huicheng ZhengEmail author
  • Wei Zhang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11257)

Abstract

Deep convolutional neural networks (ConvNets) have shown remarkable capability for visual feature learning and representation. In the field of video-based action recognition, much progress has been made with the development of ConvNets. However, main-stream ConvNets used for video-based action recognition, such as two-stream ConvNets and 3D ConvNets, still lack the ability to represent fine-grained features. In this paper, we propose a novel architecture named multi-level three-stream convolutional network (MLTSN), which contains three streams, i.e., the spatial stream, the temporal stream, and the multi-level correlation stream (MLCS). The MLCS contains several correlation modules, which fuse appearance and motion features at the same levels and obtain spatial-temporal correlation maps. The correlation maps will further be fed in several convolution layers to get refined features. The whole network is trained in a multi-step modality. Extensive experimental results show that the performance of the proposed network is competitive to state-of-the-art methods on HMDB51 and UCF101.

Keywords

Action recognition Convolutional networks Multi-level correlation mechanism 

Notes

Acknowledgments

This work was supported by National Natural Science Foundation of China (U1611461), Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase, No. U1501501), and Science and Technology Program of Guangzhou (No. 201803030029).

References

  1. 1.
    Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: British Machine Vision Conference, pp. 41.1–41.12 (2015)Google Scholar
  2. 2.
    Krizhevsky, A., Sutskever, I., Hinton, G.E: Imagenet classification with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  3. 3.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: International Conference on Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  4. 4.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)Google Scholar
  5. 5.
    Feichtenhofer, C., Pinz, A., Wildes, R.: Spatiotemporal residual networks for video action recognition. In: International Conference on Neural Information Processing Systems, pp. 3468–3476 (2016)Google Scholar
  6. 6.
    Feichtenhofer, C., Pinz, A., Wildes, R.: Spatiotemporal multiplier networks for video action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7445–7454 (2017)Google Scholar
  7. 7.
    Wang, Y., Long, M., Wang, J.: Spatiotemporal pyramid network for video action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 097–2106 (2017)Google Scholar
  8. 8.
    Ji, S., Xu, W., Yang, M.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)CrossRefGoogle Scholar
  9. 9.
    Tran, D., Bourdev, L., Fergus, R.: Learning spatiotemporal features with 3D convolutional networks. In: IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)Google Scholar
  10. 10.
    Sun, L., Jia, K., Yeung, D., Shi, B.: Human action recognition using factorized spatio-temporal convolutional networks. In: IEEE International Conference on Computer Vision, pp. 4597–4605 (2015)Google Scholar
  11. 11.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016 Part VIII. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_2CrossRefGoogle Scholar
  12. 12.
    Donahue, J., Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S.: Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)Google Scholar
  13. 13.
    Chen, Q., Zhang, Y.: Sequential segment networks for action recognition. Sig. Process. Lett. 24(5), 712–716 (2017)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. arXiv preprint arXiv:1604.04494 (2016)
  15. 15.
    Karpathy, A., Toderici, G., Shetty, S.: Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)Google Scholar
  16. 16.
    Simonyan, K., Zisserman, A.: Very deep networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  17. 17.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  18. 18.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)Google Scholar
  19. 19.
    Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Internaional Conference on Artificial Intelligence Statistics, pp. 315–323 (2011)Google Scholar
  20. 20.
    Wang, L., Xiong, Y., Wang, Z.: Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159 (2015)
  21. 21.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: International Conference on Computer Vision, pp. 2556–2563 (2011)Google Scholar
  22. 22.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
  23. 23.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: International Conference on Computer Vision, pp. 3551–3558 (2013)Google Scholar
  24. 24.
    Lan, Z., Lin, M., Li, X.: Beyond Gaussian pyramid: multi-skip feature stacking for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 204–212 (2015)Google Scholar
  25. 25.
    Bilen, H., Fernando, B., Gavves, E.: Dynamic image networks for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3034–3042 (2016)Google Scholar
  26. 26.
    Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4305–4314 (2015)Google Scholar
  27. 27.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4733 (2017)Google Scholar
  28. 28.
    Cen, J.: Robust action recognition. Sun Yat-sen University, Guangdong, Guangzhou, China (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.School of Data and Computer ScienceSun Yat-sen UniversityGuangzhouChina
  2. 2.Key Laboratory of Machine Intelligence and Advanced ComputingMinistry of EducationGuangzhouChina
  3. 3.Guangdong Key Laboratory of Information Security TechnologyGuangzhouChina

Personalised recommendations