Residual Gating Fusion Network for Human Action Recognition

  • Junxuan Zhang
  • Haifeng HuEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10996)


Most of the recent works leverage Two-Stream framework to model the spatiotemporal information for video action recognition and achieve remarkable performance. In this paper, we propose a novel convolution architecture, called Residual Gating Fusion Network (RGFN), to improve their performance by fully exploring spatiotemporal information in residual signals. In order to further exploit the local details of low-level layers, we introduce Multi-Scale Convolution Fusion (MSCF) to implement spatiotemporal fusion at multiple levels. Since RGFN is an end-to-end network, it can be trained on various kinds of video datasets and applicative to other video analysis tasks. We evaluate our RGFN on two standard benchmarks, i.e., UCF101 and HMDB51, and analyze the designs of convolution network. Experiments results demonstrate the advantages of RGFN, achieving the state-of-the-art performance.


Human action recognition Video analysis Spatiotemporal fusion Convolutional neural network 



This work was supported in part by the National Natural Science Foundation of China under Grant 61673402, Grant 61273270, and Grant 60802069, in part by the Natural Science Foundation of Guangdong under Grant 2017A030311029, Grant 2016B010109002, Grant 2015B090912001, Grant 2016B010123005, and Grant 2017B090909005, in part by the Science and Technology Program of Guangzhou under Grant 201704020180 and Grant 201604020024, and in part by the Fundamental Research Funds for the Central Universities of China.


  1. 1.
    Laptev, I.: On space-time interest points. In: ICCV, vol. 1, pp. 432–439 (2003)Google Scholar
  2. 2.
    Wang, H.: Action recognition with improved trajectories. In: ICCV, pp. 3551–3558 (2014)Google Scholar
  3. 3.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)Google Scholar
  4. 4.
    Feichtenhofer, C.: Convolutional two-stream network fusion for video action recognition. In: CVPR, pp. 1933–1941 (2016)Google Scholar
  5. 5.
    Wang, L.: Temporal segment networks: towards good practices for deep action recognition. ACM Trans. Inf. Syst. 22(1), 20–36 (2016)Google Scholar
  6. 6.
    He, K.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)Google Scholar
  7. 7.
    Soomro, K.: UCF101: a dataset of 101 human actions classes from videos in the wild, CRCV-TR-12-01 (2012)Google Scholar
  8. 8.
    Kuehne, H.: HMDB: a large video database for human motion recognition. In: ICCV (2011)Google Scholar
  9. 9.
    Bilen, H.: Dynamic image networks for action recognition. In: CVPR, pp. 3034–3042 (2016)Google Scholar
  10. 10.
    Wang, L.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: CVPR, pp. 4305–4314 (2015)Google Scholar
  11. 11.
    Du, T.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV, pp. 4489–4497 (2016)Google Scholar
  12. 12.
    Varol, G.: Long-term temporal convolutions for action recognition. TPAMI, PP(99), 1 (2016)Google Scholar
  13. 13.
    Zhu, W.: A key volume mining deep framework for action recognition. In: CVPR, pp. 1991–1999 (2016)Google Scholar
  14. 14.
    Diba, A.: Deep temporal linear encoding networks. In: CVPR (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.School of Electronic and Information EngineeringSun Yat-sen UniversityGuangzhouChina

Personalised recommendations