Advertisement

Multi-fiber Networks for Video Recognition

  • Yunpeng ChenEmail author
  • Yannis Kalantidis
  • Jianshu Li
  • Shuicheng Yan
  • Jiashi Feng
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11205)

Abstract

In this paper, we aim to reduce the computational cost of spatio-temporal deep neural networks, making them run as fast as their 2D counterparts while preserving state-of-the-art accuracy on video recognition benchmarks. To this end, we present the novel Multi-Fiber architecture that slices a complex neural network into an ensemble of lightweight networks or fibers that run through the network. To facilitate information flow between fibers we further incorporate multiplexer modules and end up with an architecture that reduces the computational cost of 3D networks by an order of magnitude, while increasing recognition performance at the same time. Extensive experimental results show that our multi-fiber architecture significantly boosts the efficiency of existing convolution networks for both image and video recognition tasks, achieving state-of-the-art performance on UCF-101, HMDB-51 and Kinetics datasets. Our proposed model requires over 9\(\times \) and 13\(\times \) less computations than the I3D [1] and R(2+1)D [2] models, respectively, yet providing higher accuracy.

Keywords

Deep learning Neural networks Video Classification Action recognition 

Notes

Acknowledgements

Jiashi Feng was partially supported by NUS IDS R-263-000-C67-646, ECRA R-263-000-C87-133 and MOE Tier-II R-263-000-D17-112.

References

  1. 1.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733. IEEE (2017)Google Scholar
  2. 2.
    Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition (2017). arXiv preprint: arXiv:1711.11248
  3. 3.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  4. 4.
    Girshick, R.: Fast R-CNN (2015). arXiv preprint: arXiv:1504.08083
  5. 5.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs (2016). arXiv preprint: arXiv:1606.00915
  6. 6.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)Google Scholar
  7. 7.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497. IEEE (2015)Google Scholar
  8. 8.
    Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning for video understanding (2017). arXiv preprint: arXiv:1712.04851
  9. 9.
    Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imagenet. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 18–22 (2018)Google Scholar
  10. 10.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv preprint: arXiv:1409.1556
  11. 11.
    Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: CVPR (2016)Google Scholar
  12. 12.
    Shou, Z., Chan, J., Zareian, A., Miyazawa, K., Chang, S.F.: CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: CVPR (2017)Google Scholar
  13. 13.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  14. 14.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  15. 15.
    Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4694–4702. IEEE (2015)Google Scholar
  16. 16.
    Wang, L., Li, W., Li, W., Van Gool, L.: Appearance-and-relation networks for video classification (2017). arXiv preprint: arXiv:1711.09125
  17. 17.
    Tran, A., Cheong, L.F.: Two-stream flow-guided convolutional attention networks for action recognition. In: International Conference on Computer Vision (2017)Google Scholar
  18. 18.
    Wu, C.Y., Zaheer, M., Hu, H., Manmatha, R., Smola, A.J., Krähenbühl, P.: Compressed video action recognition (2017). arXiv preprint: arXiv:1712.00636
  19. 19.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)Google Scholar
  20. 20.
    Kay, W., et al.: The kinetics human action video dataset (2017). arXiv preprint: arXiv:1705.06950
  21. 21.
    Shou, Z., Gao, H., Zhang, L., Miyazawa, K., Chang, S.F.: Autoloc: weakly-supervised temporal action localization in untrimmed videos. In: Ferrari, V. (ed.) ECCV 2018, Part XVI. LNCS, vol. 11220, pp. 162–179. Springer, AG (2018)Google Scholar
  22. 22.
    Shou, Z.: Online detection of action start in untrimmed, streaming videos. In: Ferrari, V. et al. (eds.) ECCV 2018, Part III. LNCS, vol. 11207, pp. 551–568. Springer, AG (2018)Google Scholar
  23. 23.
    Tran, D., Ray, J., Shou, Z., Chang, S.F., Paluri, M.: Convnet architecture search for spatiotemporal feature learning (2017). arXiv preprint: arXiv:1708.05038
  24. 24.
    Szegedy, C., et al.: Going deeper with convolutionsGoogle Scholar
  25. 25.
    Howard, A.G., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications (2017). arXiv preprint: arXiv:1704.04861
  26. 26.
    Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation (2018). arXiv preprint: arXiv:1801.04381
  27. 27.
    Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: an extremely efficient convolutional neural network for mobile devices (2017). arXiv preprint: arXiv:1707.01083
  28. 28.
    Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5987–5995. IEEE (2017)Google Scholar
  29. 29.
    Ahmed, K., Torresani, L.: Maskconnect: Connectivity learning by gradient descent. In: Ferrari, V., et al. (eds.) ECCV 2018, Part V. LNCS, vol. 11209, pp. 362–378. Springer, AG (2018)Google Scholar
  30. 30.
    Chollet, F.: Xception: deep learning with depthwise separable convolutions (2017). arXiv preprint: arXiv:1610.02357
  31. 31.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  32. 32.
    Chen, Y., Li, J., Xiao, H., Jin, X., Yan, S., Feng, J.: Dual path networks. In: Advances in Neural Information Processing Systems, pp. 4470–4478 (2017)Google Scholar
  33. 33.
    Chen, T., et al.: Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems (2015). arXiv preprint: arXiv:1512.01274
  34. 34.
    Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild (2012). arXiv preprint: arXiv:1212.0402
  35. 35.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2556–2563. IEEE (2011)Google Scholar
  36. 36.
    Paszke, A., Gross, S., Chintala, S., Chanan, G.: Pytorch (2017)Google Scholar
  37. 37.
    Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal multiplier networks for video action recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7445–7454. IEEE (2017)Google Scholar
  38. 38.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part VIII. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_2CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Yunpeng Chen
    • 1
    Email author
  • Yannis Kalantidis
    • 2
  • Jianshu Li
    • 1
  • Shuicheng Yan
    • 1
    • 3
  • Jiashi Feng
    • 1
  1. 1.National University of SingaporeSingaporeSingapore
  2. 2.Facebook ResearchMenlo ParkUSA
  3. 3.Qihoo 360 AI InstituteBeijingChina

Personalised recommendations