Advertisement

NeXtVLAD: An Efficient Neural Network to Aggregate Frame-Level Features for Large-Scale Video Classification

  • Rongcheng LinEmail author
  • Jing Xiao
  • Jianping Fan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11132)

Abstract

This paper introduces a fast and efficient network architecture, NeXtVLAD, to aggregate frame-level features into a compact feature vector for large-scale video classification. Briefly speaking, the basic idea is to decompose a high-dimensional feature into a group of relatively low-dimensional vectors with attention before applying NetVLAD aggregation over time. This NeXtVLAD approach turns out to be both effective and parameter efficient in aggregating temporal information. In the 2nd Youtube-8M video understanding challenge, a single NeXtVLAD model with less than 80M parameters achieves a GAP score of 0.87846 in private leaderboard. A mixture of 3 NeXtVLAD models results in 0.88722, which is ranked 3rd over 394 teams. The code is publicly available at https://github.com/linrongc/youtube-8m.

Keywords

Neural network VLAD Video classification Youtube8M 

Notes

Acknowledgement

The authors would like to thank Kaggle and the Google team for hosting the Youtube-8M video understanding challenge and providing the Youtube-8M Tensorflow Starter Code.

References

  1. 1.
    Abu-El-Haija, S., et al.: Youtube-8m: a large-scale video classification benchmark. arXiv:1609.08675 (2016)
  2. 2.
    Arandjelović, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  3. 3.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)CrossRefGoogle Scholar
  4. 4.
    Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Moschitti, A., Pang, B., Daelemans, W. (eds.) EMNLP, ACL, pp. 1724–1734 (2014)Google Scholar
  5. 5.
    Miech, A., Laptev, I., Sivic, J.: Learnable pooling with context gating for video classification. CoRR (2017)Google Scholar
  6. 6.
    Xie, S., Girshick, R.B., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. CoRR (2016)Google Scholar
  7. 7.
    Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching in videos. In: IEEE International Conference on Computer Vision, vol. 2, pp. 1470–1477 (2003)Google Scholar
  8. 8.
    Perronnin, F., Dance, C.R.: Fisher kernels on visual vocabularies for image categorization. In: CVPR IEEE Computer Society (2007)Google Scholar
  9. 9.
    Jegou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: CVPR IEEE Computer Society, pp. 3304–3311 (2010)Google Scholar
  10. 10.
    Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)Google Scholar
  11. 11.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: 17th International Conference on Proceedings of the Pattern Recognition, (ICPR 2004) Volume 3 - Volume 03. ICPR 2004, IEEE Computer Society, pp. 32–36 (2004)Google Scholar
  12. 12.
    Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., Russell, B.: ActionVLAD: learning spatio-temporal aggregation for action classification. In: CVPR (2017)Google Scholar
  13. 13.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)Google Scholar
  14. 14.
    Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: ActivityNet: a large-scale video benchmark for human activity understanding. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015Google Scholar
  15. 15.
    Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for human action recognition. In: Salah, A.A., Lepri, B. (eds.) HBU 2011. LNCS, vol. 7065, pp. 29–39. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-25446-8_4CrossRefGoogle Scholar
  16. 16.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1. NIPS 2014, pp. 568–576. MIT Press (2014)Google Scholar
  17. 17.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 221–231 (2013)CrossRefGoogle Scholar
  18. 18.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV). ICCV 2015, IEEE Computer Society, pp. 4489–4497 (2015)Google Scholar
  19. 19.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. CoRR (2016)Google Scholar
  20. 20.
    Wu, Z., Jiang, Y., Wang, X., Ye, H., Xue, X., Wang, J.: Fusing multi-stream deep networks for video classification. CoRR (2015)Google Scholar
  21. 21.
    Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Computer Vision and Pattern Recognition (2015)Google Scholar
  22. 22.
    Ballas, N., Yao, L., Pal, C., Courville, A.C.: Delving deeper into convolutional networks for learning video representations. CoRR (2015)Google Scholar
  23. 23.
    Fernando, B., Gavves, E., Oramas, J.M., Ghodrati, A., Tuytelaars, T.: Modeling video evolution for action recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015Google Scholar
  24. 24.
    Wang, X., Farhadi, A., Gupta, A.: Actions   transformations. In: CVPR (2016)Google Scholar
  25. 25.
    Bilen, H., Fernando, B., Gavves, E., Vedaldi, A.: Action recognition with dynamic image networks. CoRR (2016)Google Scholar
  26. 26.
    Wang, L., Li, W., Li, W., Gool, L.V.: Appearance-and-relation networks for video classification. Technical report, arXiv (2017)Google Scholar
  27. 27.
    Arandjelovic, R., Zisserman, A.: All about VLAD. In: Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, pp. 1578–1585 (2013)Google Scholar
  28. 28.
    Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR (2018)Google Scholar
  29. 29.
    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NIPS Deep Learning and Representation Learning Workshop (2015)Google Scholar
  30. 30.
    Zhang, Y., Xiang, T., Hospedales, T.M., Lu, H.: Deep mutual learning. CoRR (2017)Google Scholar
  31. 31.
    Li, Z., Hoiem, D.: Learning without forgetting. CoRR (2016)Google Scholar
  32. 32.
    Lan, X., Zhu, X., Gong, S.: Knowledge distillation by on-the-fly native ensemble. arXiv:1806.04606 (2018)
  33. 33.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37. ICML 2015, pp. 448–456. JMLR.org (2015)Google Scholar
  34. 34.
    Deng, J., Dong, W., Socher, R., Li-jia, L., Li, K., Fei-fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  35. 35.
    Hershey, S., et al.: CNN architectures for large-scale audio classification. CoRR (2016)Google Scholar
  36. 36.
    Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation. OSDI 2016, USENIX Association, pp. 265–283 (2016)Google Scholar
  37. 37.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR (2014)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.University of North Carolina at CharlotteCharlotteUSA

Personalised recommendations