Advertisement

Learnable Pooling Methods for Video Classification

  • Sebastian KmiecEmail author
  • Juhan BaeEmail author
  • Ruijian AnEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11132)

Abstract

We introduce modifications to state-of-the-art approaches to aggregating local video descriptors by using attention mechanisms and function approximations. Rather than using ensembles of existing architectures, we provide an insight on creating new architectures. We demonstrate our solutions in the “The 2nd YouTube-8M Video Understanding Challenge”, by using frame-level video and audio descriptors. We obtain testing accuracy similar to the state of the art, while meeting budget constraints, and touch upon strategies to improve the state of the art. Model implementations are available in https://github.com/pomonam/LearnablePoolingMethods.

Keywords

Video classification Youtube-8M NetVLAD Attention Pooling Aggregation 

References

  1. 1.
    Abu-El-Haija, S., et al.: Youtube-8m: a large-scale video classification benchmark. arXiv:1609.08675 (2016). https://arxiv.org/pdf/1609.08675v1.pdf
  2. 2.
    Arandjelović, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: Netvlad: CNN architecture for weakly supervised place recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1437–1451 (2018).  https://doi.org/10.1109/TPAMI.2017.2711011CrossRefGoogle Scholar
  3. 3.
    Bian, Y., et al.: Revisiting the effectiveness of off-the-shelf temporal modeling approaches for large-scale video classification. arXiv preprint arXiv:1708.03805 (2017)
  4. 4.
    Brock, A., Lim, T., Ritchie, J.M., Weston, N.: Neural photo editing with introspective adversarial networks. CoRR abs/1609.07093 (2016). http://arxiv.org/abs/1609.07093
  5. 5.
    Do, T., Tran, Q.D., Cheung, N.: Faemb: a function approximation-based embedding method for image retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, 7–12 June 2015, pp. 3556–3564 (2015).  https://doi.org/10.1109/CVPR.2015.7298978
  6. 6.
    Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., Russell, B.: Actionvlad: Learning spatio-temporal aggregation for action classificationGoogle Scholar
  7. 7.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385
  8. 8.
    Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P., Schmid, C.: Aggregating local image descriptors into compact codes. IEEE Trans. Pattern Anal. Mach. Intell. 34(9), 1704–1716 (2012).  https://doi.org/10.1109/TPAMI.2011.235CrossRefGoogle Scholar
  9. 9.
    Jégou, H., Zisserman, A.: Triangulation embedding and democratic aggregation for image search. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, 23–28 June 2014, pp. 3310–3317 (2014).  https://doi.org/10.1109/CVPR.2014.417
  10. 10.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014). http://arxiv.org/abs/1412.6980
  11. 11.
    Long, X., Gan, C., de Melo, G., Wu, J., Liu, X., Wen, S.: Attention clusters: purely attention based local feature integration for video classification. CoRR abs/1711.09550 (2017). http://arxiv.org/abs/1711.09550
  12. 12.
    Miech, A., Laptev, I., Sivic, J.: Learnable pooling with context gating for video classification. CoRR abs/1706.06905 (2017). http://arxiv.org/abs/1706.06905
  13. 13.
    Jordan, M.I., Jacobs, R.A.: Hierarchical mixtures of experts and the EM algorithm. Neural Comput. 6(2), 181–214 (1994).  https://doi.org/10.1162/neco.1994.6.2.181CrossRefGoogle Scholar
  14. 14.
    Radenovic, F., Iscen, A., Tolias, G., Avrithis, Y., Chum, O.: Revisiting oxford and paris: large-scale image retrieval benchmarking. In: IEEE Computer Vision and Pattern Recognition Conference (2018)Google Scholar
  15. 15.
    Radenovic, F., Iscen, A., Tolias, G., Avrithis, Y.S., Chum, O.: Revisiting oxford and paris: large-scale image retrieval benchmarking. CoRR abs/1803.11285 (2018). http://arxiv.org/abs/1803.11285
  16. 16.
    Tolias, G., Avrithis, Y., Jégou, H.: Image search with selective match kernels: aggregation across single and multiple images. Int. J. Comput. Vis. 116(3), 247–261 (2016)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017, pp. 6000–6010 (2017). http://papers.nips.cc/paper/7181-attention-is-all-you-need
  18. 18.
    Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning for video understanding. arXiv preprint arXiv:1712.04851 (2017)
  19. 19.
    Yu, K., Zhang, T.: Improved local coordinate coding using local tangents. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), Haifa, Israel, 21–24 June 2010, pp. 1215–1222 (2010). http://www.icml2010.org/papers/454.pdf
  20. 20.
    Zhu, Y., Wang, J., Xie, L., Zheng, L.: Attention-based pyramid aggregation network for visual place recognition. arXiv preprint arXiv:1808.00288 (2018)

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

Personalised recommendations