Top-Down Attention Recurrent VLAD Encoding for Action Recognition in Videos

Sudhakaran, Swathikiran; Lanz, Oswald

doi:10.1007/978-3-030-03840-3_28

Swathikiran Sudhakaran^16,17 &
Oswald Lanz¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11298))

Included in the following conference series:

International Conference of the Italian Association for Artificial Intelligence

966 Accesses
2 Citations

Abstract

Most recent approaches for action recognition from video leverage deep architectures to encode the video clip into a fixed length representation vector that is then used for classification. For this to be successful, the network must be capable of suppressing irrelevant scene background and extract the representation from the most discriminative part of the video. Our contribution builds on the observation that spatio-temporal patterns characterizing actions in videos are highly correlated with objects and their location in the video. We propose Top-down Attention Action VLAD (TA-VLAD), a deep recurrent architecture with built-in spatial attention that performs temporally aggregated VLAD encoding for action recognition from videos. We adopt a top-down approach of attention, by using class specific activation maps obtained from a deep CNN pre-trained for image classification, to weight appearance features before encoding them into a fixed-length video descriptor using Gated Recurrent Units. Our method achieves state of the art recognition accuracy on HMDB51 and UCF101 benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
Article Google Scholar
Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J.: Large kernel matters - improve semantic segmentation by global convolutional network. In: Proceedings of CVPR (2017)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Conference on Neural Information Processing Systems (NIPS) (2014)
Google Scholar
Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Feichtenhofer, C., Pinz, A., Wildes, R.: Spatiotemporal residual networks for video action recognition. In: Conference on Neural Information Processing Systems (NIPS) (2016)
Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: IEEE International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Sharma, S., Kiros, R., Salakhutdinov, R.: Action recognition using visual attention. In: NIPS Workshop on Time Series (2015)
Google Scholar
Sudhakaran, S., Lanz, O.: Convolutional long short-term memory networks for recognizing first person interactions. In: IEEE International Conference on Computer Vision Workshops (ICCVW) (2017)
Google Scholar
Sudhakaran, S., Lanz, O.: Learning to detect violent videos using convolutional long short-term memory. In: IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6 (2017)
Google Scholar
Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., Russell, B.: ActionVLAD: learning spatio-temporal aggregation for action classification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning (ICML) (2015)
Google Scholar
Teh, E.W., Rochan, M., Wang, Y.: Attention networks for weakly supervised object localization. In: British Machine Vision Conference (BMVC) (2016)
Google Scholar
Wang, W., Shen, J.: Deep visual attention prediction. arXiv preprint arXiv:1705.02544 (2017)
Kastner, S., Ungerleider, L.G.: Mechanisms of visual attention in the human cortex. Ann. Rev. Neurosci. 23(1), 315–341 (2000)
Article Google Scholar
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2010)
Google Scholar
Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Kim, T., Kim, M.H.: Improving the search accuracy of the VLAD through weighted aggregation of local descriptors. J. Vis. Commun. Image Represent. 31, 237–252 (2015)
Article Google Scholar
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and<0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016)
Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Kuehne, H., Jhuang, H., Stiefelhagen, R., Serre, T.: HMDB51: a large video database for human motion recognition. In: Nagel, W., Kröner, D., Resch, M. (eds.) High Performance Computing in Science and Engineering 2012, pp. 571–582. Springer, Berlin (2013). https://doi.org/10.1007/978-3-642-33374-3_41
Chapter Google Scholar
Chung, J., Gulcehre, C., Cho, K.H., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. In: Proceedings of the NIPS Workshop on Deep Learning (2014)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734 (2014)
Google Scholar
Arandjelovic, R., Zisserman, A.: All about VLAD. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2013)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)

Download references

Author information

Authors and Affiliations

Fondazione Bruno Kessler, Trento, Italy
Swathikiran Sudhakaran & Oswald Lanz
University of Trento, Trento, Italy
Swathikiran Sudhakaran

Authors

Swathikiran Sudhakaran
View author publications
You can also search for this author in PubMed Google Scholar
Oswald Lanz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Swathikiran Sudhakaran .

Editor information

Editors and Affiliations

Fondazione Bruno Kessler, Povo (TN), Italy
Chiara Ghidini
Fondazione Bruno Kessler, Povo (TN), Italy
Bernardo Magnini
University of Trento, Povo (TN), Italy
Andrea Passerini
Fondazione Bruno Kessler, Povo (TN), Italy
Paolo Traverso

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sudhakaran, S., Lanz, O. (2018). Top-Down Attention Recurrent VLAD Encoding for Action Recognition in Videos. In: Ghidini, C., Magnini, B., Passerini, A., Traverso, P. (eds) AI*IA 2018 – Advances in Artificial Intelligence. AI*IA 2018. Lecture Notes in Computer Science(), vol 11298. Springer, Cham. https://doi.org/10.1007/978-3-030-03840-3_28

Download citation

DOI: https://doi.org/10.1007/978-3-030-03840-3_28
Published: 09 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03839-7
Online ISBN: 978-3-030-03840-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics