Interaction-Aware Spatio-Temporal Pyramid Attention Networks for Action Classification

  • Yang Du
  • Chunfeng YuanEmail author
  • Bing Li
  • Lili Zhao
  • Yangxi Li
  • Weiming Hu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11220)


Local features at neighboring spatial positions in feature maps have high correlation since their receptive fields are often overlapped. Self-attention usually uses the weighted sum (or other functions) with internal elements of each local feature to obtain its weight score, which ignores interactions among local features. To address this, we propose an effective interaction-aware self-attention model inspired by PCA to learn attention maps. Furthermore, since different layers in a deep network capture feature maps of different scales, we use these feature maps to construct a spatial pyramid and then utilize multi-scale information to obtain more accurate attention scores, which are used to weight the local features in all spatial positions of feature maps to calculate attention maps. Moreover, our spatial pyramid attention is unrestricted to the number of its input feature maps so it is easily extended to a spatio-temporal version. Finally, our model is embedded in general CNNs to form end-to-end attention networks for action classification. Experimental results show that our method achieves the state-of-the-art results on the UCF101, HMDB51 and untrimmed Charades.



This work is supported by the National Key R & D Plan (No. 2017YFB1002801, 2016QY01W0106), NSFC (Grant No. 61472420, 61472063, 61751212, 61472421, 61772225, U1736106, U1636218), the Key Research Program of Frontier Sciences, CAS (Grant No. XDB02070003, QYZDJ-SSW-JSC040), the CAS External cooperation key project, Research Project of State Grid General Aviation Company Limited (NO. 5201/2018-44001B), and Bing Li is also supported by Youth Innovation Promotion Association, CAS.


  1. 1.
    Abu-El-Haija, S., et al.: Youtube-8m: a large-scale video classification benchmark. In: CoRR (2016)Google Scholar
  2. 2.
    Awad, G., et al.: Trecvid 2016: evaluating video search, video event detection, localization, and hyperlinking. In: TRECVID (2016)Google Scholar
  3. 3.
    Gorban, A., et al.: THUMOS challenge: Action recognition with a large number of classes. (2013)
  4. 4.
    Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: ECCV (2016)Google Scholar
  5. 5.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre., T.: HMDB: a large video database for human motion recognition. In: ICCV (2011)Google Scholar
  6. 6.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. In: CRCVTR-12-01 (2012)Google Scholar
  7. 7.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  8. 8.
    He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: ECCV (2016)Google Scholar
  9. 9.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)Google Scholar
  10. 10.
    Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261 (2016)
  11. 11.
    Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)Google Scholar
  12. 12.
    Lin, Z., et al.: A structured self-attentive sentence embedding. In: ICLR (2017)Google Scholar
  13. 13.
    Sharma, S., Kiros, R., Salakhutdinov, R.: Action recognition using visual attention. arXiv preprint arXiv:1511.04119 (2016)
  14. 14.
    Li, Z., Gavves, E., Jain, M., Snoek, C.G.M.: Attends and flows for action recognition. In: arXiv preprint arXiv:1607.01794 (2016)
  15. 15.
    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2017)Google Scholar
  16. 16.
    He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: IEEE TPAMI (2015)Google Scholar
  17. 17.
    Lin, T.Y., Dollar, P., Girshick, R.: Feature pyramid networks for object detection. In: CVPR (2016)Google Scholar
  18. 18.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV (2013)Google Scholar
  19. 19.
    Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: CVPR (2015)Google Scholar
  20. 20.
    Lowe, D.: Object recognition from local scale-invariant keypoints. In: ICCV (1999)Google Scholar
  21. 21.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)Google Scholar
  22. 22.
    Dalal, N., Triggs, B.: Histogram of oriented gradients for human detection. In: CVPR (2005)Google Scholar
  23. 23.
    Klaser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: BMVC (2008)Google Scholar
  24. 24.
    Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: ECCV (2006)Google Scholar
  25. 25.
    Willems, G., Tuytelaars, T., Gool, L.J.V.: An efficient dense and scale-invariant spatio-temporal interest point detector. In: ECCV (2008)Google Scholar
  26. 26.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)Google Scholar
  27. 27.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)Google Scholar
  28. 28.
    Wang, L., et al.: Temporal segment networks: Towards good practices for deep action recognition. In: ECCV (2016)CrossRefGoogle Scholar
  29. 29.
    Feichtenhofer, C., Pinz, A., Wildes, R.: Spatiotemporal residual networks for video action recognition. In: NIPS (2016)Google Scholar
  30. 30.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. In: TPAMI 35(1) (2013)CrossRefGoogle Scholar
  31. 31.
    Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning for video understanding. arXiv preprint arXiv:1712.04851 (2017)
  32. 32.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR (2017)Google Scholar
  33. 33.
    Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. In: CoRRarXiv (2016)Google Scholar
  34. 34.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV (2015)Google Scholar
  35. 35.
    Elman, J.L.: Finding structure in time. Cogn. Sci. 14(2), 179–211 (1990)CrossRefGoogle Scholar
  36. 36.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  37. 37.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)Google Scholar
  38. 38.
    Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: CVPR (2015)Google Scholar
  39. 39.
    Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., Russell, B.: Actionvlad: learning spatio-temporal aggregation for action classification. In: CVPR (2017)Google Scholar
  40. 40.
    Diba, A., Sharma, V., Gool, L.V.: Deep temporal linear encoding networks. In: CVPR (2017)Google Scholar
  41. 41.
    Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. In: NIPS (2014)Google Scholar
  42. 42.
    Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755 (2014)
  43. 43.
    Gkioxari, G., Girshick, R., Malik, J.: Contextual action recognition with R*CNN. In: ICCV (2015)Google Scholar
  44. 44.
    Mallya, A., Lazebnik, S.: Learning models for actions and person-object interactions with transfer to question answering. In: ECCV (2016)Google Scholar
  45. 45.
    Li, Z., Gavves, E., Jain, M., Snoek, C.G.M.: Videolstm convolves, attends and flows for action recognition. arXiv preprint arXiv:1607.01794 (2016)
  46. 46.
    Girdhar, R., Ramanan, D.: Attentional pooling for action recognition. In: NIPS (2017)Google Scholar
  47. 47.
    Long, X., Gan, C., Melo, G., Wu, J., Liu, X., Wen, S.: Attention clusters: purely attention based local feature integration for video classification. arXiv preprint arXiv:1711.09550 (2017)
  48. 48.
    Ma, C.Y., Kadav, A., Melvin, I., Kira, Z., AlRegib, G., Graf, H.P.: Attend and interact: higher-order object interactions for video understanding. arXiv preprint arXiv:1711.06330 (2017)
  49. 49.
    Jegou, H., Douze, M., Schmid, C., Perez, P.: Aggregating local descriptors into a compact image representation. In: CVPR (2010)Google Scholar
  50. 50.
  51. 51.
    Yang, J., Zhang, D., Frangi, A.F., Yang, J.: Two-dimensional PCA: a new approach to appearance-based face representation and recognition. In: IEEE TPAMI (2004)Google Scholar
  52. 52.
    Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML (2015)Google Scholar
  53. 53.
    Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. In: IJCV (2015)Google Scholar
  54. 54.
    Wang, L., Xiong, Y., Wang, Z., Qiao, Y.: Towards good practices for very deep two-stream convnets. In: CoRR (2015)Google Scholar
  55. 55.
    Abadi, M., et al.: Tensorflow: Large-scale machine learning on heterogeneous systems. Software available from (2015)
  56. 56.
    Ng, J.Y., Hausknecht, M.J., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: CVPR (2015)Google Scholar
  57. 57.
    Lev, G., Sadeh, G., Klein, B., Wolf, L.: RNN fisher vectors for action recognition and image annotation. In: ECCV (2016)Google Scholar
  58. 58.
    Kar, A., Rai, N., Sikka, K., Sharma, G.: Adascan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In: CVPR (2017)Google Scholar
  59. 59.
    Cai, Z., Wang, L., Peng, X., Qiao., Y.: Multi-view super vector for action recognition. In: CVPR (2014)Google Scholar
  60. 60.

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Yang Du
    • 1
    • 2
    • 3
  • Chunfeng Yuan
    • 2
    Email author
  • Bing Li
    • 2
  • Lili Zhao
    • 3
  • Yangxi Li
    • 4
  • Weiming Hu
    • 2
  1. 1.University of Chinese Academy of SciencesBeijingChina
  2. 2.CAS Center for Excellence in Brain Science and Intelligence Technology, National Laboratory of Pattern RecognitionInstitute of Automation, CASBeijingChina
  3. 3.MeituMainland ChinaChina
  4. 4.National Computer network Emergency Response technical Team/Coordination Center of ChinaHo Chi Minh CityVietnam

Personalised recommendations