stagNet: An Attentive Semantic RNN for Group Activity Recognition

  • Mengshi Qi
  • Jie Qin
  • Annan Li
  • Yunhong WangEmail author
  • Jiebo Luo
  • Luc Van Gool
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11214)


Group activity recognition plays a fundamental role in a variety of applications, e.g. sports video analysis and intelligent surveillance. How to model the spatio-temporal contextual information in a scene still remains a crucial yet challenging issue. We propose a novel attentive semantic recurrent neural network (RNN), dubbed as stagNet, for understanding group activities in videos, based on the spatio-temporal attention and semantic graph. A semantic graph is explicitly modeled to describe the spatial context of the whole scene, which is further integrated with the temporal factor via structural-RNN. Benefiting from the ‘factor sharing’ and ‘message passing’ mechanisms, our model is capable of extracting discriminative spatio-temporal features and capturing inter-group relationships. Moreover, we adopt a spatio-temporal attention model to attend to key persons/frames for improved performance. Two widely-used datasets are employed for performance evaluation, and the extensive results demonstrate the superiority of our method.


Group activity recognition Spatio-temporal attention Semantic graph Scene understanding 



This work was partly supported by the National Natural Science Foundation of China (No. 61573045) and the Foundation for Innovative Research Groups through the National Natural Science Foundation of China (No. 61421003). Jiebo Luo would like to thank the support of New York State through the Goergen Institute for Data Science and NSF Award (No. 1722847). Mengshi Qi acknowledges the financial support from the China Scholarship Council.


  1. 1.
    Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv (2016)Google Scholar
  2. 2.
    Amer, M.R., Todorovic, S.: Sum product networks for activity recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38(4), 800–813 (2016)CrossRefGoogle Scholar
  3. 3.
    Amer, M.R., Todorovic, S., Fern, A., Zhu, S.C.: Monte carlo tree search for scheduling activity recognition. In: ICCV. IEEE (2013)Google Scholar
  4. 4.
    Amer, M.R., Xie, D., Zhao, M., Todorovic, S., Zhu, S.-C.: Cost-sensitive top-down/bottom-up inference for multiscale activity recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 187–200. Springer, Heidelberg (2012). Scholar
  5. 5.
    Amer, M.R., Lei, P., Todorovic, S.: HiRF: hierarchical random field for collective activity recognition in videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 572–585. Springer, Cham (2014). Scholar
  6. 6.
    Bagautdinov, T., Alahi, A., Fleuret, F., Fua, P., Savarese, S.: Social scene understanding: end-to-end multi-person action localization and collective activity recognition. In: CVPR. IEEE (2017)Google Scholar
  7. 7.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)Google Scholar
  8. 8.
    Bengio, Y., LeCun, Y., Henderson, D.: Globally trained handwritten word recognizer using spatial representation, convolutional neural networks, and hidden Markov models. In: NIPS. MIT Press (1994)Google Scholar
  9. 9.
    Cao, C., Liu, X., Yang, Y., Yu, Y.: Look and think twice: capturing top-down visual attention with feedback convolutional neural networks. In: ICCV. IEEE (2015)Google Scholar
  10. 10.
    Chen, L.C., Schwing, A.G., Yuille, A.L., Urtasun, R.: Learning deep structured models. In: ICLR (2014)Google Scholar
  11. 11.
    Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. arXiv (2014)Google Scholar
  12. 12.
    Choi, W., Savarese, S.: A unified framework for multi-target tracking and collective activity recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 215–230. Springer, Heidelberg (2012). Scholar
  13. 13.
    Choi, W., Shahid, K., Savarese, S.: What are they doing?: collective activity classification using spatio-temporal relationship among people. In: ICCV Workshops. IEEE (2009)Google Scholar
  14. 14.
    Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: NIPS. MIT Press (2016)Google Scholar
  15. 15.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR. IEEE (2005)Google Scholar
  16. 16.
    Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 428–441. Springer, Heidelberg (2006). Scholar
  17. 17.
    Deng, Z., Vahdat, A., Hu, H., Mori, G.: Structure inference machines: recurrent neural networks for analyzing relations in group activity recognition. In: CVPR. IEEE (2016)Google Scholar
  18. 18.
    Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR. IEEE (2015)Google Scholar
  19. 19.
    Hajimirsadeghi, H., Yan, W., Vahdat, A., Mori, G.: Visual recognition by counting instances: a multi-instance cardinality potential kernel. In: CVPR. IEEE (2015)Google Scholar
  20. 20.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. IEEE (2016)Google Scholar
  21. 21.
    Hinton, G., Srivastava, N., Swersky, K.: Neural networks for machine learning-lecture 6a-overview of mini-batch gradient descentGoogle Scholar
  22. 22.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  23. 23.
    Ibrahim, M.S., Muralidharan, S., Deng, Z., Vahdat, A., Mori, G.: A hierarchical deep temporal model for group activity recognition. In: CVPR. IEEE (2016)Google Scholar
  24. 24.
    Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998)CrossRefGoogle Scholar
  25. 25.
    Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-RNN: deep learning on spatio-temporal graphs. In: CVPR. IEEE (2016)Google Scholar
  26. 26.
    Krahenbuhl, P., Koltun, V.: Efficient inference in fully connected CRFs with Gaussian edge potentials. In: NIPS. MIT Press (2011)Google Scholar
  27. 27.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS. MIT Press (2012)Google Scholar
  28. 28.
    Lan, T., Wang, Y., Yang, W., Robinovitch, S.N., Mori, G.: Discriminative latent models for recognizing contextual group activities. IEEE Trans. Pattern Anal. Mach. Intell. 34(8), 1549–62 (2012)CrossRefGoogle Scholar
  29. 29.
    Li, S., Zhang, W., Chan, A.B.: Maximum-margin structured learning with deep networks for 3D human pose estimation. In: ICCV. IEEE (2015)Google Scholar
  30. 30.
    Li, X., Chuah, M.C.: SBGAR: semantics based group activity recognition. In: CVPR. IEEE (2017)Google Scholar
  31. 31.
    Liu, J., Carr, P., Collins, R.T., Liu, Y.: Tracking sports players with context-conditioned motion models. In: CVPR. IEEE (2013)Google Scholar
  32. 32.
    Liu, Z., Li, X., Luo, P., Loy, C.C., Tang, X.: Semantic image segmentation via deep parsing network. In: ICCV. IEEE (2015)Google Scholar
  33. 33.
    Lu, W.L., Ting, J.A., Little, J.J., Murphy, K.P.: Learning to track and identify players from broadcast sports videos. IEEE Trans. Pattern Anal. Mach. Intell. 35(7), 1704–1716 (2013)CrossRefGoogle Scholar
  34. 34.
    Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent models of visual attention. In: NIPS. MIT Press (2014)Google Scholar
  35. 35.
    Mori, G.: Social roles in hierarchical models for human activity recognition. In: CVPR. IEEE (2012)Google Scholar
  36. 36.
    Munkres, J.: Algorithms for the assignment and transportation problems. J. Soc. Ind. Appl. Math. 5(1), 32–38 (1957)MathSciNetCrossRefGoogle Scholar
  37. 37.
    Qi, M., Wang, Y., Li, A.: Online cross-modal scene retrieval by binary representation and semantic graph. In: MM. ACM (2017)Google Scholar
  38. 38.
    Qin, J., et al.: Binary coding for partial action analysis with limited observation ratios. In: CVPR (2017)Google Scholar
  39. 39.
    Qin, J., et al.: Zero-shot action recognition with error-correcting output codes. In: CVPR (2017)Google Scholar
  40. 40.
    Ramanathan, V., Huang, J., Abu-El-Haija, S., Gorban, A., Murphy, K., Li, F.F.: Detecting events and key actors in multi-person videos. In: CVPR. IEEE (2016)Google Scholar
  41. 41.
    Ryoo, M.S., Aggarwal, J.K.: Stochastic representation and recognition of high-level group activities: describing structural uncertainties in human activities. Int. J. Comput. Vis. 93(2), 183–200 (2011)MathSciNetCrossRefGoogle Scholar
  42. 42.
    Shaoqing, R., Kaiming, H., Ross, G., Jian, S.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137 (2017)CrossRefGoogle Scholar
  43. 43.
    Shu, T., Todorovic, S., Zhu, S.C.: CERN: confidence-energy recurrent network for group activity recognition. In: CVPR. IEEE (2017)Google Scholar
  44. 44.
    Shu, T., Xie, D., Rothrock, B., Todorovic, S.: Joint inference of groups, events and human roles in aerial videos. In: CVPR. IEEE (2015)Google Scholar
  45. 45.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  46. 46.
    Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: NIPS, vol. 4, pp. 2951–2959 (2012)Google Scholar
  47. 47.
    Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: AAAI. AAAI (2017)Google Scholar
  48. 48.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR (2016)Google Scholar
  49. 49.
    Tompson, J., Jain, A., Lecun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: NIPS. MIT Press (2014)Google Scholar
  50. 50.
    Wang, M., Ni, B., Yang, X.: Recurrent modeling of interaction context for collective activity recognition. In: CVPR. IEEE (2017)Google Scholar
  51. 51.
    Wang, Z., Shi, Q., Shen, C., Anton, V.D.H.: Bilinear programming for human activity recognition with unknown MRF graphs. In: CVPR. IEEE (2013)Google Scholar
  52. 52.
    Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: CVPR. IEEE (2017)Google Scholar
  53. 53.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML. ACM (2015)Google Scholar
  54. 54.
    Yao, L., Torabi, A., Cho, K., Ballas, N.: Describing videos by exploiting temporal structure. In: ICCV. IEEE (2015)Google Scholar
  55. 55.
    Zhang, N., Paluri, M., Ranzato, M., Darrell, T., Bourdev, L.: PANDA: pose aligned networks for deep attribute modeling. In: CVPR. IEEE (2014)Google Scholar
  56. 56.
    Zhang, Y., Sohn, K., Villegas, R., Pan, G., Lee, H.: Improving object detection with deep convolutional networks via Bayesian optimization and structured prediction. In: CVPR. IEEE (2015)Google Scholar
  57. 57.
    Zheng, S., et al.: Conditional random fields as recurrent neural networks. In: ICCV. IEEE (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Mengshi Qi
    • 1
  • Jie Qin
    • 2
    • 3
  • Annan Li
    • 1
  • Yunhong Wang
    • 1
    Email author
  • Jiebo Luo
    • 4
  • Luc Van Gool
    • 2
  1. 1.Beijing Advanced Innovation Center for Big Data and Brain Computing, School of Computer Science and EngineeringBeihang UniversityBeijingChina
  2. 2.Computer Vision LaboratoryETH ZurichZurichSwitzerland
  3. 3.Inception Institute of Artificial IntelligenceAbu DhabiUAE
  4. 4.Department of Computer ScienceUniversity of RochesterRochesterUSA

Personalised recommendations