Supervised Learning of Group Activity

  • Shaogang Gong
  • Tao Xiang


In a public space, actions of individuals are commonly observed as elements of group activities, and are likely to involve multiple objects interacting or co-existing in a shared space. Group activity modelling is concerned with modelling not only the actions of individual objects in isolation, but also the interactions and causal relationships among individual actions. In order to make semantic sense of visual observations of group activities, a supervised learning model aims to first automatically segment temporally a video stream into plausible activity elements, followed by constructing a model from the observed visual data so far for describing different categories of activities, and recognising a new instance of activity by classifying it into one of the known categories. To this end, three problems need be addressed: (1) How to select visual features that best represent activities; (2) How to perform automatic video segmentation; and (3) How to model the temporal and causal correlations among objects whose actions are considered to form meaningful group activities. In this chapter, we describe a contextual event based group activity representation, two different methods for activity based video segmentation, and a dynamic Bayesian network for supervised learning of an activity model with its model structure automatically learned from visual observations.


Hide Markov Model Bayesian Information Criterion Video Content Hide State Dynamic Bayesian Network 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. Ballard, D., Brown, C.: Computer Vision. Prentice Hall, New York (1982) Google Scholar
  2. Baum, L.E., Petrie, T.: Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Stat. 37, 1554–1563 (1966) MathSciNetMATHCrossRefGoogle Scholar
  3. Bobick, A.F., Davis, J.: The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 23(3), 257–267 (2001) CrossRefGoogle Scholar
  4. Bregonzio, M., Gong, S., Xiang, T.: Recognising action as clouds of space-time interest points. In: IEEE Conference on Computer Vision and Pattern Recognition, Miami, USA, June 2009, pp. 1948–1955 (2009) Google Scholar
  5. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 886–893 (2005) Google Scholar
  6. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 39(1), 1–38 (1977) MathSciNetMATHGoogle Scholar
  7. Dollár, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking, pp. 65–72 (2005) CrossRefGoogle Scholar
  8. Enser, P., Sandom, C.: Towards a comprehensive survey of the semantic gap in visual image retrieval. In: ACM International Conference on Image and Video Retrieval, pp. 291–299 (2003) CrossRefGoogle Scholar
  9. Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010) CrossRefGoogle Scholar
  10. Forney, G.D.: The Viterbi algorithm. Proc. IEEE 61, 268–278 (1973) MathSciNetCrossRefGoogle Scholar
  11. Friedman, N., Murphy, K.P., Russell, S.: Learning the structure of dynamic probabilistic networks. In: Uncertainty in Artificial Intelligence, pp. 139–147 (1998) Google Scholar
  12. Gong, S., Xiang, T.: Scene event recognition without tracking. Acta Autom. Sin. 29(3), 321–331 (2003a) Google Scholar
  13. Gong, S., Xiang, T.: Recognition of group activities using dynamic probabilistic networks. In: IEEE International Conference on Computer Vision, Nice, France, October 2003, pp. 742–749 (2003b) CrossRefGoogle Scholar
  14. Greenspan, H., Goldberger, J., Mayer, A.: Probabilistic space-time video modelling via piecewise GMM. IEEE Trans. Pattern Anal. Mach. Intell. 26(3), 384–396 (2004) CrossRefGoogle Scholar
  15. Grosky, W.I., Zhao, R.: Negotiating the semantic gap: from feature maps to semantic landscapes. In: Lecture Notes in Computer Science, vol. 2234, pp. 33–42 (2001) Google Scholar
  16. Huang, C., Darwiche, A.: Inference in belief networks: a procedural guide. Int. J. Approx. Reason. 15(3), 225–263 (1996) MathSciNetMATHCrossRefGoogle Scholar
  17. Hung, H., Gong, S.: Quantifying temporal saliency. In: British Machine Vision Conference, Kingston-upon-Thames, UK, September 2004, pp. 727–736 (2004) Google Scholar
  18. Kadir, T., Brady, M.: Scale, saliency and image description. Int. J. Comput. Vis. 45(2), 83–105 (2001) MATHCrossRefGoogle Scholar
  19. Kass, R., Raftery, A.: Bayes factors. J. Am. Stat. Assoc. 90, 377–395 (1995) Google Scholar
  20. Keogh, E.: An online algorithm for segmenting time series. In: IEEE International Conference on Data Mining, pp. 289–296 (2001) Google Scholar
  21. Latecki, L., Lakamper, R.: Convexity rule for shape decomposition based on discrete contour evolution. Comput. Vis. Image Underst. 73, 441–454 (1999) CrossRefGoogle Scholar
  22. McKenna, S., Jabri, S., Duric, Z., Rosenfeld, A., Wechsler, H.: Tracking group of people. Comput. Vis. Image Underst. 80, 42–56 (2000) MATHCrossRefGoogle Scholar
  23. Naphade, M.R., Kozintsev, I., Huang, T.S.: A factor graph framework for semantic indexing and retrieval in video. IEEE Trans. Veh. Technol. 12(1), 40–52 (2002) CrossRefGoogle Scholar
  24. Ng, J., Gong, S.: Learning pixel-wise signal energy for understanding semantics. Image Vis. Comput. 21(13–14), 1171–1182 (2003) Google Scholar
  25. Piater, J.H., Crowley, J.: Multi-modal tracking of interacting targets using Gaussian approximation. In: IEEE Workshop on Performance Evaluation of Tracking and Surveillance, pp. 141–147 (2001) Google Scholar
  26. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989) CrossRefGoogle Scholar
  27. Russell, D., Gong, S.: Minimum cuts of a time-varying background. In: British Machine Vision Conference, Edinburgh, UK, September 2006, pp. 809–818 (2006) Google Scholar
  28. Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978) MATHCrossRefGoogle Scholar
  29. Stauffer, C., Grimson, W.E.L.: Adaptive background mixture models for real-time tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, Fort Collins, USA, June 1999, pp. 246–252 (1999) Google Scholar
  30. Viola, P., Jones, M.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004) CrossRefGoogle Scholar
  31. Viola, P., Jones, M., Snow, D.: Detecting pedestrians using patterns of motion and appearance. In: IEEE International Conference on Computer Vision, Nice, France, October 2003, pp. 734–741 (2003) CrossRefGoogle Scholar
  32. Xiang, T., Gong, S.: Beyond tracking: modelling activity and understanding behaviour. Int. J. Comput. Vis. 67(1), 21–51 (2006) CrossRefGoogle Scholar
  33. Xiang, T., Gong, S.: Activity based surveillance video modelling. Pattern Recognit. 41(7), 2309–2326 (2008) MATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2011

Authors and Affiliations

  1. 1.School of Electronic Engineering and Computer ScienceQueen Mary University of LondonLondonUK
  2. 2.School of Electronic Engineering and Computer ScienceQueen Mary University of LondonLondonUK

Personalised recommendations