Motion Context: A New Representation for Human Action Recognition

  • Ziming Zhang
  • Yiqun Hu
  • Syin Chan
  • Liang-Tien Chia
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5305)


One of the key challenges in human action recognition from video sequences is how to model an action sufficiently. Therefore, in this paper we propose a novel motion-based representation called Motion Context (MC), which is insensitive to the scale and direction of an action, by employing image representation techniques. A MC captures the distribution of the motion words (MWs) over relative locations in a local region of the motion image (MI) around a reference point and thus summarizes the local motion information in a rich 3D MC descriptor. In this way, any human action can be represented as a 3D descriptor by summing up all the MC descriptors of this action. For action recognition, we propose 4 different recognition configurations: MW+pLSA, MW+SVM, MC+w 3-pLSA (a new direct graphical model by extending pLSA), and MC+SVM. We test our approach on two human action video datasets from KTH and Weizmann Institute of Science (WIS) and our performances are quite promising. For the KTH dataset, the proposed MC representation achieves the highest performance using the proposed w 3-pLSA. For the WIS dataset, the best performance of the proposed MC is comparable to the state of the art.


  1. 1.
    Laptev, I., Lindeberg, T.: Space-time interest points. In: ICCV (2003)Google Scholar
  2. 2.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm approach. In: ICPR 2004, vol. III, pp. 32–36 (2004)Google Scholar
  3. 3.
    Dollár, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: VS-PETS (October 2005)Google Scholar
  4. 4.
    Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action recognition. ACM Multimedia, 357–360 (2007)Google Scholar
  5. 5.
    Wang, Y., Loe, K.F., Tan, T.L., Wu, J.K.: Spatiotemporal video segmentation based on graphical models. Trans. IP 14, 937–947 (2005)Google Scholar
  6. 6.
    Niebles, J., Fei Fei, L.: A hierarchical model of shape and appearance for human action classification. In: CVPR 2007, pp. 1–8 (2007)Google Scholar
  7. 7.
    Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. In: Mach. Learn., Hingham, MA, USA, vol. 42, pp. 177–196. Kluwer Academic Publishers, Dordrecht (2001)Google Scholar
  8. 8.
    Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. In: Data Mining and Knowledge Discovery, vol. 2, pp. 121–167 (1998)Google Scholar
  9. 9.
    Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: ICCV 2005, vol. II, pp. 1395–1402 (2005)Google Scholar
  10. 10.
    Wang, Y., Sabzmeydani, P., Mori, G.: Semi-latent dirichlet allocation: A hierarchical model for human action recognition. In: HUMO 2007, pp. 240–254 (2007)Google Scholar
  11. 11.
    Ikizler, N., Duygulu, P.: Human action recognition using distribution of oriented rectangular patches. In: HUMO 2007, pp. 271–284 (2007)Google Scholar
  12. 12.
    Efros, A., Berg, A., Mori, G., Malik, J.: Recognizing action at a distance. In: ICCV 2003, pp. 726–733 (2003)Google Scholar
  13. 13.
    Lowe, D.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 20, 91–110 (2003)Google Scholar
  14. 14.
    Savarese, S., Sel Pozo, A., Fei-Fei, J.N.L.: Spatial-temporal correlations for unsupervised action classification. In: IEEE Workshop on Motion and Video Computing, Copper Mountain, Colorado (2008)Google Scholar
  15. 15.
    Wang, Y., Tan, T., Loe, K.: Video segmentation based on graphical models. In: CVPR 2003, vol. II, pp. 335–342 (2003)Google Scholar
  16. 16.
    Wang, L., Suter, D.: Informative shape representations for human action recognition. In: ICPR 2006, vol. II, pp. 1266–1269 (2006)Google Scholar
  17. 17.
    Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. Computer Vision and Image Understanding 104 (November/December 2006) Google Scholar
  18. 18.
    Bobick, A., Davis, J.: The recognition of human movement using temporal templates. PAMI 23(3), 257–267 (2001)Google Scholar
  19. 19.
    Niebles, J., Wang, H., Wang, H., Fei Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. In: BMVC 2006, vol. III, p. 1249 (2006)Google Scholar
  20. 20.
    Belongie, S., Malik, J., Puzicha, J.: Shape context: A new descriptor for shape matching and object recognition. In: NIPS, pp. 831–837 (2000)Google Scholar
  21. 21.
    Bissacco, A., Yang, M.H., Soatto, S.: Fast human pose estimation using appearance and motion via multi-dimensional boosting regression. In: CVPR (2007)Google Scholar
  22. 22.
    Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis & Machine Intelligence 27, 1615–1630 (2005)CrossRefGoogle Scholar
  23. 23.
    Chang, C., Lin, C.: Libsvm: a library for support vector machines, Online (2001)Google Scholar
  24. 24.
    Ke, Y., Sukthankar, R., Hebert, M.: Efficient visual event detection using volumetric features. In: International Conference on Computer Vision, vol. 1, p. 166 (October 2005)Google Scholar
  25. 25.
    Wong, S., Kim, T., Cipolla, R.: Learning motion categories using both semantic and structural information. In: CVPR 2007, pp. 1–6 (2007)Google Scholar
  26. 26.
    Ali, S., Basharat, A., Shah, M.: Chaotic invariants for human action recognition. In: ICCV 2007, pp. 1–8 (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Ziming Zhang
    • 1
  • Yiqun Hu
    • 1
  • Syin Chan
    • 1
  • Liang-Tien Chia
    • 1
  1. 1.Center for Multimedia and Network Technology, School of Computer EngineeringNanyang Technological UniversitySingapore

Personalised recommendations