Middle-Level Representation for Human Activities Recognition: The Role of Spatio-Temporal Relationships

Yuan, Fei; Prinet, Véronique; Yuan, Junsong

doi:10.1007/978-3-642-35749-7_13

Middle-Level Representation for Human Activities Recognition: The Role of Spatio-Temporal Relationships

Fei Yuan¹⁷,
Véronique Prinet¹⁷ &
Junsong Yuan¹⁸

Conference paper

1900 Accesses
6 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 6553))

Abstract

We tackle the challenging problem of human activity recognition in realistic video sequences. Unlike local features-based methods or global template-based methods, we propose to represent a video sequence by a set of middle-level parts. A part, or component, has consistent spatial structure and consistent motion. We first segment the visual motion patterns and generate a set of middle-level components by clustering keypoints-based trajectories extracted from the video. To further exploit the interdependencies of the moving parts, we then define spatio-temporal relationships between pairwise components. The resulting descriptive middle-level components and pairwise-components thereby catch the essential motion characteristics of human activities. They also give a very compact representation of the video. We apply our framework on popular and challenging video datasets: Weizmann dataset and UT-Interaction dataset. We demonstrate experimentally that our middle-level representation combined with a χ ²-SVM classifier equals to or outperforms the state-of-the-art results on these dataset.

Download to read the full chapter text

Chapter PDF

References

Sullivan, J., Carlsson, S.: Recognizing and Tracking Human Action. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part I. LNCS, vol. 2350, pp. 629–644. Springer, Heidelberg (2002)
Chapter Google Scholar
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local svm approach. In: Proc. of Int. Computer Vision and Pattern Recognition, CVPR (2004)
Google Scholar
Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: Proc. of Int. Conf. on Computer Vision, ICCV, pp. 1395–1402 (2005)
Google Scholar
Li, R., Chellappa, R.: Recognizing coordinated multi-object activities using a dynamic event ensemble model. In: Proc. of Int. Acoustics, Speech, and Signal Processing, pp. 3541–3544 (2009)
Google Scholar
Turaga, P., Chellappa, R.: Locally time-invariant models of human activities using trajectories on the grassmannian. In: Proc. of Int. Computer Vision and Pattern Recognition, CVPR, pp. 2435–2441 (2009)
Google Scholar
Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In: Proc. of Int. Conf. on Computer Vision, ICCV, pp. 1593–1600 (2009)
Google Scholar
Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proc. of Conf. on Computer Vision and Pattern Recognition, CVPR, pp. 1–8 (2008)
Google Scholar
Yuan, J., Liu, Z., Wu, Y.: Discriminative video pattern search for efficient action detection. In: Proc. of Int. Computer Vision and Pattern Recognition, CVPR (2009)
Google Scholar
Cao, L., Liu, Z., Huang, T.S.: Cross-dataset action recognition. In: Proc. of Int. Computer Vision and Pattern Recognition, CVPR (2010)
Google Scholar
Wang, Y., Mori, G.: Learning a discriminative hidden part model for human action recognition. In: Advances in Neural Information Processing Systems, NIPS, vol. 21 (2008)
Google Scholar
Thi, T.H., Lu, S., Zhang, J., Cheng, L., Wang, L.: Human body articulation for action recognition in video sequences. In: Proc. of Int. Conf. on Advanced Video and Signal Based Surveillance, pp. 92–97 (2009)
Google Scholar
Yao, B., Zhu, S.C.: Learning deformable action templates from cluttered videos. In: Proc. of Int. Conf. on Computer Vision, ICCV, pp. 1507–1514 (2009)
Google Scholar
Niebles, J., Fei-Fei, L.: A hierarchical model of shape and appearance for human action classification. In: Proc. of Conf. in Computer Vision and Pattern Recognition, CVPR (2007)
Google Scholar
Sun, J., Wu, X., Yan, S., Cheong, L.F., Chua, T.S., Li, J.: Hierarchical spatio-temporal context modeling for action recognition. In: Proc. of Conf. on Computer Vision and Pattern Recognition, CVPR, pp. 2004–2011 (2009)
Google Scholar
Messing, R., Pal, C., Kautz, H.: Activity recognition using the velocity histories of tracked keypoints. In: Proc. of Int. Conf. on Computer Vision, ICCV, Washington, DC, USA (2009)
Google Scholar
Laptev, I., Lindeberg, T.: On space-time interest points. In: Proc. Int. Conf. on Computer Vision, ICCV, pp. 432–439 (2003)
Google Scholar
Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: Proc. Int. Conf. on Computer Vision, ICCV (2003)
Google Scholar
Lin, Z., Jiang, Z., Davis, L.: Recognizing actions by shape-motion prototype trees. In: Proc. Int. Conf. on Computer Vision, ICCV, pp. 444–451 (2009)
Google Scholar
Li, L.J., Fei-Fei, L.: What, where and who? classifying event by scene and object recognition. In: Proc. of IEEE Intern. Conf. in Computer Vision, ICCV (2007)
Google Scholar
Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos ’in the wild’. In: Proc. of Int. Computer Vision and Pattern Recognition, CVPR, pp. 1996–2003 (2009)
Google Scholar
Wu, J., Osuntogun, A., Choudhury, T., Philipose, M., Rehg, J.M.: A scalable approach to activity recognition based on object use. In: Proc. of IEEE Intern. Conf. in Computer Vision, ICCV (2007)
Google Scholar
Ke, Y., Sukthankar, R., Hebert, M.: Event detection in crowded videos. In: Proc. of IEEE Intern. Conf. in Computer Vision, ICCV (2007)
Google Scholar
Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. on Pattern Analysis & Machine Intelligence 27, 1615–1630 (2005)
Article Google Scholar
Rao, S., Tron, R., Vidal, R., Ma, Y.: Motion segmentation in the presence of outlying, incomplete, or corrupted trajectories. IEEE Trans. on Pattern Analysis and Machine Intelligence 99 (2009)
Google Scholar
Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. Int. J. Comput. Vision 59, 167–181 (2004)
Article Google Scholar
Johansson, G.: Visual perception of biological motion and a model for its analysis. Perception and Psychophysics 14, 201–211 (1973)
Article Google Scholar
Song, Y., Goncalves, L., Bernardo, E.D., Perona, P.: Monocular perception of biological motion in johansson displays. Comput. Vis. Image Underst. 81, 303–327 (2001)
Article MATH Google Scholar
Rao, C., Yilmaz, A., Shah, M.: View-invariant representation and recognition of actions. Int. J. Comput. Vision 50, 203–226 (2002)
Article MATH Google Scholar
Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 629–639 (1990)
Article Google Scholar
Junejo, I.N., Dexter, E., Laptev, I., Pérez, P.: Cross-View Action Recognition from Temporal Self-similarities. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 293–306. Springer, Heidelberg (2008)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

LIAMA & NLPR, CASIA, Chinese Academy of Sciences, Beijing, China, 100190
Fei Yuan & Véronique Prinet
School of EEE, Nanyang Technological University, Singapore, 639798
Junsong Yuan

Authors

Fei Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Véronique Prinet
View author publications
You can also search for this author in PubMed Google Scholar
Junsong Yuan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Toronto, 10 King’s College Road, ON M5S 3G4, Toronto, Canada
Kiriakos N. Kutulakos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yuan, F., Prinet, V., Yuan, J. (2012). Middle-Level Representation for Human Activities Recognition: The Role of Spatio-Temporal Relationships. In: Kutulakos, K.N. (eds) Trends and Topics in Computer Vision. ECCV 2010. Lecture Notes in Computer Science, vol 6553. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35749-7_13

Download citation

DOI: https://doi.org/10.1007/978-3-642-35749-7_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35748-0
Online ISBN: 978-3-642-35749-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics