Abstract
We tackle the challenging problem of human activity recognition in realistic video sequences. Unlike local features-based methods or global template-based methods, we propose to represent a video sequence by a set of middle-level parts. A part, or component, has consistent spatial structure and consistent motion. We first segment the visual motion patterns and generate a set of middle-level components by clustering keypoints-based trajectories extracted from the video. To further exploit the interdependencies of the moving parts, we then define spatio-temporal relationships between pairwise components. The resulting descriptive middle-level components and pairwise-components thereby catch the essential motion characteristics of human activities. They also give a very compact representation of the video. We apply our framework on popular and challenging video datasets: Weizmann dataset and UT-Interaction dataset. We demonstrate experimentally that our middle-level representation combined with a χ 2-SVM classifier equals to or outperforms the state-of-the-art results on these dataset.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Sullivan, J., Carlsson, S.: Recognizing and Tracking Human Action. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part I. LNCS, vol. 2350, pp. 629–644. Springer, Heidelberg (2002)
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local svm approach. In: Proc. of Int. Computer Vision and Pattern Recognition, CVPR (2004)
Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: Proc. of Int. Conf. on Computer Vision, ICCV, pp. 1395–1402 (2005)
Li, R., Chellappa, R.: Recognizing coordinated multi-object activities using a dynamic event ensemble model. In: Proc. of Int. Acoustics, Speech, and Signal Processing, pp. 3541–3544 (2009)
Turaga, P., Chellappa, R.: Locally time-invariant models of human activities using trajectories on the grassmannian. In: Proc. of Int. Computer Vision and Pattern Recognition, CVPR, pp. 2435–2441 (2009)
Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In: Proc. of Int. Conf. on Computer Vision, ICCV, pp. 1593–1600 (2009)
Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proc. of Conf. on Computer Vision and Pattern Recognition, CVPR, pp. 1–8 (2008)
Yuan, J., Liu, Z., Wu, Y.: Discriminative video pattern search for efficient action detection. In: Proc. of Int. Computer Vision and Pattern Recognition, CVPR (2009)
Cao, L., Liu, Z., Huang, T.S.: Cross-dataset action recognition. In: Proc. of Int. Computer Vision and Pattern Recognition, CVPR (2010)
Wang, Y., Mori, G.: Learning a discriminative hidden part model for human action recognition. In: Advances in Neural Information Processing Systems, NIPS, vol. 21 (2008)
Thi, T.H., Lu, S., Zhang, J., Cheng, L., Wang, L.: Human body articulation for action recognition in video sequences. In: Proc. of Int. Conf. on Advanced Video and Signal Based Surveillance, pp. 92–97 (2009)
Yao, B., Zhu, S.C.: Learning deformable action templates from cluttered videos. In: Proc. of Int. Conf. on Computer Vision, ICCV, pp. 1507–1514 (2009)
Niebles, J., Fei-Fei, L.: A hierarchical model of shape and appearance for human action classification. In: Proc. of Conf. in Computer Vision and Pattern Recognition, CVPR (2007)
Sun, J., Wu, X., Yan, S., Cheong, L.F., Chua, T.S., Li, J.: Hierarchical spatio-temporal context modeling for action recognition. In: Proc. of Conf. on Computer Vision and Pattern Recognition, CVPR, pp. 2004–2011 (2009)
Messing, R., Pal, C., Kautz, H.: Activity recognition using the velocity histories of tracked keypoints. In: Proc. of Int. Conf. on Computer Vision, ICCV, Washington, DC, USA (2009)
Laptev, I., Lindeberg, T.: On space-time interest points. In: Proc. Int. Conf. on Computer Vision, ICCV, pp. 432–439 (2003)
Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: Proc. Int. Conf. on Computer Vision, ICCV (2003)
Lin, Z., Jiang, Z., Davis, L.: Recognizing actions by shape-motion prototype trees. In: Proc. Int. Conf. on Computer Vision, ICCV, pp. 444–451 (2009)
Li, L.J., Fei-Fei, L.: What, where and who? classifying event by scene and object recognition. In: Proc. of IEEE Intern. Conf. in Computer Vision, ICCV (2007)
Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos ’in the wild’. In: Proc. of Int. Computer Vision and Pattern Recognition, CVPR, pp. 1996–2003 (2009)
Wu, J., Osuntogun, A., Choudhury, T., Philipose, M., Rehg, J.M.: A scalable approach to activity recognition based on object use. In: Proc. of IEEE Intern. Conf. in Computer Vision, ICCV (2007)
Ke, Y., Sukthankar, R., Hebert, M.: Event detection in crowded videos. In: Proc. of IEEE Intern. Conf. in Computer Vision, ICCV (2007)
Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. on Pattern Analysis & Machine Intelligence 27, 1615–1630 (2005)
Rao, S., Tron, R., Vidal, R., Ma, Y.: Motion segmentation in the presence of outlying, incomplete, or corrupted trajectories. IEEE Trans. on Pattern Analysis and Machine Intelligence 99 (2009)
Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. Int. J. Comput. Vision 59, 167–181 (2004)
Johansson, G.: Visual perception of biological motion and a model for its analysis. Perception and Psychophysics 14, 201–211 (1973)
Song, Y., Goncalves, L., Bernardo, E.D., Perona, P.: Monocular perception of biological motion in johansson displays. Comput. Vis. Image Underst. 81, 303–327 (2001)
Rao, C., Yilmaz, A., Shah, M.: View-invariant representation and recognition of actions. Int. J. Comput. Vision 50, 203–226 (2002)
Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 629–639 (1990)
Junejo, I.N., Dexter, E., Laptev, I., Pérez, P.: Cross-View Action Recognition from Temporal Self-similarities. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 293–306. Springer, Heidelberg (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yuan, F., Prinet, V., Yuan, J. (2012). Middle-Level Representation for Human Activities Recognition: The Role of Spatio-Temporal Relationships. In: Kutulakos, K.N. (eds) Trends and Topics in Computer Vision. ECCV 2010. Lecture Notes in Computer Science, vol 6553. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35749-7_13
Download citation
DOI: https://doi.org/10.1007/978-3-642-35749-7_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35748-0
Online ISBN: 978-3-642-35749-7
eBook Packages: Computer ScienceComputer Science (R0)