Coupled Action Recognition and Pose Estimation from Multiple Views

Yao, Angela; Gall, Juergen; Van Gool, Luc

doi:10.1007/s11263-012-0532-9

Coupled Action Recognition and Pose Estimation from Multiple Views

Published: 30 May 2012

Volume 100, pages 16–37, (2012)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Angela Yao¹,
Juergen Gall^1,2 &
Luc Van Gool^1,3

2373 Accesses
74 Citations
Explore all metrics

Abstract

Action recognition and pose estimation are two closely related topics in understanding human body movements; information from one task can be leveraged to assist the other, yet the two are often treated separately. We present here a framework for coupled action recognition and pose estimation by formulating pose estimation as an optimization over a set of action-specific manifolds. The framework allows for integration of a 2D appearance-based action recognition system as a prior for 3D pose estimation and for refinement of the action labels using relational pose features based on the extracted 3D poses. Our experiments show that our pose estimation system is able to estimate body poses with high degrees of freedom using very few particles and can achieve state-of-the-art results on the HumanEva-II benchmark. We also thoroughly investigate the impact of pose estimation and action recognition accuracy on each other on the challenging TUM kitchen dataset. We demonstrate not only the feasibility of using extracted 3D poses for action recognition, but also improved performance in comparison to action recognition using low-level appearance features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient Pose-Based Action Recognition

Weakly Aligned Multi-part Bag-of-Poses for Action Recognition from Depth Cameras

Action Recognition from a Single Web Image Based on an Ensemble of Pose Experts

Notes

We have kept all planes to be defined by joints at t ₂, though planes can in theory be defined in space-time by joints at different time points.
We have used tracking results to create the training data since the motion capture data for HumanEva II is withheld for evaluation purposes. Note that training data from markerless tracking approaches is in general noisier and less accurate than data from marker-based systems.
The original model has 28 joints but we do not consider the gaze since it has 0 DOF. The root joint is represented by the global orientation and position (6 DOF).
“Take object” always occurs between “reach” and “idle/carry” while “release grasp” always occurs before “idle/carry”, after interacting with an object, the drawer or the cupboard.
Note that the worst-case scenario would be if the action recognition is biased and always misclassified certain actions as others.
This is equivalent to summing the weights of the particles before resampling.
We use a lower depth than the trees trained for 2D appearance-based features since the possible number of unique \(\mathcal{F}_{i}\) for pose-based features is much smaller than that of appearance-based features.

References

Agarwal, A., & Triggs, B. (2006). Recovering 3d human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(1), 44–58.
Article Google Scholar
Aggarwal, J., & Ryoo, M. (2010). Human activity analysis: a review. ACM Computing Surveys.
Ali, S., Basharat, A., & Shah, M. (2007). Chaotic invariants for human action recognition. In Proceedings international conference on computer vision.
Google Scholar
Andriluka, M., Roth, S., & Schiele, B. (2010). Monocular 3d pose estimation and tracking by detection. In Proceedings IEEE conference on computer vision and pattern recognition.
Google Scholar
Baak, A., Rosenhahn, B., Mueller, M., & Seidel, H. P. (2009). Stabilizing motion tracking using retrieved motion priors. In Proceedings international conference on computer vision.
Google Scholar
Baumberg, A., & Hogg, D. (1994). An efficient method for contour tracking using active shape models. In Proceeding of the workshop on motion of nonrigid and articulated objects. Los Alamitos: IEEE Computer Society.
Google Scholar
Belkin, M., & Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for embedding and clustering. In Neural information processing systems.
Google Scholar
Bergtholdt, M., Kappes, J., Schmidt, S., & Schnörr, C. (2010). A study of parts-based object class detection using complete graphs. International Journal of Computer Vision, 87, 93–117.
Article MathSciNet Google Scholar
Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. In Proceedings international conference on computer vision.
Google Scholar
Bo, L., & Sminchisescu, C. (2010). Twin Gaussian processes for structured prediction. International Journal of Computer Vision, 87, 28–52.
Article Google Scholar
Bobick, A., & Davis, J. (2001). The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3), 257–267.
Article Google Scholar
Brox, T., Bruhn, A., Papenberg, N., & Weickert, J. (2004). High accuracy optical flow estimation based on a theory for warping. In Proceedings European conference on computer vision.
Google Scholar
Brubaker, M., Fleet, D., & Hertzmann, A. (2010). Physics-based person tracking using the anthropomorphic walker. International Journal of Computer Vision, 87, 140–155.
Article Google Scholar
Campbell, L., & Bobick, A. (1995). Recognition of human body motion using phase space constraints. In Proceedings international conference on computer vision.
Google Scholar
Chen, J., Kim, M., Wang, Y., & Ji, Q. (2009). Switching Gaussian process dynamic models for simultaneous composite motion tracking and recognition. In Proceedings IEEE conference on computer vision and pattern recognition.
Google Scholar
Corazza, S., Mündermann, L., Gambaretto, E., Ferrigno, G., & Andriacchi, T. (2010). Markerless motion capture through visual hull, articulated icp and subject specific model generation. International Journal of Computer Vision, 87, 156–169.
Article Google Scholar
Darby, J., Li, B., & Costen, N. (2010). Tracking human pose with multiple activity models. Pattern Recognition, 43, 3042–3058.
Article MATH Google Scholar
Del Moral, P. (2004). Feynman-Kac formulae. Genealogical and interacting particle systems with applications. New York: Springer.
MATH Google Scholar
Deutscher, J., & Reid, I. (2005). Articulated body motion capture by stochastic search. International Journal of Computer Vision, 61, 2.
Article Google Scholar
Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance (VS-PETS).
Google Scholar
Efros, A., Berg, A., Mori, G., & Malik, J. (2003). Recognizing action at a distance. In Proceedings international conference on computer vision.
Google Scholar
Elgammal, A., & Lee, C. S. (2004). Inferring 3d body pose from silhouettes using activity manifold learning. In Proceedings IEEE conference on computer vision and pattern recognition.
Google Scholar
Forsyth, D., Arikan, O., Ikemoto, L., O’Brien, J., & Ramanan, D. (2006). Computational studies of human motion: Part 1, tracking and motion synthesis. Foundations and Trends in Computer Graphics and Vision, 1.
Gall, J., Rosenhahn, B., & Seidel, H. P. (2008a). Drift-free tracking of rigid and articulated objects. In Proceedings IEEE conference on computer vision and pattern recognition.
Google Scholar
Gall, J., Rosenhahn, B., & Seidel, H. P. (2008b). An introduction to interacting simulated annealing. In Human motion: understanding, modelling, capture and animation (pp. 319–343). Berlin: Springer.
Google Scholar
Gall, J., Stoll, C., de Aguiar, E., Theobalt, C., Rosenhahn, B., & Seidel, H. P. (2009). Motion capture using joint skeleton tracking and surface estimation. In Proceedings IEEE conference on computer vision and pattern recognition (pp. 1746–1753).
Google Scholar
Gall, J., Rosenhahn, B., Brox, T., & Seidel, H. P. (2010a). Optimization and filtering for human motion capture—a multi-layer framework. International Journal of Computer Vision, 87, 75–92.
Article Google Scholar
Gall, J., Yao, A., & Van Gool, L. (2010b). 2d action recognition serves 3d human pose estimation. In Proceedings European conference on computer vision.
Google Scholar
Gall, J., Yao, A., Razavi, N., Van Gool, L., & Lempitsky, V. (2011). Hough forests for object detection, tracking, and action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Gavrila, D., & Davis, L. (1995). Towards 3-d model-based tracking and recognition of human movement: a multi-view approach. In International workshop on face and gesture recognition.
Google Scholar
Geiger, A., Urtasun, R., & Darrell, T. (2009). Rank priors for continuous non-linear dimensionality reduction. In Proceedings IEEE conference on computer vision and pattern recognition.
Google Scholar
Hou, S., Galata, A., Caillette, F., Thacker, N., & Bromiley, P. (2007). Real-time body tracking using a Gaussian process latent variable model. In Proceedings international conference on computer vision.
Google Scholar
Husz, Z. L., Wallace, A. M., & Green, P. R. (2011) Behavioural analysis with movement cluster model for concurrent actions. EURASIP Journal on Image and Video Processing.
Jaeggli, T., Koller-Meier, E., & Van Gool, L. (2009). Learning generative models for multi-activity body pose estimation. International Journal of Computer Vision, 83(2), 121–134.
Article Google Scholar
Jenkins, O. C., Serrano, G. G., & Loper, M. M. (2007). Interactive human pose and action recognition using dynamical motion primitives. International Journal of Humanoid Robotics, 4(2), 365–385.
Article Google Scholar
Jhuang, H., Serre, T., Wolf, L., & Poggio, T. (2007). A biologically inspired system for action recognition. In Proceedings international conference on computer vision.
Google Scholar
Kittler, J., Hatef, M., Duin, R., & Matas, J. (1998). On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 226–239.
Article Google Scholar
Kläser, A., Marszałek, M., Schmid, C., & Zisserman, A. (2010). Human focused action localization in video. In International workshop on sign, gesture, and activity.
Google Scholar
Kovar, L., & Gleicher, M. (2004). Automated extraction and parameterization of motions in large data sets. ACM Transactions on Graphics, 23, 559–568.
Article Google Scholar
Laptev, I., & Lindeberg, T. (2003). Space-time interest points. In Proceedings international conference on computer vision.
Google Scholar
Laptev, I., Marszałek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In Proceedings IEEE conference on computer vision and pattern recognition.
Google Scholar
Lawrence, N. (2005). Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6, 1783–1816.
MathSciNet MATH Google Scholar
Lee, C., & Elgammal, A. (2010). Coupled visual and kinematic manifold models for tracking. International Journal of Computer Vision, 87, 118–139.
Article Google Scholar
Li, R., Tian, T., & Sclaroff, S. (2007). Simultaneous learning of non-linear manifold and dynamical models for high-dimensional time series. In Proceedings international conference on computer vision.
Google Scholar
Li, R., Tian, T., Sclaroff, S., & Yang, M. (2010). 3d human motion tracking with a coordinated mixture of factor analyzers. International Journal of Computer Vision, 87, 170–190.
Article Google Scholar
Lin, R., Liu, C., Yang, M., Ahja, N., & Levinson, S. (2006). Learning nonlinear manifolds from time series. In Proceedings European conference on computer vision.
Google Scholar
Liu, J., Luo, J., & Shah, M. (2009). Recognizing realistic actions from videos ‘in the wild’. In Proceedings IEEE conference on computer vision and pattern recognition.
Google Scholar
Lv, F., & Nevatia, R. (2007). Single view human action recognition using key pose matching and Viterbi path searching. In Proceedings IEEE conference on computer vision and pattern recognition.
Google Scholar
Maji, S., Bourdev, L., & Malik, J. (2011). Action recognition from a distributed representation of pose and appearance. In Proceedings IEEE conference on computer vision and pattern recognition.
Google Scholar
Mitra, S., & Acharya, T. (2007). Gesture recognition: a survey. IEEE Transactions on Systems, Man and Cybernetics - Part C, 37(3), 311–324.
Article Google Scholar
Moeslund, T., Hilton, A., & Krüger, V. (2006). A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding, 104(2), 90–126.
Article Google Scholar
Moon, K., & Pavlovic, V. (2006). Impact of dynamics on subspace embedding and tracking of sequences. In Proceedings IEEE conference on computer vision and pattern recognition (pp. 198–205).
Google Scholar
Müller, M., Röder, T., & Clausen, M. (2005). Efficient content-based retrieval of motion capture data. ACM Transactions on Graphics, 24, 677–685.
Article Google Scholar
Natarajan, P., Singh, V., & Nevatia, R. (2010). Learning 3d action models from a few 2d videos for view invariant action recognition. In Proceedings IEEE conference on computer vision and pattern recognition.
Google Scholar
Pavlovic, V., Rehg, J., & Maccormick, J. (2000). Learning switching linear models of human motion. In Neural information processing systems (pp. 981–987).
Google Scholar
Peursum, P., Venkatesh, S., & West, G. (2010). A study on smoothing for particle-filtered 3d human body tracking. International Journal of Computer Vision, 87, 53–74.
Article Google Scholar
Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing.
Rao, C., Yilmaz, A., & Shah, M. (2002). View-invariant representation and recognition of actions. International Journal of Computer Vision, 50(2), 203–226.
Article MATH Google Scholar
Raskin, L., Rudzsky, M., & Rivlin, E. (2011). Dimensionality reduction using a Gaussian process annealed particle filter for tracking and classification of articulated body motions. Computer Vision and Image Understanding, 115(4), 503–519.
Article Google Scholar
Rasmussen, C., & Williams, C. (2006). Gaussian processes for machine learning. Cambridge: MIT Press.
MATH Google Scholar
Rodriguez, M., Ahmed, J., & Shah, M. (2008). Action Mach: a spatio-temporal maximum average correlation height filter for action recognition. In Proceedings IEEE conference on computer vision and pattern recognition.
Google Scholar
Rosales, R., & Sclaroff, S. (2001). Learning body pose via specialized maps. In Neural information processing systems.
Google Scholar
Rosenhahn, B., Brox, T., & Seidel, H. P. (2007). Scaled motion dynamics for markerless motion capture. In Proceedings IEEE conference on computer vision and pattern recognition.
Google Scholar
Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally Linear embedding. Science, 290(5500), 2323–2326.
Article Google Scholar
Schindler, K., & Van Gool, L. (2008). Action snippets: how many frames does human action recognition require. In Proceedings IEEE conference on computer vision and pattern recognition.
Google Scholar
Schmaltz, C., Rosenhahn, B., Brox, T., & Weickert, J. (2011). Region-based pose tracking with occlusions using 3d models. In Machine vision and applications (pp. 1–21).
Google Scholar
Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: a local svm approach. In Proceedings international conference on pattern recognition.
Google Scholar
Shaheen, M., Gall, J., Strzodka, R., Van Gool, L., & Seidel, H. P. (2009). A comparison of 3d model-based tracking approaches for human motion capture in uncontrolled environments. In IEEE workshop on applications of computer vision.
Google Scholar
Sidenbladh, H., Black, M., & Fleet, D. (2000). Stochastic tracking of 3d human figures using 2d image motion. In Proceedings European conference on computer vision.
Google Scholar
Sidenbladh, H., Black, M., & Sigal, L. (2002). Implicit probabilistic models of human motion for synthesis and tracking. In Proceedings European conference on computer vision (pp. 784–800).
Google Scholar
Sigal, L., Balan, A., & Black, M. (2010). Humaneva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(1–2), 4–27.
Article Google Scholar
Sminchisescu, C., & Jepson, A. (2004). Generative modeling for continuous non-linearly embedded visual inference. In Proceedings international conference on machine learning.
Google Scholar
Sminchisescu, C., Kanaujia, A., & Metaxas, D. (2007). Bm3e: discriminative density propagation for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(11), 2030–2044.
Article Google Scholar
Taycher, L., Demirdjian, D., Darrell, T., & Shakhnarovich, G. (2006). Conditional random people: tracking humans with crfs and grid filters. In Proceedings IEEE conference on computer vision and pattern recognition (pp. 222–229).
Google Scholar
Taylor, G., Sigal, L., Fleet, D., & Hinton, G. (2010). Dynamical binary latent variable models for 3d human pose tracking. In Proceedings IEEE conference on computer vision and pattern recognition.
Google Scholar
Tenenbaum, J., de Silva, V., & Langford, J. (2000). A global geometric framework for nonlinear dimensionality reduction. Chicago: Science.
Google Scholar
Tenorth, M., Bandouch, J., & Beetz, M. (2009). The TUM kitchen data set of everyday manipulation activities for motion tracking and action recognition. In IEEE workshop on tracking humans for the evaluation of their motion in image sequences.
Google Scholar
Thurau, C., & Hlavac, V. (2008). Pose primitive based human action recognition in videos or still images. In Proceedings IEEE conference on computer vision and pattern recognition.
Google Scholar
Ukita, N., Hirai, M., & Kidode, M. (2009). Complex volume and pose tracking with probabilistic dynamical model and visual hull constraint. In Proceedings international conference on computer vision.
Google Scholar
Urtasun, R., Fleet, D., & Fua, P. (2006). 3d people tracking with Gaussian process dynamical models. In Proceedings IEEE conference on computer vision and pattern recognition.
Google Scholar
Urtasun, R., Fleet, D., Hertzman, A., & Fua, P. (2005). Priors for people tracking from small training sets. In Proceedings international conference on computer vision.
Google Scholar
Wang, J., Fleet, D., & Hertzmann, A. (2008). Gaussian process dynamical models for human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2), 283–298.
Article Google Scholar
Weinland, D., & Boyer, E. (2008). Action recognition using exemplar-based embedding. In Proceedings IEEE conference on computer vision and pattern recognition.
Google Scholar
Weinland, D., Boyer, E., & Ronfard, R. (2007). Action recognition from arbitrary views using 3d exemplars. In Proceedings international conference on computer vision.
Google Scholar
Willems, G., Becker, J., Tuytelaars, T., & Van Gool, L. (2009). Exemplar-based action recognition in video. In Proceedings British machine vision conference.
Google Scholar
Yacoob, Y., & Black, M. (1999). Parameterized modeling and recognition of activities. Computer Vision and Image Understanding, 73(2), 232–247.
Article Google Scholar
Yang, W., Wang, Y., & Mori, G. (2010). Recognizing human actions from still images with latent poses. In Proceedings IEEE conference on computer vision and pattern recognition.
Google Scholar
Yao, A., Gall, J., & Van Gool, L. (2010). A hough transform-based voting framework for action recognition. In Proceedings IEEE conference on computer vision and pattern recognition.
Google Scholar
Yao, A., Gall, J., Fanelli, G., & Van Gool, L. (2011). Does human action recognition benefit from pose estimation. In Proceedings British machine vision conference.
Google Scholar
Yilmaz, A., & Shah, M. (2005). Recognizing human actions in videos acquired by uncalibrated moving cameras. In Proceedings international conference on computer vision.
Google Scholar

Download references

Acknowledgements

This work has been supported by funding from the Swiss National Foundation NCCR project IM2 as well as the EC projects IURO, TANGO and RADHAR. Angela Yao was also supported by funding from NSERC Canada.

Author information

Authors and Affiliations

Computer Vision Laboratory, ETH Zurich, Sternwartstrasse 7, 8092, Zurich, Switzerland
Angela Yao, Juergen Gall & Luc Van Gool
Max Planck Institute for Intelligent Systems, Spemannstrasse 41, 72076, Tubingen, Germany
Juergen Gall
Department of Electrical Engineering/IBBT, K.U. Leuven, Kasteelpark Arenberg 10, 3001, Heverlee, Belgium
Luc Van Gool

Authors

Angela Yao
View author publications
You can also search for this author in PubMed Google Scholar
Juergen Gall
View author publications
You can also search for this author in PubMed Google Scholar
Luc Van Gool
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Angela Yao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yao, A., Gall, J. & Van Gool, L. Coupled Action Recognition and Pose Estimation from Multiple Views. Int J Comput Vis 100, 16–37 (2012). https://doi.org/10.1007/s11263-012-0532-9

Download citation

Received: 08 September 2011
Accepted: 27 April 2012
Published: 30 May 2012
Issue Date: October 2012
DOI: https://doi.org/10.1007/s11263-012-0532-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Coupled Action Recognition and Pose Estimation from Multiple Views

Abstract

Access this article

Similar content being viewed by others

Efficient Pose-Based Action Recognition

Weakly Aligned Multi-part Bag-of-Poses for Action Recognition from Depth Cameras

Action Recognition from a Single Web Image Based on an Ensemble of Pose Experts

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Coupled Action Recognition and Pose Estimation from Multiple Views

Abstract

Access this article

Similar content being viewed by others

Efficient Pose-Based Action Recognition

Weakly Aligned Multi-part Bag-of-Poses for Action Recognition from Depth Cameras

Action Recognition from a Single Web Image Based on an Ensemble of Pose Experts

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation