Skip to main content
Log in

Coupled Action Recognition and Pose Estimation from Multiple Views

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Action recognition and pose estimation are two closely related topics in understanding human body movements; information from one task can be leveraged to assist the other, yet the two are often treated separately. We present here a framework for coupled action recognition and pose estimation by formulating pose estimation as an optimization over a set of action-specific manifolds. The framework allows for integration of a 2D appearance-based action recognition system as a prior for 3D pose estimation and for refinement of the action labels using relational pose features based on the extracted 3D poses. Our experiments show that our pose estimation system is able to estimate body poses with high degrees of freedom using very few particles and can achieve state-of-the-art results on the HumanEva-II benchmark. We also thoroughly investigate the impact of pose estimation and action recognition accuracy on each other on the challenging TUM kitchen dataset. We demonstrate not only the feasibility of using extracted 3D poses for action recognition, but also improved performance in comparison to action recognition using low-level appearance features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Fig. 4
Fig. 5
Algorithm 2
Algorithm 3
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. We have kept all planes to be defined by joints at t 2, though planes can in theory be defined in space-time by joints at different time points.

  2. We have used tracking results to create the training data since the motion capture data for HumanEva II is withheld for evaluation purposes. Note that training data from markerless tracking approaches is in general noisier and less accurate than data from marker-based systems.

  3. The original model has 28 joints but we do not consider the gaze since it has 0 DOF. The root joint is represented by the global orientation and position (6 DOF).

  4. “Take object” always occurs between “reach” and “idle/carry” while “release grasp” always occurs before “idle/carry”, after interacting with an object, the drawer or the cupboard.

  5. Note that the worst-case scenario would be if the action recognition is biased and always misclassified certain actions as others.

  6. This is equivalent to summing the weights of the particles before resampling.

  7. We use a lower depth than the trees trained for 2D appearance-based features since the possible number of unique \(\mathcal{F}_{i}\) for pose-based features is much smaller than that of appearance-based features.

References

  • Agarwal, A., & Triggs, B. (2006). Recovering 3d human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(1), 44–58.

    Article  Google Scholar 

  • Aggarwal, J., & Ryoo, M. (2010). Human activity analysis: a review. ACM Computing Surveys.

  • Ali, S., Basharat, A., & Shah, M. (2007). Chaotic invariants for human action recognition. In Proceedings international conference on computer vision.

    Google Scholar 

  • Andriluka, M., Roth, S., & Schiele, B. (2010). Monocular 3d pose estimation and tracking by detection. In Proceedings IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Baak, A., Rosenhahn, B., Mueller, M., & Seidel, H. P. (2009). Stabilizing motion tracking using retrieved motion priors. In Proceedings international conference on computer vision.

    Google Scholar 

  • Baumberg, A., & Hogg, D. (1994). An efficient method for contour tracking using active shape models. In Proceeding of the workshop on motion of nonrigid and articulated objects. Los Alamitos: IEEE Computer Society.

    Google Scholar 

  • Belkin, M., & Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for embedding and clustering. In Neural information processing systems.

    Google Scholar 

  • Bergtholdt, M., Kappes, J., Schmidt, S., & Schnörr, C. (2010). A study of parts-based object class detection using complete graphs. International Journal of Computer Vision, 87, 93–117.

    Article  MathSciNet  Google Scholar 

  • Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. In Proceedings international conference on computer vision.

    Google Scholar 

  • Bo, L., & Sminchisescu, C. (2010). Twin Gaussian processes for structured prediction. International Journal of Computer Vision, 87, 28–52.

    Article  Google Scholar 

  • Bobick, A., & Davis, J. (2001). The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3), 257–267.

    Article  Google Scholar 

  • Brox, T., Bruhn, A., Papenberg, N., & Weickert, J. (2004). High accuracy optical flow estimation based on a theory for warping. In Proceedings European conference on computer vision.

    Google Scholar 

  • Brubaker, M., Fleet, D., & Hertzmann, A. (2010). Physics-based person tracking using the anthropomorphic walker. International Journal of Computer Vision, 87, 140–155.

    Article  Google Scholar 

  • Campbell, L., & Bobick, A. (1995). Recognition of human body motion using phase space constraints. In Proceedings international conference on computer vision.

    Google Scholar 

  • Chen, J., Kim, M., Wang, Y., & Ji, Q. (2009). Switching Gaussian process dynamic models for simultaneous composite motion tracking and recognition. In Proceedings IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Corazza, S., Mündermann, L., Gambaretto, E., Ferrigno, G., & Andriacchi, T. (2010). Markerless motion capture through visual hull, articulated icp and subject specific model generation. International Journal of Computer Vision, 87, 156–169.

    Article  Google Scholar 

  • Darby, J., Li, B., & Costen, N. (2010). Tracking human pose with multiple activity models. Pattern Recognition, 43, 3042–3058.

    Article  MATH  Google Scholar 

  • Del Moral, P. (2004). Feynman-Kac formulae. Genealogical and interacting particle systems with applications. New York: Springer.

    MATH  Google Scholar 

  • Deutscher, J., & Reid, I. (2005). Articulated body motion capture by stochastic search. International Journal of Computer Vision, 61, 2.

    Article  Google Scholar 

  • Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance (VS-PETS).

    Google Scholar 

  • Efros, A., Berg, A., Mori, G., & Malik, J. (2003). Recognizing action at a distance. In Proceedings international conference on computer vision.

    Google Scholar 

  • Elgammal, A., & Lee, C. S. (2004). Inferring 3d body pose from silhouettes using activity manifold learning. In Proceedings IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Forsyth, D., Arikan, O., Ikemoto, L., O’Brien, J., & Ramanan, D. (2006). Computational studies of human motion: Part 1, tracking and motion synthesis. Foundations and Trends in Computer Graphics and Vision, 1.

  • Gall, J., Rosenhahn, B., & Seidel, H. P. (2008a). Drift-free tracking of rigid and articulated objects. In Proceedings IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Gall, J., Rosenhahn, B., & Seidel, H. P. (2008b). An introduction to interacting simulated annealing. In Human motion: understanding, modelling, capture and animation (pp. 319–343). Berlin: Springer.

    Google Scholar 

  • Gall, J., Stoll, C., de Aguiar, E., Theobalt, C., Rosenhahn, B., & Seidel, H. P. (2009). Motion capture using joint skeleton tracking and surface estimation. In Proceedings IEEE conference on computer vision and pattern recognition (pp. 1746–1753).

    Google Scholar 

  • Gall, J., Rosenhahn, B., Brox, T., & Seidel, H. P. (2010a). Optimization and filtering for human motion capture—a multi-layer framework. International Journal of Computer Vision, 87, 75–92.

    Article  Google Scholar 

  • Gall, J., Yao, A., & Van Gool, L. (2010b). 2d action recognition serves 3d human pose estimation. In Proceedings European conference on computer vision.

    Google Scholar 

  • Gall, J., Yao, A., Razavi, N., Van Gool, L., & Lempitsky, V. (2011). Hough forests for object detection, tracking, and action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  • Gavrila, D., & Davis, L. (1995). Towards 3-d model-based tracking and recognition of human movement: a multi-view approach. In International workshop on face and gesture recognition.

    Google Scholar 

  • Geiger, A., Urtasun, R., & Darrell, T. (2009). Rank priors for continuous non-linear dimensionality reduction. In Proceedings IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Hou, S., Galata, A., Caillette, F., Thacker, N., & Bromiley, P. (2007). Real-time body tracking using a Gaussian process latent variable model. In Proceedings international conference on computer vision.

    Google Scholar 

  • Husz, Z. L., Wallace, A. M., & Green, P. R. (2011) Behavioural analysis with movement cluster model for concurrent actions. EURASIP Journal on Image and Video Processing.

  • Jaeggli, T., Koller-Meier, E., & Van Gool, L. (2009). Learning generative models for multi-activity body pose estimation. International Journal of Computer Vision, 83(2), 121–134.

    Article  Google Scholar 

  • Jenkins, O. C., Serrano, G. G., & Loper, M. M. (2007). Interactive human pose and action recognition using dynamical motion primitives. International Journal of Humanoid Robotics, 4(2), 365–385.

    Article  Google Scholar 

  • Jhuang, H., Serre, T., Wolf, L., & Poggio, T. (2007). A biologically inspired system for action recognition. In Proceedings international conference on computer vision.

    Google Scholar 

  • Kittler, J., Hatef, M., Duin, R., & Matas, J. (1998). On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 226–239.

    Article  Google Scholar 

  • Kläser, A., Marszałek, M., Schmid, C., & Zisserman, A. (2010). Human focused action localization in video. In International workshop on sign, gesture, and activity.

    Google Scholar 

  • Kovar, L., & Gleicher, M. (2004). Automated extraction and parameterization of motions in large data sets. ACM Transactions on Graphics, 23, 559–568.

    Article  Google Scholar 

  • Laptev, I., & Lindeberg, T. (2003). Space-time interest points. In Proceedings international conference on computer vision.

    Google Scholar 

  • Laptev, I., Marszałek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In Proceedings IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Lawrence, N. (2005). Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6, 1783–1816.

    MathSciNet  MATH  Google Scholar 

  • Lee, C., & Elgammal, A. (2010). Coupled visual and kinematic manifold models for tracking. International Journal of Computer Vision, 87, 118–139.

    Article  Google Scholar 

  • Li, R., Tian, T., & Sclaroff, S. (2007). Simultaneous learning of non-linear manifold and dynamical models for high-dimensional time series. In Proceedings international conference on computer vision.

    Google Scholar 

  • Li, R., Tian, T., Sclaroff, S., & Yang, M. (2010). 3d human motion tracking with a coordinated mixture of factor analyzers. International Journal of Computer Vision, 87, 170–190.

    Article  Google Scholar 

  • Lin, R., Liu, C., Yang, M., Ahja, N., & Levinson, S. (2006). Learning nonlinear manifolds from time series. In Proceedings European conference on computer vision.

    Google Scholar 

  • Liu, J., Luo, J., & Shah, M. (2009). Recognizing realistic actions from videos ‘in the wild’. In Proceedings IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Lv, F., & Nevatia, R. (2007). Single view human action recognition using key pose matching and Viterbi path searching. In Proceedings IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Maji, S., Bourdev, L., & Malik, J. (2011). Action recognition from a distributed representation of pose and appearance. In Proceedings IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Mitra, S., & Acharya, T. (2007). Gesture recognition: a survey. IEEE Transactions on Systems, Man and Cybernetics - Part C, 37(3), 311–324.

    Article  Google Scholar 

  • Moeslund, T., Hilton, A., & Krüger, V. (2006). A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding, 104(2), 90–126.

    Article  Google Scholar 

  • Moon, K., & Pavlovic, V. (2006). Impact of dynamics on subspace embedding and tracking of sequences. In Proceedings IEEE conference on computer vision and pattern recognition (pp. 198–205).

    Google Scholar 

  • Müller, M., Röder, T., & Clausen, M. (2005). Efficient content-based retrieval of motion capture data. ACM Transactions on Graphics, 24, 677–685.

    Article  Google Scholar 

  • Natarajan, P., Singh, V., & Nevatia, R. (2010). Learning 3d action models from a few 2d videos for view invariant action recognition. In Proceedings IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Pavlovic, V., Rehg, J., & Maccormick, J. (2000). Learning switching linear models of human motion. In Neural information processing systems (pp. 981–987).

    Google Scholar 

  • Peursum, P., Venkatesh, S., & West, G. (2010). A study on smoothing for particle-filtered 3d human body tracking. International Journal of Computer Vision, 87, 53–74.

    Article  Google Scholar 

  • Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing.

  • Rao, C., Yilmaz, A., & Shah, M. (2002). View-invariant representation and recognition of actions. International Journal of Computer Vision, 50(2), 203–226.

    Article  MATH  Google Scholar 

  • Raskin, L., Rudzsky, M., & Rivlin, E. (2011). Dimensionality reduction using a Gaussian process annealed particle filter for tracking and classification of articulated body motions. Computer Vision and Image Understanding, 115(4), 503–519.

    Article  Google Scholar 

  • Rasmussen, C., & Williams, C. (2006). Gaussian processes for machine learning. Cambridge: MIT Press.

    MATH  Google Scholar 

  • Rodriguez, M., Ahmed, J., & Shah, M. (2008). Action Mach: a spatio-temporal maximum average correlation height filter for action recognition. In Proceedings IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Rosales, R., & Sclaroff, S. (2001). Learning body pose via specialized maps. In Neural information processing systems.

    Google Scholar 

  • Rosenhahn, B., Brox, T., & Seidel, H. P. (2007). Scaled motion dynamics for markerless motion capture. In Proceedings IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally Linear embedding. Science, 290(5500), 2323–2326.

    Article  Google Scholar 

  • Schindler, K., & Van Gool, L. (2008). Action snippets: how many frames does human action recognition require. In Proceedings IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Schmaltz, C., Rosenhahn, B., Brox, T., & Weickert, J. (2011). Region-based pose tracking with occlusions using 3d models. In Machine vision and applications (pp. 1–21).

    Google Scholar 

  • Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: a local svm approach. In Proceedings international conference on pattern recognition.

    Google Scholar 

  • Shaheen, M., Gall, J., Strzodka, R., Van Gool, L., & Seidel, H. P. (2009). A comparison of 3d model-based tracking approaches for human motion capture in uncontrolled environments. In IEEE workshop on applications of computer vision.

    Google Scholar 

  • Sidenbladh, H., Black, M., & Fleet, D. (2000). Stochastic tracking of 3d human figures using 2d image motion. In Proceedings European conference on computer vision.

    Google Scholar 

  • Sidenbladh, H., Black, M., & Sigal, L. (2002). Implicit probabilistic models of human motion for synthesis and tracking. In Proceedings European conference on computer vision (pp. 784–800).

    Google Scholar 

  • Sigal, L., Balan, A., & Black, M. (2010). Humaneva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(1–2), 4–27.

    Article  Google Scholar 

  • Sminchisescu, C., & Jepson, A. (2004). Generative modeling for continuous non-linearly embedded visual inference. In Proceedings international conference on machine learning.

    Google Scholar 

  • Sminchisescu, C., Kanaujia, A., & Metaxas, D. (2007). Bm3e: discriminative density propagation for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(11), 2030–2044.

    Article  Google Scholar 

  • Taycher, L., Demirdjian, D., Darrell, T., & Shakhnarovich, G. (2006). Conditional random people: tracking humans with crfs and grid filters. In Proceedings IEEE conference on computer vision and pattern recognition (pp. 222–229).

    Google Scholar 

  • Taylor, G., Sigal, L., Fleet, D., & Hinton, G. (2010). Dynamical binary latent variable models for 3d human pose tracking. In Proceedings IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Tenenbaum, J., de Silva, V., & Langford, J. (2000). A global geometric framework for nonlinear dimensionality reduction. Chicago: Science.

    Google Scholar 

  • Tenorth, M., Bandouch, J., & Beetz, M. (2009). The TUM kitchen data set of everyday manipulation activities for motion tracking and action recognition. In IEEE workshop on tracking humans for the evaluation of their motion in image sequences.

    Google Scholar 

  • Thurau, C., & Hlavac, V. (2008). Pose primitive based human action recognition in videos or still images. In Proceedings IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Ukita, N., Hirai, M., & Kidode, M. (2009). Complex volume and pose tracking with probabilistic dynamical model and visual hull constraint. In Proceedings international conference on computer vision.

    Google Scholar 

  • Urtasun, R., Fleet, D., & Fua, P. (2006). 3d people tracking with Gaussian process dynamical models. In Proceedings IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Urtasun, R., Fleet, D., Hertzman, A., & Fua, P. (2005). Priors for people tracking from small training sets. In Proceedings international conference on computer vision.

    Google Scholar 

  • Wang, J., Fleet, D., & Hertzmann, A. (2008). Gaussian process dynamical models for human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2), 283–298.

    Article  Google Scholar 

  • Weinland, D., & Boyer, E. (2008). Action recognition using exemplar-based embedding. In Proceedings IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Weinland, D., Boyer, E., & Ronfard, R. (2007). Action recognition from arbitrary views using 3d exemplars. In Proceedings international conference on computer vision.

    Google Scholar 

  • Willems, G., Becker, J., Tuytelaars, T., & Van Gool, L. (2009). Exemplar-based action recognition in video. In Proceedings British machine vision conference.

    Google Scholar 

  • Yacoob, Y., & Black, M. (1999). Parameterized modeling and recognition of activities. Computer Vision and Image Understanding, 73(2), 232–247.

    Article  Google Scholar 

  • Yang, W., Wang, Y., & Mori, G. (2010). Recognizing human actions from still images with latent poses. In Proceedings IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Yao, A., Gall, J., & Van Gool, L. (2010). A hough transform-based voting framework for action recognition. In Proceedings IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Yao, A., Gall, J., Fanelli, G., & Van Gool, L. (2011). Does human action recognition benefit from pose estimation. In Proceedings British machine vision conference.

    Google Scholar 

  • Yilmaz, A., & Shah, M. (2005). Recognizing human actions in videos acquired by uncalibrated moving cameras. In Proceedings international conference on computer vision.

    Google Scholar 

Download references

Acknowledgements

This work has been supported by funding from the Swiss National Foundation NCCR project IM2 as well as the EC projects IURO, TANGO and RADHAR. Angela Yao was also supported by funding from NSERC Canada.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Angela Yao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yao, A., Gall, J. & Van Gool, L. Coupled Action Recognition and Pose Estimation from Multiple Views. Int J Comput Vis 100, 16–37 (2012). https://doi.org/10.1007/s11263-012-0532-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-012-0532-9

Keywords

Navigation