Abstract
Action recognition and pose estimation are two closely related topics in understanding human body movements; information from one task can be leveraged to assist the other, yet the two are often treated separately. We present here a framework for coupled action recognition and pose estimation by formulating pose estimation as an optimization over a set of action-specific manifolds. The framework allows for integration of a 2D appearance-based action recognition system as a prior for 3D pose estimation and for refinement of the action labels using relational pose features based on the extracted 3D poses. Our experiments show that our pose estimation system is able to estimate body poses with high degrees of freedom using very few particles and can achieve state-of-the-art results on the HumanEva-II benchmark. We also thoroughly investigate the impact of pose estimation and action recognition accuracy on each other on the challenging TUM kitchen dataset. We demonstrate not only the feasibility of using extracted 3D poses for action recognition, but also improved performance in comparison to action recognition using low-level appearance features.
Similar content being viewed by others
Notes
We have kept all planes to be defined by joints at t 2, though planes can in theory be defined in space-time by joints at different time points.
We have used tracking results to create the training data since the motion capture data for HumanEva II is withheld for evaluation purposes. Note that training data from markerless tracking approaches is in general noisier and less accurate than data from marker-based systems.
The original model has 28 joints but we do not consider the gaze since it has 0 DOF. The root joint is represented by the global orientation and position (6 DOF).
“Take object” always occurs between “reach” and “idle/carry” while “release grasp” always occurs before “idle/carry”, after interacting with an object, the drawer or the cupboard.
Note that the worst-case scenario would be if the action recognition is biased and always misclassified certain actions as others.
This is equivalent to summing the weights of the particles before resampling.
We use a lower depth than the trees trained for 2D appearance-based features since the possible number of unique \(\mathcal{F}_{i}\) for pose-based features is much smaller than that of appearance-based features.
References
Agarwal, A., & Triggs, B. (2006). Recovering 3d human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(1), 44–58.
Aggarwal, J., & Ryoo, M. (2010). Human activity analysis: a review. ACM Computing Surveys.
Ali, S., Basharat, A., & Shah, M. (2007). Chaotic invariants for human action recognition. In Proceedings international conference on computer vision.
Andriluka, M., Roth, S., & Schiele, B. (2010). Monocular 3d pose estimation and tracking by detection. In Proceedings IEEE conference on computer vision and pattern recognition.
Baak, A., Rosenhahn, B., Mueller, M., & Seidel, H. P. (2009). Stabilizing motion tracking using retrieved motion priors. In Proceedings international conference on computer vision.
Baumberg, A., & Hogg, D. (1994). An efficient method for contour tracking using active shape models. In Proceeding of the workshop on motion of nonrigid and articulated objects. Los Alamitos: IEEE Computer Society.
Belkin, M., & Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for embedding and clustering. In Neural information processing systems.
Bergtholdt, M., Kappes, J., Schmidt, S., & Schnörr, C. (2010). A study of parts-based object class detection using complete graphs. International Journal of Computer Vision, 87, 93–117.
Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. In Proceedings international conference on computer vision.
Bo, L., & Sminchisescu, C. (2010). Twin Gaussian processes for structured prediction. International Journal of Computer Vision, 87, 28–52.
Bobick, A., & Davis, J. (2001). The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3), 257–267.
Brox, T., Bruhn, A., Papenberg, N., & Weickert, J. (2004). High accuracy optical flow estimation based on a theory for warping. In Proceedings European conference on computer vision.
Brubaker, M., Fleet, D., & Hertzmann, A. (2010). Physics-based person tracking using the anthropomorphic walker. International Journal of Computer Vision, 87, 140–155.
Campbell, L., & Bobick, A. (1995). Recognition of human body motion using phase space constraints. In Proceedings international conference on computer vision.
Chen, J., Kim, M., Wang, Y., & Ji, Q. (2009). Switching Gaussian process dynamic models for simultaneous composite motion tracking and recognition. In Proceedings IEEE conference on computer vision and pattern recognition.
Corazza, S., Mündermann, L., Gambaretto, E., Ferrigno, G., & Andriacchi, T. (2010). Markerless motion capture through visual hull, articulated icp and subject specific model generation. International Journal of Computer Vision, 87, 156–169.
Darby, J., Li, B., & Costen, N. (2010). Tracking human pose with multiple activity models. Pattern Recognition, 43, 3042–3058.
Del Moral, P. (2004). Feynman-Kac formulae. Genealogical and interacting particle systems with applications. New York: Springer.
Deutscher, J., & Reid, I. (2005). Articulated body motion capture by stochastic search. International Journal of Computer Vision, 61, 2.
Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance (VS-PETS).
Efros, A., Berg, A., Mori, G., & Malik, J. (2003). Recognizing action at a distance. In Proceedings international conference on computer vision.
Elgammal, A., & Lee, C. S. (2004). Inferring 3d body pose from silhouettes using activity manifold learning. In Proceedings IEEE conference on computer vision and pattern recognition.
Forsyth, D., Arikan, O., Ikemoto, L., O’Brien, J., & Ramanan, D. (2006). Computational studies of human motion: Part 1, tracking and motion synthesis. Foundations and Trends in Computer Graphics and Vision, 1.
Gall, J., Rosenhahn, B., & Seidel, H. P. (2008a). Drift-free tracking of rigid and articulated objects. In Proceedings IEEE conference on computer vision and pattern recognition.
Gall, J., Rosenhahn, B., & Seidel, H. P. (2008b). An introduction to interacting simulated annealing. In Human motion: understanding, modelling, capture and animation (pp. 319–343). Berlin: Springer.
Gall, J., Stoll, C., de Aguiar, E., Theobalt, C., Rosenhahn, B., & Seidel, H. P. (2009). Motion capture using joint skeleton tracking and surface estimation. In Proceedings IEEE conference on computer vision and pattern recognition (pp. 1746–1753).
Gall, J., Rosenhahn, B., Brox, T., & Seidel, H. P. (2010a). Optimization and filtering for human motion capture—a multi-layer framework. International Journal of Computer Vision, 87, 75–92.
Gall, J., Yao, A., & Van Gool, L. (2010b). 2d action recognition serves 3d human pose estimation. In Proceedings European conference on computer vision.
Gall, J., Yao, A., Razavi, N., Van Gool, L., & Lempitsky, V. (2011). Hough forests for object detection, tracking, and action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Gavrila, D., & Davis, L. (1995). Towards 3-d model-based tracking and recognition of human movement: a multi-view approach. In International workshop on face and gesture recognition.
Geiger, A., Urtasun, R., & Darrell, T. (2009). Rank priors for continuous non-linear dimensionality reduction. In Proceedings IEEE conference on computer vision and pattern recognition.
Hou, S., Galata, A., Caillette, F., Thacker, N., & Bromiley, P. (2007). Real-time body tracking using a Gaussian process latent variable model. In Proceedings international conference on computer vision.
Husz, Z. L., Wallace, A. M., & Green, P. R. (2011) Behavioural analysis with movement cluster model for concurrent actions. EURASIP Journal on Image and Video Processing.
Jaeggli, T., Koller-Meier, E., & Van Gool, L. (2009). Learning generative models for multi-activity body pose estimation. International Journal of Computer Vision, 83(2), 121–134.
Jenkins, O. C., Serrano, G. G., & Loper, M. M. (2007). Interactive human pose and action recognition using dynamical motion primitives. International Journal of Humanoid Robotics, 4(2), 365–385.
Jhuang, H., Serre, T., Wolf, L., & Poggio, T. (2007). A biologically inspired system for action recognition. In Proceedings international conference on computer vision.
Kittler, J., Hatef, M., Duin, R., & Matas, J. (1998). On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 226–239.
Kläser, A., Marszałek, M., Schmid, C., & Zisserman, A. (2010). Human focused action localization in video. In International workshop on sign, gesture, and activity.
Kovar, L., & Gleicher, M. (2004). Automated extraction and parameterization of motions in large data sets. ACM Transactions on Graphics, 23, 559–568.
Laptev, I., & Lindeberg, T. (2003). Space-time interest points. In Proceedings international conference on computer vision.
Laptev, I., Marszałek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In Proceedings IEEE conference on computer vision and pattern recognition.
Lawrence, N. (2005). Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6, 1783–1816.
Lee, C., & Elgammal, A. (2010). Coupled visual and kinematic manifold models for tracking. International Journal of Computer Vision, 87, 118–139.
Li, R., Tian, T., & Sclaroff, S. (2007). Simultaneous learning of non-linear manifold and dynamical models for high-dimensional time series. In Proceedings international conference on computer vision.
Li, R., Tian, T., Sclaroff, S., & Yang, M. (2010). 3d human motion tracking with a coordinated mixture of factor analyzers. International Journal of Computer Vision, 87, 170–190.
Lin, R., Liu, C., Yang, M., Ahja, N., & Levinson, S. (2006). Learning nonlinear manifolds from time series. In Proceedings European conference on computer vision.
Liu, J., Luo, J., & Shah, M. (2009). Recognizing realistic actions from videos ‘in the wild’. In Proceedings IEEE conference on computer vision and pattern recognition.
Lv, F., & Nevatia, R. (2007). Single view human action recognition using key pose matching and Viterbi path searching. In Proceedings IEEE conference on computer vision and pattern recognition.
Maji, S., Bourdev, L., & Malik, J. (2011). Action recognition from a distributed representation of pose and appearance. In Proceedings IEEE conference on computer vision and pattern recognition.
Mitra, S., & Acharya, T. (2007). Gesture recognition: a survey. IEEE Transactions on Systems, Man and Cybernetics - Part C, 37(3), 311–324.
Moeslund, T., Hilton, A., & Krüger, V. (2006). A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding, 104(2), 90–126.
Moon, K., & Pavlovic, V. (2006). Impact of dynamics on subspace embedding and tracking of sequences. In Proceedings IEEE conference on computer vision and pattern recognition (pp. 198–205).
Müller, M., Röder, T., & Clausen, M. (2005). Efficient content-based retrieval of motion capture data. ACM Transactions on Graphics, 24, 677–685.
Natarajan, P., Singh, V., & Nevatia, R. (2010). Learning 3d action models from a few 2d videos for view invariant action recognition. In Proceedings IEEE conference on computer vision and pattern recognition.
Pavlovic, V., Rehg, J., & Maccormick, J. (2000). Learning switching linear models of human motion. In Neural information processing systems (pp. 981–987).
Peursum, P., Venkatesh, S., & West, G. (2010). A study on smoothing for particle-filtered 3d human body tracking. International Journal of Computer Vision, 87, 53–74.
Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing.
Rao, C., Yilmaz, A., & Shah, M. (2002). View-invariant representation and recognition of actions. International Journal of Computer Vision, 50(2), 203–226.
Raskin, L., Rudzsky, M., & Rivlin, E. (2011). Dimensionality reduction using a Gaussian process annealed particle filter for tracking and classification of articulated body motions. Computer Vision and Image Understanding, 115(4), 503–519.
Rasmussen, C., & Williams, C. (2006). Gaussian processes for machine learning. Cambridge: MIT Press.
Rodriguez, M., Ahmed, J., & Shah, M. (2008). Action Mach: a spatio-temporal maximum average correlation height filter for action recognition. In Proceedings IEEE conference on computer vision and pattern recognition.
Rosales, R., & Sclaroff, S. (2001). Learning body pose via specialized maps. In Neural information processing systems.
Rosenhahn, B., Brox, T., & Seidel, H. P. (2007). Scaled motion dynamics for markerless motion capture. In Proceedings IEEE conference on computer vision and pattern recognition.
Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally Linear embedding. Science, 290(5500), 2323–2326.
Schindler, K., & Van Gool, L. (2008). Action snippets: how many frames does human action recognition require. In Proceedings IEEE conference on computer vision and pattern recognition.
Schmaltz, C., Rosenhahn, B., Brox, T., & Weickert, J. (2011). Region-based pose tracking with occlusions using 3d models. In Machine vision and applications (pp. 1–21).
Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: a local svm approach. In Proceedings international conference on pattern recognition.
Shaheen, M., Gall, J., Strzodka, R., Van Gool, L., & Seidel, H. P. (2009). A comparison of 3d model-based tracking approaches for human motion capture in uncontrolled environments. In IEEE workshop on applications of computer vision.
Sidenbladh, H., Black, M., & Fleet, D. (2000). Stochastic tracking of 3d human figures using 2d image motion. In Proceedings European conference on computer vision.
Sidenbladh, H., Black, M., & Sigal, L. (2002). Implicit probabilistic models of human motion for synthesis and tracking. In Proceedings European conference on computer vision (pp. 784–800).
Sigal, L., Balan, A., & Black, M. (2010). Humaneva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(1–2), 4–27.
Sminchisescu, C., & Jepson, A. (2004). Generative modeling for continuous non-linearly embedded visual inference. In Proceedings international conference on machine learning.
Sminchisescu, C., Kanaujia, A., & Metaxas, D. (2007). Bm3e: discriminative density propagation for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(11), 2030–2044.
Taycher, L., Demirdjian, D., Darrell, T., & Shakhnarovich, G. (2006). Conditional random people: tracking humans with crfs and grid filters. In Proceedings IEEE conference on computer vision and pattern recognition (pp. 222–229).
Taylor, G., Sigal, L., Fleet, D., & Hinton, G. (2010). Dynamical binary latent variable models for 3d human pose tracking. In Proceedings IEEE conference on computer vision and pattern recognition.
Tenenbaum, J., de Silva, V., & Langford, J. (2000). A global geometric framework for nonlinear dimensionality reduction. Chicago: Science.
Tenorth, M., Bandouch, J., & Beetz, M. (2009). The TUM kitchen data set of everyday manipulation activities for motion tracking and action recognition. In IEEE workshop on tracking humans for the evaluation of their motion in image sequences.
Thurau, C., & Hlavac, V. (2008). Pose primitive based human action recognition in videos or still images. In Proceedings IEEE conference on computer vision and pattern recognition.
Ukita, N., Hirai, M., & Kidode, M. (2009). Complex volume and pose tracking with probabilistic dynamical model and visual hull constraint. In Proceedings international conference on computer vision.
Urtasun, R., Fleet, D., & Fua, P. (2006). 3d people tracking with Gaussian process dynamical models. In Proceedings IEEE conference on computer vision and pattern recognition.
Urtasun, R., Fleet, D., Hertzman, A., & Fua, P. (2005). Priors for people tracking from small training sets. In Proceedings international conference on computer vision.
Wang, J., Fleet, D., & Hertzmann, A. (2008). Gaussian process dynamical models for human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2), 283–298.
Weinland, D., & Boyer, E. (2008). Action recognition using exemplar-based embedding. In Proceedings IEEE conference on computer vision and pattern recognition.
Weinland, D., Boyer, E., & Ronfard, R. (2007). Action recognition from arbitrary views using 3d exemplars. In Proceedings international conference on computer vision.
Willems, G., Becker, J., Tuytelaars, T., & Van Gool, L. (2009). Exemplar-based action recognition in video. In Proceedings British machine vision conference.
Yacoob, Y., & Black, M. (1999). Parameterized modeling and recognition of activities. Computer Vision and Image Understanding, 73(2), 232–247.
Yang, W., Wang, Y., & Mori, G. (2010). Recognizing human actions from still images with latent poses. In Proceedings IEEE conference on computer vision and pattern recognition.
Yao, A., Gall, J., & Van Gool, L. (2010). A hough transform-based voting framework for action recognition. In Proceedings IEEE conference on computer vision and pattern recognition.
Yao, A., Gall, J., Fanelli, G., & Van Gool, L. (2011). Does human action recognition benefit from pose estimation. In Proceedings British machine vision conference.
Yilmaz, A., & Shah, M. (2005). Recognizing human actions in videos acquired by uncalibrated moving cameras. In Proceedings international conference on computer vision.
Acknowledgements
This work has been supported by funding from the Swiss National Foundation NCCR project IM2 as well as the EC projects IURO, TANGO and RADHAR. Angela Yao was also supported by funding from NSERC Canada.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yao, A., Gall, J. & Van Gool, L. Coupled Action Recognition and Pose Estimation from Multiple Views. Int J Comput Vis 100, 16–37 (2012). https://doi.org/10.1007/s11263-012-0532-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-012-0532-9