2D Human Pose Estimation in TV Shows

  • Vittorio Ferrari
  • Manuel Marín-Jiménez
  • Andrew Zisserman
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5604)


The goal of this work is fully automatic 2D human pose estimation in unconstrained TV shows and feature films. Direct pose estimation on this uncontrolled material is often too difficult, especially when knowing nothing about the location, scale, pose, and appearance of the person, or even whether there is a person in the frame or not.

We propose an approach that progressively reduces the search space for body parts, to greatly facilitate the task for the pose estimator. Moreover, when video is available, we propose methods for exploiting the temporal continuity of both appearance and pose for improving the estimation based on individual frames.

The method is fully automatic and self-initializing, and explains the spatio-temporal volume covered by a person moving in a shot by soft-labeling every pixel as belonging to a particular body part or to the background. We demonstrate upper-body pose estimation by running our system on four episodes of the TV series Buffy the vampire slayer (i.e. three hours of video). Our approach is evaluated quantitatively on several hundred video frames, based on ground-truth annotation of 2D poses. Finally, we present an application to full-body action recognition on the Weizmann dataset.


Body Part Action Recognition Appearance Model Part Position Human Action Recognition 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agarwal, A., Triggs, B.: 3d human pose from silhouettes by relevance vector regression. In: CVPR (2004)Google Scholar
  2. 2.
    Agarwal, A., Triggs, B.: Tracking articulated motion using a mixture of autoregressive models. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3023, pp. 54–65. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  3. 3.
    Andriluka, M., Roth, S., Schiele, B.: People-tracking-by-detection and people-detection-by-tracking. In: CVPR (2008)Google Scholar
  4. 4.
    Bishop, C.: Pattern recognition and machine learning. Springer, Heidelberg (2006)Google Scholar
  5. 5.
    Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: ICCV (2005)Google Scholar
  6. 6.
    Bray, M., Kohli, P., Torr, P.: Posecut: Simultaneous segmentation and 3d pose estimation of humans using dynamic graph-cuts. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part II. LNCS, vol. 3952, pp. 642–655. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  7. 7.
    Dalal, N., Triggs, B.: Histogram of Oriented Gradients for Human Detection. In: CVPR, vol. 2, pp. 886–893 (2005)Google Scholar
  8. 8.
    Davis, J., Bobick, A.: The representation and recognition of action using temporal templates. In: CVPR (1997)Google Scholar
  9. 9.
    Felzenszwalb, P., Huttenlocher, D.: Pictorial structures for object recognition. IJCV 61(1) (2005)Google Scholar
  10. 10.
    Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Progressive search space reduction for human pose estimation. In: CVPR (June 2008)Google Scholar
  11. 11.
    Ferrari, V., Tuytelaars, T., Van Gool, L.: Real-time affine region tracking and coplanar grouping. In: CVPR (2001)Google Scholar
  12. 12.
    Gammeter, S., Ess, A., Jaeggli, T., Schindler, K., Van Gool, L.: Articulated multi-body tracking under egomotion. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 816–830. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  13. 13.
    Ikizler, N., Duygulu, P.: Human action recognition using distribution of oriented rectangular patches. In: ICCV workshop on Human Motion Understanding (2007)Google Scholar
  14. 14.
    Jojic, N., Winn, J., Zitnick, L.: Escaping local minima through hierarchical model selection: Automatic object discovery, segmentation, and tracking in video. In: CVPR (2006)Google Scholar
  15. 15.
    Kumar, M.P., Torr, P.H.S., Zisserman, A.: Learning layered pictorial structures from video. In: ICVGIP, pp. 148–153 (2004)Google Scholar
  16. 16.
    Kumar, M.P., Torr, P.H.S., Zisserman, A.: Learning layered motion segmentations of video. In: ICCV (2005)Google Scholar
  17. 17.
    Laptev, I.: Improvements of object detection using boosted histograms. In: BMVC (2006)Google Scholar
  18. 18.
    Laptev, I., Perez, P.: Retrieving actions in movies. In: ICCV (2007)Google Scholar
  19. 19.
    Lin, Z., Davis, L., Doermann, D., DeMenthon, D.: An interactive approach to pose-assisted and appearance-based segmentation of humans. In: ICCV workshop on Interactive Computer Vision (2007)Google Scholar
  20. 20.
    Mori, G., Ren, X., Efros, A., Malik, J.: Recovering human body configurations: Combining segmentation and recognition. In: CVPR (2004)Google Scholar
  21. 21.
    Niebles, J., Fei-Fei, L.: A hierarchical model model of shape and appearance for human action classification. In: CVPR (2007)Google Scholar
  22. 22.
    Ozuysal, M., Lepetit, V., Fleuret, F., Fua, P.: Feature harvesting for tracking-by-detection. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3953, pp. 592–605. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  23. 23.
    Ramanan, D.: Learning to parse images of articulated bodies. In: NIPS (2006)Google Scholar
  24. 24.
    Ramanan, D., Forsyth, D.A., Zisserman, A.: Strike a pose: Tracking people by finding stylized poses. In: CVPR, vol. 1, pp. 271–278 (2005)Google Scholar
  25. 25.
    Rother, C., Kolmogorov, V., Blake, A.: Grabcut: interactive foreground extraction using iterated graph cuts 23(3), 309–314 (2004)Google Scholar
  26. 26.
    Schroff, F., Criminisi, A., Zisserman, A.: Single-histogram class models for image segmentation. In: Kalra, P.K., Peleg, S. (eds.) ICVGIP 2006. LNCS, vol. 4338, pp. 82–93. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  27. 27.
    Shechtman, E., Irani, M.: Matching local self-similarities across images and videos. In: CVPR (2007)Google Scholar
  28. 28.
    Sigal, L., Bhatia, S., Roth., S., Black, M., Isard, M.: Tracking loose-limbed people. In: CVPR (2004)Google Scholar
  29. 29.
    Sigal, L., Black, M.J.: Measure locally, reason globally: Occlusion-sensitive articulated pose estimation. In: CVPR, vol. 2, pp. 2041–2048 (2006)Google Scholar
  30. 30.
    Sivic, J., Everingham, M., Zisserman, A.: Person spotting: video shot retrieval for face sets. In: Leow, W.-K., Lew, M., Chua, T.-S., Ma, W.-Y., Chaisorn, L., Bakker, E.M. (eds.) CIVR 2005. LNCS, vol. 3568, pp. 226–236. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  31. 31.
    Sminchisescu, C., Triggs, B.: Estimating articulated human motion with covariance scaled sampling. In: IJRR (2003)Google Scholar
  32. 32.
    Thurau, C., Hlavac, V.: Pose primitive based human action recognition in videos or still images. In: CVPR (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Vittorio Ferrari
    • 1
  • Manuel Marín-Jiménez
    • 2
  • Andrew Zisserman
    • 3
  1. 1.ETH ZurichSwitzerland
  2. 2.University of GranadaSpain
  3. 3.University of OxfordUK

Personalised recommendations