Exploring the Spatial Hierarchy of Mixture Models for Human Pose Estimation

  • Yuandong Tian
  • C. Lawrence Zitnick
  • Srinivasa G. Narasimhan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7576)


Human pose estimation requires a versatile yet well-constrained spatial model for grouping locally ambiguous parts together to produce a globally consistent hypothesis. Previous works either use local deformable models deviating from a certain template, or use a global mixture representation in the pose space. In this paper, we propose a new hierarchical spatial model that can capture an exponential number of poses with a compact mixture representation on each part. Using latent nodes, it can represent high-order spatial relationship among parts with exact inference. Different from recent hierarchical models that associate each latent node to a mixture of appearance templates (like HoG), we use the hierarchical structure as a pure spatial prior avoiding the large and often confounding appearance space. We verify the effectiveness of this model in three ways. First, samples representing human-like poses can be drawn from our model, showing its ability to capture high-order dependencies of parts. Second, our model achieves accurate reconstruction of unseen poses compared to a nearest neighbor pose representation. Finally, our model achieves state-of-art performance on three challenging datasets, and substantially outperforms recent hierarchical models.


Mixture Model Leaf Node Hierarchical Model Training Image Hide Node 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Andriluka, M., Roth, S., Schiele, B.: Pictorial structures revisited: People detection and articulated pose estimation. In: Computer Vision and Pattern Recognition, pp. 1014–1021. IEEE (2009)Google Scholar
  2. 2.
    Bergtholdt, M., Kappes, J., Schmidt, S., Schnörr, C.: A study of parts-based object class detection using complete graphs. International Journal of Computer Vision 87(1), 93–117 (2010)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Bourdev, L., Malik, J.: Poselets: Body part detectors trained using 3d human pose annotations. In: International Conference on Computer Vision, ICCV (2009)Google Scholar
  4. 4.
    Eichner, M., Ferrari, V.: Better appearance models for pictorial structures. In: BMVC (2009)Google Scholar
  5. 5.
    Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient matching of pictorial structures. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 66–73. IEEE (2000)Google Scholar
  6. 6.
    Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Progressive search space reduction for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008, pp. 1–8. IEEE (2008)Google Scholar
  7. 7.
    Fischler, M.A., Elschlager, R.A.: The representation and matching of pictorial structures. IEEE Transactions on Computers 100(1), 67–92 (1973)CrossRefGoogle Scholar
  8. 8.
    Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: BMVC (2010)Google Scholar
  9. 9.
    Johnson, S., Everingham, M.: Learning effective human pose estimation from inaccurate annotation. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1465–1472. IEEE (2011)Google Scholar
  10. 10.
    Lan, X., Huttenlocher, D.P.: Beyond trees: Common-factor models for 2d human pose recovery. In: Tenth IEEE International Conference on Computer Vision, ICCV 2005, vol. 1, pp. 470–477. IEEE (2005)Google Scholar
  11. 11.
    Marr, D., Nishihara, H.K.: Representation and recognition of the spatial organization of three-dimensional shapes. Proceedings of the Royal Society of London. Series B. Biological Sciences 200(1140), 269–294 (1978)CrossRefGoogle Scholar
  12. 12.
    Ramanan, D.: Learning to parse images of articulated bodies. In: NIPS, vol. 19, p. 1129 (2007)Google Scholar
  13. 13.
    Sun, M., Savarese, S.: Articulated part-based model for joint object detection and pose estimation. In: ICCV (2011)Google Scholar
  14. 14.
    Tian, T.P., Sclaroff, S.: Fast globally optimal 2d human detection with loopy graph models. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 81–88. IEEE (2010)Google Scholar
  15. 15.
    Tran, D., Forsyth, D.: Improved Human Parsing with a Full Relational Model. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 227–240. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  16. 16.
    Wang, Y., Tran, D., Liao, Z.: Learning hierarchical poselets for human parsing. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1705–1712. IEEE (2011)Google Scholar
  17. 17.
    Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: Computer Vision and Pattern Recognition (CVPR), pp. 1385–1392. IEEE (2011)Google Scholar
  18. 18.
    Zhu, L.L., Chen, Y., Yuille, A., Freeman, W.: Latent hierarchical structural learning for object detection. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1062–1069. IEEE (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Yuandong Tian
    • 1
  • C. Lawrence Zitnick
    • 2
  • Srinivasa G. Narasimhan
    • 1
  1. 1.Carnegie Mellon UniversityPittsburghUSA
  2. 2.Microsoft ResearchRedmondUSA

Personalised recommendations