Advertisement

Structured Output Prediction and Learning for Deep Monocular 3D Human Pose Estimation

  • Stefan KinauerEmail author
  • Riza Alp GülerEmail author
  • Siddhartha Chandra
  • Iasonas Kokkinos
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10746)

Abstract

In this work we address the problem of estimating 3D human pose from a single RGB image by blending a feed-forward CNN with a graphical model that couples the 3D positions of parts. The CNN populates a volumetric output space that represents the possible positions of 3D human joints, and also regresses the estimated displacements between pairs of parts. These constitute the ‘unary’ and ‘pairwise’ terms of the energy of a graphical model that resides in a 3D label space and delivers an optimal 3D pose configuration at its output. The CNN is trained on the 3D human pose dataset 3.6M, the graphical model is trained jointly with the CNN in an end-to-end manner, allowing us to exploit both the discriminative power of CNNs and the top-down information pertaining to human pose. We introduce (a) memory efficient methods for getting accurate voxel estimates for parts by blending quantization with regression (b) employ efficient structured prediction algorithms for 3D pose estimation using branch-and-bound and (c) develop a framework for qualitative and quantitative comparison of competing graphical models. We evaluate our work on the Human3.6M dataset, demonstrating that exploiting the structure of the human pose in 3D yields systematic gains.

Notes

Acknowledgements

This work has been funded by the European Horizon 2020 programme under grant agreement no. 643666 (I-Support).

References

  1. 1.
    Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653–1660 (2014)Google Scholar
  2. 2.
    Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.V., Schiele, B.: DeepCut: joint subset partition and labeling for multi person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4929–4937 (2016)Google Scholar
  3. 3.
    Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: DeeperCut: a deeper, stronger, and faster multi-person pose estimation model. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 34–50. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46466-4_3 CrossRefGoogle Scholar
  4. 4.
    Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. arXiv preprint arXiv:1611.08050 (2016)
  5. 5.
    Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_29 CrossRefGoogle Scholar
  6. 6.
    Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4732 (2016)Google Scholar
  7. 7.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. arXiv preprint arXiv:1703.06870 (2017)
  8. 8.
    Tome, D., Russell, C., Agapito, L.: Lifting from the deep: convolutional 3D pose estimation from a single image. arXiv preprint arXiv:1701.00295 (2017)
  9. 9.
    Chen, C.H., Ramanan, D.: 3D human pose estimation = 2D pose estimation + matching. arXiv preprint arXiv:1612.06524 (2016)
  10. 10.
    Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46454-1_34 CrossRefGoogle Scholar
  11. 11.
    Mehta, D., Rhodin, H., Casas, D., Sotnychenko, O., Xu, W., Theobalt, C.: Monocular 3D human pose estimation in the wild using improved CNN supervision. arXiv preprint arXiv:1611.09813v3 (2017)
  12. 12.
    Guler, A., et al.: Human joint angle estimation and gesture recognition for assistive robotic vision. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 415–431. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-48881-3_29 CrossRefGoogle Scholar
  13. 13.
    Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3D human pose. arXiv preprint arXiv:1611.07828 (2016)
  14. 14.
    Burenius, M., Sullivan, J., Carlsson, S.: 3D pictorial structures for multiple view articulated pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3618–3625 (2013)Google Scholar
  15. 15.
    Tekin, B., Márquez-Neila, P., Salzmann, M., Fua, P.: Fusing 2D uncertainty and 3D cues for monocular body pose estimation. arXiv preprint arXiv:1611.05708 (2016)
  16. 16.
    Tekin, B., Katircioglu, I., Salzmann, M., Lepetit, V., Fua, P.: Structured prediction of 3D human pose with deep neural networks. CoRR abs/1605.05180 (2016)Google Scholar
  17. 17.
    Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: NIPS (2014)Google Scholar
  18. 18.
    Yang, W., Ouyang, W., Li, H., Wang, X.: End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3073–3082 (2016)Google Scholar
  19. 19.
    Lee, C., Xie, S., Gallagher, P.W., Zhang, Z., Tu, Z.: Deeply-supervised nets. In: AISTATS (2015)Google Scholar
  20. 20.
    Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)Google Scholar
  21. 21.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)Google Scholar
  22. 22.
    Guler, R.A., Trigeorgis, G., Antonakos, E., Snape, P., Zafeiriou, S., Kokkinos, I.: DenseReg: fully convolutional dense shape regression in-the-wild. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017Google Scholar
  23. 23.
    Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., Murphy, K.P.: Towards accurate multi-person pose estimation in the wild. CoRR abs/1701.01779 (2017)Google Scholar
  24. 24.
    Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition. Int. J. Comput. Vis. 61(1), 55–79 (2005)CrossRefGoogle Scholar
  25. 25.
    Chen, X., Yuille, A.L.: Articulated pose estimation by a graphical model with image dependent pairwise relations. In: Advances in Neural Information Processing Systems, pp. 1736–1744 (2014)Google Scholar
  26. 26.
    Sapp, B., Toshev, A., Taskar, B.: Cascaded models for articulated pose estimation. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 406–420. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-15552-9_30 CrossRefGoogle Scholar
  27. 27.
    Kinauer, S., Berman, M., Kokkinos, I.: Monocular surface reconstruction using 3D deformable part models. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 296–308. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-49409-8_24 Google Scholar
  28. 28.
    Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn. 3(1), 1–122 (2011)zbMATHGoogle Scholar
  29. 29.
    Martins, A.F., Smith, N.A., Aguiar, P.M., Figueiredo, M.A.: Dual decomposition with many overlapping components. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 238–249. Association for Computational Linguistics (2011)Google Scholar
  30. 30.
    Boussaid, H., Kokkinos, I.: Fast and exact: ADMM-based discriminative shape segmentation with loopy part models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4058–4065 (2014)Google Scholar
  31. 31.
    Komodakis, N., Paragios, N., Tziritas, G.: MRF optimization via dual decomposition: message-passing revisited. In: IEEE 11th International Conference on Computer Vision, ICCV 2007, pp. 1–8. IEEE (2007)Google Scholar
  32. 32.
    Joachims, T., Finley, T., Yu, C.N.J.: Cutting-plane training of structural SVMs. Mach. Learn. 77(1), 27–59 (2009)CrossRefzbMATHGoogle Scholar
  33. 33.
    Pepik, B., Stark, M., Gehler, P.V., Schiele, B.: Multi-view and 3D deformable part models. IEEE Trans. Pattern Anal. Mach. Intell. 37(11), 2232–2245 (2015)CrossRefGoogle Scholar
  34. 34.
    Zhang, Y., Sohn, K., Villegas, R., Pan, G., Lee, H.: Improving object detection with deep convolutional networks via Bayesian optimization and structured prediction, pp. 249–258 (2015)Google Scholar
  35. 35.
    Sun, X., Shang, J., Liang, S., Wei, Y.: Compositional human pose regression. arXiv preprint arXiv:1704.00159 (2017)
  36. 36.
    Yasin, H., Iqbal, U., Kruger, B., Weber, A., Gall, J.: A dual-source approach for 3D pose estimation from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4948–4956 (2016)Google Scholar
  37. 37.
    Rogez, G., Schmid, C.: MoCap-guided data augmentation for 3D pose estimation in the wild. In: Advances in Neural Information Processing Systems, pp. 3108–3116 (2016)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.CentraleSupélecINRIA SaclayGif-sur-YvetteFrance
  2. 2.Facebook AI ResearchParisFrance
  3. 3.University Collage LondonLondonUK

Personalised recommendations