3D Ego-Pose Estimation via Imitation Learning

  • Ye YuanEmail author
  • Kris Kitani
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11220)


Ego-pose estimation, i.e., estimating a person’s 3D pose with a single wearable camera, has many potential applications in activity monitoring. For these applications, both accurate and physically plausible estimates are desired, with the latter often overlooked by existing work. Traditional computer vision-based approaches using temporal smoothing only take into account the kinematics of the motion without considering the physics that underlies the dynamics of motion, which leads to pose estimates that are physically invalid. Motivated by this, we propose a novel control-based approach to model human motion with physics simulation and use imitation learning to learn a video-conditioned control policy for ego-pose estimation. Our imitation learning framework allows us to perform domain adaption to transfer our policy trained on simulation data to real-world data. Our experiments with real egocentric videos show that our method can estimate both accurate and physically plausible 3D ego-pose sequences without observing the cameras wearer’s body.


First-person vision Pose estimation Imitation learning 



This work was sponsored in part by JST CREST (JPMJCR14E1) and IARPA (D17PC00340).


  1. 1.
    Agarwal, A., Triggs, B.: 3d human pose from silhouettes by relevance vector regression. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004, vol. 2, pp. II–II. IEEE (2004)Google Scholar
  2. 2.
    Arikan, O., Forsyth, D.A., O’Brien, J.F.: Motion synthesis from annotations. In: ACM Transactions on Graphics (TOG), vol. 22, pp. 402–408. ACM (2003)Google Scholar
  3. 3.
    Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: CVPR, vol. 1, p. 7 (2017)Google Scholar
  4. 4.
    Ho, J., Ermon, S.: Generative adversarial imitation learning. In: Advances in Neural Information Processing Systems, pp. 4565–4573 (2016)Google Scholar
  5. 5.
    Holden, D., Komura, T., Saito, J.: Phase-functioned neural networks for character control. ACM Trans. Graph. (TOG) 36(4), 42 (2017)CrossRefGoogle Scholar
  6. 6.
    Hwang, B., Jeon, D.: A method to accurately estimate the muscular torques of human wearing exoskeletons by torque sensors. Sensors 15(4), 8337–8357 (2015)CrossRefGoogle Scholar
  7. 7.
    Jiang, H., Grauman, K.: Seeing invisible poses: estimating 3d body pose from egocentric video. arXiv preprint arXiv:1603.07763 (2016)
  8. 8.
    Li, C., Kitani, K.M.: Model recommendation with virtual probes for egocentric hand detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2624–2631 (2013)Google Scholar
  9. 9.
    Li, C., Kitani, K.M.: Pixel-level hand detection in ego-centric videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3570–3577 (2013)Google Scholar
  10. 10.
    Liu, Z., Zhu, J., Bu, J., Chen, C.: A survey of human pose estimation: the body parts parsing based methods. J. Vis. Commun. Image Represent. 32, 10–19 (2015)CrossRefGoogle Scholar
  11. 11.
    Merel, J., et al.: Learning human behaviors from motion capture by adversarial imitation. arXiv preprint arXiv:1707.02201 (2017)
  12. 12.
    Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: ICML, pp. 663–670 (2000)Google Scholar
  13. 13.
    Peng, X.B., Berseth, G., Van de Panne, M.: Terrain-adaptive locomotion skills using deep reinforcement learning. ACM Trans. Graph. (TOG) 35(4), 81 (2016)Google Scholar
  14. 14.
    Peng, X.B., Berseth, G., Yin, K., Van De Panne, M.: Deeploco: dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Trans. Graph. (TOG) 36(4), 41 (2017)CrossRefGoogle Scholar
  15. 15.
    Pomerleau, D.A.: Efficient training of artificial neural networks for autonomous navigation. Neural Comput. 3(1), 88–97 (1991)CrossRefGoogle Scholar
  16. 16.
    Ren, X., Gu, C.: Figure-ground segmentation improves handled object recognition in egocentric video. In: IEEE Conference on Computer Vision and Pattern Recognition, (CVPR) 2010, pp. 3137–3144. IEEE (2010)Google Scholar
  17. 17.
    Rogez, G., Supancic, J.S., Ramanan, D.: First-person pose recognition using egocentric workspaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4325–4333 (2015)Google Scholar
  18. 18.
    Ross, S., Bagnell, D.: Efficient reductions for imitation learning. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 661–668 (2010)Google Scholar
  19. 19.
    Ross, S., Gordon, G.J., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011)Google Scholar
  20. 20.
    Russell, S.: Learning agents for uncertain environments. In: Proceedings of the eleventh annual Conference on Computational Learning Theory, pp. 101–103. ACM (1998)Google Scholar
  21. 21.
    Sarafianos, N., Boteanu, B., Ionescu, B., Kakadiaris, I.A.: 3d human pose estimation: a review of the literature and analysis of covariates. Comput. Vis. Image Underst. 152, 1–20 (2016)CrossRefGoogle Scholar
  22. 22.
    Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889–1897 (2015)Google Scholar
  23. 23.
    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
  24. 24.
    Shakhnarovich, G., Viola, P., Darrell, T.: Fast pose estimation with parameter-sensitive hashing. In: Null, p. 750. IEEE (2003)Google Scholar
  25. 25.
    Shiratori, T., Park, H.S., Sigal, L., Sheikh, Y., Hodgins, J.K.: Motion capture from body-mounted cameras. In: ACM Transactions on Graphics (TOG), vol. 30, p. 31. ACM (2011)CrossRefGoogle Scholar
  26. 26.
    Sminchisescu, C., Kanaujia, A., Metaxas, D.N.: Bm\({}^3\)e: Discriminative density propagation for visual tracking. IEEE Trans. Pattern Anal. Mach. Intell. 29(11), 2030–2044 (2007)Google Scholar
  27. 27.
    Sussillo, D., Abbott, L.F.: Generating coherent patterns of activity from chaotic neural networks. Neuron 63(4), 544–557 (2009)CrossRefGoogle Scholar
  28. 28.
    Tassa, Y., et al.: Deepmind control suite. arXiv preprint arXiv:1801.00690 (2018)
  29. 29.
    Taylor, G.W., Hinton, G.E., Roweis, S.T.: Modeling human motion using binary latent variables. In: Advances in Neural Information Processing Systems, pp. 1345–1352 (2007)Google Scholar
  30. 30.
    Todorov, E., Erez, T., Tassa, Y.: Mujoco: A physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5026–5033. IEEE (2012)Google Scholar
  31. 31.
    Toshev, A., Szegedy, C.: Deeppose: Human pose estimation via deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653–1660 (2014)Google Scholar
  32. 32.
    Wang, Z., Merel, J., Reed, S., Wayne, G., de Freitas, N., Heess, N.: Robust imitation of diverse behaviors. arXiv preprint arXiv:1707.02747 (2017)

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Carnegie Mellon UniversityPittsburghUSA

Personalised recommendations