Advertisement

Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation

  • Helge RhodinEmail author
  • Mathieu Salzmann
  • Pascal Fua
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11214)

Abstract

Modern 3D human pose estimation techniques rely on deep networks, which require large amounts of training data. While weakly-supervised methods require less supervision, by utilizing 2D poses or multi-view imagery without annotations, they still need a sufficiently large set of samples with 3D annotations for learning to succeed.

In this paper, we propose to overcome this problem by learning a geometry-aware body representation from multi-view images without annotations. To this end, we use an encoder-decoder that predicts an image from one viewpoint given an image from another viewpoint. Because this representation encodes 3D geometry, using it in a semi-supervised setting makes it easier to learn a mapping from it to 3D human pose. As evidenced by our experiments, our approach significantly outperforms fully-supervised methods given the same amount of labeled data, and improves over other semi-supervised methods while using as little as 1% of the labeled data.

Keywords

3D reconstruction Semi-supervised training Representation learning Monocular human pose reconstruction 

Notes

Acknowledgment

This work was supported in part by a Microsoft Joint Research Project.

References

  1. 1.
    Bas, A., Huber, P., Smith, W., Awais, M., Kittler, J.: 3D morphable models as spatial transformer networks. arXiv Preprint (2017)Google Scholar
  2. 2.
    Chen, W., et al.: Synthesizing training images for boosting human 3D pose estimation. In: 3DV (2016)Google Scholar
  3. 3.
    Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: Infogan: interpretable representation learning by information maximizing generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2172–2180 (2016)Google Scholar
  4. 4.
    Cohen, T., Welling, M.: Transformation properties of learned visual representations. arXiv Preprint (2014)Google Scholar
  5. 5.
    Dosovitskiy, A., Springenberg, J., Brox, T.: Learning to generate chairs with convolutional neural networks. In: Conference on Computer Vision and Pattern Recognition (2015)Google Scholar
  6. 6.
    Dosovitskiy, A., Springenberg, J., Tatarchenko, M., Brox, T.: Learning to generate chairs, tables and cars with convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 692–705 (2017)Google Scholar
  7. 7.
    Flynn, J., Neulander, I., Philbin, J., Snavely, N.: Deepstereo: learning to predict new views from the world’s imagery. In: Conference on Computer Vision and Pattern Recognition, pp. 5515–5524 (2016)Google Scholar
  8. 8.
    Gadelha, M., Maji, S., Wang, R.: 3D shape induction from 2D views of multiple objects. arXiv preprint arXiv:1612.05872 (2016)
  9. 9.
    Grant, E., Kohli, P., van Gerven, M.: Deep disentangled representations for volumetric reconstruction. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 266–279. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-49409-8_22CrossRefGoogle Scholar
  10. 10.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)Google Scholar
  11. 11.
    Hinton, G., Krizhevsky, A., Wang, S.: Transforming auto-encoders. In: International Conference on Artificial Neural Networks, pp. 44–51 (2011)Google Scholar
  12. 12.
    Ionescu, C., Carreira, J., Sminchisescu, C.: Iterated second-order label sensitive pooling for 3D human pose estimation. In: Conference on Computer Vision and Pattern Recognition (2014)Google Scholar
  13. 13.
    Ionescu, C., Papava, I., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1325–1339 (2014)CrossRefGoogle Scholar
  14. 14.
    Joo, H., et al.: Panoptic studio: a massively multiview system for social motion capture. In: International Conference on Computer Vision (2015)Google Scholar
  15. 15.
    Kar, A., Häne, C., Malik, J.: Learning a multi-view stereo machine. In: Advances in Neural Information Processing Systems, pp. 364–375 (2017)Google Scholar
  16. 16.
    Kim, H., Zollhöfer, M., Tewari, A., Thies, J., Richardt, C., Theobalt, C.: Inversefacenet: deep single-shot inverse face rendering from a single image. arXiv Preprint (2017)Google Scholar
  17. 17.
    Kulkarni, T.D., Whitney, W., Kohli, P., Tenenbaum, J.B.: Deep Convolutional Inverse Graphics Network. arXiv (2015)Google Scholar
  18. 18.
    Lassner, C., Pons-Moll, G., Gehler, P.: A generative model of people in clothing. arXiv Preprint (2017)Google Scholar
  19. 19.
    Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., Gool, L.V.: Pose guided person image generation. In: Advances in Neural Information Processing Systems, pp. 405–415 (2017)Google Scholar
  20. 20.
    Martinez, J., Hossain, R., Romero, J., Little, J.: A simple yet effective baseline for 3D human pose estimation. In: International Conference on Computer Vision (2017)Google Scholar
  21. 21.
    Mehta, D., et al.: Monocular 3D human pose estimation in the wild using improved CNN supervision. In: International Conference on 3D Vision (2017)Google Scholar
  22. 22.
    Mehta, D., et al.: Vnect: real-time 3D human pose estimation with a single RGB camera. In: ACM SIGGRAPH (2017)CrossRefGoogle Scholar
  23. 23.
    Park, E., Yang, J., Yumer, E., Ceylan, D., Berg, A.: Transformation-grounded image generation network for novel 3D view synthesis. In: Conference on Computer Vision and Pattern Recognition, pp. 702–711 (2017)Google Scholar
  24. 24.
    Pavlakos, G., Zhou, X., Derpanis, K., Konstantinos, G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3D human pose. In: Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  25. 25.
    Pavlakos, G., Zhou, X., Konstantinos, K.D.G., Kostas, D.: Harvesting multiple views for marker-less 3D human pose annotations. In: Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  26. 26.
    Peng, X., Feris, R.S., Wang, X., Metaxas, D.N.: A recurrent encoder-decoder network for sequential face alignment. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 38–56. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_3CrossRefGoogle Scholar
  27. 27.
    Popa, A.I., Zanfir, M., Sminchisescu, C.: Deep multitask architecture for integrated 2D and 3D human sensing. In: Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  28. 28.
    Reed, S., Zhang, Y., Zhang, Y., Lee, H.: Deep visual analogy-making. In: Advances in Neural Information Processing Systems, pp. 1252–1260 (2015)Google Scholar
  29. 29.
    Rezende, D., Eslami, S., Mohamed, S., Battaglia, P., Jaderberg, M., Heess, N.: Unsupervised learning of 3D structure from images. In: Advances in Neural Information Processing Systems, pp. 4996–5004 (2016)Google Scholar
  30. 30.
    Rhodin, H., et al.: Egocap: egocentric marker-less motion capture with two fisheye cameras. ACM SIGGRAPH Asia 35(6), 162 (2016)Google Scholar
  31. 31.
    Rhodin, H., et al.: Learning monocular 3D human pose estimation from multi-view images. In: Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  32. 32.
    Rogez, G., Schmid, C.: Mocap guided data augmentation for 3D pose estimation in the wild. In: Advances in Neural Information Processing Systems (2016)Google Scholar
  33. 33.
    Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-Net: localization-classification-regression for human pose. In: Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  34. 34.
    Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Conference on Medical Image Computing and Computer Assisted Intervention (2015)Google Scholar
  35. 35.
    Shu, Z., Yumer, E., Hadap, S., Sunkavalli, K., Shechtman, E., Samaras, D.: Neural face editing with intrinsic image disentangling. In: Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  36. 36.
    Tatarchenko, M., Dosovitskiy, A., Brox, T.: Single-view to multi-view: reconstructing unseen views with a convolutional network. CoRR abs/1511.06702 1, 2 (2015)Google Scholar
  37. 37.
    Tatarchenko, M., Dosovitskiy, A., Brox, T.: Multi-view 3D models from single images with a convolutional network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 322–337. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46478-7_20CrossRefGoogle Scholar
  38. 38.
    Tekin, B., Márquez-neila, P., Salzmann, M., Fua, P.: Learning to fuse 2D and 3D image cues for monocular body pose estimation. In: International Conference on Computer Vision (2017)Google Scholar
  39. 39.
    Tewari, A., et al.: Mofa: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. In: International Conference on Computer Vision (2017)Google Scholar
  40. 40.
    Thewlis, J., Bilen, H., Vedaldi, A.: Unsupervised learning of object frames by dense equivariant image labelling. In: Advances in Neural Information Processing Systems, pp. 844–855 (2017)Google Scholar
  41. 41.
    Thewlis, J., Bilen, H., Vedaldi, A.: Unsupervised learning of object landmarks by factorized spatial embeddings. In: International Conference on Computer Vision (2017)Google Scholar
  42. 42.
    Tome, D., Russell, C., Agapito, L.: Lifting from the deep: convolutional 3D pose estimation from a single image. arXiv preprint, arXiv:1701.00295 (2017)
  43. 43.
    Tran, L., Yin, X., Liu, X.: Disentangled representation learning gan for pose-invariant face recognition. In: CVPR, vol. 3, p. 7 (2017)Google Scholar
  44. 44.
    Tulsiani, S., Efros, A., Malik, J.: Multi-view consistency as supervisory signal for learning shape and pose prediction. arXiv Preprint (2018)Google Scholar
  45. 45.
    Tulsiani, S., Zhou, T., Efros, A., Malik, J.: Multi-view supervision for single-view reconstruction via differentiable ray consistency. In: Conference on Computer Vision and Pattern Recognition, vol. 1, p. 3 (2017)Google Scholar
  46. 46.
    Tung, H.Y., Harley, A., Seto, W., Fragkiadaki, K.: Adversarial inverse graphics networks: learning 2D-to-3D lifting and image-to-image translation from unpaired supervision. In: The IEEE International Conference on Computer Vision (ICCV), vol. 2 (2017)Google Scholar
  47. 47.
    Tung, H.Y., Tung, H.W., Yumer, E., Fragkiadaki, K.: Self-supervised learning of motion capture. In: Advances in Neural Information Processing Systems, pp. 5242–5252 (2017)Google Scholar
  48. 48.
    Varol, G., et al.: Learning from synthetic humans. In: Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  49. 49.
    Worrall, D., Garbin, S., Turmukhambetov, D., Brostow, G.: Interpretable transformations with encoder-decoder networks. In: International Conference on Computer Vision, vol. 4 (2017)Google Scholar
  50. 50.
    Yan, X., Yang, J., Yumer, E., Guo, Y., Lee, H.: Perspective transformer nets: learning single-view 3D object reconstruction without 3D supervision. In: Advances in Neural Information Processing Systems, pp. 1696–1704 (2016)Google Scholar
  51. 51.
    Yang, J., Reed, S., Yang, M.H., Lee, H.: Weakly-supervised disentangling with recurrent transformations for 3D view synthesis. In: Advances in Neural Information Processing Systems, pp. 1099–1107 (2015)Google Scholar
  52. 52.
    Zhao, B., Wu, X., Cheng, Z.Q., Liu, H., Feng, J.: Multi-view image generation from a single-view. arXiv preprint arXiv:1704.04886 (2017)
  53. 53.
    Zhou, T., Tulsiani, S., Sun, W., Malik, J., Efros, A.A.: View synthesis by appearance flow. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 286–301. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46493-0_18CrossRefGoogle Scholar
  54. 54.
    Zhou, X., Huang, Q., Sun, X., Xue, X., We, Y.: Weakly-supervised transfer for 3D human pose estimation in the wild. arXiv Preprint (2017)Google Scholar
  55. 55.
    Zhou, X., Karpur, A., Gan, C., Luo, L., Huang, Q.: Unsupervised domain adaptation for 3D keypoint prediction from a single depth scan. arXiv preprint arXiv:1712.05765 (2017)
  56. 56.
    Zhu, J.Y., Park, T., Isola, P., Efros, A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593 (2017)

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.CVLab, EPFLLausanneSwitzerland

Personalised recommendations