Towards Learning a Realistic Rendering of Human Behavior

  • Patrick EsserEmail author
  • Johannes Haux
  • Timo Milbich
  • Björn Ommer
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11130)


Realistic rendering of human behavior is of great interest for applications such as video animations, virtual reality and gaming engines. Commonly animations of persons performing actions are rendered by articulating explicit 3D models based on sequences of coarse body shape representations simulating a certain behavior. While the simulation of natural behavior can be efficiently learned, the corresponding 3D models are typically designed in manual, laborious processes or reconstructed from costly (multi-)sensor data. In this work, we present an approach towards a holistic learning framework for rendering human behavior in which all components are learned from easily available data. To enable control over the generated behavior, we utilize motion capture data and generate realistic motions based on user inputs. Alternatively, we can directly copy behavior from videos and learn a rendering of characters using RGB camera data only. Our experiments show that we can further improve data efficiency by training on multiple characters at the same time. Overall our approach shows a new path towards easily available, personalized avatar creation.


  1. 1.
    Allain, B., Franco, J.-S., Boyer, E.: An efficient volumetric framework for shape tracking. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  2. 2.
    Alldieck, T., Magnor, M., Xu, W., Theobalt, C., Pons-Moll, G.: Video based reconstruction of 3D people models. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  3. 3.
    Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: SCAPE: shape completion and animation of people. ACM Trans. Graph. 24(3), 408–416 (2005)CrossRefGoogle Scholar
  4. 4.
    Balakrishnan, G., Zhao, A., Dalca, A.V., Durand, F., Guttag, J.: Synthesizing images of humans in unseen poses. arXiv preprint arXiv:1804.07739 (2018)
  5. 5.
    Cagniart, C., Boyer, E., Ilic, S.: Probabilistic deformable surface tracking from multiple videos. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 326–339. Springer, Heidelberg (2010). Scholar
  6. 6.
    Cao, Z., Simon, T., Wei, S.-E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: CVPR (2017)Google Scholar
  7. 7.
    Carranza, J., Theobalt, C., Magnor, M.A., Seidel, H.-P.: Free-viewpoint video of human actors. In: ACM SIGGRAPH 2003 Papers (2003)Google Scholar
  8. 8.
    Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv e-prints, June 2016Google Scholar
  9. 9.
    de Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H.-P., Thrun, S.: Performance capture from sparse multi-view video. ACM Trans. Graph. 27(3), 98:1–98:10 (2008). Article No. 98CrossRefGoogle Scholar
  10. 10.
    Esser, P., Sutter, E., Ommer, B.: A variational U-Net for conditional appearance and shape generation. arXiv preprint arXiv:1804.04694 (2018)
  11. 11.
    Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (NIPS) (2014)Google Scholar
  12. 12.
    Hasler, N., Stoll, C., Sunkel, M., Rosenhahn, B., Seidel, H.-P.: A statistical model of human pose and body shape. Comput. Graph. Forum 28(2), 337–346 (2009)CrossRefGoogle Scholar
  13. 13.
    Holden, D., Komura, T., Saito, J.: Phase-functioned neural networks for character control. ACM Trans. Graph. 36(4), 42:1–42:13 (2017). Article No. 42CrossRefGoogle Scholar
  14. 14.
    Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. arXiv (2016)Google Scholar
  15. 15.
    Izadi, S., et al.: KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. In: Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, pp. 559–568. ACM (2011)Google Scholar
  16. 16.
    Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). Scholar
  17. 17.
    Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: Proceedings of the International Conference on Learning Representations (ICLR) (2018)Google Scholar
  18. 18.
    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
  19. 19.
    Lassner, C., Pons-Moll, G., Gehler, P.V.: A generative model of people in clothing. In: Proceedings of the IEEE International Conference on Computer Vision, vol. 6 (2017)Google Scholar
  20. 20.
    Li, H., Vouga, E., Gudym, A., Luo, L., Barron, J.T., Gusev, G.: 3D self-portraits. ACM Trans. Graph. 32(6), 187:1–187:9 (2013). Article No. 187Google Scholar
  21. 21.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. ArXiv e-prints, May 2014Google Scholar
  22. 22.
    Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34(6), 248:1–248:16 (2015). Article No. 248. Proceedings of the SIGGRAPH AsiaCrossRefGoogle Scholar
  23. 23.
    Ma, L., Sun, Q., Georgoulis, S., Gool, L.V., Schiele, B., Fritz, M.: Disentangled person image generation. In: Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  24. 24.
    Mehta, D., et al.: VNect: real-time 3D human pose estimation with a single RGB camera. ACM Trans. Graph. 36(4), 14 p. (2017). Article No. 44. Scholar
  25. 25.
    Mirza, M., Osindero, S.: Conditional generative adversarial nets. CoRR, abs/1411.1784 (2014)Google Scholar
  26. 26.
    Newcombe, R.A., et al.: KinectFusion: real-time dense surface mapping and tracking. In: Proceedings of the 2011 10th IEEE International Symposium on Mixed and Augmented Reality (2011)Google Scholar
  27. 27.
    Popa, A., Zanfir, M., Sminchisescu, C.: Deep multitask architecture for integrated 2D and 3D human sensing. In: CVPR (2017)Google Scholar
  28. 28.
    Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). Scholar
  29. 29.
    Shapiro, A., et al.: Rapid avatar capture and simulation using commodity depth sensors. Comput. Animat. Virtual Worlds 25(3–4), 201–211 (2014)CrossRefGoogle Scholar
  30. 30.
    Siarohin, A., Sangineto, E., Lathuiliere, S., Sebe, N.: Deformable GANs for pose-based human image generation. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  31. 31.
    Tomè, D., Russell, C., Agapito, L.: Lifting from the deep: convolutional 3D pose estimation from a single image. In: CVPR (2017)Google Scholar
  32. 32.
    Tran, L., Yin, X., Liu, X.: Disentangled representation learning GAN for pose-invariant face recognition. In: Proceeding of IEEE Computer Vision and Pattern Recognition (2017)Google Scholar
  33. 33.
    Tulyakov, S., Liu, M.-Y., Yang, X., Kautz, J.: MoCoGan: decomposing motion and content for video generation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  34. 34.
    Ulyanov, D., Lebedev, V., Vedaldi, A., Lempitsky, V.: Texture networks: feed-forward synthesis of textures and stylized images. In: Proceedings of the 33rd International Conference on Machine Learning (2016)Google Scholar
  35. 35.
    Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Advances in Neural Information Processing Systems (2016)Google Scholar
  36. 36.
    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)CrossRefGoogle Scholar
  37. 37.
    Zeng, M., Zheng, J., Cheng, X., Liu, X.: Templateless quasi-rigid shape modeling with implicit loop-closure. In: CVPR (2013)Google Scholar
  38. 38.
    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. arXiv preprint (2018)Google Scholar
  39. 39.
    Zhang, W., Zhu, M., Derpanis, K.: From actemes to action: a strongly-supervised representation for detailed action understanding. In: International Conference on Computer Vision (ICCV) (2013)Google Scholar
  40. 40.
    Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y.: Towards 3D human pose estimation in the wild: a weakly-supervised approach. In: IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  41. 41.
    Zhu, J.-Y., Krähenbühl, P., Shechtman, E., Efros, A.A.: Generative visual manipulation on the natural image manifold. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 597–613. Springer, Cham (2016). Scholar
  42. 42.
    Zhu, J.-Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networkss. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  43. 43.
    Zuffi, S., Black, M.J.: The stitched puppet: a graphical model of 3D human shape and pose. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015) (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Patrick Esser
    • 1
    Email author
  • Johannes Haux
    • 1
  • Timo Milbich
    • 1
  • Björn Ommer
    • 1
  1. 1.HCI, IWR, Heidelberg UniversityHeidelbergGermany

Personalised recommendations