General Automatic Human Shape and Motion Capture Using Volumetric Contour Cues

  • Helge RhodinEmail author
  • Nadia Robertini
  • Dan Casas
  • Christian Richardt
  • Hans-Peter Seidel
  • Christian Theobalt
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9909)


Markerless motion capture algorithms require a 3D body with properly personalized skeleton dimension and/or body shape and appearance to successfully track a person. Unfortunately, many tracking methods consider model personalization a different problem and use manual or semi-automatic model initialization, which greatly reduces applicability. In this paper, we propose a fully automatic algorithm that jointly creates a rigged actor model commonly used for animation – skeleton, volumetric shape, appearance, and optionally a body surface – and estimates the actor’s motion from multi-view video input only. The approach is rigorously designed to work on footage of general outdoor scenes recorded with very few cameras and without background subtraction. Our method uses a new image formation model with analytic visibility and analytically differentiable alignment energy. For reconstruction, 3D body shape is approximated as a Gaussian density field. For pose and shape estimation, we minimize a new edge-based alignment energy inspired by volume ray casting in an absorbing medium. We further propose a new statistical human body model that represents the body surface, volumetric Gaussian density, and variability in skeleton shape. Given any multi-view sequence, our method jointly optimizes the pose and shape parameters of this model fully automatically in a spatiotemporal way.


Motion Capture Body Model Bone Length Mesh Vertex Joint Location 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



We thank PerceptiveCode, in particular Arjun Jain and Jonathan Tompson, for providing and installing the ConvNet detector, Ahmed Elhayek, Jürgen Gall, Peng Guan, Hansung Kim, Armin Mustafa and Leonid Sigal for providing their data and test sequences, The Foundry for license support, and all our actors. This research was funded by the ERC Starting Grant project CapReal (335545).

Supplementary material

419978_1_En_31_MOESM1_ESM.pdf (3.2 mb)
Supplementary material 1 (pdf 3232 KB)


  1. 1.
    Moeslund, T.B., Hilton, A., Krüger, V.: A survey of advances in vision-based human motion capture and analysis. Comput. Vis. Image Underst. 104(2), 90–126 (2006)CrossRefGoogle Scholar
  2. 2.
    Holte, M.B., Tran, C., Trivedi, M.M., Moeslund, T.B.: Human pose estimation and activity recognition from multi-view videos: comparative explorations of recent developments. IEEE J. Sel. Top. Sign. Proces. 6(5), 538–552 (2012)CrossRefGoogle Scholar
  3. 3.
    Elhayek, A., de Aguiar, E., Jain, A., Tompson, J., Pishchulin, L., Andriluka, M., Bregler, C., Schiele, B., Theobalt, C.: Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras. In: CVPR, pp. 3810–3818 (2015)Google Scholar
  4. 4.
    de Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H.P., Thrun, S.: Performance capture from sparse multi-view video. ACM Trans. Graph. 27(3), 98 (2008)CrossRefGoogle Scholar
  5. 5.
    Gall, J., Stoll, C., de Aguiar, E., Theobalt, C., Rosenhahn, B., Seidel, H.P.: Motion capture using joint skeleton tracking and surface estimation. In: CVPR, pp. 1746–1753 (2009)Google Scholar
  6. 6.
    Zollhöfer, M., Nießner, M., Izadi, S., Rehmann, C., Zach, C., Fisher, M., Wu, C., Fitzgibbon, A., Loop, C., Theobalt, C., Stamminger, M.: Real-time non-rigid reconstruction using an RGB-D camera. ACM Trans. Graph. 33(4), 156 (2014)CrossRefGoogle Scholar
  7. 7.
    Stoll, C., Hasler, N., Gall, J., Seidel, H.P., Theobalt, C.: Fast articulated motion tracking using a sums of Gaussians body model. In: ICCV, pp. 951–958 (2011)Google Scholar
  8. 8.
    Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., Cook, M., Moore, R.: Real-time human pose recognition in parts from single depth images. Commun. ACM 56(1), 116–124 (2013)CrossRefGoogle Scholar
  9. 9.
    Bogo, F., Black, M.J., Loper, M., Romero, J.: Detailed full-body reconstructions of moving people from monocular RGB-D sequences. In: ICCV, pp. 2300–2308 (2015)Google Scholar
  10. 10.
    Tong, J., Zhou, J., Liu, L., Pan, Z., Yan, H.: Scanning 3D full human bodies using Kinects. IEEE Trans. Vis. Comput. Graph. 18(4), 643–650 (2012)CrossRefGoogle Scholar
  11. 11.
    Helten, T., Baak, A., Bharaj, G., Müller, M., Seidel, H.P., Theobalt, C.: Personalization and evaluation of a real-time depth-based full body tracker. In: 3DV, pp. 279–286 (2013)Google Scholar
  12. 12.
    Newcombe, R.A., Fox, D., Seitz, S.M.: DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In: CVPR, pp. 343–352 (2015)Google Scholar
  13. 13.
    Kakadiaris, I.A., Metaxas, D.: Three-dimensional human body model acquisition from multiple views. Int. J. Comput. Vis. 30(3), 191–218 (1998)CrossRefGoogle Scholar
  14. 14.
    Ahmed, N., de Aguiar, E., Theobalt, C., Magnor, M., Seidel, H.P.: Automatic generation of personalized human avatars from multi-view video. In: ACM Symposium on Virtual Reality Software and Technology, pp. 257–260 (2005)Google Scholar
  15. 15.
    Bălan, A.O., Sigal, L., Black, M.J., Davis, J.E., Haussecker, H.W.: Detailed human shape and pose from images. In: CVPR (2007)Google Scholar
  16. 16.
    Rhodin, H., Robertini, N., Richardt, C., Seidel, H.P., Theobalt, C.: A versatile scene model with differentiable visibility applied to generative pose estimation. In: ICCV (2015)Google Scholar
  17. 17.
    Hilton, A., Beresford, D., Gentils, T., Smith, R., Sun, W.: Virtual people: capturing human models to populate virtual worlds. In: Computer Animation, pp. 174–185 (1999)Google Scholar
  18. 18.
    Bălan, A.O., Black, M.J.: The naked truth: estimating body shape under clothing. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 15–29. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  19. 19.
    Theobalt, C., de Aguiar, E., Stoll, C., Seidel, H.P., Thrun, S.: Performance capture from multi-view video. In: Ronfard, R., Taubin, G. (eds.) Image and Geometry Processing for 3-D Cinematography. Geometry and Computing, pp. 127–149. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  20. 20.
    Wu, C., Stoll, C., Valgaerts, L., Theobalt, C.: On-set performance capture of multiple actors with a stereo camera. ACM Trans. Graph. 32(6), 161 (2013)Google Scholar
  21. 21.
    Wu, C., Varanasi, K., Theobalt, C.: Full body performance capture under uncontrolled and varying illumination: a shading-based approach. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part IV. LNCS, vol. 7575, pp. 757–770. Springer, Heidelberg (2012)Google Scholar
  22. 22.
    Vlasic, D., Baran, I., Matusik, W., Popović, J.: Articulated mesh animation from multi-view silhouettes. ACM Trans. Graph. 27(3), 97 (2008)CrossRefGoogle Scholar
  23. 23.
    Starck, J., Hilton, A.: Model-based multiple view reconstruction of people. In: ICCV, pp. 915–922 (2003)Google Scholar
  24. 24.
    Ballan, L., Cortelazzo, G.M.: Marker-less motion capture of skinned models in a four camera set-up using optical flow and silhouettes. In: 3DPVT (2008)Google Scholar
  25. 25.
    Allain, B., Franco, J.S., Boyer, E.: An efficient volumetric framework for shape tracking. In: CVPR, pp. 268–276 (2015)Google Scholar
  26. 26.
    Guan, P., Weiss, A., Bălan, A.O., Black, M.J.: Estimating human shape and pose from a single image. In: ICCV, pp. 1381–1388 (2009)Google Scholar
  27. 27.
    Guo, Y., Chen, X., Zhou, B., Zhao, Q.: Clothed and naked human shapes estimation from a single image. In: Hu, S.-M., Martin, R.R. (eds.) CVM 2012. LNCS, vol. 7633, pp. 43–50. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  28. 28.
    Hasler, N., Ackermann, H., Rosenhahn, B., Thormählen, T., Seidel, H.P.: Multilinear pose and body shape estimation of dressed subjects from image sets. In: CVPR, pp. 1823–1830 (2010)Google Scholar
  29. 29.
    Jain, A., Thormählen, T., Seidel, H.P., Theobalt, C.: MovieReshape: Tracking and reshaping of humans in videos. ACM Trans. Graph. 29(5) (2010)Google Scholar
  30. 30.
    Plankers, R., Fua, P.: Articulated soft objects for multi-view shape and motion capture. IEEE Trans. Pattern Anal. Mach. Intell. 25(9), 63–83 (2003)CrossRefGoogle Scholar
  31. 31.
    Ilic, S., Fua, P.: Implicit meshes for surface reconstruction. IEEE Trans. Pattern Anal. Mach. Intell. 28(2), 328–333 (2006)CrossRefGoogle Scholar
  32. 32.
    Cui, Y., Chang, W., Nöll, T., Stricker, D.: KinectAvatar: fully automatic body capture using a single Kinect. In: ACCV Workshops, pp. 133–147 (2012)Google Scholar
  33. 33.
    Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: SCAPE: shape completion and animation of people. ACM Trans. Graph. 24(3), 408–416 (2005)CrossRefGoogle Scholar
  34. 34.
    Pishchulin, L., Wuhrer, S., Helten, T., Theobalt, C., Schiele, B.: Building statistical shape spaces for 3D human modeling. arXiv:1503.05860 (2015)
  35. 35.
    Loper, M., Mahmood, N., Black, M.J.: MoSh: Motion and shape capture from sparse markers. ACM Trans. Graph. 33(6), 220 (2014)CrossRefGoogle Scholar
  36. 36.
    Campbell, N.D.F., Vogiatzis, G., Hernández, C., Cipolla, R.: Automatic 3D object segmentation in multiple views using volumetric graph-cuts. In: BMVC, pp. 530–539 (2007)Google Scholar
  37. 37.
    Wang, T., Collomosse, J., Hilton, A.: Wide baseline multi-view video matting using a hybrid Markov random field. In: ICPR, pp. 136–141 (2014)Google Scholar
  38. 38.
    Djelouah, A., Franco, J.S., Boyer, E., Le Clerc, F., Pérez, P.: Sparse multi-view consistency for object segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1890–1903 (2015)CrossRefGoogle Scholar
  39. 39.
    Szeliski, R., Golland, P.: Stereo matching with transparency and matting. In: ICCV, pp. 517–524 (1998)Google Scholar
  40. 40.
    Guillemaut, J.Y., Hilton, A.: Joint multi-layer segmentation and reconstruction for free-viewpoint video applications. Int. J. Comput. Vis. 93(1), 73–100 (2011)CrossRefGoogle Scholar
  41. 41.
    Bray, M., Kohli, P., Torr, P.H.S.: PoseCut: simultaneous segmentation and 3D pose estimation of humans using dynamic graph-cuts. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 642–655. Springer, Heidelberg (2006). doi: 10.1007/11744047_49 CrossRefGoogle Scholar
  42. 42.
    Mustafa, A., Kim, H., Guillemaut, J.Y., Hilton, A.: General dynamic scene reconstruction from multiple view video. In: ICCV (2015)Google Scholar
  43. 43.
    Hasler, N., Rosenhahn, B., Thormahlen, T., Wand, M., Gall, J., Seidel, H.P.: Markerless motion capture with unsynchronized moving cameras. In: CVPR, pp. 224–231 (2009)Google Scholar
  44. 44.
    Deutscher, J., Blake, A., Reid, I.: Articulated body motion capture by annealed particle filtering. In: CVPR, pp. 126–133 (2000)Google Scholar
  45. 45.
    Sidenbladh, H., Black, M.J.: Learning the statistics of people in images and video. Int. J. Comput. Vis. 54(1–3), 183–209 (2003)zbMATHGoogle Scholar
  46. 46.
    Sigal, L., Isard, M., Haussecker, H., Black, M.J.: Loose-limbed people: estimating 3D human pose and motion using non-parametric belief propagation. Int. J. Comput. Vis. 98(1), 15–48 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  47. 47.
    Kehl, R., Bray, M., Van Gool, L.: Markerless full body tracking by integrating multiple cues. In: ICCV Workshop on Modeling People and Human Interaction (2005)Google Scholar
  48. 48.
    Kollnig, H., Nagel, H.H.: 3D pose estimation by fitting image gradients directly to polyhedral models. In: ICCV, pp. 569–574 (1995)Google Scholar
  49. 49.
    Wachter, S., Nagel, H.H.: Tracking of persons in monocular image sequences. In: Nonrigid and Articulated Motion Workshop, pp. 2–9 (1997)Google Scholar
  50. 50.
    Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: NIPS, pp. 1799–1807 (2014)Google Scholar
  51. 51.
    Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition. Int. J. Comput. Vis. 61(1), 55–79 (2005)CrossRefGoogle Scholar
  52. 52.
    Andriluka, M., Roth, S., Schiele, B.: Pictorial structures revisited: people detection and articulated pose estimation. In: CVPR, pp. 1014–1021 (2009)Google Scholar
  53. 53.
    Amin, S., Andriluka, M., Rohrbach, M., Schiele, B.: Multi-view pictorial structures for 3D human pose estimation. In: BMVC (2013)Google Scholar
  54. 54.
    Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., Ilic, S.: 3D pictorial structures for multiple human pose estimation. In: CVPR, pp. 1669–1676 (2014)Google Scholar
  55. 55.
    Park, H.S., Shiratori, T., Matthews, I., Sheikh, Y.: 3D trajectory reconstruction under perspective projection. Int. J. Comput. Vis. 115(2), 115–135 (2015)MathSciNetCrossRefGoogle Scholar
  56. 56.
    Fayad, J., Russell, C., Agapito, L.: Automated articulated structure and 3D shape recovery from point correspondences. In: ICCV, pp. 431–438 (2011)Google Scholar
  57. 57.
    Allen, B., Curless, B., Popović, Z.: The space of human body shapes: reconstruction and parameterization from range scans. ACM Trans. Graph. 22(3), 587–594 (2003)CrossRefGoogle Scholar
  58. 58.
    Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34(6), 248 (2015)CrossRefGoogle Scholar
  59. 59.
    Hasler, N., Stoll, C., Sunkel, M., Rosenhahn, B., Seidel, H.P.: A statistical model of human pose and body shape. Comput. Graph. Forum 28(2), 337–346 (2009)CrossRefGoogle Scholar
  60. 60.
    Lewis, J.P., Cordner, M., Fong, N.: Pose space deformation: a unified approach to shape interpolation and skeleton-driven deformation. In: SIGGRAPH, pp. 165–172 (2000)Google Scholar
  61. 61.
    Cerezo, E., Pérez, F., Pueyo, X., Seron, F.J., Sillion, F.X.: A survey on participating media rendering techniques. Vis. Comput. 21(5), 303–328 (2005)CrossRefGoogle Scholar
  62. 62.
    Kim, H., Hilton, A.: Influence of colour and feature geometry on multi-modal 3D point clouds data registration. In: 3DV, pp. 202–209 (2014)Google Scholar
  63. 63.
    Sigal, L., Bălan, A.O., Black, M.J.: HumanEva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vis. 87, 4–27 (2010)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Helge Rhodin
    • 1
    Email author
  • Nadia Robertini
    • 1
    • 2
  • Dan Casas
    • 1
  • Christian Richardt
    • 1
    • 2
  • Hans-Peter Seidel
    • 1
  • Christian Theobalt
    • 1
  1. 1.MPI InformatikSaarbrückenGermany
  2. 2.Intel Visual Computing InstituteSaarbrückenGermany

Personalised recommendations