Skip to main content

AvatarCap: Animatable Avatar Conditioned Monocular Human Volumetric Capture

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13661))

Included in the following conference series:

Abstract

To address the ill-posed problem caused by partial observations in monocular human volumetric capture, we present AvatarCap, a novel framework that introduces animatable avatars into the capture pipeline for high-fidelity reconstruction in both visible and invisible regions. Our method firstly creates an animatable avatar for the subject from a small number (\(\sim \)20) of 3D scans as a prior. Then given a monocular RGB video of this subject, our method integrates information from both the image observation and the avatar prior, and accordingly reconstructs high-fidelity 3D textured models with dynamic details regardless of the visibility. To learn an effective avatar for volumetric capture from only few samples, we propose GeoTexAvatar, which leverages both geometry and texture supervisions to constrain the pose-dependent dynamics in a decomposed implicit manner. An avatar-conditioned volumetric capture method that involves a canonical normal fusion and a reconstruction network is further proposed to integrate both image observations and avatar dynamics for high-fidelity reconstruction in both observed and invisible regions. Overall, our method enables monocular human volumetric capture with detailed and pose-dependent dynamics, and the experiments show that our method outperforms state of the art.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Alldieck, T., Pons-Moll, G., Theobalt, C., Magnor, M.: Tex2shape: Detailed full human body geometry from a single image. In: ICCV. pp. 2293–2303 (2019)

    Google Scholar 

  2. Bagautdinov, T., Wu, C., Simon, T., Prada, F., Shiratori, T., Wei, S.E., Xu, W., Sheikh, Y., Saragih, J.: Driving-signal aware full-body avatars. TOG 40(4), 1–17 (2021)

    Google Scholar 

  3. Bhatnagar, B.L., Sminchisescu, C., Theobalt, C., Pons-Moll, G.: Combining implicit function learning and parametric models for 3D human reconstruction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 311–329. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_19

    Chapter  Google Scholar 

  4. Bradley, D., Popa, T., Sheffer, A., Heidrich, W., Boubekeur, T.: Markerless garment capture. TOG 27(3), 1–9 (2008)

    Article  Google Scholar 

  5. Brox, T., Rosenhahn, B., Gall, J., Cremers, D.: Combined region and motion-based 3d tracking of rigid and articulated objects. IEEE T-PAMI 32(3), 402–415 (2009)

    Article  Google Scholar 

  6. Burov, A., Nießner, M., Thies, J.: Dynamic surface function networks for clothed human bodies. In: ICCV, pp. 10754–10764 (2021)

    Google Scholar 

  7. Chen, X., Zheng, Y., Black, M.J., Hilliges, O., Geiger, A.: Snarf: differentiable forward skinning for animating non-rigid neural implicit shapes. In: ICCV, pp. 11594–11604 (2021)

    Google Scholar 

  8. Deng, B., et al.: NASA neural articulated shape approximation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 612–628. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_36

    Chapter  Google Scholar 

  9. Dong, Z., Guo, C., Song, J., Chen, X., Geiger, A., Hilliges, O.: Pina: learning a personalized implicit neural avatar from a single RGB-D video sequence. In: CVPR (2022)

    Google Scholar 

  10. Dou, M., et al.: Fusion4d: real-time performance capture of challenging scenes. TOG 35(4), 1–13 (2016)

    Article  Google Scholar 

  11. Gabeur, V., Franco, J.S., Martin, X., Schmid, C., Rogez, G.: Moulding humans: non-parametric 3d human shape estimation from single images. In: ICCV, pp. 2232–2241 (2019)

    Google Scholar 

  12. Gall, J., Stoll, C., De Aguiar, E., Theobalt, C., Rosenhahn, B., Seidel, H.P.: Motion capture using joint skeleton tracking and surface estimation. In: CVPR, pp. 1746–1753. IEEE (2009)

    Google Scholar 

  13. Goodfellow, I., et al.: Generative adversarial nets. NeurIPS 27 (2014)

    Google Scholar 

  14. Guan, P., Reiss, L., Hirshberg, D.A., Weiss, A., Black, M.J.: Drape: dressing any person. TOG 31(4), 1–10 (2012)

    Article  Google Scholar 

  15. Guo, C., Chen, X., Song, J., Hilliges, O.: Human performance capture from monocular video in the wild. In: 3DV, pp. 889–898. IEEE (2021)

    Google Scholar 

  16. Guo, K., Xu, F., Wang, Y., Liu, Y., Dai, Q.: Robust non-rigid motion tracking and surface reconstruction using l0 regularization. In: ICCV, pp. 3083–3091 (2015)

    Google Scholar 

  17. Guo, K., Xu, F., Yu, T., Liu, X., Dai, Q., Liu, Y.: Real-time geometry, albedo and motion reconstruction using a single RGBD camera. TOG 36(3), 32:1-32:13 (2017)

    Article  Google Scholar 

  18. Habermann, M., Liu, L., Xu, W., Zollhoefer, M., Pons-Moll, G., Theobalt, C.: Real-time deep dynamic characters. TOG 40(4), 1–16 (2021)

    Article  Google Scholar 

  19. Habermann, M., Xu, W., Zollhoefer, M., Pons-Moll, G., Theobalt, C.: Livecap: real-time human performance capture from monocular video. TOG 38(2), 1–17 (2019)

    Article  Google Scholar 

  20. Habermann, M., Xu, W., Zollhofer, M., Pons-Moll, G., Theobalt, C.: Deepcap: monocular human performance capture using weak supervision. In: CVPR, pp. 5052–5063 (2020)

    Google Scholar 

  21. He, T., Collomosse, J., Jin, H., Soatto, S.: Geo-PIFU: geometry and pixel aligned implicit functions for single-view human reconstruction. NeurIPS 33, 9276–9287 (2020)

    Google Scholar 

  22. He, T., Xu, Y., Saito, S., Soatto, S., Tung, T.: Arch++: animation-ready clothed human reconstruction revisited. In: ICCV, pp. 11046–11056 (2021)

    Google Scholar 

  23. He, Y., et al.: Challencap: Monocular 3d capture of challenging human performances using multi-modal references. In: CVPR, pp. 11400–11411 (2021)

    Google Scholar 

  24. Hong, Y., Zhang, J., Jiang, B., Guo, Y., Liu, L., Bao, H.: Stereopifu: depth aware clothed human digitization via stereo vision. In: CVPR, pp. 535–545 (2021)

    Google Scholar 

  25. Huang, Z., Xu, Y., Lassner, C., Li, H., Tung, T.: Arch: animatable reconstruction of clothed humans. In: CVPR, pp. 3093–3102 (2020)

    Google Scholar 

  26. Innmann, M., Zollhöfer, M., Nießner, M., Theobalt, C., Stamminger, M.: VolumeDeform: real-time volumetric non-rigid reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 362–379. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_22

    Chapter  Google Scholar 

  27. Jackson, A.S., Manafas, C., Tzimiropoulos, G.: 3D human body reconstruction from a single image via volumetric regression. In: Leal-Taixé, L., Roth, S. (eds.) 3d human body reconstruction from a single image via volumetric regression. LNCS, vol. 11132, pp. 64–77. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11018-5_6

    Chapter  Google Scholar 

  28. Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: ICCV, pp. 2252–2261 (2019)

    Google Scholar 

  29. Leroy, V., Franco, J.S., Boyer, E.: Multi-view dynamic shape refinement using local temporal integration. In: ICCV, pp. 3094–3103 (2017)

    Google Scholar 

  30. Li, C., Zhao, Z., Guo, X.: ArticulatedFusion: real-time reconstruction of motion, geometry and segmentation using a single depth camera. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 324–340. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01237-3_20

    Chapter  Google Scholar 

  31. Li, H., Adams, B., Guibas, L.J., Pauly, M.: Robust single-view geometry and motion reconstruction. TOG 28(5), 1–10 (2009)

    Article  Google Scholar 

  32. Li, H., Vouga, E., Gudym, A., Luo, L., Barron, J.T., Gusev, G.: 3D self-portraits. TOG 32(6), 1–9 (2013)

    Google Scholar 

  33. Li, R., Xiu, Y., Saito, S., Huang, Z., Olszewski, K., Li, H.: Monocular real-time volumetric performance capture. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12368, pp. 49–67. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58592-1_4

    Chapter  Google Scholar 

  34. Li, Z., Yu, T., Pan, C., Zheng, Z., Liu, Y.: Robust 3d self-portraits in seconds. In: CVPR, pp. 1344–1353 (2020)

    Google Scholar 

  35. Li, Z., Yu, T., Zheng, Z., Guo, K., Liu, Y.: Posefusion: pose-guided selective fusion for single-view human volumetric capture. In: CVPR. pp. 14162–14172 (2021)

    Google Scholar 

  36. Liu, L., Habermann, M., Rudnev, V., Sarkar, K., Gu, J., Theobalt, C.: Neural actor: neural free-view synthesis of human actors with pose control. TOG 40(6), 1–16 (2021)

    Google Scholar 

  37. Liu, Y., Dai, Q., Xu, W.: A point-cloud-based multiview stereo algorithm for free-viewpoint video. TVCG 16(3), 407–418 (2009)

    Google Scholar 

  38. Liu, Y., Stoll, C., Gall, J., Seidel, H.P., Theobalt, C.: Markerless motion capture of interacting characters using multi-view image segmentation. In: CVPR, pp. 1249–1256. IEEE (2011)

    Google Scholar 

  39. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. TOG 34(6), 1–16 (2015)

    Article  Google Scholar 

  40. Lorensen, W.E., Cline, H.E.: Marching cubes: a high resolution 3D surface construction algorithm. TOG 21(4), 163–169 (1987)

    Google Scholar 

  41. Ma, Q., Saito, S., Yang, J., Tang, S., Black, M.J.: Scale: modeling clothed humans with a surface codec of articulated local elements. In: CVPR, pp. 16082–16093 (2021)

    Google Scholar 

  42. Ma, Q., Yang, J., Ranjan, A., Pujades, S., Pons-Moll, G., Tang, S., Black, M.J.: Learning to dress 3d people in generative clothing. In: CVPR. pp. 6469–6478 (2020)

    Google Scholar 

  43. Ma, Q., Yang, J., Tang, S., Black, M.J.: The power of points for modeling humans in clothing. In: ICCV, pp. 10974–10984 (2021)

    Google Scholar 

  44. Magnenat-Thalmann, N., Laperrire, R., Thalmann, D.: Joint-dependent local deformations for hand animation and object grasping. In: In Proceedings on Graphics Interface. Citeseer (1988)

    Google Scholar 

  45. Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: learning 3D reconstruction in function space. In: CVPR, pp. 4460–4470 (2019)

    Google Scholar 

  46. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24

    Chapter  Google Scholar 

  47. Mustafa, A., Kim, H., Guillemaut, J.Y., Hilton, A.: General dynamic scene reconstruction from multiple view video. In: ICCV, pp. 900–908 (2015)

    Google Scholar 

  48. Natsume, R., et al.: Siclope: Silhouette-based clothed people. In: CVPR, pp. 4480–4490 (2019)

    Google Scholar 

  49. Newcombe, R.A., Fox, D., Seitz, S.M.: Dynamicfusion: reconstruction and tracking of non-rigid scenes in real-time. In: CVPR, pp. 343–352 (2015)

    Google Scholar 

  50. Peng, S., Dong, J., Wang, Q., Zhang, S., Shuai, Q., Zhou, X., Bao, H.: Animatable neural radiance fields for modeling dynamic human bodies. In: ICCV, pp. 14314–14323 (2021)

    Google Scholar 

  51. Pons-Moll, G., Pujades, S., Hu, S., Black, M.J.: Clothcap: seamless 4D clothing capture and retargeting. TOG 36(4), 1–15 (2017)

    Article  Google Scholar 

  52. Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: PIFU: pixel-aligned implicit function for high-resolution clothed human digitization. In: ICCV, pp. 2304–2314 (2019)

    Google Scholar 

  53. Saito, S., Simon, T., Saragih, J., Joo, H.: Pifuhd: multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In: CVPR, June 2020

    Google Scholar 

  54. Saito, S., Yang, J., Ma, Q., Black, M.J.: Scanimate: weakly supervised learning of skinned clothed avatar networks. In: CVPR, pp. 2886–2897 (2021)

    Google Scholar 

  55. Shao, R., et al.: Doublefield: Bridging the neural surface and radiance fields for high-fidelity human reconstruction and rendering. In: CVPR (2022)

    Google Scholar 

  56. Slavcheva, M., Baust, M., Cremers, D., Ilic, S.: Killingfusion: non-rigid 3d reconstruction without correspondences. In: CVPR, pp. 1386–1395 (2017)

    Google Scholar 

  57. Slavcheva, M., Baust, M., Ilic, S.: Sobolevfusion: 3D reconstruction of scenes undergoing free non-rigid motion. In: CVPR, pp. 2646–2655. IEEE, Salt Lake City, June 2018

    Google Scholar 

  58. Smith, D., Loper, M., Hu, X., Mavroidis, P., Romero, J.: Facsimile: fast and accurate scans from an image in less than a second. In: ICCV, pp. 5330–5339 (2019)

    Google Scholar 

  59. Stoll, C., Gall, J., De Aguiar, E., Thrun, S., Theobalt, C.: Video-based reconstruction of animatable human characters. TOG 29(6), 1–10 (2010)

    Article  Google Scholar 

  60. Su, Z., Xu, L., Zheng, Z., Yu, T., Liu, Y., Fang, L.: RobustFusion: human volumetric capture with data-driven visual cues using a RGBD camera. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 246–264. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_15

    Chapter  Google Scholar 

  61. Sumner, R.W., Schmid, J., Pauly, M.: Embedded deformation for shape manipulation. TOG 26(3), 80-es (2007)

    Article  Google Scholar 

  62. Suo, X., et al.: Neuralhumanfvv: real-time neural volumetric human performance rendering using RGB cameras. In: CVPR, pp. 6226–6237 (2021)

    Google Scholar 

  63. Varol, G., et al.: BodyNet: volumetric inference of 3D human body shapes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 20–38. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_2

    Chapter  Google Scholar 

  64. Wang, L., Zhao, X., Yu, T., Wang, S., Liu, Y.: NormalGAN: learning detailed 3D human from a single RGB-D image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 430–446. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_26

    Chapter  Google Scholar 

  65. Wang, S., Mihajlovic, M., Ma, Q., Geiger, A., Tang, S.: Metaavatar: learning animatable clothed human models from few depth images. NeurIPS 34 (2021)

    Google Scholar 

  66. Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: CVPR, pp. 8798–8807 (2018)

    Google Scholar 

  67. Xiang, D., et al.: Modeling clothing as a separate layer for an animatable human avatar. TOG 40(6), 1–15 (2021)

    Article  Google Scholar 

  68. Xiang, D., Prada, F., Wu, C., Hodgins, J.: Monoclothcap: towards temporally coherent clothing capture from monocular RGB video. In: 3DV, pp. 322–332. IEEE (2020)

    Google Scholar 

  69. Xiu, Y., Yang, J., Tzionas, D., Black, M.J.: Icon: implicit clothed humans obtained from normals. In: CVPR (2022)

    Google Scholar 

  70. Xu, W., et al.: Monoperfcap: human performance capture from monocular video. TOG 37(2), 1–15 (2018)

    Article  Google Scholar 

  71. Ye, G., Liu, Y., Hasler, N., Ji, X., Dai, Q., Theobalt, C.: Performance capture of interacting characters with handheld kinects. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7573, pp. 828–841. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33709-3_59

    Chapter  Google Scholar 

  72. Yu, T., et al.: Bodyfusion: real-time capture of human motion and surface geometry using a single depth camera. In: ICCV, Venice, pp. 910–919. IEEE (2017)

    Google Scholar 

  73. Yu, T., et al.: Function4d: real-time human volumetric capture from very sparse consumer RGBD sensors. In: CVPR, pp. 5746–5756 (2021)

    Google Scholar 

  74. Yu, T., et al.: Doublefusion: real-time capture of human performances with inner body shapes from a single depth sensor. In: CVPR, Salt Lake City, pp. 7287–7296. IEEE, June 2018

    Google Scholar 

  75. Yu, T., et al.: Simulcap: single-view human performance capture with cloth simulation. In: CVPR, pp. 5499–5509. IEEE (2019)

    Google Scholar 

  76. Zhang, H., et al.: Pymaf: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In: ICCV, pp. 11446–11456 (2021)

    Google Scholar 

  77. Zheng, Y., et al.: Deepmulticap: performance capture of multiple characters using sparse multiview cameras. In: ICCV (2021)

    Google Scholar 

  78. Zheng, Z., Yu, T., Dai, Q., Liu, Y.: Deep implicit templates for 3D shape representation. In: CVPR, pp. 1429–1439 (2021)

    Google Scholar 

  79. Zheng, Z., et al.: HybridFusion: real-time performance capture using a single depth sensor and sparse IMUs. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Hybridfusion: real-time performance capture using a single depth sensor and sparse imus. LNCS, vol. 11213, pp. 389–406. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_24

    Chapter  Google Scholar 

  80. Zheng, Z., Yu, T., Liu, Y., Dai, Q.: Pamir: parametric model-conditioned implicit representation for image-based human reconstruction. IEEE T-PAMI (2021)

    Google Scholar 

  81. Zheng, Z., Yu, T., Wei, Y., Dai, Q., Liu, Y.: Deephuman: 3D human reconstruction from a single image. In: ICCV, pp. 7739–7749 (2019)

    Google Scholar 

  82. Zhi, T., Lassner, C., Tung, T., Stoll, C., Narasimhan, S.G., Vo, M.: TexMesh: reconstructing detailed human texture and geometry from RGB-D Video. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 492–509. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_29

    Chapter  Google Scholar 

  83. Zhu, H., Zuo, X., Wang, S., Cao, X., Yang, R.: Detailed human shape estimation from a single image by hierarchical mesh deformation. In: CVPR, pp. 4491–4500 (2019)

    Google Scholar 

  84. Zollhöfer, M., et al.: Real-time non-rigid reconstruction using an RGB-D camera. TOG 33(4), 1–12 (2014)

    Article  Google Scholar 

Download references

Acknowledgement

This paper is supported by National Key R &D Program of China (2021ZD0113501) and the NSFC project No. 62125107.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhe Li .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1856 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, Z., Zheng, Z., Zhang, H., Ji, C., Liu, Y. (2022). AvatarCap: Animatable Avatar Conditioned Monocular Human Volumetric Capture. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13661. Springer, Cham. https://doi.org/10.1007/978-3-031-19769-7_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19769-7_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19768-0

  • Online ISBN: 978-3-031-19769-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics