Multimedia Tools and Applications

, Volume 77, Issue 17, pp 22231–22246 | Cite as

3D facial feature and expression computing from Internet image or video

  • Shan WangEmail author
  • Xukun Shen
  • Yan Zhang


Large-scale multimedia datasets such as the Internet image and video collections provide new opportunities to understand and analyze human actions, among which one of the most interesting type is facial performance. In this paper, we present an automatic reconstruction system of detailed face performances. Many existing facial performance reconstruction systems rely on data captured under controlled environments with densely spaced cameras and lights. On the contrary, our system reconstructs detailed facial geometry from just one image or a monocular video sequence with unknown lighting. To achieve this, we first simultaneously track 2D and 3D sparse features, then reconstruct the low frequency facial geometry by performing a 2D-3D feature trajectory fusion optimization, which we formulate as a linear problem that can be solved efficiently. Finally, we use a per-pixel shape-from-shading algorithm to estimate the fine-scale geometry details such as wrinkles to further improve the reconstruction fidelity. We demonstrate the accuracy of our system with reconstruction results using both single images and monocular video sequences.


3D understanding of multimedia data Image/video based 3D face acquisition 2D & 3D facial feature computing 



This work is supported by National Key R&D Program of China (2017YFB1002702).


  1. 1.
    Aldrian O, Smith WAP (2013) Inverse rendering of faces with a 3D morphable model. IEEE Trans Pattern Anal Mach Intell 35(5):1080–1093CrossRefGoogle Scholar
  2. 2.
    Basri R, Jacobs D (2003) Lambertian reflectance and linear subspaces. IEEE Trans Pattern Anal Mach Intell 25(2):218–233CrossRefGoogle Scholar
  3. 3.
    Beeler T, Bickel B, Beardsley P, Sumner B, Gross M (2010) High-quality single-shot capture of facial geometry. ACM Trans Graph 29(4):40:1–40:9CrossRefGoogle Scholar
  4. 4.
    Bickel B, Botsch M, Angst R, Matusik W, Otaduy M, Pfister H, Gross M (2007) Multi-scale capture of facial geometry and motion. ACM Trans Graph 26 (3):33:1–33:10CrossRefGoogle Scholar
  5. 5.
    Bouaziz S, Wang YY, Pauly Mark (2013) Online modeling for realtime facial animation. ACM Trans Graph 32(4):40:1–40:10CrossRefzbMATHGoogle Scholar
  6. 6.
    Bradley D, Heidrich W, Popa T, Sheffer A (2010) High resolution passive facial performance capture. ACM Trans Graph 29(4):41:1–41:10CrossRefGoogle Scholar
  7. 7.
    Bregler C, Hertzmann A, Biermann H (2000) Recovering non-rigid 3D shape from image streams. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 690–696Google Scholar
  8. 8.
    Cao C, Hou Q, Zhou K (2014) Displaced dynamic expression regression for real-time facial tracking and animation. ACM Trans Graph 33(4):43:1–43:10zbMATHGoogle Scholar
  9. 9.
    Cao C, Weng Y, Zhou S, Tong Y, Zhou K (2014) FaceWarehouse: a 3D facial expression database for visual computing. IEEE Trans Vis Comput Graph 20(3):413–425CrossRefGoogle Scholar
  10. 10.
    Cao C, Bradley D, Zhou K, Beeler T (2015) Real-time high-fidelity facial performance capture. ACM Trans Graph 34(4):46:1–46:9CrossRefGoogle Scholar
  11. 11.
    Dai Y, Li H, He M (2012) A simple prior-free method for non-rigid structure-from-motion factorization. In: Proceeding of IEEE conference on computer vision and pattern recognition, pp 2018–2025Google Scholar
  12. 12.
    Gao Z, Zhang L-F, Chen M-Y, Hauptmann A, Zhang H, Cai A (2014) Enhanced and hierarchical structure algorithm for data imbalance problem in semantic extraction under massive video dataset. Multimed Tools Appl 68(3):641–657CrossRefGoogle Scholar
  13. 13.
    Gao Z, Zhang H, Xu GP, Xue YB, Hauptmannc A G (2015) Multi-view discriminative and structured dictionary learning with group sparsity for human action recognition. Signal Process 112:83–97CrossRefGoogle Scholar
  14. 14.
    Garrido P, Valgaert L, Wu C, Theobalt C (2013) Reconstructing detailed dynamic face geometry from monocular video. ACM Trans Graph 32(6):158:1–158:10CrossRefGoogle Scholar
  15. 15.
    Garrido P, Valgaerts L, Sarmadi H, Steiner I, Varanasi K, Perez P, Theobalt C (2015) VDub: modifying face vedio of actors for plausible visual alignment to a dubbed audio track. Comput Graphic Forum 34(2):193–204CrossRefGoogle Scholar
  16. 16.
    Garrido P, Zollhofer M, Casas D, Valgaerts L (2016) Reconstruction of personalized 3D face rigs from monocular video. ACM Trans Graph 35(3):28:1–28:15CrossRefGoogle Scholar
  17. 17.
    Guenter B, Grimm C, Wood D (1998) Making faces. In: Processing of ACM SIGGRAPH 1998, pp 55–66Google Scholar
  18. 18.
    Hartley R, Ziserman A (2003) Multiple view geometry in computer vision. Cambridge University Press, Cambridge, p 2003Google Scholar
  19. 19.
    He X, Gao M, Kan M, Wang D (2017) BiRank: towards ranking on bipartite graphs. IEEE Trans Knowl Data Eng 29(1):57–71CrossRefGoogle Scholar
  20. 20.
    Huang H, Chai J, Tong X, Wu H-T (2011) Leveraging motion capture and 3D scanning for high-fidelity facial performance acquisition. ACM Trans Graph 30 (4):74:1–74:10CrossRefGoogle Scholar
  21. 21.
    Huber P, Hu G, Tena R, Kittler J (2016) A multiresolution 3D Morphable Face Model and fitting framework. In: Proceeding of international conference on computer vision theory and applications, pp 1–8Google Scholar
  22. 22.
    Li H, Adams B, Guibas LJ, Pauly M (2009) Robust single-view geometry and motion reconstruction. ACM Trans Graph 28(5):175:1–175:10CrossRefGoogle Scholar
  23. 23.
    Li H, Yu J, Ye Y, Bregler C (2013) Realtime facial animation with on-the-fly correctives. ACM Trans Graph 32(4):42:1–42:10zbMATHGoogle Scholar
  24. 24.
    Ma W-C, Jones A, Chiang J-Y, Hawkins T, Frederiksen S, Peers P, Vukovic M, Ouhyong M, Debevec P (2008) Facial performance synthesis using deformation-driven polynomial displacement maps. ACM Trans Graph 27(5):121:1–121:10CrossRefGoogle Scholar
  25. 25.
    Matthews I, Baker S (2004) Active appearance models revisited. Int J Comput Vis 60(2):135–164CrossRefGoogle Scholar
  26. 26.
    Shi F, Wu H-T, Tong X, Chai J (2014) Automatic acquisition of high-fidelity facial performances using monocular videos. ACM Trans Graph 33(6):222:1–222:13CrossRefzbMATHGoogle Scholar
  27. 27.
    Suwajanakorn S, Kemelmacher-Shlizerman I, Seitz SM (2014) Total moving face reconstruction. In: Processing of European conference on computer vision (ECCV), pp 796–812Google Scholar
  28. 28.
    Tian F, Liu X, Liu Z, Sun N,Wang M,Wang H, Zhang F (2017) Multimedia integrated annotation based on common space learning. Multimed Tools Appl 1–20.
  29. 29.
    Tian F, Shen X, Liu X (2017) Multimedia automatic annotation by mining label set correlation. Multimed Tools Appl 1–17.
  30. 30.
    Tian F, Shen X, Shang F (2017) Automatic image annotation with real-world community contributed data set. Multimed Syst 1–12.
  31. 31.
    Valgaerts L, Wu C, Bruhn A, Seidel H-P, Theobalt C (2012) Lightweight binocular facial performance capture under uncontrolled lighting. ACM Trans Graph 31(6):187:1–187:11CrossRefGoogle Scholar
  32. 32.
    Weise T, Bouaziz S, Li H, Pauly M (2011) Realtime performance-based facial animation. ACM Trans Graph 30(4):77:1–77:10CrossRefGoogle Scholar
  33. 33.
    Yang Y, Nie F, Xu D, Luo J, Zhuang Y, Pan Y (2012) Multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans Pattern Anal Mach Intell 34(4):723–742CrossRefGoogle Scholar
  34. 34.
    Yang Y, Song J, Huang Z, Ma Z, Sebe N, Hauptmann AG (2013) Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Trans Multimed 15(3):572–581CrossRefGoogle Scholar
  35. 35.
    Zhang L, Snavely N, Curless B, Seitz SM (2004) Spacetime faces: high resolution capture for modeling and animation. ACM Trans Graph 23(3):548–558CrossRefGoogle Scholar
  36. 36.
    Zhang H, Yang Y, Luan H, Yang S, Chua T-S (2014) Start from scratch: towards automatically identifying, modeling, and naming visual attributes. In: Proceedings of the 22nd ACM international conference on multimedia, pp 187–196Google Scholar
  37. 37.
    Zhang H, Wang M, Hong R, Chua T-S (2016) Play and rewind: optimizing binary representations of videos by self-supervised temporal hashing. In: Proceedings of the 2016 ACM on multimedia conference, pp 781–790Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.State Key Laboratory of Virtual Reality Technology and SystemsBeihang UniversityBeijingChina
  2. 2.School of New Media Art and DesignBeihang UniversityBeijingChina

Personalised recommendations