Visual Speech Animation

Reference work entry

Abstract

Visual speech animation (VSA) has many potential applications in human-computer interaction, assisted language learning, entertainments, and other areas. But it is one of the most challenging tasks in human motion animation because of the complex mechanisms of speech production and facial motion. This chapter surveys the basic principles, state-of-the-art technologies, and featured applications in this area. Specifically, after introducing the basic concepts and the building blocks of a typical VSA system, we showcase a state-of-the-art approach based on the deep bidirectional long short-term memory (DBLSM) recurrent neural networks (RNN) for audio-to-visual mapping, which aims to create a video-realistic talking head. Finally, the Engkoo project from Microsoft is highlighted as a practical application of visual speech animation in language learning.

Keywords

Visual speech animation Visual speech synthesis Talking head Talking face Talking avatar Facial animation Audio visual speech Audio-to-visual mapping Deep learning Deep neural network 

References

  1. Anderson R, Stenger B, Wan V, Cipolla R (2013) Expressive visual text-to-speech using active appearance models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE. p 3382Google Scholar
  2. Badin P, Ben Youssef A, Bailly G et al (2010) Visual articulatory feedback for phonetic correction in second language learning. In: Proceedings of Second Language learning Studies: Acquisition, Learning, Education and Technology, 2010Google Scholar
  3. Ben Youssef A, Shimodaira H, Braude DA (2013) Articulatory features for speech-driven head motion synthesis. In: Proceedings of the International Speech Communication Association, IEEE, 2013Google Scholar
  4. Breeuwer M, Plomp R (1985) Speechreading supplemented with formant frequency information from voiced speech. J Acoust Soc Am 77(1):314–317CrossRefGoogle Scholar
  5. Bregler C, Covell M, Slaney M (1997) Video rewrite: driving visual speech with audio. In: Proceedings of the 24th annual conference on Computer graphics and interactive techniques, ACM Press, p 353Google Scholar
  6. Busso C, Deng Z, Grimm M, Neumann U et al (2007) Rigid head motion in expressive speech animation: Analysis and synthesis. IEEE Trans Audio, Speech, Language Process 15(3):1075–1086CrossRefGoogle Scholar
  7. Cao Y, Tien WC, Faloutsos P et al(2005) Expressive speech-driven facial animation. In: ACM Transactions on Graphics, ACM, p 1283Google Scholar
  8. Cohen MM, Massaro DW (1993) Modeling coarticulation in synthetic visual speech. In: Models and techniques in computer animation. Springer, Japan, p 139CrossRefGoogle Scholar
  9. Cootes TF, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23(6):681–685CrossRefGoogle Scholar
  10. Cosatto E, Graf HP (1998) Sample-based synthesis of photo-realistic talking heads. In: Proceedings of Computer Animation, IEEE, p 103Google Scholar
  11. Cosatto E, Graf HP (2000) Photo-realistic talking-heads from image samples. IEEE Trans Multimed 2(3):152–163CrossRefGoogle Scholar
  12. Cosatto E, Ostermann J, Graf HP et al (2003) Lifelike talking faces for interactive services. Proc IEEE 91(9):1406–1429CrossRefGoogle Scholar
  13. Deng Z, Neumann U (2008) Data-driven 3D facial animation. SpringerGoogle Scholar
  14. Deng L, Yu D (2014) Deep learning methods and applications. Foundations and Trends in Signal Processing, 2014Google Scholar
  15. Deng Z, Lewis JP, Neumann U (2005) Automated eye motion using texture synthesis. IEEE Comput Graph Appl 25(2):24–30CrossRefGoogle Scholar
  16. Ding C, Xie L, Zhu P (2015) Head motion synthesis from speech using deep neural networks. Multimed Tools Appl 74(22):9871–9888CrossRefGoogle Scholar
  17. Du J, Wang Q, Gao T et al (2014) Robust speech recognition with speech enhanced deep neural networks. In: Proceedings of the International Speech Communication Association, IEEE, p 616Google Scholar
  18. Dziemianko M, Hofer G, Shimodaira H (2009). HMM-based automatic eye-blink synthesis from speech. In: Proceedings of the International Speech Communication Association, IEEE, p 1799Google Scholar
  19. Englebienne G, Cootes T, Rattray M (2007) A probabilistic model for generating realistic lip movements from speech. In: Advances in neural information processing systems, p 401Google Scholar
  20. Eskenazi M (2009) An overview of spoken language technology for education. Speech Commun 51(10):832–844CrossRefGoogle Scholar
  21. Ezzat T, Poggio T (2000) Visual speech synthesis by morphing visemes. Int J Comput Vision 38(1):45–57CrossRefMATHGoogle Scholar
  22. Ezzat T, Geiger G, Poggio T (2002) Trainable videorealistic speech animation. In: ACM SIGGRAPH 2006 Courses, ACM, p 388Google Scholar
  23. Fagel S, Clemens C (2004) An articulation model for audiovisual speech synthesis: determination, adjustment, evaluation. Speech Commun 44(1):141–154CrossRefGoogle Scholar
  24. Fagel S, Bailly G, Theobald BJ (2010) Animating virtual speakers or singers from audio: lip-synching facial animation. EURASIP J Audio, Speech, Music Process 2009(1):1–2Google Scholar
  25. Fan B, Wang L, Soong FK et al (2015) Photo-real talking head with deep bidirectional LSTM. In:IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, p 4884Google Scholar
  26. Fan B, Xie L, Yang S, Wang L et al (2016) A deep bidirectional LSTM approach for video- realistic talking head. Multimed Tools Appl 75:5287–5309CrossRefGoogle Scholar
  27. Fu S, Gutierrez-Osuna R, Esposito A et al (2005) Audio/visual mapping with cross-modal hidden Markov models. IEEE Trans Multimed 7(2):243–252CrossRefGoogle Scholar
  28. Hinton G, Deng L, Yu D, Dahl GE et al (2012) Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process Mag 29(6):82–97CrossRefGoogle Scholar
  29. Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzz 6(02):107–116CrossRefMATHGoogle Scholar
  30. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRefGoogle Scholar
  31. Huang D, Wu X, Wei J et al (2013) Visualization of Mandarin articulation by using a physiological articulatory model. In: Signal and Information Processing Association Annual Summit and Conference, IEEE, p 1Google Scholar
  32. Hunt AJ, Black AW (1996) Unit selection in a concatenative speech synthesis system using a large speech database. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, p 373Google Scholar
  33. Hura S, Leathem C, Shaked N (2010) Avatars meet the challenge. Speech Technol, 303217Google Scholar
  34. Jia J, Zhang S, Meng F et al (2011) Emotional audio-visual speech synthesis based on PAD. IEEE Trans Audio, Speech, Language Process 19(3):570–582CrossRefGoogle Scholar
  35. Jia J, Wu Z, Zhang S et al (2014) Head and facial gestures synthesis using PAD model for an expressive talking avatar. Multimed Tools Appl 73(1):439–461CrossRefGoogle Scholar
  36. Kukich K (1992) Techniques for automatically correcting words in text. ACM Comput Surv 24(4):377–439CrossRefGoogle Scholar
  37. Le BH, Ma X, Deng Z (2012) Live speech driven head-and-eye motion generators. IEEE Trans Vis Comput Graph 18(11):1902–1914CrossRefGoogle Scholar
  38. Liu P, Soong FK (2005) Kullback-Leibler divergence between two hidden Markov models. Microsoft Research Asia, Technical ReportGoogle Scholar
  39. Massaro DW (1998) Perceiving talking faces: from speech perception to a behavioral principle. Mit Press, CambridgeGoogle Scholar
  40. Massaro DW, Simpson JA (2014) Speech perception by ear and eye: a paradigm for psychological inquiry. Psychology PressGoogle Scholar
  41. Masuko T, Kobayashi T, Tamura, M et al (1998) Text-to-visual speech synthesis based on parameter generation from HMM. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, p 3745Google Scholar
  42. McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748CrossRefGoogle Scholar
  43. Mikolov T, Chen K, Corrado G et al (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781Google Scholar
  44. Musti U, Zhou Z, Pietikinen M (2014) Facial 3D shape estimation from images for visual speech animation. In: Proceedings of the Pattern Recognition, IEEE, p 40Google Scholar
  45. Ostermann J, Weissenfeld A (2004) Talking faces-technologies and applications. In: Proceedings of the 17th International Conference on Pattern Recognition, IEEE, p 826Google Scholar
  46. Pandzic IS, Forchheimer R (2002) MPEG-4 facial animation. The standard, implementation and applications. John Wiley and Sons, ChichesterCrossRefGoogle Scholar
  47. Parke FI (1972) Computer generated animation of faces. In: Proceedings of the ACM annual conference-Volume, ACM, p 451Google Scholar
  48. Peng B, Qian Y, Soong FK et al (2011) A new phonetic candidate generator for improving search query efficiency. In: Twelfth Annual Conference of the International Speech Communication AssociationGoogle Scholar
  49. Pighin F, Hecker J, Lischinski D et al (2006) Synthesizing realistic facial expressions from photographs. In: ACM SIGGRAPH 2006 Courses, ACM, p 19Google Scholar
  50. Qian Y, Yan ZJ, Wu YJ et al (2010) An HMM trajectory tiling (HTT) approach to high quality TTS. In: Proceedings of the International Speech Communication Association, IEEE, p 422Google Scholar
  51. Raidt S, Bailly G, Elisei F (2007) Analyzing gaze during face-to-face interaction. In: International Workshop on Intelligent Virtual Agents. Springer, Berlin/Heidelberg, p 403CrossRefGoogle Scholar
  52. Richmond K, Hoole P, King S (2011) Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus. In: Proceedings of the International Speech Communication Association, IEEE, p 1505Google Scholar
  53. Roweis S (1998) EM algorithms for PCA and SPCA. Adv Neural Inf Process Syst:626–632Google Scholar
  54. Sako S, Tokuda K, Masuko T et al(2000) HMM-based text-to-audio-visual speech synthesis. In: Proceedings of the International Speech Communication Association, IEEE, p 25Google Scholar
  55. Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681CrossRefGoogle Scholar
  56. Scott MR, Liu X, Zhou M (2011) Towards a Specialized Search Engine for Language Learners [Point of View]. Proc IEEE 99(9):1462–1465CrossRefGoogle Scholar
  57. Seidlhofer B (2009) Common ground and different realities: World Englishes and English as a lingua franca. World Englishes 28(2):236–245CrossRefGoogle Scholar
  58. Sumby WH, Pollack I (1954) Erratum: visual contribution to speech intelligibility in noise [J. Acoust. Soc. Am. 26, 212 (1954)]. J Acoust Soc Am 26(4):583–583CrossRefGoogle Scholar
  59. Taylor P (2009) Text-to-speech synthesis. Cambridge university press, CambridgeCrossRefGoogle Scholar
  60. Taylor SL, Mahler M, Theobald BJ et al (2012) Dynamic units of visual speech. In: Proceedings of the 11th ACM SIGGRAPH/Eurographics conference on Computer Animation, ACM, p 275Google Scholar
  61. Theobald BJ, Fagel S, Bailly G et al (2008) LIPS2008: Visual speech synthesis challenge. In: Proceedings of the International Speech Communication Association, IEEE, p 2310Google Scholar
  62. Thies J, Zollhfer M, Stamminger M et al(2016) Face2face: Real-time face capture and reenactment of rgb videos. In: Proceedings of Computer Vision and Pattern Recognition, IEEE, p 1Google Scholar
  63. Tokuda K, Yoshimura T, Masuko T et al (2000) Speech parameter generation algorithms for HMM-based speech synthesis. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, p 1615Google Scholar
  64. Tokuda K, Oura K, Hashimoto K et al (2007) The HMM-based speech synthesis system. Online: http://hts.ics.nitech.ac.jp
  65. Wang D, King S (2011) Letter-to-sound pronunciation prediction using conditional random fields. IEEE Signal Process Lett 18(2):122–125CrossRefGoogle Scholar
  66. Wang L, Soong FK (2012) High quality lips animation with speech and captured facial action unit as A/V input. In: Signal and Information Processing Association Annual Summit and Conference, IEEE, p 1Google Scholar
  67. Wang L, Soong FK (2015) HMM trajectory-guided sample selection for photo-realistic talking head. MultimedTools Appl 74(22):9849–9869CrossRefGoogle Scholar
  68. Wang L, Han W, Qian X, Soong FK (2010a) Rendering a personalized photo-real talking head from short video footage. In: 7th International Symposium on Chinese Spoken Language Processing, IEEE, p 129Google Scholar
  69. Wang L, Qian X, Han W, Soong FK (2010b) Synthesizing photo-real talking head via trajectory-guided sample selection. In: Proceedings of the International Speech Communication Association, IEEE, p 446Google Scholar
  70. Wang L, Wu YJ, Zhuang X et al (2011) Synthesizing visual speech trajectory with minimum generation error. In: IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, p 4580Google Scholar
  71. Wang L, Chen H, Li S et al (2012a) Phoneme-level articulatory animation in pronunciation training. Speech Commun 54(7):845–856CrossRefGoogle Scholar
  72. Wang L, Han W, Soong FK (2012b) High quality lip-sync animation for 3D photo-realistic talking head. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, p 4529Google Scholar
  73. Wang LJ, Qian Y, Scott M, Chen G, Soong FK (2012c) Computer-assisted Audiovisual Language Learning, IEEE Computer, p 38Google Scholar
  74. Weise T, Bouaziz S, Li H et al (2011) Realtime performance-based facial animation. In: ACM Transactions on Graphics, ACM, p 77Google Scholar
  75. Wik P, Hjalmarsson A (2009) Embodied conversational agents in computer assisted language learning. Speech Commun 51(10):1024–1037CrossRefGoogle Scholar
  76. Williams RJ, Zipser D (1989) A learning algorithm for continually running fully recurrent neural networks. l Comput 1(2):270–280Google Scholar
  77. Xie L, Liu ZQ (2007a) A coupled HMM approach to video-realistic speech animation. Pattern Recogn 40(8):2325–2340CrossRefMATHGoogle Scholar
  78. Xie L, Liu ZQ (2007b) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimed 9(3):500–510CrossRefGoogle Scholar
  79. Xie L, Jia J, Meng H et al (2015) Expressive talking avatar synthesis and animation. Multimed Tools Appl 74(22):9845–9848CrossRefGoogle Scholar
  80. Yan ZJ, Qian Y, Soong FK (2010) Rich-context unit selection (RUS) approach to high quality TTS. In: IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, p 4798Google Scholar
  81. Zen H, Senior A, Schuster M (2013) Statistical parametric speech synthesis using deep neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, p 7962Google Scholar
  82. Zhang LJ, Rubdy R, Alsagoff L (2009) Englishes and literatures-in-English in a globalised world. In: Proceedings of the 13th International Conference on English in Southeast Asia, p 42Google Scholar
  83. Zhu P, Xie, L, Chen Y (2015) Articulatory movement prediction using deep bidirectional long short-term memory based recurrent neural networks and word/phone embeddings. In: Sixteenth Annual Conference of the International Speech Communication AssociationGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Computer ScienceNorthwestern Polytechnical University (NWPU)Xi’anP. R. China
  2. 2.Microsoft ResearchRedmondUSA
  3. 3.School of Computer ScienceNorthwestern Polytechnical UniversityXi’anChina

Section editors and affiliations

  • Zhigang Deng
    • 1
  1. 1.Department of Computer Science,University of HoustonHoustonUSA

Personalised recommendations