Skip to main content

Visual Speech Animation

  • Living reference work entry
  • First Online:

Abstract

Visual speech animation (VSA) has many potential applications in human-computer interaction, assisted language learning, entertainments, and other areas. But it is one of the most challenging tasks in human motion animation because of the complex mechanisms of speech production and facial motion. This chapter surveys the basic principles, state-of-the-art technologies, and featured applications in this area. Specifically, after introducing the basic concepts and the building blocks of a typical VSA system, we showcase a state-of-the-art approach based on the deep bidirectional long short-term memory (DBLSM) recurrent neural networks (RNN) for audio-to-visual mapping, which aims to create a video-realistic talking head. Finally, the Engkoo project from Microsoft is highlighted as a practical application of visual speech animation in language learning.

This is a preview of subscription content, log in via an institution.

Notes

  1. 1.

    Also called visual speech synthesis, talking face, talking head, talking avatar, speech animation, and mouth animation.

  2. 2.

    Sometimes this task is called lip synchronization or lip sync for short.

  3. 3.

    FBB128 means two BLSTM layers sitting on top of one feed-forward layer and each layer has 128 nodes.

References

  • Anderson R, Stenger B, Wan V, Cipolla R (2013) Expressive visual text-to-speech using active appearance models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE. p 3382

    Google Scholar 

  • Badin P, Ben Youssef A, Bailly G et al (2010) Visual articulatory feedback for phonetic correction in second language learning. In: Proceedings of Second Language learning Studies: Acquisition, Learning, Education and Technology, 2010

    Google Scholar 

  • Ben Youssef A, Shimodaira H, Braude DA (2013) Articulatory features for speech-driven head motion synthesis. In: Proceedings of the International Speech Communication Association, IEEE, 2013

    Google Scholar 

  • Breeuwer M, Plomp R (1985) Speechreading supplemented with formant frequency information from voiced speech. J Acoust Soc Am 77(1):314–317

    Article  Google Scholar 

  • Bregler C, Covell M, Slaney M (1997) Video rewrite: driving visual speech with audio. In: Proceedings of the 24th annual conference on Computer graphics and interactive techniques, ACM Press, p 353

    Google Scholar 

  • Busso C, Deng Z, Grimm M, Neumann U et al (2007) Rigid head motion in expressive speech animation: Analysis and synthesis. IEEE Trans Audio, Speech, Language Process 15(3):1075–1086

    Article  Google Scholar 

  • Cao Y, Tien WC, Faloutsos P et al(2005) Expressive speech-driven facial animation. In: ACM Transactions on Graphics, ACM, p 1283

    Google Scholar 

  • Cohen MM, Massaro DW (1993) Modeling coarticulation in synthetic visual speech. In: Models and techniques in computer animation. Springer, Japan, p 139

    Chapter  Google Scholar 

  • Cootes TF, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23(6):681–685

    Article  Google Scholar 

  • Cosatto E, Graf HP (1998) Sample-based synthesis of photo-realistic talking heads. In: Proceedings of Computer Animation, IEEE, p 103

    Google Scholar 

  • Cosatto E, Graf HP (2000) Photo-realistic talking-heads from image samples. IEEE Trans Multimed 2(3):152–163

    Article  Google Scholar 

  • Cosatto E, Ostermann J, Graf HP et al (2003) Lifelike talking faces for interactive services. Proc IEEE 91(9):1406–1429

    Article  Google Scholar 

  • Deng Z, Neumann U (2008) Data-driven 3D facial animation. Springer

    Google Scholar 

  • Deng L, Yu D (2014) Deep learning methods and applications. Foundations and Trends in Signal Processing, 2014

    Google Scholar 

  • Deng Z, Lewis JP, Neumann U (2005) Automated eye motion using texture synthesis. IEEE Comput Graph Appl 25(2):24–30

    Article  Google Scholar 

  • Ding C, Xie L, Zhu P (2015) Head motion synthesis from speech using deep neural networks. Multimed Tools Appl 74(22):9871–9888

    Article  Google Scholar 

  • Du J, Wang Q, Gao T et al (2014) Robust speech recognition with speech enhanced deep neural networks. In: Proceedings of the International Speech Communication Association, IEEE, p 616

    Google Scholar 

  • Dziemianko M, Hofer G, Shimodaira H (2009). HMM-based automatic eye-blink synthesis from speech. In: Proceedings of the International Speech Communication Association, IEEE, p 1799

    Google Scholar 

  • Englebienne G, Cootes T, Rattray M (2007) A probabilistic model for generating realistic lip movements from speech. In: Advances in neural information processing systems, p 401

    Google Scholar 

  • Eskenazi M (2009) An overview of spoken language technology for education. Speech Commun 51(10):832–844

    Article  Google Scholar 

  • Ezzat T, Poggio T (2000) Visual speech synthesis by morphing visemes. Int J Comput Vision 38(1):45–57

    Article  MATH  Google Scholar 

  • Ezzat T, Geiger G, Poggio T (2002) Trainable videorealistic speech animation. In: ACM SIGGRAPH 2006 Courses, ACM, p 388

    Google Scholar 

  • Fagel S, Clemens C (2004) An articulation model for audiovisual speech synthesis: determination, adjustment, evaluation. Speech Commun 44(1):141–154

    Article  Google Scholar 

  • Fagel S, Bailly G, Theobald BJ (2010) Animating virtual speakers or singers from audio: lip-synching facial animation. EURASIP J Audio, Speech, Music Process 2009(1):1–2

    Google Scholar 

  • Fan B, Wang L, Soong FK et al (2015) Photo-real talking head with deep bidirectional LSTM. In:IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, p 4884

    Google Scholar 

  • Fan B, Xie L, Yang S, Wang L et al (2016) A deep bidirectional LSTM approach for video- realistic talking head. Multimed Tools Appl 75:5287–5309

    Article  Google Scholar 

  • Fu S, Gutierrez-Osuna R, Esposito A et al (2005) Audio/visual mapping with cross-modal hidden Markov models. IEEE Trans Multimed 7(2):243–252

    Article  Google Scholar 

  • Hinton G, Deng L, Yu D, Dahl GE et al (2012) Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process Mag 29(6):82–97

    Article  Google Scholar 

  • Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzz 6(02):107–116

    Article  MATH  Google Scholar 

  • Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  • Huang D, Wu X, Wei J et al (2013) Visualization of Mandarin articulation by using a physiological articulatory model. In: Signal and Information Processing Association Annual Summit and Conference, IEEE, p 1

    Google Scholar 

  • Hunt AJ, Black AW (1996) Unit selection in a concatenative speech synthesis system using a large speech database. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, p 373

    Google Scholar 

  • Hura S, Leathem C, Shaked N (2010) Avatars meet the challenge. Speech Technol, 303217

    Google Scholar 

  • Jia J, Zhang S, Meng F et al (2011) Emotional audio-visual speech synthesis based on PAD. IEEE Trans Audio, Speech, Language Process 19(3):570–582

    Article  Google Scholar 

  • Jia J, Wu Z, Zhang S et al (2014) Head and facial gestures synthesis using PAD model for an expressive talking avatar. Multimed Tools Appl 73(1):439–461

    Article  Google Scholar 

  • Kukich K (1992) Techniques for automatically correcting words in text. ACM Comput Surv 24(4):377–439

    Article  Google Scholar 

  • Le BH, Ma X, Deng Z (2012) Live speech driven head-and-eye motion generators. IEEE Trans Vis Comput Graph 18(11):1902–1914

    Article  Google Scholar 

  • Liu P, Soong FK (2005) Kullback-Leibler divergence between two hidden Markov models. Microsoft Research Asia, Technical Report

    Google Scholar 

  • Massaro DW (1998) Perceiving talking faces: from speech perception to a behavioral principle. Mit Press, Cambridge

    Google Scholar 

  • Massaro DW, Simpson JA (2014) Speech perception by ear and eye: a paradigm for psychological inquiry. Psychology Press

    Google Scholar 

  • Masuko T, Kobayashi T, Tamura, M et al (1998) Text-to-visual speech synthesis based on parameter generation from HMM. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, p 3745

    Google Scholar 

  • McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748

    Article  Google Scholar 

  • Microsoft Research (2015) http://research.microsoft.com/en-us/projects/voice_driven_talking_head/

  • Mikolov T, Chen K, Corrado G et al (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

    Google Scholar 

  • Musti U, Zhou Z, Pietikinen M (2014) Facial 3D shape estimation from images for visual speech animation. In: Proceedings of the Pattern Recognition, IEEE, p 40

    Google Scholar 

  • Ostermann J, Weissenfeld A (2004) Talking faces-technologies and applications. In: Proceedings of the 17th International Conference on Pattern Recognition, IEEE, p 826

    Google Scholar 

  • Pandzic IS, Forchheimer R (2002) MPEG-4 facial animation. The standard, implementation and applications. John Wiley and Sons, Chichester

    Book  Google Scholar 

  • Parke FI (1972) Computer generated animation of faces. In: Proceedings of the ACM annual conference-Volume, ACM, p 451

    Google Scholar 

  • Peng B, Qian Y, Soong FK et al (2011) A new phonetic candidate generator for improving search query efficiency. In: Twelfth Annual Conference of the International Speech Communication Association

    Google Scholar 

  • Pighin F, Hecker J, Lischinski D et al (2006) Synthesizing realistic facial expressions from photographs. In: ACM SIGGRAPH 2006 Courses, ACM, p 19

    Google Scholar 

  • Qian Y, Yan ZJ, Wu YJ et al (2010) An HMM trajectory tiling (HTT) approach to high quality TTS. In: Proceedings of the International Speech Communication Association, IEEE, p 422

    Google Scholar 

  • Raidt S, Bailly G, Elisei F (2007) Analyzing gaze during face-to-face interaction. In: International Workshop on Intelligent Virtual Agents. Springer, Berlin/Heidelberg, p 403

    Chapter  Google Scholar 

  • Microsoft Research (2015) http://research.microsoft.com/en-us/projects/blstmtalkinghead/

  • Richmond K, Hoole P, King S (2011) Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus. In: Proceedings of the International Speech Communication Association, IEEE, p 1505

    Google Scholar 

  • Roweis S (1998) EM algorithms for PCA and SPCA. Adv Neural Inf Process Syst:626–632

    Google Scholar 

  • Sako S, Tokuda K, Masuko T et al(2000) HMM-based text-to-audio-visual speech synthesis. In: Proceedings of the International Speech Communication Association, IEEE, p 25

    Google Scholar 

  • Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681

    Article  Google Scholar 

  • Scott MR, Liu X, Zhou M (2011) Towards a Specialized Search Engine for Language Learners [Point of View]. Proc IEEE 99(9):1462–1465

    Article  Google Scholar 

  • Seidlhofer B (2009) Common ground and different realities: World Englishes and English as a lingua franca. World Englishes 28(2):236–245

    Article  Google Scholar 

  • Sumby WH, Pollack I (1954) Erratum: visual contribution to speech intelligibility in noise [J. Acoust. Soc. Am. 26, 212 (1954)]. J Acoust Soc Am 26(4):583–583

    Article  Google Scholar 

  • Taylor P (2009) Text-to-speech synthesis. Cambridge university press, Cambridge

    Book  Google Scholar 

  • Taylor SL, Mahler M, Theobald BJ et al (2012) Dynamic units of visual speech. In: Proceedings of the 11th ACM SIGGRAPH/Eurographics conference on Computer Animation, ACM, p 275

    Google Scholar 

  • Theobald BJ, Fagel S, Bailly G et al (2008) LIPS2008: Visual speech synthesis challenge. In: Proceedings of the International Speech Communication Association, IEEE, p 2310

    Google Scholar 

  • Thies J, Zollhfer M, Stamminger M et al(2016) Face2face: Real-time face capture and reenactment of rgb videos. In: Proceedings of Computer Vision and Pattern Recognition, IEEE, p 1

    Google Scholar 

  • Tokuda K, Yoshimura T, Masuko T et al (2000) Speech parameter generation algorithms for HMM-based speech synthesis. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, p 1615

    Google Scholar 

  • Tokuda K, Oura K, Hashimoto K et al (2007) The HMM-based speech synthesis system. Online: http://hts.ics.nitech.ac.jp

  • Wang D, King S (2011) Letter-to-sound pronunciation prediction using conditional random fields. IEEE Signal Process Lett 18(2):122–125

    Article  Google Scholar 

  • Wang L, Soong FK (2012) High quality lips animation with speech and captured facial action unit as A/V input. In: Signal and Information Processing Association Annual Summit and Conference, IEEE, p 1

    Google Scholar 

  • Wang L, Soong FK (2015) HMM trajectory-guided sample selection for photo-realistic talking head. MultimedTools Appl 74(22):9849–9869

    Article  Google Scholar 

  • Wang L, Han W, Qian X, Soong FK (2010a) Rendering a personalized photo-real talking head from short video footage. In: 7th International Symposium on Chinese Spoken Language Processing, IEEE, p 129

    Google Scholar 

  • Wang L, Qian X, Han W, Soong FK (2010b) Synthesizing photo-real talking head via trajectory-guided sample selection. In: Proceedings of the International Speech Communication Association, IEEE, p 446

    Google Scholar 

  • Wang L, Wu YJ, Zhuang X et al (2011) Synthesizing visual speech trajectory with minimum generation error. In: IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, p 4580

    Google Scholar 

  • Wang L, Chen H, Li S et al (2012a) Phoneme-level articulatory animation in pronunciation training. Speech Commun 54(7):845–856

    Article  Google Scholar 

  • Wang L, Han W, Soong FK (2012b) High quality lip-sync animation for 3D photo-realistic talking head. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, p 4529

    Google Scholar 

  • Wang LJ, Qian Y, Scott M, Chen G, Soong FK (2012c) Computer-assisted Audiovisual Language Learning, IEEE Computer, p 38

    Google Scholar 

  • Weise T, Bouaziz S, Li H et al (2011) Realtime performance-based facial animation. In: ACM Transactions on Graphics, ACM, p 77

    Google Scholar 

  • Wik P, Hjalmarsson A (2009) Embodied conversational agents in computer assisted language learning. Speech Commun 51(10):1024–1037

    Article  Google Scholar 

  • Williams RJ, Zipser D (1989) A learning algorithm for continually running fully recurrent neural networks. l Comput 1(2):270–280

    Google Scholar 

  • Xie L, Liu ZQ (2007a) A coupled HMM approach to video-realistic speech animation. Pattern Recogn 40(8):2325–2340

    Article  MATH  Google Scholar 

  • Xie L, Liu ZQ (2007b) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimed 9(3):500–510

    Article  Google Scholar 

  • Xie L, Jia J, Meng H et al (2015) Expressive talking avatar synthesis and animation. Multimed Tools Appl 74(22):9845–9848

    Article  Google Scholar 

  • Yan ZJ, Qian Y, Soong FK (2010) Rich-context unit selection (RUS) approach to high quality TTS. In: IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, p 4798

    Google Scholar 

  • Zen H, Senior A, Schuster M (2013) Statistical parametric speech synthesis using deep neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, p 7962

    Google Scholar 

  • Zhang LJ, Rubdy R, Alsagoff L (2009) Englishes and literatures-in-English in a globalised world. In: Proceedings of the 13th International Conference on English in Southeast Asia, p 42

    Google Scholar 

  • Zhu P, Xie, L, Chen Y (2015) Articulatory movement prediction using deep bidirectional long short-term memory based recurrent neural networks and word/phone embeddings. In: Sixteenth Annual Conference of the International Speech Communication Association

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Xie PhD .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this entry

Cite this entry

Xie, L., Wang, L., Yang, S. (2016). Visual Speech Animation. In: Müller, B., et al. Handbook of Human Motion. Springer, Cham. https://doi.org/10.1007/978-3-319-30808-1_1-1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-30808-1_1-1

  • Received:

  • Accepted:

  • Published:

  • Publisher Name: Springer, Cham

  • Online ISBN: 978-3-319-30808-1

  • eBook Packages: Springer Reference EngineeringReference Module Computer Science and Engineering

Publish with us

Policies and ethics