Visual Speech Animation

Xie, Lei; Wang, Lijuan; Yang, Shan

doi:10.1007/978-3-319-30808-1_1-1

Visual Speech Animation

Lei Xie PhD⁸,
Lijuan Wang⁹ &
Shan Yang¹⁰

Living reference work entry
First Online: 21 December 2016

259 Accesses
1 Citations

Abstract

Visual speech animation (VSA) has many potential applications in human-computer interaction, assisted language learning, entertainments, and other areas. But it is one of the most challenging tasks in human motion animation because of the complex mechanisms of speech production and facial motion. This chapter surveys the basic principles, state-of-the-art technologies, and featured applications in this area. Specifically, after introducing the basic concepts and the building blocks of a typical VSA system, we showcase a state-of-the-art approach based on the deep bidirectional long short-term memory (DBLSM) recurrent neural networks (RNN) for audio-to-visual mapping, which aims to create a video-realistic talking head. Finally, the Engkoo project from Microsoft is highlighted as a practical application of visual speech animation in language learning.

This is a preview of subscription content, log in via an institution.

Notes

1.
Also called visual speech synthesis, talking face, talking head, talking avatar, speech animation, and mouth animation.
2.
Sometimes this task is called lip synchronization or lip sync for short.
3.
FBB128 means two BLSTM layers sitting on top of one feed-forward layer and each layer has 128 nodes.

References

Anderson R, Stenger B, Wan V, Cipolla R (2013) Expressive visual text-to-speech using active appearance models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE. p 3382
Google Scholar
Badin P, Ben Youssef A, Bailly G et al (2010) Visual articulatory feedback for phonetic correction in second language learning. In: Proceedings of Second Language learning Studies: Acquisition, Learning, Education and Technology, 2010
Google Scholar
Ben Youssef A, Shimodaira H, Braude DA (2013) Articulatory features for speech-driven head motion synthesis. In: Proceedings of the International Speech Communication Association, IEEE, 2013
Google Scholar
Breeuwer M, Plomp R (1985) Speechreading supplemented with formant frequency information from voiced speech. J Acoust Soc Am 77(1):314–317
Article Google Scholar
Bregler C, Covell M, Slaney M (1997) Video rewrite: driving visual speech with audio. In: Proceedings of the 24th annual conference on Computer graphics and interactive techniques, ACM Press, p 353
Google Scholar
Busso C, Deng Z, Grimm M, Neumann U et al (2007) Rigid head motion in expressive speech animation: Analysis and synthesis. IEEE Trans Audio, Speech, Language Process 15(3):1075–1086
Article Google Scholar
Cao Y, Tien WC, Faloutsos P et al(2005) Expressive speech-driven facial animation. In: ACM Transactions on Graphics, ACM, p 1283
Google Scholar
Cohen MM, Massaro DW (1993) Modeling coarticulation in synthetic visual speech. In: Models and techniques in computer animation. Springer, Japan, p 139
Chapter Google Scholar
Cootes TF, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23(6):681–685
Article Google Scholar
Cosatto E, Graf HP (1998) Sample-based synthesis of photo-realistic talking heads. In: Proceedings of Computer Animation, IEEE, p 103
Google Scholar
Cosatto E, Graf HP (2000) Photo-realistic talking-heads from image samples. IEEE Trans Multimed 2(3):152–163
Article Google Scholar
Cosatto E, Ostermann J, Graf HP et al (2003) Lifelike talking faces for interactive services. Proc IEEE 91(9):1406–1429
Article Google Scholar
Deng Z, Neumann U (2008) Data-driven 3D facial animation. Springer
Google Scholar
Deng L, Yu D (2014) Deep learning methods and applications. Foundations and Trends in Signal Processing, 2014
Google Scholar
Deng Z, Lewis JP, Neumann U (2005) Automated eye motion using texture synthesis. IEEE Comput Graph Appl 25(2):24–30
Article Google Scholar
Ding C, Xie L, Zhu P (2015) Head motion synthesis from speech using deep neural networks. Multimed Tools Appl 74(22):9871–9888
Article Google Scholar
Du J, Wang Q, Gao T et al (2014) Robust speech recognition with speech enhanced deep neural networks. In: Proceedings of the International Speech Communication Association, IEEE, p 616
Google Scholar
Dziemianko M, Hofer G, Shimodaira H (2009). HMM-based automatic eye-blink synthesis from speech. In: Proceedings of the International Speech Communication Association, IEEE, p 1799
Google Scholar
Englebienne G, Cootes T, Rattray M (2007) A probabilistic model for generating realistic lip movements from speech. In: Advances in neural information processing systems, p 401
Google Scholar
Eskenazi M (2009) An overview of spoken language technology for education. Speech Commun 51(10):832–844
Article Google Scholar
Ezzat T, Poggio T (2000) Visual speech synthesis by morphing visemes. Int J Comput Vision 38(1):45–57
Article MATH Google Scholar
Ezzat T, Geiger G, Poggio T (2002) Trainable videorealistic speech animation. In: ACM SIGGRAPH 2006 Courses, ACM, p 388
Google Scholar
Fagel S, Clemens C (2004) An articulation model for audiovisual speech synthesis: determination, adjustment, evaluation. Speech Commun 44(1):141–154
Article Google Scholar
Fagel S, Bailly G, Theobald BJ (2010) Animating virtual speakers or singers from audio: lip-synching facial animation. EURASIP J Audio, Speech, Music Process 2009(1):1–2
Google Scholar
Fan B, Wang L, Soong FK et al (2015) Photo-real talking head with deep bidirectional LSTM. In:IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, p 4884
Google Scholar
Fan B, Xie L, Yang S, Wang L et al (2016) A deep bidirectional LSTM approach for video- realistic talking head. Multimed Tools Appl 75:5287–5309
Article Google Scholar
Fu S, Gutierrez-Osuna R, Esposito A et al (2005) Audio/visual mapping with cross-modal hidden Markov models. IEEE Trans Multimed 7(2):243–252
Article Google Scholar
Hinton G, Deng L, Yu D, Dahl GE et al (2012) Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process Mag 29(6):82–97
Article Google Scholar
Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzz 6(02):107–116
Article MATH Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Huang D, Wu X, Wei J et al (2013) Visualization of Mandarin articulation by using a physiological articulatory model. In: Signal and Information Processing Association Annual Summit and Conference, IEEE, p 1
Google Scholar
Hunt AJ, Black AW (1996) Unit selection in a concatenative speech synthesis system using a large speech database. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, p 373
Google Scholar
Hura S, Leathem C, Shaked N (2010) Avatars meet the challenge. Speech Technol, 303217
Google Scholar
Jia J, Zhang S, Meng F et al (2011) Emotional audio-visual speech synthesis based on PAD. IEEE Trans Audio, Speech, Language Process 19(3):570–582
Article Google Scholar
Jia J, Wu Z, Zhang S et al (2014) Head and facial gestures synthesis using PAD model for an expressive talking avatar. Multimed Tools Appl 73(1):439–461
Article Google Scholar
Kukich K (1992) Techniques for automatically correcting words in text. ACM Comput Surv 24(4):377–439
Article Google Scholar
Le BH, Ma X, Deng Z (2012) Live speech driven head-and-eye motion generators. IEEE Trans Vis Comput Graph 18(11):1902–1914
Article Google Scholar
Liu P, Soong FK (2005) Kullback-Leibler divergence between two hidden Markov models. Microsoft Research Asia, Technical Report
Google Scholar
Massaro DW (1998) Perceiving talking faces: from speech perception to a behavioral principle. Mit Press, Cambridge
Google Scholar
Massaro DW, Simpson JA (2014) Speech perception by ear and eye: a paradigm for psychological inquiry. Psychology Press
Google Scholar
Masuko T, Kobayashi T, Tamura, M et al (1998) Text-to-visual speech synthesis based on parameter generation from HMM. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, p 3745
Google Scholar
McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748
Article Google Scholar
Microsoft Research (2015) http://research.microsoft.com/en-us/projects/voice_driven_talking_head/
Mikolov T, Chen K, Corrado G et al (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Google Scholar
Musti U, Zhou Z, Pietikinen M (2014) Facial 3D shape estimation from images for visual speech animation. In: Proceedings of the Pattern Recognition, IEEE, p 40
Google Scholar
Ostermann J, Weissenfeld A (2004) Talking faces-technologies and applications. In: Proceedings of the 17th International Conference on Pattern Recognition, IEEE, p 826
Google Scholar
Pandzic IS, Forchheimer R (2002) MPEG-4 facial animation. The standard, implementation and applications. John Wiley and Sons, Chichester
Book Google Scholar
Parke FI (1972) Computer generated animation of faces. In: Proceedings of the ACM annual conference-Volume, ACM, p 451
Google Scholar
Peng B, Qian Y, Soong FK et al (2011) A new phonetic candidate generator for improving search query efficiency. In: Twelfth Annual Conference of the International Speech Communication Association
Google Scholar
Pighin F, Hecker J, Lischinski D et al (2006) Synthesizing realistic facial expressions from photographs. In: ACM SIGGRAPH 2006 Courses, ACM, p 19
Google Scholar
Qian Y, Yan ZJ, Wu YJ et al (2010) An HMM trajectory tiling (HTT) approach to high quality TTS. In: Proceedings of the International Speech Communication Association, IEEE, p 422
Google Scholar
Raidt S, Bailly G, Elisei F (2007) Analyzing gaze during face-to-face interaction. In: International Workshop on Intelligent Virtual Agents. Springer, Berlin/Heidelberg, p 403
Chapter Google Scholar
Microsoft Research (2015) http://research.microsoft.com/en-us/projects/blstmtalkinghead/
Richmond K, Hoole P, King S (2011) Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus. In: Proceedings of the International Speech Communication Association, IEEE, p 1505
Google Scholar
Roweis S (1998) EM algorithms for PCA and SPCA. Adv Neural Inf Process Syst:626–632
Google Scholar
Sako S, Tokuda K, Masuko T et al(2000) HMM-based text-to-audio-visual speech synthesis. In: Proceedings of the International Speech Communication Association, IEEE, p 25
Google Scholar
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681
Article Google Scholar
Scott MR, Liu X, Zhou M (2011) Towards a Specialized Search Engine for Language Learners [Point of View]. Proc IEEE 99(9):1462–1465
Article Google Scholar
Seidlhofer B (2009) Common ground and different realities: World Englishes and English as a lingua franca. World Englishes 28(2):236–245
Article Google Scholar
Sumby WH, Pollack I (1954) Erratum: visual contribution to speech intelligibility in noise [J. Acoust. Soc. Am. 26, 212 (1954)]. J Acoust Soc Am 26(4):583–583
Article Google Scholar
Taylor P (2009) Text-to-speech synthesis. Cambridge university press, Cambridge
Book Google Scholar
Taylor SL, Mahler M, Theobald BJ et al (2012) Dynamic units of visual speech. In: Proceedings of the 11th ACM SIGGRAPH/Eurographics conference on Computer Animation, ACM, p 275
Google Scholar
Theobald BJ, Fagel S, Bailly G et al (2008) LIPS2008: Visual speech synthesis challenge. In: Proceedings of the International Speech Communication Association, IEEE, p 2310
Google Scholar
Thies J, Zollhfer M, Stamminger M et al(2016) Face2face: Real-time face capture and reenactment of rgb videos. In: Proceedings of Computer Vision and Pattern Recognition, IEEE, p 1
Google Scholar
Tokuda K, Yoshimura T, Masuko T et al (2000) Speech parameter generation algorithms for HMM-based speech synthesis. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, p 1615
Google Scholar
Tokuda K, Oura K, Hashimoto K et al (2007) The HMM-based speech synthesis system. Online: http://hts.ics.nitech.ac.jp
Wang D, King S (2011) Letter-to-sound pronunciation prediction using conditional random fields. IEEE Signal Process Lett 18(2):122–125
Article Google Scholar
Wang L, Soong FK (2012) High quality lips animation with speech and captured facial action unit as A/V input. In: Signal and Information Processing Association Annual Summit and Conference, IEEE, p 1
Google Scholar
Wang L, Soong FK (2015) HMM trajectory-guided sample selection for photo-realistic talking head. MultimedTools Appl 74(22):9849–9869
Article Google Scholar
Wang L, Han W, Qian X, Soong FK (2010a) Rendering a personalized photo-real talking head from short video footage. In: 7th International Symposium on Chinese Spoken Language Processing, IEEE, p 129
Google Scholar
Wang L, Qian X, Han W, Soong FK (2010b) Synthesizing photo-real talking head via trajectory-guided sample selection. In: Proceedings of the International Speech Communication Association, IEEE, p 446
Google Scholar
Wang L, Wu YJ, Zhuang X et al (2011) Synthesizing visual speech trajectory with minimum generation error. In: IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, p 4580
Google Scholar
Wang L, Chen H, Li S et al (2012a) Phoneme-level articulatory animation in pronunciation training. Speech Commun 54(7):845–856
Article Google Scholar
Wang L, Han W, Soong FK (2012b) High quality lip-sync animation for 3D photo-realistic talking head. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, p 4529
Google Scholar
Wang LJ, Qian Y, Scott M, Chen G, Soong FK (2012c) Computer-assisted Audiovisual Language Learning, IEEE Computer, p 38
Google Scholar
Weise T, Bouaziz S, Li H et al (2011) Realtime performance-based facial animation. In: ACM Transactions on Graphics, ACM, p 77
Google Scholar
Wik P, Hjalmarsson A (2009) Embodied conversational agents in computer assisted language learning. Speech Commun 51(10):1024–1037
Article Google Scholar
Williams RJ, Zipser D (1989) A learning algorithm for continually running fully recurrent neural networks. l Comput 1(2):270–280
Google Scholar
Xie L, Liu ZQ (2007a) A coupled HMM approach to video-realistic speech animation. Pattern Recogn 40(8):2325–2340
Article MATH Google Scholar
Xie L, Liu ZQ (2007b) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimed 9(3):500–510
Article Google Scholar
Xie L, Jia J, Meng H et al (2015) Expressive talking avatar synthesis and animation. Multimed Tools Appl 74(22):9845–9848
Article Google Scholar
Yan ZJ, Qian Y, Soong FK (2010) Rich-context unit selection (RUS) approach to high quality TTS. In: IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, p 4798
Google Scholar
Zen H, Senior A, Schuster M (2013) Statistical parametric speech synthesis using deep neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, p 7962
Google Scholar
Zhang LJ, Rubdy R, Alsagoff L (2009) Englishes and literatures-in-English in a globalised world. In: Proceedings of the 13th International Conference on English in Southeast Asia, p 42
Google Scholar
Zhu P, Xie, L, Chen Y (2015) Articulatory movement prediction using deep bidirectional long short-term memory based recurrent neural networks and word/phone embeddings. In: Sixteenth Annual Conference of the International Speech Communication Association
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Northwestern Polytechnical University (NWPU), Xi’an, P. R. China
Lei Xie PhD
Microsoft Research, Microsoft Building 99, 14820 NE 36th St, 98052, Redmond, WA, USA
Lijuan Wang
School of Computer Science, Northwestern Polytechnical University, 127 YouyiWest Road, Xi’an, China
Shan Yang

Authors

Lei Xie PhD
View author publications
You can also search for this author in PubMed Google Scholar
Lijuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shan Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei Xie PhD .

Editor information

Editors and Affiliations

Motion & More, Barcelona, Spain
Bertram Müller
Klinik für Orthopädie und Unfallchirurgi, Universitätsklinikum Heidelberg, Heidelberg, Germany
Sebastian I. Wolf
Institut for Biomechanics und Orthopedic, German Sport University Cologne, Cologne, Germany
Gert-Peter Brueggemann
Houston, Texas, USA
Zhigang Deng
ACRISP, FedUniversity, Cremorne, New South Wales, Australia
Andrew McIntosh
Wilmington, Delaware, USA
Freeman Miller
Research, C-Motion, Inc., Germantown, Maryland, USA
William Scott Selbie

Section Editor information

Department of Computer Science,, University of Houston, Houston, TX, USA
Zhigang Deng PhD

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Xie, L., Wang, L., Yang, S. (2016). Visual Speech Animation. In: Müller, B., et al. Handbook of Human Motion. Springer, Cham. https://doi.org/10.1007/978-3-319-30808-1_1-1

Download citation

DOI: https://doi.org/10.1007/978-3-319-30808-1_1-1
Received: 02 November 2016
Accepted: 06 November 2016
Published: 21 December 2016
Publisher Name: Springer, Cham
Online ISBN: 978-3-319-30808-1
eBook Packages: Springer Reference EngineeringReference Module Computer Science and Engineering

Publish with us

Policies and ethics