Generating Talking Face Landmarks from Speech

Eskimez, Sefik Emre; Maddox, Ross K.; Xu, Chenliang; Duan, Zhiyao

doi:10.1007/978-3-319-93764-9_35

Sefik Emre Eskimez¹⁸,
Ross K. Maddox¹⁸,
Chenliang Xu¹⁸ &
…
Zhiyao Duan¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10891))

Included in the following conference series:

International Conference on Latent Variable Analysis and Signal Separation

2325 Accesses
33 Citations

Abstract

The presence of a corresponding talking face has been shown to significantly improve speech intelligibility in noisy conditions and for hearing impaired population. In this paper, we present a system that can generate landmark points of a talking face from an acoustic speech in real time. The system uses a long short-term memory (LSTM) network and is trained on frontal videos of 27 different speakers with automatically extracted face landmarks. After training, it can produce talking face landmarks from the acoustic speech of unseen speakers and utterances. The training phase contains three key steps. We first transform landmarks of the first video frame to pin the two eye points into two predefined locations and apply the same transformation on all of the following video frames. We then remove the identity information by transforming the landmarks into a mean face shape across the entire training dataset. Finally, we train an LSTM network that takes the first- and second-order temporal differences of the log-mel spectrogram as input to predict face landmarks in each frame. We evaluate our system using the mean-squared error (MSE) loss of landmarks of lips between predicted and ground-truth landmarks as well as their first- and second-order temporal differences. We further evaluate our system by conducting subjective tests, where the subjects try to distinguish the real and fake videos of talking face landmarks. Both tests show promising results.

Z. Duan—This work is supported by the University of Rochester Pilot Award Program in AR/VR and the National Science Foundation grant No. 1741471.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Blamey, P.J., Pyman, B.C., Clark, G.M., Dowell, R.C., Gordon, M., Brown, A.M., Hollow, R.D.: Factors predicting postoperative sentence scores in postlinguistically deaf adult cochlear implant patients. Ann. Otol. Rhinol. Laryngol. 101(4), 342–348 (1992)
Article Google Scholar
Brand, M.: Voice puppetry. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, pp. 21–28. ACM Press/Addison-Wesley Publishing Co. (1999)
Google Scholar
Cassidy, S., Stenger, B., Dongen, L.V., Yanagisawa, K., Anderson, R., Wan, V., Baron-Cohen, S., Cipolla, R.: Expressive visual text-to-speech as an assistive technology for individuals with autism spectrum conditions. Comput. Vis. Image Underst. 148, 193–200 (2016)
Article Google Scholar
Choi, K., Luo, Y., Hwang, J.N.: Hidden Markov model inversion for audio-to-visual conversion in an MPEG-4 facial animation system. J. VLSI Signal Process. Syst. Signal Image Video Technol. 29, 51–61 (2001)
Article Google Scholar
Chung, J.S., Jamaludin, A., Zisserman, A.: You said that? (2017). arXiv preprint: arXiv:1705.02966
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)
Article Google Scholar
Cosker, D., Marshall, D., Rosin, P.L., Hicks, Y.: Speech driven facial animation using a Hidden Markov coarticulation model. In: Proceedings of the 17th International Conference on Pattern Recognition (ICPR), vol. 1, pp. 128–131. IEEE (2004)
Google Scholar
Cosker, D., Marshall, D., Rosin, P., Hicks, Y.: Video realistic talking heads using hierarchical non-linear speech-appearance models, Mirage, France, vol. 147 (2003)
Google Scholar
Dodd, B.E., Campbell, R.E.: Hearing by Eye: The Psychology of Lip-Reading. Lawrence Erlbaum Associates, Inc., Hillsdale (1987)
Google Scholar
Fan, B., Wang, L., Soong, F.K., Xie, L.: Photo-real talking head with deep bidirectional LSTM. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4884–4888. IEEE (2015)
Google Scholar
Garofalo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., Dahlgren, N.L.: The darpa timit acoustic-phonetic continuous speech corpus CD-ROM. Linguistic Data Consortium (1993)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)
Google Scholar
Maddox, R.K., Atilgan, H., Bizley, J.K., Lee, A.K.: Auditory selective attention is enhanced by a task-irrelevant temporally coherent visual stimulus in human listeners. eLife 4 (2015)
Google Scholar
Mallick, S.: Face morph using opencv c++/python (2016). http://www.learnopencv.com/face-morph-using-opencv-cpp-python/
Pham, H.X., Cheung, S., Pavlovic, V.: Speech-driven 3d facial animation with implicit emotional awareness: a deep learning approach. In: The 1st DALCOM Workshop, CVPR (2017)
Google Scholar
Pham, H.X., Wang, Y., Pavlovic, V.: End-to-end learning for 3d facial animation from raw waveforms of speech (2017). arXiv preprint: arXiv:1710.00920
Richie, S., Warburton, C., Carter, M.: Audiovisual database of spoken American English. Linguistic Data Consortium (2009)
Google Scholar
Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. (TOG) 36(4), 95 (2017)
Article Google Scholar
Terissi, L.D., Gómez, J.C.: Audio-to-visual conversion via HMM inversion for speech-driven facial animation. In: Zaverucha, G., da Costa, A.L. (eds.) SBIA 2008. LNCS (LNAI), vol. 5249, pp. 33–42. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88190-2_9
Chapter Google Scholar
Tillman, T.W., Carhart, R.: An expanded test for speech discrimination utilizing CNC monosyllabic words: Northwestern University auditory test no. 6. Technical report, Northwestern University Evanston Auditory Research Lab (1966)
Google Scholar
Wan, V., Anderson, R., Blokland, A., Braunschweiler, N., Chen, L., Kolluru, B., Latorre, J., Maia, R., Stenger, B., Yanagisawa, K., et al.: Photo-realistic expressive text to talking head synthesis. In: INTERSPEECH, pp. 2667–2669 (2013)
Google Scholar
Wang, L., Han, W., Soong, F.K., Huo, Q.: Text driven 3d photo-realistic talking head. In: Twelfth Annual Conference of the International Speech Communication Association (2011)
Google Scholar
Xie, L., Liu, Z.Q.: A coupled HMM approach to video-realistic speech animation. Pattern Recogn. 40, 2325–2340 (2007)
Article Google Scholar
Zhang, X., Wang, L., Li, G., Seide, F., Soong, F.K.: A new language independent, photo-realistic talking head driven by voice only. In: Interspeech, pp. 2743–2747 (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Rochester, 500 Joseph C. Wilson Blvd., Rochester, NY, 14627, USA
Sefik Emre Eskimez, Ross K. Maddox, Chenliang Xu & Zhiyao Duan

Authors

Sefik Emre Eskimez
View author publications
You can also search for this author in PubMed Google Scholar
Ross K. Maddox
View author publications
You can also search for this author in PubMed Google Scholar
Chenliang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyao Duan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sefik Emre Eskimez .

Editor information

Editors and Affiliations

Paul Sabatier University, Toulouse, France
Yannick Deville
Bar-Ilan University, Ramat Gan, Israel
Sharon Gannot
University of Surrey, Guildford, United Kingdom
Russell Mason
University of Surrey, Guildford, United Kingdom
Mark D. Plumbley
University of Surrey, Guildford, United Kingdom
Dominic Ward

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Eskimez, S.E., Maddox, R.K., Xu, C., Duan, Z. (2018). Generating Talking Face Landmarks from Speech. In: Deville, Y., Gannot, S., Mason, R., Plumbley, M., Ward, D. (eds) Latent Variable Analysis and Signal Separation. LVA/ICA 2018. Lecture Notes in Computer Science(), vol 10891. Springer, Cham. https://doi.org/10.1007/978-3-319-93764-9_35

Download citation

DOI: https://doi.org/10.1007/978-3-319-93764-9_35
Published: 06 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93763-2
Online ISBN: 978-3-319-93764-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics