Skip to main content

Generating Talking Face Landmarks from Speech

  • Conference paper
  • First Online:
Latent Variable Analysis and Signal Separation (LVA/ICA 2018)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10891))

Abstract

The presence of a corresponding talking face has been shown to significantly improve speech intelligibility in noisy conditions and for hearing impaired population. In this paper, we present a system that can generate landmark points of a talking face from an acoustic speech in real time. The system uses a long short-term memory (LSTM) network and is trained on frontal videos of 27 different speakers with automatically extracted face landmarks. After training, it can produce talking face landmarks from the acoustic speech of unseen speakers and utterances. The training phase contains three key steps. We first transform landmarks of the first video frame to pin the two eye points into two predefined locations and apply the same transformation on all of the following video frames. We then remove the identity information by transforming the landmarks into a mean face shape across the entire training dataset. Finally, we train an LSTM network that takes the first- and second-order temporal differences of the log-mel spectrogram as input to predict face landmarks in each frame. We evaluate our system using the mean-squared error (MSE) loss of landmarks of lips between predicted and ground-truth landmarks as well as their first- and second-order temporal differences. We further evaluate our system by conducting subjective tests, where the subjects try to distinguish the real and fake videos of talking face landmarks. Both tests show promising results.

Z. Duan—This work is supported by the University of Rochester Pilot Award Program in AR/VR and the National Science Foundation grant No. 1741471.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.ece.rochester.edu/projects/air/projects/talkingface.html.

  2. 2.

    http://www.ece.rochester.edu/projects/air/projects/talkingface.html.

References

  1. Blamey, P.J., Pyman, B.C., Clark, G.M., Dowell, R.C., Gordon, M., Brown, A.M., Hollow, R.D.: Factors predicting postoperative sentence scores in postlinguistically deaf adult cochlear implant patients. Ann. Otol. Rhinol. Laryngol. 101(4), 342–348 (1992)

    Article  Google Scholar 

  2. Brand, M.: Voice puppetry. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, pp. 21–28. ACM Press/Addison-Wesley Publishing Co. (1999)

    Google Scholar 

  3. Cassidy, S., Stenger, B., Dongen, L.V., Yanagisawa, K., Anderson, R., Wan, V., Baron-Cohen, S., Cipolla, R.: Expressive visual text-to-speech as an assistive technology for individuals with autism spectrum conditions. Comput. Vis. Image Underst. 148, 193–200 (2016)

    Article  Google Scholar 

  4. Choi, K., Luo, Y., Hwang, J.N.: Hidden Markov model inversion for audio-to-visual conversion in an MPEG-4 facial animation system. J. VLSI Signal Process. Syst. Signal Image Video Technol. 29, 51–61 (2001)

    Article  Google Scholar 

  5. Chung, J.S., Jamaludin, A., Zisserman, A.: You said that? (2017). arXiv preprint: arXiv:1705.02966

  6. Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)

    Article  Google Scholar 

  7. Cosker, D., Marshall, D., Rosin, P.L., Hicks, Y.: Speech driven facial animation using a Hidden Markov coarticulation model. In: Proceedings of the 17th International Conference on Pattern Recognition (ICPR), vol. 1, pp. 128–131. IEEE (2004)

    Google Scholar 

  8. Cosker, D., Marshall, D., Rosin, P., Hicks, Y.: Video realistic talking heads using hierarchical non-linear speech-appearance models, Mirage, France, vol. 147 (2003)

    Google Scholar 

  9. Dodd, B.E., Campbell, R.E.: Hearing by Eye: The Psychology of Lip-Reading. Lawrence Erlbaum Associates, Inc., Hillsdale (1987)

    Google Scholar 

  10. Fan, B., Wang, L., Soong, F.K., Xie, L.: Photo-real talking head with deep bidirectional LSTM. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4884–4888. IEEE (2015)

    Google Scholar 

  11. Garofalo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., Dahlgren, N.L.: The darpa timit acoustic-phonetic continuous speech corpus CD-ROM. Linguistic Data Consortium (1993)

    Google Scholar 

  12. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  13. King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)

    Google Scholar 

  14. Maddox, R.K., Atilgan, H., Bizley, J.K., Lee, A.K.: Auditory selective attention is enhanced by a task-irrelevant temporally coherent visual stimulus in human listeners. eLife 4 (2015)

    Google Scholar 

  15. Mallick, S.: Face morph using opencv c++/python (2016). http://www.learnopencv.com/face-morph-using-opencv-cpp-python/

  16. Pham, H.X., Cheung, S., Pavlovic, V.: Speech-driven 3d facial animation with implicit emotional awareness: a deep learning approach. In: The 1st DALCOM Workshop, CVPR (2017)

    Google Scholar 

  17. Pham, H.X., Wang, Y., Pavlovic, V.: End-to-end learning for 3d facial animation from raw waveforms of speech (2017). arXiv preprint: arXiv:1710.00920

  18. Richie, S., Warburton, C., Carter, M.: Audiovisual database of spoken American English. Linguistic Data Consortium (2009)

    Google Scholar 

  19. Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. (TOG) 36(4), 95 (2017)

    Article  Google Scholar 

  20. Terissi, L.D., Gómez, J.C.: Audio-to-visual conversion via HMM inversion for speech-driven facial animation. In: Zaverucha, G., da Costa, A.L. (eds.) SBIA 2008. LNCS (LNAI), vol. 5249, pp. 33–42. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88190-2_9

    Chapter  Google Scholar 

  21. Tillman, T.W., Carhart, R.: An expanded test for speech discrimination utilizing CNC monosyllabic words: Northwestern University auditory test no. 6. Technical report, Northwestern University Evanston Auditory Research Lab (1966)

    Google Scholar 

  22. Wan, V., Anderson, R., Blokland, A., Braunschweiler, N., Chen, L., Kolluru, B., Latorre, J., Maia, R., Stenger, B., Yanagisawa, K., et al.: Photo-realistic expressive text to talking head synthesis. In: INTERSPEECH, pp. 2667–2669 (2013)

    Google Scholar 

  23. Wang, L., Han, W., Soong, F.K., Huo, Q.: Text driven 3d photo-realistic talking head. In: Twelfth Annual Conference of the International Speech Communication Association (2011)

    Google Scholar 

  24. Xie, L., Liu, Z.Q.: A coupled HMM approach to video-realistic speech animation. Pattern Recogn. 40, 2325–2340 (2007)

    Article  Google Scholar 

  25. Zhang, X., Wang, L., Li, G., Seide, F., Soong, F.K.: A new language independent, photo-realistic talking head driven by voice only. In: Interspeech, pp. 2743–2747 (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sefik Emre Eskimez .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Eskimez, S.E., Maddox, R.K., Xu, C., Duan, Z. (2018). Generating Talking Face Landmarks from Speech. In: Deville, Y., Gannot, S., Mason, R., Plumbley, M., Ward, D. (eds) Latent Variable Analysis and Signal Separation. LVA/ICA 2018. Lecture Notes in Computer Science(), vol 10891. Springer, Cham. https://doi.org/10.1007/978-3-319-93764-9_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-93764-9_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-93763-2

  • Online ISBN: 978-3-319-93764-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics