Facial expression GAN for voice-driven face generation


Cross-modal audiovisual generation is an emerging topic in machine learning. In particular, voice-to-face is one of the most popular research branches, which aims to generate faces from human voice clips. Most recent works in voice-to-face generation do not take emotion information into account. However, it could be widely observed that expressions are the key face attributes to reconstruct sharper and more discriminative faces. In this paper, we propose a novel facial expression GAN (FE-GAN) which takes emotion and expressions into account in face generation. To achieve this goal, we use two auxiliary classifiers to learn more emotion and identity representations between different modalities, respectively. Moreover, we design two discriminators, each focusing on a different aspect of the faces, to measure identity and emotion semantic relevance in generating. The triple loss is designed to make FE-GAN robust to voice variety and keep balance in two different modalities. Extensive experiments are conducted on two real datasets to demonstrate the effectiveness of FE-GAN in both quantitative and qualitative perspectives. The experimental results show that FE-GAN can not only outperform the previous models in terms of FID and IS values, but also generate more realistic face images compared with previous models.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8


  1. 1.

    Sriram, A., Jun, H., Gaur, Y., Satheesh, S.: Robust speech recognition using generative adversarial networks. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5639–5643 (2018)

  2. 2.

    Dumpala, S.H., Sheikh, I., Chakraborty, R., Kopparapu, S.K.: A Cycle-GAN approach to model natural perturbations in speech for ASR applications. arXiv preprint arXiv:1912.11151 (2019)

  3. 3.

    Dai, B., Fidler, S., Urtasun, R., Lin, D.: Towards diverse and natural image descriptions via a conditional gan. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2970–2979 (2017)

  4. 4.

    Chen, C., Mu, S., Xiao, W., Ye, Z., Wu, L., Ju, Q.: Improving image captioning with conditional generative adversarial nets. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8142–8150 (2019)

  5. 5.

    Goodfellow, I., Pougetabadie, J., Mirza, M., Xu, B., Wardefarley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)

  6. 6.

    Poria, S., Cambria, E., Bajpai, R., Hussain, A.: A review of affective computing: from unimodal analysis to multimodal fusion. Inf. Fusion 37, 98–125 (2017)

    Article  Google Scholar 

  7. 7.

    Han, F., Guerrero, R., Pavlovic, V.: CookGAN: meal image synthesis from ingredients. Computer Vision and Pattern Recognition. arXiv (2020)

  8. 8.

    Nasir, O.R., Jha, S.K., Grover, M.S., Yu, Y., Kumar, A., Shah, R.R.: Text2FaceGAN: face generation from fine grained textual descriptions. In: IEEE International Conference on Multimedia Big Data, pp. 58–67 (2019)

  9. 9.

    Qiu, Y., Kataoka, H.: Image generation associated with music data. In: Computer Vision and Pattern Recognition (CVPR), pp. 2510–2513 (2018)

  10. 10.

    Wan, C., Chuang, S., Lee, H.: Towards audio to scene image synthesis using generative adversarial network. In: International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 496–500 (2019)

  11. 11.

    Isola, P., Zhu, J., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Computer Vision and Pattern Recognition (CVPR), pp. 5967–5976 (2017)

  12. 12.

    Duarte, A., Roldan, F., Tubau, M., Escur, J., Pascual, S., Salvador, A., Mohedano, E., Mcguinness, K., Torres, J., Giroinieto, X.: Wav2Pix: speech-conditioned face generation using generative adversarial networks. In: International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 8633–8637 (2019)

  13. 13.

    Oh, T., Dekel, T., Kim, C., Mosseri, I., Freeman, W.T., Rubinstein, M., Matusik, W.: Speech2Face: learning the face behind a voice. In: Computer Vision and Pattern Recognition (CVPR), pp. 7539–7548 (2019)

  14. 14.

    Wen, Y., Singh, R., Raj, B.: Face reconstruction from voice using generative adversarial networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 5265–5274 (2019)

  15. 15.

    Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxiliary classifier GANs. In: International Conference on Machine Learning, pp. 2642–2651 (2017)

  16. 16.

    Smith, H.M.J., Dunn, A.K., Baguley, T., Stacey, P.C.: Matching novel face and voice identity using static and dynamic facial images. Atten. Percept. Psychophys. 78(3), 868–879 (2016)

    Article  Google Scholar 

  17. 17.

    Nagrani, A., Albanie, S., Zisserman, A.: Seeing voices and hearing faces: cross-modal biometric matching. In: Computer Vision and Pattern Recognition (CVPR), pp. 8427–8436 (2018)

  18. 18.

    Livingstone, S.R., Russo, F.A.: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLOS ONE 13(5), 1–35 (2018)

    Article  Google Scholar 

  19. 19.

    Martin, O., Kotsia, I., Macq, B., Pitas, I.: The eNTERFACE’05 audio-visual emotion database. In: 22nd International Conference on Data Engineering Workshops (ICDEW’06), pp. 8–8. IEEE Computer Society (2006)

  20. 20.

    Nguyen, T.D., Le, T., Vu, H., Phung, D.: Dual discriminator generative adversarial nets. In: Advances in Neural Information Processing Systems (NIPS), pp. 2670–2680 (2017)

  21. 21.

    Durugkar, I., Gemp, I., Mahadevan, S.: Generative multi-adversarial networks. In: International Conference on Learning Representations (2017)

  22. 22.

    Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Computer Vision and Pattern Recognition (CVPR), pp. 7832–7841 (2019)

  23. 23.

    Chung, J.S., Jamaludin, A., Zisserman, A.: You said that? In: British Machine Vision Conference (BMVC) (2017)

  24. 24.

    Vougioukas, K., Petridis, S., Pantic, M.: End-to-end speech-driven facial animation with temporal GANs. In: British Machine Vision Conference (BMVC) (2018)

  25. 25.

    Konstantinos, V., Stavros, P., Maja, P.: Realistic speech-driven facial animation with GANs. Int. J. Comput. Vis. 8(5), 1398–1413 (2020)

    Google Scholar 

  26. 26.

    Watanabe, S., Kim, S., Hershey, J.R., Hori, T.: Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE J. Sel. Top. Signal Process. 11(8), 1240–1253 (2017)

    Article  Google Scholar 

  27. 27.

    Chandrasekar, P., Chapaneri, S., Jayaswal, D.: Automatic speech emotion recognition: a survey. In: International Conference on Circuits, pp. 341–346 (2014)

  28. 28.

    Passricha, V., Aggarwal, R.K.: A hybrid of deep CNN and bidirectional LSTM for automatic speech recognition. J. Intell. Syst. 29(1), 1261–1274 (2019)

    Article  Google Scholar 

  29. 29.

    Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31(1), 39–58 (2009)

    Article  Google Scholar 

  30. 30.

    Aldeneh, Z., Provost, E.M.: Using regional saliency for speech emotion recognition. In: International Conference on Acoustics, Speech and Signal Processing, pp. 2741–2745 (2017)

  31. 31.

    Chenchah, F., Lachiri, Z.: Acoustic emotion recognition using linear and nonlinear cepstral coefficients. Int. J. Adv. Comput. Sci. Appl. 6(11), 1–4 (2015)

    Google Scholar 

  32. 32.

    Waghmare, V.B., Deshmukh, R.R., Shrishrimal, P.P., Janvale, G.B., Ambedkar, B.B.: Emotion recognition system from artificial Marathi speech using MFCC and LDA techniques. In: International Conference on Advances in Communication, Network, and Computing (2014)

  33. 33.

    Xie, Y., Liang, R., Liang, Z., Huang, C., Zou, C., Schuller, B.: Speech emotion classification using attention-based LSTM. IEEE Trans. Audio Speech Lang. Process. 27(11), 1675–1685 (2019)

    Article  Google Scholar 

  34. 34.

    Huang, Z., Dong, M., Mao, Q., Zhan, Y.: Speech emotion recognition using CNN. In: the Proceedings of the 22nd ACM international conference on Multimedia, pp. 801–804

  35. 35.

    Yi, R., Ye, Z., Zhang, J., Bao, H., Liu, Y.: Audio-driven talking face video generation with learning-based personalized head pose. arXiv preprint arXiv:2002.10137 (2020)

  36. 36.

    Suwajanakorn, S., Seitz, S.M., Kemelmachershlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. (TOG) 36(4), 1–13 (2017)

    Article  Google Scholar 

  37. 37.

    Jalalifar, S.A., Hasani, H., Aghajan, H.: Speech-driven facial reenactment using conditional generative adversarial networks. arXiv preprint arXiv:1803.07461 (2018)

  38. 38.

    Sadoughi, N., Busso, C.: Speech-driven expressive talking lips with conditional sequential generative adversarial networks. IEEE Trans. Affect. Comput. (2019). https://doi.org/10.1109/TAFFC.2019.2916031

    Article  Google Scholar 

  39. 39.

    Duan, B., Wang, W., Tang, H., Latapie, H., Yan, Y.: Cascade attention guided residue learning GAN for cross-modal translation. arXiv preprint arXiv:1907.01826 (2019)

  40. 40.

    Van Segbroeck, M., Tsiartas, A., Narayanan, S.S.: A robust frontend for VAD: exploiting contextual, discriminative and spectral cues of human voice. In: Conference of the International Speech Communication Association, pp. 704–708 (2013)

  41. 41.

    King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)

    Google Scholar 

  42. 42.

    Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of Wasserstein Gans. In: Advances in Neural Information Processing Systems, pp. 5767–5777 (2017)

  43. 43.

    Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: International Conference on Learning Representations (2016)

  44. 44.

    Gan, Z., Chen, L., Wang, W., Pu, Y., Zhang, Y., Liu, H., Li, C., Carin, L.: Triangle generative adversarial networks. In: Advances in Neural Information Processing Systems, pp. 5247–5256 (2017)

  45. 45.

    Li, C., Xu, K., Zhu, J., Zhang, B.: Triple generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 4088–4098 (2017)

  46. 46.

    Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: Neural Information Processing Systems, pp. 2234–2242 (2016)

  47. 47.

    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Neural Information Processing Systems, pp. 6626–6637 (2017)

  48. 48.

    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)

  49. 49.

    Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: British Machine Vision Conference (2015)

Download references


This work was supported by the National Natural Science Foundation of China (Grant 61761166005), Ministry of Science and Technology, Taiwan (MOST 106-2218-E-032-003-MY3), and National Natural Science Foundation of Zhejiang (Grant LY20F020007), and the Ningbo Science Technology Plan projects (Grant 2019B10032) and the K.C. Wong Magna Fund in Ningbo University.

Author information



Corresponding author

Correspondence to Zhen Liu.

Ethics declarations

Conflict of interest

All authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Fang, Z., Liu, Z., Liu, T. et al. Facial expression GAN for voice-driven face generation. Vis Comput (2021). https://doi.org/10.1007/s00371-021-02074-w

Download citation


  • Expression reconstruction
  • Cross-model generation
  • Voice-to-face generation
  • Generative adversarial networks