DNN-Based Talking Movie Generation with Face Direction Consideration

  • Toru Ishikawa
  • Takashi Nose
  • Akinori ItoEmail author
Conference paper
Part of the Smart Innovation, Systems and Technologies book series (SIST, volume 110)


In this paper, we propose a method to generate a talking head animation considering the direction of the face. The proposed method parametrizes a facial image using the active appearance model (AAM) and models the parameters of the AAM using a feedforward deep neural network. Since the AAM is a two-dimensional face model, conventional methods that use the AAM assumes only the frontal face. Thus, when combining the generated face and other parts such as a head and a body, the direction of the face and the head was often inconsistent. The proposed method models the shape parameters of the AAM using the principal component analysis (PCA) so that the direction and movement of individual facial parts are modeled separately; thus we substitute the face direction of the generated animation with that of the head part so that the direction of the face and the head coincides. We conducted an experiment to demonstrate that the proposed method can generate face animation with proper face direction.


Photo-realistic facial animation Face image synthesis Deep neural network 



Part of this work was supported by JSPS KAKENHI Grant Number JP17H00823.


  1. 1.
    Anderson, R., Stenger, B., Wan, V., Cipolla, R.: Expressive visual text-to-speech using active appearance models. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3382–3389 (2013)Google Scholar
  2. 2.
    Baltrušaitis, T., Robinson, P., Morency, L.P.: Openface: an open source facial behavior analysis toolkit. In: IEEE Winter Conference on Applications of Computer Vision (2016)Google Scholar
  3. 3.
    Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. In: European Conference on Computer Vision, pp. 484–498 (1998)Google Scholar
  4. 4.
    Fan, B., Wang, L., Soong, F.K., Xie, L.: Photo-real talking head with deep bidirectional LSTM. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4884–4888. IEEE (2015)Google Scholar
  5. 5.
    Fan, B., Xie, L., Yang, S., Wang, L., Soong, F.K.: A deep bidirectional LSTM approach for video-realistic talking head. Multimed. Tools Appl. 75(9), 5287–5309 (2016)CrossRefGoogle Scholar
  6. 6.
    Ishi, C.T., Ishiguro, H., Hagita, N.: Analysis of relationship between head motion events and speech in dialogue conversations. Speech Commun. 57, 233–243 (2014). Scholar
  7. 7.
    Ling, Z.H., Kang, S.Y., Zen, H., Senior, A., Schuster, M., Qian, X.J., Meng, H.M., Deng, L.: Deep learning for acoustic modeling in parametric speech generation: a systematic review of existing techniques and future trends. IEEE Signal Process. Mag. 32(3), 35–52 (2015)CrossRefGoogle Scholar
  8. 8.
    Mattheyses, W., Verhelst, W.: Audiovisual speech synthesis: an overview of the state-of-the-art. Speech Commun. 66, 182–217 (2015)CrossRefGoogle Scholar
  9. 9.
    Ostermann, J., Weissenfeld, A.: Talking faces—technologies and applications. In: Proceedings of the 17th International Conference on Pattern Recognition (ICPR), vol. 3, pp. 826–833 (2004)Google Scholar
  10. 10.
    Parker, J., Maia, R., Stylianou, Y., Cipolla, R.: Expressive visual text to speech and expression adaptation using deep neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4920–4924. IEEE (2017)Google Scholar
  11. 11.
    Saito, Y., Nose, T., Shinozaki, T., Ito, A.: Conversion of speaker’s face image using PCA and animation unit for video chatting. In: Proceedings of the International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP), pp. 433–436 (2015)Google Scholar
  12. 12.
    Sato, K., Nose, T., Ito, A.: HMM-based photo-realistic talking face synthesis using facial expression parameter mapping with deep neural networks. J. Comput. Commun. 5(10), 50 (2017)CrossRefGoogle Scholar
  13. 13.
    Wu, Y.J., Wang, R.H.: Minimum generation error training for HMM-based speech synthesis. In: Proceedings of ICASSP, pp. 889–892 (2006)Google Scholar
  14. 14.
    Xie, L., Sun, N., Fan, B.: A statistical parametric approach to video-realistic text-driven talking avatar. 73(1), 377–396 (2014). Scholar
  15. 15.
    Zen, H., Tokuda, K., Black, A.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Graduate School of EngineeringTohoku UniversitySendaiJapan

Personalised recommendations