Deep Neural Network Based 3D Articulatory Movement Prediction Using Both Text and Audio Inputs

  • Lingyun Yu
  • Jun YuEmail author
  • Qiang LingEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11295)


Robust and accurate predicting of articulatory movements has various important applications, such as human-machine interaction. Various approaches have been proposed to solve the acoustic-articulatory mapping problem. However, their precision is not high enough with only acoustic features available. Recently, deep neural network (DNN) has brought tremendous success in many fields. To increase the accuracy, on the one hand, we propose a new network architecture called bottleneck squeeze-and-excitation recurrent convolutional neural network (BSERCNN) for articulatory movement prediction. On the one hand, by introducing the squeeze-and-excitation (SE) module, our BSERCNN can model the interdependencies and relationships between channels and that makes our model more efficiency. On the other hand, phoneme-level text features and acoustic features are integrated together as inputs to BSERCNN for better performance. Experiments show that BSERCNN achieves the state-of-the-art root-mean-squared error (RMSE) 0.563 mm and the correlation coefficient 0.954 with both text and audio inputs.


Deep Neural Network Squeeze-and-excitation module Bottleneck network Articulatory movement prediction 



This work is supported by the National Natural Science Foundation of China (U1736123, 61572450), Anhui Provincial Natural Science Foundation (1708085QF138), the Fundamental Research Funds for the Central Universities (WK2350000002).


  1. 1.
    Yu, J., Wang, Z.-F.: A video, text, and speech-driven realistic 3-D virtual head for human-machine interface. IEEE Trans. Cybern. 45(5), 991–1002 (2015)CrossRefGoogle Scholar
  2. 2.
    Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimedia 11(7), 1254–1265 (2009)CrossRefGoogle Scholar
  3. 3.
    Fanelli, G., Gall, J., Romsdorfer, H., Weise, T., Van Gool, L.: A 3-D audio-visual corpus of affective communication. IEEE Trans. Multimedia 12(6), 591–598 (2010)CrossRefGoogle Scholar
  4. 4.
    Mitra, V.: Articulatory information for robust speech recognition, Ph.D. dissertation (2010)Google Scholar
  5. 5.
    Toda, T., Black, A.W., Tokuda, K.: Statistical mapping between articulatory movements and acoustic spectrum using a gaussian mixture model. Speech Commun. 50(3), 215–227 (2008)CrossRefGoogle Scholar
  6. 6.
    Zhang, L., Renals, S.: Acoustic-articulatory modeling with the trajectory HMM. IEEE Signal Process. Lett. 15, 245–248 (2008)CrossRefGoogle Scholar
  7. 7.
    Deng, L., Hinton, G., Kingsbury, B.: New types of deep neural network learning for speech recognition and related applications: an overview. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8599–8603 (2013)Google Scholar
  8. 8.
    Qian, Y., Fan, Y., Hu, W., Soong, F.K.: On the training aspects of deep neural network (DNN) for parametric TTS synthesis. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3829–3833 (2014)Google Scholar
  9. 9.
    Uria, B., Murray, I., Renals, S., Richmond, K.: Deep architectures for articulatory inversion. In: Thirteenth Annual Conference of the International Speech Communication Association (2012)Google Scholar
  10. 10.
    Uria, B., Renals, S., Richmond, K.: A deep neural network for acoustic-articulatory speech inversion. In: NIPS 2011 Workshop on Deep Learning and Unsupervised Feature Learning (2011)Google Scholar
  11. 11.
    Zhu, P., Xie, L., Chen, Y.: Articulatory movement prediction using deep bidirectional long short-term memory based recurrent neural networks and word/phone embeddings. In: INTERSPEECH, pp. 2192–2196 (2015)Google Scholar
  12. 12.
    Wei, Z., Wu, Z., Xie, L.: Predicting articulatory movement from text using deep architecture with stacked bottleneck features. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–6. IEEE (2016)Google Scholar
  13. 13.
    Ling, Z.H., Richmond, K., Yamagishi, J.: An analysis of HMM-based prediction of articulatory movements. Speech Commun. 52(10), 834–846 (2010)CrossRefGoogle Scholar
  14. 14.
    Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process. 22(10), 1533–1545 (2014)CrossRefGoogle Scholar
  15. 15.
    Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks, arXiv preprint arXiv:1709.01507, vol. 7 (2017)
  16. 16.
    Yu, D., Seltzer, M.L.: Improved bottleneck features using pretrained deep neural networks. In: Twelfth Annual Conference of the International Speech Communication Association (2011)Google Scholar
  17. 17.
    Cheng, X., Li, X., Tai, Y., Yang, J.: SESR: Single image super resolution with recursive squeeze and excitation networks, arXiv preprint arXiv:1801.10319 (2018)
  18. 18.
    Schönle, P.W., Gräbe, K., Wenig, P., Höhne, J., Schrader, J., Conrad, B.: Electromagnetic articulography: use of alternating magnetic fields for trackingmovements of multiple points inside and outside the vocal tract. Brain Lang. 31(1), 26–35 (1987)CrossRefGoogle Scholar
  19. 19.
    Wu, Z., Watts, O., King, S.: Merlin: an open source neural network speech synthesis system. Proc. SSW, Sunnyvale, USA (2016)Google Scholar
  20. 20.
    Jia, Y., Shelhamer, E., Donahue, J., et al.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014)Google Scholar
  21. 21.
    Ling, Z.-H., Richmond, K., Yamagishi, J.: HMM-based text-to-articulatory-movement prediction and analysis of critical articulators. In: Proc. Interspeech, pp. 2194–2197, Sep. 2010Google Scholar
  22. 22.
    Richmond, K.: Preliminary inversion mapping results with a new EMA corpus (2009)Google Scholar
  23. 23.
    Liu, P., Yu, Q., Wu, Z., Kang, S., Meng, H., Cai, L.: A deep recurrent approach for acoustic-to-articulatory inversion. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4450–4454. IEEE (2015)Google Scholar
  24. 24.
    Yu, J., Li, A., Hu, F., et al.: Data-driven 3D visual pronunciation of Chinese IPA for language learning. In: 2013 International Conference on Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), pp. 1–6. IEEE (2013)Google Scholar
  25. 25.
    Marcos, S., Gómez-García-Bermejo, J., Zalama, E.: A realistic, virtual head for human-computer interaction. Interact. Comput. 22(3), 176–192 (2010)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of AutomationUniversity of Science and Technology of ChinaHefeiChina

Personalised recommendations