Deep Neural Network Based 3D Articulatory Movement Prediction Using Both Text and Audio Inputs

Yu, Lingyun; Yu, Jun; Ling, Qiang

doi:10.1007/978-3-030-05710-7_6

Lingyun Yu¹⁸,
Jun Yu¹⁸ &
Qiang Ling¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11295))

Included in the following conference series:

International Conference on Multimedia Modeling

2622 Accesses
2 Citations

Abstract

Robust and accurate predicting of articulatory movements has various important applications, such as human-machine interaction. Various approaches have been proposed to solve the acoustic-articulatory mapping problem. However, their precision is not high enough with only acoustic features available. Recently, deep neural network (DNN) has brought tremendous success in many fields. To increase the accuracy, on the one hand, we propose a new network architecture called bottleneck squeeze-and-excitation recurrent convolutional neural network (BSERCNN) for articulatory movement prediction. On the one hand, by introducing the squeeze-and-excitation (SE) module, our BSERCNN can model the interdependencies and relationships between channels and that makes our model more efficiency. On the other hand, phoneme-level text features and acoustic features are integrated together as inputs to BSERCNN for better performance. Experiments show that BSERCNN achieves the state-of-the-art root-mean-squared error (RMSE) 0.563 mm and the correlation coefficient 0.954 with both text and audio inputs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The demo can be downloaded from https://pan.baidu.com/s/1q0YLvdqq8WVU5OOLj0ru1w.

References

Yu, J., Wang, Z.-F.: A video, text, and speech-driven realistic 3-D virtual head for human-machine interface. IEEE Trans. Cybern. 45(5), 991–1002 (2015)
Article Google Scholar
Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimedia 11(7), 1254–1265 (2009)
Article Google Scholar
Fanelli, G., Gall, J., Romsdorfer, H., Weise, T., Van Gool, L.: A 3-D audio-visual corpus of affective communication. IEEE Trans. Multimedia 12(6), 591–598 (2010)
Article Google Scholar
Mitra, V.: Articulatory information for robust speech recognition, Ph.D. dissertation (2010)
Google Scholar
Toda, T., Black, A.W., Tokuda, K.: Statistical mapping between articulatory movements and acoustic spectrum using a gaussian mixture model. Speech Commun. 50(3), 215–227 (2008)
Article Google Scholar
Zhang, L., Renals, S.: Acoustic-articulatory modeling with the trajectory HMM. IEEE Signal Process. Lett. 15, 245–248 (2008)
Article Google Scholar
Deng, L., Hinton, G., Kingsbury, B.: New types of deep neural network learning for speech recognition and related applications: an overview. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8599–8603 (2013)
Google Scholar
Qian, Y., Fan, Y., Hu, W., Soong, F.K.: On the training aspects of deep neural network (DNN) for parametric TTS synthesis. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3829–3833 (2014)
Google Scholar
Uria, B., Murray, I., Renals, S., Richmond, K.: Deep architectures for articulatory inversion. In: Thirteenth Annual Conference of the International Speech Communication Association (2012)
Google Scholar
Uria, B., Renals, S., Richmond, K.: A deep neural network for acoustic-articulatory speech inversion. In: NIPS 2011 Workshop on Deep Learning and Unsupervised Feature Learning (2011)
Google Scholar
Zhu, P., Xie, L., Chen, Y.: Articulatory movement prediction using deep bidirectional long short-term memory based recurrent neural networks and word/phone embeddings. In: INTERSPEECH, pp. 2192–2196 (2015)
Google Scholar
Wei, Z., Wu, Z., Xie, L.: Predicting articulatory movement from text using deep architecture with stacked bottleneck features. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–6. IEEE (2016)
Google Scholar
Ling, Z.H., Richmond, K., Yamagishi, J.: An analysis of HMM-based prediction of articulatory movements. Speech Commun. 52(10), 834–846 (2010)
Article Google Scholar
Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process. 22(10), 1533–1545 (2014)
Article Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks, arXiv preprint arXiv:1709.01507, vol. 7 (2017)
Yu, D., Seltzer, M.L.: Improved bottleneck features using pretrained deep neural networks. In: Twelfth Annual Conference of the International Speech Communication Association (2011)
Google Scholar
Cheng, X., Li, X., Tai, Y., Yang, J.: SESR: Single image super resolution with recursive squeeze and excitation networks, arXiv preprint arXiv:1801.10319 (2018)
Schönle, P.W., Gräbe, K., Wenig, P., Höhne, J., Schrader, J., Conrad, B.: Electromagnetic articulography: use of alternating magnetic fields for trackingmovements of multiple points inside and outside the vocal tract. Brain Lang. 31(1), 26–35 (1987)
Article Google Scholar
Wu, Z., Watts, O., King, S.: Merlin: an open source neural network speech synthesis system. Proc. SSW, Sunnyvale, USA (2016)
Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., et al.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014)
Google Scholar
Ling, Z.-H., Richmond, K., Yamagishi, J.: HMM-based text-to-articulatory-movement prediction and analysis of critical articulators. In: Proc. Interspeech, pp. 2194–2197, Sep. 2010
Google Scholar
Richmond, K.: Preliminary inversion mapping results with a new EMA corpus (2009)
Google Scholar
Liu, P., Yu, Q., Wu, Z., Kang, S., Meng, H., Cai, L.: A deep recurrent approach for acoustic-to-articulatory inversion. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4450–4454. IEEE (2015)
Google Scholar
Yu, J., Li, A., Hu, F., et al.: Data-driven 3D visual pronunciation of Chinese IPA for language learning. In: 2013 International Conference on Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), pp. 1–6. IEEE (2013)
Google Scholar
Marcos, S., Gómez-García-Bermejo, J., Zalama, E.: A realistic, virtual head for human-computer interaction. Interact. Comput. 22(3), 176–192 (2010)
Article Google Scholar

Download references

Acknowledgement

This work is supported by the National Natural Science Foundation of China (U1736123, 61572450), Anhui Provincial Natural Science Foundation (1708085QF138), the Fundamental Research Funds for the Central Universities (WK2350000002).

Author information

Authors and Affiliations

Department of Automation, University of Science and Technology of China, Hefei, 230027, Anhui, China
Lingyun Yu, Jun Yu & Qiang Ling

Authors

Lingyun Yu
View author publications
You can also search for this author in PubMed Google Scholar
Jun Yu
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Ling
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jun Yu or Qiang Ling .

Editor information

Editors and Affiliations

Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Ioannis Kompatsiaris
EURECOM, Sophia Antipolis, France
Benoit Huet
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Vasileios Mezaris
Dublin City University, Dublin, Ireland
Cathal Gurrin
National Chiao Tung University, Hsinchu, Taiwan
Wen-Huang Cheng
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Stefanos Vrochidis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, L., Yu, J., Ling, Q. (2019). Deep Neural Network Based 3D Articulatory Movement Prediction Using Both Text and Audio Inputs. In: Kompatsiaris, I., Huet, B., Mezaris, V., Gurrin, C., Cheng, WH., Vrochidis, S. (eds) MultiMedia Modeling. MMM 2019. Lecture Notes in Computer Science(), vol 11295. Springer, Cham. https://doi.org/10.1007/978-3-030-05710-7_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-05710-7_6
Published: 08 December 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05709-1
Online ISBN: 978-3-030-05710-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics