Prediction of F0 Based on Articulatory Features Using DNN

Zhao, Cenxi; Wang, Longbiao; Dang, Jianwu; Yu, Ruiguo

doi:10.1007/978-3-030-00126-1_6

Cenxi Zhao¹⁹,
Longbiao Wang¹⁹,
Jianwu Dang^19,20 &
…
Ruiguo Yu¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10733))

Included in the following conference series:

International Seminar on Speech Production

554 Accesses

Abstract

In this paper, articulatory-to-F0 prediction contains two parts, one part is articulatory-to-voiced/unvoiced flag classification and the other one is articulatory-to-F0 mapping for voiced frames. This paper explores several types of articulatory features to confirm the most suitable one for F0 prediction using deep neural networks (DNNs) and long short-term memory (LSTM). Besides, the conventional method for articulatory-to-F0 mapping for voiced frames uses the F0 values after interpolation to train the model. In this paper, only F0 values at voiced frames are adopted for training. Experimental results on the test set on MNGU0 database show: (1) the velocity and acceleration of articulatory movements are quite effective on articulatory-to-F0 prediction; (2) acoustic feature evaluated from articulatory feature with neural networks makes a little better performance than the fusion of it and articulatory feature on articulatory-to-F0 prediction; (3) LSTM models can achieve better effect in articulatory-to-F0 prediction than DNNs; (4) Only-voiced model training method is proved to outperform the conventional method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bauer, D., Kannampuzha, J., Hoole, P., Kröger, B.J.: Gesture duration and articulator velocity in plosive-vowel-transitions. In: Development of Multimodal Interfaces: Active Listening and Synchrony, Second COST 2102 International Training School, pp. 346–353 (2010)
Google Scholar
Chen, C., Julian, A.: New methods in continuous Mandarin speech recognition. In: European Conference on Speech Communication and Technology (1997)
Google Scholar
Haykin, S.: Neural Networks: A Comprehensive Foundation, pp. 71–80 (1994)
Google Scholar
Hess, W., Douglas, O.: Pitch Determination of Speech Signals: Algorithms and Devices by Wolfgang Hess, pp. 219–240. Springer, Heidelberg (1983). https://doi.org/10.1007/978-3-642-81926-1
Hinton, Geoffrey E.: A practical guide to training restricted Boltzmann machines. In: Montavon, G., Orr, Geneviève B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 599–619. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8_32
Chapter Google Scholar
Hochreiter, S., Jurgen, S.: Long short-term-memory. Neural Comput. 9(8), 1735–1780 (2014)
Article Google Scholar
Honda, K.: Relationship between pitch control and vowel articulation. Haskins Lab. Status Rep. Speech Res. 73(1), 269–282 (1983)
Google Scholar
Kawahara, H.: Speech representation and transformation using adaptive interpolation of weighted spectrum: VOCODER revisited. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2. pp. 1303–1306 (2002)
Google Scholar
Koishida, K., Kobayashi, T., Imai, S., Tokuda, K.: Efficient encoding of mel-generalized cepstrum for CELP coders. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 1355–1358 (1997)
Google Scholar
Ling, Z.H., Richmond, K., Yamagishi, J., Wang, R.H.: Integrating articulatory features into hmm-based parametric speech synthesis. In: IEEE Transactions on Audio Speech & Language Processing, vol. 17, no. 6, pp. 1171–1185 (2009)
Google Scholar
Liu, Z.C., Ling, Z.H., Dai, L.R.: Articulatory-to-acoustic conversion with cascaded prediction of spectral and excitation features using neural networks. In: INTERSPEECH, pp. 1502–1506 (2016)
Google Scholar
Markov, K., Dang, J., Nakamura, S.: Integration of articulatory and spectrum features based on the hybrid HMM/BN modeling framework. Speech Commun. 48(2), 161–175 (2006)
Article Google Scholar
Narayanan, S., Erik, B., Prasanta, K.G., Louis, G., Athanasios, K., Yoon, K., Adam, C.: A multimodal real-time MRI articulatory corpus for speech research. In: INTERSPEECH, pp. 837–840 (2011)
Google Scholar
Richmond, K., Hoole, P., King, S.: Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus. In: INTERSPEECH, pp. 1505–1508 (2011)
Google Scholar
Schönle, P.W., Gräbe, K., Wenig, P., Höhne, J., Schrader, J., Conrad, B.: Electromagnetic articulography: use of alternating magnetic fields for tracking movements of multiple points inside and outside the vocal tract. Brain Lang. 31(1), 26–35 (1987)
Article Google Scholar
Schultz, T.W.: Modeling coarticulation in EMG-based continuous speech recognition. Speech Commun. 52(4), 341–353 (2010)
Article Google Scholar
Toda, T., Black, A.W., Tokuda, K.: Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model. Speech Commun. 50(3), 215–227 (2008)
Article Google Scholar
Xie, X., Liu, X., Wang, L., Su, R.: generalized variable parameter HMMs based acoustic-to-articulatory inversion. In: INTERSPEECH, pp. 1506–1510 (2015)
Google Scholar

Download references

Acknowledgements

The research is supported partially by the National Basic Research Program of China (No. 2013CB329303), the National Natural Science Foundation of China (No. 61233009 and No. 61771333) and JSPS KAKENHI Grant (16K00297).

Author information

Authors and Affiliations

Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin University, Tianjin, China
Cenxi Zhao, Longbiao Wang, Jianwu Dang & Ruiguo Yu
Japan Advanced Institute of Science and Technology, Ishikawa, Japan
Jianwu Dang

Authors

Cenxi Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Longbiao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianwu Dang
View author publications
You can also search for this author in PubMed Google Scholar
Ruiguo Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Longbiao Wang or Jianwu Dang .

Editor information

Editors and Affiliations

Chinese Academy of Social Sciences, Beijing, China
Qiang Fang
JAIST , Nomi, Japan
Jianwu Dang
Grenoble Alpes University, Saint-Martin-d'Hères, France
Pascal Perrier
Tianjin University, Tianjin, China
Jianguo Wei
Tianjin University, Tianjin, China
Longbiao Wang
Shenzhen Institute of Advanced Technology, Shenzhen, China
Nan Yan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, C., Wang, L., Dang, J., Yu, R. (2018). Prediction of F0 Based on Articulatory Features Using DNN. In: Fang, Q., Dang, J., Perrier, P., Wei, J., Wang, L., Yan, N. (eds) Studies on Speech Production. ISSP 2017. Lecture Notes in Computer Science(), vol 10733. Springer, Cham. https://doi.org/10.1007/978-3-030-00126-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-00126-1_6
Published: 11 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00125-4
Online ISBN: 978-3-030-00126-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics