Abstract
In this paper, articulatory-to-F0 prediction contains two parts, one part is articulatory-to-voiced/unvoiced flag classification and the other one is articulatory-to-F0 mapping for voiced frames. This paper explores several types of articulatory features to confirm the most suitable one for F0 prediction using deep neural networks (DNNs) and long short-term memory (LSTM). Besides, the conventional method for articulatory-to-F0 mapping for voiced frames uses the F0 values after interpolation to train the model. In this paper, only F0 values at voiced frames are adopted for training. Experimental results on the test set on MNGU0 database show: (1) the velocity and acceleration of articulatory movements are quite effective on articulatory-to-F0 prediction; (2) acoustic feature evaluated from articulatory feature with neural networks makes a little better performance than the fusion of it and articulatory feature on articulatory-to-F0 prediction; (3) LSTM models can achieve better effect in articulatory-to-F0 prediction than DNNs; (4) Only-voiced model training method is proved to outperform the conventional method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bauer, D., Kannampuzha, J., Hoole, P., Kröger, B.J.: Gesture duration and articulator velocity in plosive-vowel-transitions. In: Development of Multimodal Interfaces: Active Listening and Synchrony, Second COST 2102 International Training School, pp. 346–353 (2010)
Chen, C., Julian, A.: New methods in continuous Mandarin speech recognition. In: European Conference on Speech Communication and Technology (1997)
Haykin, S.: Neural Networks: A Comprehensive Foundation, pp. 71–80 (1994)
Hess, W., Douglas, O.: Pitch Determination of Speech Signals: Algorithms and Devices by Wolfgang Hess, pp. 219–240. Springer, Heidelberg (1983). https://doi.org/10.1007/978-3-642-81926-1
Hinton, Geoffrey E.: A practical guide to training restricted Boltzmann machines. In: Montavon, G., Orr, Geneviève B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 599–619. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8_32
Hochreiter, S., Jurgen, S.: Long short-term-memory. Neural Comput. 9(8), 1735–1780 (2014)
Honda, K.: Relationship between pitch control and vowel articulation. Haskins Lab. Status Rep. Speech Res. 73(1), 269–282 (1983)
Kawahara, H.: Speech representation and transformation using adaptive interpolation of weighted spectrum: VOCODER revisited. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2. pp. 1303–1306 (2002)
Koishida, K., Kobayashi, T., Imai, S., Tokuda, K.: Efficient encoding of mel-generalized cepstrum for CELP coders. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 1355–1358 (1997)
Ling, Z.H., Richmond, K., Yamagishi, J., Wang, R.H.: Integrating articulatory features into hmm-based parametric speech synthesis. In: IEEE Transactions on Audio Speech & Language Processing, vol. 17, no. 6, pp. 1171–1185 (2009)
Liu, Z.C., Ling, Z.H., Dai, L.R.: Articulatory-to-acoustic conversion with cascaded prediction of spectral and excitation features using neural networks. In: INTERSPEECH, pp. 1502–1506 (2016)
Markov, K., Dang, J., Nakamura, S.: Integration of articulatory and spectrum features based on the hybrid HMM/BN modeling framework. Speech Commun. 48(2), 161–175 (2006)
Narayanan, S., Erik, B., Prasanta, K.G., Louis, G., Athanasios, K., Yoon, K., Adam, C.: A multimodal real-time MRI articulatory corpus for speech research. In: INTERSPEECH, pp. 837–840 (2011)
Richmond, K., Hoole, P., King, S.: Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus. In: INTERSPEECH, pp. 1505–1508 (2011)
Schönle, P.W., Gräbe, K., Wenig, P., Höhne, J., Schrader, J., Conrad, B.: Electromagnetic articulography: use of alternating magnetic fields for tracking movements of multiple points inside and outside the vocal tract. Brain Lang. 31(1), 26–35 (1987)
Schultz, T.W.: Modeling coarticulation in EMG-based continuous speech recognition. Speech Commun. 52(4), 341–353 (2010)
Toda, T., Black, A.W., Tokuda, K.: Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model. Speech Commun. 50(3), 215–227 (2008)
Xie, X., Liu, X., Wang, L., Su, R.: generalized variable parameter HMMs based acoustic-to-articulatory inversion. In: INTERSPEECH, pp. 1506–1510 (2015)
Acknowledgements
The research is supported partially by the National Basic Research Program of China (No. 2013CB329303), the National Natural Science Foundation of China (No. 61233009 and No. 61771333) and JSPS KAKENHI Grant (16K00297).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhao, C., Wang, L., Dang, J., Yu, R. (2018). Prediction of F0 Based on Articulatory Features Using DNN. In: Fang, Q., Dang, J., Perrier, P., Wei, J., Wang, L., Yan, N. (eds) Studies on Speech Production. ISSP 2017. Lecture Notes in Computer Science(), vol 10733. Springer, Cham. https://doi.org/10.1007/978-3-030-00126-1_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-00126-1_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00125-4
Online ISBN: 978-3-030-00126-1
eBook Packages: Computer ScienceComputer Science (R0)