Skip to main content

Prediction of F0 Based on Articulatory Features Using DNN

  • Conference paper
  • First Online:
Studies on Speech Production (ISSP 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10733))

Included in the following conference series:

  • 554 Accesses

Abstract

In this paper, articulatory-to-F0 prediction contains two parts, one part is articulatory-to-voiced/unvoiced flag classification and the other one is articulatory-to-F0 mapping for voiced frames. This paper explores several types of articulatory features to confirm the most suitable one for F0 prediction using deep neural networks (DNNs) and long short-term memory (LSTM). Besides, the conventional method for articulatory-to-F0 mapping for voiced frames uses the F0 values after interpolation to train the model. In this paper, only F0 values at voiced frames are adopted for training. Experimental results on the test set on MNGU0 database show: (1) the velocity and acceleration of articulatory movements are quite effective on articulatory-to-F0 prediction; (2) acoustic feature evaluated from articulatory feature with neural networks makes a little better performance than the fusion of it and articulatory feature on articulatory-to-F0 prediction; (3) LSTM models can achieve better effect in articulatory-to-F0 prediction than DNNs; (4) Only-voiced model training method is proved to outperform the conventional method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bauer, D., Kannampuzha, J., Hoole, P., Kröger, B.J.: Gesture duration and articulator velocity in plosive-vowel-transitions. In: Development of Multimodal Interfaces: Active Listening and Synchrony, Second COST 2102 International Training School, pp. 346–353 (2010)

    Google Scholar 

  2. Chen, C., Julian, A.: New methods in continuous Mandarin speech recognition. In: European Conference on Speech Communication and Technology (1997)

    Google Scholar 

  3. Haykin, S.: Neural Networks: A Comprehensive Foundation, pp. 71–80 (1994)

    Google Scholar 

  4. Hess, W., Douglas, O.: Pitch Determination of Speech Signals: Algorithms and Devices by Wolfgang Hess, pp. 219–240. Springer, Heidelberg (1983). https://doi.org/10.1007/978-3-642-81926-1

  5. Hinton, Geoffrey E.: A practical guide to training restricted Boltzmann machines. In: Montavon, G., Orr, Geneviève B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 599–619. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8_32

    Chapter  Google Scholar 

  6. Hochreiter, S., Jurgen, S.: Long short-term-memory. Neural Comput. 9(8), 1735–1780 (2014)

    Article  Google Scholar 

  7. Honda, K.: Relationship between pitch control and vowel articulation. Haskins Lab. Status Rep. Speech Res. 73(1), 269–282 (1983)

    Google Scholar 

  8. Kawahara, H.: Speech representation and transformation using adaptive interpolation of weighted spectrum: VOCODER revisited. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2. pp. 1303–1306 (2002)

    Google Scholar 

  9. Koishida, K., Kobayashi, T., Imai, S., Tokuda, K.: Efficient encoding of mel-generalized cepstrum for CELP coders. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 1355–1358 (1997)

    Google Scholar 

  10. Ling, Z.H., Richmond, K., Yamagishi, J., Wang, R.H.: Integrating articulatory features into hmm-based parametric speech synthesis. In: IEEE Transactions on Audio Speech & Language Processing, vol. 17, no. 6, pp. 1171–1185 (2009)

    Google Scholar 

  11. Liu, Z.C., Ling, Z.H., Dai, L.R.: Articulatory-to-acoustic conversion with cascaded prediction of spectral and excitation features using neural networks. In: INTERSPEECH, pp. 1502–1506 (2016)

    Google Scholar 

  12. Markov, K., Dang, J., Nakamura, S.: Integration of articulatory and spectrum features based on the hybrid HMM/BN modeling framework. Speech Commun. 48(2), 161–175 (2006)

    Article  Google Scholar 

  13. Narayanan, S., Erik, B., Prasanta, K.G., Louis, G., Athanasios, K., Yoon, K., Adam, C.: A multimodal real-time MRI articulatory corpus for speech research. In: INTERSPEECH, pp. 837–840 (2011)

    Google Scholar 

  14. Richmond, K., Hoole, P., King, S.: Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus. In: INTERSPEECH, pp. 1505–1508 (2011)

    Google Scholar 

  15. Schönle, P.W., Gräbe, K., Wenig, P., Höhne, J., Schrader, J., Conrad, B.: Electromagnetic articulography: use of alternating magnetic fields for tracking movements of multiple points inside and outside the vocal tract. Brain Lang. 31(1), 26–35 (1987)

    Article  Google Scholar 

  16. Schultz, T.W.: Modeling coarticulation in EMG-based continuous speech recognition. Speech Commun. 52(4), 341–353 (2010)

    Article  Google Scholar 

  17. Toda, T., Black, A.W., Tokuda, K.: Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model. Speech Commun. 50(3), 215–227 (2008)

    Article  Google Scholar 

  18. Xie, X., Liu, X., Wang, L., Su, R.: generalized variable parameter HMMs based acoustic-to-articulatory inversion. In: INTERSPEECH, pp. 1506–1510 (2015)

    Google Scholar 

Download references

Acknowledgements

The research is supported partially by the National Basic Research Program of China (No. 2013CB329303), the National Natural Science Foundation of China (No. 61233009 and No. 61771333) and JSPS KAKENHI Grant (16K00297).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Longbiao Wang or Jianwu Dang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhao, C., Wang, L., Dang, J., Yu, R. (2018). Prediction of F0 Based on Articulatory Features Using DNN. In: Fang, Q., Dang, J., Perrier, P., Wei, J., Wang, L., Yan, N. (eds) Studies on Speech Production. ISSP 2017. Lecture Notes in Computer Science(), vol 10733. Springer, Cham. https://doi.org/10.1007/978-3-030-00126-1_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00126-1_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00125-4

  • Online ISBN: 978-3-030-00126-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics