\(F_{0}\) Modeling Using DNN for Arabic Parametric Speech Synthesis

  • Imene ZangarEmail author
  • Zied Mnasri
  • Vincent Colotte
  • Denis Jouvet
Conference paper
Part of the Proceedings of the International Neural Networks Society book series (INNS, volume 1)


Deep neural networks (DNN) are gaining increasing interest in speech processing applications, especially in text-to-speech synthesis. Actually state-of-the-art speech generation tools, like MERLIN and WAVENET are totally DNN-based. However, every language has to be modeled on its own using DNN. One of the key components of speech synthesis modules is the prosodic parameters generation module from contextual input features, and more particularly the fundamental frequency (\(F_{0}\)) generation module. Actually \(F_{0}\) is responsible for intonation, that is why it should be accurately modeled to provide intelligible and natural speech. However, \(F_{0}\) modeling is highly dependent on the language. Therefore, language specific characteristics have to be taken into account. In this paper, we aim to model \(F_{0}\) for Arabic speech synthesis with feedforward and recurrent DNN, and using specific characteristic features for Arabic like vowel quantity and gemination, in order to improve the quality of Arabic parametric speech synthesis.


Arabic parametric speech synthesis Fundamental frequency \(F_{0}\) Deep neural networks Recurrent neural networks 



This research work was conducted in the framework of PHC-Utique Program, financed by CMCU (Comité mixte de coopération universitaire), grant No15G1405.


  1. 1.
    Pierrehumbert, J.: The phonology and phonetics of English intonation. Ph.D. Thesis, Massachusetts Institute of Technology (1980)Google Scholar
  2. 2.
    Hart, J., Collier, R., Cohen, A.: A Perceptual Study of Intonation. Cambridge University Press, Cambridge (1990)CrossRefGoogle Scholar
  3. 3.
    Dusterhoff, K., Black, A.: Generating \(F_{0}\) contour for speech synthesis using the tilt intonation theory. In: 3rd ESCA workshop on Intonation: Theory Models and Applications, pp. 107–110. Athens, Greece (1997)Google Scholar
  4. 4.
    Taylor, P.: Analysis and synthesis of intonation using the tilt model. J. Acoust. Soc. Am. 107(3), 1697–1714 (2000)CrossRefGoogle Scholar
  5. 5.
    Moehler, G., Conkie, A.: Parametric modeling of Intonation using vector quantization. In: 3rd ESCA Workshop on Speech Synthesis, pp. 311–316. Jenolan Caves, Australia (1998)Google Scholar
  6. 6.
    Wu, Z., Watts, O., King, S.: Merlin: An open source neural network speech synthesis system. In: 9th ISCA Workshop on Speech Synthesis, pp. 202–207. Sunnyvale, USA (2016)Google Scholar
  7. 7.
    Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T. Kitamura, T.: Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In: 6th European Conference on Speech Communication and Technology, pp. 2347–2350. Budapest, Hungary (1999)Google Scholar
  8. 8.
    Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: 38th International Conference on Acoustics, Speech, and Signal Processing, pp. 7962–7966. IEEE, Vancouver, Canada (2013)Google Scholar
  9. 9.
    Chen, B., Bian, T., Yu, K.: Discrete duration model for speech synthesis. In: 18th Annual Conference of the International Speech Communication Association, pp. 789–793. Stockholm, Sweden (2017)Google Scholar
  10. 10.
    Zangar, I., Mnasri, Z., Colotte, V., Jouvet, D., Houidhek, A.: Duration modeling using DNN for Arabic speech synthesis. In: 9th International Conference on Speech Prosody, pp. 597–601. Poznan, Poland (2018)Google Scholar
  11. 11.
    Oord, A.V.D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K.: Wavenet: a generative model for raw audio. arXiv preprint arXiv: 1609.03499 (2016)
  12. 12.
    Yoshimura, T.: Simultaneous modeling of phonetic and prosodic parameters, and characteristic conversion for HMM-based Text-to-Speech systems. Ph.D. Thesis, Department of Electrical and Computer Engineering, Nagoya Institute of Technology (2002)Google Scholar
  13. 13.
    Zen, H., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.: Hidden semi-Markov model based speech synthesis. In: 8th International Conference on Spoken Language Processing, pp. 1393–1396. Jeju Island, Korea (2004)Google Scholar
  14. 14.
    Tokuda, K., Masuko, T., Miyazaki, N., Kobayashi, T.: Multi-space probability distribution HMM. IEICE Trans. Inf. Syst. 85(3), 455–464 (2002)Google Scholar
  15. 15.
    Zen, H., Tokuda, K. Black, A.W.: Statistical parametric speech synthesis. In: Speech Communication 2009, vol. 51, pp. 1093–1064. ELSEVIER (2009).
  16. 16.
    Yu, K., Young, S.: Continuous \(F_{0}\) modeling for HMM based statistical parametric speech synthesis. IEICE Trans. Inf. Syst. 19(5), 1071–1079 (2011)Google Scholar
  17. 17.
    Fan, Y., Qian, Y., Xie, F. L., Soong, F. K.: TTS synthesis with bidirectional LSTM based recurrent neural networks. In: 15th Annual Conference of the International Speech Communication Association, pp. 1964–1968. Singapore (2014)Google Scholar
  18. 18.
    Chen, C.J., Gopinath, R.A., Monkowski, M.D., Picheny, M.A., Shen, K.: New methods in continuous Mandarin speech recognition. In: 5th European Conference on Speech Communication and Technology, pp. 1543–1546. Rhodes, Greece (1997)Google Scholar
  19. 19.
    Chen, B., Lai, J., Yu, K.: Comparison of modeling target in LSTM-RNN duration model. In: 18th Annual Conference of the International Speech Communication Association, pp. 794–798. Stockholm, Sweden (2017)Google Scholar
  20. 20.
    Halabi, N., Wald, M.: Phonetic inventory for an Arabic speech corpus. In: 10th International Conference on Language Resources and Evaluation, pp. 734–738. Slovenia (2016)Google Scholar
  21. 21.
    Speech Signal Processing Toolkit (SPTK).
  22. 22.
    Houidhek, A., Colotte, V., Mnasri, Z., Jouvet, D.: DNN-based speech synthesis for Arabic: modelling and evaluation. In: 6th International Conference on Statistical Language and Speech Processing, pp. 9–20. Mons, Belgium (2018)Google Scholar
  23. 23.
    Camacho, A., Harris, J.G.: A sawtooth waveform inspired pitch estimator for speech and music. J. Acoust. Soc. Am. 124(3), 1638–1652 (2008)CrossRefGoogle Scholar
  24. 24.
    Zen, H.: An example of context-dependent label format for HMM-based speech synthesis in English. The HTS CMUARCTIC demo (2006)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Imene Zangar
    • 1
    Email author
  • Zied Mnasri
    • 1
    • 3
  • Vincent Colotte
    • 2
  • Denis Jouvet
    • 2
  1. 1.Electrical Engineering DepartmentUniversity Tunis El Manar, Ecole Nationale d’Ingénieurs de TunisTunisTunisia
  2. 2.Université de Lorraine, CNRS, Inria, LORIANancyFrance
  3. 3.Università degli studi di Genova, DIBRISGenoaItaly

Personalised recommendations