WaveNet-Based Speech Synthesis Applied to Czech

A Comparison with the Traditional Synthesis Methods
  • Zdeněk HanzlíčekEmail author
  • Jakub Vít
  • Daniel Tihelka
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11107)


WaveNet is a recently-developed deep neural network for generating high-quality synthetic speech. It produces directly raw audio samples. This paper describes the first application of WaveNet-based speech synthesis for the Czech language. We used the basic WaveNet architecture. The duration of particular phones and the required fundamental frequency used for local conditioning were estimated by additional LSTM networks. We conducted a MUSHRA listening test to compare WaveNet with 2 traditional synthesis methods: unit selection and HMM-based synthesis. Experiments were performed on 4 large speech corpora. Though our implementation of WaveNet did not outperform the unit selection method as reported in other studies, there is still a lot of scope for improvement, while the unit selection TTS have probably reached its quality limit.


Speech synthesis WaveNet Deep neural network Unit selection HMM-based speech synthesis 



This research was supported by the Czech Science Foundation (GACR), project No. GA16-04420S and by the grant of the University of West Bohemia, project No. SGS-2016-039. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the program “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.


  1. 1.
    Ling, Z.H., Kang, S.Y., Zen, H., et al.: Deep learning for acoustic modeling in parametric speech generation: a systematic review of existing techniques and future trends. IEEE Signal Process. Mag. 32(3), 35–52 (2015)CrossRefGoogle Scholar
  2. 2.
    Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)CrossRefGoogle Scholar
  3. 3.
    Zen, H.: Acoustic modeling in statistical parametric speech synthesis - from HMM to LSTM-RNN. In: Proceedings of MLSLP (2015)Google Scholar
  4. 4.
    van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., et al.: WaveNet: a generative model for raw audio. CoRR abs/1609.03499 (2016).
  5. 5.
    Kobayashi, K., Hayashi, T., Tamamori, A., Toda, T.: Statistical voice conversion with WaveNet-based waveform generation. In: Proceedings of Interspeech 2017, pp. 1138–1142 (2017)Google Scholar
  6. 6.
    Hayashi, T., Tamamori, A., Kobayashi, K., et al.: An investigation of multi-speaker training for WaveNet vocoder. In: Proceedings of ASRU 2017, pp. 712–718 (2017)Google Scholar
  7. 7.
    Tamamori, A., Hayashi, T., Kobayashi, K., et al.: Speaker-dependent WaveNet vocoder. In: Proceedings of Interspeech 2017, pp. 1118–1122 (2017)Google Scholar
  8. 8.
    Arik, S.O., Chrzanowski, M., Coates, A., et al.: Deep voice: real-time neural text-to-speech. CoRR abs/1702.07825 (2017).
  9. 9.
    Matoušek, J., Tihelka, D., Romportl, J.: Current state of Czech text-to-speech system ARTIC. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS, vol. 4188, pp. 439–446. Springer, Heidelberg (2006). Scholar
  10. 10.
    Matousšek, J., Legát, M., Tihelka, D.: Is unit selection aware of audible artifacts? In: Proceedings of SSW8, pp. 267–271. ISCA (2013)Google Scholar
  11. 11.
    Hanzlíček, Z.: Czech HMM-Based Speech Synthesis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS, vol. 6231, pp. 291–298. Springer, Heidelberg (2010). Scholar
  12. 12.
    Hanzlíček, Z.: Optimal number of states in HMM-based speech synthesis. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS, vol. 10415, pp. 353–361. Springer, Cham (2017). Scholar
  13. 13.
    Matousšek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: Proceedings of LREC (2008)Google Scholar
  14. 14.
    Method for the subjective assessment of intermediate quality level of coding systems. ITU Recommendation ITU-R BS.1534-2 (2014)Google Scholar
  15. 15.
    Henter, G.E., Merritt, T., Shannon, M., et al.: Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech. In: Proceedings of Interspeech 2014, pp. 1504–1508 (2014)Google Scholar
  16. 16.
    van den Oord, A., Li, Y., Babuschkin, I., et al.: Parallel WaveNet: fast high-fidelity speech synthesis. CoRR abs/1711.10433 (2017).

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Zdeněk Hanzlíček
    • 1
    Email author
  • Jakub Vít
    • 1
  • Daniel Tihelka
    • 1
  1. 1.NTIS - New Technology for the Information Society, Faculty of Applied SciencesUniversity of West BohemiaPilsenCzech Republic

Personalised recommendations