WaveNet-Based Speech Synthesis Applied to Czech
WaveNet is a recently-developed deep neural network for generating high-quality synthetic speech. It produces directly raw audio samples. This paper describes the first application of WaveNet-based speech synthesis for the Czech language. We used the basic WaveNet architecture. The duration of particular phones and the required fundamental frequency used for local conditioning were estimated by additional LSTM networks. We conducted a MUSHRA listening test to compare WaveNet with 2 traditional synthesis methods: unit selection and HMM-based synthesis. Experiments were performed on 4 large speech corpora. Though our implementation of WaveNet did not outperform the unit selection method as reported in other studies, there is still a lot of scope for improvement, while the unit selection TTS have probably reached its quality limit.
KeywordsSpeech synthesis WaveNet Deep neural network Unit selection HMM-based speech synthesis
This research was supported by the Czech Science Foundation (GACR), project No. GA16-04420S and by the grant of the University of West Bohemia, project No. SGS-2016-039. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the program “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.
- 3.Zen, H.: Acoustic modeling in statistical parametric speech synthesis - from HMM to LSTM-RNN. In: Proceedings of MLSLP (2015)Google Scholar
- 4.van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., et al.: WaveNet: a generative model for raw audio. CoRR abs/1609.03499 (2016). http://arxiv.org/abs/1609.03499
- 5.Kobayashi, K., Hayashi, T., Tamamori, A., Toda, T.: Statistical voice conversion with WaveNet-based waveform generation. In: Proceedings of Interspeech 2017, pp. 1138–1142 (2017)Google Scholar
- 6.Hayashi, T., Tamamori, A., Kobayashi, K., et al.: An investigation of multi-speaker training for WaveNet vocoder. In: Proceedings of ASRU 2017, pp. 712–718 (2017)Google Scholar
- 7.Tamamori, A., Hayashi, T., Kobayashi, K., et al.: Speaker-dependent WaveNet vocoder. In: Proceedings of Interspeech 2017, pp. 1118–1122 (2017)Google Scholar
- 8.Arik, S.O., Chrzanowski, M., Coates, A., et al.: Deep voice: real-time neural text-to-speech. CoRR abs/1702.07825 (2017). https://arxiv.org/abs/1702.07825
- 10.Matousšek, J., Legát, M., Tihelka, D.: Is unit selection aware of audible artifacts? In: Proceedings of SSW8, pp. 267–271. ISCA (2013)Google Scholar
- 13.Matousšek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: Proceedings of LREC (2008)Google Scholar
- 14.Method for the subjective assessment of intermediate quality level of coding systems. ITU Recommendation ITU-R BS.1534-2 (2014)Google Scholar
- 15.Henter, G.E., Merritt, T., Shannon, M., et al.: Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech. In: Proceedings of Interspeech 2014, pp. 1504–1508 (2014)Google Scholar
- 16.van den Oord, A., Li, Y., Babuschkin, I., et al.: Parallel WaveNet: fast high-fidelity speech synthesis. CoRR abs/1711.10433 (2017). https://arxiv.org/abs/1711.10433