Speaker Adaptation with Continuous Vocoder-Based DNN-TTS

Mandeel, Ali Raheem; Al-Radhi, Mohammed Salah; Csapó, Tamás Gábor

doi:10.1007/978-3-030-87802-3_37

Ali Raheem Mandeel¹⁰,
Mohammed Salah Al-Radhi¹⁰ &
Tamás Gábor Csapó^10,11

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12997))

Included in the following conference series:

International Conference on Speech and Computer

1632 Accesses
2 Citations

Abstract

Traditional vocoder-based statistical parametric speech synthesis can be advantageous in applications that require low computational complexity. Recent neural vocoders, which can produce high naturalness, still cannot fulfill the requirement of being real-time during synthesis. In this paper, we experiment with our earlier continuous vocoder, in which the excitation is modeled with two one-dimensional parameters: continuous F0 and Maximum Voiced Frequency. We show on the data of 9 speakers that an average voice can be trained for DNN-TTS, and speaker adaptation is feasible 400 utterances (about 14 min). Objective experiments support that the quality of speaker adaptation with Continuous Vocoder-based DNN-TTS is similar to the quality of the speaker adaptation with a WORLD Vocoder-based baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis

Article Open access 22 October 2022

Czech Speech Synthesis with Generative Neural Vocoder

DNN-Based Speech Synthesis: Importance of Input Features and Training Data

References

Agiomyrgiannakis, Y.: Vocaine the vocoder and applications in speech synthesis. In: ICASSP, pp. 4230–4234. IEEE (2015)
Google Scholar
Airaksinen, M., Juvela, L., Bollepalli, B., Yamagishi, J., Alku, P.: A comparison between straight, glottal, and sinusoidal vocoding in statistical parametric speech synthesis. IEEE/ACM Trans. Audio Speech Lang. Process. 26(9), 1658–1670 (2018)
Article Google Scholar
Al-Radhi, M.S., Abdo, O., Csapó, T.G., Abdou, S., Németh, G., Fashal, M.: A continuous vocoder for statistical parametric speech synthesis and its evaluation using an audio-visual phonetically annotated Arabic corpus. Comput. Speech Lang. 60, 101025 (2020)
Article Google Scholar
Al-Radhi, M.S., Csapó, T.G., Németh, G.: Time-domain envelope modulating the noise component of excitation in a continuous residual-based vocoder for statistical parametric speech synthesis. In: INTERSPEECH, pp. 434–438 (2017)
Google Scholar
Al-Radhi, M.S., Csapó, T.G., Németh, G.: A continuous vocoder using sinusoidal model for statistical parametric speech synthesis. In: Karpov, A., Jokisch, O., Potapova, R. (eds.) SPECOM 2018. LNCS (LNAI), vol. 11096, pp. 11–20. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99579-3_2
Chapter Google Scholar
Bakhturina, E., Lavrukhin, V., Ginsburg, B., Zhang, Y.: Hi-Fi multi-speaker English TTS dataset. arXiv preprint arXiv:2104.01497 (2021)
Beskow, J., Berthelsen, H.: A hybrid harmonics-and-bursts modelling approach to speech synthesis. In: SSW, pp. 208–213 (2016)
Google Scholar
Black, A.W., Zen, H., Tokuda, K.: Statistical parametric speech synthesis. In: ICASSP, vol. 4, pp. IV-1229. IEEE (2007)
Google Scholar
Csapó, T.G., Németh, G., Cernak, M.: Residual-based excitation with continuous F0 modeling in HMM-based speech synthesis. In: Dediu, A.-H., Martín-Vide, C., Vicsi, K. (eds.) SLSP 2015. LNCS (LNAI), vol. 9449, pp. 27–38. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25789-1_4
Chapter Google Scholar
Csapó, T.G., Németh, G., Cernak, M., Garner, P.N.: Modeling unvoiced sounds in statistical parametric speech synthesis with a continuous vocoder. In: EUSIPCO, pp. 1338–1342. IEEE (2016)
Google Scholar
Habeeb, I.Q., Fadhil, T.Z., Jurn, Y.N., Habeeb, Z.Q., Abdulkhudhur, H.N.: An ensemble technique for speech recognition in noisy environments. Indones. J. Electr. Eng. Comput. Sci. 18(2), 835–842 (2020)
Article Google Scholar
Hashimoto, K., Oura, K., Nankaku, Y., Tokuda, K.: The effect of neural networks in statistical parametric speech synthesis. In: ICASSP, pp. 4455–4459. IEEE (2015)
Google Scholar
Hu, Q., Richmond, K., Yamagishi, J., Latorre, J.: An experimental comparison of multiple vocoder types. In: SSW8, pp. 135–140 (2013)
Google Scholar
Hu, Q., Wu, Z., Richmond, K., Yamagishi, J., Stylianou, Y., Maia, R.: Fusion of multiple parameterisations for DNN-based sinusoidal speech synthesis with multi-task learning. In: INTERSPEECH, pp. 854–858 (2015)
Google Scholar
Lanchantin, P., Gales, M.J., King, S., Yamagishi, J.: Multiple-average-voice-based speech synthesis. In: ICASSP, pp. 285–289. IEEE (2014)
Google Scholar
Ling, Z.H., Deng, L., Yu, D.: Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synthesis. IEEE Trans. Audio Speech Lang. Process. 21(10), 2129–2139 (2013)
Article Google Scholar
Tokuda, K., Nankaku, Y., Toda, T., Zen, H., Yamagishi, J., Oura, K.: Speech synthesis based on hidden Markov models. Proc. IEEE 101(5), 1234–1252 (2013)
Article Google Scholar
Wu, Z., Swietojanski, P., Veaux, C., Renals, S., King, S.: A study of speaker adaptation for DNN-based speech synthesis. In: Interspeech, pp. 879–883 (2015)
Google Scholar
Wu, Z., Valentini-Botinhao, C., Watts, O., King, S.: Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In: ICASSP, pp. 4460–4464. IEEE (2015)
Google Scholar
Wu, Z., Watts, O., King, S.: Merlin: an open source neural network speech synthesis system. In: SSW, pp. 202–207 (2016)
Google Scholar
Zen, H., Senior, A.: Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In: ICASSP, pp. 3844–3848. IEEE (2014)
Google Scholar
Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)
Article Google Scholar

Download references

Acknowledgments

The research was partly supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 825619 (AI4EU), and by the National Research Development and Innovation Office of Hungary (FK 124584 and PD 127915). The Titan X GPU used was donated by NVIDIA Corporation. We would like to thank the subjects for participating in the listening test.

Author information

Authors and Affiliations

Department of Telecommunication and Media Informatics, Budapest University of Technology and Economics, Budapest, Hungary
Ali Raheem Mandeel, Mohammed Salah Al-Radhi & Tamás Gábor Csapó
MTA-ELTE Lendület Lingual Articulation Research Group, Budapest, Hungary
Tamás Gábor Csapó

Authors

Ali Raheem Mandeel
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed Salah Al-Radhi
View author publications
You can also search for this author in PubMed Google Scholar
Tamás Gábor Csapó
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ali Raheem Mandeel .

Editor information

Editors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mandeel, A.R., Al-Radhi, M.S., Csapó, T.G. (2021). Speaker Adaptation with Continuous Vocoder-Based DNN-TTS. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_37

Download citation

DOI: https://doi.org/10.1007/978-3-030-87802-3_37
Published: 22 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87801-6
Online ISBN: 978-3-030-87802-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Speaker Adaptation with Continuous Vocoder-Based DNN-TTS

Abstract

Access this chapter

Similar content being viewed by others

Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis

Czech Speech Synthesis with Generative Neural Vocoder

DNN-Based Speech Synthesis: Importance of Input Features and Training Data

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Speaker Adaptation with Continuous Vocoder-Based DNN-TTS

Abstract

Access this chapter

Similar content being viewed by others

Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis

Czech Speech Synthesis with Generative Neural Vocoder

DNN-Based Speech Synthesis: Importance of Input Features and Training Data

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation