Improved Prosodic Clustering for Multispeaker and Speaker-Independent Phoneme-Level Prosody Control

Christidou, Myrsini; Vioni, Alexandra; Ellinas, Nikolaos; Vamvoukakis, Georgios; Markopoulos, Konstantinos; Kakoulidis, Panos; Sung, June Sig; Park, Hyoungmin; Chalamandaris, Aimilios; Tsiakoulis, Pirros

doi:10.1007/978-3-030-87802-3_11

Myrsini Christidou¹⁰,
Alexandra Vioni¹⁰,
Nikolaos Ellinas¹⁰,
Georgios Vamvoukakis¹⁰,
Konstantinos Markopoulos¹⁰,
Panos Kakoulidis¹⁰,
June Sig Sung¹¹,
Hyoungmin Park¹¹,
Aimilios Chalamandaris¹⁰ &
…
Pirros Tsiakoulis¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12997))

Included in the following conference series:

International Conference on Speech and Computer

1651 Accesses
2 Altmetric

Abstract

This paper presents a method for phoneme-level prosody control of F0 and duration on a multispeaker text-to-speech setup, which is based on prosodic clustering. An autoregressive attention-based model is used, incorporating multispeaker architecture modules in parallel to a prosody encoder. Several improvements over the basic single-speaker method are proposed that increase the prosodic control range and coverage. More specifically we employ data augmentation, F0 normalization, balanced clustering for duration, and speaker-independent prosodic clustering. These modifications enable fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. The model is also fine-tuned to unseen speakers with limited amounts of data and it is shown to maintain its prosody control capabilities, verifying that the speaker-independent prosodic clustering is effective. Experimental results verify that the model maintains high output speech quality and that the proposed method allows efficient prosody control within each speaker’s range despite the variability that a multispeaker setting introduces.

M. Christidou and A. Vioni—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Angelini, O., Moinet, A., Yanagisawa, K., Drugman, T.: Singing synthesis: with a little help from my attention. In: Proceedings of Interspeech (2020)
Google Scholar
Battenberg, E., et al.: Effective use of variational embedding capacity in expressive end-to-end speech synthesis. arXiv:1906.03402 (2019)
Blaauw, M., Bonada, J.: A neural parametric singing synthesizer modeling timbre and expression from natural songs. Appl. Sci. 7(12), 1313 (2017)
Article Google Scholar
Chalamandaris, A., Tsiakoulis, P., Raptis, S., Karabetsos, S.: Corpus design for a unit selection TTS system with application to Bulgarian. In: Proceedings of 4th Conference on Human Language Technology: Challenges for Computer Science and Linguistics, pp. 35–46 (2009)
Google Scholar
Chien, C.M., Lee, H.: Hierarchical prosody modeling for non-autoregressive speech synthesis. In: Proceedings of SLT (2021)
Google Scholar
Cooper, E., Lai, C.I., Yasuda, Y., Yamagishi, J.: Can speaker augmentation improve multi-speaker end-to-end TTS? In: Proceedings of Interspeech (2020)
Google Scholar
Corretge, R.: Praat Vocal Toolkit (2012–2020). http://www.praatvocaltoolkit.com
Daxin, T., Tan, L.: Fine-grained style modelling and transfer in text-to-speech synthesis via content-style disentanglement. arXiv:2011.03943 (2020)
Du, C., Yu, K.: Mixture Density Network for Phone-Level Prosody Modelling in Speech Synthesis. arXiv:2102.00851 (2021)
Ellinas, N., et al.: High quality streaming speech synthesis with low, sentence-length-independent latency. In: Proceedings of Interspeech (2020)
Google Scholar
Gururani, S., Gupta, K., Shah, D., Shakeri, Z., Pinto, J.: Prosody Transfer in Neural Text to Speech Using Global Pitch and Loudness Features. arXiv:1911.09645 (2019)
Hsu, W.N., et al.: Hierarchical generative modeling for controllable speech synthesis. In: Proceedings of ICLR (2018)
Google Scholar
Ito, K., Johnson, L.: The LJ Speech Dataset (2017). https://keithito.com/LJ-Speech-Dataset
Karlapati, S., Moinet, A., Joly, A., Klimkov, V., Sáez-Trigueros, D., Drugman, T.: CopyCat: many-to-many fine-grained prosody transfer for neural text-to-speech. In: Proceedings of Interspeech (2020)
Google Scholar
Klimkov, V., Ronanki, S., Rohnke, J., Drugman, T.: Fine-grained robust prosody transfer for single-speaker neural text-to-speech. In: Proceedings of Interspeech (2019)
Google Scholar
Kumar, N., Goel, S., Narang, A., Lall, B.: Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis. arXiv:2012.07252 (2020)
Kurihara, K., Seiyama, N., Kumano, T.: Prosodic features control by symbols as input of sequence-to-sequence acoustic modeling for neural TTS. IEICE Trans. Inf. Syst. E104.D(2), 302–311 (2021)
Google Scholar
Lee, Y., Kim, T.: Robust and fine-grained prosody control of end-to-end speech synthesis. In: Proceedings of ICASSP (2019)
Google Scholar
Neekhara, P., Hussain, S., Dubnov, S., Koushanfar, F., McAuley, J.: Expressive Neural Voice Cloning. arXiv:2102.00151 (2021)
Park, J., Han, K., Jeong, Y., Lee, S.W.: Phonemic-level duration control using attention alignment for natural speech synthesis. In: Proceedings of ICASSP (2019)
Google Scholar
Ping, W., et al.: Deep voice 3: scaling text-to-speech with convolutional sequence learning. In: Proceedings of ICLR (2018)
Google Scholar
Raitio, T., Rasipuram, R., Castellani, D.: Controllable neural text-to-speech synthesis using intuitive prosodic features. In: Proceedings of Interspeech (2020)
Google Scholar
Shechtman, S., Sorin, A.: Sequence to sequence neural speech synthesis with prosody modification capabilities. In: Proceedings of SSW (2019)
Google Scholar
Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In: Proceedings of ICASSP (2018)
Google Scholar
Skerry-Ryan, R., et al.: Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron. In: Proceedings of ICML (2018)
Google Scholar
Sun, G., et al.: Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and autoregressive prosody prior. In: Proceedings of ICASSP (2020)
Google Scholar
Sun, G., Zhang, Y., Weiss, R.J., Cao, Y., Zen, H., Wu, Y.: Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis. In: Proceedings of ICASSP (2020)
Google Scholar
Valle, R., Li, J., Prenger, R., Catanzaro, B.: Mellotron: multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens. In: Proceedings of ICASSP (2020)
Google Scholar
Vioni, A., et al.: Prosodic clustering for phoneme-level prosody control in end-to-end speech synthesis. In: Proceedings of ICASSP (2021)
Google Scholar
Vipperla, R., et al.: Bunched LPCNet: vocoder for low-cost neural text-to-speech systems. In: Proceedings of Interspeech (2020)
Google Scholar
Wan, V., an Chan, C., Kenter, T., Vit, J., Clark, R.: CHiVE: varying prosody in speech synthesis with a linguistically driven dynamic hierarchical conditional variational network. In: Proceedings of ICML (2019)
Google Scholar
Wang, J., Li, J., Zhao, X., Wu, Z., Meng, H.: Adversarially learning disentangled speech representations for robust multi-factor voice conversion. arXiv:2102.00184 (2021)
Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. In: Proceedings of Interspeech (2017)
Google Scholar
Wang, Y., et al.: Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis. In: Proceedings of ICML (2018)
Google Scholar
Zhang, G., Qin, Y., Lee, T.: Learning syllable-level discrete prosodic representation for expressive speech generation. In: Proceedings of Interspeech (2020)
Google Scholar
Zhang, J.X., et al.: Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer. arXiv:2009.01475 (2020)
Zhang, Y.J., Pan, S., He, L., Ling, Z.H.: Learning latent representations for style control and transfer in end-to-end speech synthesis. In: Proceedings of ICASSP (2019)
Google Scholar
Zhang, Y., et al.: Learning to speak fluently in a foreign language: multilingual speech synthesis and cross-language voice cloning. In: Proceedings of Interspeech (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Innoetics, Samsung Electronics, Marousi, Greece
Myrsini Christidou, Alexandra Vioni, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Panos Kakoulidis, Aimilios Chalamandaris & Pirros Tsiakoulis
Mobile Communications Business, Samsung Electronics, Suwon, Republic of Korea
June Sig Sung & Hyoungmin Park

Authors

Myrsini Christidou
View author publications
You can also search for this author in PubMed Google Scholar
Alexandra Vioni
View author publications
You can also search for this author in PubMed Google Scholar
Nikolaos Ellinas
View author publications
You can also search for this author in PubMed Google Scholar
Georgios Vamvoukakis
View author publications
You can also search for this author in PubMed Google Scholar
Konstantinos Markopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Panos Kakoulidis
View author publications
You can also search for this author in PubMed Google Scholar
June Sig Sung
View author publications
You can also search for this author in PubMed Google Scholar
Hyoungmin Park
View author publications
You can also search for this author in PubMed Google Scholar
Aimilios Chalamandaris
View author publications
You can also search for this author in PubMed Google Scholar
Pirros Tsiakoulis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Myrsini Christidou .

Editor information

Editors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Christidou, M. et al. (2021). Improved Prosodic Clustering for Multispeaker and Speaker-Independent Phoneme-Level Prosody Control. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-87802-3_11
Published: 22 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87801-6
Online ISBN: 978-3-030-87802-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics