Optimal Feature Set and Minimal Training Size for Pronunciation Adaptation in TTS

Tahon, Marie; Qader, Raheel; Lecorvé, Gwénolé; Lolive, Damien

doi:10.1007/978-3-319-45925-7_9

Marie Tahon¹⁵,
Raheel Qader¹⁵,
Gwénolé Lecorvé¹⁵ &
…
Damien Lolive¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9918))

Included in the following conference series:

International Conference on Statistical Language and Speech Processing

477 Accesses
2 Citations

Abstract

Text-to-Speech (TTS) systems rely on a grapheme-to-phoneme converter which is built to produce canonical, or statically stylized, pronunciations. Hence, the TTS quality drops when phoneme sequences generated by this converter are inconsistent with those labeled in the speech corpus on which the TTS system is built, or when a given expressivity is desired. To solve this problem, the present work aims at automatically adapting generated pronunciations to a given style by training a phoneme-to-phoneme conditional random field (CRF). Precisely, our work investigates (i) the choice of optimal features among acoustic, articulatory, phonological and linguistic ones, and (ii) the selection of a minimal data size to train the CRF. As a case study, adaptation to a TTS-dedicated speech corpus is performed. Cross-validation experiments show that small training corpora can be used without much degrading performance. Apart from improving TTS quality, these results bring interesting perspectives for more complex adaptation scenarios towards expressive speech synthesis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 34.99; Price excludes VAT (USA)

Softcover Book: USD 44.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www-expression.irisa.fr/demos/: Corpus-specific adaptation.

References

Olinsky, C., Cummins, F.: Iterative English adaptation in a speech synthesis system. In: IEEE Workshop on Speech Synthesis (2002)
Google Scholar
Govind, D., Prasanna, S.M.: Expressive speech synthesis: a review. Int. J. Speech Technol. 16, 237–260 (2013)
Article Google Scholar
Karanasou, P., Yvon, F., Lavergne, T., Lamel, L.: Discriminative training of a phoneme confusion model for a dynamic lexicon in ASR. In: Proceedings of Interspeech (2013)
Google Scholar
Rao, K., Peng, F., Sak, H., Beaufays, F.: Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. In: Proceedings of ICASSP (2015)
Google Scholar
Yao, K., Zweig, G.: Sequence-to-sequence neural net models for grapheme-to-phoneme conversion. In: Proceedings of Interspeech (2015)
Google Scholar
Lecorvé, G., Lolive, D.: Adaptive statistical utterance phonetization for French. In: Proceedings of ICASSP (2015)
Google Scholar
Hazen, T.J., Hetherington, I., Shu, H., Livescu, K.: Pronunciation modeling using a finite-state transducer representation. Speech Commun. 46, 189–203 (2005)
Article Google Scholar
Livescu, K., Jyothi, P., Fosler-Lussier, E.: Articulatory feature-based pronunciation modeling. Comput. Speech Lang. 36, 212–232 (2016)
Article Google Scholar
Nagòrski, A., Boves, L., Steeneken, H.: In search of optimal data selection for training of automatic speech recognition systems. In: Proceedings of ASRU (2003)
Google Scholar
Moore, R.K.: A comparison of the data requirements of automatic speech recognition systems and human listeners. In: Proceedings of Eurospeech (2003)
Google Scholar
Schuller, B., Batliner, A., Seppi, D., Steidl, S., Vogt, T., Wagner, J., Devillers, L., Vidrascu, L., Amir, N., Kessous, L., Aharonson, V.: The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. In: Proceedings of Interspeech (2007)
Google Scholar
Tahon, M., Devillers, L.: Towards a small set of robust acoustic features for emotion recognition: challenges. IEEE/ACM Trans. Speech Audio Lang. Process. 54(1), 16–48 (2016)
Article Google Scholar
Chen, Y., Ganapathi, A., Katz, R.: Challenges and opportunities for managing data systems using statistical models. In: Bulletin of the IEEE Computer Society Technical Committee on Data Engineering (2011)
Google Scholar
Qader, R., Lecorvé, G., Lolive, D., Sébillot, P.: Probabilistic speaker pronunciation adaptation for spontaneous speech synthesis using linguistic features. In: Dediu, A.-H., et al. (eds.) SLSP 2015. LNCS, vol. 9449, pp. 229–241. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25789-1_22
Chapter Google Scholar
Chevelu, J., Lecorvé, G., Lolive, D.: ROOTS: a toolkit for easy, fast and consistent processing of large sequential annotated data collections. In: Proceedings of LREC (2014)
Google Scholar
Béchet, F.: LIA-PHON: un système complet de phonétisation de texte. Traitement Automatique des Langues (TAL) 42, 47–67 (2001)
Google Scholar
Lin, Y., Michel, J.-B., Aiden, E.L., Orwant, J., Brockman, W., Petrov, S.: Syntactic annotations for the Google books ngram corpus. In: Proceedings of ACL (2012)
Google Scholar
d’Alessandro, C., Rosset, S., Rossi, J.-P.: The pitch of short-duration fundamental frequency glissandos. J. Acoust. Soc. Am. 104, 2339–2348 (1998)
Article Google Scholar
Lavergne, T., Cappé, O., Yvon, F.: Practical very large scale CRFs. In: Proceedings of ACL (2010)
Google Scholar
Guyon, I., Elissef, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
MATH Google Scholar
Tahon, M., Qader, R., Lecorvé, G., Lolive, D.: Improving TTS with corpus-specific pronunciation adaptation. In: Proceedings of Interspeech (2016)
Google Scholar
Qader, R., Lecorvé, G., Lolive, D., Sébillot, P.: Adaptation de la prononciation pour la synthèse de la parole spontanée en utilisant des informations linguistiques. In: Proceedings of Journées d’Etudes sur la Parole (2016)
Google Scholar

Download references

Acknowledgments

This study has been realized under the ANR (French National Research Agency) project SynPaFlex ANR-15-CE23-0015.

Author information

Authors and Affiliations

IRISA/University of Rennes 1, 6 Rue de Kérampont, 22300, Lannion, France
Marie Tahon, Raheel Qader, Gwénolé Lecorvé & Damien Lolive

Authors

Marie Tahon
View author publications
You can also search for this author in PubMed Google Scholar
Raheel Qader
View author publications
You can also search for this author in PubMed Google Scholar
Gwénolé Lecorvé
View author publications
You can also search for this author in PubMed Google Scholar
Damien Lolive
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marie Tahon .

Editor information

Editors and Affiliations

University of West Bohemia , Plzen, Czech Republic
Pavel Král
Rovira i Virgili University , Tarragona, Spain
Carlos Martín-Vide

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tahon, M., Qader, R., Lecorvé, G., Lolive, D. (2016). Optimal Feature Set and Minimal Training Size for Pronunciation Adaptation in TTS. In: Král, P., Martín-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2016. Lecture Notes in Computer Science(), vol 9918. Springer, Cham. https://doi.org/10.1007/978-3-319-45925-7_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-45925-7_9
Published: 21 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45924-0
Online ISBN: 978-3-319-45925-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics