Abstract
Text-to-Speech (TTS) systems rely on a grapheme-to-phoneme converter which is built to produce canonical, or statically stylized, pronunciations. Hence, the TTS quality drops when phoneme sequences generated by this converter are inconsistent with those labeled in the speech corpus on which the TTS system is built, or when a given expressivity is desired. To solve this problem, the present work aims at automatically adapting generated pronunciations to a given style by training a phoneme-to-phoneme conditional random field (CRF). Precisely, our work investigates (i) the choice of optimal features among acoustic, articulatory, phonological and linguistic ones, and (ii) the selection of a minimal data size to train the CRF. As a case study, adaptation to a TTS-dedicated speech corpus is performed. Cross-validation experiments show that small training corpora can be used without much degrading performance. Apart from improving TTS quality, these results bring interesting perspectives for more complex adaptation scenarios towards expressive speech synthesis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
http://www-expression.irisa.fr/demos/: Corpus-specific adaptation.
References
Olinsky, C., Cummins, F.: Iterative English adaptation in a speech synthesis system. In: IEEE Workshop on Speech Synthesis (2002)
Govind, D., Prasanna, S.M.: Expressive speech synthesis: a review. Int. J. Speech Technol. 16, 237–260 (2013)
Karanasou, P., Yvon, F., Lavergne, T., Lamel, L.: Discriminative training of a phoneme confusion model for a dynamic lexicon in ASR. In: Proceedings of Interspeech (2013)
Rao, K., Peng, F., Sak, H., Beaufays, F.: Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. In: Proceedings of ICASSP (2015)
Yao, K., Zweig, G.: Sequence-to-sequence neural net models for grapheme-to-phoneme conversion. In: Proceedings of Interspeech (2015)
Lecorvé, G., Lolive, D.: Adaptive statistical utterance phonetization for French. In: Proceedings of ICASSP (2015)
Hazen, T.J., Hetherington, I., Shu, H., Livescu, K.: Pronunciation modeling using a finite-state transducer representation. Speech Commun. 46, 189–203 (2005)
Livescu, K., Jyothi, P., Fosler-Lussier, E.: Articulatory feature-based pronunciation modeling. Comput. Speech Lang. 36, 212–232 (2016)
Nagòrski, A., Boves, L., Steeneken, H.: In search of optimal data selection for training of automatic speech recognition systems. In: Proceedings of ASRU (2003)
Moore, R.K.: A comparison of the data requirements of automatic speech recognition systems and human listeners. In: Proceedings of Eurospeech (2003)
Schuller, B., Batliner, A., Seppi, D., Steidl, S., Vogt, T., Wagner, J., Devillers, L., Vidrascu, L., Amir, N., Kessous, L., Aharonson, V.: The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. In: Proceedings of Interspeech (2007)
Tahon, M., Devillers, L.: Towards a small set of robust acoustic features for emotion recognition: challenges. IEEE/ACM Trans. Speech Audio Lang. Process. 54(1), 16–48 (2016)
Chen, Y., Ganapathi, A., Katz, R.: Challenges and opportunities for managing data systems using statistical models. In: Bulletin of the IEEE Computer Society Technical Committee on Data Engineering (2011)
Qader, R., Lecorvé, G., Lolive, D., Sébillot, P.: Probabilistic speaker pronunciation adaptation for spontaneous speech synthesis using linguistic features. In: Dediu, A.-H., et al. (eds.) SLSP 2015. LNCS, vol. 9449, pp. 229–241. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25789-1_22
Chevelu, J., Lecorvé, G., Lolive, D.: ROOTS: a toolkit for easy, fast and consistent processing of large sequential annotated data collections. In: Proceedings of LREC (2014)
Béchet, F.: LIA-PHON: un système complet de phonétisation de texte. Traitement Automatique des Langues (TAL) 42, 47–67 (2001)
Lin, Y., Michel, J.-B., Aiden, E.L., Orwant, J., Brockman, W., Petrov, S.: Syntactic annotations for the Google books ngram corpus. In: Proceedings of ACL (2012)
d’Alessandro, C., Rosset, S., Rossi, J.-P.: The pitch of short-duration fundamental frequency glissandos. J. Acoust. Soc. Am. 104, 2339–2348 (1998)
Lavergne, T., Cappé, O., Yvon, F.: Practical very large scale CRFs. In: Proceedings of ACL (2010)
Guyon, I., Elissef, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
Tahon, M., Qader, R., Lecorvé, G., Lolive, D.: Improving TTS with corpus-specific pronunciation adaptation. In: Proceedings of Interspeech (2016)
Qader, R., Lecorvé, G., Lolive, D., Sébillot, P.: Adaptation de la prononciation pour la synthèse de la parole spontanée en utilisant des informations linguistiques. In: Proceedings of Journées d’Etudes sur la Parole (2016)
Acknowledgments
This study has been realized under the ANR (French National Research Agency) project SynPaFlex ANR-15-CE23-0015.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Tahon, M., Qader, R., Lecorvé, G., Lolive, D. (2016). Optimal Feature Set and Minimal Training Size for Pronunciation Adaptation in TTS. In: Král, P., Martín-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2016. Lecture Notes in Computer Science(), vol 9918. Springer, Cham. https://doi.org/10.1007/978-3-319-45925-7_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-45925-7_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45924-0
Online ISBN: 978-3-319-45925-7
eBook Packages: Computer ScienceComputer Science (R0)