Statistical Pronunciation Adaptation for Spontaneous Speech Synthesis

  • Raheel Qader
  • Gwénolé LecorvéEmail author
  • Damien Lolive
  • Marie Tahon
  • Pascale Sébillot
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10415)


To bring more expressiveness into text-to-speech systems, this paper presents a new pronunciation variant generation method which works by adapting standard, i.e., dictionary-based, pronunciations to a spontaneous style. Its strength and originality lie in exploiting a wide range of linguistic, articulatory and prosodic features, and in using a probabilistic machine learning framework, namely conditional random fields and phoneme-based n-gram models. Extensive experiments on the Buckeye corpus of English conversational speech demonstrate the effectiveness of the approach through objective and perceptual evaluations.


Speech synthesis Spontaneous speech Pronunciation modeling Statistical adaptation Conditional random field 



This study has been realized under the ANR (French National Research Agency) project SynPaFlex ANR-15-CE23-0015.


  1. 1.
    Tajchman, G., Foster, E., Jurafsky, D.: Building multiple pronunciation models for novel words using exploratory computational phonology. In: Proceedings of Eurospeech (1995)Google Scholar
  2. 2.
    Giachin, E., Rosenberg, A., Lee, C.H.: Word juncture modeling using phonological rules for HMM-based continuous speech recognition. In: Proceedings of ICASSP (1990)Google Scholar
  3. 3.
    Oshika, B.T., Zue, V.W., Weeks, R.V., Neu, H., Aurbach, J.: The role of phonological rules in speech understanding research. IEEE Trans. Acous. Speech Signal Process. 23, 104–112 (1975)CrossRefGoogle Scholar
  4. 4.
    Goronzy, S., Rapp, S., Kompe, R.: Generating non-native pronunciation variants for lexicon adaptation. Speech Commun. 42(1), 109–123 (2004)CrossRefGoogle Scholar
  5. 5.
    Vazirnezhad, B., Almasganj, F., Ahadi, S.M.: Hybrid statistical pronunciation models designed to be trained by a medium-size corpus. Comput. Speech Lang. 23, 1–24 (2009)CrossRefGoogle Scholar
  6. 6.
    Dilts, P.C.: Modelling phonetic reduction in a corpus of spoken English using random forests and mixed-effects regression. Ph.D. thesis, University of Alberta (2013)Google Scholar
  7. 7.
    Chen, K., Hasegawa-Johnson, M.: Modeling pronunciation variation using artificial neural networks for English spontaneous speech. In: Proceedings of Interspeech (2004)Google Scholar
  8. 8.
    Karanasou, P., Yvon, F., Lavergne, T., Lamel, L.: Discriminative training of a phoneme confusion model for a dynamic lexicon in ASR. In: Proceedings of Interspeech (2013)Google Scholar
  9. 9.
    Prahallad, K., Black, A.W., Mosur, R.: Sub-phonetic modeling for capturing pronunciation variations for conversational speech synthesis. In: Proceedings of ICASSP (2006)Google Scholar
  10. 10.
    Qader, R., Lecorvé, G., Lolive, D., Sébillot, P.: Probabilistic speaker pronunciation adaptation for spontaneous speech synthesis using linguistic features. In: Dediu, A.-H., Martín-Vide, C., Vicsi, K. (eds.) SLSP 2015. LNCS (LNAI), vol. 9449, pp. 229–241. Springer, Cham (2015). doi: 10.1007/978-3-319-25789-1_22 CrossRefGoogle Scholar
  11. 11.
    Tahon, M., Qader, R., Lecorvé, G., Lolive, D.: Improving TTS with corpus-specific pronunciation adaptation. In: Proceedings of Interspeech (2016)Google Scholar
  12. 12.
    Bell, A., Brenier, J.M., Gregory, M., Girand, C., Jurafsky, D.: Predictability effects on durations of content and function words in conversational English. J. Mem. Lang. 60, 92–111 (2009)CrossRefGoogle Scholar
  13. 13.
    Bates, R., Ostendorf, M.: Modeling pronunciation variation in conversational speech using prosody. In: Proceedings of ISCA Tutorial and Research Workshop (ITRW) on Pronunciation Modeling and Lexicon Adaptation for Spoken Language Technology (2002)Google Scholar
  14. 14.
    Livescu, K., Jyothi, P., Fosler-Lussier, E.: Articulatory feature-based pronunciation modeling. Comput. Speech Lang. 36, 165–172 (2016)CrossRefGoogle Scholar
  15. 15.
    Rasipuram, R., Doss, M.M.: Articulatory feature based continuous speech recognition using probabilistic lexical modeling. Comput. Speech Lang. 36, 165–172 (2016)CrossRefGoogle Scholar
  16. 16.
    Pitt, M.A., Johnson, K., Hume, E., Kiesling, S., Raymond, W.: The Buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability. Speech Commun. 45, 89–95 (2005)CrossRefGoogle Scholar
  17. 17.
    Jiampojamarn, S., Kondrak, G., Sherif, T.: Applying many-to-many alignments and hidden Markov models to letter-to-phoneme conversion. In: Proceedings of NAACL-HLT (2007)Google Scholar
  18. 18.
    Rosti, A.V.I., Matsoukas, S.: Combining outputs from multiple machine translation systems. In: Proceedings of NAACL-HLT (2007)Google Scholar
  19. 19.
    Huet, S., Gravier, G., Sébillot, P.: Morpho-syntactic post-processing of N-best lists for improved French automatic speech recognition. Comput. Speech Lang. 24(4), 663–684 (2010)CrossRefGoogle Scholar
  20. 20.
    Stolcke, A., Zheng, J., Wang, W., Abrash, V.: SRILM at sixteen: update and outlook. In: Proceedings of IEEE ASRU Workshop (2011)Google Scholar
  21. 21.
    Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A.W., Tokuda, K.: The HMM-based speech synthesis system (HTS) version 2.0. In: Proceedings of SSW (2007)Google Scholar
  22. 22.
    King, S., Karaiskos, V.: The Blizzard challenge 2012. In: Proceedings of Blizzard Challenge 2012 Workshop (2012)Google Scholar
  23. 23.
    Karhila, R., Remes, U., Kurimo, M.: Noise in HMM-based speech synthesis adaptation: analysis, evaluation methods and experiments. IEEE J. Sel. Top. Signal Process. 8(2), 285–295 (2014)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Raheel Qader
    • 1
  • Gwénolé Lecorvé
    • 1
    Email author
  • Damien Lolive
    • 1
  • Marie Tahon
    • 1
  • Pascale Sébillot
    • 2
  1. 1.IRISA/University of Rennes 1 (ENSSAT)LannionFrance
  2. 2.IRISA/INSA RennesRennesFrance

Personalised recommendations