Statistical Pronunciation Adaptation for Spontaneous Speech Synthesis

Qader, Raheel; Lecorvé, Gwénolé; Lolive, Damien; Tahon, Marie; Sébillot, Pascale

doi:10.1007/978-3-319-64206-2_11

Raheel Qader¹⁵,
Gwénolé Lecorvé¹⁵,
Damien Lolive¹⁵,
Marie Tahon¹⁵ &
…
Pascale Sébillot¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10415))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1510 Accesses
1 Citations

Abstract

To bring more expressiveness into text-to-speech systems, this paper presents a new pronunciation variant generation method which works by adapting standard, i.e., dictionary-based, pronunciations to a spontaneous style. Its strength and originality lie in exploiting a wide range of linguistic, articulatory and prosodic features, and in using a probabilistic machine learning framework, namely conditional random fields and phoneme-based n-gram models. Extensive experiments on the Buckeye corpus of English conversational speech demonstrate the effectiveness of the approach through objective and perceptual evaluations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
CRFs allow dependencies between predicted phonemes but it appeared in preliminary work that using a separate phonological model is better to avoid overfitting the training data.
2.
Binomial test with \(\alpha =0.1\) and votes for “No preference” equally spread over A and B, following the methodology proposed in [23].

References

Tajchman, G., Foster, E., Jurafsky, D.: Building multiple pronunciation models for novel words using exploratory computational phonology. In: Proceedings of Eurospeech (1995)
Google Scholar
Giachin, E., Rosenberg, A., Lee, C.H.: Word juncture modeling using phonological rules for HMM-based continuous speech recognition. In: Proceedings of ICASSP (1990)
Google Scholar
Oshika, B.T., Zue, V.W., Weeks, R.V., Neu, H., Aurbach, J.: The role of phonological rules in speech understanding research. IEEE Trans. Acous. Speech Signal Process. 23, 104–112 (1975)
Article Google Scholar
Goronzy, S., Rapp, S., Kompe, R.: Generating non-native pronunciation variants for lexicon adaptation. Speech Commun. 42(1), 109–123 (2004)
Article Google Scholar
Vazirnezhad, B., Almasganj, F., Ahadi, S.M.: Hybrid statistical pronunciation models designed to be trained by a medium-size corpus. Comput. Speech Lang. 23, 1–24 (2009)
Article Google Scholar
Dilts, P.C.: Modelling phonetic reduction in a corpus of spoken English using random forests and mixed-effects regression. Ph.D. thesis, University of Alberta (2013)
Google Scholar
Chen, K., Hasegawa-Johnson, M.: Modeling pronunciation variation using artificial neural networks for English spontaneous speech. In: Proceedings of Interspeech (2004)
Google Scholar
Karanasou, P., Yvon, F., Lavergne, T., Lamel, L.: Discriminative training of a phoneme confusion model for a dynamic lexicon in ASR. In: Proceedings of Interspeech (2013)
Google Scholar
Prahallad, K., Black, A.W., Mosur, R.: Sub-phonetic modeling for capturing pronunciation variations for conversational speech synthesis. In: Proceedings of ICASSP (2006)
Google Scholar
Qader, R., Lecorvé, G., Lolive, D., Sébillot, P.: Probabilistic speaker pronunciation adaptation for spontaneous speech synthesis using linguistic features. In: Dediu, A.-H., Martín-Vide, C., Vicsi, K. (eds.) SLSP 2015. LNCS (LNAI), vol. 9449, pp. 229–241. Springer, Cham (2015). doi:10.1007/978-3-319-25789-1_22
Chapter Google Scholar
Tahon, M., Qader, R., Lecorvé, G., Lolive, D.: Improving TTS with corpus-specific pronunciation adaptation. In: Proceedings of Interspeech (2016)
Google Scholar
Bell, A., Brenier, J.M., Gregory, M., Girand, C., Jurafsky, D.: Predictability effects on durations of content and function words in conversational English. J. Mem. Lang. 60, 92–111 (2009)
Article Google Scholar
Bates, R., Ostendorf, M.: Modeling pronunciation variation in conversational speech using prosody. In: Proceedings of ISCA Tutorial and Research Workshop (ITRW) on Pronunciation Modeling and Lexicon Adaptation for Spoken Language Technology (2002)
Google Scholar
Livescu, K., Jyothi, P., Fosler-Lussier, E.: Articulatory feature-based pronunciation modeling. Comput. Speech Lang. 36, 165–172 (2016)
Article Google Scholar
Rasipuram, R., Doss, M.M.: Articulatory feature based continuous speech recognition using probabilistic lexical modeling. Comput. Speech Lang. 36, 165–172 (2016)
Article Google Scholar
Pitt, M.A., Johnson, K., Hume, E., Kiesling, S., Raymond, W.: The Buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability. Speech Commun. 45, 89–95 (2005)
Article Google Scholar
Jiampojamarn, S., Kondrak, G., Sherif, T.: Applying many-to-many alignments and hidden Markov models to letter-to-phoneme conversion. In: Proceedings of NAACL-HLT (2007)
Google Scholar
Rosti, A.V.I., Matsoukas, S.: Combining outputs from multiple machine translation systems. In: Proceedings of NAACL-HLT (2007)
Google Scholar
Huet, S., Gravier, G., Sébillot, P.: Morpho-syntactic post-processing of N-best lists for improved French automatic speech recognition. Comput. Speech Lang. 24(4), 663–684 (2010)
Article Google Scholar
Stolcke, A., Zheng, J., Wang, W., Abrash, V.: SRILM at sixteen: update and outlook. In: Proceedings of IEEE ASRU Workshop (2011)
Google Scholar
Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A.W., Tokuda, K.: The HMM-based speech synthesis system (HTS) version 2.0. In: Proceedings of SSW (2007)
Google Scholar
King, S., Karaiskos, V.: The Blizzard challenge 2012. In: Proceedings of Blizzard Challenge 2012 Workshop (2012)
Google Scholar
Karhila, R., Remes, U., Kurimo, M.: Noise in HMM-based speech synthesis adaptation: analysis, evaluation methods and experiments. IEEE J. Sel. Top. Signal Process. 8(2), 285–295 (2014)
Article Google Scholar

Download references

Acknowledgments

This study has been realized under the ANR (French National Research Agency) project SynPaFlex ANR-15-CE23-0015.

Author information

Authors and Affiliations

IRISA/University of Rennes 1 (ENSSAT), Lannion, France
Raheel Qader, Gwénolé Lecorvé, Damien Lolive & Marie Tahon
IRISA/INSA Rennes, Rennes, France
Pascale Sébillot

Authors

Raheel Qader
View author publications
You can also search for this author in PubMed Google Scholar
Gwénolé Lecorvé
View author publications
You can also search for this author in PubMed Google Scholar
Damien Lolive
View author publications
You can also search for this author in PubMed Google Scholar
Marie Tahon
View author publications
You can also search for this author in PubMed Google Scholar
Pascale Sébillot
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gwénolé Lecorvé .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein
University of West Bohemia, Pilsen, Czech Republic
Václav Matoušek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qader, R., Lecorvé, G., Lolive, D., Tahon, M., Sébillot, P. (2017). Statistical Pronunciation Adaptation for Spontaneous Speech Synthesis. In: Ekštein, K., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2017. Lecture Notes in Computer Science(), vol 10415. Springer, Cham. https://doi.org/10.1007/978-3-319-64206-2_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-64206-2_11
Published: 29 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64205-5
Online ISBN: 978-3-319-64206-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics