Speech Synthesis: Text-To-Speech Conversion and Artificial Voices

Trouvain, Jürgen; Möbius, Bernd

doi:10.1007/978-3-030-02438-3_168

Jürgen Trouvain³ &
Bernd Möbius³

134 Accesses
2 Citations

Abstract

The artificial generation of speech has fascinated mankind since ancient times. The robotic-sounding artificial voices from the last century are nowadays replaced with more naturally sounding voices based on pre-recorded human speech. Significant progress in data processing led to qualitative leaps in intelligibility and naturalness. Apart from sizable data of the voice donor, a fully fledged text-to-speech (TTS) synthesizer requires further linguistic resources and components of natural language processing including dictionaries with information on pronunciation and word prosody, morphological structure, and parts-of-speech but also procedures for automatic chunking texts in smaller parts, or morpho-syntactic parsing. TTS technology can be used in many different application domains, for instance, as a communicative aid for those who cannot speak and those who cannot see and in situations characterized as “hands busy, eyes busy” often as a part of spoken dialog systems. One remaining big challenge is evaluation of the quality of synthetic speech output and its appropriateness for the needs of the user. There are also promising developments in speech synthesis that go beyond the pure acoustic channel. Multimodal synthesis includes the visual channel, e.g., in talking heads, whereas silent-speech interfaces and brain-to-speech conversion convert articulatory gestures and brain waves, respectively, to spoken output. Although there has been much progress in quality in the last decade, often achieved by processing enormous amounts of data, TTS today is available only for relatively few languages (probably fewer than 50 with a dominance of English). Thus, a major task will be to find or create linguistic resources and make them available for more languages and language varieties.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 849.99; Price excludes VAT (USA)

Hardcover Book: USD 1,099.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Denby, B., Schultz, T., Honda, K., Hueber, T., Gilbert, J. M., & Brumberg, J. S. (2010). Silent speech interfaces. Speech Communication, 52(4), 270–287.
Article Google Scholar
Dudley, H. (1940). The carrier nature of speech. The Bell System Technical Journal, 19(4), 495–515.
Article Google Scholar
Dutoit, T. (1997). An introduction to text-to-speech synthesis. Dordrecht: Kluwer.
Book Google Scholar
Fant, G. (1960). Acoustic theory of speech production. The Hague: Mouton.
Google Scholar
Herff, C., Heger, D., de Pesters, A., Telaar, D., Brunner, P., Schalk, G., & Schultz, T. (2015). Brain-to-text: Decoding spoken phrases from phone representations in the brain. Frontiers in Neuroscience, 9, 217. https://doi.org/10.3389/fnins.2015.00217. Accessed 01 Aug 2018.
Article Google Scholar
Rehm, G., & Uszkoreit, H. (Eds.). (2013). The META-NET strategic research agenda for multilingual Europe 2020. Heidelberg: Springer.
Google Scholar
Shen, J., Pang, R., Weiss, R. J., Schuster M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R. J., Saurous, R. A., Agiomyrgiannakis, Y., & Wu, Y. (2018). Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In Proceedings of IEEE international conference on acoustics, speech and signal processing, Calgary, paper #3782.
Google Scholar
Sproat, R. (Ed.). (1998). Multilingual text-to-speech synthesis – the Bell Labs approach. Dordrecht: Kluwer.
Google Scholar
Taylor, P. (2009). Text-to-speech synthesis. Cambridge, UK: Cambridge University Press.
Book Google Scholar
von Kempelen, W. (2017). Mechanismus der menschlichen Sprache – The Mechanism of Human Speech. Kommentierte Transliteration & Übertragung ins Englische – Commented Transliteration & Translation into English by Fabian Brackhane, Richard Sproat & Jürgen Trouvain (Eds.). Dresden: TUDpress. Also available online http://www.coli.uni-saarland.de/~trouvain/kempelen.html
Wahlster, W. (Ed.). (2006). SmartKom: Foundations of multimodal dialogue systems. Berlin: Springer.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Language Science and Technology, Saarland University, Saarbrücken, Germany
Jürgen Trouvain & Bernd Möbius

Authors

Jürgen Trouvain
View author publications
You can also search for this author in PubMed Google Scholar
Bernd Möbius
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jürgen Trouvain .

Editor information

Editors and Affiliations

Department of Geography, University of Kentucky, Lexington, KY, USA
Stanley D. Brunn
Research Centre Deutscher Sprachatlas, Philipps University, Marburg, Germany
Roland Kehrein

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Trouvain, J., Möbius, B. (2020). Speech Synthesis: Text-To-Speech Conversion and Artificial Voices. In: Brunn, S., Kehrein, R. (eds) Handbook of the Changing World Language Map. Springer, Cham. https://doi.org/10.1007/978-3-030-02438-3_168

Download citation

DOI: https://doi.org/10.1007/978-3-030-02438-3_168
Published: 23 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02437-6
Online ISBN: 978-3-030-02438-3
eBook Packages: Social SciencesReference Module Humanities and Social SciencesReference Module Business, Economics and Social Sciences

Publish with us

Policies and ethics