Skip to main content

Text-To-Speech Synthesis

  • Chapter
  • First Online:
Quality of Experience

Abstract

In this chapter, we will address the quality experienced when listening to speech which is synthesized by state-of-the-art synthesis systems which generate artificial speech from text. Such systems are used, e.g., in information and navigation systems, but also for generating audiobooks. We describe both, auditory evaluation methods as well as instrumental models predicting perceived QoE. Besides overall perceived quality, we focus on perceptual quality features that can be used for diagnosis and system optimization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    An extensive collection of speech produced by German speaking synthesizers can be found in [4].

References

  1. ASA S3.2-2009 (2009) American national standard method for measuring the intelligibility of speech over communication systems. American National Standards of the Acoustical Society of America, Washington

    Google Scholar 

  2. Benoit C, Griceb M, Hazanc V (1996) The SUS test: a method for the assessment of text-to-speech synthesis intelligibility using semantically unpredictable sentences. Speech Communication 18(4):381–392

    Google Scholar 

  3. Black AW, Taylor PA (1994) CHATR: a generic speech synthesis system. In: COLING 1994, vol 2. pp 983–986

    Google Scholar 

  4. Burkhardt F (2013) Comparison of German TTS-systems. Cited 20 Apr 2013. http://syntheticspeech.de/index.html

  5. Cernak M, Rusko M (2005) An evaluation of synthetic speech using the PESQ measure. In: Proceedings of forum acusticum, Budapest, Hungary, pp 2725–2728

    Google Scholar 

  6. Chu M, Peng H (2001) An objective measure for estimating MOS of synthesized speech. In: Proceedings of the 7th international conference on speech communication and technology (Eurospeech 2001), Aalborg, Denmark, pp 2087–2090

    Google Scholar 

  7. Côté N (2011) Integral and diagnostic intrusive prediction of speech quality. Springer, Heidelberg

    Book  Google Scholar 

  8. Falk TH, Möller S (2008) Towards signal-based instrumental quality diagnosis for text-to-speech systems. IEEE Signal Processing Letter 15:781–784

    Google Scholar 

  9. Fujisaki H (1981) Dynamic characteristics of voice fundamental frequency in speech and singing. Acoustical analysis and physiological interpretations. In: STL-QPSR, vol 22. pp 1–20

    Google Scholar 

  10. Gibbon D, Moore R, Winski R (1997) Handbook of standards and resources for spoken language systems. De Gruyter Mouton, Berlin, Boston

    Google Scholar 

  11. Hinterleitner F, Möller S, Norrenbrock C, Heute U (2011) Perceptual quality dimensions of text-to-speech systems. In: Proceedings of the 12th annual conference of the international speech communication association (Interspeech 2011), Florence, Italy, pp 2177–2180

    Google Scholar 

  12. Hinterleitner F, Neitzel G, Möller S, Norrenbrock C (2011) An evaluation protocol for the subjective assessment of text-to-speech in audiobook reading tasks. In: Proceedings of the Blizzard challenge workshop, Florence, Italy

    Google Scholar 

  13. Hinterleitner F, Zabel S, Möller S, Leutelt L, Norrenbrock C (2011) Predicting the quality of synthesized speech using reference-based prediction measures. In: Proceedings of the 22nd Konferenz Elektronische Sprachsignalverarbeitung (ESSV 2011), Aachen, Germany, pp 99–106

    Google Scholar 

  14. Hinterleitner F, Norrenbrock C, Möller S (2012) On the use of fujisaki parameters for the quality prediction of synthetic speech. In: Proceedings of the 23rd Konferenz Elektronische Sprachsignalverarbeitung (ESSV 2012), Cottbus, Germany, pp 112–119

    Google Scholar 

  15. Hinterleitner F, Norrenbrock C, Möller S, Heute U (2012) What makes this voice sound so bad? A multidimensional analysis of state-of-the-art text-to-speech systems. In: Proceedings of the 2012 IEEE workshop on spoken language technology (SLT), Miami, USA, pp 240–245

    Google Scholar 

  16. Hinterleitner F, Norrenbrock C, Möller S (2013) Perceptual quality dimensions of text-to-speech in audiobook reading tasks. In: Proceedings of the 24th Konferenz Elektronische Sprachsignalverarbeitung (ESSV 2013), Bielefeld, Germany, pp 44–49

    Google Scholar 

  17. Hinterleitner F, Norrenbrock C, Möller S, Heute U (2013) Predicting the quality of text-to-speech systems from a large-scale feature set, Lyon, France, pp 383–387

    Google Scholar 

  18. ITU-T Recommendation P.85 (1994) A method for subjective performance assessment of the quality of speech voice output devices. International Telecommunication Union, Geneva

    Google Scholar 

  19. ITU-T Recommendation P.563 (2004) Single ended method for objective speech quality assessment in narrow-band telephony. International Telecommunication Union, Geneva

    Google Scholar 

  20. ITU-T Recommendation P.862 (2001) Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. International Telecommunication Union, Geneva

    Google Scholar 

  21. ITU-T Recommendation P.863 (2011) Perceptual objective listening quality assessment (POLQA). International Telecommunication Union, Geneva

    Google Scholar 

  22. Jekosch U (1993) Speech quality assessment and evaluation. In: Proceedings of Eurospeech, Berlin, Germany, pp 1387–1394

    Google Scholar 

  23. Klatt DH (1980) Software for a cascade/parallel formant synthesizer. Journal of the Acoustical Society of America 67(3):971–995

    Google Scholar 

  24. Kraft V, Portele T (1995) Quality evaluation of five German speech synthesis systems. Acta Acustica 3:351–365

    Google Scholar 

  25. Mariniak A (1993) A global framework for the assessment of synthetic speech without subjects. In: Proceedings of the 3rd European conference on speech processing and technology (Eurospeech), Berlin, Germany, pp 1683–1686

    Google Scholar 

  26. Mayo C, Clark RAJ, King S (2005) Listener’s weighting of acoustic cues to synthetic speech naturalness: a multidimensional scaling analysis. In: Proceedings of the 6th annual conference of the international speech communication association (Interspeech), Lisbon, Portugal, pp 1725–1728

    Google Scholar 

  27. Minker W, Lee GG, Mariani J, Nakamura S (2010) Salient features for anger recognition in German and English IVR portals. Spoken dialogue systems technology and design. Springer

    Google Scholar 

  28. Möller S, Hinterleitner F (2013) ITU-T Contribution COM 12–37: proposal for an appendix to Rec. P.85 of the evaluation of speech output for audiobook reading tasks. Deutsche Telekom AG, ITU-T SG12 meeting 19–28 Mar 2013, Geneva

    Google Scholar 

  29. Möller S, Hinterleitner F, Falk TH, Polzehl T (2010) Comparison of approaches for instrumentally predicting the quality of text-to-speech systems. In: Proceedings of the 11th annual conference of the international speech communication association (Interspeech 2010), Makuhari, Japan, pp 1325–1328

    Google Scholar 

  30. Moulines E, Charpentier N (1990) Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication 9(5/6):453–467

    Google Scholar 

  31. Norrenbrock C, Hinterleitner F, Heute U, Möller S (2012) Instrumental assessment of prosodic quality for text-to-speech signals. IEEE Signal Processing Letters 19:255–258

    Google Scholar 

  32. Norrenbrock C, Hinterleitner F, Heute U, Möller S (2012) Quality analysis of macroprosodic \(F_{0}\) dynamics in text-to-speech signals. In: Proceedings of the 13th annual conference of the international speech communication association (Interspeech 2012), Portland, USA, pp 454–457

    Google Scholar 

  33. Norrenbrock C, Hinterleitner F, Heute U, Möller S (2012) Towards perceptual quality modeling of synthesized audiobooks. In: Proceedings of the blizzard challenge workshop, Portland, USA

    Google Scholar 

  34. Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286

    Google Scholar 

  35. Sityaev D, Knill K, Burrows T (2006) Comparison of the ITU-T P.85 standard to other methods for the evaluation of text-to-speech systems. In: Proceedings of the 9th international conference on spoken language processing (Interspeech), Pittsburgh, USA, pp 1077–1080

    Google Scholar 

  36. Tokuda K, Zen H, Black AW (2002) An HMM-based speech synthesis system applied to English. In: Proceedings of 2002 IEEE speech synthesis workshop, Santa Monica, USA, pp 227–230

    Google Scholar 

  37. Tsogo L, Masson MH, Bardot A (2000) Multidimensional scaling methods for many-objects sets: a review. Multivariate Behavioral Research 35(3):307–319

    Google Scholar 

  38. Viswanathan M, Viswanathan M (2005) Measuring speech quality for text-to-speech systems: development and assessment of a modified mean opinion score (MOS) scale. Computer Speech and Language 19(1):55–83

    Google Scholar 

Download references

Acknowledgments

This work was supported by the Deutsche Forschungsgemeinschaft (DFG), grants MO-1138/11-1, MO-1138/11-2, HE-4465/4-1 and HE-4465/4-2.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Florian Hinterleitner .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Hinterleitner, F., Norrenbrock, C., Möller, S., Heute, U. (2014). Text-To-Speech Synthesis. In: Möller, S., Raake, A. (eds) Quality of Experience. T-Labs Series in Telecommunication Services. Springer, Cham. https://doi.org/10.1007/978-3-319-02681-7_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-02681-7_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-02680-0

  • Online ISBN: 978-3-319-02681-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics