Skip to main content

Expressive Speech Synthesis: Past, Present, and Possible Futures

  • Chapter

Abstract

Approaches towards adding expressivity to synthetic speech have changed considerably over the last 20 years. Early systems, including formant and diphone systems, have been focused around “explicit control” models; early unit selection systems have adopted a “playback” approach. Currently, various approaches are being pursued to increase the flexibility in expression while maintaining the quality of state-of-the-art systems, among them a new “implicit control” paradigm in statistical parametric speech synthesis, which provides control over expressivity by combining and interpolating between statistical models trained on different expressive databases. The present chapter provides an overview of the past and present approaches, and ventures a look into possible future developments.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Audibert, N., Vincent, D., Aubergé V., & Rosec, O. (2006). Expressive speech synthesis: Evaluation of a voice quality centered coder on the different acoustic dimensions. In: Proceedings of Speech Prosody, Dresden, Germany.

    Google Scholar 

  • Birkholz, P. (2007). Control of an articulatory speech synthesizer based on dynamic approximation of spatial articulatory targets. In: Proceedings of Interspeech, Antwerp, Belgium.

    Google Scholar 

  • Bulut, M., Narayanan, S.S., & Syrdal, A.K. (2002). Expressive speech synthesis using a concate-native synthesiser. In: Proceedings of the 7th International Conference on Spoken Language Processing, Denver.

    Google Scholar 

  • Burkhardt, F., & Sendlmeier, W.F. (2000). Verification of acoustical correlates of emotional speech using formant synthesis. In: Proceedings of the ISCA Workshop on Speech and Emotion, Northern Ireland, pp. 151–156.

    Google Scholar 

  • Cahn, J.E. (1990). The generation of affect in synthesized speech. Journal of the American Voice I/O Society, 8, 1–19.

    Google Scholar 

  • Campbell, N. (2005). Developments in corpus-based speech synthesis: Approaching natural conversational speech. IEICE Transactions on Information and Systems 88(3), 376–383.

    Article  Google Scholar 

  • Campbell, N. (2007). Approaches to conversational speech rhythm: Speech activity in two-person telephone dialogues. In: Proceedings of the International Congress of Phonetic Sciences, Saarbrücken, Germany, pp. 343–348.

    Google Scholar 

  • Campbell, N., & Marumoto, T. (2000). Automatic labelling of voice-quality in speech databases for synthesis. In: Proceedings of the 6th International Conference on Spoken Language Processing, Beijing.

    Google Scholar 

  • Charpentier, F., & Moulines, E. (1989). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. In: Proceedings of Eurospeech, Paris, pp. 13–19.

    Google Scholar 

  • d'Alessandro, C., & Doval, B. (2003). Voice quality modification for emotional speech synthesis. In: Proceedings of Eurospeech 2003, Geneva, Switzerland, pp. 1653–1656.

    Google Scholar 

  • Edgington, M. (1997). Investigating the limitations of concatenative synthesis. In: Proceedings of Eurospeech 1997, Rhodes/Athens.

    Google Scholar 

  • Ekman, P. (1977) Biological and cultural contributions to body and facial movement. In: J. Blacking (Ed.) The anthropology of the body, London: Academic Press, pp. 39–84.

    Google Scholar 

  • Fernandez, R., & Ramabhadran, B. (2007). Automatic exploration of corpus-specific properties for expressive text-to-speech: A case study in emphasis. In: Proceedings of the 6th ISCA Workshop on Speech Synthesis, Bonn, Germany, pp. 34–39.

    Google Scholar 

  • Heuft, B., Portele, T., & Rauth, M. (1996). Emotions in time domain synthesis. In: Proceedings of the 4th International Conference of Spoken Language Processing, Philadelphia.

    Google Scholar 

  • Iida, A., & Campbell, N. (2003). Speech database design for a concatenative text-to-speech synthesis system for individuals with communication disorders. International Journal of Speech Technology 6, 379–392.

    Article  Google Scholar 

  • Iriondo, I., Guaus, R., Rogríguez, A., Lázaro, P., Montoya, N., Blanco, J. M., Bernadas, D., Oliver, J. M., Tena, D., & Longhi, L. (2000). Validation of an acoustical modelling of emotional expression in Spanish using speech synthesis techniques. In: Proceedings of the ISCA Workshop on Speech and Emotion, Northern Ireland, pp. 161–166.

    Google Scholar 

  • Johnson, W.L., Narayanan, S.S., Whitney, R., Das, R., Bulut, M., & LaBore, C. (2002). Limited domain synthesis of expressive military speech for animated characters. In: Proceedings of the 7th International Conference on Spoken Language Processing, Denver.

    Google Scholar 

  • Ling, Z. H., Qin, L., Lu, H., Gao, Y., Dai, L. R., Wang, R. H., Jiang, Y., Zhao, Z. W., Yang, J. H., Chen, J., Hu, G. P. (2007). The USTC and iFlytek speech synthesis systems for Blizzard Challenge 2007. In: Proceedings of Blizzard Challenge, Bonn, Germany.

    Google Scholar 

  • Matsui, H., & Kawahara, H. (2003). Investigation of emotionally morphed speech perception and its structure using a high quality speech manipulation system. In: Proceedings of Eurospeech 2003, Geneva, Switzerland, pp. 2113–2116.

    Google Scholar 

  • Miyanaga, K., Masuko, T., & Kobayashi, T. (2004). A style control technique for HMM-based speech synthesis. In: Proceedings of the 8th International Conference of Spoken Language Processing, Jeju, Korea.

    Google Scholar 

  • Montero, J. M., Gutiérrez-Arriola, J., Colás, J., Enríquez, E., Pardo, J. M. (1999). Analysis and modelling of emotional speech in Spanish. In: Proceedings of the 14th International Conference of Phonetic Sciences, San Francisco, pp. 957–960.

    Google Scholar 

  • Moore, R. K. (2007). Spoken language processing: Piecing together the puzzle. Speech Communication, 49, 418–435

    Article  Google Scholar 

  • Mozziconacci, S.J. L. (1998). Speech variability and emotion: Production and perception. PhD thesis, Technical University Eindhoven

    Google Scholar 

  • Mozziconacci, S. J. L., & Hermes, D. J. (1999). Role of intonation patterns in conveying emotion in speech. In: Proceedings of the 14th International Conference of Phonetic Sciences, San Francisco, pp. 2001–2004.

    Google Scholar 

  • Murray I. R., & Arnott, J. L. (1995). Implementation and testing of a system for producing emotion-by-rule in synthetic speech. Speech Communication, 16, 369–390

    Article  Google Scholar 

  • Pitrelli, J. F., Bakis, R., Eide, E. M., Fernandez, R., Hamza, W., & Picheny, M. A. (2006). The IBM expressive text-to-speech synthesis system for American English. IEEE Transactions on Audio, Speech and Language Processing 14(4):1099–1108.

    Article  Google Scholar 

  • Rank, E., & Pirker, H. (1998). Generating emotional speech with a concatenative synthesizer. In: Proceedings of the 5th International Conference of Spoken Language Processing, Sydney, Australia, vol 3, pp. 671–674.

    Google Scholar 

  • Scherer, K. R. (1986). Vocal affect expression: A review and a model for future research. Psychological Bulletin 99,143–165

    Article  Google Scholar 

  • Schröder, M. (1999). Can emotions be synthesized without controlling voice quality? Phonus 4, Research Report of the Institute of Phonetics, University of the Saarland, pp. 37–55.

    Google Scholar 

  • Schröder, M. (2001) Emotional speech synthesis: A review. In: Proceedings of Eurospeech 2001, Aalborg, Denmark (vol 1, pp. 561–564).

    Google Scholar 

  • Schröder, M. (2003). Experimental study of affect bursts. Speech Communication Special Issue Speech and Emotion 40(1–2), 99–116.

    MATH  Google Scholar 

  • Schröder, M. (2006). Expressing degree of activation in synthetic speech. IEEE Transactions on Audio, Speech and Language Processing 14(4),1128–1136

    Article  Google Scholar 

  • Schröder, M. (2007). Interpolating expressions in unit selection. In: Proceedings of the second International Conference on Affective Computing and Intelligent Interaction (ACII'2007), Lisbon, Portugal.

    Google Scholar 

  • Schröder, M. (2008). Approaches to emotional expressivity in synthetic speech. In: K. Izdebski (Ed.) The emotion in the human voice, vol 3, Plural, San Diego.

    Google Scholar 

  • Schröder, M., & Grice, M. (2003). Expressing vocal effort in concatenative synthesis. In: Proceedings of the 15th International Conference of Phonetic Sciences, Barcelona.

    Google Scholar 

  • Schröder, M., Heylen, D., & Poggi, I. (2006). Perception of non-verbal emotional listener feedback. In: Proceedings of Speech Prosody 2006, Dresden, Germany.

    Google Scholar 

  • Trouvain, J., & Schröder, M. (2004). How (not) to add laughter to synthetic speech. In: Proc. Workshop on Affective Dialogue Systems, Kloster Irsee, Germany, pp 229–232.

    Google Scholar 

  • Turk, O., Schröder, M., Bozkurt, B., & Arslan, L. (2005). Voice quality interpolation for emotional text-to-speech synthesis. In: Proceedings of Interspeech 2005, Lisbon, Portugal, pp. 797–800.

    Google Scholar 

  • Vincent, D., Rosec, O., & Chonavel, T. (2005). Estimation of LF glottal source parameters based on an ARX model. In: Proceedings of Interspeech, Lisbon, Portugal, pp. 333–336.

    Google Scholar 

  • Vroomen, J., Collier, R., & Mozziconacci, S. J. L. (1993). Duration and intonation in emotional speech. In: Proceedings of Eurospeech 1993, Berlin, Germany, vol 1, pp. 577–580.

    Google Scholar 

  • Wang, L., Chu, M., Peng, Y., Zhao, Y., & Soong, F. (2007). Perceptual annotation of expressive speech. In: Proceedings of the sixth ISCA Workshop on Speech Synthesis, Bonn, Germany, pp. 46–51.

    Google Scholar 

  • Wollermann, C., & Lasarcyk, E. (2007). Modeling and perceiving of (un-)certainty in articulatory speech synthesis. In: Proceedings the sixth ISCA Speech Synthesis Workshop, Bonn, Germany, pp. 40–45.

    Google Scholar 

  • Yamagishi, J., Kobayashi, T., Tachibana, M., Ogata, K., & Nakano, Y. (2007). Model adaptation approach to speech synthesis with diverse voices and styles. In: Proceedings of ICASSP, Hawaii, vol. IV, pp. 1233–1236.

    Google Scholar 

  • Yamagishi, J., Onishi, K., Masuko, T., & Kobayashi T. (2003) Modeling of various speaking styles and emotions for HMM-based speech synthesis. In: Proceedings of Eurospeech, Geneva, Switzerland, pp. 2461–2464.

    Google Scholar 

  • Ye, H., & Young, S. (2004). High quality voice morphing. In: Proceedings of ICASSP 2004, Montreal.

    Google Scholar 

  • Zen, H., & Toda, T. (2005). An overview of Nitech HMM-based speech synthesis system for Blizzard Challenge 2005. In: Proceedings of Interspeech, Lisbon, Portugal, pp. 93–96.

    Google Scholar 

  • Zovato, E., Pacchiotti, A., Quazza, S., & Sandri, S. (2004). Towards emotional speech synthesis: A rule based approach. In: Proceedings of the fifth ISCA Speech Synthesis Workshop, Pittsburgh,PA, pp 219–220.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag London Limited

About this chapter

Cite this chapter

Schröder, M. (2009). Expressive Speech Synthesis: Past, Present, and Possible Futures. In: Tao, J., Tan, T. (eds) Affective Information Processing. Springer, London. https://doi.org/10.1007/978-1-84800-306-4_7

Download citation

  • DOI: https://doi.org/10.1007/978-1-84800-306-4_7

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84800-305-7

  • Online ISBN: 978-1-84800-306-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics