Expressive Speech Synthesis: Past, Present, and Possible Futures

Schröder, Marc

doi:10.1007/978-1-84800-306-4_7

Expressive Speech Synthesis: Past, Present, and Possible Futures

Marc Schröder²

Chapter

1426 Accesses
44 Citations

Abstract

Approaches towards adding expressivity to synthetic speech have changed considerably over the last 20 years. Early systems, including formant and diphone systems, have been focused around “explicit control” models; early unit selection systems have adopted a “playback” approach. Currently, various approaches are being pursued to increase the flexibility in expression while maintaining the quality of state-of-the-art systems, among them a new “implicit control” paradigm in statistical parametric speech synthesis, which provides control over expressivity by combining and interpolating between statistical models trained on different expressive databases. The present chapter provides an overview of the past and present approaches, and ventures a look into possible future developments.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Audibert, N., Vincent, D., Aubergé V., & Rosec, O. (2006). Expressive speech synthesis: Evaluation of a voice quality centered coder on the different acoustic dimensions. In: Proceedings of Speech Prosody, Dresden, Germany.
Google Scholar
Birkholz, P. (2007). Control of an articulatory speech synthesizer based on dynamic approximation of spatial articulatory targets. In: Proceedings of Interspeech, Antwerp, Belgium.
Google Scholar
Bulut, M., Narayanan, S.S., & Syrdal, A.K. (2002). Expressive speech synthesis using a concate-native synthesiser. In: Proceedings of the 7th International Conference on Spoken Language Processing, Denver.
Google Scholar
Burkhardt, F., & Sendlmeier, W.F. (2000). Verification of acoustical correlates of emotional speech using formant synthesis. In: Proceedings of the ISCA Workshop on Speech and Emotion, Northern Ireland, pp. 151–156.
Google Scholar
Cahn, J.E. (1990). The generation of affect in synthesized speech. Journal of the American Voice I/O Society, 8, 1–19.
Google Scholar
Campbell, N. (2005). Developments in corpus-based speech synthesis: Approaching natural conversational speech. IEICE Transactions on Information and Systems 88(3), 376–383.
Article Google Scholar
Campbell, N. (2007). Approaches to conversational speech rhythm: Speech activity in two-person telephone dialogues. In: Proceedings of the International Congress of Phonetic Sciences, Saarbrücken, Germany, pp. 343–348.
Google Scholar
Campbell, N., & Marumoto, T. (2000). Automatic labelling of voice-quality in speech databases for synthesis. In: Proceedings of the 6th International Conference on Spoken Language Processing, Beijing.
Google Scholar
Charpentier, F., & Moulines, E. (1989). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. In: Proceedings of Eurospeech, Paris, pp. 13–19.
Google Scholar
d'Alessandro, C., & Doval, B. (2003). Voice quality modification for emotional speech synthesis. In: Proceedings of Eurospeech 2003, Geneva, Switzerland, pp. 1653–1656.
Google Scholar
Edgington, M. (1997). Investigating the limitations of concatenative synthesis. In: Proceedings of Eurospeech 1997, Rhodes/Athens.
Google Scholar
Ekman, P. (1977) Biological and cultural contributions to body and facial movement. In: J. Blacking (Ed.) The anthropology of the body, London: Academic Press, pp. 39–84.
Google Scholar
Fernandez, R., & Ramabhadran, B. (2007). Automatic exploration of corpus-specific properties for expressive text-to-speech: A case study in emphasis. In: Proceedings of the 6th ISCA Workshop on Speech Synthesis, Bonn, Germany, pp. 34–39.
Google Scholar
Heuft, B., Portele, T., & Rauth, M. (1996). Emotions in time domain synthesis. In: Proceedings of the 4th International Conference of Spoken Language Processing, Philadelphia.
Google Scholar
Iida, A., & Campbell, N. (2003). Speech database design for a concatenative text-to-speech synthesis system for individuals with communication disorders. International Journal of Speech Technology 6, 379–392.
Article Google Scholar
Iriondo, I., Guaus, R., Rogríguez, A., Lázaro, P., Montoya, N., Blanco, J. M., Bernadas, D., Oliver, J. M., Tena, D., & Longhi, L. (2000). Validation of an acoustical modelling of emotional expression in Spanish using speech synthesis techniques. In: Proceedings of the ISCA Workshop on Speech and Emotion, Northern Ireland, pp. 161–166.
Google Scholar
Johnson, W.L., Narayanan, S.S., Whitney, R., Das, R., Bulut, M., & LaBore, C. (2002). Limited domain synthesis of expressive military speech for animated characters. In: Proceedings of the 7th International Conference on Spoken Language Processing, Denver.
Google Scholar
Ling, Z. H., Qin, L., Lu, H., Gao, Y., Dai, L. R., Wang, R. H., Jiang, Y., Zhao, Z. W., Yang, J. H., Chen, J., Hu, G. P. (2007). The USTC and iFlytek speech synthesis systems for Blizzard Challenge 2007. In: Proceedings of Blizzard Challenge, Bonn, Germany.
Google Scholar
Matsui, H., & Kawahara, H. (2003). Investigation of emotionally morphed speech perception and its structure using a high quality speech manipulation system. In: Proceedings of Eurospeech 2003, Geneva, Switzerland, pp. 2113–2116.
Google Scholar
Miyanaga, K., Masuko, T., & Kobayashi, T. (2004). A style control technique for HMM-based speech synthesis. In: Proceedings of the 8th International Conference of Spoken Language Processing, Jeju, Korea.
Google Scholar
Montero, J. M., Gutiérrez-Arriola, J., Colás, J., Enríquez, E., Pardo, J. M. (1999). Analysis and modelling of emotional speech in Spanish. In: Proceedings of the 14th International Conference of Phonetic Sciences, San Francisco, pp. 957–960.
Google Scholar
Moore, R. K. (2007). Spoken language processing: Piecing together the puzzle. Speech Communication, 49, 418–435
Article Google Scholar
Mozziconacci, S.J. L. (1998). Speech variability and emotion: Production and perception. PhD thesis, Technical University Eindhoven
Google Scholar
Mozziconacci, S. J. L., & Hermes, D. J. (1999). Role of intonation patterns in conveying emotion in speech. In: Proceedings of the 14th International Conference of Phonetic Sciences, San Francisco, pp. 2001–2004.
Google Scholar
Murray I. R., & Arnott, J. L. (1995). Implementation and testing of a system for producing emotion-by-rule in synthetic speech. Speech Communication, 16, 369–390
Article Google Scholar
Pitrelli, J. F., Bakis, R., Eide, E. M., Fernandez, R., Hamza, W., & Picheny, M. A. (2006). The IBM expressive text-to-speech synthesis system for American English. IEEE Transactions on Audio, Speech and Language Processing 14(4):1099–1108.
Article Google Scholar
Rank, E., & Pirker, H. (1998). Generating emotional speech with a concatenative synthesizer. In: Proceedings of the 5th International Conference of Spoken Language Processing, Sydney, Australia, vol 3, pp. 671–674.
Google Scholar
Scherer, K. R. (1986). Vocal affect expression: A review and a model for future research. Psychological Bulletin 99,143–165
Article Google Scholar
Schröder, M. (1999). Can emotions be synthesized without controlling voice quality? Phonus 4, Research Report of the Institute of Phonetics, University of the Saarland, pp. 37–55.
Google Scholar
Schröder, M. (2001) Emotional speech synthesis: A review. In: Proceedings of Eurospeech 2001, Aalborg, Denmark (vol 1, pp. 561–564).
Google Scholar
Schröder, M. (2003). Experimental study of affect bursts. Speech Communication Special Issue Speech and Emotion 40(1–2), 99–116.
MATH Google Scholar
Schröder, M. (2006). Expressing degree of activation in synthetic speech. IEEE Transactions on Audio, Speech and Language Processing 14(4),1128–1136
Article Google Scholar
Schröder, M. (2007). Interpolating expressions in unit selection. In: Proceedings of the second International Conference on Affective Computing and Intelligent Interaction (ACII'2007), Lisbon, Portugal.
Google Scholar
Schröder, M. (2008). Approaches to emotional expressivity in synthetic speech. In: K. Izdebski (Ed.) The emotion in the human voice, vol 3, Plural, San Diego.
Google Scholar
Schröder, M., & Grice, M. (2003). Expressing vocal effort in concatenative synthesis. In: Proceedings of the 15th International Conference of Phonetic Sciences, Barcelona.
Google Scholar
Schröder, M., Heylen, D., & Poggi, I. (2006). Perception of non-verbal emotional listener feedback. In: Proceedings of Speech Prosody 2006, Dresden, Germany.
Google Scholar
Trouvain, J., & Schröder, M. (2004). How (not) to add laughter to synthetic speech. In: Proc. Workshop on Affective Dialogue Systems, Kloster Irsee, Germany, pp 229–232.
Google Scholar
Turk, O., Schröder, M., Bozkurt, B., & Arslan, L. (2005). Voice quality interpolation for emotional text-to-speech synthesis. In: Proceedings of Interspeech 2005, Lisbon, Portugal, pp. 797–800.
Google Scholar
Vincent, D., Rosec, O., & Chonavel, T. (2005). Estimation of LF glottal source parameters based on an ARX model. In: Proceedings of Interspeech, Lisbon, Portugal, pp. 333–336.
Google Scholar
Vroomen, J., Collier, R., & Mozziconacci, S. J. L. (1993). Duration and intonation in emotional speech. In: Proceedings of Eurospeech 1993, Berlin, Germany, vol 1, pp. 577–580.
Google Scholar
Wang, L., Chu, M., Peng, Y., Zhao, Y., & Soong, F. (2007). Perceptual annotation of expressive speech. In: Proceedings of the sixth ISCA Workshop on Speech Synthesis, Bonn, Germany, pp. 46–51.
Google Scholar
Wollermann, C., & Lasarcyk, E. (2007). Modeling and perceiving of (un-)certainty in articulatory speech synthesis. In: Proceedings the sixth ISCA Speech Synthesis Workshop, Bonn, Germany, pp. 40–45.
Google Scholar
Yamagishi, J., Kobayashi, T., Tachibana, M., Ogata, K., & Nakano, Y. (2007). Model adaptation approach to speech synthesis with diverse voices and styles. In: Proceedings of ICASSP, Hawaii, vol. IV, pp. 1233–1236.
Google Scholar
Yamagishi, J., Onishi, K., Masuko, T., & Kobayashi T. (2003) Modeling of various speaking styles and emotions for HMM-based speech synthesis. In: Proceedings of Eurospeech, Geneva, Switzerland, pp. 2461–2464.
Google Scholar
Ye, H., & Young, S. (2004). High quality voice morphing. In: Proceedings of ICASSP 2004, Montreal.
Google Scholar
Zen, H., & Toda, T. (2005). An overview of Nitech HMM-based speech synthesis system for Blizzard Challenge 2005. In: Proceedings of Interspeech, Lisbon, Portugal, pp. 93–96.
Google Scholar
Zovato, E., Pacchiotti, A., Quazza, S., & Sandri, S. (2004). Towards emotional speech synthesis: A rule based approach. In: Proceedings of the fifth ISCA Speech Synthesis Workshop, Pittsburgh,PA, pp 219–220.
Google Scholar

Download references

Author information

Authors and Affiliations

DFKI GmbH, Saarbrücken, Germany
Marc Schröder

Authors

Marc Schröder
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Automation, Chinese Academy of Sciences, 95 Zhongguancun East Road, Haidian, Beijing, 100080, P.R. China
Jianhua Tao & Tieniu Tan BSc, MSc, PhD &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Schröder, M. (2009). Expressive Speech Synthesis: Past, Present, and Possible Futures. In: Tao, J., Tan, T. (eds) Affective Information Processing. Springer, London. https://doi.org/10.1007/978-1-84800-306-4_7

Download citation

DOI: https://doi.org/10.1007/978-1-84800-306-4_7
Publisher Name: Springer, London
Print ISBN: 978-1-84800-305-7
Online ISBN: 978-1-84800-306-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics