Abstract
Approaches towards adding expressivity to synthetic speech have changed considerably over the last 20 years. Early systems, including formant and diphone systems, have been focused around “explicit control” models; early unit selection systems have adopted a “playback” approach. Currently, various approaches are being pursued to increase the flexibility in expression while maintaining the quality of state-of-the-art systems, among them a new “implicit control” paradigm in statistical parametric speech synthesis, which provides control over expressivity by combining and interpolating between statistical models trained on different expressive databases. The present chapter provides an overview of the past and present approaches, and ventures a look into possible future developments.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Audibert, N., Vincent, D., Aubergé V., & Rosec, O. (2006). Expressive speech synthesis: Evaluation of a voice quality centered coder on the different acoustic dimensions. In: Proceedings of Speech Prosody, Dresden, Germany.
Birkholz, P. (2007). Control of an articulatory speech synthesizer based on dynamic approximation of spatial articulatory targets. In: Proceedings of Interspeech, Antwerp, Belgium.
Bulut, M., Narayanan, S.S., & Syrdal, A.K. (2002). Expressive speech synthesis using a concate-native synthesiser. In: Proceedings of the 7th International Conference on Spoken Language Processing, Denver.
Burkhardt, F., & Sendlmeier, W.F. (2000). Verification of acoustical correlates of emotional speech using formant synthesis. In: Proceedings of the ISCA Workshop on Speech and Emotion, Northern Ireland, pp. 151–156.
Cahn, J.E. (1990). The generation of affect in synthesized speech. Journal of the American Voice I/O Society, 8, 1–19.
Campbell, N. (2005). Developments in corpus-based speech synthesis: Approaching natural conversational speech. IEICE Transactions on Information and Systems 88(3), 376–383.
Campbell, N. (2007). Approaches to conversational speech rhythm: Speech activity in two-person telephone dialogues. In: Proceedings of the International Congress of Phonetic Sciences, Saarbrücken, Germany, pp. 343–348.
Campbell, N., & Marumoto, T. (2000). Automatic labelling of voice-quality in speech databases for synthesis. In: Proceedings of the 6th International Conference on Spoken Language Processing, Beijing.
Charpentier, F., & Moulines, E. (1989). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. In: Proceedings of Eurospeech, Paris, pp. 13–19.
d'Alessandro, C., & Doval, B. (2003). Voice quality modification for emotional speech synthesis. In: Proceedings of Eurospeech 2003, Geneva, Switzerland, pp. 1653–1656.
Edgington, M. (1997). Investigating the limitations of concatenative synthesis. In: Proceedings of Eurospeech 1997, Rhodes/Athens.
Ekman, P. (1977) Biological and cultural contributions to body and facial movement. In: J. Blacking (Ed.) The anthropology of the body, London: Academic Press, pp. 39–84.
Fernandez, R., & Ramabhadran, B. (2007). Automatic exploration of corpus-specific properties for expressive text-to-speech: A case study in emphasis. In: Proceedings of the 6th ISCA Workshop on Speech Synthesis, Bonn, Germany, pp. 34–39.
Heuft, B., Portele, T., & Rauth, M. (1996). Emotions in time domain synthesis. In: Proceedings of the 4th International Conference of Spoken Language Processing, Philadelphia.
Iida, A., & Campbell, N. (2003). Speech database design for a concatenative text-to-speech synthesis system for individuals with communication disorders. International Journal of Speech Technology 6, 379–392.
Iriondo, I., Guaus, R., Rogríguez, A., Lázaro, P., Montoya, N., Blanco, J. M., Bernadas, D., Oliver, J. M., Tena, D., & Longhi, L. (2000). Validation of an acoustical modelling of emotional expression in Spanish using speech synthesis techniques. In: Proceedings of the ISCA Workshop on Speech and Emotion, Northern Ireland, pp. 161–166.
Johnson, W.L., Narayanan, S.S., Whitney, R., Das, R., Bulut, M., & LaBore, C. (2002). Limited domain synthesis of expressive military speech for animated characters. In: Proceedings of the 7th International Conference on Spoken Language Processing, Denver.
Ling, Z. H., Qin, L., Lu, H., Gao, Y., Dai, L. R., Wang, R. H., Jiang, Y., Zhao, Z. W., Yang, J. H., Chen, J., Hu, G. P. (2007). The USTC and iFlytek speech synthesis systems for Blizzard Challenge 2007. In: Proceedings of Blizzard Challenge, Bonn, Germany.
Matsui, H., & Kawahara, H. (2003). Investigation of emotionally morphed speech perception and its structure using a high quality speech manipulation system. In: Proceedings of Eurospeech 2003, Geneva, Switzerland, pp. 2113–2116.
Miyanaga, K., Masuko, T., & Kobayashi, T. (2004). A style control technique for HMM-based speech synthesis. In: Proceedings of the 8th International Conference of Spoken Language Processing, Jeju, Korea.
Montero, J. M., Gutiérrez-Arriola, J., Colás, J., Enríquez, E., Pardo, J. M. (1999). Analysis and modelling of emotional speech in Spanish. In: Proceedings of the 14th International Conference of Phonetic Sciences, San Francisco, pp. 957–960.
Moore, R. K. (2007). Spoken language processing: Piecing together the puzzle. Speech Communication, 49, 418–435
Mozziconacci, S.J. L. (1998). Speech variability and emotion: Production and perception. PhD thesis, Technical University Eindhoven
Mozziconacci, S. J. L., & Hermes, D. J. (1999). Role of intonation patterns in conveying emotion in speech. In: Proceedings of the 14th International Conference of Phonetic Sciences, San Francisco, pp. 2001–2004.
Murray I. R., & Arnott, J. L. (1995). Implementation and testing of a system for producing emotion-by-rule in synthetic speech. Speech Communication, 16, 369–390
Pitrelli, J. F., Bakis, R., Eide, E. M., Fernandez, R., Hamza, W., & Picheny, M. A. (2006). The IBM expressive text-to-speech synthesis system for American English. IEEE Transactions on Audio, Speech and Language Processing 14(4):1099–1108.
Rank, E., & Pirker, H. (1998). Generating emotional speech with a concatenative synthesizer. In: Proceedings of the 5th International Conference of Spoken Language Processing, Sydney, Australia, vol 3, pp. 671–674.
Scherer, K. R. (1986). Vocal affect expression: A review and a model for future research. Psychological Bulletin 99,143–165
Schröder, M. (1999). Can emotions be synthesized without controlling voice quality? Phonus 4, Research Report of the Institute of Phonetics, University of the Saarland, pp. 37–55.
Schröder, M. (2001) Emotional speech synthesis: A review. In: Proceedings of Eurospeech 2001, Aalborg, Denmark (vol 1, pp. 561–564).
Schröder, M. (2003). Experimental study of affect bursts. Speech Communication Special Issue Speech and Emotion 40(1–2), 99–116.
Schröder, M. (2006). Expressing degree of activation in synthetic speech. IEEE Transactions on Audio, Speech and Language Processing 14(4),1128–1136
Schröder, M. (2007). Interpolating expressions in unit selection. In: Proceedings of the second International Conference on Affective Computing and Intelligent Interaction (ACII'2007), Lisbon, Portugal.
Schröder, M. (2008). Approaches to emotional expressivity in synthetic speech. In: K. Izdebski (Ed.) The emotion in the human voice, vol 3, Plural, San Diego.
Schröder, M., & Grice, M. (2003). Expressing vocal effort in concatenative synthesis. In: Proceedings of the 15th International Conference of Phonetic Sciences, Barcelona.
Schröder, M., Heylen, D., & Poggi, I. (2006). Perception of non-verbal emotional listener feedback. In: Proceedings of Speech Prosody 2006, Dresden, Germany.
Trouvain, J., & Schröder, M. (2004). How (not) to add laughter to synthetic speech. In: Proc. Workshop on Affective Dialogue Systems, Kloster Irsee, Germany, pp 229–232.
Turk, O., Schröder, M., Bozkurt, B., & Arslan, L. (2005). Voice quality interpolation for emotional text-to-speech synthesis. In: Proceedings of Interspeech 2005, Lisbon, Portugal, pp. 797–800.
Vincent, D., Rosec, O., & Chonavel, T. (2005). Estimation of LF glottal source parameters based on an ARX model. In: Proceedings of Interspeech, Lisbon, Portugal, pp. 333–336.
Vroomen, J., Collier, R., & Mozziconacci, S. J. L. (1993). Duration and intonation in emotional speech. In: Proceedings of Eurospeech 1993, Berlin, Germany, vol 1, pp. 577–580.
Wang, L., Chu, M., Peng, Y., Zhao, Y., & Soong, F. (2007). Perceptual annotation of expressive speech. In: Proceedings of the sixth ISCA Workshop on Speech Synthesis, Bonn, Germany, pp. 46–51.
Wollermann, C., & Lasarcyk, E. (2007). Modeling and perceiving of (un-)certainty in articulatory speech synthesis. In: Proceedings the sixth ISCA Speech Synthesis Workshop, Bonn, Germany, pp. 40–45.
Yamagishi, J., Kobayashi, T., Tachibana, M., Ogata, K., & Nakano, Y. (2007). Model adaptation approach to speech synthesis with diverse voices and styles. In: Proceedings of ICASSP, Hawaii, vol. IV, pp. 1233–1236.
Yamagishi, J., Onishi, K., Masuko, T., & Kobayashi T. (2003) Modeling of various speaking styles and emotions for HMM-based speech synthesis. In: Proceedings of Eurospeech, Geneva, Switzerland, pp. 2461–2464.
Ye, H., & Young, S. (2004). High quality voice morphing. In: Proceedings of ICASSP 2004, Montreal.
Zen, H., & Toda, T. (2005). An overview of Nitech HMM-based speech synthesis system for Blizzard Challenge 2005. In: Proceedings of Interspeech, Lisbon, Portugal, pp. 93–96.
Zovato, E., Pacchiotti, A., Quazza, S., & Sandri, S. (2004). Towards emotional speech synthesis: A rule based approach. In: Proceedings of the fifth ISCA Speech Synthesis Workshop, Pittsburgh,PA, pp 219–220.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag London Limited
About this chapter
Cite this chapter
Schröder, M. (2009). Expressive Speech Synthesis: Past, Present, and Possible Futures. In: Tao, J., Tan, T. (eds) Affective Information Processing. Springer, London. https://doi.org/10.1007/978-1-84800-306-4_7
Download citation
DOI: https://doi.org/10.1007/978-1-84800-306-4_7
Publisher Name: Springer, London
Print ISBN: 978-1-84800-305-7
Online ISBN: 978-1-84800-306-4
eBook Packages: Computer ScienceComputer Science (R0)