Abstract
This chapter builds on research projects at KTH concerned with development of multi-modal speech synthesis. The synthesis strategy chosen is model-based parametric speech synthesis for both the auditory and visual modality. The modalities are controlled from the same rule synthesis framework. The visual model can also be directly controlled, for aspects that are not phonetic in nature. This flexible set-up has made it possible to exploit the technology in several different applications, like spoken dialogue systems. In the Teleface project the synthetic face is evaluated as a lip-reading support for hard-of-hearing persons. In this project several studies of multi-modal speech intelligibility have been carried out using different combinations of natural/synthetic, auditory/visual speech.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Piatt, S. M. and Badler, N. I. (1981), “Animating Facial Expressions” Computer Graphics, Vol. 15, No. 3, pp. 245–252.
Waters, K. (1987), “A muscle model for animating three-dimensional facial expressions”, Computer Graphics, 21: 17–24.
Terzopoulos, D., Waters, K. (1990) “Physically based facial modelling, analysis and animation” Visualisation and Computer Animation, 1: 73–80.
Parke F I (1982). Parametrized models for facial animation. IEEE Computer Graphics, 2 (9), pp 61–68.
Öhman T (1998). An audio-visual database in Swedish for bimodal speech processing. TMH-QPSR, KTH, 1 /1998.
Beskow, J. (1995) “Rule-based Visual Speech Synthesis” In Proceedings ofEurospeech ’95, Madrid, Spain.
Ezzat T & Tomaso P (1998). MikeTalk: A talking facial display based on morphing visemes, Proceedings of the Computer Animation Conference, Philadelphia, PA
Bregler C, Covell M & Slaney M (1997). Video Rewrite: Visual speech synthesis from video, Proceedings of the ESCA Workshop on Audiovisual Speech Processing, Rhodes, Greece
Brooke N M & Scott S D (1998). An audio-visual speech synthesiser, Proceedings of STiLL, The ESCA Workshop on Speech Technology in Language Learning, Marholmen, Sweden
Carlson R, Granström B and Hunnicutt S (1991). Multilingual text-to-speech development and applications. A. W. Ainsworth (Ed.), Advances in speech, hearing and language processing, JAI Press, London, UK.
Cohen, M. M., & Massaro, D. W. (1993) Modeling coarticulation in synthetic visual speech. In N. M. Thalmann & D. Thalmann (Eds.) Models and Techniques in Computer Animation. Tokyo: Springer-Verlag, 139–156.
Beskow, J. (1997) Animation of Talking Agents, In Proceedings of AVSP′97, ESCA Workshop on Audio-Visual Speech Processing, Rhodes, Greece.
Pelachaud, C. (1991). Communication and Coarticulation in Facial Animation, Ph.D. dissertion, University of Pennsylvania.
Cole R et al. (1998), Intelligent animated agents for interactive language training, Proceedings of STiLL, The ESCA Workshop on Speech Technology in Language Learning, Marholmen, Sweden
Massaro D W (1998). Perceiving Talking Faces: From Speech Perception to a Behavioral Principle. Cambridge, MA: MIT Press
Bertenstam J, Beskow J, Blomberg M, Carlson R, Elenius K, Granström B, Gustafson J, Hunnicutt S, Högberg J, Lindell R, Neovius L, de Serpa-Leitao A, Nord L and Ström N (1995). The Waxholm system — a progress report, In Proceedings of Spoken Dialogue Systems, Vigsø, Denmark.
Beskow J, Elenius K & McGlashan S (1997). Olga - A dialogue system with an animated talking agent, Proceedings of EUROSPEECH′97,Rhodes,Greece.
Beskow J & McGlashan S (1997). OLGA - A conversational agent with gestures. In: André E, ed., Proc of the IJCAI -97 Workshop on Animated Interface Agents: Making them Intelligent, Nagoya, Japan
Cassel, J., Steedman, M., Badler, N., Pelachaud, C., Stone, M., Douville, B., Prevost, S. and Achorn B. (1994), “Modeling the Interaction between Speech and Gesture”, In Proceedings of 16th Annual Conference of the Cognitive Science Society, Georgia Institute of Technology, Atlanta, USA.
Katashi, N. and Akikazu, T (1994) “Speech Dialogue with Facial Displays: Multimodal Human-Computer Conversation”, Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (ACL-94), pp. 102–109.
Thórisson K R (1997), Gandalf: An Embodied Humanoid Capable of Real-Time Multimodal Dialogue with People, Proceedings of First ACM International Conference on Autonomous Agents, Marina del Rey, California, pp. 536–7
Beskow J, Dahlquist M, Granstrom B, Lundeberg M, Spens K-E & Öhman T (1997). The teleface project — multimodal speech communication for the hearing impaired. In Proceedings of Eurospeech ′97, Rhodos, Greece.
Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E & Öhman, T. (1998). The synthetic face in a hearing impaired view,.Proceedings of Fonetik’98, Stockholm, Sweden.
Lundeberg, M. (1997). Multimodal talkommunikation — Utveckling av testmiljö, Master of science thesis (in Swedish). TMH-KTH, Stockholm, Sweden.
MacLeod A & Summerfleld Q (1990). A procedure for measuring auditory and audiovisual speech reception thresholds for sentences in noise. Rationale, evaluation and recommenda-tions for use. British Journal of Audiology, 24: 29–43.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag London Limited
About this paper
Cite this paper
Granström, B. (1999). Multi-modal Speech Synthesis with Applications. In: Chollet, G., Di Benedetto, M.G., Esposito, A., Marinaro, M. (eds) Speech Processing, Recognition and Artificial Neural Networks. Springer, London. https://doi.org/10.1007/978-1-4471-0845-0_18
Download citation
DOI: https://doi.org/10.1007/978-1-4471-0845-0_18
Publisher Name: Springer, London
Print ISBN: 978-1-85233-094-1
Online ISBN: 978-1-4471-0845-0
eBook Packages: Springer Book Archive