Multi-modal Speech Synthesis with Applications

Granström, Björn

doi:10.1007/978-1-4471-0845-0_18

Björn Granström⁴

250 Accesses
1 Citations

Abstract

This chapter builds on research projects at KTH concerned with development of multi-modal speech synthesis. The synthesis strategy chosen is model-based parametric speech synthesis for both the auditory and visual modality. The modalities are controlled from the same rule synthesis framework. The visual model can also be directly controlled, for aspects that are not phonetic in nature. This flexible set-up has made it possible to exploit the technology in several different applications, like spoken dialogue systems. In the Teleface project the synthetic face is evaluated as a lip-reading support for hard-of-hearing persons. In this project several studies of multi-modal speech intelligibility have been carried out using different combinations of natural/synthetic, auditory/visual speech.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Piatt, S. M. and Badler, N. I. (1981), “Animating Facial Expressions” Computer Graphics, Vol. 15, No. 3, pp. 245–252.
Article Google Scholar
Waters, K. (1987), “A muscle model for animating three-dimensional facial expressions”, Computer Graphics, 21: 17–24.
Article Google Scholar
Terzopoulos, D., Waters, K. (1990) “Physically based facial modelling, analysis and animation” Visualisation and Computer Animation, 1: 73–80.
Article Google Scholar
Parke F I (1982). Parametrized models for facial animation. IEEE Computer Graphics, 2 (9), pp 61–68.
Article Google Scholar
Öhman T (1998). An audio-visual database in Swedish for bimodal speech processing. TMH-QPSR, KTH, 1 /1998.
Google Scholar
Beskow, J. (1995) “Rule-based Visual Speech Synthesis” In Proceedings ofEurospeech ’95, Madrid, Spain.
Google Scholar
Ezzat T & Tomaso P (1998). MikeTalk: A talking facial display based on morphing visemes, Proceedings of the Computer Animation Conference, Philadelphia, PA
Google Scholar
Bregler C, Covell M & Slaney M (1997). Video Rewrite: Visual speech synthesis from video, Proceedings of the ESCA Workshop on Audiovisual Speech Processing, Rhodes, Greece
Google Scholar
Brooke N M & Scott S D (1998). An audio-visual speech synthesiser, Proceedings of STiLL, The ESCA Workshop on Speech Technology in Language Learning, Marholmen, Sweden
Google Scholar
Carlson R, Granström B and Hunnicutt S (1991). Multilingual text-to-speech development and applications. A. W. Ainsworth (Ed.), Advances in speech, hearing and language processing, JAI Press, London, UK.
Google Scholar
Cohen, M. M., & Massaro, D. W. (1993) Modeling coarticulation in synthetic visual speech. In N. M. Thalmann & D. Thalmann (Eds.) Models and Techniques in Computer Animation. Tokyo: Springer-Verlag, 139–156.
Google Scholar
Beskow, J. (1997) Animation of Talking Agents, In Proceedings of AVSP′97, ESCA Workshop on Audio-Visual Speech Processing, Rhodes, Greece.
Google Scholar
Pelachaud, C. (1991). Communication and Coarticulation in Facial Animation, Ph.D. dissertion, University of Pennsylvania.
Google Scholar
Cole R et al. (1998), Intelligent animated agents for interactive language training, Proceedings of STiLL, The ESCA Workshop on Speech Technology in Language Learning, Marholmen, Sweden
Google Scholar
Massaro D W (1998). Perceiving Talking Faces: From Speech Perception to a Behavioral Principle. Cambridge, MA: MIT Press
Google Scholar
Bertenstam J, Beskow J, Blomberg M, Carlson R, Elenius K, Granström B, Gustafson J, Hunnicutt S, Högberg J, Lindell R, Neovius L, de Serpa-Leitao A, Nord L and Ström N (1995). The Waxholm system — a progress report, In Proceedings of Spoken Dialogue Systems, Vigsø, Denmark.
Google Scholar
Beskow J, Elenius K & McGlashan S (1997). Olga - A dialogue system with an animated talking agent, Proceedings of EUROSPEECH′97,Rhodes,Greece.
Google Scholar
Beskow J & McGlashan S (1997). OLGA - A conversational agent with gestures. In: André E, ed., Proc of the IJCAI -97 Workshop on Animated Interface Agents: Making them Intelligent, Nagoya, Japan
Google Scholar
Cassel, J., Steedman, M., Badler, N., Pelachaud, C., Stone, M., Douville, B., Prevost, S. and Achorn B. (1994), “Modeling the Interaction between Speech and Gesture”, In Proceedings of 16th Annual Conference of the Cognitive Science Society, Georgia Institute of Technology, Atlanta, USA.
Google Scholar
Katashi, N. and Akikazu, T (1994) “Speech Dialogue with Facial Displays: Multimodal Human-Computer Conversation”, Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (ACL-94), pp. 102–109.
Google Scholar
Thórisson K R (1997), Gandalf: An Embodied Humanoid Capable of Real-Time Multimodal Dialogue with People, Proceedings of First ACM International Conference on Autonomous Agents, Marina del Rey, California, pp. 536–7
Google Scholar
Beskow J, Dahlquist M, Granstrom B, Lundeberg M, Spens K-E & Öhman T (1997). The teleface project — multimodal speech communication for the hearing impaired. In Proceedings of Eurospeech ′97, Rhodos, Greece.
Google Scholar
Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E & Öhman, T. (1998). The synthetic face in a hearing impaired view,.Proceedings of Fonetik’98, Stockholm, Sweden.
Google Scholar
Lundeberg, M. (1997). Multimodal talkommunikation — Utveckling av testmiljö, Master of science thesis (in Swedish). TMH-KTH, Stockholm, Sweden.
Google Scholar
MacLeod A & Summerfleld Q (1990). A procedure for measuring auditory and audiovisual speech reception thresholds for sentences in noise. Rationale, evaluation and recommenda-tions for use. British Journal of Audiology, 24: 29–43.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Deparment of Speech, Music and Hearing, KTH, Centre for Speech Technology (CTT), SE-100 44, Stockholm, Sweden
Björn Granström

Authors

Björn Granström
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

ENST-CNR URA 820, 46 rue Barrault, 75634, Paris Cedex 13, France
Gerard Chollet PhD
INFOCOM Department, Rome University “La Sapienza”, via Eudossiana 18, I00184, Rome, Italy
Maria Gabriella Di Benedetto PhD
IIASS, via G Pellegrino 19, I-84019, Vietri sul Mare (SA), Italy
Anna Esposito PhD & Maria Marinaro PhD &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Granström, B. (1999). Multi-modal Speech Synthesis with Applications. In: Chollet, G., Di Benedetto, M.G., Esposito, A., Marinaro, M. (eds) Speech Processing, Recognition and Artificial Neural Networks. Springer, London. https://doi.org/10.1007/978-1-4471-0845-0_18

Download citation

DOI: https://doi.org/10.1007/978-1-4471-0845-0_18
Publisher Name: Springer, London
Print ISBN: 978-1-85233-094-1
Online ISBN: 978-1-4471-0845-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics