Skip to main content

Abstract

This chapter builds on research projects at KTH concerned with development of multi-modal speech synthesis. The synthesis strategy chosen is model-based parametric speech synthesis for both the auditory and visual modality. The modalities are controlled from the same rule synthesis framework. The visual model can also be directly controlled, for aspects that are not phonetic in nature. This flexible set-up has made it possible to exploit the technology in several different applications, like spoken dialogue systems. In the Teleface project the synthetic face is evaluated as a lip-reading support for hard-of-hearing persons. In this project several studies of multi-modal speech intelligibility have been carried out using different combinations of natural/synthetic, auditory/visual speech.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Piatt, S. M. and Badler, N. I. (1981), “Animating Facial Expressions” Computer Graphics, Vol. 15, No. 3, pp. 245–252.

    Article  Google Scholar 

  • Waters, K. (1987), “A muscle model for animating three-dimensional facial expressions”, Computer Graphics, 21: 17–24.

    Article  Google Scholar 

  • Terzopoulos, D., Waters, K. (1990) “Physically based facial modelling, analysis and animation” Visualisation and Computer Animation, 1: 73–80.

    Article  Google Scholar 

  • Parke F I (1982). Parametrized models for facial animation. IEEE Computer Graphics, 2 (9), pp 61–68.

    Article  Google Scholar 

  • Öhman T (1998). An audio-visual database in Swedish for bimodal speech processing. TMH-QPSR, KTH, 1 /1998.

    Google Scholar 

  • Beskow, J. (1995) “Rule-based Visual Speech Synthesis” In Proceedings ofEurospeech ’95, Madrid, Spain.

    Google Scholar 

  • Ezzat T & Tomaso P (1998). MikeTalk: A talking facial display based on morphing visemes, Proceedings of the Computer Animation Conference, Philadelphia, PA

    Google Scholar 

  • Bregler C, Covell M & Slaney M (1997). Video Rewrite: Visual speech synthesis from video, Proceedings of the ESCA Workshop on Audiovisual Speech Processing, Rhodes, Greece

    Google Scholar 

  • Brooke N M & Scott S D (1998). An audio-visual speech synthesiser, Proceedings of STiLL, The ESCA Workshop on Speech Technology in Language Learning, Marholmen, Sweden

    Google Scholar 

  • Carlson R, Granström B and Hunnicutt S (1991). Multilingual text-to-speech development and applications. A. W. Ainsworth (Ed.), Advances in speech, hearing and language processing, JAI Press, London, UK.

    Google Scholar 

  • Cohen, M. M., & Massaro, D. W. (1993) Modeling coarticulation in synthetic visual speech. In N. M. Thalmann & D. Thalmann (Eds.) Models and Techniques in Computer Animation. Tokyo: Springer-Verlag, 139–156.

    Google Scholar 

  • Beskow, J. (1997) Animation of Talking Agents, In Proceedings of AVSP′97, ESCA Workshop on Audio-Visual Speech Processing, Rhodes, Greece.

    Google Scholar 

  • Pelachaud, C. (1991). Communication and Coarticulation in Facial Animation, Ph.D. dissertion, University of Pennsylvania.

    Google Scholar 

  • Cole R et al. (1998), Intelligent animated agents for interactive language training, Proceedings of STiLL, The ESCA Workshop on Speech Technology in Language Learning, Marholmen, Sweden

    Google Scholar 

  • Massaro D W (1998). Perceiving Talking Faces: From Speech Perception to a Behavioral Principle. Cambridge, MA: MIT Press

    Google Scholar 

  • Bertenstam J, Beskow J, Blomberg M, Carlson R, Elenius K, Granström B, Gustafson J, Hunnicutt S, Högberg J, Lindell R, Neovius L, de Serpa-Leitao A, Nord L and Ström N (1995). The Waxholm system — a progress report, In Proceedings of Spoken Dialogue Systems, Vigsø, Denmark.

    Google Scholar 

  • Beskow J, Elenius K & McGlashan S (1997). Olga - A dialogue system with an animated talking agent, Proceedings of EUROSPEECH′97,Rhodes,Greece.

    Google Scholar 

  • Beskow J & McGlashan S (1997). OLGA - A conversational agent with gestures. In: André E, ed., Proc of the IJCAI -97 Workshop on Animated Interface Agents: Making them Intelligent, Nagoya, Japan

    Google Scholar 

  • Cassel, J., Steedman, M., Badler, N., Pelachaud, C., Stone, M., Douville, B., Prevost, S. and Achorn B. (1994), “Modeling the Interaction between Speech and Gesture”, In Proceedings of 16th Annual Conference of the Cognitive Science Society, Georgia Institute of Technology, Atlanta, USA.

    Google Scholar 

  • Katashi, N. and Akikazu, T (1994) “Speech Dialogue with Facial Displays: Multimodal Human-Computer Conversation”, Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (ACL-94), pp. 102–109.

    Google Scholar 

  • Thórisson K R (1997), Gandalf: An Embodied Humanoid Capable of Real-Time Multimodal Dialogue with People, Proceedings of First ACM International Conference on Autonomous Agents, Marina del Rey, California, pp. 536–7

    Google Scholar 

  • Beskow J, Dahlquist M, Granstrom B, Lundeberg M, Spens K-E & Öhman T (1997). The teleface project — multimodal speech communication for the hearing impaired. In Proceedings of Eurospeech ′97, Rhodos, Greece.

    Google Scholar 

  • Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E & Öhman, T. (1998). The synthetic face in a hearing impaired view,.Proceedings of Fonetik’98, Stockholm, Sweden.

    Google Scholar 

  • Lundeberg, M. (1997). Multimodal talkommunikation — Utveckling av testmiljö, Master of science thesis (in Swedish). TMH-KTH, Stockholm, Sweden.

    Google Scholar 

  • MacLeod A & Summerfleld Q (1990). A procedure for measuring auditory and audiovisual speech reception thresholds for sentences in noise. Rationale, evaluation and recommenda-tions for use. British Journal of Audiology, 24: 29–43.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1999 Springer-Verlag London Limited

About this paper

Cite this paper

Granström, B. (1999). Multi-modal Speech Synthesis with Applications. In: Chollet, G., Di Benedetto, M.G., Esposito, A., Marinaro, M. (eds) Speech Processing, Recognition and Artificial Neural Networks. Springer, London. https://doi.org/10.1007/978-1-4471-0845-0_18

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-0845-0_18

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-85233-094-1

  • Online ISBN: 978-1-4471-0845-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics