Skip to main content

Human-Machine Communication by Audio-Visual Integration

  • Chapter
Intelligent Multimedia Processing with Soft Computing

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 168))

  • 695 Accesses

Abstract

The use of audio-visual information is inevitable in human communication. Complementary usage of audio-visual information enables more accurate, robust, natural, and friendly human communication in real environments. These types of information are also required for computers to realize natural and friendly interfaces, which are currently unreliable and unfriendly.

In this chapter, we focus on synchronous multi-modalities, specifically audio information of speech and image information of a face for audio-visual speech recognition, synthesis and translation. Human audio speech and visual speech information both originate from movements of the speech organs triggered by motor commands from the brain. Accordingly, such speech signals represent the information of an utterance in different ways. Therefore, these audio and visual speech modalities have strong correlations and complementary relationships. There is indeed a very strong demand to improve current speech recognition performance. The performance in real environments drastically degrades when speech is exposed to acoustic noise, reverberation and speaking style differences. The integration of audio and visual information is expected to make the system robust and reliable and improve the performance. On the other hand, there is also a demand to improve speech synthesis intelligibility as well. The multi-modal speech synthesis of audio speech and lip-synchronized talking face images can improve intelligibility and naturalness. This chapter first describes audio-visual speech detection and recognition which aim to improve the robustness of speech recognition performance in actual noisy environments in Section 1. Second, a talking face synthesis system based on a 3-D mesh model and an audio-visual speech translation system are introduced. The audiovisual speech translation system recognizes input speech in an original language, translates it into a target language and synthesizes output speech in a target language in Section 2.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Stork D G, Hennecke M E (1996) Speechreading by humans and machines, Springer, Berlin

    MATH  Google Scholar 

  2. Nakamura S, Nagai R, Shikano K (1997), Improved bimodal speech recognition using tied-mixture HMMs and 5000 word audio-visual synchronous database. Proceedings of Eurospeech: 1623–1626

    Google Scholar 

  3. Tomlinson M J, Russell M J, Brooke N M (1996), Integrating audio and visual information to provide highly robust speech recognition’”, Proceedings of ICASSP, Vol.2: 821–824

    Google Scholar 

  4. de Cuetos P, Neti C, Senio A (2000), Audio-visual intent to speak detection for human-computer interaction, Proceedings of ICASSP: 1325–1328

    Google Scholar 

  5. Iyengar G, Neti C (2001), A vision-based microphone switch for speech intent detection, Proceedings of IEEE Workshop on Real-Time Analysis of Face and Gesture

    Google Scholar 

  6. Murai K, Kumatani K, Nakamura S (2001), A robust end point detection by speaker’s facial motion, Proceedings of International Workshop of Hands-free speech communication(HSC2001): 199–202

    Google Scholar 

  7. Naito M, Singer H, Yamamoto H, Nakajima H, Matsui T, Tsukada H, Nakamura A, Sagisaka Y (1999), Evaluation of ATRSPREC for travel arrangement task, Proceedings of Fall Meeting of Acoustic Society of Japan: 113–114

    Google Scholar 

  8. Nakamura S (2001), Fusion of audio-visual information for integrated speech processing, Proceedings on Third International Conference on Audio-and Video-based Biometric Person Authentication, LNCS2091, Springer,: 127–143

    Google Scholar 

  9. Kumatani K, Nakamura S, Shikano K (2001), An adaptive integration method based on product HMM for bi-modal speech recognition, Proceedings of International Workshop of Hands-free Speech Communication(HSC2001): 195–198

    Google Scholar 

  10. Nakamura S, Kumatani K, Tamura S (2002), State synchronous modeling of audio-visnual information for bi-modal speech recognition, Proceedigns of IEEE Workshop of Automatic Speech Recognition and Understanding

    Google Scholar 

  11. Breglar C, Covell M, Slaney T (1997), Video rewrite: driving visual speech with audio. Proceedings of ACM SIGGRAPH: 353–360

    Google Scholar 

  12. Ezzat T, Geiger G, Poggio T (2002), Trainable videorealistic speech pnimation. Proceedings of ACM SIGGRAPH 2002: 388–398

    Google Scholar 

  13. Takezawa T, Morimoto T, Sagisaka Y, Campbell N, Iida, Sugaya F, Yokoo A, Ya-mamoto S (1998), Japanese-to-english speech translation system: ATR-MATRIX. Proceedings of International Conference of Spoken Language Processing, ICSLP: 957–960

    Google Scholar 

  14. Campbell N, Black AW (1995), Chatr: a multi-lingual speech re-sequencing synthesis system. Proceedings of IEICE Technical Report, sp96-7: 45–72

    Google Scholar 

  15. Campbell N (1996) CHATR: A high definition speech re-sequencing system. Proceedings of ASA/ASJ Joint meeting: 1223–1228

    Google Scholar 

  16. Ostendorf M, Singer H (1997) HMM topology design using maximum likelihood suc-cessive state splitting. Proceedings of Computer Speech and Language, vol 11, no 1: 17–41

    Article  Google Scholar 

  17. Sumita E, Yamada S, Yamamoto K, Paul M, Kashioka H, Ishikawa K, Shirai S (1999), Solutions to problems inherent in spoken-language translation: The ATR-MATRIX approach. Proceedings of MT Summit VII: 229–235

    Google Scholar 

  18. Furuse O, Kawai J, Iida H, Akamine S, Kim D (1995), Multi-lingual spoken-language translation utilizing translation examples. Proceedings of NLPRS5: 544–549

    Google Scholar 

  19. Furuse O, Iida H (1996), Incremental translation utilizing constituent boundary pat-terns. Proceedings of Coling’ 96: 412–417

    Google Scholar 

  20. Yotsukura T, Morishima S (2002), An open source development tool for anthropomor-phic dialog agent-face image synthesis and lip synchronization-. Proceedings of IEEE Fifth Workshop on Multimedia Signal Processing, 03.01.05.pdf

    Google Scholar 

  21. Kawamoto S et al (2002) Open-source software for developing anthropomorphic spo-ken dialog agent. Proceedings of International Workshop on Lifelike Animated Agents-Tools, Affective Functions and Applications: 64–69

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Nakamura, S., Yotsukura, T., Morishima, S. (2005). Human-Machine Communication by Audio-Visual Integration. In: Tan, YP., Yap, K.H., Wang, L. (eds) Intelligent Multimedia Processing with Soft Computing. Studies in Fuzziness and Soft Computing, vol 168. Springer, Berlin, Heidelberg . https://doi.org/10.1007/3-540-32367-8_16

Download citation

  • DOI: https://doi.org/10.1007/3-540-32367-8_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23053-3

  • Online ISBN: 978-3-540-32367-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics