Human-Machine Communication by Audio-Visual Integration

Nakamura, Satoshi; Yotsukura, Tatsuo; Morishima, Shigeo

doi:10.1007/3-540-32367-8_16

Satoshi Nakamura³,
Tatsuo Yotsukura³ &
Shigeo Morishima^3,4

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 168))

695 Accesses

Abstract

The use of audio-visual information is inevitable in human communication. Complementary usage of audio-visual information enables more accurate, robust, natural, and friendly human communication in real environments. These types of information are also required for computers to realize natural and friendly interfaces, which are currently unreliable and unfriendly.

In this chapter, we focus on synchronous multi-modalities, specifically audio information of speech and image information of a face for audio-visual speech recognition, synthesis and translation. Human audio speech and visual speech information both originate from movements of the speech organs triggered by motor commands from the brain. Accordingly, such speech signals represent the information of an utterance in different ways. Therefore, these audio and visual speech modalities have strong correlations and complementary relationships. There is indeed a very strong demand to improve current speech recognition performance. The performance in real environments drastically degrades when speech is exposed to acoustic noise, reverberation and speaking style differences. The integration of audio and visual information is expected to make the system robust and reliable and improve the performance. On the other hand, there is also a demand to improve speech synthesis intelligibility as well. The multi-modal speech synthesis of audio speech and lip-synchronized talking face images can improve intelligibility and naturalness. This chapter first describes audio-visual speech detection and recognition which aim to improve the robustness of speech recognition performance in actual noisy environments in Section 1. Second, a talking face synthesis system based on a 3-D mesh model and an audio-visual speech translation system are introduced. The audiovisual speech translation system recognizes input speech in an original language, translates it into a target language and synthesizes output speech in a target language in Section 2.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Stork D G, Hennecke M E (1996) Speechreading by humans and machines, Springer, Berlin
MATH Google Scholar
Nakamura S, Nagai R, Shikano K (1997), Improved bimodal speech recognition using tied-mixture HMMs and 5000 word audio-visual synchronous database. Proceedings of Eurospeech: 1623–1626
Google Scholar
Tomlinson M J, Russell M J, Brooke N M (1996), Integrating audio and visual information to provide highly robust speech recognition’”, Proceedings of ICASSP, Vol.2: 821–824
Google Scholar
de Cuetos P, Neti C, Senio A (2000), Audio-visual intent to speak detection for human-computer interaction, Proceedings of ICASSP: 1325–1328
Google Scholar
Iyengar G, Neti C (2001), A vision-based microphone switch for speech intent detection, Proceedings of IEEE Workshop on Real-Time Analysis of Face and Gesture
Google Scholar
Murai K, Kumatani K, Nakamura S (2001), A robust end point detection by speaker’s facial motion, Proceedings of International Workshop of Hands-free speech communication(HSC2001): 199–202
Google Scholar
Naito M, Singer H, Yamamoto H, Nakajima H, Matsui T, Tsukada H, Nakamura A, Sagisaka Y (1999), Evaluation of ATRSPREC for travel arrangement task, Proceedings of Fall Meeting of Acoustic Society of Japan: 113–114
Google Scholar
Nakamura S (2001), Fusion of audio-visual information for integrated speech processing, Proceedings on Third International Conference on Audio-and Video-based Biometric Person Authentication, LNCS2091, Springer,: 127–143
Google Scholar
Kumatani K, Nakamura S, Shikano K (2001), An adaptive integration method based on product HMM for bi-modal speech recognition, Proceedings of International Workshop of Hands-free Speech Communication(HSC2001): 195–198
Google Scholar
Nakamura S, Kumatani K, Tamura S (2002), State synchronous modeling of audio-visnual information for bi-modal speech recognition, Proceedigns of IEEE Workshop of Automatic Speech Recognition and Understanding
Google Scholar
Breglar C, Covell M, Slaney T (1997), Video rewrite: driving visual speech with audio. Proceedings of ACM SIGGRAPH: 353–360
Google Scholar
Ezzat T, Geiger G, Poggio T (2002), Trainable videorealistic speech pnimation. Proceedings of ACM SIGGRAPH 2002: 388–398
Google Scholar
Takezawa T, Morimoto T, Sagisaka Y, Campbell N, Iida, Sugaya F, Yokoo A, Ya-mamoto S (1998), Japanese-to-english speech translation system: ATR-MATRIX. Proceedings of International Conference of Spoken Language Processing, ICSLP: 957–960
Google Scholar
Campbell N, Black AW (1995), Chatr: a multi-lingual speech re-sequencing synthesis system. Proceedings of IEICE Technical Report, sp96-7: 45–72
Google Scholar
Campbell N (1996) CHATR: A high definition speech re-sequencing system. Proceedings of ASA/ASJ Joint meeting: 1223–1228
Google Scholar
Ostendorf M, Singer H (1997) HMM topology design using maximum likelihood suc-cessive state splitting. Proceedings of Computer Speech and Language, vol 11, no 1: 17–41
Article Google Scholar
Sumita E, Yamada S, Yamamoto K, Paul M, Kashioka H, Ishikawa K, Shirai S (1999), Solutions to problems inherent in spoken-language translation: The ATR-MATRIX approach. Proceedings of MT Summit VII: 229–235
Google Scholar
Furuse O, Kawai J, Iida H, Akamine S, Kim D (1995), Multi-lingual spoken-language translation utilizing translation examples. Proceedings of NLPRS5: 544–549
Google Scholar
Furuse O, Iida H (1996), Incremental translation utilizing constituent boundary pat-terns. Proceedings of Coling’ 96: 412–417
Google Scholar
Yotsukura T, Morishima S (2002), An open source development tool for anthropomor-phic dialog agent-face image synthesis and lip synchronization-. Proceedings of IEEE Fifth Workshop on Multimedia Signal Processing, 03.01.05.pdf
Google Scholar
Kawamoto S et al (2002) Open-source software for developing anthropomorphic spo-ken dialog agent. Proceedings of International Workshop on Lifelike Animated Agents-Tools, Affective Functions and Applications: 64–69
Google Scholar

Download references

Author information

Authors and Affiliations

ATR Spoken Language Translation Research Labs., 2-2-2, “Keihanna-science city”, Kyoto, 619-0288, Japan
Satoshi Nakamura, Tatsuo Yotsukura & Shigeo Morishima
Department of Applied Physics School of Science and Engineering, Waseda University, 3-4-1 Okubo Shinjuku, Tokyo, 169-8555, Japan
Shigeo Morishima

Authors

Satoshi Nakamura
View author publications
You can also search for this author in PubMed Google Scholar
Tatsuo Yotsukura
View author publications
You can also search for this author in PubMed Google Scholar
Shigeo Morishima
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Electrical and Electronic Engineering, Nanyang Technological University, Nanyang Avenue, Singapore, 639798
Yap-Peng Tan , Kim Hui Yap & Lipo Wang , &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Nakamura, S., Yotsukura, T., Morishima, S. (2005). Human-Machine Communication by Audio-Visual Integration. In: Tan, YP., Yap, K.H., Wang, L. (eds) Intelligent Multimedia Processing with Soft Computing. Studies in Fuzziness and Soft Computing, vol 168. Springer, Berlin, Heidelberg . https://doi.org/10.1007/3-540-32367-8_16

Download citation

DOI: https://doi.org/10.1007/3-540-32367-8_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23053-3
Online ISBN: 978-3-540-32367-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics