Abstract
The use of audio-visual information is inevitable in human communication. Complementary usage of audio-visual information enables more accurate, robust, natural, and friendly human communication in real environments. These types of information are also required for computers to realize natural and friendly interfaces, which are currently unreliable and unfriendly.
In this chapter, we focus on synchronous multi-modalities, specifically audio information of speech and image information of a face for audio-visual speech recognition, synthesis and translation. Human audio speech and visual speech information both originate from movements of the speech organs triggered by motor commands from the brain. Accordingly, such speech signals represent the information of an utterance in different ways. Therefore, these audio and visual speech modalities have strong correlations and complementary relationships. There is indeed a very strong demand to improve current speech recognition performance. The performance in real environments drastically degrades when speech is exposed to acoustic noise, reverberation and speaking style differences. The integration of audio and visual information is expected to make the system robust and reliable and improve the performance. On the other hand, there is also a demand to improve speech synthesis intelligibility as well. The multi-modal speech synthesis of audio speech and lip-synchronized talking face images can improve intelligibility and naturalness. This chapter first describes audio-visual speech detection and recognition which aim to improve the robustness of speech recognition performance in actual noisy environments in Section 1. Second, a talking face synthesis system based on a 3-D mesh model and an audio-visual speech translation system are introduced. The audiovisual speech translation system recognizes input speech in an original language, translates it into a target language and synthesizes output speech in a target language in Section 2.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Stork D G, Hennecke M E (1996) Speechreading by humans and machines, Springer, Berlin
Nakamura S, Nagai R, Shikano K (1997), Improved bimodal speech recognition using tied-mixture HMMs and 5000 word audio-visual synchronous database. Proceedings of Eurospeech: 1623–1626
Tomlinson M J, Russell M J, Brooke N M (1996), Integrating audio and visual information to provide highly robust speech recognition’”, Proceedings of ICASSP, Vol.2: 821–824
de Cuetos P, Neti C, Senio A (2000), Audio-visual intent to speak detection for human-computer interaction, Proceedings of ICASSP: 1325–1328
Iyengar G, Neti C (2001), A vision-based microphone switch for speech intent detection, Proceedings of IEEE Workshop on Real-Time Analysis of Face and Gesture
Murai K, Kumatani K, Nakamura S (2001), A robust end point detection by speaker’s facial motion, Proceedings of International Workshop of Hands-free speech communication(HSC2001): 199–202
Naito M, Singer H, Yamamoto H, Nakajima H, Matsui T, Tsukada H, Nakamura A, Sagisaka Y (1999), Evaluation of ATRSPREC for travel arrangement task, Proceedings of Fall Meeting of Acoustic Society of Japan: 113–114
Nakamura S (2001), Fusion of audio-visual information for integrated speech processing, Proceedings on Third International Conference on Audio-and Video-based Biometric Person Authentication, LNCS2091, Springer,: 127–143
Kumatani K, Nakamura S, Shikano K (2001), An adaptive integration method based on product HMM for bi-modal speech recognition, Proceedings of International Workshop of Hands-free Speech Communication(HSC2001): 195–198
Nakamura S, Kumatani K, Tamura S (2002), State synchronous modeling of audio-visnual information for bi-modal speech recognition, Proceedigns of IEEE Workshop of Automatic Speech Recognition and Understanding
Breglar C, Covell M, Slaney T (1997), Video rewrite: driving visual speech with audio. Proceedings of ACM SIGGRAPH: 353–360
Ezzat T, Geiger G, Poggio T (2002), Trainable videorealistic speech pnimation. Proceedings of ACM SIGGRAPH 2002: 388–398
Takezawa T, Morimoto T, Sagisaka Y, Campbell N, Iida, Sugaya F, Yokoo A, Ya-mamoto S (1998), Japanese-to-english speech translation system: ATR-MATRIX. Proceedings of International Conference of Spoken Language Processing, ICSLP: 957–960
Campbell N, Black AW (1995), Chatr: a multi-lingual speech re-sequencing synthesis system. Proceedings of IEICE Technical Report, sp96-7: 45–72
Campbell N (1996) CHATR: A high definition speech re-sequencing system. Proceedings of ASA/ASJ Joint meeting: 1223–1228
Ostendorf M, Singer H (1997) HMM topology design using maximum likelihood suc-cessive state splitting. Proceedings of Computer Speech and Language, vol 11, no 1: 17–41
Sumita E, Yamada S, Yamamoto K, Paul M, Kashioka H, Ishikawa K, Shirai S (1999), Solutions to problems inherent in spoken-language translation: The ATR-MATRIX approach. Proceedings of MT Summit VII: 229–235
Furuse O, Kawai J, Iida H, Akamine S, Kim D (1995), Multi-lingual spoken-language translation utilizing translation examples. Proceedings of NLPRS5: 544–549
Furuse O, Iida H (1996), Incremental translation utilizing constituent boundary pat-terns. Proceedings of Coling’ 96: 412–417
Yotsukura T, Morishima S (2002), An open source development tool for anthropomor-phic dialog agent-face image synthesis and lip synchronization-. Proceedings of IEEE Fifth Workshop on Multimedia Signal Processing, 03.01.05.pdf
Kawamoto S et al (2002) Open-source software for developing anthropomorphic spo-ken dialog agent. Proceedings of International Workshop on Lifelike Animated Agents-Tools, Affective Functions and Applications: 64–69
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Nakamura, S., Yotsukura, T., Morishima, S. (2005). Human-Machine Communication by Audio-Visual Integration. In: Tan, YP., Yap, K.H., Wang, L. (eds) Intelligent Multimedia Processing with Soft Computing. Studies in Fuzziness and Soft Computing, vol 168. Springer, Berlin, Heidelberg . https://doi.org/10.1007/3-540-32367-8_16
Download citation
DOI: https://doi.org/10.1007/3-540-32367-8_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23053-3
Online ISBN: 978-3-540-32367-9
eBook Packages: EngineeringEngineering (R0)