Fusing Data Streams in Continuous Audio-Visual Speech Recognition

Rothkrantz, Leon J. M.; Wojdeł, Jacek C.; Wiggers, Pascal

doi:10.1007/11551874_5

Leon J. M. Rothkrantz¹⁹,
Jacek C. Wojdeł¹⁹ &
Pascal Wiggers¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3658))

Included in the following conference series:

International Conference on Text, Speech and Dialogue

702 Accesses
1 Citations

Abstract

Speech recognition still lacks robustness when faced with changing noise characteristics. Automatic lip reading on the other hand is not affected by acoustic noise and can therefore provide the speech recognizer with valuable additional information, especially since the visual modality contains information that is complementary to information in the audio channel. In this paper we present a novel way of processing the video signal for lip reading and a post-processing data transformation that can be used alongside it. The presented Lip Geometry Estimation (LGE) is compared with other geometry- and image intensity-based techniques typically deployed for this task. A large vocabulary continuous audio-visual speech recognizer for Dutch using this method has been implemented. We show that a combined system improves upon audio-only recognition in the presence of noise.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Young, S.: Large vocabulary continuous speech recognition: A review. In: Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, Snowbird, Utah, December 1995, pp. 3–28. IEEE, Los Alamitos (1995)
Google Scholar
Fujimoto, M., Ariki, Y.: Noise Robust Speech Recognition by Integration of MLLR Adaptation and Feature Extraction for Noise Reduced Speech, IPSJ SIG Notes Spoken Language Processing No.039
Google Scholar
Boll, S.F.: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. ASSP-27, 113–120 (1979)
Article Google Scholar
Sumby, W.H., Pollack, I.: Visual contribution to speech intelligibility in noise. Journal of the Acoustical Society of America 26, 212–215 (1954)
Article Google Scholar
Benoit, K.S., Mohamadi, C.: Audio-visual intelligibility of French speech in noise. Journal of Speech and Hearing Research 37, 1195–1203 (1994)
Google Scholar
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)
Article Google Scholar
Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the Automatic Recognition of Audiovisual. Proceedings of the IEEE 91(9) (September 2003)
Google Scholar
Chen, T.: Audio-Visual speech processing. IEEE Signal Processing Magazine, pp. 9–21 (January 2001)
Google Scholar
Luettin, J., Thacker, N.A., Beet, S.W.: Speechreading using shape and intensity information. In: Proceedings of ICSLP 1996, pp. 44–47 (1996)
Google Scholar
Stork, S.W., Hennecke, M.E. (eds.): Speechreading by Humans and Machines. ser. NATO ASI Series, Series F: Computer and Systems Sciences. Springer, Berlin (1996)
MATH Google Scholar
Tamura, S., Iwano, K., Furui, S.: Multi-Modal Speech Recognition Using Optical-Flow Analysis for Lip Images. J. VLSI Signal Process. Syst. 36(2-3), 117–124 (2004)
Article Google Scholar
Coianiz, T., Torresani, L., Caprile, B.: 2D deformable models for visual speech analysis. In: Stork, D.G., Hennecke, M.E. (eds.) Speechreading by Humans and Machines. NATO ASI Series, Series F: Computer and Systems Sciences, Springer, Berlin (1996)
Google Scholar
Nefian, A.V., Liang, L., Pi, X., Liu, X., Murphy, K.: Dynamic Bayesian Networks for Audio- Visual Speech Recognition. EURASIP Journal on Applied Signal Processing 2002 11, 1–15 (2002)
Google Scholar
Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D.: Large-vocabulary audio-visual speech recognition: A summary of the Johns Hopkins Summer 2000 Workshop, in Proc. Works. Multimedia Signal Processing, Cannes, France, pp. 619–624 (2000)
Google Scholar
Krone, G., Talle, B., Wichert, A., Palm, G.: Neural architectures for sensorfusion in speech recognition. In: Proc. Europ. Tut. Works. Audio-Visual Speech Processing, Rhodes, Greece, pp. 57–60 (1997)
Google Scholar
Goecke, R., Millar, J., Zelinsky, A., Robert-Ribes, J.: Automatic extraction of lip feature points. In: Proceedings of the Australian Conference on Robotics and Automation ACRA 2000, Melbourne, Australia, September 2000, pp. 31–36 (2000)
Google Scholar
Massaro, D.W., Stork, D.G.: Speech recognition and sensory integration. American Scientist 86, 236–244 (1998)
Google Scholar
Wojdel, J.C., Rothkrantz, L.J.M.: Using Aerial and Geometric Features in Automatic Lipreading. In: Proc. Eurospeech 2001, Scandinavia (September 2001)
Google Scholar
Wiggers, P., Wojdel, J.C., Rothkrantz, L.J.M.: Medium vocabulary continuous audio-visual speech recognition. In: Proceedings of ICSLP 2002, September 2002, pp. 1921–1924. ISCA, Denver CO (2002)
Google Scholar
Kshirsagar, S., Mangnenat-Thalman, N.: Viseme space for realistic speech animation. In: Massaro et al., pp. 30–35 (2001)
Google Scholar
Potamianos, G., Neti, C.: Automatic speechreading of impaired speech. In: Massaro et al., pp. 177–182 (2001)
Google Scholar
Lucey, S., Sridharan, S., Chandran, V.: A link between cepstral shrinking and the weighted product rule in audio-visual speech recognition. In: Proceedings of ICSL 2002, ISCA, Denver CO, USA (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Man–Machine Interaction Group, Delft University of Technology, Mekelweg 4, 2628 CD, Delft, The Netherlands
Leon J. M. Rothkrantz, Jacek C. Wojdeł & Pascal Wiggers

Authors

Leon J. M. Rothkrantz
View author publications
You can also search for this author in PubMed Google Scholar
Jacek C. Wojdeł
View author publications
You can also search for this author in PubMed Google Scholar
Pascal Wiggers
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of West Bohemia in Pilsen, Univerzitni 8, 30614, Plzen, Czech Republic
Václav Matoušek , Pavel Mautner & Tomáš Pavelka , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rothkrantz, L.J.M., Wojdeł, J.C., Wiggers, P. (2005). Fusing Data Streams in Continuous Audio-Visual Speech Recognition. In: Matoušek, V., Mautner, P., Pavelka, T. (eds) Text, Speech and Dialogue. TSD 2005. Lecture Notes in Computer Science(), vol 3658. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11551874_5

Download citation

DOI: https://doi.org/10.1007/11551874_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28789-6
Online ISBN: 978-3-540-31817-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics