Two-Level Bimodal Association for Audio-Visual Speech Recognition

Lee, Jong-Seok; Ebrahimi, Touradj

doi:10.1007/978-3-642-04697-1_13

Two-Level Bimodal Association for Audio-Visual Speech Recognition

Jong-Seok Lee²⁰ &
Touradj Ebrahimi²⁰

Conference paper

1712 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 5807))

Abstract

This paper proposes a new method for bimodal information fusion in audio-visual speech recognition, where cross-modal association is considered in two levels. First, the acoustic and the visual data streams are combined at the feature level by using the canonical correlation analysis, which deals with the problems of audio-visual synchronization and utilizing the cross-modal correlation. Second, information streams are integrated at the decision level for adaptive fusion of the streams according to the noise condition of the given speech datum. Experimental results demonstrate that the proposed method is effective for producing noise-robust recognition performance without a priori knowledge about the noise conditions of the speech data.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chibelushi, C.C., Deravi, F., Mason, J.S.D.: A Review of Speech-Based Bimodal Recognition. IEEE Trans. Multimedia 4, 23–37 (2002)
Article Google Scholar
Bregler, C., Konig, Y.: ‘Eigenlips’ for Robust Speech Recognition. In: Proc. ICASSP, Adelaide, Australia, pp. 669–672 (1994)
Google Scholar
Rogozan, A., Deléglise, P.: Adaptive Fusion of Acoustic and Visual Sources for Automatic Speech Recognition. Speech Commun. 26, 149–161 (1998)
Article Google Scholar
Dupont, S., Luettin, J.: Audio-Visual Speech Modeling for Continuous Speech Recognition. IEEE Trans. Multimedia 2, 141–151 (2000)
Article Google Scholar
Lee, J.-S., Park, C.H.: Adaptive Decision Fusion for Audio-Visual Speech Recognition. In: Mihelič, F., Žibert, J. (eds.) Speech Recognition, Technologies and Applications, I-Tech, Vienna Austria, pp. 275–296 (2008a)
Google Scholar
Benoît, C.: The Intrinsic Bimodality of Speech Communication and the Synthesis of Talking Faces. In: Taylor, M.M., Nel, F., Bouwhuis, D. (eds.) The Structure of Multimodal Dialogue II, pp. 485–502. John Benjamins, Amsterdam (2000)
Chapter Google Scholar
Meyer, G.F., Mullligan, J.B., Wuerger, S.M.: Continuous Audio-Visual Digit Recognition using N-Best Decision Fusion. Information Fusion 5, 91–101 (2004)
Article Google Scholar
Conrey, B., Pisoni, D.B.: Auditory-Visual Speech Perception and Synchrony Detection for Speech and Nonspeech Signals. J. Acoust. Soc. Amer. 119, 4065–4073 (2006)
Article Google Scholar
Fisher III, J.W., Darrell, T.: Speaker Association with Signal-Level Audiovisual Fusion. IEEE Trans. Multimedia 6, 406–413 (2004)
Article Google Scholar
Sargin, M.E., Yemez, Y., Erzin, E., Tekalp, A.M.: Audiovisual Synchronization and Fusion using Canonical Correlation Analysis. IEEE Trans. Multimedia 9, 1396–1403 (2007)
Article Google Scholar
Bredin, H., Chollet, G.: Audiovisual Speech Synchrony Measure: Application to Biometrics. EURASIP J. Advances in Signal Processing 2007, 11 pages, Article ID 70186 (2007)
Google Scholar
Slaney, M., Covell, M.: FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.) Advances in Neural Information Processing Systems, vol. 13, pp. 814–820. MIT Press, Cambridge (2001)
Google Scholar
Eveno, N., Besacier, L.: Co-Inertia Analysis for “Liveness” Test in Audio-Visual Biometrics. In: Proc. Int. Symposium on Image and Signal Processing and Analysis, Zagreb, Croatia, pp. 257–261 (2005)
Google Scholar
Huang, X., Acero, A., Hon, H.-W.: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall, Upper Saddle River (2001)
Google Scholar
Lee, J.-S., Park, C.H.: Training Hidden Markov Models by Hybrid Simulated Annealing for Visual Speech Recognition. In: Proc. IEEE Int. Conf. Systems, Man, Cybernetics, Taipei, Taiwan, pp. 198–202 (2006)
Google Scholar
Hermansky, H.: Exploring Temporal Domain for Robustness in Speech Recognition. In: Proc. Int. Congress on Acoustics, Trondheim, Norway, pp. 61–64 (1995)
Google Scholar
Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical Correlation Analysis: An Overview with Application to Learning Methods. Dept. Comput. Sci., Univ. London, UK, Tech. Rep. CSD-TR-03-02 (2003)
Google Scholar
Gopinath, R.A.: Maximum Likelihood Modeling with Gaussian Distributions for Classification. In: Proc. ICASSP, Seattle, USA, pp. 661–664 (1998)
Google Scholar
Lee, J.-S., Park, C.H.: Robust Audio-Visual Speech Recognition based on Late Integration. IEEE Trans. Multimedia 10, 767–779 (2008b)
Article Google Scholar
Lewis, T.W., Powers, D.M.W.: Sensor Fusion Weighting Measures in Audio-Visual Speech Recognition. In: Proc. 27th Australasian Conf. Computer Science, Dunedin, New Zealand, pp. 305–314 (2004)
Google Scholar
Movellan, J.R.: Visual Speech Recognition with Stochastic Networks. In: Tesauro, G., Touretzky, D., Leen, T. (eds.) Advances in Neural Information Processing Systems, vol. 7, pp. 851–858. MIT Press, Cambridge (1995)
Google Scholar
Chibelushi, C.C., Gandon, S., Mason, J.S.D., Deravi, F., Johnston, R.D.: Design Issues for a Digital Audio-Visual Integrated Database. In: Proc. IEE Colloq. Integrated Audio-Visual Processing for Recognition, Synthesis, Communication, London, UK, pp. 7/1–7/7 (1996)
Google Scholar
Pigeon, S., Vandendrope, L.: The M2VTS Multimodal Face Database (Release 1.00). In: Proc. Int. Conf. Audio- and Video-based Biometric Authentication, Crans-Montana, Switzerland, pp. 403–409 (1997)
Google Scholar
Varga, V., Steeneken, H.J.M.: Assessment for Automatic Speech Recognition: II. NOISEX 1992: A Database and an Experiment to Study the Effect of Additive Noise on Speech Recognition Systems. Speech Commun. 12, 247–251 (1993)
Article Google Scholar
Rivet, B., Girin, L., Jutten, C.: Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals from Convolutive Mixtures. IEEE Trans. Multimedia 15, 96–108 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Multimedia Signal Processing Group, Ecole Polytechnique Fédérale de Lausanne (EPFL), CH, 1015, Lausanne, Switzerland
Jong-Seok Lee & Touradj Ebrahimi

Authors

Jong-Seok Lee
View author publications
You can also search for this author in PubMed Google Scholar
Touradj Ebrahimi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

DGA/D4S/MRIS, CEP/GIP, 16 bis avenue Prieur de la côte d’or., 94114, Arcueil, France
Jacques Blanc-Talon
Department of Telecommunication and Information Processing, Ghent University, St.-Pietersnieuwstraat 41, 9000, Gent, Belgium
Wilfried Philips
CSIRO ICT Centre, Epping, Po Box 76, 1710, Sydney, NSW, Australia
Dan Popescu
University of Antwerp, Universiteitsplein 1; Building N., 2610, Wilrijk, Belgium
Paul Scheunders

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lee, JS., Ebrahimi, T. (2009). Two-Level Bimodal Association for Audio-Visual Speech Recognition. In: Blanc-Talon, J., Philips, W., Popescu, D., Scheunders, P. (eds) Advanced Concepts for Intelligent Vision Systems. ACIVS 2009. Lecture Notes in Computer Science, vol 5807. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04697-1_13

Download citation

DOI: https://doi.org/10.1007/978-3-642-04697-1_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04696-4
Online ISBN: 978-3-642-04697-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics