Skip to main content

Two-Level Bimodal Association for Audio-Visual Speech Recognition

  • Conference paper
  • 1712 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 5807))

Abstract

This paper proposes a new method for bimodal information fusion in audio-visual speech recognition, where cross-modal association is considered in two levels. First, the acoustic and the visual data streams are combined at the feature level by using the canonical correlation analysis, which deals with the problems of audio-visual synchronization and utilizing the cross-modal correlation. Second, information streams are integrated at the decision level for adaptive fusion of the streams according to the noise condition of the given speech datum. Experimental results demonstrate that the proposed method is effective for producing noise-robust recognition performance without a priori knowledge about the noise conditions of the speech data.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chibelushi, C.C., Deravi, F., Mason, J.S.D.: A Review of Speech-Based Bimodal Recognition. IEEE Trans. Multimedia 4, 23–37 (2002)

    Article  Google Scholar 

  2. Bregler, C., Konig, Y.: ‘Eigenlips’ for Robust Speech Recognition. In: Proc. ICASSP, Adelaide, Australia, pp. 669–672 (1994)

    Google Scholar 

  3. Rogozan, A., Deléglise, P.: Adaptive Fusion of Acoustic and Visual Sources for Automatic Speech Recognition. Speech Commun. 26, 149–161 (1998)

    Article  Google Scholar 

  4. Dupont, S., Luettin, J.: Audio-Visual Speech Modeling for Continuous Speech Recognition. IEEE Trans. Multimedia 2, 141–151 (2000)

    Article  Google Scholar 

  5. Lee, J.-S., Park, C.H.: Adaptive Decision Fusion for Audio-Visual Speech Recognition. In: Mihelič, F., Žibert, J. (eds.) Speech Recognition, Technologies and Applications, I-Tech, Vienna Austria, pp. 275–296 (2008a)

    Google Scholar 

  6. Benoît, C.: The Intrinsic Bimodality of Speech Communication and the Synthesis of Talking Faces. In: Taylor, M.M., Nel, F., Bouwhuis, D. (eds.) The Structure of Multimodal Dialogue II, pp. 485–502. John Benjamins, Amsterdam (2000)

    Chapter  Google Scholar 

  7. Meyer, G.F., Mullligan, J.B., Wuerger, S.M.: Continuous Audio-Visual Digit Recognition using N-Best Decision Fusion. Information Fusion 5, 91–101 (2004)

    Article  Google Scholar 

  8. Conrey, B., Pisoni, D.B.: Auditory-Visual Speech Perception and Synchrony Detection for Speech and Nonspeech Signals. J. Acoust. Soc. Amer. 119, 4065–4073 (2006)

    Article  Google Scholar 

  9. Fisher III, J.W., Darrell, T.: Speaker Association with Signal-Level Audiovisual Fusion. IEEE Trans. Multimedia 6, 406–413 (2004)

    Article  Google Scholar 

  10. Sargin, M.E., Yemez, Y., Erzin, E., Tekalp, A.M.: Audiovisual Synchronization and Fusion using Canonical Correlation Analysis. IEEE Trans. Multimedia 9, 1396–1403 (2007)

    Article  Google Scholar 

  11. Bredin, H., Chollet, G.: Audiovisual Speech Synchrony Measure: Application to Biometrics. EURASIP J. Advances in Signal Processing 2007, 11 pages, Article ID 70186 (2007)

    Google Scholar 

  12. Slaney, M., Covell, M.: FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.) Advances in Neural Information Processing Systems, vol. 13, pp. 814–820. MIT Press, Cambridge (2001)

    Google Scholar 

  13. Eveno, N., Besacier, L.: Co-Inertia Analysis for “Liveness” Test in Audio-Visual Biometrics. In: Proc. Int. Symposium on Image and Signal Processing and Analysis, Zagreb, Croatia, pp. 257–261 (2005)

    Google Scholar 

  14. Huang, X., Acero, A., Hon, H.-W.: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall, Upper Saddle River (2001)

    Google Scholar 

  15. Lee, J.-S., Park, C.H.: Training Hidden Markov Models by Hybrid Simulated Annealing for Visual Speech Recognition. In: Proc. IEEE Int. Conf. Systems, Man, Cybernetics, Taipei, Taiwan, pp. 198–202 (2006)

    Google Scholar 

  16. Hermansky, H.: Exploring Temporal Domain for Robustness in Speech Recognition. In: Proc. Int. Congress on Acoustics, Trondheim, Norway, pp. 61–64 (1995)

    Google Scholar 

  17. Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical Correlation Analysis: An Overview with Application to Learning Methods. Dept. Comput. Sci., Univ. London, UK, Tech. Rep. CSD-TR-03-02 (2003)

    Google Scholar 

  18. Gopinath, R.A.: Maximum Likelihood Modeling with Gaussian Distributions for Classification. In: Proc. ICASSP, Seattle, USA, pp. 661–664 (1998)

    Google Scholar 

  19. Lee, J.-S., Park, C.H.: Robust Audio-Visual Speech Recognition based on Late Integration. IEEE Trans. Multimedia 10, 767–779 (2008b)

    Article  Google Scholar 

  20. Lewis, T.W., Powers, D.M.W.: Sensor Fusion Weighting Measures in Audio-Visual Speech Recognition. In: Proc. 27th Australasian Conf. Computer Science, Dunedin, New Zealand, pp. 305–314 (2004)

    Google Scholar 

  21. Movellan, J.R.: Visual Speech Recognition with Stochastic Networks. In: Tesauro, G., Touretzky, D., Leen, T. (eds.) Advances in Neural Information Processing Systems, vol. 7, pp. 851–858. MIT Press, Cambridge (1995)

    Google Scholar 

  22. Chibelushi, C.C., Gandon, S., Mason, J.S.D., Deravi, F., Johnston, R.D.: Design Issues for a Digital Audio-Visual Integrated Database. In: Proc. IEE Colloq. Integrated Audio-Visual Processing for Recognition, Synthesis, Communication, London, UK, pp. 7/1–7/7 (1996)

    Google Scholar 

  23. Pigeon, S., Vandendrope, L.: The M2VTS Multimodal Face Database (Release 1.00). In: Proc. Int. Conf. Audio- and Video-based Biometric Authentication, Crans-Montana, Switzerland, pp. 403–409 (1997)

    Google Scholar 

  24. Varga, V., Steeneken, H.J.M.: Assessment for Automatic Speech Recognition: II. NOISEX 1992: A Database and an Experiment to Study the Effect of Additive Noise on Speech Recognition Systems. Speech Commun. 12, 247–251 (1993)

    Article  Google Scholar 

  25. Rivet, B., Girin, L., Jutten, C.: Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals from Convolutive Mixtures. IEEE Trans. Multimedia 15, 96–108 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lee, JS., Ebrahimi, T. (2009). Two-Level Bimodal Association for Audio-Visual Speech Recognition. In: Blanc-Talon, J., Philips, W., Popescu, D., Scheunders, P. (eds) Advanced Concepts for Intelligent Vision Systems. ACIVS 2009. Lecture Notes in Computer Science, vol 5807. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04697-1_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-04697-1_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-04696-4

  • Online ISBN: 978-3-642-04697-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics