Artificial Intelligence Review

, Volume 9, Issue 4–5, pp 323–346 | Cite as

A comparison of models for fusion of the auditory and visual sensors in speech perception

  • Jordi Robert-Ribes
  • Jean-Luc Schwartz
  • Pierre Escudier


Though a large amount of psychological and physiological evidence of audio-visual integration in speech has been collected in the last 20 years, there is no agreement about the nature of the fusion process. We present the main experimental data, and describe the various models proposed in the literature, together with a number of studies in the field of automatic audiovisual speech recognition. We discuss these models in relation to general proposals arising from psychology in the field of intersensory interaction, or from the field of vision and robotics in the field of sensor fusion. Then we examine the characteristics of four main models, in the light of psychological data and formal properties, and we present the results of a modelling study on audio-visual recognition of French vowels in noise. We conclude in favor of the relative superiority of a model in which the auditory and visual inputs are projected and fused in a common representation space related to motor properties of speech objects, the fused representation being further classified for lexical access.

Key words

audiovisual speech perception sensor fusion noisy speech recognition intersensory interactions nowel processing 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Abry, C. & Boë, L. J. (1986). Laws for Lips.Speech Communication 5: 97–104.Google Scholar
  2. Aulanko, R. & Sams, M. (1991). Integration of Auditory and Visual Components of Articulatory Information in the Human Brain. XII Congrès Intern. des Sciences Phonétiques, 19–24 Août 1991, Aix-en-Provence (France), pp. 38–41.Google Scholar
  3. Basir, O. A. & Shen, H. C. (1992). Sensory Data Integration: A Team Consensus Approach.Proceedings of The 1992 IEEE Int. Conf. on Robotics and Automation, Nice, France, May 1992, pp. 1683–1688.Google Scholar
  4. Beckerman, M. (1992). A Bayes-maximum entropy method for multi-sensor data fusion.Proceedings of The 1992 IEEE Int. Conf. on Robotics and Automation, Nice, France, May 1992, pp. 1668–1674.Google Scholar
  5. Benoît, C., Mohamadi, T. & Kandel, S. D. (in press). Effects of Phonetic Context on Audio-Visual Intelligibility of French.Journal of Speech and Hearing Research (in press).Google Scholar
  6. Binnie, C. A., Montgomery, A. A. & Jackson, P. L. (1974). Auditory and Visual Contributions to the Perception of Consonants.Journal of Speech and Hearing Research 17: 619–630.Google Scholar
  7. Bladon, R. A. W. & Lindblom, B. (1981). Modelling the Judgement of Vowel Quality Differences.J. Acoust. Soc. Am. 69: 1414–1422.Google Scholar
  8. Boë, L. J. & Perrier, P. (1993). Personal Communication.Google Scholar
  9. Boë, L. J., Schwartz, J. L. & Vallée, N., (to appear). The Prediction of Vowel Systems: Perceptual Contrast and Stability. In Keller, E. (ed.)Fundamentals of Speech Synthesis and Speech Recognition. John Wiley.Google Scholar
  10. Braida, D., Picheny, M. A., Cohen, J. R., Rabinowitz, W. M. & Perkell, J. S. (1986). Use of Articulatory Signals in Automatic Speech Recognition.J. Acoust. Soc. Am. 80: 18.Google Scholar
  11. Breeuwer, M. & Plomp, R. (1986). Speechreading Supplemented with Auditorily Presented Speech Parameters.J. Acoust. Soc. Am. 79: 481–499.Google Scholar
  12. Bregman, A. S. (1990).Auditory Scene Analysis: The Perceptual Organization of Sound. The MIT Press: Cambridge, MA.Google Scholar
  13. Brooke, M. & Petajan, E. D. (1986). Seeing Speech: Investigations into the Synthesis and Recognition of Visible Speech Movements Using Automatic Image Processing and Computer Graphics. Conference Publication No. 258, Inter. Conf. on Speech Input/Output; Techniques and Applications (London 24–26 March 1986), pp. 104–109.Google Scholar
  14. Campbell, H. W. (1974). Phoneme Recognition by Ear and by Eye: A Distinctive Feature Analysis. Doctoral dissertation, Katholieke Universiteit te Nijmegen.Google Scholar
  15. Campbell, R. (1988). Tracing Lip Movements: Making Speech Visible.Visible Language 13(1).Google Scholar
  16. Campbell, R. & Dodd, B. (1980). Hearing by Eye.Quarterly Journal of Experimental Psychology,32: 85–99.Google Scholar
  17. Cathiard, M. A., Lallouache, T., Mohamadi, T. & Abry, Ch. (1993). Etude perceptive des interactions Son/Image: Application au visiophone et à la visioconférence. Rapport CNET Marché 92 7B 032.Google Scholar
  18. Crowley, J. L. & Demazeau, Y. (1993). Principles and Techniques for Sensor Data Fusion.Signal Processing 32: 5–27.Google Scholar
  19. de Gelder, B. & Vroomen, J. (1992). Abstract Versus Modality-Specific Memory Representations in Processing Auditory and Visual Speech.Memory and Cognition 20: 533–538.Google Scholar
  20. de Gelder, B., Vroomen, J. & van der Heide, L. (1991). Face Recognition and Lip-Reading in Autism.European Journal of Cognitive Psychology,3: 69–86.Google Scholar
  21. Delgutte, B. & Kiang, N. Y. S. (1984). Speech Coding in the Auditory Nerve: IV. Sounds with Consonant-Like Dynamic Characteristics.J. Acoust. Soc. Am. 75: 897–907.Google Scholar
  22. Dixon, N. F. & Spitz L. (1980). The Detection of Audiovisual Desynchrony.Perception 9: 719–721.Google Scholar
  23. Dodd, B. (1979). Lip-Reading in Infants: Attention to Speech Presented In- and Out-of-Synchrony.Cognitive Psychology 11: 478–484.Google Scholar
  24. Erber, N. P. (1969). Interaction of Audition and Vision in the Recognition of Oral Speech Stimuli.Journal of Speech and Hearing Research 12: 423–425.Google Scholar
  25. Erber, N. P. (1975). Auditory-Visual Perception of Speech.Journal of Speech and Hearing Disorders 40: 481–492.Google Scholar
  26. Faugeras, O., Ayache, N. & Faverjon, B. (1986). Building Visual Maps by Combining Noisy Stereo Measurements. IEEE Internat. Conf. on Robotics and Automation, San Francisco, CA.Google Scholar
  27. Finn, K. E. & Montgomery (1988). Automatic Optically Based Recognition of Speech.Pattern Recognition Letters 8: 159–164.Google Scholar
  28. Fowler, C. A. (1986). An Event Approach to the Study of Speech Perception from a Direct-Realist Perspective.J. of Phonetics 14: 3–28.Google Scholar
  29. Fowler, C. A. & Dekle, D. J. (1991). Listening with Eye and Hand: Crossmodal Contributions to Speech Perception.J. of Experimental Psychology: Human Perception and Performance 17: 816–828.Google Scholar
  30. Gibson, J. J. (1966).The Senses Considered as Perceptual Systems. Boston: Houghton Mifflin Co.Google Scholar
  31. Grant, K. W., Ardell, L. H., Kuhl, P. K. & Sparks, D. W. (1985). The Contribution of Fundamental Frequency, Amplitude Envelope, and Voicing Duration Cues to Speechreading in Normal-Hearing Subjects.J. Acoust. Soc. Am. 77: 671–677.Google Scholar
  32. Green, K. P. (1987). The Perception of Speaking Rate Using Visual Information from a Talker's Face.Perception and Psychophysics 42: 587–593.Google Scholar
  33. Green, K. P. & Kuhl, P. K. (1989). The Role of Visual Information in the Processing of Place and Manner Features in Speech Perception.Perception and Psychophysics 45: 34–42.Google Scholar
  34. Green, K. P. & Kuhl, P. K. (1991). Integral Processing of Visual Place and Auditory Voicing Information During Phonetic Perception.Journal of Experimental Psychology: Human Perception and Performance 17: 278–288.Google Scholar
  35. Green, K. P. & Miller J. L. (1985). On the Role of Visual Rate Information in Phonetic Perception.Perception and Psychophysics 38: 269–276.Google Scholar
  36. Hatwell, Y. (1993). Transferts intermodaux et intégration intermodale. In Richelle, M., Reguin, J. & Robert, M. (eds.)Traité de Psychologie Expérimentale. Presses Universitaires de France: Paris.Google Scholar
  37. King, A. J. & Palmer, A. R. (1985). Integration of Visual and Auditory Information in Bimodal Neurones in the Guinea-pig Superior Colliculus.Expl. Brain Res. 60: 492–500.Google Scholar
  38. Klatt, D. H. (1979). Speech Perception: A Model of Acoustic-Phonetic Analysis and Lexical Access.J. of Phonetics 7: 279–312.Google Scholar
  39. Knudsen, E. I. (1982). Auditory and Visual Maps of Space in the Optic Tectum of the Owl.J. Neurosci. 2: 1177–1194.Google Scholar
  40. Konishi, M. (1986). Centrally Synthesized Maps of Sensory Space.Trends Neurosci 9: 163–168.Google Scholar
  41. Kuhl, P. K. & Meltzoff, A. N. (1982). The Bimodal Perception of Speech in Infancy.Science 218: 1138–1141.Google Scholar
  42. Kurita, T., Honda, K. & Kakita, Y. (1988). Analysis of Speech by a Joint Use of Image Processing of Lip Movements. I.E.I.C.E. Tech. Report, SP88-94, 41–48.Google Scholar
  43. Lallouache, M. T. (1991). Un poste ‘visage-parole’ couleur. Acquisition et traitement automatique des contours des lèvres. Doctoral Thesis, Institut National Polytechnique, Grenoble.Google Scholar
  44. Liberman, A. & Mattingly, I. (1985). The Motor Theory of Speech Perception Revised.Cognition 21: 1–33.PubMedGoogle Scholar
  45. Lisker, L. & Rossi, M. (1993). Auditory and Visual Cueing of the [± Rounded] Feature of Vowels.Language and Speech 35: 391–417.Google Scholar
  46. Massaro, D. W. (1987).Speech Perception by Ear and Eye: A Paradigm for Psychological Inquiry. Lawrence Erlbaum Associates: London.Google Scholar
  47. Massaro, D. W. (1989). Testing Between the TRACE Model and the Fuzzy Logical Model of Speech Perception.Cognitive Psychology 21: 398–421.Google Scholar
  48. Massaro, D. W. & Cohen, M. M. (1993). Perceiving Asynchronous Bimodal Speech in Consonant-Vowel and Vowel Syllables.Speech Communication 13: 127–134.Google Scholar
  49. Massaro, D. W., Cohen, M. M., Gesi, A., Heredia, R. & Tsuzaki, M. (1993). Bimodal Speech Perception: An Examination Across Languages.J. of Phonetics 21: 445–478.Google Scholar
  50. Massaro, D. W. & Warner, D. S. (1977). Dividing Attention Between Auditory and Visual Perception.Perception and Psychophysics 21: 569–574.Google Scholar
  51. Matsuoka, K., Furuya, T. & Kurosu, K. (1986). Speech Recognition by Image Processing of Lip Movements.Trans. on Soc. Instru. and Cont. Eng. 22: 191–198.Google Scholar
  52. McGurk, H. & MacDonald, J. (1976). Hearing Lips and Seeing Voices.Nature 264: 746–748.Google Scholar
  53. McLeod, A. & Summerfield, Q. (1987). Quantifying the Contribution of Vision to Speech Perception in Noise.British Journal of Audiology 21: 131–141.Google Scholar
  54. Meredith, M. A. & Stein, B. E. (1983). Interactions Among Converging Sensory Inputs in the Superior Colliculus.Science 221: 389–391.Google Scholar
  55. Morris, A. C. (1992).Analyse informationnelle du traitement de la parole dans le système auditif périphérique et le noyau cochléaire. Doctoral Thesis, Institut National Polytechnique, Grenoble.Google Scholar
  56. Petajan, E. D. (1984). Automatic Lipreading to Enhance Speech Recognition.Proceeding of the Global Communications Conference, IEEE Communication Society, Atlanta, Georgia, 265–272.Google Scholar
  57. Radeau, M. (1994). Auditory-Visual Spatial Interaction and Modularity.Current Psychology of Cognition 13: 3–51.Google Scholar
  58. Reed, C. M., Rabinowitz, W. M., Durlach, N. I. & Braida, L. D. (1985). Research on the Tadoma method of speech communication.J. Acoust. Soc. Am. 77(1): 247–257.Google Scholar
  59. Reisberg, D., McLean, J. & Goldfield, A. (1987). Easy to Hear but Hard to Understand: A Lipreading Advantage with Intact Auditory Stimuli. In Dodd, B. & Campbell, R. (eds.)Hearing by Eye: the Psychology of Lipreading, 97–113. Lawrence Erlbaum Associates: London.Google Scholar
  60. Risberg, A. & Lubker, J. L. (1978). Prosody and Speechreading.Speech Transmission Laboratory Quaterly Progress & Status Report, Stockholm, Vol. 4, 1–16.Google Scholar
  61. Robert-Ribes, J. (to appear).Models of Audiovisual Integration. Doctoral Thesis, Institut National Polytechnique, Grenoble.Google Scholar
  62. Robert-Ribes, J., Escudier, P. & Schwartz J. L. (1991). Modèles d'intégration audition-vision: une étude neuromimétique. Internal report, ICP, Grenoble.Google Scholar
  63. Robert-Ribes, J., Lallouache, T., Escudier, P. & Schwartz, J. L. (1993). Integrating Auditory and Visual Representations for Audiovisual Vowel Recognition.Proceedings of The Third European Conference on Speech Communication and Technology, 1753–1756 (by ESCA), Berlin Germany.Google Scholar
  64. Roberts, M. & Summerfield, Q. (1981). Audiovisual Presentation Demonstrates That Selective Adaptation in Speech Perception is Purely Auditory.Perception and Psychophysics 30: 309–314.Google Scholar
  65. Rönnberg, J., Arlinger, S., Lyxell, B. & Kinnefords, C. (1989). Visual Evoked Potentials: Relation to Adult Speechreading and Cognitive Function.J. of Speech and Hearing Research 32: 725–735.Google Scholar
  66. Rosen, S., Fourcin, A. J. & Moore B. (1981). Voice Pitch as an Aid to Lipreading.Nature 291: 150–152.Google Scholar
  67. Sams, M., Aulanko, R., Hämäläinen, M., Hari, R., Lounasmaa, O. V., Lu, S. & Simola, J. (1991). Seeing Speech: Visual Information from Lip Movements Modifies Activity in the Human Auditory Cortex.Neuroscience Letters 127: 141–145.Google Scholar
  68. Schroeder, M. R., Atal, B. S. & Hall, J. L. (1979). Objective Measure of Certain Speech Signal Degradations Based on Masking Properties of Human Auditory Perception. In Lindblom, B. & Ohman, S. (eds.)Frontiers of Speech Communication Research, 217–229. Academic Press: London.Google Scholar
  69. Sekiyama, K. & Tokhura, Y. (1993). Inter-Language Differences in the Influence of Visual Cues in Speech Perception.J. of Phonetics 21: 427–444.Google Scholar
  70. Shepherd, D. C., DeLavergne, R. W., Frueh, F. X. & Clobridge, C. (1977). Visual-Neural Correlate of Speechreading Ability in Normal-Hearing Adults.Journal of Speech and Hearing Research 20: 752–765.Google Scholar
  71. Stein, B. E. & Meredith, M. A. (1993).The Merging of the Senses. MIT Press: Cambridge.Google Scholar
  72. Stork, D. G., Wolff, G. & Levine, E. (1992).Neural Network Lipreading System for Improved Speech Recognition. IJCNN-92, Baltimore MD, Vol. 2, 285–295.Google Scholar
  73. Sumby, W. H. & Pollack, I. (1954). Visual Contribution to Speech Intelligibility in Noise.J. Acoust. Soc. Am. 26: 212–215.Google Scholar
  74. Summerfield, Q. (1979). Use of Visual Information for Phonetic Perception.Phonetica 36: 314–331.Google Scholar
  75. Summerfield, Q. (1987). Some Preliminaries to a Comprehensive Account of Audio-Visual Speech Perception. In Dodd, B. & Campbell, R. (eds.)Hearing by Eye: The Psychology of Lipreading, 3–51. Lawrence Erlbaum Associates: London.Google Scholar
  76. Summerfield, Q. (1992). Lipreading and Audio-Visual Speech Perception. In Bruce, Cowey, Ellis & Peret (eds.)Proceeding the Facial Image, 71–78. Clarendon Press: Oxford.Google Scholar
  77. Summerfield, Q. & McGrath, M. (1984). Detection and Resolution of Audio-Visual Incompatibility in the Perception of Vowels.Quarterly Journal of Experimental Psychology: Human Experimental Psychology 36: 51–74.Google Scholar
  78. Tamura, S. (1989).Lip Contour Extraction-Complement and Tracing by Using Energy Function and Optical Flow. Paper of Technical Group on Pattern Recognition and Understanding, I.E.I.C.E., PRU89-20, 9–16.Google Scholar
  79. Vroomen, J. H. M. (1992).Hearing Voices and Seeing Lips: Investigations in the Psychology of Lipreading. Doctoral dissertation, Katolieke Univ. Brabant.Google Scholar
  80. Watanabe, T. & Kohda, M. (1990). Lip-Reading of Japanese Vowels Using Neural Networks.Proceedings of The Int. Conf. Spoken Lang. Proces. 90, Kobe, Japan, 1373–1376.Google Scholar
  81. Welch. R. B. (1989). A Comparison of Speech Perception and Spatial Localization.Behavioral and Brain Sciences 12: 776–777.Google Scholar
  82. Welch, R. B. & Warren, D. H. (1986). Intersensory Interactions. In Boff, K. R., Kaufman, L. & Thomas, J. P. (eds.)Handbook of Perception and Human Performance, Volume I: Sensory Processes and Perception, 25-1–25-36. Wiley: New York.Google Scholar
  83. Yuhas, B. P., Goldstein, M. H. & Sejnowski, T. J. (1989). Integration of Acoustic and Visual Speech Signals Using Neural Networks.IEEE Communications Magazine 65–71.Google Scholar
  84. Yuhas, B. P., Goldstein, M. H., Sejnowski, T. J. & Jenkins, R. E. (1990). Neural Network Models of Sensory Integration for Improved Vowel Recognition.Proc. of the IEEE 78: 1658–1668.Google Scholar
  85. Yuhas, B. P. & Goldstein, M. H. (1991). Comparing Human and Neural Network Lipreaders.J. Acoust. Soc. Am.,90: 598–600.Google Scholar

Copyright information

© Kluwer Academic Publishers 1995

Authors and Affiliations

  • Jordi Robert-Ribes
    • 1
  • Jean-Luc Schwartz
    • 1
  • Pierre Escudier
    • 1
  1. 1.Institut de la Communication Parlée, CNRS UA 368, INPG/ENSERGUniversité Stendhal \ INPGGrenoble Cedex 1France

Personalised recommendations