Abstract
We examined how speakers of different languages perceive speech in face-to-face communication. These speakers identified synthetic unimodal and bimodal speech syllables made from synthetic auditory and visual five-step /ba/-/da/ continua. In the first experiment, Dutch speakers identified the test syllables as either /ba/ or /da/. To explore the robustness of the results, Dutch and English speakers were given a completely open-ended response task. Tasks in previous studies had always specified a set of alternatives. Similar results were found in the two-alternative and open-ended task. Identification of the speech segments was influenced by both the auditory and the visual sources of information. The results falsified an auditory dominance model (ADM) which assumes that the contribution of visible speech is dependent on poor-quality audible speech. The results also falsified an additive model of perception (AMP) in which the auditory and visual sources are linearly combined. The fuzzy logical model of perception (FLMP) provided a good description of performance, supporting the claim that multiple sources of continuous information are evaluated and integrated in speech perception. These results replicate previous results found with English, Spanish, and Japanese speakers. Although there were significant performance differences, the model analyses indicated no differences in the nature of information processing across language groups. The performance differences across languages were caused by information differences due to different phonologies in Dutch and English. These results suggest that the underlying mechanisms for speech perception are similar across languages.
Similar content being viewed by others
References
Binnie, C. A., Montgomery, A. A., &Jackson, P. L. (1974). Auditory and visual contributions to the perception of selected English consonants for normally hearing and hearing-impaired listeners. In H. Birk Nielsen & E. Kampp (Eds.),Visual and audio-visual perception of speech (Scandinavian Audiology,4[Suppl.], 181–209). Stockholm: Almquist & Wiksell.
Breeuwer, M., &Plomp, R. (1984). Speechreading supplemented with frequency-selective sound-pressure information.Journal of the Acoustical Society of America,76, 686–691.
Campbell, R., &Dodd, B. (1980). Hearing by eye.Quarterly Journal of Experimental Psychology,32, 85–99.
Chandler, J. P. (1969). Subroutine STEPIT—Finds local minima of a smooth function of several parameters.Behavioral Science,14, 81–82.
Cohen, M. M. (1984).Processing of visual and auditory information in speech perception. Unpublished doctoral dissertation, University of California, Santa Cruz.
Cohen, M. M., &Massaro, D. W. (1990). Synthesis of visible speech.Behavior Research Methods, Instruments, & Computers,22, 260–263.
Cutting, J. E., Bruno, N., Brady, N. P., &Moore, C. (1992). Selectivity, scope, and simplicity of models: A lesson from fitting judgments of perceived depth.Journal of Experimental Psychology: General,121, 362–381.
Gouraud, H. (1971). Continuous shading of curved surfaces.IEEE Transactions on Computers,C-20, 623–628.
Green, K. P., &Kuhl, P. K. (1989). The role of visual information in the processing of place and manner features in speech perception.Perception & Psychophysics,45, 34–42.
Klatt, D. H. (1980). Software for a cascade/parallel formant synthesizer.Journal of the Acoustical Society of America,67, 971–995.
Lindau, M., &Ladefoged, P. (1986). Variability of feature specifications. In J. S. Perkell & D. H. Klatt (Eds.),Invariance and variability of speech processes (pp. 464–478). Hillsdale, NJ: Erlbaum.
MacWhinney, B., &Bates, E. (Eds.) (1989).The crosslinguistic study of sentence processing. New York: Cambridge University Press.
Maddieson, I. (1984).Patterns of sounds. Cambridge: Cambridge University Press.
Massaro, D. W. (1987).Speech perception by ear and eye: A paradigm for psychological inquiry. Hillsdale, NJ: Erlbaum.
Massaro, D. W. (1988). Ambiguity in perception and experimentation.Journal of Experimental Psychology: General,117, 417–421.
Massaro, D. W. (1989a). Multiple book review of Speech perception by ear and eye: A paradigm for psychological inquiry.Behavioral & Brain Sciences,12, 741–794.
Massaro, D. W. (1989b). Testing between the TRACE model and the fuzzy logical model of perception.Cognitive Psychology,21, 398–421.
Massaro, D. W. (1990). A fuzzy logical model of speech perception. In D. Vickers & P. L. Smith (Eds.),Human information processing: Measures, mechanisms, and models (pp. 367–379). Amsterdam: North-Holland.
Massaro, D. W., &Cohen, M. M. (1983). Evaluation and integration of visual and auditory information in speech perception.Journal of Experimental Psychology: Human Perception & Performance,9, 753–771.
Massaro, D. W., &Cohen, M. M. (1990). Perception of synthesized audible and visible speech.Psychological Science,1, 55–63.
Massaro, D. W., &Cohen, M. M. (1993a). The paradigm and the fuzzy logical model of perception are alive and well.Journal of Experimental Psychology: General,122, 115–124.
Massaro, D. W., &Cohen, M. M. (1993b). Perceiving asynchronous bimodal speech in consonant-vowel and vowel syllables.Speech Communication,13, 127–134.
Massaro, D. W., &Friedman, D. (1990). Models of integration given multiple sources of information.Psychological Review,97, 225–252.
Massaro, D. W., Tsuzaki, M., Cohen, M. M., Gesi, A., &Heredia, R. (1993). Bimodal speech perception: An examination across languages.Journal of Phonetics,21, 445–478.
McGurk, H., &MacDonald, J. (1976). Hearing lips and seeing voices.Nature,264, 746–748.
Parke, F. I. (1974).A parametric model for human faces (Tech. Rep. UTEC-CSc-75-047). Salt Lake City: University of Utah, Department of Computer Science.
Parke, F. I. (1975). A model for human faces that allows speech synchronized animation.Computers & Graphics Journal,1, 1–4.
Parke, F. I. (1982). Parameterized models for facial animation.IEEE Computer Graphics,2(9), 61–68.
Pearce, A., Wyvill, B., Wyvill, G., & Hill, D. (1986). Speech and expression: A computer solution to face animation. InProceedings of Graphics Interface '86 (pp. 136-140).
Platt, J. R. (1964). Strong inference.Science,146, 347–353.
Reisberg, D., McLean, J., &Goldfield, A. (1987). Easy to hear but hard to understand: A lip-reading advantage with intact auditory stimuli. In B. Dodd & R. Campbell (Eds.),Hearing by eye: The psychology of lip-reading (pp. 97–113). Hove, U.K.: Erlbaum.
Sekiyama, K., &Tohkura, Y. (1991). McGurk effect in non-English listeners: Few visual effects for Japanese subjects hearing Japanese syllables of high auditory intelligibility.Journal of the Acoustical Society of America,90, 1797–1805.
Sekiyama, K., &Tohkura, Y. (1993). Inter-language differences in the influence of visual cues in speech perception.Journal of Phonetics,21, 427–444.
Smeele, P. M. T., & Sittig, A. C. (1991a). The contribution of vision to speech perception. InProceedings of the 2nd European Conference on Speech Communication and Technology, Eurospeech 91 (pp. 1495-1497).
Smeele, P. M. T., & Sittig, A. C. (1991b). Effects of desynchronization of vision and speech on the perception of speech: Preliminary results. InCCITT Brazil Conference Sept. '91 (Stgrp. XII, Wp. XII/2 and XII/3, Contribution D.81).
Smeele, P. M. T., Sittig, A. C., &van Heuven, V. J. (1992). Intelligibility of audio-visually desynchronised speech: Asymmetrical effect of phoneme position.Proceedings of the International Conference on Spoken Language Processing 92,1, 65–68.
Studdert-Kennedy, M. (1989). Reading gestures by light and sound. In A. W. Young & H. D. Ellis (Eds.),Handbook of research on face processing (pp. 217–222). Amsterdam: North-Holland.
Summerfield, A. Q. (1979). Use of visual information in phonetic perception.Phonetica,36, 314–331.
Thompson, L. A., &Massaro, D. W. (1989). Before you see it, you see its parts: Evidence for feature encoding and integration in preschool children and adults.Cognitive Psychology,21, 334–362.
Vroomen, J. H. M. (1992).Hearing voices and seeing lips: Investigations in the psychology of lipreading. Unpublished doctoral dissertation, Katholieke Universiteit Brabant.
Author information
Authors and Affiliations
Corresponding author
Additional information
The research reported in this paper and the writing of the paper were supported, in part, by grants from the Public Health Service (PHS R01 NS 20314), the National Science Foundation (BNS 8812728), and the graduate division of the University of California, Santa Cruz.
Rights and permissions
About this article
Cite this article
Massaro, D.W., Cohen, M.M. & Smeele, P.M.T. Cross-linguistic comparisons in the integration of visual and auditory speech. Memory & Cognition 23, 113–131 (1995). https://doi.org/10.3758/BF03210561
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.3758/BF03210561