Abstract
A primary objective of a theory of audio-visual speech perception is to describe the process of audio-visual integration and the form of the auditory and visual streams of information. An experiment was conducted in which listeners were presented with audio-visual sentences in a transcription task. The visual components of the stimuli consisted of a face of a male talker. The acoustic components of the audio-visual stimuli consisted of: (1) natural speech (2) envelope-shaped noise which preserved the duration and amplitude of the original speech waveform and (3) various types of sinewave speech signals which preserved different aspects of the time-varying spectrum of the original speech signals. Sinewave speech is a skeletonized version of a natural utterance which contains frequency and amplitude variation of the formants, but lacks any fine-grained acoustic structure of speech. When all three formants are represented in this form (T1+T2+T3) and listeners are told they are listening to speech, the intelligibility of sentences is relatively high (above 75%) (Remez, Rubin, Pisoni, & Carrell 1981). However, when listeners are presented with only single tones (Ti, T2, or T3) performance falls to almost zero. Preliminary results reported here indicate that intelligibility of sine-wave sentences is greatly increased when visual information is combined with the auditory signal. We predicted that the increase in intelligibility for the sinewave speech with an added video display would be greater than the gain observed with the envelope-shaped noise. This prediction is based on the assumption that the phonetic properties of spoken utterances are retained in the audio-visual stream of the sine-wave condition. The results demonstrate that visual information significantly increases the intelligibility of the tonal analog of the second formant, but not the tonal analog of the first formant or the bit-flipped noise. Suggesting that the information contained in the tone 2 analog is useful for audio-visual integration. Thus, the dynamic time-varying properties of the vocal tract transfer function that are encoded in both the optical and acoustic signals play an important role in speech intelligibility, and therefore need to be incorporated in theoretical accounts of audio-visual speech perception.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 1996 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
SaldaƱa, H.M., Pisoni, D.B., Fellowes, J.M., Remez, R.E. (1996). Audio-Visual Speech Perception Without Speech Cues: A First Report. In: Stork, D.G., Hennecke, M.E. (eds) Speechreading by Humans and Machines. NATO ASI Series, vol 150. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-13015-5_10
Download citation
DOI: https://doi.org/10.1007/978-3-662-13015-5_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-08252-8
Online ISBN: 978-3-662-13015-5
eBook Packages: Springer Book Archive