Audio-Visual Speech Perception Without Speech Cues: A First Report

Saldaña, Helena M.; Pisoni, David B.; Fellowes, Jennifer M.; Remez, Robert E.

doi:10.1007/978-3-662-13015-5_10

Helena M. Saldaña³,
David B. Pisoni³,
Jennifer M. Fellowes⁴ &
…
Robert E. Remez⁴

Part of the book series: NATO ASI Series ((NATO ASI F,volume 150))

235 Accesses
1 Citations

Abstract

A primary objective of a theory of audio-visual speech perception is to describe the process of audio-visual integration and the form of the auditory and visual streams of information. An experiment was conducted in which listeners were presented with audio-visual sentences in a transcription task. The visual components of the stimuli consisted of a face of a male talker. The acoustic components of the audio-visual stimuli consisted of: (1) natural speech (2) envelope-shaped noise which preserved the duration and amplitude of the original speech waveform and (3) various types of sinewave speech signals which preserved different aspects of the time-varying spectrum of the original speech signals. Sinewave speech is a skeletonized version of a natural utterance which contains frequency and amplitude variation of the formants, but lacks any fine-grained acoustic structure of speech. When all three formants are represented in this form (T1+T2+T3) and listeners are told they are listening to speech, the intelligibility of sentences is relatively high (above 75%) (Remez, Rubin, Pisoni, & Carrell 1981). However, when listeners are presented with only single tones (Ti, T2, or T3) performance falls to almost zero. Preliminary results reported here indicate that intelligibility of sine-wave sentences is greatly increased when visual information is combined with the auditory signal. We predicted that the increase in intelligibility for the sinewave speech with an added video display would be greater than the gain observed with the envelope-shaped noise. This prediction is based on the assumption that the phonetic properties of spoken utterances are retained in the audio-visual stream of the sine-wave condition. The results demonstrate that visual information significantly increases the intelligibility of the tonal analog of the second formant, but not the tonal analog of the first formant or the bit-flipped noise. Suggesting that the information contained in the tone 2 analog is useful for audio-visual integration. Thus, the dynamic time-varying properties of the vocal tract transfer function that are encoded in both the optical and acoustic signals play an important role in speech intelligibility, and therefore need to be incorporated in theoretical accounts of audio-visual speech perception.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 299.00; Price excludes VAT (USA)

Softcover Book: USD 379.99; Price excludes VAT (USA)

Hardcover Book: USD 379.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Author information

Authors and Affiliations

Speech Research Laboratory, Psychology Department, Indiana University, Bloomington, IN, 47405, USA
Helena M. Saldaña & David B. Pisoni
Department of Psychology, Barnard College, 3009 Broadway, New York, NY, 10027-6598, USA
Jennifer M. Fellowes & Robert E. Remez

Authors

Helena M. Saldaña
View author publications
You can also search for this author in PubMed Google Scholar
David B. Pisoni
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer M. Fellowes
View author publications
You can also search for this author in PubMed Google Scholar
Robert E. Remez
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Ricoh California Research Center, 2882 Sand Hill Road #115, 94025-7022, Menlo Park, CA, USA
David G. Stork & Marcus E. Hennecke &
Department of Electrical Engineering, Stanford University, 94305, Stanford, CA, USA
David G. Stork & Marcus E. Hennecke &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Saldaña, H.M., Pisoni, D.B., Fellowes, J.M., Remez, R.E. (1996). Audio-Visual Speech Perception Without Speech Cues: A First Report. In: Stork, D.G., Hennecke, M.E. (eds) Speechreading by Humans and Machines. NATO ASI Series, vol 150. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-13015-5_10

Download citation

DOI: https://doi.org/10.1007/978-3-662-13015-5_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-08252-8
Online ISBN: 978-3-662-13015-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics