Skip to main content

Audiovisual Integration in Speaker Identification

  • Chapter
  • First Online:

Abstract

Audiovisual integration (AVI) is well-known during speech perception, but evidence for AVI in speaker identification has been less clear. This chapter reviews evidence for face–voice integration in speaker identification. Links between perceptual representations mediating face and voice identification, tentatively suggested by behavioral evidence more than a decade ago, have been recently supported by neuroimaging data indicating tight functional connectivity between the fusiform face and temporal voice areas. Research that recombined dynamic facial and vocal identities with precise synchrony provided strong evidence for AVI in identifying personally familiar (but not unfamiliar) speakers. Electrophysiological data demonstrate AVI at multiple neuronal levels and suggest that perceiving time-synchronized speaking faces triggers early (∼50–80 ms) audiovisual processing, although audiovisual speaker identity is only computed ∼200 ms later.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Like many other studies, it needs to be noted that this experiment used static faces. On the one hand, the study is therefore subject to the limitations mentioned earlier; on the other hand, this might be further evidence that even static faces can elicit some crossmodal effects (Joassin, Maurage, Bruyer, Crommelinck, & Campanella, 2004; Joassin et al., 2011).

  2. 2.

    It could be speculated whether differences in timing might have been a consequence of the use of temporally extended sentence stimuli in Schweinberger, Kloth, and Robertson (2011) and Schweinberger, Walther, Zäske, and Kovacs (2011). However, in as yet unpublished research, we have now repeated the same experiment using brief syllabic stimuli similar to those used in the McGurk-paradigm, and replicated the crucial results, in terms of an early frontocentral negativity around 50–80 ms to bimodal stimuli, and an onset of speaker identity correspondence effects around 250 ms.

References

  • Andics, A., McQueen, J. M., Petersson, K. M., Gal, V., Rudas, G., & Vidnyanszky, Z. (2010). Neural mechanisms for voice recognition. NeuroImage, 52, 1528–1540.

    Article  PubMed  Google Scholar 

  • Belin, P., Bestelmeyer, P. E. G., Latinus, M., & Watson, R. (2011). Understanding voice perception. British Journal of Psychology, 102, 711–725.

    Article  PubMed  Google Scholar 

  • Belin, P., Fecteau, S., & Bedard, C. (2004). Thinking the voice: Neural correlates of voice perception. Trends in Cognitive Sciences, 8, 129–135.

    Article  PubMed  Google Scholar 

  • Belin, P., & Zatorre, R. J. (2003). Adaptation to speaker’s voice in right anterior temporal lobe. NeuroReport, 14, 2105–2109.

    Article  PubMed  Google Scholar 

  • Belin, P., Zatorre, R. J., Lafaille, P., Ahad, P., & Pike, B. (2000). Voice-selective areas in human auditory cortex. Nature, 403, 309–312.

    Article  PubMed  CAS  Google Scholar 

  • Benson, P. J., & Perrett, D. I. (1991). Perception and recognition of photographic quality facial caricatures: Implications for the recognition of natural images. European Journal of Cognitive Psychology, 3, 105–135.

    Article  Google Scholar 

  • Bricker, P. D., & Pruzansky, S. (1966). Effects of stimulus content and duration on talker identification. Journal of the Acoustical Society of America, 40, 1441–1449.

    Article  PubMed  CAS  Google Scholar 

  • Bruce, V., & Young, A. (1986). Understanding face recognition. British Journal of Psychology, 77, 305–327.

    Article  PubMed  Google Scholar 

  • Bruce, V., & Young, A. (2011). Face perception. Hove, UK: Psychology Press.

    Google Scholar 

  • Burton, A. M., Bruce, V., & Johnston, R. A. (1990). Understanding face recognition with an interactive activation model. British Journal of Psychology, 81, 361–380.

    Article  PubMed  Google Scholar 

  • Calvert, G. A., Brammer, M. J., & Iversen, S. D. (1998). Crossmodal identification. Trends in Cognitive Sciences, 2, 247–253.

    Article  PubMed  CAS  Google Scholar 

  • Campanella, S., & Belin, P. (2007). Integrating face and voice in person perception. Trends in Cognitive Sciences, 11, 535–543.

    Article  PubMed  Google Scholar 

  • Charest, I., Pernet, C. R., Rousselet, G. A., Quinones, I., Latinus, M., Fillion-Bilodeau, S., et al. (2009). Electrophysiological evidence for an early processing of human voices. BMC Neuroscience, 10(127), 1–11.

    Google Scholar 

  • Colonius, H., Diederich, A., & Steenken, R. (2009). Time-Window-of-Integration (TWIN) model for saccadic reaction time: Effect of auditory masker level on visual-auditory spatial interaction in elevation. Brain Topography, 21, 177–184.

    Article  PubMed  Google Scholar 

  • de Gelder, B., & Vroomen, J. (2000). The perception of emotions by ear and by eye. Cognition & Emotion, 14, 289–311.

    Article  Google Scholar 

  • Ellis, H. D., Jones, D. M., & Mosdell, N. (1997). Intra- and inter-modal repetition priming of familiar faces and voices. British Journal of Psychology, 88, 143–156.

    Article  PubMed  Google Scholar 

  • Formisano, E., De Martino, F., Bonte, M., & Goebel, R. (2008). “Who” Is Saying “What”? Brain-based decoding of human voice and speech. Science, 322, 970–973.

    Article  PubMed  CAS  Google Scholar 

  • Fox, C. J., & Barton, J. J. S. (2007). What is adapted in face adaptation? The neural representations of expression in the human visual system. Brain Research, 1127, 80–89.

    Article  PubMed  CAS  Google Scholar 

  • Garrido, L., Eisner, F., McGettigan, C., Stewart, L., Sauter, D., Hanley, J. R., et al. (2009). Developmental phonagnosia: A selective deficit of vocal identity recognition. Neuropsychologia, 47, 123–131.

    Article  PubMed  Google Scholar 

  • Ghazanfar, A. A., & Schroeder, C. E. (2006). Is neocortex essentially multisensory? Trends in Cognitive Sciences, 10, 278–285.

    Article  PubMed  Google Scholar 

  • Green, K. P., Kuhl, P. K., Meltzoff, A. N., & Stevens, E. B. (1991). Integration speech information across talkers, gender, and sensory modality: Female faces and male voices in the McGurk effect. Perception & Psychophysics, 50, 524–536.

    Article  CAS  Google Scholar 

  • Hagan, C. C., Woods, W., Johnson, S., Calder, A. J., Green, G. G. R., & Young, A. W. (2009). MEG demonstrates a supra-additive response to facial and vocal emotion in the right superior temporal sulcus. Proceedings of the National Academy of Sciences of the United States of America, 106, 20010–20015.

    PubMed  CAS  Google Scholar 

  • Hanley, J. R., Smith, S. T., & Hadfield, J. (1998). I recognise you but I can’t place you: An investigation of familiar-only experiences during tests of voice and face recognition. Quarterly Journal of Experimental Psychology, 51A, 179–195.

    Google Scholar 

  • Haxby, J. V., Hoffman, E. A., & Gobbini, M. I. (2000). The distributed human neural system for face perception. Trends in Cognitive Sciences, 4, 223–233.

    Article  PubMed  Google Scholar 

  • Joassin, F., Maurage, P., Bruyer, R., Crommelinck, M., & Campanella, S. (2004). When audition alters vision: An event-related potential study of the cross-modal interactions between faces and voices. Neuroscience Letters, 369, 132–137.

    Article  PubMed  CAS  Google Scholar 

  • Joassin, F., Pesenti, M., Maurage, P., Verreckt, E., Bruyer, R., & Campanella, S. (2011). Cross-modal interactions between human faces and voices involved in person recognition. Cortex, 47, 367–376.

    Google Scholar 

  • Kawahara, H., & Matsui, H. (2003). Auditory morphing based on an elastic perceptual distance metric in an interference-free time-frequency representation. IEEE Proceedings of ICASSP, 1, 256–259.

    Google Scholar 

  • Kovács, G., Zimmer, M., Banko, E., Harza, I., Antal, A., & Vidnyanszky, Z. (2006). Electrophysiological correlates of visual adaptation to faces and body parts in humans. Cerebral Cortex, 16, 742–753.

    Article  PubMed  Google Scholar 

  • Lander, K., & Chuang, L. (2005). Why are moving faces easier to recognize? Visual Cognition, 12, 429–442.

    Article  Google Scholar 

  • Legge, G. E., Grossmann, C., & Pieper, C. M. (1984). Learning unfamiliar voices. Journal of Experimental Psychology: Learning, Memory, and Cognition, 10, 298–303.

    Article  Google Scholar 

  • Leopold, D. A., O’Toole, A. J., Vetter, T., & Blanz, V. (2001). Prototype-referenced shape encoding revealed by high-level aftereffects. Nature Neuroscience, 4, 89–94.

    Article  PubMed  CAS  Google Scholar 

  • McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748.

    Article  PubMed  CAS  Google Scholar 

  • Munhall, K. G., Gribble, P., Sacco, L., & Ward, M. (1996). Temporal constraints on the McGurk effect. Perception & Psychophysics, 58, 351–362.

    Article  CAS  Google Scholar 

  • Natu, V., & O’Toole, A. J. (2011). The neural processing of familiar and unfamiliar faces: A review and synopsis. British Journal of Psychology, 102, 726–747.

    Article  PubMed  Google Scholar 

  • Navarra, J., Vatakis, A., Zampini, M., Soto-Faraco, S., Humphreys, W., & Spence, C. (2005). Exposure to asynchronous audiovisual speech extends the temporal window for audiovisual integration. Cognitive Brain Research, 25, 499–507.

    Article  PubMed  Google Scholar 

  • Neuner, F., & Schweinberger, S. R. (2000). Neuropsychological impairments in the recognition of faces, voices, and personal names. Brain and Cognition, 44, 342–366.

    Article  PubMed  CAS  Google Scholar 

  • Pollack, I., Pickett, J. M., & Sumby, W. H. (1954). On the identification of speakers by voice. Journal of the Acoustical Society of America, 26, 403–406.

    Article  Google Scholar 

  • Robertson, D. M. C., & Schweinberger, S. R. (2010). The role of audiovisual asynchrony in person recognition. Quarterly Journal of Experimental Psychology, 63, 23–30.

    Article  Google Scholar 

  • Saint-Amour, D., De Sanctis, P., Molholm, S., Ritter, W., & Foxe, J. J. (2007). Seeing voices: High-density electrical mapping and source-analysis of the multisensory mismatch negativity evoked during the McGurk illusion. Neuropsychologia, 45, 587–597.

    Article  PubMed  Google Scholar 

  • Sams, M., Aulanko, R., Hämalainen, M., Hari, R., Lounasmaa, O. V., Lu, S.-T., et al. (1991). Seeing speech: Visual information from lip movements modifies activity in the human auditory cortex. Neuroscience Letters, 127, 141–145.

    Article  PubMed  CAS  Google Scholar 

  • Schweinberger, S. R. (1996). Recognizing people by faces, names, and voices: Psychophysiological and neuropsychological investigations. University of Konstanz: Habilitation Thesis.

    Google Scholar 

  • Schweinberger, S. R. (2011). Neurophysiological correlates of face recognition. In A. J. Calder, G. Rhodes, M. H. Johnson, & J. V. Haxby (Eds.), The handbook of face perception (pp. 345–366). Oxford: Oxford University Press.

    Google Scholar 

  • Schweinberger, S. R., Casper, C., Hauthal, N., Kaufmann, J. M., Kawahara, H., Kloth, N., et al. (2008). Auditory adaptation in voice perception. Current Biology, 18, 684–688.

    Article  PubMed  CAS  Google Scholar 

  • Schweinberger, S. R., Herholz, A., & Sommer, W. (1997). Recognizing famous voices: Influence of stimulus duration and different types of retrieval cues. Journal of Speech, Language, and Hearing Research, 40, 453–463.

    PubMed  CAS  Google Scholar 

  • Schweinberger, S. R., Herholz, A., & Stief, V. (1997). Auditory long-term memory: Repetition priming of voice recognition. Quarterly Journal of Experimental Psychology, 50A, 498–517.

    Google Scholar 

  • Schweinberger, S. R., Kloth, N., & Robertson, D. M. C. (2011). Hearing facial identities: Brain correlates of face-voice integration in person identification. Cortex, 47, 1026–1037.

    Article  PubMed  Google Scholar 

  • Schweinberger, S. R., Pickering, E. C., Jentzsch, I., Burton, A. M., & Kaufmann, J. M. (2002). Event-related brain potential evidence for a response of inferior temporal cortex to familiar face repetitions. Cognitive Brain Research, 14, 398–409.

    Article  PubMed  Google Scholar 

  • Schweinberger, S. R., Robertson, D., & Kaufmann, J. M. (2007). Hearing facial identities. Quarterly Journal of Experimental Psychology, 60, 1446–1456.

    Article  Google Scholar 

  • Schweinberger, S. R., Walther, C., Zäske, R., & Kovacs, G. (2011). Neural correlates of adaptation to voice identity. British Journal of Psychology, 102, 748–764.

    Article  PubMed  Google Scholar 

  • Shah, N. J., Marshall, J. C., Zafiris, O., Schwab, A., Zilles, K., Markowitsch, H. J., et al. (2001). The neural correlates of person familiarity. A functional magnetic resonance imaging study with clinical implications. Brain, 124, 804–815.

    Article  PubMed  CAS  Google Scholar 

  • Sheffert, S. M., & Olson, E. (2004). Audiovisual speech facilitates voice learning. Perception & Psychophysics, 66, 352–362.

    Article  Google Scholar 

  • Soto-Faraco, S., & Alsius, A. (2009). Deconstructing the McGurk–MacDonald illusion. Journal of Experimental Psychology: Human Perception and Performance, 35, 580–587.

    Article  PubMed  Google Scholar 

  • Stein, B. E., & Stanford, T. R. (2008). Multisensory integration: Current issues from the perspective of the single neuron. Nature Reviews Neuroscience, 9, 255–266.

    Article  PubMed  CAS  Google Scholar 

  • Stekelenburg, J. J., & Vroomen, J. (2007). Neural correlates of multisensory integration of ecologically valid audiovisual events. Journal of Cognitive Neuroscience, 19, 1964–1973.

    Article  PubMed  Google Scholar 

  • Sugiura, M., Shah, N. J., Zilles, K., & Fink, G. R. (2005). Cortical representations of personally familiar objects and places: Functional organization of the human posterior cingulate cortex. Journal of Cognitive Neuroscience, 17, 183–198.

    Article  PubMed  Google Scholar 

  • Summerfield, Q., MacLeod, A., McGrath, M., & Brooke, M. (1989). Lips, teeth, and the benefits of lipreading. In A. W. Young & H. D. Ellis (Eds.), Handbook of research on face processing (pp. 223–233). Amsterdam: North-Holland.

    Google Scholar 

  • van Wassenhove, V., Grant, K. W., & Poeppel, D. (2005). Visual speech speeds up the neural processing of auditory speech. Proceedings of the National Academy of Sciences of the United States of America, 102, 1181–1186.

    Article  PubMed  Google Scholar 

  • van Wassenhove, V., Grant, K. W., & Poeppel, D. (2007). Temporal window of integration in auditory-visual speech perception. Neuropsychologia, 45, 598–607.

    Article  PubMed  Google Scholar 

  • VanLancker, D., & Kreiman, J. (1987). Voice discrimination and recognition are separate abilities. Neuropsychologia, 25, 829–834.

    Article  CAS  Google Scholar 

  • VanLancker, D., Kreiman, J., & Wickens, T. D. (1985). Familiar voice recognition: Patterns and parameters. Part II: Recognition of rate-altered voices. Journal of Phonetics, 13, 39–52.

    Google Scholar 

  • von Kriegstein, K., Kleinschmidt, A., Sterzer, P., & Giraud, A. L. (2005). Interaction of face and voice areas during speaker recognition. Journal of Cognitive Neuroscience, 17, 367–376.

    Article  Google Scholar 

  • Walker, S., Bruce, V., & O’Malley, C. (1995). Facial identity and facial speech processing: Familiar faces and voices in the McGurk effect. Perception & Psychophysics, 57, 1124–1133.

    Article  CAS  Google Scholar 

  • Welch, R. B., & Warren, D. H. (1980). Immediate perceptual response to intersensory discrepancy. Psychological Bulletin, 88, 638–667.

    Article  PubMed  CAS  Google Scholar 

  • Zäske, R., Schweinberger, S. R., & Kawahara, H. (2010). Voice aftereffects of adaptation to speaker identity. Hearing Research, 268, 38–45.

    Article  PubMed  Google Scholar 

Download references

Acknowledgments

The author’s research is supported by grants from the Deutsche Forschungsgemeinschaft (Grants Schw 511/6-2 and Schw511/10-1) in the context of the DFG Research Unit Person Perception (FOR1097). I am very grateful to Romi Zäske for helpful comments on an earlier draft of this chapter.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefan R. Schweinberger .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media New York

About this chapter

Cite this chapter

Schweinberger, S.R. (2013). Audiovisual Integration in Speaker Identification. In: Belin, P., Campanella, S., Ethofer, T. (eds) Integrating Face and Voice in Person Perception. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-3585-3_6

Download citation

Publish with us

Policies and ethics