Skip to main content

On Learning Associations of Faces and Voices

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11365))

Abstract

In this paper, we study the associations between human faces and voices. Audiovisual integration, specifically the integration of facial and vocal information is a well-researched area in neuroscience. It is shown that the overlapping information between the two modalities plays a significant role in perceptual tasks such as speaker identification. Through an online study on a new dataset we created, we confirm previous findings that people can associate unseen faces with corresponding voices and vice versa with greater than chance accuracy. We computationally model the overlapping information between faces and voices and show that the learned cross-modal representation contains enough information to identify matching faces and voices with performance similar to that of humans. Our representation exhibits correlations to certain demographic attributes and features obtained from either visual or aural modality alone. We release our dataset of audiovisual recordings and demographic annotations of people reading out short text used in our studies.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    In machine learning terminology, this could be seen as natural supervision [28] or self-supervision [9] with unlabeled data.

  2. 2.

    Gender is such a strong cue that we use it for control questions in our user study. See Sect. 3.

References

  1. Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: ICCV, pp. 609–617 (2017)

    Google Scholar 

  2. Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: NIPS, pp. 892–900 (2016)

    Google Scholar 

  3. Bau, D., Zhou, B., Khosla, A., Oliva, A., Torralba, A.: Network dissection: quantifying interpretability of deep visual representations. In: CVPR, pp. 3319–3327 (2017)

    Google Scholar 

  4. Brookes, H., Slater, A., Quinn, P.C., Lewkowicz, D.J., Hayes, R., Brown, E.: Three-month-old infants learn arbitrary auditory-visual pairings between voices and faces. Infant Child Dev. 10(1–2), 75–82 (2001)

    Article  Google Scholar 

  5. Campanella, S., Belin, P.: Integrating face and voice in person perception. Trends Cogn. Sci. 11(12), 535–543 (2007)

    Article  Google Scholar 

  6. Chen, W., Chen, X., Zhang, J., Huang, K.: Beyond triplet loss: a deep quadruplet network for person re-identification. In: CVPR, pp. 1320–1329 (2017)

    Google Scholar 

  7. Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: CVPR, pp. 539–546 (2005)

    Google Scholar 

  8. Chung, J.S., Senior, A.W., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: CVPR, pp. 3444–3453 (2017)

    Google Scholar 

  9. Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: ICCV, pp. 1422–1430 (2015)

    Google Scholar 

  10. Gaver, W.W.: What in the world do we hear? an ecological approach to auditory event perception. Ecol. Psychol. 5(1), 1–29 (1993)

    Article  Google Scholar 

  11. Gebru, I.D., Ba, S., Evangelidis, G., Horaud, R.: Tracking the active speaker based on a joint audio-visual observation model. In: ICCV Workshop, pp. 15–21 (2015)

    Google Scholar 

  12. Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: Feragen, A., Pelillo, M., Loog, M. (eds.) SIMBAD 2015. LNCS, vol. 9370, pp. 84–92. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24261-3_7

    Chapter  Google Scholar 

  13. Hoover, K., Chaudhuri, S., Pantofaru, C., Slaney, M., Sturdy, I.: Putting a face to the voice: fusing audio and visual signals across a video to determine speakers. arXiv preprint arXiv:1706.00079 (2017)

  14. Joassin, F., Pesenti, M., Maurage, P., Verreckt, E., Bruyer, R., Campanella, S.: Cross-modal interactions between human faces and voices involved in person recognition. Cortex 47(3), 367–376 (2011)

    Article  Google Scholar 

  15. Jones, B., Kabanoff, B.: Eye movements in auditory space perception. Atten. Percept. Psychophys. 17(3), 241–245 (1975)

    Article  Google Scholar 

  16. Kamachi, M., Hill, H., Lander, K., Vatikiotis-Bateson, E.: “Putting the face to the voice”: matching identity across modality. Curr. Biol. 13(19), 1709–1714 (2003)

    Article  Google Scholar 

  17. Karras, T., Aila, T., Laine, S., Herva, A., Lehtinen, J.: Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans. Graph. 36(4), 94:1–94:12 (2017)

    Article  Google Scholar 

  18. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  19. von Kriegstein, K., Kleinschmidt, A., Sterzer, P., Giraud, A.L.: Interaction of face and voice areas during speaker recognition. J. Cogn. Neurosci. 17(3), 367–376 (2005)

    Article  Google Scholar 

  20. Lachs, L., Pisoni, D.B.: Crossmodal source identification in speech perception. Ecol. Psychol. 16(3), 159–187 (2004)

    Article  Google Scholar 

  21. Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: ICCV, pp. 3730–3738 (2015)

    Google Scholar 

  22. van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. JMLR 9, 2579–2605 (2008)

    MATH  Google Scholar 

  23. Mavica, L.W., Barenholtz, E.: Matching voice and face identity from static images. J. Exp. Psychol. Hum. Percept. Perform. 39(2), 307–312 (2013)

    Article  Google Scholar 

  24. McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746–748 (1976)

    Article  Google Scholar 

  25. Nagrani, A., Albanie, S., Zisserman, A.: Seeing voices and hearing faces: cross-modal biometric matching. In: CVPR, pp. 8427–8436 (2018)

    Google Scholar 

  26. Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. In: INTERSPEECH, pp. 2616–2620 (2017)

    Google Scholar 

  27. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: ICML, pp. 689–696 (2011)

    Google Scholar 

  28. Owens, A., Isola, P., McDermott, J.H., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: CVPR, pp. 2405–2413 (2016)

    Google Scholar 

  29. Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_48

    Chapter  Google Scholar 

  30. Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: BMVC, pp. 41.1–41.12 (2015)

    Google Scholar 

  31. Senocak, A., Oh, T., Kim, J., Yang, M., Kweon, I.S.: Learning to localize sound source in visual scenes. arXiv preprint arXiv:1803.03849 (2018)

  32. Shelton, B.R., Searle, C.L.: The influence of vision on the absolute identification of sound-source position. Percept. Psychophys. 28(6), 589–596 (1980)

    Article  Google Scholar 

  33. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  34. Sliwa, J., Duhamel, J.R., Pascalis, O., Wirth, S.: Spontaneous voice-face identity matching by rhesus monkeys for familiar conspecifics and humans. PNAS 108(4), 1735–1740 (2011)

    Article  Google Scholar 

  35. Smith, H.M., Dunn, A.K., Baguley, T., Stacey, P.C.: Concordant cues in faces and voices: testing the backup signal hypothesis. Evol. Psychol. 14(1), 1–10 (2016)

    Article  Google Scholar 

  36. Smith, H.M., Dunn, A.K., Baguley, T., Stacey, P.C.: Matching novel face and voice identity using static and dynamic facial images. Atten. Percept. Psychophys. 78(3), 868–879 (2016)

    Article  Google Scholar 

  37. Solèr, M., Bazin, J.-C., Wang, O., Krause, A., Sorkine-Hornung, A.: Suggesting sounds for images from video collections. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 900–917. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_59

    Chapter  Google Scholar 

  38. Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. 36(4), 95:1–95:13 (2017)

    Article  Google Scholar 

  39. Taylor, S.L., et al.: A deep learning approach for generalized speech animation. ACM Trans. Graph. 36(4), 93:1–93:11 (2017)

    Article  Google Scholar 

  40. Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: CVPR, pp. 1521–1528 (2011)

    Google Scholar 

  41. Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: CVPR, pp. 2962–2971 (2017)

    Google Scholar 

  42. Vendrov, I., Kiros, R., Fidler, S., Urtasun, R.: Order-embeddings of images and language. arXiv preprint arXiv:1511.06361 (2015)

  43. Wu, Z., Singh, B., Davis, L.S., Subrahmanian, V.S.: Deception detection in videos. In: AAAI (2018)

    Google Scholar 

  44. Zweig, L.J., Suzuki, S., Grabowecky, M.: Learned face-voice pairings facilitate visual search. Psychon. Bull. Rev. 22(2), 429–436 (2015)

    Article  Google Scholar 

Download references

Acknowledgments

This work was funded in part by the QCRI–CSAIL computer science research program. Changil Kim was supported by a Swiss National Science Foundation fellowship P2EZP2 168785. We thank Sung-Ho Bae for his help.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Changil Kim .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 13821 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kim, C., Shin, H.V., Oh, TH., Kaspar, A., Elgharib, M., Matusik, W. (2019). On Learning Associations of Faces and Voices. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11365. Springer, Cham. https://doi.org/10.1007/978-3-030-20873-8_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-20873-8_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-20872-1

  • Online ISBN: 978-3-030-20873-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics