On Learning Associations of Faces and Voices

Kim, Changil; Shin, Hijung Valentina; Oh, Tae-Hyun; Kaspar, Alexandre; Elgharib, Mohamed; Matusik, Wojciech

doi:10.1007/978-3-030-20873-8_18

On Learning Associations of Faces and Voices

Changil Kim¹⁸,
Hijung Valentina Shin¹⁹,
Tae-Hyun Oh¹⁸,
Alexandre Kaspar¹⁸,
Mohamed Elgharib²⁰ &
…
Wojciech Matusik¹⁸

Conference paper
First Online: 26 May 2019

2503 Accesses
24 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11365))

Abstract

In this paper, we study the associations between human faces and voices. Audiovisual integration, specifically the integration of facial and vocal information is a well-researched area in neuroscience. It is shown that the overlapping information between the two modalities plays a significant role in perceptual tasks such as speaker identification. Through an online study on a new dataset we created, we confirm previous findings that people can associate unseen faces with corresponding voices and vice versa with greater than chance accuracy. We computationally model the overlapping information between faces and voices and show that the learned cross-modal representation contains enough information to identify matching faces and voices with performance similar to that of humans. Our representation exhibits correlations to certain demographic attributes and features obtained from either visual or aural modality alone. We release our dataset of audiovisual recordings and demographic annotations of people reading out short text used in our studies.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
In machine learning terminology, this could be seen as natural supervision [28] or self-supervision [9] with unlabeled data.
2.
Gender is such a strong cue that we use it for control questions in our user study. See Sect. 3.

References

Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: ICCV, pp. 609–617 (2017)
Google Scholar
Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: NIPS, pp. 892–900 (2016)
Google Scholar
Bau, D., Zhou, B., Khosla, A., Oliva, A., Torralba, A.: Network dissection: quantifying interpretability of deep visual representations. In: CVPR, pp. 3319–3327 (2017)
Google Scholar
Brookes, H., Slater, A., Quinn, P.C., Lewkowicz, D.J., Hayes, R., Brown, E.: Three-month-old infants learn arbitrary auditory-visual pairings between voices and faces. Infant Child Dev. 10(1–2), 75–82 (2001)
Article Google Scholar
Campanella, S., Belin, P.: Integrating face and voice in person perception. Trends Cogn. Sci. 11(12), 535–543 (2007)
Article Google Scholar
Chen, W., Chen, X., Zhang, J., Huang, K.: Beyond triplet loss: a deep quadruplet network for person re-identification. In: CVPR, pp. 1320–1329 (2017)
Google Scholar
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: CVPR, pp. 539–546 (2005)
Google Scholar
Chung, J.S., Senior, A.W., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: CVPR, pp. 3444–3453 (2017)
Google Scholar
Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: ICCV, pp. 1422–1430 (2015)
Google Scholar
Gaver, W.W.: What in the world do we hear? an ecological approach to auditory event perception. Ecol. Psychol. 5(1), 1–29 (1993)
Article Google Scholar
Gebru, I.D., Ba, S., Evangelidis, G., Horaud, R.: Tracking the active speaker based on a joint audio-visual observation model. In: ICCV Workshop, pp. 15–21 (2015)
Google Scholar
Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: Feragen, A., Pelillo, M., Loog, M. (eds.) SIMBAD 2015. LNCS, vol. 9370, pp. 84–92. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24261-3_7
Chapter Google Scholar
Hoover, K., Chaudhuri, S., Pantofaru, C., Slaney, M., Sturdy, I.: Putting a face to the voice: fusing audio and visual signals across a video to determine speakers. arXiv preprint arXiv:1706.00079 (2017)
Joassin, F., Pesenti, M., Maurage, P., Verreckt, E., Bruyer, R., Campanella, S.: Cross-modal interactions between human faces and voices involved in person recognition. Cortex 47(3), 367–376 (2011)
Article Google Scholar
Jones, B., Kabanoff, B.: Eye movements in auditory space perception. Atten. Percept. Psychophys. 17(3), 241–245 (1975)
Article Google Scholar
Kamachi, M., Hill, H., Lander, K., Vatikiotis-Bateson, E.: “Putting the face to the voice”: matching identity across modality. Curr. Biol. 13(19), 1709–1714 (2003)
Article Google Scholar
Karras, T., Aila, T., Laine, S., Herva, A., Lehtinen, J.: Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans. Graph. 36(4), 94:1–94:12 (2017)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
von Kriegstein, K., Kleinschmidt, A., Sterzer, P., Giraud, A.L.: Interaction of face and voice areas during speaker recognition. J. Cogn. Neurosci. 17(3), 367–376 (2005)
Article Google Scholar
Lachs, L., Pisoni, D.B.: Crossmodal source identification in speech perception. Ecol. Psychol. 16(3), 159–187 (2004)
Article Google Scholar
Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: ICCV, pp. 3730–3738 (2015)
Google Scholar
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. JMLR 9, 2579–2605 (2008)
MATH Google Scholar
Mavica, L.W., Barenholtz, E.: Matching voice and face identity from static images. J. Exp. Psychol. Hum. Percept. Perform. 39(2), 307–312 (2013)
Article Google Scholar
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746–748 (1976)
Article Google Scholar
Nagrani, A., Albanie, S., Zisserman, A.: Seeing voices and hearing faces: cross-modal biometric matching. In: CVPR, pp. 8427–8436 (2018)
Google Scholar
Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. In: INTERSPEECH, pp. 2616–2620 (2017)
Google Scholar
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: ICML, pp. 689–696 (2011)
Google Scholar
Owens, A., Isola, P., McDermott, J.H., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: CVPR, pp. 2405–2413 (2016)
Google Scholar
Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_48
Chapter Google Scholar
Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: BMVC, pp. 41.1–41.12 (2015)
Google Scholar
Senocak, A., Oh, T., Kim, J., Yang, M., Kweon, I.S.: Learning to localize sound source in visual scenes. arXiv preprint arXiv:1803.03849 (2018)
Shelton, B.R., Searle, C.L.: The influence of vision on the absolute identification of sound-source position. Percept. Psychophys. 28(6), 589–596 (1980)
Article Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Sliwa, J., Duhamel, J.R., Pascalis, O., Wirth, S.: Spontaneous voice-face identity matching by rhesus monkeys for familiar conspecifics and humans. PNAS 108(4), 1735–1740 (2011)
Article Google Scholar
Smith, H.M., Dunn, A.K., Baguley, T., Stacey, P.C.: Concordant cues in faces and voices: testing the backup signal hypothesis. Evol. Psychol. 14(1), 1–10 (2016)
Article Google Scholar
Smith, H.M., Dunn, A.K., Baguley, T., Stacey, P.C.: Matching novel face and voice identity using static and dynamic facial images. Atten. Percept. Psychophys. 78(3), 868–879 (2016)
Article Google Scholar
Solèr, M., Bazin, J.-C., Wang, O., Krause, A., Sorkine-Hornung, A.: Suggesting sounds for images from video collections. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 900–917. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_59
Chapter Google Scholar
Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. 36(4), 95:1–95:13 (2017)
Article Google Scholar
Taylor, S.L., et al.: A deep learning approach for generalized speech animation. ACM Trans. Graph. 36(4), 93:1–93:11 (2017)
Article Google Scholar
Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: CVPR, pp. 1521–1528 (2011)
Google Scholar
Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: CVPR, pp. 2962–2971 (2017)
Google Scholar
Vendrov, I., Kiros, R., Fidler, S., Urtasun, R.: Order-embeddings of images and language. arXiv preprint arXiv:1511.06361 (2015)
Wu, Z., Singh, B., Davis, L.S., Subrahmanian, V.S.: Deception detection in videos. In: AAAI (2018)
Google Scholar
Zweig, L.J., Suzuki, S., Grabowecky, M.: Learned face-voice pairings facilitate visual search. Psychon. Bull. Rev. 22(2), 429–436 (2015)
Article Google Scholar

Download references

Acknowledgments

This work was funded in part by the QCRI–CSAIL computer science research program. Changil Kim was supported by a Swiss National Science Foundation fellowship P2EZP2 168785. We thank Sung-Ho Bae for his help.

Author information

Authors and Affiliations

MIT CSAIL, Cambridge, MA, USA
Changil Kim, Tae-Hyun Oh, Alexandre Kaspar & Wojciech Matusik
Adobe Research, Cambridge, MA, USA
Hijung Valentina Shin
QCRI, Doha, Qatar
Mohamed Elgharib

Authors

Changil Kim
View author publications
You can also search for this author in PubMed Google Scholar
Hijung Valentina Shin
View author publications
You can also search for this author in PubMed Google Scholar
Tae-Hyun Oh
View author publications
You can also search for this author in PubMed Google Scholar
Alexandre Kaspar
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Elgharib
View author publications
You can also search for this author in PubMed Google Scholar
Wojciech Matusik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Changil Kim .

Editor information

Editors and Affiliations

IIIT Hyderabad, Hyderabad, India
C.V. Jawahar
ANU, Canberra, ACT, Australia
Hongdong Li
Simon Fraser University, Burnaby, BC, Canada
Greg Mori
ETH Zurich, Zurich, Zürich, Switzerland
Konrad Schindler

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 13821 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, C., Shin, H.V., Oh, TH., Kaspar, A., Elgharib, M., Matusik, W. (2019). On Learning Associations of Faces and Voices. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11365. Springer, Cham. https://doi.org/10.1007/978-3-030-20873-8_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-20873-8_18
Published: 26 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20872-1
Online ISBN: 978-3-030-20873-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics