Objects that Sound

  • Relja ArandjelovićEmail author
  • Andrew Zisserman
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11205)


In this paper our objectives are, first, networks that can embed audio and visual inputs into a common space that is suitable for cross-modal retrieval; and second, a network that can localize the object that sounds in an image, given the audio signal. We achieve both these objectives by training from unlabelled video using only audio-visual correspondence (AVC) as the objective function. This is a form of cross-modal self-supervision from video.

To this end, we design new network architectures that can be trained for cross-modal retrieval and localizing the sound source in an image, by using the AVC task. We make the following contributions: (i) show that audio and visual embeddings can be learnt that enable both within-mode (e.g. audio-to-audio) and between-mode retrieval; (ii) explore various architectures for the AVC task, including those for the visual stream that ingest a single image, or multiple images, or a single image and multi-frame optical flow; (iii) show that the semantic object that sounds within an image can be localized (using only the sound, no motion or flow information); and (iv) give a cautionary tale on how to avoid undesirable shortcuts in the data preparation.



We thank Carl Doersch for useful insights regarding preventing shortcuts.


  1. 1.
    Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: NIPS (2016)Google Scholar
  2. 2.
    Harwath, D., Torralba, A., Glass, J.R.: Unsupervised learning of spoken language with visual context. In: NIPS (2016)Google Scholar
  3. 3.
    Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016). Scholar
  4. 4.
    Arandjelović, R., Zisserman, A.: Look, listen and learn. In: Proceedings of ICCV (2017)Google Scholar
  5. 5.
    Barnard, K., Duygulu, P., de Freitas, N., Forsyth, D., Blei, D., Jordan, M.: Matching words and pictures. JMLR 3, 1107–1135 (2003)zbMATHGoogle Scholar
  6. 6.
    Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.A.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002). Scholar
  7. 7.
    Frome, A., et al.: Devise: a deep visual-semantic embedding model. In: NIPS (2013)Google Scholar
  8. 8.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044 (2015)
  9. 9.
    de Sa, V.R.: Learning classification from unlabelled data. In: NIPS (1994)Google Scholar
  10. 10.
    Kidron, E., Schechner, Y.Y., Elad, M.: Pixels that sound. In: Proceedings of CVPR (2005)Google Scholar
  11. 11.
    Owens, A., Isola, P., McDermott, J.H., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: Proceedings of CVPR, pp. 2405–2413 (2016)Google Scholar
  12. 12.
    Aytar, Y., Vondrick, C., Torralba, A.: See, hear, and read: deep aligned representations. CoRR abs/1706.00932 (2017)Google Scholar
  13. 13.
    Dosovitskiy, A., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative unsupervised feature learning with convolutional neural networks. In: NIPS (2014)Google Scholar
  14. 14.
    Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proceedings of CVPR (2015)Google Scholar
  15. 15.
    Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: Proceedings of ICCV (2015)Google Scholar
  16. 16.
    Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: Proceedings of ICCV, pp. 2794–2802 (2015)Google Scholar
  17. 17.
    Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016). Scholar
  18. 18.
    Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). Scholar
  19. 19.
    Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: Proceedings of CVPR, pp. 2536–2544 (2016)Google Scholar
  20. 20.
    Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). Scholar
  21. 21.
    Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: Proceedings of ICCV (2017)Google Scholar
  22. 22.
    Doersch, C., Zisserman, A.: Multi-task self-supervised visual learning. In: Proceedings of ICCV (2017)Google Scholar
  23. 23.
    Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: ICASSP (2017)Google Scholar
  24. 24.
    Arandjelović, R., Zisserman, A.: Objects that sound. CoRR abs/1712.06651 (2017)Google Scholar
  25. 25.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of ICML (2015)Google Scholar
  26. 26.
    Arandjelović, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: IEEE PAMI (2017)Google Scholar
  27. 27.
    Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: Proceedings of CVPR, vol. 1, pp. 539–546. IEEE (2005)Google Scholar
  28. 28.
    Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: Proceedings of CVPR (2016)Google Scholar
  29. 29.
    Hong, S., Im, W., S. Yang, H.: CBVMR: content-based video-music retrieval using soft intra-modal structure constraint. In: ACM ICMR (2018)Google Scholar
  30. 30.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)Google Scholar
  31. 31.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)Google Scholar
  32. 32.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of ICLR (2015)Google Scholar
  33. 33.
    Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of CVPR (2015)Google Scholar
  34. 34.
    Piczak, K.J.: ESC: dataset for environmental sound classification. In: Proceedings of ACMM (2015)Google Scholar
  35. 35.
    Dietterich, T.G., Lathrop, R.H., Lozano-Perez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89(1–2), 31–71 (1997)CrossRefGoogle Scholar
  36. 36.
    Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Is object localization for free? - Weakly-supervised learning with convolutional neural networks. In: Proceedings of CVPR (2015)Google Scholar
  37. 37.
    Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of CVPR (2016)Google Scholar
  38. 38.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of ICLR (2015)Google Scholar
  39. 39.
    Shivappa, S.T., Rao, B.D., Trivedi, M.M.: Audio-visual fusion and tracking with multilevel iterative decoding: framework and experimental evaluation. IEEE J. Sel. Top. Signal Process. 4(5), 882–894 (2010)CrossRefGoogle Scholar
  40. 40.
    Senocak, A., Oh, T.H., Kim, J., Yang, M.H., Kweon, I.S.: On learning association of sound source and visual scenes. In: Proceedings of CVPR (2018)Google Scholar
  41. 41.
    Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. In: Ferrari, (eds.) ECCV 2018, Part I. LNCS, vol. 11205, pp. 587–604. Springer, Cham (2018)Google Scholar
  42. 42.
    Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: Proceedings of ECCV (2018, to appear)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.DeepMindLondonUK
  2. 2.VGG, Department of Engineering ScienceUniversity of OxfordOxfordUK

Personalised recommendations