Advertisement

Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning

  • Andrew Owens
  • Jiajun Wu
  • Josh H. McDermott
  • William T. Freeman
  • Antonio Torralba
Article
  • 29 Downloads

Abstract

The sound of crashing waves, the roar of fast-moving cars—sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds. This paper extends an earlier conference paper, Owens et al. (in: European conference on computer vision, 2016b), with additional experiments and discussion.

Keywords

Sound Convolutional networks Unsupervised learning 

Notes

Acknowledgements

This work was supported by NSF Grants #1524817 to A.T; NSF Grants #1447476 and #1212849 to W.F.; a McDonnell Scholar Award to J.H.M.; and a Microsoft Ph.D. Fellowship to A.O. It was also supported by Shell Research, and by a donation of GPUs from NVIDIA. We thank Phillip Isola for the helpful discussions, and Carl Vondrick for sharing the data that we used in our experiments. We also thank the anonymous reviewers for their comments, which significantly improved the paper (in particular, for suggesting the comparison with texton features in Sect. 5).

References

  1. Agrawal, P., Carreira, J., & Malik, J. (2015). Learning to see by moving. In IEEE international conference on computer vision.Google Scholar
  2. Andrew, G., Arora, R., Bilmes, J. A., & Livescu, K. (2013). Deep canonical correlation analysis. In International conference on machine learning.Google Scholar
  3. Arandjelović, R., & Zisserman, A. (2017). Look, listen and learn. ICCV.Google Scholar
  4. Aytar, Y., Vondrick, C., & Torralba, A. (2016). Soundnet: Learning sound representations from unlabeled video. In Advances in neural information processing systems.Google Scholar
  5. Bau, D., Zhou, B., Khosla, A., Oliva, A., & Torralba, A. (2017). Network dissection: Quantifying interpretability of deep visual representations. CVPR.Google Scholar
  6. de Sa, V. R. (1994a). Learning classification with unlabeled data. Advances in neural information processing systems (pp 112)Google Scholar
  7. de Sa, V. R. (1994b). Minimizing disagreement for self-supervised classification. In Proceedings of the 1993 Connectionist Models Summer School (pp. 300.). Psychology Press.Google Scholar
  8. Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition.Google Scholar
  9. Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. In IEEE international conference on computer vision.Google Scholar
  10. Doersch, C., & Zisserman, A. (2017). Multi-task self-supervised visual learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2051–2060).Google Scholar
  11. Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., & Brox, T. (2014). Discriminative unsupervised feature learning with convolutional neural networks. In Advances in neural information processing systems.Google Scholar
  12. Ellis, D. P., Zeng, X., McDermott, J. H. (2011). Classifying soundtracks with audio texture features. In IEEE international conference on acoustics, speech, and signal processing.Google Scholar
  13. Eronen, A. J., Peltonen, V. T., Tuomi, J. T., Klapuri, A. P., Fagerlund, S., Sorsa, T., et al. (2006). Audio-based context recognition. IEEE/ACM Transactions on Audio Speech and Language Processing, 14(1), 321–329.CrossRefGoogle Scholar
  14. Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.CrossRefGoogle Scholar
  15. Fisher III J. W., Darrell, T., Freeman, W. T., Viola, P. A. (2000). Learning joint statistical models for audio–visual fusion and segregation. In Advances in neural information processing systems.Google Scholar
  16. Gaver, W. W. (1993). What in the world do we hear?: An ecological approach to auditory event perception. Ecological psychology, 5(1), 1–29.CrossRefGoogle Scholar
  17. Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., & Ritter, M. (2017). Audio set: An ontology and human-labeled dartaset for audio events. In IEEE international conference on acoustics, speech, and signal processing.Google Scholar
  18. Girshick, R. (2015). Fast r-cnn. In IEEE international conference on computer vision.Google Scholar
  19. Goroshin, R., Bruna, J., Tompson, J., Eigen, D., & LeCun, Y. (2015). Unsupervised feature learning from temporal data. arXiv preprint arXiv:1504.02518.
  20. Gupta, S., Hoffman, J., & Malik, J. (2016). Cross modal distillation for supervision transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  21. Hershey, J. R., & Movellan, J. R. (1999). Audio vision: Using audio–visual synchrony to locate sounds. In Advances in neural information processing systems.Google Scholar
  22. Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: Towards removing the curse of dimensionality. In ACM symposium on theory of computing.Google Scholar
  23. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning.Google Scholar
  24. Isola, P. (2015). The discovery of perceptual structure from visual co-occurrences in space and time. PhD thesisGoogle Scholar
  25. Isola, P., Zoran, D., Krishnan, D., & Adelson, E.H. (2016). Learning visual groups from co-occurrences in space and time. In International conference on learning representations, Workshop.Google Scholar
  26. Jayaraman, D., & Grauman, K. (2015). Learning image representations tied to ego-motion. In IEEE international conference on computer vision.Google Scholar
  27. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM multimedia conference.Google Scholar
  28. Kidron, E., Schechner, Y. Y., & Elad, M. (2005). Pixels that sound. In IEEE conference on computer vision and pattern recognition.Google Scholar
  29. Krähenbühl, P., Doersch, C., Donahue, J., & Darrell, T. (2016). Data-dependent initializations of convolutional neural networks. In International conference on learning representations Google Scholar
  30. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems.Google Scholar
  31. Le, Q. V., Ranzato, M. A., Monga, R., Devin, M., Chen, K., Corrado, G. S, Dean, J., & Ng, A. Y. (2012). Building high-level features using large scale unsupervised learning. In International conference on machine learning Google Scholar
  32. Lee, K., Ellis, D. P., & Loui, A. C. (2010). Detecting local semantic concepts in environmental sounds using markov model based clustering. In IEEE international conference on acoustics, speech, and signal processing.Google Scholar
  33. Leung, T., & Malik, J. (2001). Representing and recognizing the visual appearance of materials using three-dimensional textons. International Journal of Computer Vision, 43(1), 29–44.CrossRefMATHGoogle Scholar
  34. Lin, M., Chen, Q., & Yan, S. (2014). Network in network. International conference on learning representations.Google Scholar
  35. McDermott, J. H., & Simoncelli, E. P. (2011). Sound texture perception via statistics of the auditory periphery: Evidence from sound synthesis. Neuron, 71(5), 926–940.CrossRefGoogle Scholar
  36. Mishkin, D., & Matas, J. (2015). All you need is a good init. arXiv preprint arXiv:1511.06422.
  37. Mobahi, H., Collobert, R., & Weston, J. (2009). Deep learning from temporal coherence in video. In International conference on machine learning.Google Scholar
  38. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In International Conference on Machine Learning.Google Scholar
  39. Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2015). Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Conference on computer vision and pattern recognition.Google Scholar
  40. Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E. H., & Freeman, W. T. (2016a). Visually indicated sounds. In CVPR.Google Scholar
  41. Owens, A., Wu, J., McDermott, J. H., Freeman, W. T., & Torralba, A. (2016b). Ambient sound provides supervision for visual learning. In European conference on computer vision.Google Scholar
  42. Pathak, D., Girshick, R., Dollár, P., Darrell, T., & Hariharan, B. (2017). Learning features by watching objects move. In CVPR.Google Scholar
  43. Salakhutdinov, R., & Hinton, G. (2009). Semantic hashing. International Journal of Approximate Reasoning, 50(7), 969–978.CrossRefGoogle Scholar
  44. Slaney, M., & Covell, M. (2000). Facesync: A linear operator for measuring synchronization of video facial images and audio tracks. In Advances in neural information processing systems.Google Scholar
  45. Smith, L., & Gasser, M. (2005). The development of embodied cognition: Six lessons from babies. Artificial life, 11(1–2), 13–29.CrossRefGoogle Scholar
  46. Srivastava, N., & Salakhutdinov, R. R. (2012). Multimodal learning with deep boltzmann machines. In Advances in neural information processing systems.Google Scholar
  47. Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., & Li, L. J. (2015). The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817.
  48. Wang, X., & Gupta, A. (2015). Unsupervised learning of visual representations using videos. In IEEE international conference on computer vision Google Scholar
  49. Weiss, Y., Torralba, A., & Fergus, R. (2009). Spectral hashing. In Advances in neural information processing systems.Google Scholar
  50. Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In IEEE conference on computer vision and pattern recognition.Google Scholar
  51. Zhang, R., Isola, P., & Efros, A. A. (2016). Colorful image colorization. In European conference on computer vision pp. 649–666. SpringerGoogle Scholar
  52. Zhang, R., Isola, P., & Efros, A. A. (2017). Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In CVPR.Google Scholar
  53. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Advances in neural information processing systems.Google Scholar
  54. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2015). Object detectors emerge in deep scene cnns. In International conference on learning representations.Google Scholar
  55. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In The IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Andrew Owens
    • 1
    • 2
  • Jiajun Wu
    • 2
  • Josh H. McDermott
    • 2
  • William T. Freeman
    • 2
  • Antonio Torralba
    • 2
  1. 1.University of CaliforniaBerkeleyUSA
  2. 2.Massachusetts Institute of TechnologyCambridgeUSA

Personalised recommendations