The Sound of Pixels

  • Hang ZhaoEmail author
  • Chuang Gan
  • Andrew Rouditchenko
  • Carl Vondrick
  • Josh McDermott
  • Antonio Torralba
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11205)


We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos, learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel. Our approach capitalizes on the natural synchronization of the visual and audio modalities to learn models that jointly parse sounds and images, without requiring additional manual supervision. Experimental results on a newly collected MUSIC dataset show that our proposed Mix-and-Separate framework outperforms several baselines on source separation. Qualitative results suggest our model learns to ground sounds in vision, enabling applications such as independently adjusting the volume of sound sources.


Cross-modal learning Sound separation and localization 



This work was supported by NSF grant IIS-1524817. We thank Adria Recasens, Yu Zhang and Xue Feng for insightful discussions.


  1. 1.
    Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 609–617. IEEE (2017)Google Scholar
  2. 2.
    Arandjelović, R., Zisserman, A.: Objects that sound (2017). arXiv preprint arXiv:1712.06651
  3. 3.
    Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: learning sound representations from unlabeled video. In: Advances in Neural Information Processing Systems, pp. 892–900 (2016)Google Scholar
  4. 4.
    Belouchrani, A., Abed-Meraim, K., Cardoso, J.F., Moulines, E.: A blind source separation technique using second-order statistics. IEEE Trans. Sig. Process. 45(2), 434–444 (1997)CrossRefGoogle Scholar
  5. 5.
    Bregman, A.S.: Auditory Scene Analysis: The Perceptual Organization of Sound. MIT Press, Cambridge (1994)Google Scholar
  6. 6.
    Cardoso, J.F.: Infomax and maximum likelihood for blind source separation. IEEE Sig. Process. Lett. 4(4), 112–114 (1997)CrossRefGoogle Scholar
  7. 7.
    Chandna, P., Miron, M., Janer, J., Gómez, E.: Monoaural audio source separation using deep convolutional neural networks. In: Tichavský, P., Babaie-Zadeh, M., Michel, O.J.J., Thirion-Moreau, N. (eds.) LVA/ICA 2017. LNCS, vol. 10169, pp. 258–266. Springer, Cham (2017). Scholar
  8. 8.
    Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.I.: Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-Way Data Analysis and Blind Source Separation. Wiley, Chichester (2009)CrossRefGoogle Scholar
  9. 9.
    Comon, P., Jutten, C.: Handbook of Blind Source Separation: Independent Component Analysis and Applications. Academic Press, San Diego (2010)Google Scholar
  10. 10.
    Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430 (2015)Google Scholar
  11. 11.
    Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation (2018). arXiv preprint arXiv:1804.03619
  12. 12.
    Gabbay, A., Ephrat, A., Halperin, T., Peleg, S.: Seeing through noise: speaker separation and enhancement using visually-derived speech (2017). arXiv preprint arXiv:1708.06767
  13. 13.
    Gan, C., Gong, B., Liu, K., Su, H., Guibas, L.J.: Geometry-guided CNN for self-supervised video representation learning (2018)Google Scholar
  14. 14.
    Haykin, S., Chen, Z.: The cocktail party problem. Neural Comput. 17(9), 1875–1902 (2005)CrossRefGoogle Scholar
  15. 15.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  16. 16.
    Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31–35. IEEE (2016)Google Scholar
  17. 17.
    Hershey, J.R., Movellan, J.R.: Audio vision: using audio-visual synchrony to locate sounds. In: Solla, S.A., Leen, T.K., Müller, K. (eds.) Advances in Neural Information Processing Systems, vol. 12, pp. 813–819. MIT Press (2000).
  18. 18.
    Hershey, S., Chaudhuri, S., Ellis, D.P., Gemmeke, J.F., Jansen, A., Moore, R.C., Plakal, M., Platt, D., Saurous, R.A., Seybold, B., et al.: Cnn architectures for large-scale audio classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135. IEEE (2017)Google Scholar
  19. 19.
    Izadinia, H., Saleemi, I., Shah, M.: Multimodal analysis for identification and segmentation of moving-sounding objects. IEEE Trans. Multimed. 15(2), 378–390 (2013)CrossRefGoogle Scholar
  20. 20.
    Jayaraman, D., Grauman, K.: Learning image representations tied to ego-motion. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1413–1421 (2015)Google Scholar
  21. 21.
    Kidron, E., Schechner, Y.Y., Elad, M.: Pixels that sound. In: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 1, pp. 88–95. IEEE Computer Society, Washington (2005).
  22. 22.
    Larsson, G., Maire, M., Shakhnarovich, G.: Colorization as a proxy task for visual understanding. In: CVPR, vol. 2, p. 8 (2017)Google Scholar
  23. 23.
    Logan, B.: Mel frequency cepstral coefficients for music modeling. Int. Soc. Music Inf. Retrieval 270, 1–11 (2000)Google Scholar
  24. 24.
    Ma, W.C., Chu, H., Zhou, B., Urtasun, R., Torralba, A.: Single image intrinsic decomposition without a single intrinsic image. In: Ferrari, V., et al. (eds.) ECCV 2018, Part XIV. LNCS, vol. 11205, pp. 211–229. Springer, Cham (2018)Google Scholar
  25. 25.
    McDermott, J.H.: The cocktail party problem. Curr. Biol. 19(22), R1024–R1027 (2009)CrossRefGoogle Scholar
  26. 26.
    Mesaros, A., Heittola, T., Diment, A., Elizalde, B., Ankit Shah, E.A.: Dcase 2017 challenge setup: tasks, datasets and baseline system. In: DCASE 2017 - Workshop on Detection and Classification of Acoustic Scenes and Events (2017)Google Scholar
  27. 27.
    Nagrani, A., Albanie, S., Zisserman, A.: Seeing voices and hearing faces: cross-modal biometric matching (2018). arXiv preprint arXiv:1804.00326
  28. 28.
    Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML 2011, pp. 689–696 (2011)Google Scholar
  29. 29.
    Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features (2018). arXiv preprint arXiv:1804.03641
  30. 30.
    Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2405–2413 (2016)Google Scholar
  31. 31.
    Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016). Scholar
  32. 32.
    Pathak, D., Girshick, R., Dollár, P., Darrell, T., Hariharan, B.: Learning features by watching objects move. In: Proceedings of CVPR, vol. 2 (2017)Google Scholar
  33. 33.
    Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544 (2016)Google Scholar
  34. 34.
    Raffel, C., et al.: mir\_eval: a transparent implementation of common mir metrics. In: Proceedings of the 15th International Society for Music Information Retrieval Conference, ISMIR. Citeseer (2014)Google Scholar
  35. 35.
    Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). Scholar
  36. 36.
    de Sa, V.R.: Learning classification with unlabeled data. In: Advances in Neural Information Processing Systems, pp. 112–119 (1993)Google Scholar
  37. 37.
    Senocak, A., Oh, T.H., Kim, J., Yang, M.H., Kweon, I.S.: Learning to localize sound source in visual scenes (2018). arXiv preprint arXiv:1803.03849
  38. 38.
    Shu, Z., Yumer, E., Hadap, S., Sunkavalli, K., Shechtman, E., Samaras, D.: Neural face editing with intrinsic image disentangling (2017). arXiv preprint arXiv:1704.04131
  39. 39.
    Simpson, A.J.R., Roma, G., Plumbley, M.D.: Deep karaoke: extracting vocals from musical mixtures using a convolutional deep neural network. In: Vincent, E., Yeredor, A., Koldovský, Z., Tichavský, P. (eds.) LVA/ICA 2015. LNCS, vol. 9237, pp. 429–436. Springer, Cham (2015). Scholar
  40. 40.
    Smaragdis, P., Brown, J.C.: Non-negative matrix factorization for polyphonic music transcription. In: 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 177–180. IEEE (2003)Google Scholar
  41. 41.
    Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)CrossRefGoogle Scholar
  42. 42.
    Virtanen, T.: Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 15(3), 1066–1074 (2007)CrossRefGoogle Scholar
  43. 43.
    Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Advances in Neural Information Processing Systems, pp. 613–621 (2016)Google Scholar
  44. 44.
    Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., Murphy, K.: Tracking emerges by colorizing videos (2018). arXiv preprint arXiv:1806.09594
  45. 45.
    Wang, D., Chen, J.: Supervised speech separation based on deep learning: an overview (2017). arXiv preprint arXiv:1708.07524
  46. 46.
    Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: ICCV, pp. 2794–2802 (2015)Google Scholar
  47. 47.
    Zhao, H., Gallo, O., Frosio, I., Kautz, J.: Loss functions for image restoration with neural networks. IEEE Trans. Comput. Imaging 3(1), 47–57 (2017)CrossRefGoogle Scholar
  48. 48.
    Zhao, M., et al.: Through-wall human pose estimation using radio signals. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7356–7365 (2018)Google Scholar
  49. 49.
    Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)Google Scholar
  50. 50.
    Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: Proceedings of CVPR (2017)Google Scholar
  51. 51.
    Zhou, Y., Wang, Z., Fang, C., Bui, T., Berg, T.L.: Visual to sound: Generating natural sound for videos in the wild (2017). arXiv preprint arXiv:1712.01393
  52. 52.
    Zibulevsky, M., Pearlmutter, B.A.: Blind source separation by sparse decomposition in a signal dictionary. Neural Comput. 13(4), 863–882 (2001)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Hang Zhao
    • 1
    Email author
  • Chuang Gan
    • 1
    • 2
  • Andrew Rouditchenko
    • 1
  • Carl Vondrick
    • 1
    • 3
  • Josh McDermott
    • 1
  • Antonio Torralba
    • 1
  1. 1.Massachusetts Institute of TechnologyCambridgeUSA
  2. 2.MIT-IBM Watson AI LabCambridgeUSA
  3. 3.Columbia UniversityNew York CityUSA

Personalised recommendations