Visually Indicated Sound Generation by Perceptually Optimized Classification

  • Kan Chen
  • Chuanxi ZhangEmail author
  • Chen Fang
  • Zhaowen Wang
  • Trung Bui
  • Ram Nevatia
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11134)


Visually indicated sound generation aims to predict visually consistent sound from the video content. Previous methods addressed this problem by creating a single generative model that ignores the distinctive characteristics of various sound categories. Nowadays, state-of-the-art sound classification networks are available to capture semantic-level information in audio modality, which can also serve for the purpose of visually indicated sound generation. In this paper, we explore generating fine-grained sound from a variety of sound classes, and leverage pre-trained sound classification networks to improve the audio generation quality. We propose a novel Perceptually Optimized Classification based Audio generation Network (POCAN), which generates sound conditioned on the sound class predicted from visual information. Additionally, a perceptual loss is calculated via a pre-trained sound classification network to align the semantic information between the generated sound and its ground truth during training. Experiments show that POCAN achieves significantly better results in visually indicated sound generation task on two datasets.


Visually indicated sound generation Perceptual loss 


  1. 1.
    Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: CVPR (2016)Google Scholar
  2. 2.
    Chen, L., Srivastava, S., Duan, Z., Xu, C.: Deep cross-modal audio-visual generation. In: ACM MM Workshop (2017)Google Scholar
  3. 3.
    Zhou, Y., Wang, Z., Fang, C., Bui, T., Berg, T.L.: Visual to sound: generating natural sound for videos in the wild. CoRR (2017)Google Scholar
  4. 4.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)CrossRefGoogle Scholar
  5. 5.
    Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). Scholar
  6. 6.
    Welch, P.: The use of fast Fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms. IEEE Trans. Audio Electroacoust. 15, 70–73 (1967)CrossRefGoogle Scholar
  7. 7.
    Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: NIPS (2016)Google Scholar
  8. 8.
    Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: ICASSP (2017)Google Scholar
  9. 9.
    Harwath, D., Torralba, A., Glass, J.: Unsupervised learning of spoken language with visual context. In: NIPS (2016)Google Scholar
  10. 10.
    Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: ICCV (2017)Google Scholar
  11. 11.
    Muthusamy, Y.K., Cole, R.A., Slaney, M.: Speaker-independent vowel recognition: spectrograms versus cochleagrams. In: ICASSP (1990)Google Scholar
  12. 12.
    Mehri, S., et al.: SampleRNN: an unconditional end-to-end neural audio generation model. In: ICLR (2016)Google Scholar
  13. 13.
    Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. In: CVPR (2014)Google Scholar
  14. 14.
    Szegedy, C., et al.: Intriguing properties of neural networks. In: ICLR (2014)Google Scholar
  15. 15.
    Yosinski, J., Clune, J., Nguyen, A., Fuchs, T., Lipson, H.: Understanding neural networks through deep visualization. In: CVPR (2015)Google Scholar
  16. 16.
    Mahendran, A., Vedaldi, A.: Understanding deep image representations by inverting them. In: CVPR (2015)Google Scholar
  17. 17.
    Dosovitskiy, A., Brox, T.: Inverting visual representations with convolutional networks. In: CVPR (2016)Google Scholar
  18. 18.
    Chen, K., Bui, T., Fang, C., Wang, Z., Nevatia, R.: AMC: attention guided multi-modal correlation learning for image search. In: CVPR (2017)Google Scholar
  19. 19.
    Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)Google Scholar
  20. 20.
    Chen, K., Wang, J., Chen, L.C., Gao, H., Xu, W., Nevatia, R.: ABC-CNN: an attention based convolutional neural network for visual question answering. In: CVPRW (2016)Google Scholar
  21. 21.
    Chen, K., Kovvuri, R., Gao, J., Nevatia, R.: MSRC: multimodal spatial regression with semantic context for phrase grounding. IJMIR 7, 17–28 (2018)Google Scholar
  22. 22.
    Chen, K., Kovvuri, R., Nevatia, R.: Query-guided regression network with context policy for phrase grounding. In: ICCV (2017)Google Scholar
  23. 23.
    Gao, J., Chen, K., Nevatia, R.: CTAP: complementary temporal action proposal generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 70–85. Springer, Cham (2018). Scholar
  24. 24.
    Myers, G.K., et al.: The 2014 SESAME multimedia event detection and recounting system. In: Proceedings of TRECVID (2014)Google Scholar
  25. 25.
    Gao, J., Yang, Z., Sun, C., Chen, K., Nevatia, R.: TURN TAP: temporal unit regression network for temporal action proposals (2017)Google Scholar
  26. 26.
    Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: ICCV (2017)Google Scholar
  27. 27.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  28. 28.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  29. 29.
    Harris, F.J.: On the use of windows for harmonic analysis with the discrete Fourier transform. Proc. IEEE 66, 51–83 (1978)CrossRefGoogle Scholar
  30. 30.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  31. 31.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  32. 32.
    Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS (2010)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Kan Chen
    • 1
  • Chuanxi Zhang
    • 1
    Email author
  • Chen Fang
    • 2
  • Zhaowen Wang
    • 2
  • Trung Bui
    • 2
  • Ram Nevatia
    • 1
  1. 1.University of Southern CaliforniaLos AngelesUSA
  2. 2.Adobe ResearchSan JoseUSA

Personalised recommendations