Applying visual domain style transfer and texture synthesis techniques to audio: insights and challenges
Abstract
Style transfer is a technique for combining two images based on the activations and feature statistics in a deep learning neural network architecture. This paper studies the analogous task in the audio domain and takes a critical look at the problems that arise when adapting the original vision-based framework to handle spectrogram representations. We conclude that CNN architectures with features based on 2D representations and convolutions are better suited for visual images than for time–frequency representations of audio. Despite the awkward fit, experiments show that the Gram matrix determined “style” for audio is more closely aligned with timbral signatures without temporal structure, whereas network layer activity determining audio “content” seems to capture more of the pitch and rhythmic structures. We shed insight on several reasons for the domain differences with illustrative examples. We motivate the use of several types of one-dimensional CNNs that generate results that are better aligned with intuitive notions of audio texture than those based on existing architectures built for images. These ideas also prompt an exploration of audio texture synthesis with architectural variants for extensions to infinite textures, multi-textures, parametric control of receptive fields and the constant-Q transform as an alternative frequency scaling for the spectrogram.
Keywords
Style transfer Texture synthesis Sound modelling Convolutional neural networksNotes
Acknowledgements
This research was supported by an NVIDIA Corporation Academic Programs GPU Grant.
References
- 1.Athineos M, Ellis D (2003) Sound texture modelling with linear prediction in both time and frequency domains. In: 2003 IEEE international conference on acoustics, speech, and signal processing (ICASSP), vol 5. IEEE, pp V–648Google Scholar
- 2.Beauregard GT, Harish M, Wyse L (2015) Single pass spectrogram inversion. In: 2015 IEEE international conference on digital signal processing (DSP). IEEE, pp 427–431Google Scholar
- 3.Bizley JK, Cohen YE (2013) The what, where and how of auditory-object perception. Nat Rev Neurosci 14(10):693CrossRefGoogle Scholar
- 4.Choi K, Fazekas G, Sandler M (2016) Explaining deep convolutional neural networks on music classification. arXiv preprint arXiv:160702444
- 5.Cui B, Qi C, Wang A (2017) Multi-style transfer: generalizing fast style transfer to several genresGoogle Scholar
- 6.Davies ER (2008) Handbook of texture analysis. Imperial College Press, London, UK, chap introduction to texture analysis, pp 1–31Google Scholar
- 7.Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: Computer vision and pattern recognition, 2009. IEEE, pp 248–255Google Scholar
- 8.Deng L, Abdel-Hamid O, Yu D (2013) A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion. In: 2013 IEEE international conference on acoustics. Speech and signal processing (ICASSP). IEEE, pp 6669–6673Google Scholar
- 9.Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. In: 2014 IEEE international conference on acoustics. Speech and signal processing (ICASSP). IEEE, pp 6964–6968Google Scholar
- 10.Dubnov S, Bar-Joseph Z, El-Yaniv R, Lischinski D, Werman M (2002) Synthesizing sound textures through wavelet tree learning. IEEE Comput Graph Appl 4:38–48CrossRefGoogle Scholar
- 11.Dumoulin V, Shlens J, Kudlur M (2017) A learned representation for artistic style. In: Proceedings of ICLRGoogle Scholar
- 12.Ellis D (2013) Spectrograms: constant-q (log-frequency) and conventional (linear). http://www.ee.columbia.edu/ln/rosa/matlab/sgram/
- 13.Gatys LA, Ecker AS, Bethge M (2015a) A neural algorithm of artistic style. arXiv preprint arXiv:150806576
- 14.Gatys LA, Ecker AS, Bethge M (2015b) Texture synthesis using convolutional neural networks. In: Advances in neural information processing systems, pp 262–270Google Scholar
- 15.Gatys LA, Bethge M, Hertzmann A, Shechtman E (2016) Preserving color in neural artistic style transfer. arXiv preprint arXiv:160605897
- 16.Gatys LA, Ecker AS, Bethge M, Hertzmann A, Shechtman E (2017) Controlling perceptual factors in neural style transfer. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
- 17.Griffin D, Lim J (1984) Signal estimation from modified short-time Fourier transform. IEEE Trans Acoust 32(2):236–243CrossRefGoogle Scholar
- 18.Grinstein E, Duong N, Ozerov A, Perez P (2017) Audio style transfer. arXiv preprint arXiv:171011385
- 19.Hoskinson R, Pai D (2001) Manipulation and resynthesis with natural grains. In: Proceedings of the 2001 international computer music conference, ICMCGoogle Scholar
- 20.Huzaifah bin Md Shahrin M (2017) Comparison of time-frequency representations for environmental sound classification using convolutional neural networks. arXiv preprint arXiv:170607156
- 21.Jing Y, Yang Y, Feng Z, Ye J, Song M (2017) Neural style transfer: a review. arXiv preprint arXiv:170504058
- 22.Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision. Springer, pp 694–711Google Scholar
- 23.Julesz B (1962) Visual pattern discrimination. IRE Trans Inf Theory 8(2):84–92CrossRefGoogle Scholar
- 24.Ling ZH, Kang SY, Zen H, Senior A, Schuster M, Qian XJ, Meng HM, Deng L (2015) Deep learning for acoustic modeling in parametric speech generation: a systematic review of existing techniques and future trends. IEEE Signal Process Mag 32(3):35–52CrossRefGoogle Scholar
- 25.Mahendran A, Vedaldi A (2015) Understanding deep image representations by inverting them. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 5188–5196Google Scholar
- 26.McDermott JH, Simoncelli EP (2011) Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis. Neuron 71(5):926–940CrossRefGoogle Scholar
- 27.Novak R, Nikulin Y (2016) Improving the neural algorithm of artistic style. arXiv preprint arXiv:160504603
- 28.Perez A, Proctor C, Jain A (2017) Style transfer for prosodic speech. Tech. rep., Tech. Rep., Stanford UniversityGoogle Scholar
- 29.Piczak KJ (2015) Esc: dataset for environmental sound classification. In: Proceedings of the 23rd ACM international conference on multimedia. ACM, pp 1015–1018Google Scholar
- 30.Rand TC (1974) Dichotic release from masking for speech. J Acoust Soc Am 55(3):678–680CrossRefGoogle Scholar
- 31.Salamon J, Bello JP (2017) Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process Lett 24(3):279–283CrossRefGoogle Scholar
- 32.Schwarz D, Schnell N (2010) Descriptor-based sound texture sampling. In: Sound and music computing (SMC), pp 510–515Google Scholar
- 33.Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556
- 34.Ulyanov D, Lebedev V (2016) Audio texture synthesis and style transfer. https://dmitryulyanov.github.io/audio-texture-synthesis-and-style-transfer/
- 35.Ulyanov D, Lebedev V, Vedaldi A, Lempitsky VS (2016a) Texture networks: Feed-forward synthesis of textures and stylized images. In: ICML, pp 1349–1357Google Scholar
- 36.Ulyanov D, Vedaldi A, Lempitsky VS (2016b) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:160708022
- 37.Ulyanov D, Vedaldi A, Lempitsky VS (2017) Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), vol 1, p 3Google Scholar
- 38.Ustyuzhaninov I, Brendel W, Gatys LA, Bethge M (2016) Texture synthesis using shallow convolutional networks with random filters. arXiv preprint arXiv:160600021
- 39.Verma P, Smith JO (2018) Neural style transfer for audio spectograms. arXiv preprint arXiv:180101589
- 40.Wyse L (2017) Audio spectrogram representations for processing with convolutional neural networks. In: Proceedings of the first international workshop on deep learning and music joint with IJCNN, pp 37–41Google Scholar
- 41.Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer, pp 818–833Google Scholar