An effective analysis of deep learning based approaches for audio based feature extraction and its visualization

  • DhirajEmail author
  • Rohit Biswas
  • Nischay Ghattamaraju


Visualizations help decipher latent patterns in music and garner a deep understanding of a song’s characteristics. This paper offers a critical analysis of the effectiveness of various state-of-the-art Deep Neural Networks in visualizing music. Several implementations of auto encoders and genre classifiers have been explored for extracting meaningful features from audio tracks. Novel techniques have been devised to map these audio features to parameters that drive visualizations. These methodologies have been designed in a manner that enables the visualizations to be responsive to the music as well as provide unique visual experiences across different songs.


Deep neural networks Convolutional autoencoder VGG Alexnet Audio feature extraction Genre classifiers Audio visualization PCA K-means 



  1. 1.
    Annesi P, Basili R, Gitto R, Moschitti A, Petitti R (2007) Audio feature engineering for automatic music genre classification. In Large Scale Semantic Access to Content (Text, Image, Video, and Sound), pp. 702-711. LE CENTRE DE HAUTES ETUDES INTERNATIONALES D'INFORMATIQUE DOCUMENTAIREGoogle Scholar
  2. 2.
    Baniya BK, Lee J, Li ZN (2014) Audio feature reduction and analysis for automatic music genre classification. In: 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC), San Diego, pp. 457–462Google Scholar
  3. 3.
    Benzi K, Defferrard M, Vandergheynst P, Bresson X (2016) Fma: A dataset for music analysis,” arXiv preprint arXiv:1612.01840Google Scholar
  4. 4.
    Chung Y, Wu C, Shen C, Lee H, Lee L (2016) Audio Word2Vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder. Proc. Interspeech, pp. 410–415Google Scholar
  5. 5.
    Ciuha P, Klemenc B, Solina F (2010) Visualization of concurrent tones in music with colours. Univ. of Ljubljana, SloveniaCrossRefGoogle Scholar
  6. 6.
    Congote J, Segura A, Kabongo L, Moreno A, Posada J, Ruiz O (2011) Interactive visualization of volumetric data with webgl in real-time. In: Proceedings of the 16th International Conference on 3D Web Technology, pp. 137–146. ACMGoogle Scholar
  7. 7.
    Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. Proc. IEEE Int. Conf Acoust. Speech Signal Process, pp. 6964–6968Google Scholar
  8. 8.
    Dixon S, Goebl W, Widmer G (2002) The performance worm: Real time visualisation based on langner's representation. In M. Nordahl, eds, Proceedings of the 2002 International Computer Music Conference, pages 361–364, San Francisco. International Computer Music AssociationGoogle Scholar
  9. 9.
    Foote J (2018) Visualizing music and audio using self-similarityGoogle Scholar
  10. 10.
    Gallagher M, Downs T (1997) Visualisation of learning in neural networks using principal component analysis. In: Verma B and Yao X (eds) Proceedings of International Conference on Computational Intelligence and Multimedia Applications, Gold Coast, pp. 327–331Google Scholar
  11. 11.
    Ha D, Eck D (2017) A neural representation of sketch drawings. CoRRGoogle Scholar
  12. 12.
    Hershey S, Chaudhuri S, Ellis DP, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B et al (2016) CNN architectures for large-scale audio classification. arXiv preprint arXiv: 1609.09430Google Scholar
  13. 13.
    Humphrey EJ, Bello JP, LeCun Y (2013) Feature Learning and Deep Architectures: New Directions for Music Informatics. J Intell Inf Syst 41(3):461–481CrossRefGoogle Scholar
  14. 14.
    Im DJ, Belghazi MID, Memisevic R (2015) Conservativeness of untied auto-encoders, CoRR, abs/1506.07643Google Scholar
  15. 15.
    Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICMLGoogle Scholar
  16. 16.
    Kahng M, Andrews PY, Kalro A, Chau DH (2018) ActiVis: Visual Exploration of Industry-Scale Deep Neural Network Models. IEEE Trans Vis Comput Graph 24(1):88–97. CrossRefGoogle Scholar
  17. 17.
    Kim J, Won M, Serra X, Liem CCS (2018) Transfer Learning of Artist Group Factors to Musical Genre Classification. In WWWGoogle Scholar
  18. 18.
    Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. InNIPS, pp. 1106–1114Google Scholar
  19. 19.
    Mao X, Shen C, Yang Y-B (2016) Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In NIPSGoogle Scholar
  20. 20.
    Mierswa I, Morik K (2005) Automatic Feature Extraction for Classifying Audio Data. Mach Learn 58(2-3):127–149. CrossRefzbMATHGoogle Scholar
  21. 21.
    Murauer B, Specht G (2018) Detecting Music Genre Using Extreme Gradient Boosting. In WWWGoogle Scholar
  22. 22.
    Nam J, Herrera J, Lee K (2015) A Deep Bag-of-Features Model for Music Auto-Tagging. Eprint arXiv: 1508.04999Google Scholar
  23. 23.
    Pascual S, Bonafonte A, Serrà J (2017) SEGAN: Speech Enhancement Generative Adversarial Network arXiv: 1703.09452Google Scholar
  24. 24.
    Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in PyTorch. In: NIPS 2017 Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques, Long BeachGoogle Scholar
  25. 25.
    Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H (2016) Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396Google Scholar
  26. 26.
    Scherer D, Muller A, Behnke S (2010) Evaluation of pooling operations in convolutional architectures for object recognition. In: Proc. of the Intl. Conf. on Artificial Neural Networks, pp. 92–101CrossRefGoogle Scholar
  27. 27.
    Schluter J (2011) Unsupervised audio feature extraction for music similarity estimation. Technische Universit at Munchen, Fakultat fur InformatikGoogle Scholar
  28. 28.
    Sigtia S, Dixon S (2014) Improved music feature learning with deep neural networks. In: Proceedings of the 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP)Google Scholar
  29. 29.
    Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In ICLRGoogle Scholar
  30. 30.
    Sutskever I, Martens J, Dahl GE, Hinton GE (2013) On the importance of initialization and momentum in deep learning. In ICML, volume 28 of JMLR Proceedings, pp. 1139–1147. JMLR.orgGoogle Scholar
  31. 31.
    Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Rabinovich A (2014) Going deeper with convolutions. Technical reportGoogle Scholar
  32. 32.
    Takahashi N, Gygli M, Gool LV (2017) AEnet: Learning deep audio features for video analysis. arXiv: 1701.00599Google Scholar
  33. 33.
    Takahashi N, Gygli M, Van Gool L (2017) Aenet: Learning deep audio features for video analysisGoogle Scholar
  34. 34.
    Taylor R, Boulanger P, Torres D (2006) Real-time music visualizations using responsive imageryGoogle Scholar
  35. 35.
    Umapathy K, Krishnan S, Rao RK (2007) Audio Signal Feature Extraction and Classification Using Local Discriminant Bases. IEEE Trans Audio Speech Lang Process 15(4):1236–1246CrossRefGoogle Scholar
  36. 36.
    Wang H-H, Liu J-M, You M, Li G-Z (2015) Audio signals encoding for cough classification using convolutional neural networks: A comparative study. 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Washington, DC, pp. 442–445Google Scholar
  37. 37.
    Wang S, Sun J, Phillips P, Zhao G, Zhang Y (2017) Polarimetric synthetic aperture radar image segmentation by the convolutional neural network using graphical processing units. J Real-Time Image ProcGoogle Scholar
  38. 38.
    Wyse L (2017) Audio spectrogram representations for processing with convolutional neural networks. arXiv: 1706.09559Google Scholar
  39. 39.
    Zeiler MD, Fergus R (2013) Stochastic pooling for regularization of deep convolutional neural networks. CoRR, abs/1301.3557Google Scholar
  40. 40.
    Zhang Y-D, Dong Z, Chen X, Jia W, Du S, Muhammad K, Wang S (2017) Image based fruit category classification by 13-layer deep convolutional neural network and data augmentation. Multimed Tools Appl:1–20.

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.CSIR-Central Electronics Engineering Research Institute (CEERI)PilaniIndia
  2. 2.Birla Institute of Technology and SciencePilaniIndia

Personalised recommendations