An evaluation of deep neural network models for music classification using spectrograms


Deep Neural Network (DNN) models have lately received considerable attention for that the network structure can extract deep features to improve classification accuracy and achieve excellent results in the field of image. However, due to the different content forms of music and images, transferring deep learning to music classification is still a problem. To address this issue, in the paper, we transfer the state-of-the-art DNN models to music classification and evaluate the performance of the models using spectrograms. Firstly, we convert the music audio files into spectrograms by modal transformation, and then classify music through deep learning. In order to alleviate the problem of overfitting during training, we propose a balanced trusted loss function and build the balanced trusted model ResNet50_trust. Finally, we compare the performance of different DNN models in music classification. Furthermore, this work adds music sentiment analysis based on the newly constructed music emotion dataset. Extensive experimental evaluations on three music datasets show that our proposed model Resnet50_trust consistently outperforms other DNN models.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8


  1. 1.

    Aguiar RL, Costa YMG, Nanni L (2016) Music genre recognition using spectrograms with harmonic-percussive sound separation. In 35th International Conference of the Chilean Computer Science Society, Valparaiso, Chile, pp 1–7

  2. 2.

    Bengio Y (2009) Learning deep architectures for AI. Foundations and trends in Machine Learning 2(1):1–127

    MathSciNet  Article  Google Scholar 

  3. 3.

    Chaurasiya H (2020) Time-Frequency Representations: Spectrogram, Cochleogram and Correlogram. Procedia Computer Science 167:1901–1910

    Article  Google Scholar 

  4. 4.

    Choi K, Fazekas G, Sandler M (2016) Automatic tagging using deep convolutional neural networks. arXiv preprint arXiv:1606.00298.

  5. 5.

    Costa YMG, Oliveira LS, Silla JCN, Silla CN Jr (2017) An evaluation of convolutional neural networks for music classification using spectrograms. Applied soft computing 52:28–38

    Article  Google Scholar 

  6. 6.

    Defferrard M, Benzi K, Vandergheynst P et al (2016) Fma: A dataset for music analysis. arXiv preprint arXiv:1612.01840.

  7. 7.

    Deng L, Yu D (2014) Deep learning: methods and applications. Foundations and Trends in Signal Processing 7(3–4):197–387

    MathSciNet  Article  Google Scholar 

  8. 8.

    Ferraro A, Bogdanov D, Jeon JH et al (2019) Music Auto-tagging Using CNNs and Mel-spectrograms with Reduced Frequency and Time Resolution. arXiv preprint arXiv:1911.04824.

  9. 9.

    Glauner PO (2015) Deep Convolutional Neural Networks for Smile Recognition (MSc Thesis). Imperial College London, Department of Computing. arXiv:1508.06535.

  10. 10.

    Gulli A, Pal S (2017) Deep learning with Keras. Packt Publishing Ltd.

  11. 11.

    He K, Zhang X, Ren S et al. (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778.

  12. 12.

    Howard A G, Zhu M, Chen B et al (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.

  13. 13.

    Huang G, Liu Z, Van Der Maaten L et al (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp: 4700–4708.

  14. 14.

    Khunarsal P, Lursinsap C, Raicharoen T (2013) Very short time environmental sound classification based on spectrogram pattern matching. Information Sciences 243:57–74

    Article  Google Scholar 

  15. 15.

    Kim T, Lee J, Nam J (2018) Sample-level CNN architectures for music auto-tagging using raw waveforms. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp: 366–370.

  16. 16.

    Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

  17. 17.

    Kobayashi T, Kubota A, Suzuki Y (2018) Audio feature extraction based on sub-band signal correlations for music genre classification. In 2018 IEEE International Symposium on Multimedia. ISM, pp 180–181.

  18. 18.

    Kong Q, Feng X, Li Y (2014) Music genre classification using convolutional neural network. In Proc. Int. Soc. Music Inform. Retrieval (ISMIR).

  19. 19.

    LeCun Y, Bottou L, Bengio Y et al (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324

    Article  Google Scholar 

  20. 20.

    LeCun Y, Bengio Y, Hinton G (2015) Hinton. Deep learning. Nature 521(7553):436–444

    Article  Google Scholar 

  21. 21.

    Lidy T, Schindler A (2016) Parallel convolutional neural networks for music genre and mood classification. MIREX2016.

  22. 22.

    Liu X, Chen Q, Wu X et al (2017) CNN based music emotion classification. arXiv preprint arXiv:1704.05665.

  23. 23.

    Ma X, Wu Z, Jia J et al (2018) Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms. In Interspeech, pp 3683–3687

  24. 24.

    McKinney M, Breebaart J (2003) Features for audio and music classification. In Proc. ISMIR, pp 151–158.

  25. 25.

    Nam J, Choi K, Lee J et al (2018) Deep learning for audio-based music classification and tagging: Teaching computers to distinguish rock from bach. IEEE Signal Processing Magazine 36(1):41–51

    Article  Google Scholar 

  26. 26.

    Panagakis Y, Kotropoulos C, Arce GR (2009) Music genre classification via sparse representations of auditory temporal modulations, In 2009 17th European Signal Processing Conference, IEEE, pp 1–5.

  27. 27.

    Papakostas M, Giannakopoulos T (2018) Speech-music discrimination using deep visual feature extractors. Expert Systems with Applications 114:334–344

    Article  Google Scholar 

  28. 28.

    Pons J, Serra X (2019) Randomly weighted CNNs for (music) audio classification. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 336–340

  29. 29.

    Sainath TN, Mohamed A, Kingsbury B et al (2013) Deep convolutional neural networks for LVCSR. In 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, pp 8614–8618.

  30. 30.

    Sandler M, Howard A, Zhu M et al (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520.

  31. 31.

    Satt A, Rozenberg S, Hoory R (2017) Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms. In INTERSPEECH, pp 1089–1093

  32. 32.

    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

  33. 33.

    Song G, Wang Z, Han F et al (2018) Music auto-tagging using deep Recurrent Neural Networks. Neurocomputing 292:104–110

    Article  Google Scholar 

  34. 34.

    Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Trans. Speech Audio Process 10(5):293–302

    Article  Google Scholar 

  35. 35.

    Valerio V D, Pereira R M, Costa YMG et al (2018) A Resampling Approach for Imbalanceness on Music Genre Classification Using Spectrograms. In The Thirty-First International Flairs Conference.

  36. 36.

    Zhang W, Lei W, Xu X et al (2016) Improved Music Genre Classification with Convolutional Neural Networks. In INTERSPEECH, pp 3304–3308.

  37. 37.

    Zhou ZH, Feng J (2019) Deep forest. National Science Review 6(1):74–86

    Article  Google Scholar 

  38. 38.

    Zoph B, Le Q V (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578.

  39. 39.

    Zoph B, Vasudevan V, Shlens J et al (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8697–8710.

Download references


This work was supported in part by the Natural Science Foundation of the Colleges and Universities in Anhui Province of China under Grant No.KJ2020A0035; and in part by the Scientific Research Project of Hebei Education Department of China under Grant No.QN2020198.

Author information



Corresponding author

Correspondence to Lixin Han.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Li, J., Han, L., Li, X. et al. An evaluation of deep neural network models for music classification using spectrograms. Multimed Tools Appl (2021).

Download citation


  • DNN models
  • Deep learning
  • Transfer learning
  • Music classification
  • Spectrograms