Deep Neural Network (DNN) models have lately received considerable attention for that the network structure can extract deep features to improve classification accuracy and achieve excellent results in the field of image. However, due to the different content forms of music and images, transferring deep learning to music classification is still a problem. To address this issue, in the paper, we transfer the state-of-the-art DNN models to music classification and evaluate the performance of the models using spectrograms. Firstly, we convert the music audio files into spectrograms by modal transformation, and then classify music through deep learning. In order to alleviate the problem of overfitting during training, we propose a balanced trusted loss function and build the balanced trusted model ResNet50_trust. Finally, we compare the performance of different DNN models in music classification. Furthermore, this work adds music sentiment analysis based on the newly constructed music emotion dataset. Extensive experimental evaluations on three music datasets show that our proposed model Resnet50_trust consistently outperforms other DNN models.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
Aguiar RL, Costa YMG, Nanni L (2016) Music genre recognition using spectrograms with harmonic-percussive sound separation. In 35th International Conference of the Chilean Computer Science Society, Valparaiso, Chile, pp 1–7
Bengio Y (2009) Learning deep architectures for AI. Foundations and trends in Machine Learning 2(1):1–127
Chaurasiya H (2020) Time-Frequency Representations: Spectrogram, Cochleogram and Correlogram. Procedia Computer Science 167:1901–1910
Choi K, Fazekas G, Sandler M (2016) Automatic tagging using deep convolutional neural networks. arXiv preprint arXiv:1606.00298.
Costa YMG, Oliveira LS, Silla JCN, Silla CN Jr (2017) An evaluation of convolutional neural networks for music classification using spectrograms. Applied soft computing 52:28–38
Defferrard M, Benzi K, Vandergheynst P et al (2016) Fma: A dataset for music analysis. arXiv preprint arXiv:1612.01840.
Deng L, Yu D (2014) Deep learning: methods and applications. Foundations and Trends in Signal Processing 7(3–4):197–387
Ferraro A, Bogdanov D, Jeon JH et al (2019) Music Auto-tagging Using CNNs and Mel-spectrograms with Reduced Frequency and Time Resolution. arXiv preprint arXiv:1911.04824.
Glauner PO (2015) Deep Convolutional Neural Networks for Smile Recognition (MSc Thesis). Imperial College London, Department of Computing. arXiv:1508.06535.
Gulli A, Pal S (2017) Deep learning with Keras. Packt Publishing Ltd.
He K, Zhang X, Ren S et al. (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778.
Howard A G, Zhu M, Chen B et al (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
Huang G, Liu Z, Van Der Maaten L et al (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp: 4700–4708.
Khunarsal P, Lursinsap C, Raicharoen T (2013) Very short time environmental sound classification based on spectrogram pattern matching. Information Sciences 243:57–74
Kim T, Lee J, Nam J (2018) Sample-level CNN architectures for music auto-tagging using raw waveforms. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp: 366–370.
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Kobayashi T, Kubota A, Suzuki Y (2018) Audio feature extraction based on sub-band signal correlations for music genre classification. In 2018 IEEE International Symposium on Multimedia. ISM, pp 180–181.
Kong Q, Feng X, Li Y (2014) Music genre classification using convolutional neural network. In Proc. Int. Soc. Music Inform. Retrieval (ISMIR).
LeCun Y, Bottou L, Bengio Y et al (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324
LeCun Y, Bengio Y, Hinton G (2015) Hinton. Deep learning. Nature 521(7553):436–444
Lidy T, Schindler A (2016) Parallel convolutional neural networks for music genre and mood classification. MIREX2016.
Liu X, Chen Q, Wu X et al (2017) CNN based music emotion classification. arXiv preprint arXiv:1704.05665.
Ma X, Wu Z, Jia J et al (2018) Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms. In Interspeech, pp 3683–3687
McKinney M, Breebaart J (2003) Features for audio and music classification. In Proc. ISMIR, pp 151–158.
Nam J, Choi K, Lee J et al (2018) Deep learning for audio-based music classification and tagging: Teaching computers to distinguish rock from bach. IEEE Signal Processing Magazine 36(1):41–51
Panagakis Y, Kotropoulos C, Arce GR (2009) Music genre classification via sparse representations of auditory temporal modulations, In 2009 17th European Signal Processing Conference, IEEE, pp 1–5.
Papakostas M, Giannakopoulos T (2018) Speech-music discrimination using deep visual feature extractors. Expert Systems with Applications 114:334–344
Pons J, Serra X (2019) Randomly weighted CNNs for (music) audio classification. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 336–340
Sainath TN, Mohamed A, Kingsbury B et al (2013) Deep convolutional neural networks for LVCSR. In 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, pp 8614–8618.
Sandler M, Howard A, Zhu M et al (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520.
Satt A, Rozenberg S, Hoory R (2017) Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms. In INTERSPEECH, pp 1089–1093
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Song G, Wang Z, Han F et al (2018) Music auto-tagging using deep Recurrent Neural Networks. Neurocomputing 292:104–110
Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Trans. Speech Audio Process 10(5):293–302
Valerio V D, Pereira R M, Costa YMG et al (2018) A Resampling Approach for Imbalanceness on Music Genre Classification Using Spectrograms. In The Thirty-First International Flairs Conference.
Zhang W, Lei W, Xu X et al (2016) Improved Music Genre Classification with Convolutional Neural Networks. In INTERSPEECH, pp 3304–3308.
Zhou ZH, Feng J (2019) Deep forest. National Science Review 6(1):74–86
Zoph B, Le Q V (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578.
Zoph B, Vasudevan V, Shlens J et al (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8697–8710.
This work was supported in part by the Natural Science Foundation of the Colleges and Universities in Anhui Province of China under Grant No.KJ2020A0035; and in part by the Scientific Research Project of Hebei Education Department of China under Grant No.QN2020198.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Li, J., Han, L., Li, X. et al. An evaluation of deep neural network models for music classification using spectrograms. Multimed Tools Appl (2021). https://doi.org/10.1007/s11042-020-10465-9
- DNN models
- Deep learning
- Transfer learning
- Music classification