Recognition of emotion in music based on deep convolutional neural network

  • Rajib SarkarEmail author
  • Sombuddha Choudhury
  • Saikat Dutta
  • Aneek Roy
  • Sanjoy Kumar Saha


In the domain of music information retrieval, emotion based classification is an active area of research. Emotion being a perceptual and subjective concept, the task is quite challenging. It is very difficult to design signal based descriptors to represent emotions. In this work deep leaning network is proposed and experiment is done with benchmark datasets namely, Soundtracks, Bi-Modal and MER_taffc. Experiment has also been done with hand crafted descriptor consisting of different time domain and spectral features, linear predictive coding and MFCC based features. Different classifiers like, neural network, support vector machine and random forest are tried. Although the combined feature set with neural network provides an optimal result for the datasets, but in general the performance of such approaches is limited. It is difficult to obtain a consistent feature set that works across the classifier and datasets. To get rid of the issue of feature design, deep learning based approach is followed. A convolutional neural network built around VGGNet and a novel post-processing technique are proposed. Proposed methodology provides substantial improvement of performance for the datasets. Comparison with other reported works on three different datasets also establishes the superiority of the proposed methodology. The improvement in performance has been substantiated by Z test.


Music emotion recognition Convolutional neural network Deep learning Audio features 


Compliance with Ethical Standards

Conflict of interests

The authors declare that they have no conflict of interest.


  1. 1.
    Abdel-Hamid O, Mohamed AR, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22(10):1533–1545CrossRefGoogle Scholar
  2. 2.
    Albornoz E, Sänchez-Gutiërrez M, Martinez F, Rufiner H, Goddard J (2014) Spoken emotion recognition using deep learning. In: Iberoamerican congress on pattern recognition, pp 104–111Google Scholar
  3. 3.
    Badshah AM, Rahim N, Ullah N, Ahmad J, Muhammad K, Lee MY, Kwon S, Baik SW (2019) Deep features-based speech emotion recognition for smart affective services. Multimed Tools Appl 78(5):5571–5589CrossRefGoogle Scholar
  4. 4.
    Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473
  5. 5.
    Bigand E, Vieillard S, Madurell F, Marozeau J, Dacquet A (2005) Multidimensional scaling of emotional responses to music: The effect of musical expertise and of the duration of the excerpts. Cogn Emot 19(8):1113–1139CrossRefGoogle Scholar
  6. 6.
    Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefzbMATHGoogle Scholar
  7. 7.
    Cabrera D et al (1999) Psysound: a computer program for psychoacoustical analysis. In: Australian acoustical society conference, vol 24, pp 47–54Google Scholar
  8. 8.
    Casella G, Berger RL (2002) Statistical inference, vol 2. CA, Duxbury Pacific GroveGoogle Scholar
  9. 9.
    Chollet F (2015) Keras.
  10. 10.
    Coutinho E, Trigeorgis G, Zafeiriou S, Schuller BW (2015) Automatically estimating emotion in music with deep long-short term memory recurrent neural networks. In: MediaevalGoogle Scholar
  11. 11.
    Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297zbMATHGoogle Scholar
  12. 12.
    Cummins N, Amiriparian S, Hagerer G, Batliner A, Steidl S, Schuller BW (2017) An image-based deep spectrum feature representation for the recognition of emotional speech. In: International conference on multimedia, pp 478–484Google Scholar
  13. 13.
    Droit-Volet S, Ramos D, Bueno L, Bigand E (2013) music, emotion, and time perception: the influence of subjective emotional valence and arousal? Front Psychol 4:417CrossRefGoogle Scholar
  14. 14.
    Eerola T, Vuoskoski JK (2011) A comparison of the discrete and dimensional models of emotion in music. Psychol Music 39(1):18–49CrossRefGoogle Scholar
  15. 15.
    Gabrielsson A, Lindström E (2001) The influence of musical structure on emotional expression. Oxford University Press, OxfordGoogle Scholar
  16. 16.
    Gharavian D, Bejani M, Sheikhan M (2017) Audio-visual emotion recognition using fcbf feature selection method and particle swarm optimization for fuzzy artmap neural networks. Multimed Tools Appl 76(2):2331–2352CrossRefGoogle Scholar
  17. 17.
    Goldberg Y (2017) Neural network methods for natural language processing. Synth Lect Hum Lang Technol 10(1):1–309CrossRefGoogle Scholar
  18. 18.
    Han BJ, Rho S, Jun S, Hwang E (2010) Music emotion classification and context-based music recommendation. Multimed Tools Appl 47(3):433–460CrossRefGoogle Scholar
  19. 19.
    Hassan A, Damper R, Niranjan M (2013) On acoustic emotion recognition: compensating for covariate shift. IEEE Trans Audio Speech Lang Process 21(7):1458–1468CrossRefGoogle Scholar
  20. 20.
    Hastie T, Tibshirani R, Friedman J (2008) The Elements of Statistical Learning, 2 edn., chap. Random Forests. Springer, pp 592Google Scholar
  21. 21.
    Huang Z, Dong M, Mao Q, Zhan Y (2014) Speech emotion recognition using cnn. In: ACM International conference on multimedia, pp 801–804Google Scholar
  22. 22.
    Huang Z, Xue W, Mao Q, Zhan Y (2017) Unsupervised domain adaptation for speech emotion recognition using pcanet. Multimed Tools Appl 76(5):6785–6799CrossRefGoogle Scholar
  23. 23.
    Huq A, Bello JP, Rowe R (2010) Automated music emotion recognition: a systematic evaluation. J Music Res 39(3):227–244CrossRefGoogle Scholar
  24. 24.
    Jun Han B, Rho S, Dannenberg RB, Hwang E (2009) Smers: Music emotion recognition using support vector regression. In: International society for music information retrieval, pp 651–656Google Scholar
  25. 25.
    Kahou SE, Pal C, Bouthillier X, Froumenty P, Gülċehre Ċ, Memisevic R, Vincent P, Courville A, Bengio Y, Ferrari RC et al (2013) Combining modality specific deep neural networks for emotion recognition in video. In: International conference on multimodal interaction, pp 543–550Google Scholar
  26. 26.
    Kim Y, Schmidt EM, Migneco R, Morton BG, Richardson P, Scott J, Speck JA, Turnbull D (2010) Music emotion recognition: a state of the art review. In: International society for music information retrieval, pp 255–266Google Scholar
  27. 27.
    Kim Y, Lee H, Provost EM (2013) Deep learning for robust feature generation in audiovisual emotion recognition. In: International conference on acoustics, speech and signal processing, pp 3687–3691Google Scholar
  28. 28.
    Kingma D, Ba J (2014) Adam: A method for stochastic optimization. arXiv:1412.6980
  29. 29.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105Google Scholar
  30. 30.
    Krumhansl CL (2002) Music: a link between cognition and emotion. Curr Direct Psychol Sci 11(2):45–50CrossRefGoogle Scholar
  31. 31.
    Lerch A (2012) An Introduction to Audio Content Analysis: Applications in Signal Processing and Music Informatics, 1st edn. Wiley-IEEE Press, New YorkCrossRefGoogle Scholar
  32. 32.
    Logan B (2000) Mel frequency cepstral coefficients for music modeling. In: International society for music information retrieval, pp 138–147Google Scholar
  33. 33.
    Lu L, Liu D, Zhang H (2006) Automatic mood detection and tracking of music audio signals. IEEE Trans Audio Speech Lang Process 14(1):5–18CrossRefGoogle Scholar
  34. 34.
    Lu Q, Chen X, Yang D, Wang J (2010) Boosting for multi-modal music emotion. In: International society for music information and retrieval conference, pp 105–105Google Scholar
  35. 35.
    Lin YC, Yang YH, Chen HH (2011) Exploiting online music tags for music emotion classification. ACM Trans Multimed Comput Commun Appl 7S(1):26:1–26:16Google Scholar
  36. 36.
    Liu X, Chen Q, Wu X, Liu Y, Liu Y (2017) Cnn based music emotion classification. arXiv:1704.05665
  37. 37.
    Malheiro R, Panda R, Gomes P, Paiva R (2016) Bi-modal music emotion recognition: Novel lyrical features and dataset. In: International workshop on music and machine learningGoogle Scholar
  38. 38.
    Mao Q, Dong M, Huang Z, Zhan Y (2014) Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans Multimed 16(8):2203–2213CrossRefGoogle Scholar
  39. 39.
    Markov K, Iwata M, Matsui T (2013) Music emotion recognition using gaussian processes. In: MediaevalGoogle Scholar
  40. 40.
    Minsky M, Papert S (1969) Perceptrons. MIT Press, CambridgeGoogle Scholar
  41. 41.
    Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: International conference on machine learning, pp 807–814Google Scholar
  42. 42.
    Nordström H, Laukka P (2019) The time course of emotion recognition in speech and music. J Acoust Soc Amer 145(5):3058–3074CrossRefGoogle Scholar
  43. 43.
    Ooi CS, Seng KP, Ang LM, Chew LW (2014) A new approach of audio emotion recognition. Expert Syst Appl 41(13):5858–5869CrossRefGoogle Scholar
  44. 44.
    Panda R, Malheiro RM, Paiva RP (2018) Novel audio features for music emotion recognition. IEEE Transactions on Affective ComputingGoogle Scholar
  45. 45.
    Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12:2825–2830MathSciNetzbMATHGoogle Scholar
  46. 46.
    Rabiner LR, Schafer RW (2007) Introduction to digital speech processing. Found Trends Signal Process 1(1):1–194CrossRefzbMATHGoogle Scholar
  47. 47.
    Rao KS, Reddy VR, Maity S (2015) Language identification using spectral and prosodic features. Springer, BerlinGoogle Scholar
  48. 48.
    Russell J (1980) A circumplex model of affect. J Person Soc Psychol 39 (6):1161–1178CrossRefGoogle Scholar
  49. 49.
    Saari P, Eerola T, Lartillot O (2011) Generalizability and simplicity as criteria in feature selection: Application to mood classification in music. IEEE Trans Audio Speech Lang Process 19(6):1802–1812CrossRefGoogle Scholar
  50. 50.
    Schmidt EM, Kim Y (2011) Learning emotion-based acoustic features with deep belief networks. In: IEEE Workshop on applications of signal processing to audio and acoustics, pp 65–68Google Scholar
  51. 51.
    Sadowski P (2016) Notes on backpropagation. homepage: (online)
  52. 52.
    Sanyal S, Banerjee A, Sengupta R, Ghosh D (2016) Chaotic brain, musical mind-a non-linear neurocognitive physics based study. Journal of Neurology and NeuroscienceGoogle Scholar
  53. 53.
    Seo YS, Huh JH (2019) Automatic emotion-based music classification for supporting intelligent iot applications. Electronics 8(2):164CrossRefGoogle Scholar
  54. 54.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:
  55. 55.
    Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Computer vision and pattern recognition, pp 1–9Google Scholar
  56. 56.
    Thayer RE (1990) The biopsychology of mood and arousal. Oxford University Press, OxfordGoogle Scholar
  57. 57.
    Thammasan N, Fukui K, Numao M (2016) Application of deep belief networks in eeg-based dynamic music-emotion recognition. In: International joint conference on neural networks, pp 881–888Google Scholar
  58. 58.
    Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Nicolaou MA, Schuller B, Zafeiriou S (2016) Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In: International conference on acoustics, speech and signal processing, pp 5200–5204Google Scholar
  59. 59.
    Tzanetakis G, Cook P (1999) Marsyas: a framework for audio analysis. Organised Sound 4(3):169–175CrossRefGoogle Scholar
  60. 60.
    Yang YH, Lin YC, Su YF, Chen HH (2007) Music emotion classification: a regression approach. In: International conference on multimedia and expo, pp 208–211Google Scholar
  61. 61.
    Yang YH, Lin YC, Su YF, Chen HH (2008) A regression approach to music emotion recognition. IEEE Trans Audio Speech Lang Process 16(2):448–457CrossRefGoogle Scholar
  62. 62.
    Yang YH, Chen HH (2012) Machine recognition of music emotion: a review. ACM Trans Intell Syst Technol 3(3):40:1–40:30CrossRefGoogle Scholar
  63. 63.
    Yang X, Dong Y, Li J (2018) Review of data features-based music emotion recognition methods. Multimedi Syst 24(4):365–389CrossRefGoogle Scholar
  64. 64.
    Yeh CH, Tseng WY, Chen CY, Lin YD, Tsai YR, Bi HI, Lin YC, Lin HY (2014) Popular music representation: chorus detection & emotion recognition. Multimed Tools Appl 73(3):2103–2128CrossRefGoogle Scholar
  65. 65.
    Zhang F, Meng H, Li M (2016) Emotion extraction and recognition from music. In: International conference on natural computation, fuzzy systems and knowledge discovery, pp 1728–1733Google Scholar
  66. 66.
    Zheng WL, Lu BL (2015) Investigating critical frequency bands and channels for eeg-based emotion recognition with deep neural networks. IEEE Trans Auton Ment Dev 7(3):162–175CrossRefGoogle Scholar
  67. 67.
    Zeng N, Zhang H, Song B, Liu W, Li Y, Dobaie AM (2018) Facial expression recognition via learning deep sparse autoencoders. Neurocomputing 273:643–649CrossRefGoogle Scholar
  68. 68.
    Zao L, Cavalcante D, Coelho R (2014) Time-frequency feature and ams-gmm mask for acoustic emotion classification. IEEE Signal Process Lett 21(5):620–624CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Computer Science and Engg.Jadavpur UniversityKolkataIndia
  2. 2.Computer Science DepartmentDerozio Memorial CollegeKolkataIndia

Personalised recommendations