A Segment Level Approach to Speech Emotion Recognition Using Transfer Learning

  • Sourav SahooEmail author
  • Puneet Kumar
  • Balasubramanian Raman
  • Partha Pratim Roy
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12047)


Speech emotion recognition (SER) is a non-trivial task considering that the very definition of emotion is ambiguous. In this paper, we propose a speech emotion recognition system that predicts emotions for multiple segments of a single audio clip unlike the conventional emotion recognition models that predict the emotion of an entire audio clip directly. The proposed system consists of a pre-trained deep convolutional neural network (CNN) followed by a single layered neural network which predicts the emotion classes of the audio segments. The predictions for the individual segments are finally combined to predict the emotion of a particular clip. We define several new types of accuracies while evaluating the performance of the proposed model. The proposed model attains an accuracy of 68.7% surpassing the current state-of-the-art models in classifying the data into one of the four emotional classes (angry, happy, sad and neutral) when trained and evaluated on IEMOCAP audio-only dataset.


Emotion recognition Affective computing Deep learning Mel spectrograms Computational paralinguistics 

Supplementary material

488101_1_En_34_MOESM1_ESM.pdf (149 kb)
Supplementary material 1 (pdf 149 KB)


  1. 1.
    Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and mandarin. In: International Conference on Machine Learning (ICML), pp. 173–182 (2016)Google Scholar
  2. 2.
    Badshah, A.M., Ahmad, J., Rahim, N., Baik, S.W.: Speech emotion recognition from spectrograms with deep convolutional neural network. In: International Conference on Platform Technology and Service (PlatCon), pp. 1–5. IEEE (2017)Google Scholar
  3. 3.
    Braun, M., Mainz, A., Chadowitz, R., Pfleging, B., Alt, F.: At your service: designing voice assistant personalities to improve automotive user interfaces. In: Proceedings of the CHI Conference on Human Factors in Computing Systems, p. 40. ACM (2019)Google Scholar
  4. 4.
    Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B.: A database of German emotional speech. In: Ninth European Conference on Speech Communication and Technology (2005)Google Scholar
  5. 5.
    Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335 (2008)CrossRefGoogle Scholar
  6. 6.
    Caruana, R., Lawrence, S., Giles, C.L.: Overfitting in neural nets: backpropagation, conjugate gradient, and early stopping. In: Advances in Neural Information Processing Systems, pp. 402–408 (2001)Google Scholar
  7. 7.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Computer Vision and Pattern Recognition (CVPR). IEEE (2009)Google Scholar
  8. 8.
    Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. IEEE (2017)Google Scholar
  9. 9.
    Gideon, J., Khorram, S., Aldeneh, Z., Dimitriadis, D., Provost, E.M.: Progressive neural networks for transfer learning in emotion recognition. arXiv preprint arXiv:1706.03256 (2017)
  10. 10.
    Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: Fifteenth Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 223–227. ISCA (2014)Google Scholar
  11. 11.
    Hershey, S., et al.: CNN architectures for large-scale audio classification. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135. IEEE (2017)Google Scholar
  12. 12.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
  13. 13.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  14. 14.
    Kleinginna, P.R., Kleinginna, A.M.: A categorized list of emotion definitions, with suggestions for a consensual definition. Motiv. Emot. 5, 345–379 (1981)CrossRefGoogle Scholar
  15. 15.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  16. 16.
    Lee, C.C., Mower, E., Busso, C., Lee, S., Narayanan, S.: Emotion recognition using a hierarchical binary decision tree approach. Speech Commun. 53(9–10), 1162–1171 (2011)CrossRefGoogle Scholar
  17. 17.
    Lee, J., Tashev, I.: High-level feature representation using recurrent neural network for speech emotion recognition. In: Sixteenth Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 1537–1540. ISCA (2015)Google Scholar
  18. 18.
    Livingstone, S.R., Russo, F.A.: The ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018)CrossRefGoogle Scholar
  19. 19.
    Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML), pp. 807–814 (2010)Google Scholar
  20. 20.
    Neumann, M., Vu, N.T.: Attentive convolutional neural network based speech emotion recognition: a study on the impact of input features, signal length, and acted speech. arXiv preprint arXiv:1706.00612 (2017)
  21. 21.
    Pan, S.J., Yang, Q.: A survey on transfer learning. Trans. Knowl. Data Eng. 22(10), 1345–1359 (2009)CrossRefGoogle Scholar
  22. 22.
    Ping, W., et al.: Deep voice 3: scaling text-to-speech with convolutional sequence learning. arXiv preprint arXiv:1710.07654 (2017)
  23. 23.
    Provost, E.M.: Identifying salient sub-utterance emotion dynamics using flexible units and estimates of affective flow. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3682–3686. IEEE (2013)Google Scholar
  24. 24.
    Sahu, G.: Multimodal speech emotion recognition and ambiguity resolution. arXiv preprint arXiv:1904.06022 (2019)
  25. 25.
    Satt, A., Rozenberg, S., Hoory, R.: Efficient emotion recognition from speech using deep learning on spectrograms. In: Eighteenth Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 1089–1093. ISCA (2017)Google Scholar
  26. 26.
    Schuller, B., Rigoll, G., Lang, M.: Hidden Markov model-based speech emotion recognition. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. II-1. IEEE (2003)Google Scholar
  27. 27.
    Seehapoch, T., Wongthanavasu, S.: Speech emotion recognition using support vector machines. In: 5th International Conference on Knowledge and Smart Technology (KST), pp. 86–91. IEEE (2013)Google Scholar
  28. 28.
    Shami, M.T., Kamel, M.S.: Segment-based approach to the recognition of emotions in speech. In: International Conference on Multimedia and Expo, pp. 4-pp. IEEE (2005)Google Scholar
  29. 29.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  30. 30.
    Song, P., Jin, Y., Zhao, L., Xin, M.: Speech emotion recognition using transfer learning. IEICE Trans. Inf. Syst. 97(9), 2530–2532 (2014)CrossRefGoogle Scholar
  31. 31.
    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)MathSciNetzbMATHGoogle Scholar
  32. 32.
    Trigeorgis, G., et al.: Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5200–5204. IEEE (2016)Google Scholar
  33. 33.
    Yoon, S., Byun, S., Dey, S., Jung, K.: Speech emotion recognition using multi-hop attention mechanism. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2822–2826. IEEE (2019)Google Scholar
  34. 34.
    Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Spoken Language Technology Workshop (SLT), pp. 112–118. IEEE (2018)Google Scholar
  35. 35.
    Zheng, W., Yu, J., Zou, Y.: An experimental study of speech emotion recognition based on deep convolutional neural networks. In: International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 827–831. IEEE (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Department of Electrical EngineeringIndian Institute of Technology MadrasChennaiIndia
  2. 2.Department of Computer Science and EngineeringIndian Institute of Technology RoorkeeRoorkeeIndia

Personalised recommendations