Multimodal speech emotion recognition and classification using convolutional neural network techniques


Emotion recognition plays a vital role in dealing with day to day interpersonal human interactions. Understanding the feeling of a person from his speech can reveal wonders in shaping social interactions. A persons emotion can be identified with the tone and pitch of his voice. The acoustic speech signal are split into short frames, fast fourier transformation is applied, and relevant features are extracted using mel-frequency cepstrum coefficients (MFCC) and modulation spectral (MS). In this paper, algorithms like linear regression, decision tree, random forest, support vector machine (SVM) and convolutional neural networks (CNN) are used for classification and prediction once relevant features are selected from speech signals. Human emotions like neutral, calm, happy, sad, fearful, disgust and surprise are classified using decision tree, random forest, support vector machine (SVM) and convolutional neural networks (CNN). We have tested our model with RAVDEES dataset and CNN has shown 78.20% accuracy in recognizing emotions compared to decision tree, random forest and SVM.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7


  1. Basu, S., Arnab, B., Aftabuddin, M., Mukherjee, J., & Guha, R. (2016). Effects of emotion on physiological signals. IEEE Annual India,.

    Article  Google Scholar 

  2. Caiming, Y., Tian, Q., Cheng, F., & Zhang, S. (2011). Speech emotion recognition using support vector machines. Communications in Computer and Information Science, 152, 215–220.

    Article  Google Scholar 

  3. Franti, E., Ioan, I. S. P. A. S., Dragomir, V., MonicaDascalu, E. Z., & Stoica, Ioan Cristian. (2017). Voice based emotion recognition with convolutional neural networks for companion robots. Romanian Journal of Information Science and Technology, 20(3), 222–240.

    Google Scholar 

  4. Fry, D. B. (1955). Duration and intensity as physical correlates of linguistic stress. Journal of the Acoustical Society of America, 27(4), 765–768.

    Article  Google Scholar 

  5. Fry, D. B. (1958). Experiments in the perception of stress. Language and Speech, 1, 126–152.

    Article  Google Scholar 

  6. Jean Shilpa, V., & Jawahar, P. K. (2019). Advanced optimization by profiling of acoustics software applications for interoperability in HCF systems. Journal of Green Engineering, 9(3), 462–474.

    Google Scholar 

  7. Jing, S., Mao, X., & Chen, L. (2018). Prominence features: Effective emotional features for speech emotion recognition. Digital Signal Processing., 72, 216–231.

    Article  Google Scholar 

  8. Kakouros, S., & Rasanen, O. (2015). Automatic detection of sentence prominence in speech using predictability of word-level acoustic features. In: Proceedings of Inter speech, pp. 568–572.

  9. Kakouros, S., & Rasanen, O. (2016). 3PRO an unsupervised method for the automatic detection of sentence prominence in speech. Speech Communication, 82(1), 67–84.

    Article  Google Scholar 

  10. Kerkeni, L., Serrestou, Y., Mbarki, M., Raoof, K., Mahjoub, M. A., & Cleder, C. (2019). Automatic speech emotion recognition using machine learning. In Social media and machine learning. Intech Open.

  11. Kochanski, G., Grabe, E., Coleman, J., & Rosner, B. (2005). Loudness predicts prominence: Fundamental frequency lends little. The Journal of the Acoustical Society of America, 118(2), 1038–1054.

    Article  Google Scholar 

  12. Lieberman, P. (1959). Some acoustic correlates of word stress in American English. The Journal of the Acoustical Society of America., 32(4), 451–454.

    Article  Google Scholar 

  13. Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE,.

    Article  Google Scholar 

  14. Mannepalli, K., Sastry, P. N., & Suman, M. (2018). Emotion recognition in speech signals using optimization based multi-SVNN classifier. Journal of King Saud University - Computer and Information Sciences,.

    Article  Google Scholar 

  15. Mirsamadi, S., Barsoum, E., & Zhang, C. (2017). Automatic speech emotion recognition using recurrent neural networks with local attention. In: Proceedings of the Acoustics Speech and Signal Processing (ICASSP) 2017 IEEE International Conference, pp. 2227-2231.

  16. Nogueiras, A., Moreno, A., Bonafonte, A., & Marino, J. B. (2001). Speech Emotion Recognition Using Hidden Markov Models. In: Eurospeech 2001.

  17. Seehapoch, T., & Wongthanavasu, S. (2013). Proceedings of the 5th International Conference on Knowledge and Smart Technology (KST).

  18. Terken, J. M. B. (1994). Fundamental frequency and perceived prominence of accented syllables. The Journal of the Acoustical Society of America, 95(6), 3662–3665.

    Article  Google Scholar 

  19. Zhao, J., Mao, X., & Chena, L. (2019). Speech emotion recognition using deep 1D and 2D CNN LSTM networks. Biomedical Signal Processing and Control, 47, 312–323.

    Article  Google Scholar 

  20. Zhao, J., Ma, R. L., & Zhang, X. (2017). Speech emotion recognition based on decision tree and improved SVM mixed model. Transaction of Beijing Institute of Technology,.

    Article  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to A. Christy.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Christy, A., Vaithyasubramanian, S., Jesudoss, A. et al. Multimodal speech emotion recognition and classification using convolutional neural network techniques. Int J Speech Technol 23, 381–388 (2020).

Download citation


  • Speech emotion recognition
  • Feature extraction
  • Classification
  • SVM
  • CNN
  • Accuracy