Multimodal speech emotion recognition and classification using convolutional neural network techniques

Christy, A.; Vaithyasubramanian, S.; Jesudoss, A.; Praveena, M. D. Anto

doi:10.1007/s10772-020-09713-y

Multimodal speech emotion recognition and classification using convolutional neural network techniques

Published: 04 June 2020

Volume 23, pages 381–388, (2020)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

A. Christy ORCID: orcid.org/0000-0003-4363-9069¹,
S. Vaithyasubramanian²,
A. Jesudoss¹ &
…
M. D. Anto Praveena¹

1535 Accesses
42 Citations
Explore all metrics

Abstract

Emotion recognition plays a vital role in dealing with day to day interpersonal human interactions. Understanding the feeling of a person from his speech can reveal wonders in shaping social interactions. A persons emotion can be identified with the tone and pitch of his voice. The acoustic speech signal are split into short frames, fast fourier transformation is applied, and relevant features are extracted using mel-frequency cepstrum coefficients (MFCC) and modulation spectral (MS). In this paper, algorithms like linear regression, decision tree, random forest, support vector machine (SVM) and convolutional neural networks (CNN) are used for classification and prediction once relevant features are selected from speech signals. Human emotions like neutral, calm, happy, sad, fearful, disgust and surprise are classified using decision tree, random forest, support vector machine (SVM) and convolutional neural networks (CNN). We have tested our model with RAVDEES dataset and CNN has shown 78.20% accuracy in recognizing emotions compared to decision tree, random forest and SVM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparative Analysis of Different Classifiers for Speech Emotion Recognition

Speech Emotion Recognition Through Extraction of Various Emotional Features Using Machine and Deep Learning Classifiers

Speech Emotion Recognition Using CNN, k-NN, MLP and Random Forest

References

Basu, S., Arnab, B., Aftabuddin, M., Mukherjee, J., & Guha, R. (2016). Effects of emotion on physiological signals. IEEE Annual India,. https://doi.org/10.1109/INDICON.2016.7839091.
Article Google Scholar
Caiming, Y., Tian, Q., Cheng, F., & Zhang, S. (2011). Speech emotion recognition using support vector machines. Communications in Computer and Information Science, 152, 215–220.
Article Google Scholar
Franti, E., Ioan, I. S. P. A. S., Dragomir, V., MonicaDascalu, E. Z., & Stoica, Ioan Cristian. (2017). Voice based emotion recognition with convolutional neural networks for companion robots. Romanian Journal of Information Science and Technology, 20(3), 222–240.
Google Scholar
Fry, D. B. (1955). Duration and intensity as physical correlates of linguistic stress. Journal of the Acoustical Society of America, 27(4), 765–768.
Article Google Scholar
Fry, D. B. (1958). Experiments in the perception of stress. Language and Speech, 1, 126–152.
Article Google Scholar
Jean Shilpa, V., & Jawahar, P. K. (2019). Advanced optimization by profiling of acoustics software applications for interoperability in HCF systems. Journal of Green Engineering, 9(3), 462–474.
Google Scholar
Jing, S., Mao, X., & Chen, L. (2018). Prominence features: Effective emotional features for speech emotion recognition. Digital Signal Processing., 72, 216–231.
Article Google Scholar
Kakouros, S., & Rasanen, O. (2015). Automatic detection of sentence prominence in speech using predictability of word-level acoustic features. In: Proceedings of Inter speech, pp. 568–572.
Kakouros, S., & Rasanen, O. (2016). 3PRO an unsupervised method for the automatic detection of sentence prominence in speech. Speech Communication, 82(1), 67–84.
Article Google Scholar
Kerkeni, L., Serrestou, Y., Mbarki, M., Raoof, K., Mahjoub, M. A., & Cleder, C. (2019). Automatic speech emotion recognition using machine learning. In Social media and machine learning. Intech Open. https://doi.org/10.5772/intechopen.84856.
Kochanski, G., Grabe, E., Coleman, J., & Rosner, B. (2005). Loudness predicts prominence: Fundamental frequency lends little. The Journal of the Acoustical Society of America, 118(2), 1038–1054.
Article Google Scholar
Lieberman, P. (1959). Some acoustic correlates of word stress in American English. The Journal of the Acoustical Society of America., 32(4), 451–454.
Article Google Scholar
Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE,. https://doi.org/10.1371/journal.pone.0196391.
Article Google Scholar
Mannepalli, K., Sastry, P. N., & Suman, M. (2018). Emotion recognition in speech signals using optimization based multi-SVNN classifier. Journal of King Saud University - Computer and Information Sciences,. https://doi.org/10.1016/j.jksuci.2018.11.012.
Article Google Scholar
Mirsamadi, S., Barsoum, E., & Zhang, C. (2017). Automatic speech emotion recognition using recurrent neural networks with local attention. In: Proceedings of the Acoustics Speech and Signal Processing (ICASSP) 2017 IEEE International Conference, pp. 2227-2231.
Nogueiras, A., Moreno, A., Bonafonte, A., & Marino, J. B. (2001). Speech Emotion Recognition Using Hidden Markov Models. In: Eurospeech 2001.
Seehapoch, T., & Wongthanavasu, S. (2013). Proceedings of the 5th International Conference on Knowledge and Smart Technology (KST). https://doi.org/10.1109/KST.2013.6512793.
Terken, J. M. B. (1994). Fundamental frequency and perceived prominence of accented syllables. The Journal of the Acoustical Society of America, 95(6), 3662–3665.
Article Google Scholar
Zhao, J., Mao, X., & Chena, L. (2019). Speech emotion recognition using deep 1D and 2D CNN LSTM networks. Biomedical Signal Processing and Control, 47, 312–323.
Article Google Scholar
Zhao, J., Ma, R. L., & Zhang, X. (2017). Speech emotion recognition based on decision tree and improved SVM mixed model. Transaction of Beijing Institute of Technology,. https://doi.org/10.15918/j.tbit1001-0645.2017.04.011.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computer Science and Engineering, Sathyabama Institute of Science and Technology, Chennai, 600 119, India
A. Christy, A. Jesudoss & M. D. Anto Praveena
Department of Mathematics, Sathyabama Institute of Science and Technology, Chennai, 600 119, India
S. Vaithyasubramanian

Authors

A. Christy
View author publications
You can also search for this author in PubMed Google Scholar
S. Vaithyasubramanian
View author publications
You can also search for this author in PubMed Google Scholar
A. Jesudoss
View author publications
You can also search for this author in PubMed Google Scholar
M. D. Anto Praveena
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Christy.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Christy, A., Vaithyasubramanian, S., Jesudoss, A. et al. Multimodal speech emotion recognition and classification using convolutional neural network techniques. Int J Speech Technol 23, 381–388 (2020). https://doi.org/10.1007/s10772-020-09713-y

Download citation

Received: 27 December 2019
Accepted: 29 April 2020
Published: 04 June 2020
Issue Date: June 2020
DOI: https://doi.org/10.1007/s10772-020-09713-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal speech emotion recognition and classification using convolutional neural network techniques

Abstract

Access this article

Similar content being viewed by others

Comparative Analysis of Different Classifiers for Speech Emotion Recognition

Speech Emotion Recognition Through Extraction of Various Emotional Features Using Machine and Deep Learning Classifiers

Speech Emotion Recognition Using CNN, k-NN, MLP and Random Forest

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multimodal speech emotion recognition and classification using convolutional neural network techniques

Abstract

Access this article

Similar content being viewed by others

Comparative Analysis of Different Classifiers for Speech Emotion Recognition

Speech Emotion Recognition Through Extraction of Various Emotional Features Using Machine and Deep Learning Classifiers

Speech Emotion Recognition Using CNN, k-NN, MLP and Random Forest

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation