Emotion recognition plays a vital role in dealing with day to day interpersonal human interactions. Understanding the feeling of a person from his speech can reveal wonders in shaping social interactions. A persons emotion can be identified with the tone and pitch of his voice. The acoustic speech signal are split into short frames, fast fourier transformation is applied, and relevant features are extracted using mel-frequency cepstrum coefficients (MFCC) and modulation spectral (MS). In this paper, algorithms like linear regression, decision tree, random forest, support vector machine (SVM) and convolutional neural networks (CNN) are used for classification and prediction once relevant features are selected from speech signals. Human emotions like neutral, calm, happy, sad, fearful, disgust and surprise are classified using decision tree, random forest, support vector machine (SVM) and convolutional neural networks (CNN). We have tested our model with RAVDEES dataset and CNN has shown 78.20% accuracy in recognizing emotions compared to decision tree, random forest and SVM.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
Basu, S., Arnab, B., Aftabuddin, M., Mukherjee, J., & Guha, R. (2016). Effects of emotion on physiological signals. IEEE Annual India,. https://doi.org/10.1109/INDICON.2016.7839091.
Caiming, Y., Tian, Q., Cheng, F., & Zhang, S. (2011). Speech emotion recognition using support vector machines. Communications in Computer and Information Science, 152, 215–220.
Franti, E., Ioan, I. S. P. A. S., Dragomir, V., MonicaDascalu, E. Z., & Stoica, Ioan Cristian. (2017). Voice based emotion recognition with convolutional neural networks for companion robots. Romanian Journal of Information Science and Technology, 20(3), 222–240.
Fry, D. B. (1955). Duration and intensity as physical correlates of linguistic stress. Journal of the Acoustical Society of America, 27(4), 765–768.
Fry, D. B. (1958). Experiments in the perception of stress. Language and Speech, 1, 126–152.
Jean Shilpa, V., & Jawahar, P. K. (2019). Advanced optimization by profiling of acoustics software applications for interoperability in HCF systems. Journal of Green Engineering, 9(3), 462–474.
Jing, S., Mao, X., & Chen, L. (2018). Prominence features: Effective emotional features for speech emotion recognition. Digital Signal Processing., 72, 216–231.
Kakouros, S., & Rasanen, O. (2015). Automatic detection of sentence prominence in speech using predictability of word-level acoustic features. In: Proceedings of Inter speech, pp. 568–572.
Kakouros, S., & Rasanen, O. (2016). 3PRO an unsupervised method for the automatic detection of sentence prominence in speech. Speech Communication, 82(1), 67–84.
Kerkeni, L., Serrestou, Y., Mbarki, M., Raoof, K., Mahjoub, M. A., & Cleder, C. (2019). Automatic speech emotion recognition using machine learning. In Social media and machine learning. Intech Open. https://doi.org/10.5772/intechopen.84856.
Kochanski, G., Grabe, E., Coleman, J., & Rosner, B. (2005). Loudness predicts prominence: Fundamental frequency lends little. The Journal of the Acoustical Society of America, 118(2), 1038–1054.
Lieberman, P. (1959). Some acoustic correlates of word stress in American English. The Journal of the Acoustical Society of America., 32(4), 451–454.
Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE,. https://doi.org/10.1371/journal.pone.0196391.
Mannepalli, K., Sastry, P. N., & Suman, M. (2018). Emotion recognition in speech signals using optimization based multi-SVNN classifier. Journal of King Saud University - Computer and Information Sciences,. https://doi.org/10.1016/j.jksuci.2018.11.012.
Mirsamadi, S., Barsoum, E., & Zhang, C. (2017). Automatic speech emotion recognition using recurrent neural networks with local attention. In: Proceedings of the Acoustics Speech and Signal Processing (ICASSP) 2017 IEEE International Conference, pp. 2227-2231.
Nogueiras, A., Moreno, A., Bonafonte, A., & Marino, J. B. (2001). Speech Emotion Recognition Using Hidden Markov Models. In: Eurospeech 2001.
Seehapoch, T., & Wongthanavasu, S. (2013). Proceedings of the 5th International Conference on Knowledge and Smart Technology (KST). https://doi.org/10.1109/KST.2013.6512793.
Terken, J. M. B. (1994). Fundamental frequency and perceived prominence of accented syllables. The Journal of the Acoustical Society of America, 95(6), 3662–3665.
Zhao, J., Mao, X., & Chena, L. (2019). Speech emotion recognition using deep 1D and 2D CNN LSTM networks. Biomedical Signal Processing and Control, 47, 312–323.
Zhao, J., Ma, R. L., & Zhang, X. (2017). Speech emotion recognition based on decision tree and improved SVM mixed model. Transaction of Beijing Institute of Technology,. https://doi.org/10.15918/j.tbit1001-0645.2017.04.011.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Christy, A., Vaithyasubramanian, S., Jesudoss, A. et al. Multimodal speech emotion recognition and classification using convolutional neural network techniques. Int J Speech Technol 23, 381–388 (2020). https://doi.org/10.1007/s10772-020-09713-y
- Speech emotion recognition
- Feature extraction