Analysis of Emotions Through Speech Using the Combination of Multiple Input Sources with Deep Convolutional and LSTM Networks

  • Cristyan R. Gil MoralesEmail author
  • Suraj Shinde
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11289)


Understanding emotions expressed in speech by a person is fundamental in having a better interaction between humans and machines. Many algorithms have been developed to solve this problem before. They have been tested on different datasets, some of these datasets were recorded by actors under ideal recording conditions and some others were recorded from people’s opinion on some video streaming platform. Deep learning has shown very positive results in recent years and the model presented here follows this approach. We propose the use of Fourier transformations as the input of a convolutional neural network and Mel frequency cepstral coefficients as the input of an LSTM neural network. Finally, we concatenate the outputs of both models and obtain a final classification for five emotions. The model is trained using the MOSEI dataset. We also perform data augmentation by using time variations and pitch changes. Our model shows significant improvements over state-of-the-art algorithms.


Speech Emotion Recognition CNN LSTM Deep learning MFCC stft 


  1. 1.
    Bengio, Y., Courville, A.C., Vincent, P.: Unsupervised feature learning and deep learning: a review and new perspectives. CoRR abs/1206.5538 (2012).
  2. 2.
    Blanchard, N., Moreira, D.M., Bharati, A., Scheirer, W.J.: Getting the subtext without the text: scalable multimodal sentiment classification from visual and acoustic modalities. CoRR abs/1807.01122 (2018)Google Scholar
  3. 3.
    Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech, vol. 5, pp. 1517–1520 (2005)Google Scholar
  4. 4.
    Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42, 335–359 (2008)CrossRefGoogle Scholar
  5. 5.
    Davletcharova, A., Sugathan, S., Abraham, B., James, A.P.: Detection and analysis of emotion from speech signals. CoRR abs/1506.06832 (2015).
  6. 6.
    Fayek, H.M., Lech, M., Cavedon, L.: Towards real-time speech emotion recognition using deep neural networks. In: 2015 9th International Conference on Signal Processing and Communication Systems (ICSPCS), pp. 1–5, December 2015Google Scholar
  7. 7.
    Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: INTERSPEECH (2015)Google Scholar
  8. 8.
    Lee, C.W., Song, K.Y., Jeong, J., Choi, W.Y.: Convolutional attention networks for multimodal emotion recognition from speech and text data. CoRR abs/1805.06606 (2018)Google Scholar
  9. 9.
    Lim, W., young Jang, D., Lee, T.: Speech emotion recognition using convolutional and recurrent neural networks. 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–4 (2016)Google Scholar
  10. 10.
    Mao, Q., Dong, M., Huang, Z., Zhan, Y.: Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimedia 16, 2203–2213 (2014)CrossRefGoogle Scholar
  11. 11.
    Mao, X., Chen, L., Fu, L.: Multi-level speech emotion recognition based on HMM and ANN. In: 2009 WRI World Congress on Computer Science and Information Engineering. vol. 7, pp. 225–229, March 2009Google Scholar
  12. 12.
    Neiberg, D., Elenius, K., Laskowski, K.: Emotion recognition in spontaneous speech using GMMS. In: INTERSPEECH (2006)Google Scholar
  13. 13.
    Niu, Y., Zou, D., Niu, Y., He, Z., Tan, H.: A breakthrough in speech emotion recognition using deep retinal convolution neural networks. CoRR abs/1707.09917 (2017)Google Scholar
  14. 14.
    Niu, Y., Zou, D., Niu, Y., He, Z., Tan, H.: Improvement on speech emotion recognition based on deep convolutional neural networks. In: Proceedings of the 2018 International Conference on Computing and Artificial Intelligence, pp. 13–18. ICCAI 2018. ACM, New York, NY, USA (2018).
  15. 15.
    Pham, H., Manzini, T., Liang, P.P., Póczos, B.: Seq2seq2sentiment: Multimodal sequence to sequence models for sentiment analysis. CoRR abs/1807.03915 (2018)Google Scholar
  16. 16.
    Prasomphan, S.: Improvement of speech emotion recognition with neural network classifier by using speech spectrogram. In: 2015 International Conference on Systems, Signals and Image Processing (IWSSIP), pp. 73–76, September 2015Google Scholar
  17. 17.
    Sahay, S., Kumar, S.H., Xia, R., Huang, J., Nachman, L.: Multimodal relational tensor network for sentiment and emotion classification. CoRR abs/1806.02923 (2018)Google Scholar
  18. 18.
    Ververidis, D., Kotropoulos, C.: Fast and accurate sequential floating forward feature selection with the bayes classifier applied to speech emotion recognition. Signal Processing 88(12), 2956–2970 (2008). Scholar
  19. 19.
    Williams, J., Kleinegesse, S., Comanescu, R., Radu, O.: Recognizing emotions in video using multimodal DNN feature fusion. In: Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), pp. 11–19. Association for Computational Linguistics, July 2018Google Scholar
  20. 20.
    Wu, C.H., Liang, W.B.: Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels. IEEE Trans. Affect. Comput. 2(1), 10–21 (2011)CrossRefGoogle Scholar
  21. 21.
    Zadeh, A., Liang, P.P., Poria, S., Vij, P., Cambria, E., Morency, L.P.: Multi-attention recurrent network for human communication comprehension. CoRR abs/1802.00923 (2018)Google Scholar
  22. 22.
    Zhang, S., Zhang, S., Huang, T., Gao, W.: Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multimedia 20(6), 1576–1590 (2018)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.everis AI Digital LabMexico CityMexico

Personalised recommendations