Convolutional Neural Network with Spectrogram and Perceptual Features for Speech Emotion Recognition

  • Linjuan Zhang
  • Longbiao WangEmail author
  • Jianwu DangEmail author
  • Lili Guo
  • Haotian Guan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11304)


Convolutional neural network (CNN) has demonstrated a great power at mining deep information from spectrogram for speech emotion recognition. However, perceptual features such as low-level descriptors (LLDs) and their statistical values were not utilized sufficiently in CNN-based emotion recognition. To solve this problem, we propose novel features to combine spectrogram and perceptual features in different levels. Firstly, frame-level LLDs are arranged as time-sequence LLDs. Then, spectrogram and time-sequence LLDs are fused as compositional spectrographic features (CSF). To fully utilize perceptual features and global information, statistical values of LLDs are added in CSF to generate rich-compositional spectrographic features (RSF). Finally, the proposed features are individually fed to CNN to extract deep features for emotion recognition. Bi-directional long short-term memory was employed to identify emotions and the experiments were conducted on EmoDB. Compared with spectrogram, CSF and RSF improve the unweighted accuracy by a relative error reduction of 32.04% and 36.91%, respectively.


Speech emotion recognition Spectrogram Perceptual features Convolutional neural network Bi-directional long short-term memory 



The research was supported by the National Natural Science Foundation of China (No. 61771333 and No. U1736219) and JSPS KAKENHI Grant (16K00297).


  1. 1.
    Kołakowska, A., Landowska, A., Szwoch, M., Szwoch, W., Wróbel, M.R.: Emotion recognition and its applications. In: Hippe, Z.S., Kulikowski, J.L., Mroczek, T., Wtorek, J. (eds.) Human-Computer Systems Interaction: Backgrounds and Applications 3. AISC, vol. 300, pp. 51–62. Springer, Cham (2014). Scholar
  2. 2.
    El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit. 44(3), 572–587 (2011). Scholar
  3. 3.
    Schuller, B., Batliner, A., Steidl, S., Seppi, D.: Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun. 53(9–10), 1062–1087 (2011). Scholar
  4. 4.
    Ringeval, F., et al.: Av+ ec 2015: the first affect recognition challenge bridging across audio, video, and physiological data. In: 5th International Workshop on Audio/Visual Emotion Challenge, pp. 3–8. ACM (2015).
  5. 5.
    Valstar, M., et al.: Avec 2016: depression, mood, and emotion recognition workshop and challenge. In: 6th International Workshop on Audio/Visual Emotion Challenge, pp. 3–10. ACM (2016).
  6. 6.
    Schuller, B., Steidl, S., Batliner, A.: The Interspeech 2009 emotion challenge. In: Tenth Annual Conference of the International Speech Communication Association (2009)Google Scholar
  7. 7.
    Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: INTERSPEECH, pp. 223–227 (2014).
  8. 8.
    Huang, C. W., Narayanan, S. S.: Attention assisted discovery of sub-utterance structure in speech emotion recognition. In: INTERSPEECH, pp. 1387–1391 (2016).
  9. 9.
    Mirsamadi, S., Barsoum, E., Zhang, C.: Automatic speech emotion recognition using recurrent neural networks with local attention. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2227–2231. IEEE (2017).
  10. 10.
    Variani, E., Lei, X., McDermott, E., Moreno, I. L., Gonzalez-Dominguez, J.: Deep neural networks for small footprint text-dependent speaker verification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056. IEEE (2014).
  11. 11.
    Hannun, A., et al.: Deep Speech: Scaling up End-to-end Speech Recognition (2014).
  12. 12.
    Amodei, D., et al.: Deep Speech 2: end-to-end speech recognition in English and Mandarin. In: International Conference on Machine Learning, pp. 173–182 (2016).
  13. 13.
    Nwe, T.L., Foo, S.W., De Silva, L.C.: Speech emotion recognition using hidden markov models. Speech Commun. 41(4), 603–623 (2003). Scholar
  14. 14.
    Huang, Z., Dong, M., Mao, Q., Zhan, Y.: Speech emotion recognition using CNN. In: 22nd ACM international conference on Multimedia, pp. 801–804. ACM (2014).
  15. 15.
    Lim, W., Jang, D., Lee, T.: Speech emotion recognition using convolutional and recurrent neural networks. In: Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–4. IEEE, Asia-Pacific (2016).
  16. 16.
    Satt, A., Rozenberg, S., Hoory, R.: Efficient emotion recognition from speech using deep learning on spectrograms. In: INTERSPEECH, pp. 1089–1093 (2017).
  17. 17.
    Guo, L., Wang, L., Dang, J., Zhang, L., Guan, H.: A feature fusion method based on extreme learning machine for speech emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2666–2670 (2018).
  18. 18.
    Guo, L., Wang, L., Dang, J., Zhang, L., Guan, H., Li, X.: Speech emotion recognition by combining amplitude and phase information using convolutional neural network. In: INTERSPEECH, pp. 1611–1615 (2018).
  19. 19.
    Hu, H., Xu, M.X., Wu, W.: Fusion of global statistical and segmental spectral features for speech emotion recognition. In: INTERSPEECH, pp. 2269–2272 (2007)Google Scholar
  20. 20.
    Yu, D., et al..: Deep convolutional neural networks with layer-wise context expansion and attention. In: INTERSPEECH, pp. 17–21 (2016).
  21. 21.
    Lee, J., Tashev, I.: High-level feature representation using recurrent neural network for speech emotion recognition. In: Sixteenth Annual Conference of the International Speech Communication Association (2015).
  22. 22.
    Petrushin, V. A.: Emotion recognition in speech signal: experimental study, development, and application. In: Sixth International Conference on Spoken Language Processing, pp. 222–225 (2000)Google Scholar
  23. 23.
    Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., Weiss, B.: A Database of German Emotional Speech. In: Ninth European Conference on Speech Communication and Technology, pp. 1517–1520 (2005)Google Scholar
  24. 24.
    Xie, B.: Research on Key Issues of Mandarin Speech Emotion Recognition [Ph.D. Thesis]. Hangzhou: Zhejiang University (2006)Google Scholar
  25. 25.
    Provost, E. M.: Identifying salient sub-utterance emotion dynamics using flexible units and estimates of affective flow. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3682–3686. IEEE (2013).

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and ComputingTianjin UniversityTianjinChina
  2. 2.Japan Advanced Institute of Science and TechnologyIshikawaJapan
  3. 3.Intelligent Spoken Language Technology (Tianjin) Co., Ltd.TianjinChina

Personalised recommendations