Skip to main content

Convolutional Neural Networks in Speech Emotion Recognition – Time-Domain and Spectrogram-Based Approach

  • Conference paper
  • First Online:
Information Technology in Biomedicine (ITIB 2019)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1011))

Included in the following conference series:

  • 556 Accesses

Abstract

In this work a convolutional neural network is applied for classification of emotional speech. Two significantly different approaches to speech signal pre-processing are compared: traditional, based on frequency spectrum and time domain-based. In the first case, a mel-scale spectrogram of the sound signal is computed and used as a 2-dimensional input for the network, similarly as in image recognition tasks. In the second approach, raw sound signal in time domain is fed to the network. Despite the radically different form and content of the input data, the neural architecture is similar, with 2D convolutional layers in the first approach and 1D convolutional layers in the second one, and also identical fully-connected output layers in both approaches. We put emphasis to use practically the same number of trainable parameters in both networks, as well as the same size of input signal snippets used for training. The obtained results show that, under this setting, the frequency-based approach offers very little advantage over direct application of the raw sound signal. In both cases, the total accuracy of whole-file classification exceeded 93% for a dataset with three emotion types.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Defined as the proportion of correctly classified segments to the number of all the segments.

References

  1. Dean, J., Patterson, D., Young, C.: A new golden age in computer architecture: empowering the machine-learning revolution. IEEE Micro 38(2) (2018)

    Google Scholar 

  2. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates Inc. (2012)

    Google Scholar 

  3. Opałka, S., Stasiak, B., Szajerman, D., Wojciechowski, A.: Multi-Channel Convolutional Neural Networks Architecture Feeding for Effective EEG Mental Tasks Classification, Sensors 18(10), 3451 (2018)

    Google Scholar 

  4. Tarasiuk, P., Pryczek, M.: Geometric transformations embedded into convolutional neural networks. J. Appl. Comput. Sci. 24(3), 33–48 (2016)

    Google Scholar 

  5. Harár, P., Burget, R., Dutta, M.K.: Speech emotion recognition with deep learning. In: Proceedings of 4th International Conference on Signal Processing and Integrated Networks (SPIN), pp. 137–140 (2017)

    Google Scholar 

  6. Ververidis, D., Kotropoulos, C.: Emotional speech recognition: resources, features, and methods. Speech Commun. 48(9), 1162–1181 (2006)

    Article  Google Scholar 

  7. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: Proceedings of INTERSPEECH 2005, Lissabon, Portugal, pp. 1517–1520 (2005)

    Google Scholar 

  8. Uhrin, D., Partila, P., Frnda, J., Sevcik, L., Voznak, M., Lin, J.C.-W.: Design of emotion recognition system. In: Proceedings of the 2nd Czech-China Scientific Conference 2016, pp. 53–63 (2017)

    Google Scholar 

  9. Kołakowska A., Landowska A., Szwoch M., Szwoch W., Wróbel M.R.: Emotion recognition and its applications. In: Hippe, Z., Kulikowski, J., Mroczek, T., Wtorek, J. (eds.) Human-Computer Systems Interaction: Backgrounds and Applications 3. Advances in Intelligent Systems and Computing, vol. 300, pp. 51–62. Springer (2014)

    Google Scholar 

  10. Partila P., Voznak M.: Speech emotions recognition using 2-D neural classifier. In: Zelinka, I., Chen, G., Rössler, O., Snasel, V., Abraham, A. (eds.) Nostradamus 2013: Prediction, Modeling and Analysis of Complex Systems. Advances in Intelligent Systems and Computing, vol. 210, pp. 221–231. Springer, Heidelberg (2013)

    Google Scholar 

  11. Stasiak, B., Rychlicki-Kicior, K.: Fundamental frequency extraction in speech emotion recognition. In: Communications in Computer and Information Science, CCIS, vol. 287, pp. 292–303 (2012)

    Google Scholar 

  12. Pan, Y., Shen, P., Shen, L.: Speech emotion recognition using support vector machine. Int. J. Smart Home 6(2), 101–108 (2012)

    Google Scholar 

  13. Wöllmer, M., Eyben, F., Reiter, S., Schuller, B., Cox, C., Douglas-Cowie, E., Cowie, R.: Abandoning emotion classes - towards continuous emotion recognition with modeling of long-range dependencies. In: Proceedings of INTERSPEECH, Brisbane, Australia, ISCA, pp. 597–600 (2008)

    Google Scholar 

  14. Stuhlsatz, A., Meyer, C., Eyben, F., Zielke, T., Meier, G., Schuller, B.: Deep neural networks for acoustic emotion recognition: raising the benchmarks. In: Proceedings of ICASSP, Prague, Czech Republic, pp. 5688–5691. IEEE (2011)

    Google Scholar 

  15. Lee, C.W., Song, K.Y., Jeong, J., Choi, W.Y.: Convolutional Attention Networks for Multimodal Emotion Recognition from Speech and Text Data (2018). arXiv:805.06606

  16. Mao, Q., Dong, M., Huang, Z., Zhan, Y.: Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimed. 16(8), 2203–2213 (2014)

    Article  Google Scholar 

  17. Badshah, A.M., Rahim, N., Ullah, N., Ahmad, J. Muhammad, K., Lee, M.Y., Kwon S., Baik, S.W.: Deep features-based speech emotion recognition for smart affective services. Multimed. Tools Appl. (2017). https://doi.org/10.1007/s11042-017-5292-7

  18. Weiskirchen, N., Böck, R., Wendemuth, A.: Recognition of emotional speech with convolutional neural networks by means of spectral estimates. In: 7th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos, ACII, pp. 50–55 (2017)

    Google Scholar 

  19. Jianfeng, Z., Xia, M., Lijiang, C.: Learning deep features to recognise speech emotion using merged deep CNN. IET Signal Process. 12(6), 713–721 (2018)

    Article  Google Scholar 

  20. Zhang, L., Wang, L., Dang, J., Guo, L., Guan, H.: Convolutional Neural Network with Spectrogram and Perceptual Features for Speech Emotion Recognition. In: Cheng, L., Leung, A., Ozawa, S. (eds.) Neural Information Processing. ICONIP 2018. Lecture Notes in Computer Science, vol. 11304, pp. 62–71. Springer (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bartłomiej Stasiak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Stasiak, B., Opałka, S., Szajerman, D., Wojciechowski, A. (2019). Convolutional Neural Networks in Speech Emotion Recognition – Time-Domain and Spectrogram-Based Approach. In: Pietka, E., Badura, P., Kawa, J., Wieclawek, W. (eds) Information Technology in Biomedicine. ITIB 2019. Advances in Intelligent Systems and Computing, vol 1011. Springer, Cham. https://doi.org/10.1007/978-3-030-23762-2_15

Download citation

Publish with us

Policies and ethics