Convolutional Neural Networks in Speech Emotion Recognition – Time-Domain and Spectrogram-Based Approach

Stasiak, Bartłomiej; Opałka, Sławomir; Szajerman, Dominik; Wojciechowski, Adam

doi:10.1007/978-3-030-23762-2_15

Bartłomiej Stasiak¹⁸,
Sławomir Opałka¹⁸,
Dominik Szajerman¹⁸ &
…
Adam Wojciechowski¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1011))

Included in the following conference series:

International Conference on Information Technologies in Biomedicine

556 Accesses

Abstract

In this work a convolutional neural network is applied for classification of emotional speech. Two significantly different approaches to speech signal pre-processing are compared: traditional, based on frequency spectrum and time domain-based. In the first case, a mel-scale spectrogram of the sound signal is computed and used as a 2-dimensional input for the network, similarly as in image recognition tasks. In the second approach, raw sound signal in time domain is fed to the network. Despite the radically different form and content of the input data, the neural architecture is similar, with 2D convolutional layers in the first approach and 1D convolutional layers in the second one, and also identical fully-connected output layers in both approaches. We put emphasis to use practically the same number of trainable parameters in both networks, as well as the same size of input signal snippets used for training. The obtained results show that, under this setting, the frequency-based approach offers very little advantage over direct application of the raw sound signal. In both cases, the total accuracy of whole-file classification exceeded 93% for a dataset with three emotion types.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Defined as the proportion of correctly classified segments to the number of all the segments.

References

Dean, J., Patterson, D., Young, C.: A new golden age in computer architecture: empowering the machine-learning revolution. IEEE Micro 38(2) (2018)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates Inc. (2012)
Google Scholar
Opałka, S., Stasiak, B., Szajerman, D., Wojciechowski, A.: Multi-Channel Convolutional Neural Networks Architecture Feeding for Effective EEG Mental Tasks Classification, Sensors 18(10), 3451 (2018)
Google Scholar
Tarasiuk, P., Pryczek, M.: Geometric transformations embedded into convolutional neural networks. J. Appl. Comput. Sci. 24(3), 33–48 (2016)
Google Scholar
Harár, P., Burget, R., Dutta, M.K.: Speech emotion recognition with deep learning. In: Proceedings of 4th International Conference on Signal Processing and Integrated Networks (SPIN), pp. 137–140 (2017)
Google Scholar
Ververidis, D., Kotropoulos, C.: Emotional speech recognition: resources, features, and methods. Speech Commun. 48(9), 1162–1181 (2006)
Article Google Scholar
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: Proceedings of INTERSPEECH 2005, Lissabon, Portugal, pp. 1517–1520 (2005)
Google Scholar
Uhrin, D., Partila, P., Frnda, J., Sevcik, L., Voznak, M., Lin, J.C.-W.: Design of emotion recognition system. In: Proceedings of the 2nd Czech-China Scientific Conference 2016, pp. 53–63 (2017)
Google Scholar
Kołakowska A., Landowska A., Szwoch M., Szwoch W., Wróbel M.R.: Emotion recognition and its applications. In: Hippe, Z., Kulikowski, J., Mroczek, T., Wtorek, J. (eds.) Human-Computer Systems Interaction: Backgrounds and Applications 3. Advances in Intelligent Systems and Computing, vol. 300, pp. 51–62. Springer (2014)
Google Scholar
Partila P., Voznak M.: Speech emotions recognition using 2-D neural classifier. In: Zelinka, I., Chen, G., Rössler, O., Snasel, V., Abraham, A. (eds.) Nostradamus 2013: Prediction, Modeling and Analysis of Complex Systems. Advances in Intelligent Systems and Computing, vol. 210, pp. 221–231. Springer, Heidelberg (2013)
Google Scholar
Stasiak, B., Rychlicki-Kicior, K.: Fundamental frequency extraction in speech emotion recognition. In: Communications in Computer and Information Science, CCIS, vol. 287, pp. 292–303 (2012)
Google Scholar
Pan, Y., Shen, P., Shen, L.: Speech emotion recognition using support vector machine. Int. J. Smart Home 6(2), 101–108 (2012)
Google Scholar
Wöllmer, M., Eyben, F., Reiter, S., Schuller, B., Cox, C., Douglas-Cowie, E., Cowie, R.: Abandoning emotion classes - towards continuous emotion recognition with modeling of long-range dependencies. In: Proceedings of INTERSPEECH, Brisbane, Australia, ISCA, pp. 597–600 (2008)
Google Scholar
Stuhlsatz, A., Meyer, C., Eyben, F., Zielke, T., Meier, G., Schuller, B.: Deep neural networks for acoustic emotion recognition: raising the benchmarks. In: Proceedings of ICASSP, Prague, Czech Republic, pp. 5688–5691. IEEE (2011)
Google Scholar
Lee, C.W., Song, K.Y., Jeong, J., Choi, W.Y.: Convolutional Attention Networks for Multimodal Emotion Recognition from Speech and Text Data (2018). arXiv:805.06606
Mao, Q., Dong, M., Huang, Z., Zhan, Y.: Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimed. 16(8), 2203–2213 (2014)
Article Google Scholar
Badshah, A.M., Rahim, N., Ullah, N., Ahmad, J. Muhammad, K., Lee, M.Y., Kwon S., Baik, S.W.: Deep features-based speech emotion recognition for smart affective services. Multimed. Tools Appl. (2017). https://doi.org/10.1007/s11042-017-5292-7
Weiskirchen, N., Böck, R., Wendemuth, A.: Recognition of emotional speech with convolutional neural networks by means of spectral estimates. In: 7th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos, ACII, pp. 50–55 (2017)
Google Scholar
Jianfeng, Z., Xia, M., Lijiang, C.: Learning deep features to recognise speech emotion using merged deep CNN. IET Signal Process. 12(6), 713–721 (2018)
Article Google Scholar
Zhang, L., Wang, L., Dang, J., Guo, L., Guan, H.: Convolutional Neural Network with Spectrogram and Perceptual Features for Speech Emotion Recognition. In: Cheng, L., Leung, A., Ozawa, S. (eds.) Neural Information Processing. ICONIP 2018. Lecture Notes in Computer Science, vol. 11304, pp. 62–71. Springer (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Information Technology, Łódź University of Technology, ul. Wólczańska 215, 93–005, Łódź, Poland
Bartłomiej Stasiak, Sławomir Opałka, Dominik Szajerman & Adam Wojciechowski

Authors

Bartłomiej Stasiak
View author publications
You can also search for this author in PubMed Google Scholar
Sławomir Opałka
View author publications
You can also search for this author in PubMed Google Scholar
Dominik Szajerman
View author publications
You can also search for this author in PubMed Google Scholar
Adam Wojciechowski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bartłomiej Stasiak .

Editor information

Editors and Affiliations

Faculty of Biomedical Engineering, Silesian University of Technology, Zabrze, Poland
Ewa Pietka
Faculty of Biomedical Engineering, Silesian University of Technology, Zabrze, Poland
Pawel Badura
Faculty of Biomedical Engineering, Silesian University of Technology, Gliwice, Poland
Jacek Kawa
Faculty of Biomedical Engineering, Silesian University of Technology, Zabrze, Poland
Wojciech Wieclawek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Stasiak, B., Opałka, S., Szajerman, D., Wojciechowski, A. (2019). Convolutional Neural Networks in Speech Emotion Recognition – Time-Domain and Spectrogram-Based Approach. In: Pietka, E., Badura, P., Kawa, J., Wieclawek, W. (eds) Information Technology in Biomedicine. ITIB 2019. Advances in Intelligent Systems and Computing, vol 1011. Springer, Cham. https://doi.org/10.1007/978-3-030-23762-2_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-23762-2_15
Published: 26 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-23761-5
Online ISBN: 978-3-030-23762-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Convolutional Neural Networks in Speech Emotion Recognition – Time-Domain and Spectrogram-Based Approach