Abstract
Speech emotion recognition (SER) is an important field of human–computer interaction. Although humans have various ways of expressing emotions, speech is one of the most direct ways. Therefore, it is an important technical challenge to extract the emotional information from the speech signal as much as possible. To address this issue, we proposed a local frame-level global dynamic attention network (LF-GANet) to extract emotional information from speech signals. This network mainly consists of two parts, a local frame-level module (LFM) and a global dynamic attention module (GAM). To extract rich frame-level emotional information from speech signals, the LFM was designed to extract features from forward and reverse time series separately; the GAM real-time extracted the global correlations from speech signals. We conducted experiments on the EMODB and SAVEE datasets. The results showed that our method outperformes the existing SOTA model in UAR on both datasets, verifying the effectiveness of the model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Yildirim S, Kaya Y, Kılıc F (2021) A modified feature selection method based on metaheuristic algorithms for speech emotion recognition. Appl Acoustics 173:107721
Assuncao G, Menezes P, Perdig˜ao F (2020) Speaker awareness for speech emotion recognition. Int J Online Biomed Eng 16(4):15–22
Ozer I (2021) Pseudo-colored rate map representation for speech emotion recognition. Biomed Signal Process Control 66:102502
Muppidi A, Radfar M (2021) Speech emotion recognition using quaternion convolutional neural networks. In: ICASSP 2021, Toronto, ON, Canada, June 6–11. IEEE, pp 6309–6313
Rajamani ST, Rajamani KT, Mallol-Ragolta A et al (2021) A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition. In: ICASSP 2021, Toronto, ON, Canada, June 6–11. IEEE, pp 6294–6298
Mustaqeem M, Kwon S (2020) Clustering-based speech emotion recognition by incorporating learned features and deep bilstm, IEEE. Access 8:79861–79875
Ye JX, Wen XC, Wang XZ, Xu Y, Luo Y, Wu CL, Chen LY, Liu KH (2022) GM-TCNet: gated multi-scale temporal convolutional network using emotion causality for speech emotion recognition. Speech Commun 145:21–35. ISSN 0167-6393. https://doi.org/10.1016/j.specom.2022.07.005
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Peng Z, Lu Y, Pan S et al. (2021) Efficient speech emotion recognition using multi-scale CNN and attention. In: ICASSP 2021, Toronto, ON, Canada, June 6–11. IEEE, pp 3020–3024
Burkhardt F, Paeschke A, Rolfes M et al (2005) A database of German emotional speech. In: INTERSPEECH 2005, Lisbon, Portugal, September 4–8, vol 5, pp 1517–1520
Philip Jackson and SJUoSG Haq (2014) Surrey audio-visual expressed emotion (savee) database. University of Surrey, Guildford, UK
McFee B, Raffel C, Liang D, Ellis DP, McVicar M, Battenberg E, Nieto O librosa (2015) Audio and music signal analysis in python. In: Proceedings of the 14th python in science conference, vol 8, pp 18–25
Ibrahim H, Loo CK, Alnajjar F (2021) Grouped echo state network with late fusion for speech emotion recognition. In: Neural information processing—28th international conference, ICONIP 2021, Sanur, Bali, Indonesia, December 8–12, 2021, Proceedings, Part III. (Lecture Notes in Computer Science), vol 13110. Springer, pp 431–442
Kanwal S, Asghar S (2021) Speech emotion recognition using clustering based ga-optimized feature set. IEEE Access 9:125830–125842
Acknowledgements
This work was supported by the Tianjin Science and Technology Planning Project under Grant No. 20JCYBJC00300 and the National Natural Science Foundation of China under Grant No. 62001328.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Dou, S., Han, T., Liu, R., Xia, W., Zhong, H. (2024). LF-GANet: Local Frame-Level Global Dynamic Attention Network for Speech Emotion Recognition. In: Wang, W., Liu, X., Na, Z., Zhang, B. (eds) Communications, Signal Processing, and Systems. CSPS 2023. Lecture Notes in Electrical Engineering, vol 1032. Springer, Singapore. https://doi.org/10.1007/978-981-99-7505-1_13
Download citation
DOI: https://doi.org/10.1007/978-981-99-7505-1_13
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7539-6
Online ISBN: 978-981-99-7505-1
eBook Packages: EngineeringEngineering (R0)