Abstract
Spoken emotion recognition is a multidisciplinary research area that has received increasing attention over the last few years. In this paper, restricted Boltzmann machines and deep belief networks are used to classify emotions in speech. The motivation lies in the recent success reported using these alternative techniques in speech processing and speech recognition. This classifier is compared with a multilayer perceptron classifier, using spectral and prosodic characteristics. A well-known German emotional database is used in the experiments and two methodologies of cross-validation are proposed. Our experimental results show that the deep method achieves an improvement of 8.67% over the baseline in a speaker independent scheme.
Chapter PDF
References
Albornoz, E.M., Milone, D.H., Rufiner, H.L.: Spoken emotion recognition using hierarchical classifiers. Computer Speech & Language 25(3), 556–570 (2011)
Batliner, A., Steidl, S., Schuller, B., Seppi, D., Vogt, T., Wagner, J., Devillers, L., Vidrascu, L., Aharonson, V., Kessous, L., Amir, N.: Whodunnit - Searching for the most important feature types signalling emotion-related user states in speech. Computer Speech & Language 25(1), 4–28 (2011)
Bengio, Y.: Learning Deep Architectures for AI. Foundations and Trends®in Machine Learning 2(1), 1–127 (2009)
Borchert, M., Dusterhoft, A.: Emotions in speech - experiments with prosody and quality features in speech for use in categorical and dimensional emotion recognition environments. In: Proc. of IEEE Int. Conference on Natural Language Processing and Knowledge Engineering (NLP-KE), pp. 147–151 (October 2005)
Brueckner, R., Schuller, B.: Likability classification - a not so deep neural network approach. In: 13th Annual Conference of the International Speech Communication Association, INTERSPEECH 2012, Portland, USA, pp. 1–4 (2012)
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A Database of German Emotional Speech. In: Proc. of 9th European Conference on Speech Communication and Technology (Interspeech), pp. 1517–1520 (September 2005)
Devillers, L., Vidrascu, L.: Speaker Classification II: Selected Projects. In: Müller, C. (ed.) Speaker Classifcation II. LNCS (LNAI), vol. 4441, pp. 34–42. Springer, Heidelberg (2007)
El Ayadi, M., Kamel, M., Karray, F.: Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition 44(3), 572–587 (2011)
Eyben, F., Wöllmer, M., Schuller, B.: Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the International Conference on Multimedia, MM 2010, pp. 1459–1462. ACM, New York (2010)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall (July 1998)
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine 29(6), 82–97 (2012)
Hinton, G.E.: A practical guide to training restricted boltzmann machines. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 599–619. Springer, Heidelberg (2012)
Hinton Geoffrey, E.: Training Products of Experts by Minimizing Contrastive Divergence. Neural Computation 14(8), 1771–1800 (2002)
Hinton Geoffrey, E., Simon, O., Yee-Whye, T.: A Fast Learning Algorithm for Deep Belief Nets. Neural Computation 18(7), 1527–1554 (2006), doi:10.1162/neco.2006.18.7.1527
Kim, Y., Lee, H., Provost, E.M.: Deep learning for robust feature generation in audiovisual emotion recognition. In: ICASSP, pp. 3687–3691. IEEE (2013)
Koolagudi, S., Rao, K.: Emotion recognition from speech using source, system, and prosodic features. International Journal of Speech Technology 15, 265–289 (2012)
Mohamed, A., Sainath, T., Dahl, G., Ramabhadran, B., Hinton, G., Picheny, M.: Deep belief networks using discriminative features for phone recognition. In: IEEE Int. Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5060–5063 (2011)
Popovic, B., Ostrogonac, S., Delic, V., Janev, M., Stankovic, I.: Deep architectures for automatic emotion recognition based on lip shape. In: The 12th Int. Scientific-Professional Symposium (INFOTEH), Bosnia and Herzegovina (March 2013)
Rabiner, L., Juang, B.H.: Fundamentals of Speech Recognition. Prentice-Hall, Inc., Upper Saddle River (1993)
Rufiner, H.L., Torres, M.E., Gamero, L.G., Milone, D.H.: Introducing complexity measures in nonlinear physiological signals: application to robust speech recognition. Physica A: Statistical Mechanics and its Applications 332(1), 496–508 (2004)
Sánchez-Gutiérrez, M., Albornoz, E.M., Martínez-Licona, F., Rufiner, H.L., Goddard, J.: Deep learning for emotional speech recognition. In: 6th Mexican Conference on Pattern Recognition, Cancún, México (accepted, June 2014)
Schmidt, E.M., Kim, Y.E.: Learning emotion-based acoustic features with deep belief networks. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 65–68. IEEE, New Paltz (2011)
Stuhlsatz, A., Meyer, C., Eyben, F., ZieIke, T., Meier, G., Schuller, B.: Deep neural networks for acoustic emotion recognition: Raising the benchmarks. In: IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 5688–5691 (2011)
Wulsin, D.: DBN Toolbox v1.0. Department of Bioengineering, University of Pennsylvania (2010), http://www.seas.upenn.edu/~wulsin/
Yang, B., Lugger, M.: Emotion recognition from speech signals using new harmony features. Signal Processing 90(5), 1415–1423 (2010), Special Section on Statistical Signal & Array Processing
Yildirim, S., Narayanan, S., Potamianos, A.: Detecting emotional state of a child in a conversational computer game. Computer Speech & Language 25(1), 29–44 (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Albornoz, E.M., Sánchez-Gutiérrez, M., Martinez-Licona, F., Rufiner, H.L., Goddard, J. (2014). Spoken Emotion Recognition Using Deep Learning. In: Bayro-Corrochano, E., Hancock, E. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2014. Lecture Notes in Computer Science, vol 8827. Springer, Cham. https://doi.org/10.1007/978-3-319-12568-8_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-12568-8_13
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12567-1
Online ISBN: 978-3-319-12568-8
eBook Packages: Computer ScienceComputer Science (R0)