Maximum Echo-State-Likelihood Networks for Emotion Recognition

  • Edmondo Trentin
  • Stefan Scherer
  • Friedhelm Schwenker
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5998)


Emotion recognition is a relevant task in human-computer interaction. Several pattern recognition and machine learning techniques have been applied so far in order to assign input audio and/or video sequences to specific emotional classes. This paper introduces a novel approach to the problem, suitable also to more generic sequence recognition tasks. The approach relies on the combination of the recurrent reservoir of an echo state network with a connectionist density estimation module. The reservoir realizes an encoding of the input sequences into a fixed-dimensionality pattern of neuron activations. The density estimator, consisting of a constrained radial basis functions network, evaluates the likelihood of the echo state given the input. Unsupervised training is accomplished within a maximum-likelihood framework. The architecture can then be used for estimating class-conditional probabilities in order to carry out emotion classification within a Bayesian setup. Preliminary experiments in emotion recognition from speech signals from the WaSeP© dataset show that the proposed approach is effective, and it may outperform state-of-the-art classifiers.


Emotion recognition echo state network radial basis functions maximum likelihood density estimation 


  1. 1.
    Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford (1995)Google Scholar
  2. 2.
    Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis. Wiley, New York (1973)zbMATHGoogle Scholar
  3. 3.
    Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Proceedings of the Thirteenth International Conference on Machine Learning, San Francisco, pp. 148–156 (1996)Google Scholar
  4. 4.
    Hermansky, H., Hanson, B., Wakita, H.: Perceptually based linear predictive analysis of speech. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 1985, April 1985, vol. 10, pp. 509–512 (1985)Google Scholar
  5. 5.
    Hermansky, H., Morgan, N., Bayya, A., Kohn, P.: Rasta-plp speech analysis. Technical report, ICSI Technical Report TR-91-069 (1991)Google Scholar
  6. 6.
    Hermansky, H., Morgan, N., Bayya, A., Kohn, P.: Rasta-plp speech analysis technique. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 1992, vol. 1, pp. 121–124 (1992)Google Scholar
  7. 7.
    Jaeger, H.: Tutorial on training recurrent neural networks, covering bppt, rtrl, ekf and the echo state network approach. Technical Report 159, Fraunhofer-Gesellschaft, St. Augustin Germany (2002)Google Scholar
  8. 8.
    Jaeger, H., Haas, H.: Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science 304, 78–80 (2004)CrossRefGoogle Scholar
  9. 9.
    Lee, C.M., Yildirim, S., Bulut, M., Kazemzadeh, A., Busso, C., Deng, Z., Lee, S., Narayanan, S.S.: Emotion recognition based on phoneme classes. In: Proceedings of ICSLP 2004 (2004)Google Scholar
  10. 10.
    McLachlan, G.J., Basford, K.E. (eds.): Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York (1988)zbMATHGoogle Scholar
  11. 11.
    Rabiner, L.R.: Fundamentals of Speech Recognition. Prentice-Hall, Englewood Cliffs (1993)Google Scholar
  12. 12.
    Robinson, D.W., Dadson, R.S.: A re-determination of the equal-loudness relations for pure tones. British Journal of Applied Physics 7(5), 166–181 (1956)CrossRefGoogle Scholar
  13. 13.
    Scherer, K.R., Johnstone, T., Klasmeyer, G.: Vocal expression of emotion. In: Davidson, R.J., Scherer, K.R., Goldsmith, H.H. (eds.) Handbook of Affective Sciences, Affective Science, pp. 433–456. Oxford University Press, Oxford (2003)Google Scholar
  14. 14.
    Scherer, S., Oubbati, M., Schwenker, F., Palm, G.: Real-time emotion recognition from speech using echo state networks. In: Prevost, L., Marinai, S., Schwenker, F. (eds.) ANNPR 2008. LNCS (LNAI), vol. 5064, pp. 205–216. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  15. 15.
    Scherer, S., Schwenker, F., Campbell, W.N., Palm, G.: Multimodal laughter detection in natural discourses. In: Proceedings of 3rd International Workshop on Human-Centered Robotic Systems, HCRS 2009 (2009)Google Scholar
  16. 16.
    Scherer, S., Schwenker, F., Palm, G.: Classifier fusion for emotion recognition from speech. In: 3rd IET International Conference on Intelligent Environments 2007 (IE 2007), pp. 152–155. IEEE, Los Alamitos (2007)CrossRefGoogle Scholar
  17. 17.
    Wendt, B., Scheich, H.: The magdeburger prosodie korpus - a spoken language corpus for fmri-studies. In: Speech Prosody 2002, SProSIG (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Edmondo Trentin
    • 1
  • Stefan Scherer
    • 2
  • Friedhelm Schwenker
    • 2
  1. 1.Dipartimento di Ingegneria dell’InformazioneUniversità degli studi di SienaSienaItaly
  2. 2.Institute of Neural Information ProcessingUlm UniversityUlmGermany

Personalised recommendations