Voice and Speech Analysis in Search of States and Traits

  • Björn Schuller


This chapter gives a general overview on the principle of voice and speech analysis for automatic assessment of human communicative behaviour. A short motivation and introduction of the tasks ranging from speakers’ states like emotions to personality traits is followed by the two parts ‘voice’ and ‘speech’—each consisting of a taxonomy on feature extraction and frequently encountered methods for classification and regression, with excursuses lead over chunking of the audio stream, the selection and extraction of relevant feature information, and assessment of (non-) linguistic information in the first place. Finally, popular databases with according benchmarks are introduced and recent application examples are given focusing on human-to-human and human-to-agent conversation.


Support Vector Machine Discrete Cosine Transform Emotion Recognition Automatic Speech Recognition Linguistic Feature 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Ai, H., Litman, D., Forbes-Riley, K., Rotaru, M., Tetreault, J., Purandare, A.: Using system and user performance features to improve emotion detection in spoken tutoring dialogs. In: Proc. Interspeech, Pittsburgh, PA, USA, pp. 797–800 (2006) Google Scholar
  2. 2.
    Ang, J., Dhillon, R., Shriberg, E., Stolcke, A.: Prosody-based automatic detection of annoyance and frustration in human–computer dialog. In: Proc. Interspeech, Denver, CO, USA, pp. 2037–2040 (2002) Google Scholar
  3. 3.
    Batliner, A., Steidl, S., Nöth, E.: Laryngealizations and emotions: how many babushkas? In: Proc. of the International Workshop on Paralinguistic Speech—Between Models and Data (ParaLing’07), Saarbrücken, Germany, pp. 17–22 (2007) Google Scholar
  4. 4.
    Batliner, A., Steidl, S., Schuller, B., Seppi, D., Vogt, T., Devillers, L., Vidrascu, L., Amir, N., Kessous, L., Aharonson, V.: The impact of F0 extraction errors on the classification of prominence and emotion. In: Proc. ICPhS, Saarbrücken, Germany, pp. 2201–2204 (2007) Google Scholar
  5. 5.
    Batliner, A., Schuller, B., Schaeffler, S., Steidl, S.: Mothers, adults, children, pets—towards the acoustics of intimacy. In: Proc. ICASSP, Las Vegas, NV, pp. 4497–4500 (2008) Google Scholar
  6. 6.
    Batliner, A., Seppi, D., Steidl, S., Schuller, B.: Segmenting into adequate units for automatic recognition of emotion-related episodes: a speech-based approach. Adv. Hum.-Comput. Interact. 2010, 782802 (2010), 15 pages Google Scholar
  7. 7.
    Batliner, A., Steidl, S., Schuller, B., Seppi, D., Vogt, T., Wagner, J., Devillers, L., Vidrascu, L., Aharonson, V., Amir, N.: Whodunnit—searching for the most important feature types signalling emotional user states in speech. Comput. Speech Lang. 25, 4–28 (2011) CrossRefGoogle Scholar
  8. 8.
    Breese, J., Ball, G.: Modeling emotional state and personality for conversational agents. Technical Report MS-TR-98-41, Microsoft (1998) Google Scholar
  9. 9.
    Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: Proc. Interspeech, Lisbon, Portugal, pp. 1517–1520 (2005) Google Scholar
  10. 10.
    Campbell, N., Kashioka, H., Ohara, R.: No laughing matter. In: Proc. Interspeech, Lisbon, Portugal, pp. 465–468 (2005) Google Scholar
  11. 11.
    Cowie, R., Douglas-Cowie, E., Savvidou, S., McMahon, E., Sawey, M., Schröder, M.: Feeltrace: An instrument for recording perceived emotion in real time. In: Proc. of the ISCA Workshop on Speech and Emotion, Newcastle, Northern Ireland, pp. 19–24 (2000) Google Scholar
  12. 12.
    Elliott, C.: The affective reasoner: a process model of emotions in a multi-agent system. PhD thesis, Dissertation, Northwestern University (1992) Google Scholar
  13. 13.
    Engbert, I.S., Hansen, A.V.: Documentation of the Danish emotional speech database DES. Technical report, Center for PersonKommunikation, Aalborg University, Denmark (2007). Last visited 11/13/2007
  14. 14.
    Enos, F., Shriberg, E., Graciarena, M., Hirschberg, J., Stolcke, A.: Detecting deception using critical segments. In: Proceedings of the 8th Annual Conference of the International Speech Communication Association (Interspeech 2007), Antwerp, Belgium, pp. 2281–2284 (2007) Google Scholar
  15. 15.
    Eyben, F., Batliner, A., Schuller, B., Seppi, D., Steidl, S.: Cross-corpus classification of realistic emotions—some pilot experiments. In: Proc. 3rd International Workshop on EMOTION (satellite of LREC): Corpora for Research on Emotion and Affect, Valetta, pp. 77–82 (2010) Google Scholar
  16. 16.
    Eyben, F., Wöllmer, M., Schuller, B.: openSMILE—the Munich versatile and fast open-source audio feature extractor. In: Proc. ACM Multimedia, Florence, Italy, pp. 1459–1462 (2010) Google Scholar
  17. 17.
    Goertzel, B., Silverman, K., Hartley, C., Bugaj, S., Ross, M.: The baby webmind project. In: Proc. of The Annual Conference of The Society for the Study of Artificial Intelligence and the Simulation of Behavior (AISB) (2000) Google Scholar
  18. 18.
    Grimm, M., Kroschel, K., Narayanan, S.: The Vera am Mittag German audio-visual emotional speech database. In: Proc. of the IEEE International Conference on Multimedia and Expo (ICME), Hannover, Germany, pp. 865–868 (2008) Google Scholar
  19. 19.
    Haderlein, T., Nöth, E., Toy, H., Batliner, A., Schuster, M., Eysholdt, U., Hornegger, J., Rosanowski, F.: Automatic evaluation of prosodic features of tracheoesophageal substitute voice. Eur. Arch. Oto-Rhino-Laryngol. 264(11), 1315–1321 (2007) CrossRefGoogle Scholar
  20. 20.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11 (2009) Google Scholar
  21. 21.
    Hansen, J.H.L., Bou-Ghazale, S.: Getting started with SUSAS: a Speech Under Simulated and Actual Stress database. In: Proc. EUROSPEECH-97, vol. 4, Rhodes, Greece, pp. 1743–1746 (1997) Google Scholar
  22. 22.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) Proc. of ECML-98, 10th European Conference on Machine Learning, Chemnitz, Germany, pp. 137–142. Springer, Heidelberg (1998) CrossRefGoogle Scholar
  23. 23.
    Kao, Y.-H., Lee, L.-S.: Feature analysis for emotion recognition from Mandarin speech considering the special characteristics of Chinese language. In: Proc. ICLSP, pp. 1814–1817 (2006) Google Scholar
  24. 24.
    Krajewski, J., Kröger, B.: Using prosodic and spectral characteristics for sleepiness detection. In: Eighth Annual Conf. Int. Speech Communication Association, pp. 1841–1844 (2007) Google Scholar
  25. 25.
    Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999) CrossRefGoogle Scholar
  26. 26.
    Litman, D., Forbes, K.: Recognizing emotions from student speech in tutoring dialogues. In: Proc. ASRU, Virgin Island, USA, pp. 25–30 (2003) Google Scholar
  27. 27.
    Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. J. Mach. Learn. Res. 2, 419–444 (2002) MATHGoogle Scholar
  28. 28.
    Lugger, M., Yang, B.: Psychological motivated multi-stage emotion classification exploiting voice quality features. In: Mihelic, F., Zibert, J. (eds.) Speech Recognition, p. 1. In-Tech, Rijeka (2008) Google Scholar
  29. 29.
    Martin, O., Kotsia, I., Macq, B., Pitas, I.: The eNTERFACE’05 audio-visual emotion database. In: IEEE Workshop on Multimedia Database Management (2006) Google Scholar
  30. 30.
    Matos, S., Birring, S.S., Pavord, I.D., Evans, D.H.: Detection of cough signals in continuous audio recordings using hidden Markov models. In: IEEE Trans. Biomedical Engineering, pp. 1078–1108 (2006) Google Scholar
  31. 31.
    Mohammadi, G., Vinciarelli, A., Mortillaro, M.: The voice of personality: mapping nonverbal vocal behavior into trait attributions. In: Proc. SSPW, Firenze, Italy, pp. 17–20 (2010) CrossRefGoogle Scholar
  32. 32.
    Mporas, I., Ganchev, T.: Estimation of unknown speakers’ height from speech. Int. J. Speech Technol. 12(4), 149–160 (2009) CrossRefGoogle Scholar
  33. 33.
    Pal, P., Iyer, A.N., Yantorno, R.E.: Emotion detection from infant facial expressions and cries. In: Proc. ICASSP, Toulouse, France, pp. 809–812 (2006) Google Scholar
  34. 34.
    Russell, J.A., Bachorowski, J.A., Fernandez-Dols, J.M.: Facial and vocal expressions of emotion. Annual Review of Psychology, 329–349 (2003) Google Scholar
  35. 35.
    Schiel, F., Heinrich, C.: Laying the foundation for in-car alcohol detection by speech. In: Proc. INTERSPEECH 2010, Brighton, UK, pp. 983–986 (2009) Google Scholar
  36. 36.
    Schröder, M., Cowie, R., Heylen, D., Pantic, M., Pelachaud, C., Schuller, B.: Towards responsive sensitive artificial listeners. In: Proc. 4th Intern. Workshop on Human-Computer Conversation, Bellagio, Italy (2008) Google Scholar
  37. 37.
    Schuller, B., Jiménez Villar, R., Rigoll, G., Lang, M.: Meta-classifiers in acoustic and linguistic feature fusion-based affect recognition. In: Proc. ICASSP, vol. I, Philadelphia, PA, USA, pp. 325–328 (2005) Google Scholar
  38. 38.
    Schuller, B., Müller, R., Lang, M., Rigoll, G.: Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensemble. In: Proc. Interspeech, Lisbon, Portugal, pp. 805–808 (2005) Google Scholar
  39. 39.
    Schuller, B., Wimmer, M., Arsic, D., Rigoll, G., Radig, B.: Audiovisual behaviour modeling by combined feature spaces. In: Proc. ICASSP, Honolulu, HY, pp. 733–736 (2007) Google Scholar
  40. 40.
    Schuller, B., Batliner, A., Steidl, S., Seppi, D.: Emotion Recognition from Speech: Putting ASR in the Loop. In: Proc. ICASSP, Taipei, Taiwan, pp. 4585–4588. IEEE Press, New York (2009) Google Scholar
  41. 41.
    Schuller, B., Müller, R., Eyben, F., Gast, J., Hörnler, B., Wöllmer, M., Rigoll, G., Höthker, A., Konosu, H.: Being bored? Recognising natural interest by extensive audiovisual integration for real-life application. Image Vis. Comput. 27, 1760–1774 (2009). Special Issue on Visual and Multimodal Analysis of Human Spontaneous Behavior CrossRefGoogle Scholar
  42. 42.
    Schuller, B., Schenk, J., Rigoll, G., Knaup, T.: The “godfather” vs. “chaos”: comparing linguistic analysis based on online knowledge sources and bags-of-n-grams for movie review valence estimation. In: Proc. International Conference on Document Analysis and Recognition, Barcelona, Spain, pp. 858–862 (2009) CrossRefGoogle Scholar
  43. 43.
    Schuller, B., Steidl, S., Batliner, A.: The INTERSPEECH 2009 Emotion Challenge. In: Proc. Interspeech, Brighton, UK, pp. 312–315 (2009) Google Scholar
  44. 44.
    Schuller, B., Vlasenko, B., Eyben, F., Rigoll, G., Wendemuth, A.: Acoustic emotion recognition: a benchmark comparison of performances. In: Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Merano, pp. 552–557 (2009) Google Scholar
  45. 45.
    Schuller, B., Metze, F., Steidl, S., Batliner, A., Eyben, F., Polzehl, T.: Late fusion of individual engines for improved recognition of negative emotions in speech—learning vs. democratic vote. In: Proc. 35th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Dallas, pp. 5230–5233 (2010) Google Scholar
  46. 46.
    Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Müller, C., Narayanan, S.: The INTERSPEECH 2010 Paralinguistic Challenge. In: Proc. INTERSPEECH 2010, Makuhari, Japan, pp. 2794–2797 (2010) Google Scholar
  47. 47.
    Schuller, B., Vlasenko, B., Eyben, F., Wöllmer, M., Stuhlsatz, A., Wendemuth, A., Rigoll, G.: Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Trans. Affect. Comput. 1, 119–132 (2010) CrossRefGoogle Scholar
  48. 48.
    Schuller, B., Batliner, A., Steidl, S., Seppi, D.: Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge (2011). Speech Communication—Special Issue on Sensing Emotion and Affect—Facing Realism in Speech Processing Google Scholar
  49. 49.
    Seppi, D., Batliner, A., Schuller, B., Steidl, S., Vogt, T., Wagner, J., Devillers, L., Vidrascu, L., Amir, N., Aharonson, V.: Patterns, prototypes, performance: classifying emotional user states. In: Proc. Interspeech, Brisbane, Australia, pp. 601–604 (2008) Google Scholar
  50. 50.
    Seppi, D., Batliner, A., Steidl, S., Schuller, B., Nöth, E.: Word accent and emotion. In: Proc. Speech Prosody 2010, Chicago, IL (2010) Google Scholar
  51. 51.
    Steidl, S.: Automatic Classification of Emotion-Related User States in Spontaneous Children’s Speech. Logos, Berlin (2009). PhD thesis, FAU Erlangen-Nuremberg Google Scholar
  52. 52.
    Steidl, S., Batliner, A., Seppi, D., Schuller, B.: On the impact of children’s emotional speech on acoustic and language models. EURASIP J. Audio Speech Music Process. 2010, 783954 (2010). 14 pages, doi: 10.1155/2010/783954 CrossRefGoogle Scholar
  53. 53.
    Steininger, S., Schiel, F., Dioubina, O., Raubold, S.: Development of user-state conventions for the multimodal corpus in smartkom. In: Proc. Workshop on Multimodal Resources and Multimodal Systems Evaluation, Las Palmas, Spain, pp. 33–37 (2002) Google Scholar
  54. 54.
    Stuhlsatz, A., Meyer, C., Eyben, F., Zielke, T., Meier, G., Schuller, B.: Deep neural networks for acoustic emotion recognition: raising the benchmarks. In: Proc. ICASSP, Prague, Czech Republic (2011) Google Scholar
  55. 55.
    Vlasenko, B., Schuller, B., Wendemuth, A., Rigoll, G.: Combining frame and turn-level information for robust recognition of emotions within speech. In: Proc. Interspeech, Antwerp, Belgium, pp. 2249–2252 (2007) Google Scholar
  56. 56.
    Ward, N., Tsukahara, W.: Prosodic features which cue backchannel responses in English and Japanese. J. Pragmat. 32, 1177–1207 (2000) CrossRefGoogle Scholar
  57. 57.
    Weiss, B., Burkhardt, F.: Voice attributes affecting likability perception. In: Proc. INTERSPEECH, Makuhari, Japan (2010) Google Scholar
  58. 58.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005) MATHGoogle Scholar
  59. 59.
    Wöllmer, M., Eyben, F., Reiter, S., Schuller, B., Cox, C., Douglas-Cowie, E., Cowie, R.: Abandoning emotion classes—towards continuous emotion recognition with modelling of long-range dependencies. In: Proc. Interspeech, Brisbane, Australia, pp. 597–600 (2008) Google Scholar
  60. 60.
    Wöllmer, M., Schuller, B., Eyben, F., Rigoll, G.: Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. IEEE J. Sel. Top. Signal Process. 4, 867–881 (2010). Special Issue on “Speech Processing for Natural Interaction with Intelligent Environments” CrossRefGoogle Scholar
  61. 61.
    Zhe, X., Boucouvalas, A.C.: Text-to-emotion engine for real time internet communication. In: Proc. of the International Symposium on Communication Systems, Networks, and DSPs, Staffordshire University, pp. 164–168 (2002) Google Scholar

Copyright information

© Springer-Verlag London Limited 2011

Authors and Affiliations

  1. 1.Institute for Human–Machine CommunicationTechnische Universität MünchenMunichGermany

Personalised recommendations