Skip to main content

Part of the book series: Springer Theses ((Springer Theses))

Abstract

A central aim of this thesis was to define standard acoustic feature sets for both speech and music, which contain a large and comprehensive set of acoustic descriptors. Based on previous efforts to combine features and the authors experience from evaluations across several databases and tasks, 12 standard acoustic parameter sets have been proposed and well evaluated for this thesis. These sets include the acoustic baseline features sets of the INTERSPEECH challenges on Emotion and Paralinguistics form 2009–2013 (ComParE) as well as the Audio-Visual Emotion Challenges (2011–2013). Further, two sets for music processing and two minimalistic speech parameter sets (GeMAPS and eGeMAPS) are proposed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    An article is to appear in IEEE Transactions on Affective Computing (Eyben et al. 2015).

  2. 2.

    According to Schuller et al. (2011a)—and the openSMILE configuration file—the IS11 set contains 4,368 features in total. This is also the size of the baseline feature vectors provided for the challenge. However, the duration of the segment is counted twice there, due to the way it was implemented in the openSMILE configuration file. Thus, the correct number of unique features in IS11 is 4,367.

  3. 3.

    The RTF for the rhythmic features was not evaluated, as they are not implemented in the openSMILE framework.

References

  • R. Banse, K.R. Scherer, Acoustic profiles in vocal emotion expression. J. Personal. Soc. Psychol. 70(3), 614–636 (1996)

    Article  Google Scholar 

  • A. Batliner, J. Buckow, R. Huber, V. Warnke, E. Nöth, H. Niemann, Prosodic Feature Evaluation: Brute Force or Well Designed? In Proceedings of the 14th ICPhS, vol 3, San Francisco, CA, USA, pp. 2315–2318 (1999)

    Google Scholar 

  • A. Batliner, S. Steidl, B. Schuller, D. Seppi, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, V. Aharonson, N. Amir, Whodunnit—Searching for the most important feature types signalling emotional user states in speech. Comput. Speech Lang. 25(1), 4–28 (2011)

    Article  Google Scholar 

  • A. Batliner, B. Möbius, Prosodic models, automatic speech understanding, and speech synthesis: towards the common ground?, in The Integration of Phonetic Knowledge in Speech Technology, ed. by W. Barry, W. Dommelen (Springer, Dordrecht, 2005), pp. 21–44

    Chapter  Google Scholar 

  • F. Eyben, K. Scherer, B. Schuller, J. Sundberg, E. André, C. Busso, L. Devillers, J. Epps, P. Laukka, S. Narayanan, K. Truong, The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing. IEEE Trans. Affect. Comput. doi:10.1109/TAFFC.2015.2457417

    Google Scholar 

  • F. Eyben, B. Schuller, Music Classification with the Munich openSMILE Toolkit. In Proceedings of the Annual Meeting of the MIREX 2010 community as part of the 11th International Conference on Music Information Retrieval (ISMIR), Utrecht, The Netherlands, August 2010. ISMIR. http://www.music-ir.org/mirex/abstracts/2010/FE1.pdf

  • P.N. Juslin, P. Laukka, Communication of emotions in vocal expression and music performance: Different channels, same code? Psychol. Bull. 129(5), 770–814 (2003)

    Article  Google Scholar 

  • E. Marchi, A. Batliner, B. Schuller, S. Fridenzon, S. Tal, O. Golan, Speech, Emotion, Age, Language, Task, and Typicality: Trying to Disentangle Performance and Feature Relevance. In Proceedings of the First International Workshop on Wide Spectrum Social Signal Processing (WS \(^3\) P 2012), held in conjunction with the ASE/IEEE International Conference on Social Computing (SocialCom 2012), IEEE Computer Society. pp. 961–968, Amsterdam, The Netherlands, September 2012

    Google Scholar 

  • M. Müller, F. Kurth, M. Clausen. Audio matching via chroma-based statistical features. In Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR), pp. 288–295, London, UK, (2005)

    Google Scholar 

  • S. Patel, K.R. Scherer, Vocal behaviour, in Handbook of Nonverbal Communication, ed. by J.A. Hall, M.L. Knapp (Mouton-DeGruyter, Berlin, 2013), pp. 167–204

    Google Scholar 

  • A. Sadeghi Naini, M. Homayounpour, Speaker age interval and sex identification based on jitters, shimmers and mean MFCC using supervised and unsupervised discriminative classification methods. In Proceedings of the 8th International Conference on Signal Processing (ICSP), vol 1, Beijing, China, 2006. doi:10.1109/ICOSP.2006.345516

  • K.R. Scherer, Vocal affect expression: A review and a model for future research. Psychol. Bull. 99, 143–165 (1986)

    Article  Google Scholar 

  • M. Schröder, Speech and Emotion Research: An Overview of Research Frameworks and a Dimensional Approach to Emotional Speech Synthesis, volume PHONUS 7 of Research Report of the Institute of Phonetics, Saarland University. Ph.D thesis, Institute for Phonetics, University of Saarbrücken, 2004

    Google Scholar 

  • M. Schröder, F. Burkhardt, S. Krstulovic, Synthesis of emotional speech, in Blueprint for Affective Computing, ed. by K.R. Scherer, T. Bänziger, E. Roesch (Oxford University Press, Oxford, 2010), pp. 222–231

    Google Scholar 

  • B. Schuller, A. Batliner, D. Seppi, S. Steidl, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, N. Amir, L. Kessous, V. Aharonson, The Relevance of Feature Type for the Automatic Classification of Emotional User States: Low Level Descriptors and Functionals. In Proceedings of INTERSPEECH 2007, ISCA .pp. 2253–2256, Antwerp, Belgium, August 2007a

    Google Scholar 

  • B. Schuller, F. Eyben, G. Rigoll, Fast and Robust Meter and Tempo Recognition for the Automatic Discrimination of Ballroom Dance Styles. In Proceedings of the ICASSP 2007, IEEE. vol I, pp 217–220, Honolulu, HI, USA, April 2007b

    Google Scholar 

  • B. Schuller, A. Batliner, S. Steidl, F. Schiel, J. Krajewski, The INTERSPEECH 2011 Speaker State Challenge. In Proceedings of INTERSPEECH 2011, ISCA. Florence, Italy, pp. 3201–3204 August 2011a

    Google Scholar 

  • B. Schuller, M. Valstar, F. Eyben, G. McKeown, R. Cowie, M. Pantic, AVEC 2011—The First International Audio/Visual Emotion Challenge, in Proceedings of the First International Audio/Visual Emotion Challenge and Workshop, AVEC 2011, held in conjunction with the International HUMAINE Association Conference on Affective Computing and Intelligent Interaction (ACII) 2011, vol. II, ed. by B. Schuller, M. Valstar, R. Cowie, M. Pantic (Springer, Memphis, TN, USA, October 2011b), pp. 415–424

    Google Scholar 

  • B. Schuller, G. Rigoll, Recognising Interest in Conversational Speech—Comparing Bag of Frames and Supra-segmental Features. In Proceedings of INTERSPEECH 2009, ISCA pp. 1999–2002, Brighton, UK, September 2009

    Google Scholar 

  • B. Schuller, G. Rigoll, M. Lang, Hidden Markov Model-based Speech Emotion Recognition. In Proceedings of the ICASSP 2003, IEEE. vol 2, pp. II 1–4, Hong Kong, China, April 2003

    Google Scholar 

  • B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. Müller, S. Narayanan, The INTERSPEECH 2010 Paralinguistic Challenge. In Proceedings of INTERSPEECH 2010, ISCA. Makuhari, Japan, pp. 2794–2797 September 2010

    Google Scholar 

  • B. Schuller, S. Steidl, A. Batliner, J. Epps, F. Eyben, F. Ringeval, E. Marchi, Y. Zhang, The INTERSPEECH 2014 computational paralinguistics challenge: Cognitive and physical load. In Proceedings of the INTERSPEECH 2014, ISCA. Singapore, 2014. (to appear)

    Google Scholar 

  • B. Schuller, S. Steidl, A. Batliner, F. Jurcicek, The INTERSPEECH 2009 Emotion Challenge. In Proceedings of INTERSPEECH 2009, Brighton, UK, pp. 312–315 September 2009

    Google Scholar 

  • B. Schuller, S. Steidl, A. Batliner, E. Nöth, A. Vinciarelli, F. Burkhardt, R. van Son, F. Weninger, F. Eyben, T. Bocklet, G. Mohammadi, B. Weiss, The INTERSPEECH 2012 Speaker Trait Challenge. In Proceedings of INTERSPEECH 2012, ISCA. Portland, OR, USA, September 2012a

    Google Scholar 

  • B. Schuller, M. Valstar, R. Cowie, M. Pantic, AVEC 2012: the continuous audio/visual emotion challenge—an introduction, in Proceedings of the 14th ACM International Conference on Multimodal Interaction (ICMI) 2012, ed. by L.-P. Morency, D. Bohus, H.K. Aghajan, J. Cassell, A. Nijholt, J. Epps (ACM, Santa Monica, CA, USA, October 2012b), pp. 361–362

    Google Scholar 

  • B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, et al., The INTERSPEECH 2013 Computational Paralinguistics Challenge: Social Signals, Conflict, Emotion, Autism. In Proceedings of the INTERSPEECH 2013, ISCA. Lyon, France, pp. 148–152 2013

    Google Scholar 

  • B. Schuller, A. Batliner, Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing (Wiley, Hoboken, 2013), p. 344. ISBN 978-1119971368

    Book  Google Scholar 

  • J. Sundberg, S. Patel, E. Bjorkner, K .R. Scherer, Interdependencies among voice source parameters in emotional speech. IEEE Trans. Affect. Comput. 2(3), 162–174 (2011). doi:10.1109/T-AFFC.2011.14. ISSN 1949-3045

    Article  Google Scholar 

  • M. Valstar, B. Schuller, K. Smith, F. Eyben, B. Jiang, S. Bilakhia, S. Schnieder, R. Cowie, M. Pantic, AVEC 2013—The Continuous Audio/Visual Emotion and Depression Recognition Challenge. In Proceedings of the ACM Multimedia 2013, CM. Barcelona, Spain, October 2013

    Google Scholar 

  • D. Ververidis, C. Kotropoulos, Emotional speech recognition: Resources, features, and methods. Speech Commun. 9, 1162–1181 (2006)

    Article  Google Scholar 

  • F. Weninger, F. Eyben, B. W. Schuller, M. Mortillaro, K. R. Scherer, On the Acoustics of Emotion in Audio: What Speech, Music and Sound have in Common. Frontiers in Psychology, 4(Article ID 292): 1–12, May 2013b. doi:10.3389/fpsyg.2013.00292

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Florian Eyben .

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Eyben, F. (2016). Standard Baseline Feature Sets. In: Real-time Speech and Music Classification by Large Audio Feature Space Extraction. Springer Theses. Springer, Cham. https://doi.org/10.1007/978-3-319-27299-3_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27299-3_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27298-6

  • Online ISBN: 978-3-319-27299-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics