Abstract
A central aim of this thesis was to define standard acoustic feature sets for both speech and music, which contain a large and comprehensive set of acoustic descriptors. Based on previous efforts to combine features and the authors experience from evaluations across several databases and tasks, 12 standard acoustic parameter sets have been proposed and well evaluated for this thesis. These sets include the acoustic baseline features sets of the INTERSPEECH challenges on Emotion and Paralinguistics form 2009–2013 (ComParE) as well as the Audio-Visual Emotion Challenges (2011–2013). Further, two sets for music processing and two minimalistic speech parameter sets (GeMAPS and eGeMAPS) are proposed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
An article is to appear in IEEE Transactions on Affective Computing (Eyben et al. 2015).
- 2.
According to Schuller et al. (2011a)—and the openSMILE configuration file—the IS11 set contains 4,368 features in total. This is also the size of the baseline feature vectors provided for the challenge. However, the duration of the segment is counted twice there, due to the way it was implemented in the openSMILE configuration file. Thus, the correct number of unique features in IS11 is 4,367.
- 3.
The RTF for the rhythmic features was not evaluated, as they are not implemented in the openSMILE framework.
References
R. Banse, K.R. Scherer, Acoustic profiles in vocal emotion expression. J. Personal. Soc. Psychol. 70(3), 614–636 (1996)
A. Batliner, J. Buckow, R. Huber, V. Warnke, E. Nöth, H. Niemann, Prosodic Feature Evaluation: Brute Force or Well Designed? In Proceedings of the 14th ICPhS, vol 3, San Francisco, CA, USA, pp. 2315–2318 (1999)
A. Batliner, S. Steidl, B. Schuller, D. Seppi, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, V. Aharonson, N. Amir, Whodunnit—Searching for the most important feature types signalling emotional user states in speech. Comput. Speech Lang. 25(1), 4–28 (2011)
A. Batliner, B. Möbius, Prosodic models, automatic speech understanding, and speech synthesis: towards the common ground?, in The Integration of Phonetic Knowledge in Speech Technology, ed. by W. Barry, W. Dommelen (Springer, Dordrecht, 2005), pp. 21–44
F. Eyben, K. Scherer, B. Schuller, J. Sundberg, E. André, C. Busso, L. Devillers, J. Epps, P. Laukka, S. Narayanan, K. Truong, The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing. IEEE Trans. Affect. Comput. doi:10.1109/TAFFC.2015.2457417
F. Eyben, B. Schuller, Music Classification with the Munich openSMILE Toolkit. In Proceedings of the Annual Meeting of the MIREX 2010 community as part of the 11th International Conference on Music Information Retrieval (ISMIR), Utrecht, The Netherlands, August 2010. ISMIR. http://www.music-ir.org/mirex/abstracts/2010/FE1.pdf
P.N. Juslin, P. Laukka, Communication of emotions in vocal expression and music performance: Different channels, same code? Psychol. Bull. 129(5), 770–814 (2003)
E. Marchi, A. Batliner, B. Schuller, S. Fridenzon, S. Tal, O. Golan, Speech, Emotion, Age, Language, Task, and Typicality: Trying to Disentangle Performance and Feature Relevance. In Proceedings of the First International Workshop on Wide Spectrum Social Signal Processing (WS \(^3\) P 2012), held in conjunction with the ASE/IEEE International Conference on Social Computing (SocialCom 2012), IEEE Computer Society. pp. 961–968, Amsterdam, The Netherlands, September 2012
M. Müller, F. Kurth, M. Clausen. Audio matching via chroma-based statistical features. In Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR), pp. 288–295, London, UK, (2005)
S. Patel, K.R. Scherer, Vocal behaviour, in Handbook of Nonverbal Communication, ed. by J.A. Hall, M.L. Knapp (Mouton-DeGruyter, Berlin, 2013), pp. 167–204
A. Sadeghi Naini, M. Homayounpour, Speaker age interval and sex identification based on jitters, shimmers and mean MFCC using supervised and unsupervised discriminative classification methods. In Proceedings of the 8th International Conference on Signal Processing (ICSP), vol 1, Beijing, China, 2006. doi:10.1109/ICOSP.2006.345516
K.R. Scherer, Vocal affect expression: A review and a model for future research. Psychol. Bull. 99, 143–165 (1986)
M. Schröder, Speech and Emotion Research: An Overview of Research Frameworks and a Dimensional Approach to Emotional Speech Synthesis, volume PHONUS 7 of Research Report of the Institute of Phonetics, Saarland University. Ph.D thesis, Institute for Phonetics, University of Saarbrücken, 2004
M. Schröder, F. Burkhardt, S. Krstulovic, Synthesis of emotional speech, in Blueprint for Affective Computing, ed. by K.R. Scherer, T. Bänziger, E. Roesch (Oxford University Press, Oxford, 2010), pp. 222–231
B. Schuller, A. Batliner, D. Seppi, S. Steidl, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, N. Amir, L. Kessous, V. Aharonson, The Relevance of Feature Type for the Automatic Classification of Emotional User States: Low Level Descriptors and Functionals. In Proceedings of INTERSPEECH 2007, ISCA .pp. 2253–2256, Antwerp, Belgium, August 2007a
B. Schuller, F. Eyben, G. Rigoll, Fast and Robust Meter and Tempo Recognition for the Automatic Discrimination of Ballroom Dance Styles. In Proceedings of the ICASSP 2007, IEEE. vol I, pp 217–220, Honolulu, HI, USA, April 2007b
B. Schuller, A. Batliner, S. Steidl, F. Schiel, J. Krajewski, The INTERSPEECH 2011 Speaker State Challenge. In Proceedings of INTERSPEECH 2011, ISCA. Florence, Italy, pp. 3201–3204 August 2011a
B. Schuller, M. Valstar, F. Eyben, G. McKeown, R. Cowie, M. Pantic, AVEC 2011—The First International Audio/Visual Emotion Challenge, in Proceedings of the First International Audio/Visual Emotion Challenge and Workshop, AVEC 2011, held in conjunction with the International HUMAINE Association Conference on Affective Computing and Intelligent Interaction (ACII) 2011, vol. II, ed. by B. Schuller, M. Valstar, R. Cowie, M. Pantic (Springer, Memphis, TN, USA, October 2011b), pp. 415–424
B. Schuller, G. Rigoll, Recognising Interest in Conversational Speech—Comparing Bag of Frames and Supra-segmental Features. In Proceedings of INTERSPEECH 2009, ISCA pp. 1999–2002, Brighton, UK, September 2009
B. Schuller, G. Rigoll, M. Lang, Hidden Markov Model-based Speech Emotion Recognition. In Proceedings of the ICASSP 2003, IEEE. vol 2, pp. II 1–4, Hong Kong, China, April 2003
B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. Müller, S. Narayanan, The INTERSPEECH 2010 Paralinguistic Challenge. In Proceedings of INTERSPEECH 2010, ISCA. Makuhari, Japan, pp. 2794–2797 September 2010
B. Schuller, S. Steidl, A. Batliner, J. Epps, F. Eyben, F. Ringeval, E. Marchi, Y. Zhang, The INTERSPEECH 2014 computational paralinguistics challenge: Cognitive and physical load. In Proceedings of the INTERSPEECH 2014, ISCA. Singapore, 2014. (to appear)
B. Schuller, S. Steidl, A. Batliner, F. Jurcicek, The INTERSPEECH 2009 Emotion Challenge. In Proceedings of INTERSPEECH 2009, Brighton, UK, pp. 312–315 September 2009
B. Schuller, S. Steidl, A. Batliner, E. Nöth, A. Vinciarelli, F. Burkhardt, R. van Son, F. Weninger, F. Eyben, T. Bocklet, G. Mohammadi, B. Weiss, The INTERSPEECH 2012 Speaker Trait Challenge. In Proceedings of INTERSPEECH 2012, ISCA. Portland, OR, USA, September 2012a
B. Schuller, M. Valstar, R. Cowie, M. Pantic, AVEC 2012: the continuous audio/visual emotion challenge—an introduction, in Proceedings of the 14th ACM International Conference on Multimodal Interaction (ICMI) 2012, ed. by L.-P. Morency, D. Bohus, H.K. Aghajan, J. Cassell, A. Nijholt, J. Epps (ACM, Santa Monica, CA, USA, October 2012b), pp. 361–362
B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, et al., The INTERSPEECH 2013 Computational Paralinguistics Challenge: Social Signals, Conflict, Emotion, Autism. In Proceedings of the INTERSPEECH 2013, ISCA. Lyon, France, pp. 148–152 2013
B. Schuller, A. Batliner, Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing (Wiley, Hoboken, 2013), p. 344. ISBN 978-1119971368
J. Sundberg, S. Patel, E. Bjorkner, K .R. Scherer, Interdependencies among voice source parameters in emotional speech. IEEE Trans. Affect. Comput. 2(3), 162–174 (2011). doi:10.1109/T-AFFC.2011.14. ISSN 1949-3045
M. Valstar, B. Schuller, K. Smith, F. Eyben, B. Jiang, S. Bilakhia, S. Schnieder, R. Cowie, M. Pantic, AVEC 2013—The Continuous Audio/Visual Emotion and Depression Recognition Challenge. In Proceedings of the ACM Multimedia 2013, CM. Barcelona, Spain, October 2013
D. Ververidis, C. Kotropoulos, Emotional speech recognition: Resources, features, and methods. Speech Commun. 9, 1162–1181 (2006)
F. Weninger, F. Eyben, B. W. Schuller, M. Mortillaro, K. R. Scherer, On the Acoustics of Emotion in Audio: What Speech, Music and Sound have in Common. Frontiers in Psychology, 4(Article ID 292): 1–12, May 2013b. doi:10.3389/fpsyg.2013.00292
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Eyben, F. (2016). Standard Baseline Feature Sets. In: Real-time Speech and Music Classification by Large Audio Feature Space Extraction. Springer Theses. Springer, Cham. https://doi.org/10.1007/978-3-319-27299-3_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-27299-3_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27298-6
Online ISBN: 978-3-319-27299-3
eBook Packages: EngineeringEngineering (R0)