Real-Life Robustness

Eyben, Florian

doi:10.1007/978-3-319-27299-3_5

Florian Eyben²

Part of the book series: Springer Theses ((Springer Theses))

1876 Accesses

Abstract

With rapidly growing interest in and market value of social signal and media analysis a large demand for robust technology which works in adverse situations with poor audio quality and high levels of background noise and reverberation present has been created. Application areas include, e.g., interactive speech systems on mobile devices, multi-modal user profiling for better user adaptation of smart agents, call centre speech analytics, marketing research voice analytics, or health monitoring for stress and depression. This chapter discusses methods for pre-processing speech robustly and for enhancing audio classification algorithms in degraded acoustic conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In openSMILE the energy based VAD can be implemented with a cEnergy component and a cTurnDetector component.
2.
This type of smoothing is done by the cTurnDetector component in openSMILE.
3.
In openSMILE, both MVN and MRN are implemented in the cVectorMVN component.
4.
http://www.shazam.com/.

References

J. Benesty, S. Makino, J. Chen (eds.), Speech Enhancement (Springer, Berlin, 2005). ISBN 978-3-540-24039-6
Google Scholar
R. Banse, K.R. Scherer, Acoustic profiles in vocal emotion expression. J. Personal. Soc. Psychol. 70(3), 614–636 (1996)
Article Google Scholar
C. Busso, S. Lee, S. Narayanan, Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Trans. Audio Speech Lang. Process. 17, 582–596 (2009)
Article Google Scholar
C.H. Demarty, C. Penet, G. Gravier, M. Soleymani, The MediaEval 2012 affect task: violent scenes detection in hollywood movies, in Proceedings of the MediaEval 2012 Workshop. CEUR-WS.org, Pisa, Italy, Oct 2012
Google Scholar
S. Deng, J. Han, T. Zheng, G. Zheng, A modified MAP criterion based on hidden Markov model for voice activity detecion, in Proceedings of ICASSP 2011 (IEEE, Prague, 2011), pp. 5220–5223
Google Scholar
F. Eyben, M. Wöllmer, B. Schuller, openSMILE—the munich versatile and fast open-source audio feature extractor, in Proceedings of ACM Multimedia 2010 (ACM, Florence, 2010), pp. 1459–1462
Google Scholar
F. Eyben, B. Schuller, G. Rigoll, Improving generalisation and robustness of acoustic affect recognition, in Proceedings of the 14th ACM International Conference on Multimodal Interaction (ICMI) 2012, ed. by L.-P. Morency, D. Bohus, H.K. Aghajan, J. Cassell, A. Nijholt, J. Epps (ACM, Santa Monica, 2012a), pp. 517–522
Google Scholar
F. Eyben, F. Weninger, N. Lehment, G. Rigoll, B. Schuller, Violent scenes detection with large, brute-forced acoustic and visual feature sets, in Proceedings of the MediaEval 2012 Workshop. CEUR-WS.org, Pisa, Italy, Oct 2012b
Google Scholar
F. Eyben, F. Weninger, B. Schuller, Affect recognition in real-life acoustic conditions—a new perspective on feature selection, in Proceedings of INTERSPEECH 2013 (ISCA, Lyon, 2013a), pp. 2044–2048
Google Scholar
F. Eyben, F. Weninger, S. Squartini, B. Schuller, Real-life voice activity detection with LSTM recurrent neural networks and an application to hollywood movies, in Proceedings of ICASSP 2013 (IEEE, Vancouver, 2013b), pp. 483–487
Google Scholar
F. Eyben, F. Weninger, N. Lehment, B. Schuller, G. Rigoll, Affective video retrieval: violence detection in hollywood movies by large-scale segmental feature extraction. PLoS ONE 8(12), e78506 (2013c). doi:10.1371/journal.pone.0078506
Google Scholar
M. Fujimoto, S. Watanabe, T. Nakatani, Frame-wise model re-estimation method based on gaussian pruning with weight normalization for noise robust voice activity detection. Speech Commun. 54(2), 229–244 (2012). doi:10.1016/j.specom.2011.08.005. ISSN: 0167-6393
Google Scholar
J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, N.L. Dahlgrena, V. Zue, TIMIT acoustic-phonetic continuous speech corpus (1993)
Google Scholar
R. Gemello, F. Mana, R.D. Mori, Non-linear esimation of voice activity to improve automatic recognition of noisy speech, in Proceedings of INTERSPEECH 2005 (ISCA, Lisbon, 2005), pp. 2617–2620
Google Scholar
A.B. Graf, S. Borer, Normalization in support vector machines, in Pattern Recognition, Lecture Notes in Computer Science (Springer, Berlin, 2001), pp. 277–282
Google Scholar
A. Graves, S. Fernández, J. Schmidhuber, Multidimensional recurrent neural networks, in Proceedings of the 2007 International Conference on Artificial Neural Networks (ICANN). Lecture Notes in Computer Science, vol. 4668 (Springer, Porto, 2007), pp. 549–558
Google Scholar
M. Hahn, C.K. Park, An improved speech detection algorithm for isolated korean utterances, in Proceedings of ICASSP 1992 (IEEE, San Francisco, 1992), vol. 1, pp. 525–528
Google Scholar
J.A. Haigh, J.S. Mason, Robust voice activity detection using cepstral features, in Proceedings of the IEEE Region 10 Conference on Computer, Communication, Control, and Power Engineering (IEEE, 1993), vol. 3, pp. 321–324
Google Scholar
R. Herbrich, T. Graepel, A PAC-Bayesian margin bound for linear classifiers: why SVMs work, in Advances in Neural Information Processing Systems (MIT press, Cambridge, 2001), pp. 224–230
Google Scholar
H. Hermansky, Perceptual linear predictive (PLP) analysis for speech. J. Acoust. Soc. Am. (JASA) 87, 1738–1752 (1990)
Article Google Scholar
S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, in A Field Guide to Dynamical Recurrent Neural Networks, ed. by S.C. Kremer, J.F. Kolen (IEEE Press, New York, 2001)
Google Scholar
M. Jeub, M. Schäfer, P. Vary, A binaural room impulse response database for the evaluation of dereverberation algorithms, in Proceedings of the International Conference on Digital Signal Processing (DSP) (IEEE, Santorini, 2009), pp. 1–4
Google Scholar
J.-C. Junqua, J.-P. Haton, Robustness in Automatic Speech Recognition: Fundamentals and Applications (Kluwer Academic Publishers, Boston, 1996)
Book Google Scholar
C.-C. Lee, M. Black, A. Katsamanis, A. Lammert, B. Baucom, A. Christensen, P.G. Georgiou, S.S. Narayanan, Quantification of prosodic entrainment in affective spontaneous spoken interactions of married couples, in Proceedings of INTERSPEECH 2010 (ISCA, Makuhari, 2010), pp. 793–796
Google Scholar
H. Lu, M. Rabbi, G. Chittaranjan, D. Frauendorfer, M. Schmid Mast, A.T. Campbell, D. Gatica-Perez, T. Choudhury, Stresssense: detecting stress in unconstrained acoustic environments using smartphones, in Proceedings of the 2012 ACM Conference on Ubiquitous Computing (Ubicomp’12) (ACM, Pittsburgh, 2012), pp. 351–360
Google Scholar
M. Marzinzik, B. Kollmeier, Speech pause detection for noise spectrum estimation by tracking power envelope dynamics. IEEE Trans. Speech Audio Process. 10, 109–118 (2002)
Article Google Scholar
S. Matsuda, N. Ito, K. Tsujino, H. Kashioka, S. Sagayama, Speaker-dependent voice activity detection robust to background speech noise, in Proceedings of INTERSPEECH 2012 (ISCA, Portland, 2012)
Google Scholar
A. Misra, Speech/nonspeech segmentation in web videos, in Proceedings of INTERSPEECH 2012 (ISCA, Portland, 2012)
Google Scholar
S. Mousazadeh, I. Cohen, AR-GARCH in presence of noise: parameter estimation and its application to voice activity detection. IEEE Trans. Audio Speech Lang. Process. 19(4), 916–926 (2011)
Article Google Scholar
E. Mower, M.J. Mataric, S.S. Narayanan, A framework for automatic human emotion classification using emotional profiles. IEEE Trans. Audio Speech Lang. Process. 19(5), 1057–1070 (2011). doi:10.1109/TASL.2010.2076804
Article Google Scholar
T. Ng, B. Zhang, L. Nguyen, S. Matsoukas, X. Zhou, N. Mesgarani, K. Vesel, P. Matjka, Developing a speech activity detection system for the darpa rats program, in Proceedings of INTERSPEECH 2012 (ISCA, Portland, 2012)
Google Scholar
M.K. Omar, Speech activity detection for noisy data using adaptation techniques, in Proceedings of INTERSPEECH 2012 (ISCA, Portland, 2012)
Google Scholar
L. Parra, C. Alvino, Geometric source separation: merging convolutive source separation with geometric beamforming, in Proceedings of the 2001 IEEE Signal Processing Society Workshop on Neural Networks for Signal Processing XI (IEEE, 2001), pp. 273–282. doi:10.1109/NNSP.2001.943132
K. Pearson, Notes on regression and inheritance in the case of two parents. Proc. R. Soc. Lond. 58, 240–242 (1895)
Article Google Scholar
M.A. Pitt, L. Dilley, K. Johnson, S. Kiesling, W. Raymond, E. Hume, E. Fosler-Lussier, in Buckeye Corpus of Conversational Speech (2nd release). Department of Psychology, Ohio State University (Distributor), Columbus, OH, USA (2007). http://www.buckeyecorpus.osu.edu/
J. Pittermann, A. Pittermann, W. Minker, Emotion recognition and adaptation in spoken dialogue systems. Int. J. Speech Technol. 13, 49–60 (2010)
Article Google Scholar
L.R. Rabiner, M.R. Sambur, Voice-unvoiced-silence detection using the itakura LPC distance measure, in Proceedings of ICASSP 1977 (IEEE, Hartford, 1977), vol. 2, pp. 323–326
Google Scholar
J. Ramirez, J. Segura, M. Benitez, A. De La Torre, A. Rubio, Efficient voice activity detection algorithms using long-term speech information. Speech Commun. 42(3), 271–287 (2004)
Article Google Scholar
J. Ramirez, J. Segura, C. Benitez, L. Garcia, A. Rubio, Statistical voice activity detection using a multiple observation likelihood ratio test. IEEE Signal Process. Lett. 12(10), 689–692 (2005)
Article Google Scholar
M. Schröder, E. Bevacqua, R. Cowie, F. Eyben, H. Gunes, D. Heylen, M. ter Maat, G. McKeown, S. Pammi, M. Pantic, C. Pelachaud, B. Schuller, E. de Sevin, M. Valstar, M. Wöllmer, Building autonomous sensitive artificial listeners. IEEE Trans. Affect. Comput. 3(2), 165–183 (2012)
Article Google Scholar
B. Schuller, G. Rigoll, M. Grimm, K. Kroschel, T. Moosmayr, G. Ruske, Effects of in-car noise-conditions on the recognition of emotion within speech, in Proceedings of the 33. Jahrestagung für Akustik (DAGA) 2007 (DEGA, Stuttgart, 2007), pp. 305–306
Google Scholar
B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, A. Wendemuth, Acoustic emotion recognition: a benchmark comparison of performances, in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2009 (IEEE, Merano, 2009a), pp. 552–557
Google Scholar
B. Schuller, R. Müller, F. Eyben, J. Gast, B. Hörnler, M. Wöllmer, G. Rigoll, A. Höthker, H. Konosu, Being bored? recognising natural interest by extensive audiovisual integration for real-life application. Image Vis. Comput. 27(12), 1760–1774, Special issue on visual and multimodal analysis of human spontaneous behavior (2009b)
Google Scholar
B. Schuller, Affective speaker state analysis in the presence of reverberation. Int. J. Speech Technol. 14(2), 77–87 (2011a)
Google Scholar
B. Schuller, A. Batliner, S. Steidl, D. Seppi, Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun. 53(9/10), 1062–1087, Special issue on sensing emotion and affect—facing realism in speech processing (2011b)
Google Scholar
B. Schuller, M. Valstar, R. Cowie, M. Pantic, AVEC 2012: the continuous audio/visual emotion challenge—an introduction, in Proceedings of the 14th ACM International Conference on Multimodal Interaction (ICMI) 2012, ed. by L.-P. Morency, D. Bohus, H.K. Aghajan, J. Cassell, A. Nijholt, J. Epps (ACM, Santa Monica, 2012a), pp. 361–362
Google Scholar
B. Schuller, S. Steidl, A. Batliner, E. Nöth, A. Vinciarelli, F. Burkhardt, R. van Son, F. Weninger, F. Eyben, T. Bocklet, G. Mohammadi, B. Weiss, The INTERSPEECH 2012 speaker trait challenge, in Proceedings of INTERSPEECH 2012 (ISCA, Portland, 2012b)
Google Scholar
B. Schuller, Intelligent Audio Analysis, Signals and Communication Technology (Springer, Berlin, 2013a). ISBN 978-3642368059
Google Scholar
B. Schuller, A. Batliner, Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing (Wiley, Hoboken, 2013b). p. 344, ISBN 978-1119971368
Google Scholar
B. Schuller, F. Pokorny, S. Ladstätter, M. Fellner, F. Graf, L. Paletta, Acoustic geo-sensing: recognising cyclists’ route, route direction, and route progress from cell-phone audio, in Proceedings of ICASSP 2013 (IEEE, Vancouver, 2013c), pp. 453–457
Google Scholar
B. Schuller, S. Steidl, A. Batliner, F. Schiel, J. Krajewski, F. Weninger, F. Eyben, Medium-term speaker states—a review on intoxication, sleepiness and the first challenge. Comput. Speech Lang. 28(2), 346–374, Special issue on broadening the view on speaker analysis (2014)
Google Scholar
V. Sethu, E. Ambikairajah, J. Epps, Speaker normalisation for speech-based emotion detection, in Proceedings of the 15th International Conference on Digital Signal Processing (DSP 2007), pp. 611–614, Cardiff, UK, July 2007
Google Scholar
J. Sohn, N. Kim, A statistical model-based voice activity detection. IEEE Signal Process. Lett. 6(1), 1–3 (1999)
Article Google Scholar
J. Sola, J. Sevilla, Importance of input data normalization for the application of neural networks to complex industrial problems. IEEE Trans. Nucl. Sci. 44(3), 1464–1468 (1997). doi:10.1109/23.589532. ISSN: 0018-9499
Google Scholar
Y. Suh, H. Kim, Multiple acoustic model-based discriminative likelihood ratio weighting for voice activity detection. IEEE Signal Process. Lett. 19(8), 507–510 (2012). doi:10.1109/LSP.2012.2204978. ISSN: 1070-9908
Google Scholar
M. Suzuki, S. Nakagawa, K. Kita, Prosodic feature normalization for emotion recognition by using synthesized speech, in Advances in Knowledge-Based and Intelligent Information and Engineering Systems—16th Annual KES Conference, vol. 243, Frontiers in Artificial Intelligence and Applications, ed. by M. Graña, C. Toro, J. Posada, R.J. Howlett, L.C. Jain (IOS Press, San Sebastian, 2012), pp. 306–313
Google Scholar
W.Q. Syed, H.-C. Wu, Speech waveform compression using robust adaptive voice activity detection for nonstationary noise in multimedia communications, in Proceedings of Global Telecommunications Conference, 2007 (GLOBECOM’07) (IEEE, Washington DC, 2007), pp. 3096–3101
Google Scholar
D. Talkin, A robust algorithm for pitch tracking (RAPT), in Speech Coding and Synthesis, ed. by W.B. Kleijn, K.K. Paliwal (Elsevier, New York, 1995), pp. 495–518. ISBN 0444821694
Google Scholar
K. Thambiratnam, W. Zhu, F. Seide, Voice activity detection using speech recognizer feedback, in Proceedings of INTERSPEECH 2012 (ISCA, Portland, 2012)
Google Scholar
S. Thomas, S.H. Mallidi, T. Janu, H. Hermansky, N. Mesgarani, X. Zhou, S. Shamma, T. Ng, B. Zhang, L. Nguyen, S. Matsoukas, Acoustic and data-driven features for robust speech activity detection, in Proceedings of INTERSPEECH 2012 (ISCA, Portland, 2012)
Google Scholar
A. Vinciarelli, M. Pantic, H. Bourlard, Social signal processing: survey of an emerging domain. Image Vis. Comput. 27(12), 1743–1759 (2009). doi:10.1016/j.imavis.2008.11.007
Article Google Scholar
A.L. Wang, An industrial-strength audio search algorithm, in Proceedings of ISMIR (Baltimore, 2003)
Google Scholar
F. Weninger, B. Schuller, M. Wöllmer, G. Rigoll, Localization of non-linguistic events in spontaneous speech by non-negative matrix factorization and long short-term memory, in Proceedings of ICASSP 2011 (IEEE, Prague, 2011a), pp. 5840–5843
Google Scholar
F. Weninger, B. Schuller, A. Batliner, S. Steidl, D. Seppi, Recognition of non-prototypical emotions in reverberated and noisy speech by non-negative matrix factorization. EURASIP J. Adv. Signal Process. (Article ID 838790), Special issue on emotion and mental state recognition from speech (2011b)
Google Scholar
K. Woo, T. Yang, K. Park, C. Lee, Robust voice activity detection algorithm for estimating noise spectrum. Electron. Lett. 36(2), 180–181 (2000)
Article Google Scholar
M. You, C. Chen, J. Bu, J. Liu, J. Tao, Emotion recognition from noisy speech, in Proceedings of the IEEE International Conference on Multimedia and Expo (ICME 2006) (IEEE, Toronto, 2006), pp. 1653–1656. doi:10.1109/ICME.2006.262865
S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK Book, Cambridge University Engineering Department, for HTK version 3.4 edition (2006)
Google Scholar
Z. Zeng, M. Pantic, G.I. Rosiman, T.S. Huang, A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31(1), 39–58 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Human-Machine Communication (MMK), Technische Universität München, Munich, Germany
Florian Eyben

Authors

Florian Eyben
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Florian Eyben .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Eyben, F. (2016). Real-Life Robustness. In: Real-time Speech and Music Classification by Large Audio Feature Space Extraction. Springer Theses. Springer, Cham. https://doi.org/10.1007/978-3-319-27299-3_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-27299-3_5
Published: 25 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27298-6
Online ISBN: 978-3-319-27299-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics