Abstract
With rapidly growing interest in and market value of social signal and media analysis a large demand for robust technology which works in adverse situations with poor audio quality and high levels of background noise and reverberation present has been created. Application areas include, e.g., interactive speech systems on mobile devices, multi-modal user profiling for better user adaptation of smart agents, call centre speech analytics, marketing research voice analytics, or health monitoring for stress and depression. This chapter discusses methods for pre-processing speech robustly and for enhancing audio classification algorithms in degraded acoustic conditions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In openSMILE the energy based VAD can be implemented with a cEnergy component and a cTurnDetector component.
- 2.
This type of smoothing is done by the cTurnDetector component in openSMILE.
- 3.
In openSMILE, both MVN and MRN are implemented in the cVectorMVN component.
- 4.
References
J. Benesty, S. Makino, J. Chen (eds.), Speech Enhancement (Springer, Berlin, 2005). ISBN 978-3-540-24039-6
R. Banse, K.R. Scherer, Acoustic profiles in vocal emotion expression. J. Personal. Soc. Psychol. 70(3), 614–636 (1996)
C. Busso, S. Lee, S. Narayanan, Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Trans. Audio Speech Lang. Process. 17, 582–596 (2009)
C.H. Demarty, C. Penet, G. Gravier, M. Soleymani, The MediaEval 2012 affect task: violent scenes detection in hollywood movies, in Proceedings of the MediaEval 2012 Workshop. CEUR-WS.org, Pisa, Italy, Oct 2012
S. Deng, J. Han, T. Zheng, G. Zheng, A modified MAP criterion based on hidden Markov model for voice activity detecion, in Proceedings of ICASSP 2011 (IEEE, Prague, 2011), pp. 5220–5223
F. Eyben, M. Wöllmer, B. Schuller, openSMILE—the munich versatile and fast open-source audio feature extractor, in Proceedings of ACM Multimedia 2010 (ACM, Florence, 2010), pp. 1459–1462
F. Eyben, B. Schuller, G. Rigoll, Improving generalisation and robustness of acoustic affect recognition, in Proceedings of the 14th ACM International Conference on Multimodal Interaction (ICMI) 2012, ed. by L.-P. Morency, D. Bohus, H.K. Aghajan, J. Cassell, A. Nijholt, J. Epps (ACM, Santa Monica, 2012a), pp. 517–522
F. Eyben, F. Weninger, N. Lehment, G. Rigoll, B. Schuller, Violent scenes detection with large, brute-forced acoustic and visual feature sets, in Proceedings of the MediaEval 2012 Workshop. CEUR-WS.org, Pisa, Italy, Oct 2012b
F. Eyben, F. Weninger, B. Schuller, Affect recognition in real-life acoustic conditions—a new perspective on feature selection, in Proceedings of INTERSPEECH 2013 (ISCA, Lyon, 2013a), pp. 2044–2048
F. Eyben, F. Weninger, S. Squartini, B. Schuller, Real-life voice activity detection with LSTM recurrent neural networks and an application to hollywood movies, in Proceedings of ICASSP 2013 (IEEE, Vancouver, 2013b), pp. 483–487
F. Eyben, F. Weninger, N. Lehment, B. Schuller, G. Rigoll, Affective video retrieval: violence detection in hollywood movies by large-scale segmental feature extraction. PLoS ONE 8(12), e78506 (2013c). doi:10.1371/journal.pone.0078506
M. Fujimoto, S. Watanabe, T. Nakatani, Frame-wise model re-estimation method based on gaussian pruning with weight normalization for noise robust voice activity detection. Speech Commun. 54(2), 229–244 (2012). doi:10.1016/j.specom.2011.08.005. ISSN: 0167-6393
J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, N.L. Dahlgrena, V. Zue, TIMIT acoustic-phonetic continuous speech corpus (1993)
R. Gemello, F. Mana, R.D. Mori, Non-linear esimation of voice activity to improve automatic recognition of noisy speech, in Proceedings of INTERSPEECH 2005 (ISCA, Lisbon, 2005), pp. 2617–2620
A.B. Graf, S. Borer, Normalization in support vector machines, in Pattern Recognition, Lecture Notes in Computer Science (Springer, Berlin, 2001), pp. 277–282
A. Graves, S. Fernández, J. Schmidhuber, Multidimensional recurrent neural networks, in Proceedings of the 2007 International Conference on Artificial Neural Networks (ICANN). Lecture Notes in Computer Science, vol. 4668 (Springer, Porto, 2007), pp. 549–558
M. Hahn, C.K. Park, An improved speech detection algorithm for isolated korean utterances, in Proceedings of ICASSP 1992 (IEEE, San Francisco, 1992), vol. 1, pp. 525–528
J.A. Haigh, J.S. Mason, Robust voice activity detection using cepstral features, in Proceedings of the IEEE Region 10 Conference on Computer, Communication, Control, and Power Engineering (IEEE, 1993), vol. 3, pp. 321–324
R. Herbrich, T. Graepel, A PAC-Bayesian margin bound for linear classifiers: why SVMs work, in Advances in Neural Information Processing Systems (MIT press, Cambridge, 2001), pp. 224–230
H. Hermansky, Perceptual linear predictive (PLP) analysis for speech. J. Acoust. Soc. Am. (JASA) 87, 1738–1752 (1990)
S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, in A Field Guide to Dynamical Recurrent Neural Networks, ed. by S.C. Kremer, J.F. Kolen (IEEE Press, New York, 2001)
M. Jeub, M. Schäfer, P. Vary, A binaural room impulse response database for the evaluation of dereverberation algorithms, in Proceedings of the International Conference on Digital Signal Processing (DSP) (IEEE, Santorini, 2009), pp. 1–4
J.-C. Junqua, J.-P. Haton, Robustness in Automatic Speech Recognition: Fundamentals and Applications (Kluwer Academic Publishers, Boston, 1996)
C.-C. Lee, M. Black, A. Katsamanis, A. Lammert, B. Baucom, A. Christensen, P.G. Georgiou, S.S. Narayanan, Quantification of prosodic entrainment in affective spontaneous spoken interactions of married couples, in Proceedings of INTERSPEECH 2010 (ISCA, Makuhari, 2010), pp. 793–796
H. Lu, M. Rabbi, G. Chittaranjan, D. Frauendorfer, M. Schmid Mast, A.T. Campbell, D. Gatica-Perez, T. Choudhury, Stresssense: detecting stress in unconstrained acoustic environments using smartphones, in Proceedings of the 2012 ACM Conference on Ubiquitous Computing (Ubicomp’12) (ACM, Pittsburgh, 2012), pp. 351–360
M. Marzinzik, B. Kollmeier, Speech pause detection for noise spectrum estimation by tracking power envelope dynamics. IEEE Trans. Speech Audio Process. 10, 109–118 (2002)
S. Matsuda, N. Ito, K. Tsujino, H. Kashioka, S. Sagayama, Speaker-dependent voice activity detection robust to background speech noise, in Proceedings of INTERSPEECH 2012 (ISCA, Portland, 2012)
A. Misra, Speech/nonspeech segmentation in web videos, in Proceedings of INTERSPEECH 2012 (ISCA, Portland, 2012)
S. Mousazadeh, I. Cohen, AR-GARCH in presence of noise: parameter estimation and its application to voice activity detection. IEEE Trans. Audio Speech Lang. Process. 19(4), 916–926 (2011)
E. Mower, M.J. Mataric, S.S. Narayanan, A framework for automatic human emotion classification using emotional profiles. IEEE Trans. Audio Speech Lang. Process. 19(5), 1057–1070 (2011). doi:10.1109/TASL.2010.2076804
T. Ng, B. Zhang, L. Nguyen, S. Matsoukas, X. Zhou, N. Mesgarani, K. Vesel, P. Matjka, Developing a speech activity detection system for the darpa rats program, in Proceedings of INTERSPEECH 2012 (ISCA, Portland, 2012)
M.K. Omar, Speech activity detection for noisy data using adaptation techniques, in Proceedings of INTERSPEECH 2012 (ISCA, Portland, 2012)
L. Parra, C. Alvino, Geometric source separation: merging convolutive source separation with geometric beamforming, in Proceedings of the 2001 IEEE Signal Processing Society Workshop on Neural Networks for Signal Processing XI (IEEE, 2001), pp. 273–282. doi:10.1109/NNSP.2001.943132
K. Pearson, Notes on regression and inheritance in the case of two parents. Proc. R. Soc. Lond. 58, 240–242 (1895)
M.A. Pitt, L. Dilley, K. Johnson, S. Kiesling, W. Raymond, E. Hume, E. Fosler-Lussier, in Buckeye Corpus of Conversational Speech (2nd release). Department of Psychology, Ohio State University (Distributor), Columbus, OH, USA (2007). http://www.buckeyecorpus.osu.edu/
J. Pittermann, A. Pittermann, W. Minker, Emotion recognition and adaptation in spoken dialogue systems. Int. J. Speech Technol. 13, 49–60 (2010)
L.R. Rabiner, M.R. Sambur, Voice-unvoiced-silence detection using the itakura LPC distance measure, in Proceedings of ICASSP 1977 (IEEE, Hartford, 1977), vol. 2, pp. 323–326
J. Ramirez, J. Segura, M. Benitez, A. De La Torre, A. Rubio, Efficient voice activity detection algorithms using long-term speech information. Speech Commun. 42(3), 271–287 (2004)
J. Ramirez, J. Segura, C. Benitez, L. Garcia, A. Rubio, Statistical voice activity detection using a multiple observation likelihood ratio test. IEEE Signal Process. Lett. 12(10), 689–692 (2005)
M. Schröder, E. Bevacqua, R. Cowie, F. Eyben, H. Gunes, D. Heylen, M. ter Maat, G. McKeown, S. Pammi, M. Pantic, C. Pelachaud, B. Schuller, E. de Sevin, M. Valstar, M. Wöllmer, Building autonomous sensitive artificial listeners. IEEE Trans. Affect. Comput. 3(2), 165–183 (2012)
B. Schuller, G. Rigoll, M. Grimm, K. Kroschel, T. Moosmayr, G. Ruske, Effects of in-car noise-conditions on the recognition of emotion within speech, in Proceedings of the 33. Jahrestagung für Akustik (DAGA) 2007 (DEGA, Stuttgart, 2007), pp. 305–306
B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, A. Wendemuth, Acoustic emotion recognition: a benchmark comparison of performances, in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2009 (IEEE, Merano, 2009a), pp. 552–557
B. Schuller, R. Müller, F. Eyben, J. Gast, B. Hörnler, M. Wöllmer, G. Rigoll, A. Höthker, H. Konosu, Being bored? recognising natural interest by extensive audiovisual integration for real-life application. Image Vis. Comput. 27(12), 1760–1774, Special issue on visual and multimodal analysis of human spontaneous behavior (2009b)
B. Schuller, Affective speaker state analysis in the presence of reverberation. Int. J. Speech Technol. 14(2), 77–87 (2011a)
B. Schuller, A. Batliner, S. Steidl, D. Seppi, Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun. 53(9/10), 1062–1087, Special issue on sensing emotion and affect—facing realism in speech processing (2011b)
B. Schuller, M. Valstar, R. Cowie, M. Pantic, AVEC 2012: the continuous audio/visual emotion challenge—an introduction, in Proceedings of the 14th ACM International Conference on Multimodal Interaction (ICMI) 2012, ed. by L.-P. Morency, D. Bohus, H.K. Aghajan, J. Cassell, A. Nijholt, J. Epps (ACM, Santa Monica, 2012a), pp. 361–362
B. Schuller, S. Steidl, A. Batliner, E. Nöth, A. Vinciarelli, F. Burkhardt, R. van Son, F. Weninger, F. Eyben, T. Bocklet, G. Mohammadi, B. Weiss, The INTERSPEECH 2012 speaker trait challenge, in Proceedings of INTERSPEECH 2012 (ISCA, Portland, 2012b)
B. Schuller, Intelligent Audio Analysis, Signals and Communication Technology (Springer, Berlin, 2013a). ISBN 978-3642368059
B. Schuller, A. Batliner, Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing (Wiley, Hoboken, 2013b). p. 344, ISBN 978-1119971368
B. Schuller, F. Pokorny, S. Ladstätter, M. Fellner, F. Graf, L. Paletta, Acoustic geo-sensing: recognising cyclists’ route, route direction, and route progress from cell-phone audio, in Proceedings of ICASSP 2013 (IEEE, Vancouver, 2013c), pp. 453–457
B. Schuller, S. Steidl, A. Batliner, F. Schiel, J. Krajewski, F. Weninger, F. Eyben, Medium-term speaker states—a review on intoxication, sleepiness and the first challenge. Comput. Speech Lang. 28(2), 346–374, Special issue on broadening the view on speaker analysis (2014)
V. Sethu, E. Ambikairajah, J. Epps, Speaker normalisation for speech-based emotion detection, in Proceedings of the 15th International Conference on Digital Signal Processing (DSP 2007), pp. 611–614, Cardiff, UK, July 2007
J. Sohn, N. Kim, A statistical model-based voice activity detection. IEEE Signal Process. Lett. 6(1), 1–3 (1999)
J. Sola, J. Sevilla, Importance of input data normalization for the application of neural networks to complex industrial problems. IEEE Trans. Nucl. Sci. 44(3), 1464–1468 (1997). doi:10.1109/23.589532. ISSN: 0018-9499
Y. Suh, H. Kim, Multiple acoustic model-based discriminative likelihood ratio weighting for voice activity detection. IEEE Signal Process. Lett. 19(8), 507–510 (2012). doi:10.1109/LSP.2012.2204978. ISSN: 1070-9908
M. Suzuki, S. Nakagawa, K. Kita, Prosodic feature normalization for emotion recognition by using synthesized speech, in Advances in Knowledge-Based and Intelligent Information and Engineering Systems—16th Annual KES Conference, vol. 243, Frontiers in Artificial Intelligence and Applications, ed. by M. Graña, C. Toro, J. Posada, R.J. Howlett, L.C. Jain (IOS Press, San Sebastian, 2012), pp. 306–313
W.Q. Syed, H.-C. Wu, Speech waveform compression using robust adaptive voice activity detection for nonstationary noise in multimedia communications, in Proceedings of Global Telecommunications Conference, 2007 (GLOBECOM’07) (IEEE, Washington DC, 2007), pp. 3096–3101
D. Talkin, A robust algorithm for pitch tracking (RAPT), in Speech Coding and Synthesis, ed. by W.B. Kleijn, K.K. Paliwal (Elsevier, New York, 1995), pp. 495–518. ISBN 0444821694
K. Thambiratnam, W. Zhu, F. Seide, Voice activity detection using speech recognizer feedback, in Proceedings of INTERSPEECH 2012 (ISCA, Portland, 2012)
S. Thomas, S.H. Mallidi, T. Janu, H. Hermansky, N. Mesgarani, X. Zhou, S. Shamma, T. Ng, B. Zhang, L. Nguyen, S. Matsoukas, Acoustic and data-driven features for robust speech activity detection, in Proceedings of INTERSPEECH 2012 (ISCA, Portland, 2012)
A. Vinciarelli, M. Pantic, H. Bourlard, Social signal processing: survey of an emerging domain. Image Vis. Comput. 27(12), 1743–1759 (2009). doi:10.1016/j.imavis.2008.11.007
A.L. Wang, An industrial-strength audio search algorithm, in Proceedings of ISMIR (Baltimore, 2003)
F. Weninger, B. Schuller, M. Wöllmer, G. Rigoll, Localization of non-linguistic events in spontaneous speech by non-negative matrix factorization and long short-term memory, in Proceedings of ICASSP 2011 (IEEE, Prague, 2011a), pp. 5840–5843
F. Weninger, B. Schuller, A. Batliner, S. Steidl, D. Seppi, Recognition of non-prototypical emotions in reverberated and noisy speech by non-negative matrix factorization. EURASIP J. Adv. Signal Process. (Article ID 838790), Special issue on emotion and mental state recognition from speech (2011b)
K. Woo, T. Yang, K. Park, C. Lee, Robust voice activity detection algorithm for estimating noise spectrum. Electron. Lett. 36(2), 180–181 (2000)
M. You, C. Chen, J. Bu, J. Liu, J. Tao, Emotion recognition from noisy speech, in Proceedings of the IEEE International Conference on Multimedia and Expo (ICME 2006) (IEEE, Toronto, 2006), pp. 1653–1656. doi:10.1109/ICME.2006.262865
S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK Book, Cambridge University Engineering Department, for HTK version 3.4 edition (2006)
Z. Zeng, M. Pantic, G.I. Rosiman, T.S. Huang, A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31(1), 39–58 (2009)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Eyben, F. (2016). Real-Life Robustness. In: Real-time Speech and Music Classification by Large Audio Feature Space Extraction. Springer Theses. Springer, Cham. https://doi.org/10.1007/978-3-319-27299-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-27299-3_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27298-6
Online ISBN: 978-3-319-27299-3
eBook Packages: EngineeringEngineering (R0)