Skip to main content

Part of the book series: Springer Theses ((Springer Theses))

  • 1876 Accesses

Abstract

With rapidly growing interest in and market value of social signal and media analysis a large demand for robust technology which works in adverse situations with poor audio quality and high levels of background noise and reverberation present has been created. Application areas include, e.g., interactive speech systems on mobile devices, multi-modal user profiling for better user adaptation of smart agents, call centre speech analytics, marketing research voice analytics, or health monitoring for stress and depression. This chapter discusses methods for pre-processing speech robustly and for enhancing audio classification algorithms in degraded acoustic conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In openSMILE the energy based VAD can be implemented with a cEnergy component and a cTurnDetector component.

  2. 2.

    This type of smoothing is done by the cTurnDetector component in openSMILE.

  3. 3.

    In openSMILE, both MVN and MRN are implemented in the cVectorMVN component.

  4. 4.

    http://www.shazam.com/.

References

  • J. Benesty, S. Makino, J. Chen (eds.), Speech Enhancement (Springer, Berlin, 2005). ISBN 978-3-540-24039-6

    Google Scholar 

  • R. Banse, K.R. Scherer, Acoustic profiles in vocal emotion expression. J. Personal. Soc. Psychol. 70(3), 614–636 (1996)

    Article  Google Scholar 

  • C. Busso, S. Lee, S. Narayanan, Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Trans. Audio Speech Lang. Process. 17, 582–596 (2009)

    Article  Google Scholar 

  • C.H. Demarty, C. Penet, G. Gravier, M. Soleymani, The MediaEval 2012 affect task: violent scenes detection in hollywood movies, in Proceedings of the MediaEval 2012 Workshop. CEUR-WS.org, Pisa, Italy, Oct 2012

    Google Scholar 

  • S. Deng, J. Han, T. Zheng, G. Zheng, A modified MAP criterion based on hidden Markov model for voice activity detecion, in Proceedings of ICASSP 2011 (IEEE, Prague, 2011), pp. 5220–5223

    Google Scholar 

  • F. Eyben, M. Wöllmer, B. Schuller, openSMILE—the munich versatile and fast open-source audio feature extractor, in Proceedings of ACM Multimedia 2010 (ACM, Florence, 2010), pp. 1459–1462

    Google Scholar 

  • F. Eyben, B. Schuller, G. Rigoll, Improving generalisation and robustness of acoustic affect recognition, in Proceedings of the 14th ACM International Conference on Multimodal Interaction (ICMI) 2012, ed. by L.-P. Morency, D. Bohus, H.K. Aghajan, J. Cassell, A. Nijholt, J. Epps (ACM, Santa Monica, 2012a), pp. 517–522

    Google Scholar 

  • F. Eyben, F. Weninger, N. Lehment, G. Rigoll, B. Schuller, Violent scenes detection with large, brute-forced acoustic and visual feature sets, in Proceedings of the MediaEval 2012 Workshop. CEUR-WS.org, Pisa, Italy, Oct 2012b

    Google Scholar 

  • F. Eyben, F. Weninger, B. Schuller, Affect recognition in real-life acoustic conditions—a new perspective on feature selection, in Proceedings of INTERSPEECH 2013 (ISCA, Lyon, 2013a), pp. 2044–2048

    Google Scholar 

  • F. Eyben, F. Weninger, S. Squartini, B. Schuller, Real-life voice activity detection with LSTM recurrent neural networks and an application to hollywood movies, in Proceedings of ICASSP 2013 (IEEE, Vancouver, 2013b), pp. 483–487

    Google Scholar 

  • F. Eyben, F. Weninger, N. Lehment, B. Schuller, G. Rigoll, Affective video retrieval: violence detection in hollywood movies by large-scale segmental feature extraction. PLoS ONE 8(12), e78506 (2013c). doi:10.1371/journal.pone.0078506

    Google Scholar 

  • M. Fujimoto, S. Watanabe, T. Nakatani, Frame-wise model re-estimation method based on gaussian pruning with weight normalization for noise robust voice activity detection. Speech Commun. 54(2), 229–244 (2012). doi:10.1016/j.specom.2011.08.005. ISSN: 0167-6393

    Google Scholar 

  • J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, N.L. Dahlgrena, V. Zue, TIMIT acoustic-phonetic continuous speech corpus (1993)

    Google Scholar 

  • R. Gemello, F. Mana, R.D. Mori, Non-linear esimation of voice activity to improve automatic recognition of noisy speech, in Proceedings of INTERSPEECH 2005 (ISCA, Lisbon, 2005), pp. 2617–2620

    Google Scholar 

  • A.B. Graf, S. Borer, Normalization in support vector machines, in Pattern Recognition, Lecture Notes in Computer Science (Springer, Berlin, 2001), pp. 277–282

    Google Scholar 

  • A. Graves, S. Fernández, J. Schmidhuber, Multidimensional recurrent neural networks, in Proceedings of the 2007 International Conference on Artificial Neural Networks (ICANN). Lecture Notes in Computer Science, vol. 4668 (Springer, Porto, 2007), pp. 549–558

    Google Scholar 

  • M. Hahn, C.K. Park, An improved speech detection algorithm for isolated korean utterances, in Proceedings of ICASSP 1992 (IEEE, San Francisco, 1992), vol. 1, pp. 525–528

    Google Scholar 

  • J.A. Haigh, J.S. Mason, Robust voice activity detection using cepstral features, in Proceedings of the IEEE Region 10 Conference on Computer, Communication, Control, and Power Engineering (IEEE, 1993), vol. 3, pp. 321–324

    Google Scholar 

  • R. Herbrich, T. Graepel, A PAC-Bayesian margin bound for linear classifiers: why SVMs work, in Advances in Neural Information Processing Systems (MIT press, Cambridge, 2001), pp. 224–230

    Google Scholar 

  • H. Hermansky, Perceptual linear predictive (PLP) analysis for speech. J. Acoust. Soc. Am. (JASA) 87, 1738–1752 (1990)

    Article  Google Scholar 

  • S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  • S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, in A Field Guide to Dynamical Recurrent Neural Networks, ed. by S.C. Kremer, J.F. Kolen (IEEE Press, New York, 2001)

    Google Scholar 

  • M. Jeub, M. Schäfer, P. Vary, A binaural room impulse response database for the evaluation of dereverberation algorithms, in Proceedings of the International Conference on Digital Signal Processing (DSP) (IEEE, Santorini, 2009), pp. 1–4

    Google Scholar 

  • J.-C. Junqua, J.-P. Haton, Robustness in Automatic Speech Recognition: Fundamentals and Applications (Kluwer Academic Publishers, Boston, 1996)

    Book  Google Scholar 

  • C.-C. Lee, M. Black, A. Katsamanis, A. Lammert, B. Baucom, A. Christensen, P.G. Georgiou, S.S. Narayanan, Quantification of prosodic entrainment in affective spontaneous spoken interactions of married couples, in Proceedings of INTERSPEECH 2010 (ISCA, Makuhari, 2010), pp. 793–796

    Google Scholar 

  • H. Lu, M. Rabbi, G. Chittaranjan, D. Frauendorfer, M. Schmid Mast, A.T. Campbell, D. Gatica-Perez, T. Choudhury, Stresssense: detecting stress in unconstrained acoustic environments using smartphones, in Proceedings of the 2012 ACM Conference on Ubiquitous Computing (Ubicomp’12) (ACM, Pittsburgh, 2012), pp. 351–360

    Google Scholar 

  • M. Marzinzik, B. Kollmeier, Speech pause detection for noise spectrum estimation by tracking power envelope dynamics. IEEE Trans. Speech Audio Process. 10, 109–118 (2002)

    Article  Google Scholar 

  • S. Matsuda, N. Ito, K. Tsujino, H. Kashioka, S. Sagayama, Speaker-dependent voice activity detection robust to background speech noise, in Proceedings of INTERSPEECH 2012 (ISCA, Portland, 2012)

    Google Scholar 

  • A. Misra, Speech/nonspeech segmentation in web videos, in Proceedings of INTERSPEECH 2012 (ISCA, Portland, 2012)

    Google Scholar 

  • S. Mousazadeh, I. Cohen, AR-GARCH in presence of noise: parameter estimation and its application to voice activity detection. IEEE Trans. Audio Speech Lang. Process. 19(4), 916–926 (2011)

    Article  Google Scholar 

  • E. Mower, M.J. Mataric, S.S. Narayanan, A framework for automatic human emotion classification using emotional profiles. IEEE Trans. Audio Speech Lang. Process. 19(5), 1057–1070 (2011). doi:10.1109/TASL.2010.2076804

    Article  Google Scholar 

  • T. Ng, B. Zhang, L. Nguyen, S. Matsoukas, X. Zhou, N. Mesgarani, K. Vesel, P. Matjka, Developing a speech activity detection system for the darpa rats program, in Proceedings of INTERSPEECH 2012 (ISCA, Portland, 2012)

    Google Scholar 

  • M.K. Omar, Speech activity detection for noisy data using adaptation techniques, in Proceedings of INTERSPEECH 2012 (ISCA, Portland, 2012)

    Google Scholar 

  • L. Parra, C. Alvino, Geometric source separation: merging convolutive source separation with geometric beamforming, in Proceedings of the 2001 IEEE Signal Processing Society Workshop on Neural Networks for Signal Processing XI (IEEE, 2001), pp. 273–282. doi:10.1109/NNSP.2001.943132

  • K. Pearson, Notes on regression and inheritance in the case of two parents. Proc. R. Soc. Lond. 58, 240–242 (1895)

    Article  Google Scholar 

  • M.A. Pitt, L. Dilley, K. Johnson, S. Kiesling, W. Raymond, E. Hume, E. Fosler-Lussier, in Buckeye Corpus of Conversational Speech (2nd release). Department of Psychology, Ohio State University (Distributor), Columbus, OH, USA (2007). http://www.buckeyecorpus.osu.edu/

  • J. Pittermann, A. Pittermann, W. Minker, Emotion recognition and adaptation in spoken dialogue systems. Int. J. Speech Technol. 13, 49–60 (2010)

    Article  Google Scholar 

  • L.R. Rabiner, M.R. Sambur, Voice-unvoiced-silence detection using the itakura LPC distance measure, in Proceedings of ICASSP 1977 (IEEE, Hartford, 1977), vol. 2, pp. 323–326

    Google Scholar 

  • J. Ramirez, J. Segura, M. Benitez, A. De La Torre, A. Rubio, Efficient voice activity detection algorithms using long-term speech information. Speech Commun. 42(3), 271–287 (2004)

    Article  Google Scholar 

  • J. Ramirez, J. Segura, C. Benitez, L. Garcia, A. Rubio, Statistical voice activity detection using a multiple observation likelihood ratio test. IEEE Signal Process. Lett. 12(10), 689–692 (2005)

    Article  Google Scholar 

  • M. Schröder, E. Bevacqua, R. Cowie, F. Eyben, H. Gunes, D. Heylen, M. ter Maat, G. McKeown, S. Pammi, M. Pantic, C. Pelachaud, B. Schuller, E. de Sevin, M. Valstar, M. Wöllmer, Building autonomous sensitive artificial listeners. IEEE Trans. Affect. Comput. 3(2), 165–183 (2012)

    Article  Google Scholar 

  • B. Schuller, G. Rigoll, M. Grimm, K. Kroschel, T. Moosmayr, G. Ruske, Effects of in-car noise-conditions on the recognition of emotion within speech, in Proceedings of the 33. Jahrestagung für Akustik (DAGA) 2007 (DEGA, Stuttgart, 2007), pp. 305–306

    Google Scholar 

  • B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, A. Wendemuth, Acoustic emotion recognition: a benchmark comparison of performances, in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2009 (IEEE, Merano, 2009a), pp. 552–557

    Google Scholar 

  • B. Schuller, R. Müller, F. Eyben, J. Gast, B. Hörnler, M. Wöllmer, G. Rigoll, A. Höthker, H. Konosu, Being bored? recognising natural interest by extensive audiovisual integration for real-life application. Image Vis. Comput. 27(12), 1760–1774, Special issue on visual and multimodal analysis of human spontaneous behavior (2009b)

    Google Scholar 

  • B. Schuller, Affective speaker state analysis in the presence of reverberation. Int. J. Speech Technol. 14(2), 77–87 (2011a)

    Google Scholar 

  • B. Schuller, A. Batliner, S. Steidl, D. Seppi, Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun. 53(9/10), 1062–1087, Special issue on sensing emotion and affect—facing realism in speech processing (2011b)

    Google Scholar 

  • B. Schuller, M. Valstar, R. Cowie, M. Pantic, AVEC 2012: the continuous audio/visual emotion challenge—an introduction, in Proceedings of the 14th ACM International Conference on Multimodal Interaction (ICMI) 2012, ed. by L.-P. Morency, D. Bohus, H.K. Aghajan, J. Cassell, A. Nijholt, J. Epps (ACM, Santa Monica, 2012a), pp. 361–362

    Google Scholar 

  • B. Schuller, S. Steidl, A. Batliner, E. Nöth, A. Vinciarelli, F. Burkhardt, R. van Son, F. Weninger, F. Eyben, T. Bocklet, G. Mohammadi, B. Weiss, The INTERSPEECH 2012 speaker trait challenge, in Proceedings of INTERSPEECH 2012 (ISCA, Portland, 2012b)

    Google Scholar 

  • B. Schuller, Intelligent Audio Analysis, Signals and Communication Technology (Springer, Berlin, 2013a). ISBN 978-3642368059

    Google Scholar 

  • B. Schuller, A. Batliner, Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing (Wiley, Hoboken, 2013b). p. 344, ISBN 978-1119971368

    Google Scholar 

  • B. Schuller, F. Pokorny, S. Ladstätter, M. Fellner, F. Graf, L. Paletta, Acoustic geo-sensing: recognising cyclists’ route, route direction, and route progress from cell-phone audio, in Proceedings of ICASSP 2013 (IEEE, Vancouver, 2013c), pp. 453–457

    Google Scholar 

  • B. Schuller, S. Steidl, A. Batliner, F. Schiel, J. Krajewski, F. Weninger, F. Eyben, Medium-term speaker states—a review on intoxication, sleepiness and the first challenge. Comput. Speech Lang. 28(2), 346–374, Special issue on broadening the view on speaker analysis (2014)

    Google Scholar 

  • V. Sethu, E. Ambikairajah, J. Epps, Speaker normalisation for speech-based emotion detection, in Proceedings of the 15th International Conference on Digital Signal Processing (DSP 2007), pp. 611–614, Cardiff, UK, July 2007

    Google Scholar 

  • J. Sohn, N. Kim, A statistical model-based voice activity detection. IEEE Signal Process. Lett. 6(1), 1–3 (1999)

    Article  Google Scholar 

  • J. Sola, J. Sevilla, Importance of input data normalization for the application of neural networks to complex industrial problems. IEEE Trans. Nucl. Sci. 44(3), 1464–1468 (1997). doi:10.1109/23.589532. ISSN: 0018-9499

    Google Scholar 

  • Y. Suh, H. Kim, Multiple acoustic model-based discriminative likelihood ratio weighting for voice activity detection. IEEE Signal Process. Lett. 19(8), 507–510 (2012). doi:10.1109/LSP.2012.2204978. ISSN: 1070-9908

    Google Scholar 

  • M. Suzuki, S. Nakagawa, K. Kita, Prosodic feature normalization for emotion recognition by using synthesized speech, in Advances in Knowledge-Based and Intelligent Information and Engineering Systems—16th Annual KES Conference, vol. 243, Frontiers in Artificial Intelligence and Applications, ed. by M. Graña, C. Toro, J. Posada, R.J. Howlett, L.C. Jain (IOS Press, San Sebastian, 2012), pp. 306–313

    Google Scholar 

  • W.Q. Syed, H.-C. Wu, Speech waveform compression using robust adaptive voice activity detection for nonstationary noise in multimedia communications, in Proceedings of Global Telecommunications Conference, 2007 (GLOBECOM’07) (IEEE, Washington DC, 2007), pp. 3096–3101

    Google Scholar 

  • D. Talkin, A robust algorithm for pitch tracking (RAPT), in Speech Coding and Synthesis, ed. by W.B. Kleijn, K.K. Paliwal (Elsevier, New York, 1995), pp. 495–518. ISBN 0444821694

    Google Scholar 

  • K. Thambiratnam, W. Zhu, F. Seide, Voice activity detection using speech recognizer feedback, in Proceedings of INTERSPEECH 2012 (ISCA, Portland, 2012)

    Google Scholar 

  • S. Thomas, S.H. Mallidi, T. Janu, H. Hermansky, N. Mesgarani, X. Zhou, S. Shamma, T. Ng, B. Zhang, L. Nguyen, S. Matsoukas, Acoustic and data-driven features for robust speech activity detection, in Proceedings of INTERSPEECH 2012 (ISCA, Portland, 2012)

    Google Scholar 

  • A. Vinciarelli, M. Pantic, H. Bourlard, Social signal processing: survey of an emerging domain. Image Vis. Comput. 27(12), 1743–1759 (2009). doi:10.1016/j.imavis.2008.11.007

    Article  Google Scholar 

  • A.L. Wang, An industrial-strength audio search algorithm, in Proceedings of ISMIR (Baltimore, 2003)

    Google Scholar 

  • F. Weninger, B. Schuller, M. Wöllmer, G. Rigoll, Localization of non-linguistic events in spontaneous speech by non-negative matrix factorization and long short-term memory, in Proceedings of ICASSP 2011 (IEEE, Prague, 2011a), pp. 5840–5843

    Google Scholar 

  • F. Weninger, B. Schuller, A. Batliner, S. Steidl, D. Seppi, Recognition of non-prototypical emotions in reverberated and noisy speech by non-negative matrix factorization. EURASIP J. Adv. Signal Process. (Article ID 838790), Special issue on emotion and mental state recognition from speech (2011b)

    Google Scholar 

  • K. Woo, T. Yang, K. Park, C. Lee, Robust voice activity detection algorithm for estimating noise spectrum. Electron. Lett. 36(2), 180–181 (2000)

    Article  Google Scholar 

  • M. You, C. Chen, J. Bu, J. Liu, J. Tao, Emotion recognition from noisy speech, in Proceedings of the IEEE International Conference on Multimedia and Expo (ICME 2006) (IEEE, Toronto, 2006), pp. 1653–1656. doi:10.1109/ICME.2006.262865

  • S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK Book, Cambridge University Engineering Department, for HTK version 3.4 edition (2006)

    Google Scholar 

  • Z. Zeng, M. Pantic, G.I. Rosiman, T.S. Huang, A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31(1), 39–58 (2009)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Florian Eyben .

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Eyben, F. (2016). Real-Life Robustness. In: Real-time Speech and Music Classification by Large Audio Feature Space Extraction. Springer Theses. Springer, Cham. https://doi.org/10.1007/978-3-319-27299-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27299-3_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27298-6

  • Online ISBN: 978-3-319-27299-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics