Skip to main content

Part of the book series: Springer Handbooks ((SHB))

Abstract

When a speech recognition system is deployed outside the laboratory setting, it needs to handle a variety of signal variabilities. These may be due to many factors, including additive noise, acoustic echo, and speaker accent. If the speech recognition accuracy does not degrade very much under these conditions, the system is called robust. Even though there are several reasons why real-world speech may differ from clean speech, in this chapter we focus on the influence of the acoustical environment, defined as the transformations that affect the speech signal from the time it leaves the mouth until it is in digital format.

Specifically, we discuss strategies for dealing with additive noise. Some of the techniques, like feature normalization, are general enough to provide robustness against several forms of signal degradation. Others, such as feature enhancement, provide superior noise robustness at the expense of being less general. A good system will implement several techniques to provide a strong defense against acoustical variabilities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 579.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 729.00
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abbreviations

AFE:

advanced front-end

AGN:

automatic gain normalization

ARMA:

autoregressive moving-average

ASR:

automatic speech recognition

CDCN:

codeword-dependent cepstral normalization

CDF:

cumulative distribution function

CHN:

cepstral histogram normalization

CMVN:

cepstral mean and variance normalization

CVN:

cepstral variance normalization

DCT:

discrete cosine transform

DFT:

discrete Fourier transform

DPMC:

data-driven parallel model combination

FFT:

fast Fourier transform

GMM:

Gaussian mixture model

HMM:

hidden Markov models

IIR:

infinite impulse response

LMFB:

log mel-frequency filterbank

LTI:

linear time invariant

MAP:

maximum a posteriori

MCE:

minimum classification error

MFCC:

mel-filter cepstral coefficient

ML:

maximum-likelihood

MLLR:

maximum-likelihood linear regression

MMI:

maximum mutual information

MMSE:

minimum mean-square error

MPE:

minimum phone error

PDF:

probability density function

PMC:

parallel model combination

RASTA:

relative spectra

RATZ:

multivariate Gaussian-based cepstral normalization

SNR:

signal-to-noise ratio

SPINE:

speech in noisy environment

SPLICE:

stereo piecewise linear compensation for environment

SVM:

support vector machines

VAD:

voice activity detector

VQ:

vector quantization

VTS:

vector Taylor-series

WSJ:

Wall Street Journal

References

  1. H.G. Hirsch, D. Pearce: The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy conditions, ISCA ITRW ASR2000 ``Automatic Speech Recognition: Challenges for the Next Millenniumʼʼ (2000)

    Google Scholar 

  2. R.G. Leonard, G. Doddington: Tidigits (Linguistic Data Consortium, Philadelphia 1993)

    Google Scholar 

  3. D. Pierce, A. Gunawardana: Aurora 2.0 speech recognition in noise: Update 2. Complex backend definition for Aurora 2.0, http://icslp2002.colorado.edu/special_sessions/aurora (2002)

  4. A. Moreno, B. Lindberg, C. Draxler, G. Richard, K. Choukri, J. Allen, S. Euler: SpeechDat-Car: A large speech database for automotive environments, Proc. 2nd Int. Conf. Language Resources and Evaluation (2000)

    Google Scholar 

  5. J. Garofalo, D. Graff, D. Paul, D. Pallett: CSR-I (WSJ0) Complete (Linguistic Data Consortium, Philadelphia 1993)

    Google Scholar 

  6. A. Varga, H.J.M. Steeneken, M. Tomlinson, D. Jones: The NOISEX-92 study on the effect of additive noise on automatic speech recognition. Tech. Rep. Defence Evaluation and Research Agency (DERA) (Speech Research Unit, Malvern 1992)

    Google Scholar 

  7. A. Schmidt-Nielsen: Speech in Noisy Environments (SPINE) Evaluation Audio (Linguistic Data Consortium, Philadelphia 2000)

    Google Scholar 

  8. R.P. Lippmann, E.A. Martin, D.P. Paul: Multi-style training for robust isolated-word speech recognition, Proc. IEEE ICASSP (1987) pp. 709-712

    Google Scholar 

  9. M. Matassoni, M. Omologo, D. Giuliani: Hands-free speech recognition using a filtered clean corpus and incremental HMM adaptation, Proc. IEEE ICASSP (2000) pp. 1407-1410

    Google Scholar 

  10. G. Saon, J.M. Huerta, E.-E. Jan: Robust digit recognition in noisy environments: The Aurora 2 system, Proc. Eurospeech 2001 (2001)

    Google Scholar 

  11. C.-P. Chen, K. Filali, J.A. Bilmes: Frontend post-processing and backend model enhancement on the Aurora 2.0/3.0 databases, Int. Conf. Spoken Language Process. (2002)

    Google Scholar 

  12. B.S. Atal: Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification, J. Acoust. Soc. Am. 55(6), 1304-1312 (1974)

    Article  Google Scholar 

  13. B.W. Gillespie, L.E. Atlas: Acoustic diversity for improved speech recognition in reverberant environments, Proc. IEEE ICASSP I, 557-560 (2002)

    Google Scholar 

  14. A. de la Torre, A.M. Peinado, J.C. Segura, J.L. Perez-Cordoba, M.C. Benítez, A.J. Rubio: Histogram equalization of speech representation for robust speech recognition, IEEE Trans. Speech Audio Process. 13(3), 355-366 (2005)

    Article  Google Scholar 

  15. M.G. Rahim, B.H. Juang: Signal bias removal by maximum likelihood estimation for robust telephone speech recognition, IEEE Trans. Speech Audio Process. 4(1), 19-30 (1996)

    Article  Google Scholar 

  16. A. Acero, X.D. Huang: Augmented cepstral normalization for robust speech recognition, Proc. IEEE Workshop on Automatic Speech Recognition (1995)

    Google Scholar 

  17. J. Ramírez, J.C. Segura, C. Benítez, L. García, A. Rubio: Statistical voice activity detection using a multiple observation likelihood ratio test, IEEE Signal Proc. Lett. 12(10), 689-692 (2005)

    Article  Google Scholar 

  18. H. Hermansky, N. Morgan: RASTA processing of speech, IEEE Trans. Speech Audio Process. 2(4), 578-589 (1994)

    Article  Google Scholar 

  19. L. Deng, A. Acero, M. Plumpe, X.D. Huang: Large-vocabulary speech recognition under adverse acoustic environments, Int. Conf. Spoken Language Process. (2000)

    Google Scholar 

  20. A. Acero: Acoustical and Environmental Robustness in Automatic Speech Recognition (Kluwer Academic, Boston 1993)

    Book  Google Scholar 

  21. P. Moreno: Speech Recognition in Noisy Environments, Ph.D. Thesis (Carnegie Mellon University, Pittsburgh 1996)

    Google Scholar 

  22. A. Acero, R.M. Stern: Environmental robustness in automatic speech recognition, Proc. IEEE ICASSP (1990) pp. 849-852

    Google Scholar 

  23. J. Droppo, A. Acero: Maximum mutual information SPLICE transform for seen and unseen conditions, Proc. Interspeech Conf. (2005)

    Google Scholar 

  24. J. Wu, Q. Huo: An environment compensated minimum classification error training approach and its evaluation on Aurora 2 database, Proc. ICSLP 1, 453-456 (2002)

    Google Scholar 

  25. D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau, G. Zweig: fMPE: Discriminatively trained features for speech recognition, Proc. IEEE ICASSP (2005)

    Google Scholar 

  26. S. Tamura, A. Waibel: Noise reduction using connectionist models, Proc. IEEE ICASSP (1988) pp. 553-556

    Google Scholar 

  27. L. Neumeyer, M. Weintraub: Probabilistic optimum filtering for robust speech recognition, Proc. IEEE ICASSP 1, 417-420 (1994)

    Google Scholar 

  28. A. Acero, R.M. Stern: Robust speech recognition by normalization of the acoustic space, Proc. IEEE ICASSP 2, 893-896 (1991)

    Google Scholar 

  29. M.J. Gales: Model Based Techniques for Noise Robust Speech Recognition, Ph.D. Thesis (Cambridge University, Cambridge 1995)

    Google Scholar 

  30. E.A. Wan, R.V.D. Merwe, A.T. Nelson: Dual estimation and the unscented transformation. In: Advances in Neural Information Processing Systems, ed. by S.A. Solla, T.K. Leen, K.R. Muller (MIT Press, Cambridge 2000) pp. 666-672

    Google Scholar 

  31. P.J. Moreno, B. Raj, R.M. Stern: A vector taylor series approach for environment independent speech recognition, Proc. IEEE ICASSP (1996) pp. 733-736

    Google Scholar 

  32. B.J. Frey, L. Deng, A. Acero, T. Kristjansson: ALGONQUIN: Iterating Laplaceʼs method to remove multiple types of acoustic distortion for robust speech recognition, Proc. Eurospeech (2001)

    Google Scholar 

  33. J. Droppo, A. Acero, L. Deng: A nonlinear observation model for removing noise from corrupted speech log mel-spectral energies, Proc. Int. Conf. Spoken Language Process. (2002)

    Google Scholar 

  34. C. Couvreur, H. Van Hamme: Model-based feature enhancement for noisy speech recognition, Proc. IEEE ICASSP 3, 1719-1722 (2000)

    Google Scholar 

  35. J. Droppo, A. Acero: Noise robust speech recognition with a switching linear dynamic model, Proc. IEEE ICASSP (2004)

    Google Scholar 

  36. B. Raj, R. Singh, R. Stern: On tracking noise with linear dynamical system models, Proc. IEEE ICASSP 1, 965-968 (2004)

    Google Scholar 

  37. H. Shimodaira, N. Sakai, M. Nakai, S. Sagayama: Jacobian joint adaptation to noise, channel and vocal tract length, Proc. IEEE ICASSP 1, 197-200 (2002)

    Google Scholar 

  38. J. Droppo, L. Deng, A. Acero: A comparison of three non-linear observation models for noisy speech features, Proc. Eurospeech Conf. (2003)

    Google Scholar 

  39. R.A. Gopinath, M.J.F. Gales, P.S. Gopalakrishnan, S. Balakrishnan-Aiyer, M.A. Picheny: Robust speech recognition in noise - performance of the IBM continuous speech recognizer on the ARPA noise spoke task, Proc. ARPA Workshop on Spoken Language Systems Technology (1995) pp. 127-133

    Google Scholar 

  40. A.P. Varga, R.K. Moore: Hidden markov model decomposition of speech and noise, Proc. IEEE ICASSP (1990) pp. 845-848

    Google Scholar 

  41. A. Acero, L. Deng, T. Kristjansson, J. Zhang: HMM adaptation using vector taylor series for noisy speech recognition, Int. Conf. Spoken Language Processing (2000)

    Google Scholar 

  42. W. Ward: Modeling non-verbal sounds for speech recognition, Proc. Speech and Natural Language Workshop (1989) pp. 311-318

    Google Scholar 

  43. S.F. Boll: Suppression of acoustic noise in speech using spectral subtraction, IEEE T. Acoust. Speech 24(April), 113-120 (1979)

    Article  Google Scholar 

  44. Y. Ephraim, D. Malah: Speech enhancement using a minimum mean-square error log-spectral amplitude estimator, IEEE Trans. Acoust. Speech Signal Process., Vol. ASSP-33 (1985) pp. 443-445

    Google Scholar 

  45. V. Stouten: Robust Automatic Speech Recognition in Time-varying Environments, Ph.D. Thesis (K. U. Leuven, Leuven 2006)

    Google Scholar 

  46. M. Berouti, R. Schwartz, J. Makhoul: Enhancement of speech corrupted by acoustic noise, Proc. IEEE ICASSP (1979) pp. 208-211

    Google Scholar 

  47. ETSI ES 2002 050 Recommendation: Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced front-end feature extraction algorithm (2002)

    Google Scholar 

  48. D. Macho, L. Mauuary, B. Noê, Y.M. Cheng, D. Ealey, D. Jouvet, H. Kelleher, D. Pearce, F. Saadoun: Evaluation of a noise-robust DSR front-end on Aurora databases, Proc. ICSLP (2002) pp. 17-20

    Google Scholar 

  49. A. Agarwal, Y.M. Cheng: Two-stage mel-warped Wiener filter for robust speech recognition, Proc. ASRU (1999)

    Google Scholar 

  50. B. Noê, J. Sienel, D. Jouvet, L. Mauuary, L. Boves, J. de Veth, F. de Wet: Noise reduction for noise robust feature extraction for distributed speech recognition, Proc. Eurospeech (2001) pp. 201-204

    Google Scholar 

  51. D. Macho, Y.M. Cheng: SNR-Dependent waveform processing for improving the robustness of ASR front-end, Proc. IEEE ICASSP (2001) pp. 305-308

    Google Scholar 

  52. L. Mauuary: Blind equalization in the cepstral domain for robust telephone based speech recognition, Proc. EUSPICO 1, 359-363 (1998)

    Google Scholar 

  53. M. Afify, O. Siohan: Sequential estimation with optimal forgetting for robust speech recognition, IEEE Trans. Speech Audio Process. 12(1), 19-26 (2004)

    Article  Google Scholar 

  54. J. Droppo, A. Acero, L. Deng: Uncertainty decoding with SPLICE for noise robust speech recognition, Proc. IEEE ICASSP (2002)

    Google Scholar 

  55. M. Cooke, P. Green, L. Josifovski, A. Vizinho: Robust automatic speech recognition with missing and unreliable acoustic data, Speech Commun. 34(3), 267-285 (2001)

    Article  MATH  Google Scholar 

  56. J.P. Barker, M. Cooke, P. Green: Robust ASR based on clean speech models: An evaluation of missing data techniques for connected digit recognition in noise, Proc. Eurospeech 2001, 213-216 (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jasha Droppo Ph.D or Alex Acero Dr. .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Droppo, J., Acero, A. (2008). Environmental Robustness. In: Benesty, J., Sondhi, M.M., Huang, Y.A. (eds) Springer Handbook of Speech Processing. Springer Handbooks. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-49127-9_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-49127-9_33

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-49125-5

  • Online ISBN: 978-3-540-49127-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics