Environmental Robustness

Droppo, Jasha; Acero, Alex

doi:10.1007/978-3-540-49127-9_33

Jasha Droppo Ph.D⁴ &
Alex Acero Dr.⁵

Part of the book series: Springer Handbooks ((SHB))

7993 Accesses
32 Citations

Abstract

When a speech recognition system is deployed outside the laboratory setting, it needs to handle a variety of signal variabilities. These may be due to many factors, including additive noise, acoustic echo, and speaker accent. If the speech recognition accuracy does not degrade very much under these conditions, the system is called robust. Even though there are several reasons why real-world speech may differ from clean speech, in this chapter we focus on the influence of the acoustical environment, defined as the transformations that affect the speech signal from the time it leaves the mouth until it is in digital format.

Specifically, we discuss strategies for dealing with additive noise. Some of the techniques, like feature normalization, are general enough to provide robustness against several forms of signal degradation. Others, such as feature enhancement, provide superior noise robustness at the expense of being less general. A good system will implement several techniques to provide a strong defense against acoustical variabilities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 579.00; Price excludes VAT (USA)

Hardcover Book: USD 729.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abbreviations

AFE:: advanced front-end
AGN:: automatic gain normalization
ARMA:: autoregressive moving-average
ASR:: automatic speech recognition
CDCN:: codeword-dependent cepstral normalization
CDF:: cumulative distribution function
CHN:: cepstral histogram normalization
CMVN:: cepstral mean and variance normalization
CVN:: cepstral variance normalization
DCT:: discrete cosine transform
DFT:: discrete Fourier transform
DPMC:: data-driven parallel model combination
FFT:: fast Fourier transform
GMM:: Gaussian mixture model
HMM:: hidden Markov models
IIR:: infinite impulse response
LMFB:: log mel-frequency filterbank
LTI:: linear time invariant
MAP:: maximum a posteriori
MCE:: minimum classification error
MFCC:: mel-filter cepstral coefficient
ML:: maximum-likelihood
MLLR:: maximum-likelihood linear regression
MMI:: maximum mutual information
MMSE:: minimum mean-square error
MPE:: minimum phone error
PDF:: probability density function
PMC:: parallel model combination
RASTA:: relative spectra
RATZ:: multivariate Gaussian-based cepstral normalization
SNR:: signal-to-noise ratio
SPINE:: speech in noisy environment
SPLICE:: stereo piecewise linear compensation for environment
SVM:: support vector machines
VAD:: voice activity detector
VQ:: vector quantization
VTS:: vector Taylor-series
WSJ:: Wall Street Journal

References

H.G. Hirsch, D. Pearce: The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy conditions, ISCA ITRW ASR2000 ``Automatic Speech Recognition: Challenges for the Next Millenniumʼʼ (2000)
Google Scholar
R.G. Leonard, G. Doddington: Tidigits (Linguistic Data Consortium, Philadelphia 1993)
Google Scholar
D. Pierce, A. Gunawardana: Aurora 2.0 speech recognition in noise: Update 2. Complex backend definition for Aurora 2.0, http://icslp2002.colorado.edu/special_sessions/aurora (2002)
A. Moreno, B. Lindberg, C. Draxler, G. Richard, K. Choukri, J. Allen, S. Euler: SpeechDat-Car: A large speech database for automotive environments, Proc. 2nd Int. Conf. Language Resources and Evaluation (2000)
Google Scholar
J. Garofalo, D. Graff, D. Paul, D. Pallett: CSR-I (WSJ0) Complete (Linguistic Data Consortium, Philadelphia 1993)
Google Scholar
A. Varga, H.J.M. Steeneken, M. Tomlinson, D. Jones: The NOISEX-92 study on the effect of additive noise on automatic speech recognition. Tech. Rep. Defence Evaluation and Research Agency (DERA) (Speech Research Unit, Malvern 1992)
Google Scholar
A. Schmidt-Nielsen: Speech in Noisy Environments (SPINE) Evaluation Audio (Linguistic Data Consortium, Philadelphia 2000)
Google Scholar
R.P. Lippmann, E.A. Martin, D.P. Paul: Multi-style training for robust isolated-word speech recognition, Proc. IEEE ICASSP (1987) pp. 709-712
Google Scholar
M. Matassoni, M. Omologo, D. Giuliani: Hands-free speech recognition using a filtered clean corpus and incremental HMM adaptation, Proc. IEEE ICASSP (2000) pp. 1407-1410
Google Scholar
G. Saon, J.M. Huerta, E.-E. Jan: Robust digit recognition in noisy environments: The Aurora 2 system, Proc. Eurospeech 2001 (2001)
Google Scholar
C.-P. Chen, K. Filali, J.A. Bilmes: Frontend post-processing and backend model enhancement on the Aurora 2.0/3.0 databases, Int. Conf. Spoken Language Process. (2002)
Google Scholar
B.S. Atal: Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification, J. Acoust. Soc. Am. 55(6), 1304-1312 (1974)
Article Google Scholar
B.W. Gillespie, L.E. Atlas: Acoustic diversity for improved speech recognition in reverberant environments, Proc. IEEE ICASSP I, 557-560 (2002)
Google Scholar
A. de la Torre, A.M. Peinado, J.C. Segura, J.L. Perez-Cordoba, M.C. Benítez, A.J. Rubio: Histogram equalization of speech representation for robust speech recognition, IEEE Trans. Speech Audio Process. 13(3), 355-366 (2005)
Article Google Scholar
M.G. Rahim, B.H. Juang: Signal bias removal by maximum likelihood estimation for robust telephone speech recognition, IEEE Trans. Speech Audio Process. 4(1), 19-30 (1996)
Article Google Scholar
A. Acero, X.D. Huang: Augmented cepstral normalization for robust speech recognition, Proc. IEEE Workshop on Automatic Speech Recognition (1995)
Google Scholar
J. Ramírez, J.C. Segura, C. Benítez, L. García, A. Rubio: Statistical voice activity detection using a multiple observation likelihood ratio test, IEEE Signal Proc. Lett. 12(10), 689-692 (2005)
Article Google Scholar
H. Hermansky, N. Morgan: RASTA processing of speech, IEEE Trans. Speech Audio Process. 2(4), 578-589 (1994)
Article Google Scholar
L. Deng, A. Acero, M. Plumpe, X.D. Huang: Large-vocabulary speech recognition under adverse acoustic environments, Int. Conf. Spoken Language Process. (2000)
Google Scholar
A. Acero: Acoustical and Environmental Robustness in Automatic Speech Recognition (Kluwer Academic, Boston 1993)
Book Google Scholar
P. Moreno: Speech Recognition in Noisy Environments, Ph.D. Thesis (Carnegie Mellon University, Pittsburgh 1996)
Google Scholar
A. Acero, R.M. Stern: Environmental robustness in automatic speech recognition, Proc. IEEE ICASSP (1990) pp. 849-852
Google Scholar
J. Droppo, A. Acero: Maximum mutual information SPLICE transform for seen and unseen conditions, Proc. Interspeech Conf. (2005)
Google Scholar
J. Wu, Q. Huo: An environment compensated minimum classification error training approach and its evaluation on Aurora 2 database, Proc. ICSLP 1, 453-456 (2002)
Google Scholar
D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau, G. Zweig: fMPE: Discriminatively trained features for speech recognition, Proc. IEEE ICASSP (2005)
Google Scholar
S. Tamura, A. Waibel: Noise reduction using connectionist models, Proc. IEEE ICASSP (1988) pp. 553-556
Google Scholar
L. Neumeyer, M. Weintraub: Probabilistic optimum filtering for robust speech recognition, Proc. IEEE ICASSP 1, 417-420 (1994)
Google Scholar
A. Acero, R.M. Stern: Robust speech recognition by normalization of the acoustic space, Proc. IEEE ICASSP 2, 893-896 (1991)
Google Scholar
M.J. Gales: Model Based Techniques for Noise Robust Speech Recognition, Ph.D. Thesis (Cambridge University, Cambridge 1995)
Google Scholar
E.A. Wan, R.V.D. Merwe, A.T. Nelson: Dual estimation and the unscented transformation. In: Advances in Neural Information Processing Systems, ed. by S.A. Solla, T.K. Leen, K.R. Muller (MIT Press, Cambridge 2000) pp. 666-672
Google Scholar
P.J. Moreno, B. Raj, R.M. Stern: A vector taylor series approach for environment independent speech recognition, Proc. IEEE ICASSP (1996) pp. 733-736
Google Scholar
B.J. Frey, L. Deng, A. Acero, T. Kristjansson: ALGONQUIN: Iterating Laplaceʼs method to remove multiple types of acoustic distortion for robust speech recognition, Proc. Eurospeech (2001)
Google Scholar
J. Droppo, A. Acero, L. Deng: A nonlinear observation model for removing noise from corrupted speech log mel-spectral energies, Proc. Int. Conf. Spoken Language Process. (2002)
Google Scholar
C. Couvreur, H. Van Hamme: Model-based feature enhancement for noisy speech recognition, Proc. IEEE ICASSP 3, 1719-1722 (2000)
Google Scholar
J. Droppo, A. Acero: Noise robust speech recognition with a switching linear dynamic model, Proc. IEEE ICASSP (2004)
Google Scholar
B. Raj, R. Singh, R. Stern: On tracking noise with linear dynamical system models, Proc. IEEE ICASSP 1, 965-968 (2004)
Google Scholar
H. Shimodaira, N. Sakai, M. Nakai, S. Sagayama: Jacobian joint adaptation to noise, channel and vocal tract length, Proc. IEEE ICASSP 1, 197-200 (2002)
Google Scholar
J. Droppo, L. Deng, A. Acero: A comparison of three non-linear observation models for noisy speech features, Proc. Eurospeech Conf. (2003)
Google Scholar
R.A. Gopinath, M.J.F. Gales, P.S. Gopalakrishnan, S. Balakrishnan-Aiyer, M.A. Picheny: Robust speech recognition in noise - performance of the IBM continuous speech recognizer on the ARPA noise spoke task, Proc. ARPA Workshop on Spoken Language Systems Technology (1995) pp. 127-133
Google Scholar
A.P. Varga, R.K. Moore: Hidden markov model decomposition of speech and noise, Proc. IEEE ICASSP (1990) pp. 845-848
Google Scholar
A. Acero, L. Deng, T. Kristjansson, J. Zhang: HMM adaptation using vector taylor series for noisy speech recognition, Int. Conf. Spoken Language Processing (2000)
Google Scholar
W. Ward: Modeling non-verbal sounds for speech recognition, Proc. Speech and Natural Language Workshop (1989) pp. 311-318
Google Scholar
S.F. Boll: Suppression of acoustic noise in speech using spectral subtraction, IEEE T. Acoust. Speech 24(April), 113-120 (1979)
Article Google Scholar
Y. Ephraim, D. Malah: Speech enhancement using a minimum mean-square error log-spectral amplitude estimator, IEEE Trans. Acoust. Speech Signal Process., Vol. ASSP-33 (1985) pp. 443-445
Google Scholar
V. Stouten: Robust Automatic Speech Recognition in Time-varying Environments, Ph.D. Thesis (K. U. Leuven, Leuven 2006)
Google Scholar
M. Berouti, R. Schwartz, J. Makhoul: Enhancement of speech corrupted by acoustic noise, Proc. IEEE ICASSP (1979) pp. 208-211
Google Scholar
ETSI ES 2002 050 Recommendation: Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced front-end feature extraction algorithm (2002)
Google Scholar
D. Macho, L. Mauuary, B. Noê, Y.M. Cheng, D. Ealey, D. Jouvet, H. Kelleher, D. Pearce, F. Saadoun: Evaluation of a noise-robust DSR front-end on Aurora databases, Proc. ICSLP (2002) pp. 17-20
Google Scholar
A. Agarwal, Y.M. Cheng: Two-stage mel-warped Wiener filter for robust speech recognition, Proc. ASRU (1999)
Google Scholar
B. Noê, J. Sienel, D. Jouvet, L. Mauuary, L. Boves, J. de Veth, F. de Wet: Noise reduction for noise robust feature extraction for distributed speech recognition, Proc. Eurospeech (2001) pp. 201-204
Google Scholar
D. Macho, Y.M. Cheng: SNR-Dependent waveform processing for improving the robustness of ASR front-end, Proc. IEEE ICASSP (2001) pp. 305-308
Google Scholar
L. Mauuary: Blind equalization in the cepstral domain for robust telephone based speech recognition, Proc. EUSPICO 1, 359-363 (1998)
Google Scholar
M. Afify, O. Siohan: Sequential estimation with optimal forgetting for robust speech recognition, IEEE Trans. Speech Audio Process. 12(1), 19-26 (2004)
Article Google Scholar
J. Droppo, A. Acero, L. Deng: Uncertainty decoding with SPLICE for noise robust speech recognition, Proc. IEEE ICASSP (2002)
Google Scholar
M. Cooke, P. Green, L. Josifovski, A. Vizinho: Robust automatic speech recognition with missing and unreliable acoustic data, Speech Commun. 34(3), 267-285 (2001)
Article MATH Google Scholar
J.P. Barker, M. Cooke, P. Green: Robust ASR based on clean speech models: An evaluation of missing data techniques for connected digit recognition in noise, Proc. Eurospeech 2001, 213-216 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Speech Technology Group, Microsoft Research, One Microsoft Way, 98052, Redmond, WA, USA
Jasha Droppo Ph.D
Microsoft Research, One Microsoft Way, 98052, Redmond, WA, USA
Alex Acero Dr.

Authors

Jasha Droppo Ph.D
View author publications
You can also search for this author in PubMed Google Scholar
Alex Acero Dr.
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jasha Droppo Ph.D or Alex Acero Dr. .

Editor information

Editors and Affiliations

INRS-EMT, University of Quebec, 800 de la Gauchetiere Ouest, H5A 1K6, Montreal, Quebec, Canada
Jacob Benesty Dr.
Avayalabs Research, 233 Mount Airy Road, 07920, Basking Ridge, NJ, USA
M. Mohan Sondhi Ph.D.
Alcatel-Lucent, Bell Laboratories, 600 Mountain Avenue, 07974, Murray Hill, NJ, USA
Yiteng Arden Huang Dr.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Droppo, J., Acero, A. (2008). Environmental Robustness. In: Benesty, J., Sondhi, M.M., Huang, Y.A. (eds) Springer Handbook of Speech Processing. Springer Handbooks. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-49127-9_33

Download citation

DOI: https://doi.org/10.1007/978-3-540-49127-9_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49125-5
Online ISBN: 978-3-540-49127-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics