Abstract
When a speech recognition system is deployed outside the laboratory setting, it needs to handle a variety of signal variabilities. These may be due to many factors, including additive noise, acoustic echo, and speaker accent. If the speech recognition accuracy does not degrade very much under these conditions, the system is called robust. Even though there are several reasons why real-world speech may differ from clean speech, in this chapter we focus on the influence of the acoustical environment, defined as the transformations that affect the speech signal from the time it leaves the mouth until it is in digital format.
Specifically, we discuss strategies for dealing with additive noise. Some of the techniques, like feature normalization, are general enough to provide robustness against several forms of signal degradation. Others, such as feature enhancement, provide superior noise robustness at the expense of being less general. A good system will implement several techniques to provide a strong defense against acoustical variabilities.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Abbreviations
- AFE:
-
advanced front-end
- AGN:
-
automatic gain normalization
- ARMA:
-
autoregressive moving-average
- ASR:
-
automatic speech recognition
- CDCN:
-
codeword-dependent cepstral normalization
- CDF:
-
cumulative distribution function
- CHN:
-
cepstral histogram normalization
- CMVN:
-
cepstral mean and variance normalization
- CVN:
-
cepstral variance normalization
- DCT:
-
discrete cosine transform
- DFT:
-
discrete Fourier transform
- DPMC:
-
data-driven parallel model combination
- FFT:
-
fast Fourier transform
- GMM:
-
Gaussian mixture model
- HMM:
-
hidden Markov models
- IIR:
-
infinite impulse response
- LMFB:
-
log mel-frequency filterbank
- LTI:
-
linear time invariant
- MAP:
-
maximum a posteriori
- MCE:
-
minimum classification error
- MFCC:
-
mel-filter cepstral coefficient
- ML:
-
maximum-likelihood
- MLLR:
-
maximum-likelihood linear regression
- MMI:
-
maximum mutual information
- MMSE:
-
minimum mean-square error
- MPE:
-
minimum phone error
- PDF:
-
probability density function
- PMC:
-
parallel model combination
- RASTA:
-
relative spectra
- RATZ:
-
multivariate Gaussian-based cepstral normalization
- SNR:
-
signal-to-noise ratio
- SPINE:
-
speech in noisy environment
- SPLICE:
-
stereo piecewise linear compensation for environment
- SVM:
-
support vector machines
- VAD:
-
voice activity detector
- VQ:
-
vector quantization
- VTS:
-
vector Taylor-series
- WSJ:
-
Wall Street Journal
References
H.G. Hirsch, D. Pearce: The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy conditions, ISCA ITRW ASR2000 ``Automatic Speech Recognition: Challenges for the Next Millenniumʼʼ (2000)
R.G. Leonard, G. Doddington: Tidigits (Linguistic Data Consortium, Philadelphia 1993)
D. Pierce, A. Gunawardana: Aurora 2.0 speech recognition in noise: Update 2. Complex backend definition for Aurora 2.0, http://icslp2002.colorado.edu/special_sessions/aurora (2002)
A. Moreno, B. Lindberg, C. Draxler, G. Richard, K. Choukri, J. Allen, S. Euler: SpeechDat-Car: A large speech database for automotive environments, Proc. 2nd Int. Conf. Language Resources and Evaluation (2000)
J. Garofalo, D. Graff, D. Paul, D. Pallett: CSR-I (WSJ0) Complete (Linguistic Data Consortium, Philadelphia 1993)
A. Varga, H.J.M. Steeneken, M. Tomlinson, D. Jones: The NOISEX-92 study on the effect of additive noise on automatic speech recognition. Tech. Rep. Defence Evaluation and Research Agency (DERA) (Speech Research Unit, Malvern 1992)
A. Schmidt-Nielsen: Speech in Noisy Environments (SPINE) Evaluation Audio (Linguistic Data Consortium, Philadelphia 2000)
R.P. Lippmann, E.A. Martin, D.P. Paul: Multi-style training for robust isolated-word speech recognition, Proc. IEEE ICASSP (1987) pp. 709-712
M. Matassoni, M. Omologo, D. Giuliani: Hands-free speech recognition using a filtered clean corpus and incremental HMM adaptation, Proc. IEEE ICASSP (2000) pp. 1407-1410
G. Saon, J.M. Huerta, E.-E. Jan: Robust digit recognition in noisy environments: The Aurora 2 system, Proc. Eurospeech 2001 (2001)
C.-P. Chen, K. Filali, J.A. Bilmes: Frontend post-processing and backend model enhancement on the Aurora 2.0/3.0 databases, Int. Conf. Spoken Language Process. (2002)
B.S. Atal: Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification, J. Acoust. Soc. Am. 55(6), 1304-1312 (1974)
B.W. Gillespie, L.E. Atlas: Acoustic diversity for improved speech recognition in reverberant environments, Proc. IEEE ICASSP I, 557-560 (2002)
A. de la Torre, A.M. Peinado, J.C. Segura, J.L. Perez-Cordoba, M.C. Benítez, A.J. Rubio: Histogram equalization of speech representation for robust speech recognition, IEEE Trans. Speech Audio Process. 13(3), 355-366 (2005)
M.G. Rahim, B.H. Juang: Signal bias removal by maximum likelihood estimation for robust telephone speech recognition, IEEE Trans. Speech Audio Process. 4(1), 19-30 (1996)
A. Acero, X.D. Huang: Augmented cepstral normalization for robust speech recognition, Proc. IEEE Workshop on Automatic Speech Recognition (1995)
J. Ramírez, J.C. Segura, C. Benítez, L. García, A. Rubio: Statistical voice activity detection using a multiple observation likelihood ratio test, IEEE Signal Proc. Lett. 12(10), 689-692 (2005)
H. Hermansky, N. Morgan: RASTA processing of speech, IEEE Trans. Speech Audio Process. 2(4), 578-589 (1994)
L. Deng, A. Acero, M. Plumpe, X.D. Huang: Large-vocabulary speech recognition under adverse acoustic environments, Int. Conf. Spoken Language Process. (2000)
A. Acero: Acoustical and Environmental Robustness in Automatic Speech Recognition (Kluwer Academic, Boston 1993)
P. Moreno: Speech Recognition in Noisy Environments, Ph.D. Thesis (Carnegie Mellon University, Pittsburgh 1996)
A. Acero, R.M. Stern: Environmental robustness in automatic speech recognition, Proc. IEEE ICASSP (1990) pp. 849-852
J. Droppo, A. Acero: Maximum mutual information SPLICE transform for seen and unseen conditions, Proc. Interspeech Conf. (2005)
J. Wu, Q. Huo: An environment compensated minimum classification error training approach and its evaluation on Aurora 2 database, Proc. ICSLP 1, 453-456 (2002)
D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau, G. Zweig: fMPE: Discriminatively trained features for speech recognition, Proc. IEEE ICASSP (2005)
S. Tamura, A. Waibel: Noise reduction using connectionist models, Proc. IEEE ICASSP (1988) pp. 553-556
L. Neumeyer, M. Weintraub: Probabilistic optimum filtering for robust speech recognition, Proc. IEEE ICASSP 1, 417-420 (1994)
A. Acero, R.M. Stern: Robust speech recognition by normalization of the acoustic space, Proc. IEEE ICASSP 2, 893-896 (1991)
M.J. Gales: Model Based Techniques for Noise Robust Speech Recognition, Ph.D. Thesis (Cambridge University, Cambridge 1995)
E.A. Wan, R.V.D. Merwe, A.T. Nelson: Dual estimation and the unscented transformation. In: Advances in Neural Information Processing Systems, ed. by S.A. Solla, T.K. Leen, K.R. Muller (MIT Press, Cambridge 2000) pp. 666-672
P.J. Moreno, B. Raj, R.M. Stern: A vector taylor series approach for environment independent speech recognition, Proc. IEEE ICASSP (1996) pp. 733-736
B.J. Frey, L. Deng, A. Acero, T. Kristjansson: ALGONQUIN: Iterating Laplaceʼs method to remove multiple types of acoustic distortion for robust speech recognition, Proc. Eurospeech (2001)
J. Droppo, A. Acero, L. Deng: A nonlinear observation model for removing noise from corrupted speech log mel-spectral energies, Proc. Int. Conf. Spoken Language Process. (2002)
C. Couvreur, H. Van Hamme: Model-based feature enhancement for noisy speech recognition, Proc. IEEE ICASSP 3, 1719-1722 (2000)
J. Droppo, A. Acero: Noise robust speech recognition with a switching linear dynamic model, Proc. IEEE ICASSP (2004)
B. Raj, R. Singh, R. Stern: On tracking noise with linear dynamical system models, Proc. IEEE ICASSP 1, 965-968 (2004)
H. Shimodaira, N. Sakai, M. Nakai, S. Sagayama: Jacobian joint adaptation to noise, channel and vocal tract length, Proc. IEEE ICASSP 1, 197-200 (2002)
J. Droppo, L. Deng, A. Acero: A comparison of three non-linear observation models for noisy speech features, Proc. Eurospeech Conf. (2003)
R.A. Gopinath, M.J.F. Gales, P.S. Gopalakrishnan, S. Balakrishnan-Aiyer, M.A. Picheny: Robust speech recognition in noise - performance of the IBM continuous speech recognizer on the ARPA noise spoke task, Proc. ARPA Workshop on Spoken Language Systems Technology (1995) pp. 127-133
A.P. Varga, R.K. Moore: Hidden markov model decomposition of speech and noise, Proc. IEEE ICASSP (1990) pp. 845-848
A. Acero, L. Deng, T. Kristjansson, J. Zhang: HMM adaptation using vector taylor series for noisy speech recognition, Int. Conf. Spoken Language Processing (2000)
W. Ward: Modeling non-verbal sounds for speech recognition, Proc. Speech and Natural Language Workshop (1989) pp. 311-318
S.F. Boll: Suppression of acoustic noise in speech using spectral subtraction, IEEE T. Acoust. Speech 24(April), 113-120 (1979)
Y. Ephraim, D. Malah: Speech enhancement using a minimum mean-square error log-spectral amplitude estimator, IEEE Trans. Acoust. Speech Signal Process., Vol. ASSP-33 (1985) pp. 443-445
V. Stouten: Robust Automatic Speech Recognition in Time-varying Environments, Ph.D. Thesis (K. U. Leuven, Leuven 2006)
M. Berouti, R. Schwartz, J. Makhoul: Enhancement of speech corrupted by acoustic noise, Proc. IEEE ICASSP (1979) pp. 208-211
ETSI ES 2002 050 Recommendation: Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced front-end feature extraction algorithm (2002)
D. Macho, L. Mauuary, B. Noê, Y.M. Cheng, D. Ealey, D. Jouvet, H. Kelleher, D. Pearce, F. Saadoun: Evaluation of a noise-robust DSR front-end on Aurora databases, Proc. ICSLP (2002) pp. 17-20
A. Agarwal, Y.M. Cheng: Two-stage mel-warped Wiener filter for robust speech recognition, Proc. ASRU (1999)
B. Noê, J. Sienel, D. Jouvet, L. Mauuary, L. Boves, J. de Veth, F. de Wet: Noise reduction for noise robust feature extraction for distributed speech recognition, Proc. Eurospeech (2001) pp. 201-204
D. Macho, Y.M. Cheng: SNR-Dependent waveform processing for improving the robustness of ASR front-end, Proc. IEEE ICASSP (2001) pp. 305-308
L. Mauuary: Blind equalization in the cepstral domain for robust telephone based speech recognition, Proc. EUSPICO 1, 359-363 (1998)
M. Afify, O. Siohan: Sequential estimation with optimal forgetting for robust speech recognition, IEEE Trans. Speech Audio Process. 12(1), 19-26 (2004)
J. Droppo, A. Acero, L. Deng: Uncertainty decoding with SPLICE for noise robust speech recognition, Proc. IEEE ICASSP (2002)
M. Cooke, P. Green, L. Josifovski, A. Vizinho: Robust automatic speech recognition with missing and unreliable acoustic data, Speech Commun. 34(3), 267-285 (2001)
J.P. Barker, M. Cooke, P. Green: Robust ASR based on clean speech models: An evaluation of missing data techniques for connected digit recognition in noise, Proc. Eurospeech 2001, 213-216 (2001)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Droppo, J., Acero, A. (2008). Environmental Robustness. In: Benesty, J., Sondhi, M.M., Huang, Y.A. (eds) Springer Handbook of Speech Processing. Springer Handbooks. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-49127-9_33
Download citation
DOI: https://doi.org/10.1007/978-3-540-49127-9_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49125-5
Online ISBN: 978-3-540-49127-9
eBook Packages: EngineeringEngineering (R0)