Summary
While in the last decade most of the research focused on how to develop environment-independent automatic speech recognition, recently environment-adaptive systems have attracted much interest. Adaptive systems have been studied to compensate for a large variety of problems: speaking style, speaking rate, non-native speakers, transducers and transmission channels, noise, language, task, etc. In this chapter, we contrast environment-indepenent and environment-adaptive systems. We study environment-independent approaches from two perspectives 1) speech analysis and, feature extraction, and 2) acoustic modeling. We also present a front-end, called phoneme similarity front-end, which is relatively insensitive to speaker variations. Then, we review adaptation/compensation techniques that have been successful at improving ASR robustness by being applicable to a wide range of problems. Finally, we focus on the mainstream methods applicable to noise robust speech recognition with an emphasis on noise-adaptive techniques. While this chapter emphasizes state-of-the-art technology in the domain of robust and adaptive speech recognition, exciting new developments in this area are also discussed in Chapter 5.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Ahadi-Sarkani, S. and Woodland, P., (1995). Rapid Speaker Adaptation Using Model Prediction. ICASSP, pages 684–687.
Ahadi-Sarkani, S., (1996). Bayesian and Predictive Techniques for Speaker Adaptation. Ph.D. Thesis, Cambridge University.
Ahadi-Sarkani, S. and Woodland, P., (1997). Combined Bayesian and Predictive Techniques for Rapid Speaker Adaptation of Continuous Density Hidden Markov Models. Computer Speech and Language, Vol. 11, pages 187–206.
Alexandre, P. and Lockwood, P., (1993). Root Cepstral Analysis: A Unified View. Application to Speech Processing in Car Noise Environments. Speech Communication, Vol. 12, No 3, pages 277–288.
Applebaum, T., Morin, P., and Hanson, B., (1996). A Phoneme-Similarity Based ASR Front-End. ICASSP, pages 33–36.
Atal, B.S., (1974). Effectiveness of Linear Prediction Characteristics of the Speech Wave for Automatic Speaker Identification and Verification. J. Acoust. Soc. America, Vol. 55, pages 1304–1312.
Bahl, L.R., de Souza, P.V., Gopalkrishnan, P.S., Nahamoo, D., and Picheny, M.A. (1991). Context Dependent Modelling of Phones in Continuous Speech Using Decision Trees. DARPA Workshop on Speech and Natural Language Processing Workshop, pages 264–270.
Beulen, K. and Ney, H., (1998). Automatic Question Generation for Decision Tree Based State Tying. ICASSP, pages 805–808.
Bocchieri, E. and Wilpon, J., (1992). Discriminative Analysis for Feature Reduction in Automatic Speech Recognition. ICASSP, pages I.501–I.504.
Bourlard, H. and Dupont, (1996). A New ASR Approach Based on Independent Processing and Combination of Partial Frequency Bands. ICSLP, pages 422–425.
Campbell, N., (1984). Canonical Variate Analysis-a General Formulation. Australian Journal of Statistics, Vol. 26, pages 86–96.
Chesta, C., Laface, P., and Ravera, F., (1997). Bottom-Up and Top-Down State Clustering for Robust Acoustic Modeling. EUROSPEECH, pages 11–14.
Chien, J-T., Junqua, J-C., and Gelin, P., (1999). Extraction of Reliable Transformation Parameters for Unsupervised Speaker Adaptation. EUROSPEECH, pages 207–210.
Chou, W. and Reichl, W., (1998). High Resolution Decision Tree Based Acoustic Modeling Beyond CART. ICSLP, pages 2203–2206.
Claes, T., and van Compernolle, D., (1996). SNR-Normalization for Robust Speech Recognition. ICASSP, pages 331, 334.
Cox, S.J., Linford, P.W., Hill, W.B., and Johnson, R.D., (1998). Towards Speech Recogniser Assessment Using a Human Reference Standard. Computer Speech and Language, Vol. 12 pages 375–391.
Davis, S.B. and Mermelstein, P., (1980). Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Trans. on ASSP, Vol. 28, pages 357–366.
De Mori, R., editor, (1998). Spoken Dialogues with Computers. Academic Press.
Fischer, A. and Stahl, V., (1999a). On Improvement Measures for Spectral Subtraction Applied to Robust Automatic Speech Recognition in Car Environments. Workshop on Robust Methods for Speech Recognition in Adverse Conditions, Tampere, Finland, pages 75–78.
Fischer, A. and Stahl, V., (1999b). Database and Online Adaptation for Improved Speech Recognition in Car Environments. ICASSP, pages 445–448.
Furui, S., (1981). Comparison of Speaker Recognition Methods Using Static Features and Dynamic Features. IEEE Trans. on ASSP, Vol. 29, No. 3, pages 342–350.
Gales, M. and Young, S., (1993). Parallel Model Combination for Speech Recognition in Noise. Technical Report CUED/F-INFENG/TR 135, Cambridge University.
Gales, M. and Young, S., (1995a). Robust Speech Recognition in Additive and Convolutional Noise Using Parallel Model Combination. Computer Speech and Language, Vol. 9, pages 289–307.
Gales, M. and Young, S., (1995b). A Fast and Flexible Implementation of Parallel Model Combination, ICASSP, pages 133–136.
Gauvain, J-L. and Lee, C-H., (1992). Bayesian Learning for Hidden Markov Model with Gaussian Mixture State Observation Densities. Speech Communication, Vol. 11, pages 205–213.
Gauvain, J-L. and Lee, C-H., (1994). Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains. IEEE Trans. on SAP, Vol. 2, No 2, pages 291–298.
Gelin, P. and Junqua, J-C., (1999). Techniques for Robust Speech Recognition in the Car Environment. EUROSPEECH, pages 2483–2486.
Giuliani, D. and De Mori, R., (1998). Speaker Adaptation. In Spoken Dialogues with Computers, R. De Mori, editor, Academic Press, pages 363–403.
Goh, Z., Tan, K.-H., and Tan, B.T.G., (1998). Postprocessing Method for Suppressing Musical Noise Generated by Spectral Subtraction. IEEE Trans. on SAP, Vol. 6, pages 287–292.
Gong, Y. and Godfrey, J.J., (1999). Transforming HMMs for Speaker-Independent Hands-Free Speech Recognition in the Car. ICASSP, pages 297–300.
Hansen, J.H.L., (1995). Analysis and Compensation of Speech Under Stress and Noise for Environmental Robustness in Speech Recognition. ESCA/NATO Workshop on Speech Under Stress, Lisbon, Portugal, pages 91–98.
Hanson, B., Applebaum, T., and Junqua, J-C., (1996). Spectral Dynamics for Speech Recognition under Adverse Conditions. In Lee, C.-H., Paliwal, K., and Soong, F., editors, Automatic Speech and Speaker Recognition: Advanced Topics, Kluwer Academic Publishers, pages 331–356.
Hazen, T. and Glass, J., (1997). A Comparison of Novel Techniques for Instantaneous Speaker Adaptation. EUROSPEECH, pages 2047–2050.
Hermansky, H., (1990). Perceptual Linear Predictive (PLP) Analysis of Speech. J. Acoust. Soc. America, Vol. 87, pages 1738–1752.
Hermansky, H., Morgan, N., Bayya, A., and Kohn, P., (1991). Compensation of the Effect of the Communication Channel in Auditory-like Analysis of Speech (RASTA-PLP). EUROSPEECH, pages 1367–1370.
Hermansky, H. and Morgan, N., (1994). RASTA Processing of Speech. IEEE Trans. on SAP, Vol. 2, No 4, pages 578–589.
Hirsch, H., Meyer, P., and Ruehl, H., (1991). Improved Speech Recognition Using High-Pass Filtering of Subband Envelopes. EUROSPEECH, pages 413–416.
Hoshimi, M., Miyata, M., Hiraoka, S., and Niyada, K., (1992). Speaker-Independent Speech Recognition Method Using Trained Speech from a Small Number of Speakers. ICASSP, pages 1.469–1.472, 1992.
Hoshimi M., Yamada M., Niyada K., and Makino S., (1998). A Study of Noise Robustness for Speaker-Independent Speech Recognition Method using Phoneme Similarity Vector. ICSLP, pages 325–328.
Houtgast T. and Steeneken, H.J.M., (1972). Envelope Spectrum and Intelligibility of Speech in Enclosures. IEEE Conf. On Speech Communication and Processing, pages 392–395.
Houtgast, T., Steeneken. J.M., and Plomp, R., (1980). Predicting Speech Intelligibility in Rooms from the Modulation Transfer Function: I. General Room Acoustics. Acustica, No 46, pages 60–72.
Hunt, M.J., (1979). A Statistical Approach to Metrics for Word and Syllable Recognition. J. Acoust. Soc. America, 1979, Vol. 66, pages S535–536.
Hunt, M.J. and Lefèbvre, C., (1988). Speaker Dependent and Independent Speech Recognition Experiments with an Auditory Model. ICASSP, pages 215–218.
Hunt, M.J. and Lefèbvre, C., (1989). A Comparison of Several Acoustic Representations for Speech Recognition with Degraded and Undegraded Speech. ICASSP, pages 262–265.
Hunt, M.J., Richardson, S., Bateman, D., and Piau, A., (1991). An Investigation of PLP and IMELDA Acoustic Representations and of their Potential for Combination. ICASSP, pages 881–884.
Hunt, M.J., (1999). Spectral Signal Processing for ASR. IEEE ASRU Workshop, Colorado, Keystone, U.S.A.
Inamura, A., (1991). Speaker-Adaptive HMM-based Speech Recognition with a Stochastic Speaker Classifier. ICASSP, pages 841–844.
Junqua J-C., Fohr D., Mari J-F., Applebaum T. and Hanson B., (1995). Time Derivatives, Cepstral Normalization, and Spectral Parameter Filtering for Continuous Spelled Names over the Telephone. EUROSPEECH, pages 1385–1388.
Junqua, J-C. and Haton, J-P., (1996). Robustness in Automatic Speech Recognition. Kluwer Academic Publishers.
Junqua, J-C. and Haton, J-P., editors, (1997). Proceedings of Robust Speech Recognition for Unknown Communication Channels. ESCA-NATO Workshop, Pontà-Mousson, France.
Kanadera, N., Arai, T., Hermansky, H., and Pavel, M., (1997). On the Importance of Various Modulation Frequencies for Speech Recognition. EUROSPEECH, pages 1079–1082.
Katagiri, S., Lee C-H., and Juang B-H., (1991). New Discriminative Training Algorithm Based on the Generalized Probabilistic Descent Method. IEEE Workshop on Neural Networks for Signal Processing, pages 299–308.
Klatt, D., (1976). A Digital Filter Bank for Spectral Matching. ICASSP, pages 573–576.
Kosaka, T., Matsunaga, S., and Sagayama, S., (1994). Tree-Structured Speaker Clustering for Speaker-Independent Continuous Speech Recognition. ICSLP, pages 1375–1378.
Kuhn, R., Contolini, M., and Junqua, J-C., (1999). Military Training: A Discriminative Method for Growing Acoustic Trees. IEEE ASRU Workshop, Colorado, Keystone, U.S.A.
Lasry, M. and Stern, R., (1984). A Posteriori Estimation of Correlated Jointly Gaussian Mean Vectors. IEEE Trans. Patt. Anal. Machine Intelli., Vol. 6, No 4, pages 530–535.
Lazaridès, A., Normandin, Y., and Kuhn, R., (1996). Improving Decision Trees for Acoustic Modeling. ICSLP pages 1053–1056.
Lee, K-F., (1989). Automatic Speech Recognition: The Development of the SPHINX System. Kluwer Academic Publishers.
Lee, C-H., Lin, C-H., and Juang, B.H., (1991). A Study of Speaker Adaptation of the Parameters of Continuous Density Hidden Markov Models. IEEE Trans. Sig. Proc., Vol. 39, pages 806–814.
Lee, C-H., (1998). On Stochastic Feature and Model Compensation Approaches to Robust Speech Recognition. Speech Communication, Vol. 25, No 1–3, pages 29–48.
Lee, L. and Rose, R., (1996). Speaker Normalization Using Efficient Frequency Warping Procedures. ICASSP, pages 353–356.
Leggetter, C. J. and Woodland, P.C., (1994). Speaker Adaptation of Continuous Density HMMs Using Linear Regression. ICSLP, pages 451–454.
Leggetter, C. J. and Woodland, P.C., (1995). Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models. Computer Speech and Language, Vol. 9, pages 171–185.
Lippmann, R., (1987). An Introduction to Computing with Neural Nets. IEEE Trans. ASSP Magazine, Vol. 4, No 2, pages 4–22.
Lockwood, P., and Boudy, J., (1992). Experiments with a Nonlinear Spectral Subtractor (NSS), Hidden Markov Models and the Projection, for Robust Speech Recognition in Cars. Speech Communication, Vol. 11, No 2–3, pages 215–228.
Lockwood, P. and Alexandre, P., (1994). Root Adaptive Homomorphic Deconvolution Schemes for Speech Recognition in Noise. ICASSP, pages 441–444.
Markel, J.D. and Gray, A.H., (1976). Linear Prediction of Speech. Springer-Verlag, Berlin.
Mathan, L. and Miclet, L., (1990). Speaker Hierarchical Clustering for Improving Speaker-Independent HMM Word Recognition, ICASSP, pages 149–152.
Mirghafori, N. and Morgan, N., (1998). Combining Connectionist Multi-Band and Full-Band Probability Streams for Speech Recognition of Natural Numbers. ICSLP, pages 743–746.
Mokbel, C., (1997). MUSE, Multipath Stochastic Equalization. A Theoretical Framework to Combine Equalization and Stochastic Modeling. ESCA-NATO Workshop on Robust Speech Recognition for Unknown Communication Channels, Pontà-Mousson, France, pages 211–214.
Mokbel, C. and Collin, O., (1999). Incremental Enrollment of Speech Recognizers. ICASSP, pages 453–456.
Morgan, N.0, (1999). Temporal Signal Processing for ASR. IEEE ASRU Workshop, Colorado, Keystone, U.S.A.
Morin, P., and Applebaum, T. (1995). Word Hypothesizer Based on Reliably Detected Phoneme Similarity. EUROSPEECH, pages 897–900.
Morin, P., Applebaum, T., Boman, R., Zhao, Y., and Junqua, J-C., (1998). Robust and Compact Multilingual Word Recognizers using Features Extracted from a Phoneme Similarity Front-End. ICSLP, pages 377–380.
Nadeu, C., Pachès-Leal, P., and Juang, B.H., (1997). Filtering the Time Sequences of Spectral Parameters for Speech Recognition. Speech Communication, Vol. 22, No 4, pages 315–332.
Neti, C. and Roukos, S., (1997). Phone-Context Specific Gender-Dependent Acoustic Models for Continuous Speech Recognition. IEEE ASRU Workshop, Santa Barbara, U.S.A., pages 192–198.
Nguyen, P., Gelin, P., Junqua, J-C., and Chien, J-T., (1999). N-Best Based Supervised and Unsupervised Adaptation for Native and Non-Native Speakers in Cars. ICASSP, pages 173–176.
Nock, N., Gales, M. and Young, S., (1997). A Comparative Study of Methods for Phonetic Decision-Tree State Clustering. EUROSPEECH, pages 111–114.
Odell, J.J., (1995). The Use of Context in Large Vocabulary Speech Recognition. Ph.D. Thesis, Cambridge University.
Ohkura, K., Suguyama, M., Sagayama, S., (1992). Speaker Adaptation Based on Transfer Vector Field Smoothing with Continuous Density HMMs. ICSLP, pages 369–372.
Paul., P., (1997). Extensions to Phone-State Decision-Tree Clustering: Single Tree and Tagged Clustering. ICASSP, pages 1487–1490.
Pye, D. and Woodland, P.C., (1997). Experiments in Speaker Normalisation and Adaptation for Large Vocabulary Speech Recognition. ICASSP, pages 1047, 1050.
Rabiner, L. and Juang, B.H., (1993). Fundamentals of Speech Recognition. Prentice Hall.
Rahim, M. and Juang, B.H., (1996). Signal Bins Removal by Maximum Likelihood Estimatio for Robust Telephone Speech Recognition. IEEE Trans. on SAP, Vol. 4, No 1, pages 19–30.
Reichl, W. and Chou, W., (1999). A Unified Approach of Incorporating General Features in Decision Tree Based Acoustic Modeling. ICASSP, pages 573–576.
Rigazio, L., Junqua, J-C., and Caller, M., (1998). Multilevel Discriminative Training for Spelled Word Recognition. ICASSP, pages 489–492.
Roth, R., Baker, J., Baker, J., Gillick, L., Hunt, M., Ito, Y., Lowe, S., Orloff, J., Peskin, B., and Scattone, F., (1993). Large Vocabulary Continuous Speech Recognition of Wall Street Journal Data. ICASSP, pages 11.640–11.643.
Sagayama, S., Yamaguchi, Y., Takahashi, S., and Takahashi, J., (1997). Jacobian Approach to Fast Acoustic Model Adaptation. ICASSP, pages 835–838.
Sankar, A., (1998). A New Look at HMM parameter Tying for Large Vocabulary Speech Recognition. ICSLP, pages 2219–2222.
Siohan, O., (1995). On the Robustness of Linear Discriminant Analysis as a Preprocessing Step for Noisy Speech Recognition. ICASSP, pages 125–128.
Singh, R., Raj, B., and Stern, R.M., (1999). Automatic Clustering and Generation of Contextual Questions for Tied States in Hidden Markov Models. ICASSP, pages 117–120.
Sjölander, K. and Högberg, J., (1997). Using Expanded Question Sets in Decision Tree Clustering for Acoustic Modeling. IEEE ASRU Workshop, Santa Barbara, U.S.A., pages 179–183.
Stern, R., Acero, A., Liu, F-H., and Oshima, Y., (1996). Signal Processing for Robust Speech Recognition. In Automatic Speech and Speaker Recognition, C-H. Lee, F. Soong, and K. Paliwal, editors, Kluwcr Academic Publishers, pages 357–384.
Stern, R., Raj, B., and Moreno, P., (1997). Compensation for Environmental Degradation in Automatic Speech Recognition. ESCA-NATO Workshop on Robust Speech Recognition for Unknown Communication Channels, Pontà-Mousson, France, pages 33–42.
Sun, Z. and Mason, J., (1993). Combining Features via LDA in Speaker Recognition. EUROSPEECH, pages 2287–2290.
Takahashi, J. and Sagayama, S., (1994). Telephone Line Characteristic Adaptation Using Vector Field Smoothing Technique. ICSLP, pages 991–994.
Tomlinson, M.J., Russel, M.J. Moorc, R.K., Bucklan, A.P., and Fawley, M.A., (1997). Modelling Asynchrony in Speech Using Elementary Single-Signal Decomposition. ICASSP, pages 1247–1250.
Tibrewala, S. and Hermansky, H., (1997). Sub-band-based Recognition of Noisy Speech. ICASSP, pages 1255–1258.
Varga, A.P. and Moore, R.K., (1990). Hidden Markov Model Decomposition of Speech and Noise. ICASSP, pages 845–848.
Wegmann, S., McAllaster, D., Orioff, J., and Peskin, B., (1996). Speaker Normalization on Conversational Telephone Speech. ICASSP, pages 339–342.
Willett, D., Neukirchen, C., Rottland, J. and Rigoll, G., (1999). Refining Tree-based State Clustering by Means of Formal Concept Analysis, Balanced Decision Trees and Automatically Generated Model-Sets. ICASSP, pages 565–568.
Yamaguchi, Y., Takahashi, S., and Sagayama, S., (1997). Fast Adaptation of Acoustic Models to Environmental Noise Using Jacobian Adaptation Algorithm. EUROSPEECH, pages 2051–2054.
Young, S.J. and Woodland, P.C., (1994). State Clustering in Hidden Markov Model-based Continuous Speech Recognition. Computer Speech and Language, Vol. 8, pages 369–383.
Yu, X. and Ward, W., (1999). Corrective Training for Speaker Adaptation. EUROSPEECH, pages 2535,2538.
Zavaliagkos, G., Schwartz, R., McDonough, J. and Makhoul, J. (1995). Adaptation Algorithms for Large Scale HMM Recognizer. EUROSPEECH, pages 1131–1134.
Rights and permissions
Copyright information
© 2002 Kluwer Academic Publishers
About this chapter
Cite this chapter
(2002). Environment-Independent Adaptive Speech Recognition: A Review of the State of the Art. In: Robust Speech Recognition in Embedded Systems and PC Applications. The International Series in Engineering and Computer Science, vol 563. Springer, Boston, MA. https://doi.org/10.1007/0-306-47027-6_2
Download citation
DOI: https://doi.org/10.1007/0-306-47027-6_2
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-7923-7873-0
Online ISBN: 978-0-306-47027-1
eBook Packages: Springer Book Archive