Environment-Independent Adaptive Speech Recognition: A Review of the State of the Art

doi:10.1007/0-306-47027-6_2

Environment-Independent Adaptive Speech Recognition: A Review of the State of the Art

Chapter

163 Accesses

Part of the book series: The International Series in Engineering and Computer Science ((SECS,volume 563))

Summary

While in the last decade most of the research focused on how to develop environment-independent automatic speech recognition, recently environment-adaptive systems have attracted much interest. Adaptive systems have been studied to compensate for a large variety of problems: speaking style, speaking rate, non-native speakers, transducers and transmission channels, noise, language, task, etc. In this chapter, we contrast environment-indepenent and environment-adaptive systems. We study environment-independent approaches from two perspectives 1) speech analysis and, feature extraction, and 2) acoustic modeling. We also present a front-end, called phoneme similarity front-end, which is relatively insensitive to speaker variations. Then, we review adaptation/compensation techniques that have been successful at improving ASR robustness by being applicable to a wide range of problems. Finally, we focus on the mainstream methods applicable to noise robust speech recognition with an emphasis on noise-adaptive techniques. While this chapter emphasizes state-of-the-art technology in the domain of robust and adaptive speech recognition, exciting new developments in this area are also discussed in Chapter 5.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ahadi-Sarkani, S. and Woodland, P., (1995). Rapid Speaker Adaptation Using Model Prediction. ICASSP, pages 684–687.
Google Scholar
Ahadi-Sarkani, S., (1996). Bayesian and Predictive Techniques for Speaker Adaptation. Ph.D. Thesis, Cambridge University.
Google Scholar
Ahadi-Sarkani, S. and Woodland, P., (1997). Combined Bayesian and Predictive Techniques for Rapid Speaker Adaptation of Continuous Density Hidden Markov Models. Computer Speech and Language, Vol. 11, pages 187–206.
Google Scholar
Alexandre, P. and Lockwood, P., (1993). Root Cepstral Analysis: A Unified View. Application to Speech Processing in Car Noise Environments. Speech Communication, Vol. 12, No 3, pages 277–288.
Article Google Scholar
Applebaum, T., Morin, P., and Hanson, B., (1996). A Phoneme-Similarity Based ASR Front-End. ICASSP, pages 33–36.
Google Scholar
Atal, B.S., (1974). Effectiveness of Linear Prediction Characteristics of the Speech Wave for Automatic Speaker Identification and Verification. J. Acoust. Soc. America, Vol. 55, pages 1304–1312.
Google Scholar
Bahl, L.R., de Souza, P.V., Gopalkrishnan, P.S., Nahamoo, D., and Picheny, M.A. (1991). Context Dependent Modelling of Phones in Continuous Speech Using Decision Trees. DARPA Workshop on Speech and Natural Language Processing Workshop, pages 264–270.
Google Scholar
Beulen, K. and Ney, H., (1998). Automatic Question Generation for Decision Tree Based State Tying. ICASSP, pages 805–808.
Google Scholar
Bocchieri, E. and Wilpon, J., (1992). Discriminative Analysis for Feature Reduction in Automatic Speech Recognition. ICASSP, pages I.501–I.504.
Google Scholar
Bourlard, H. and Dupont, (1996). A New ASR Approach Based on Independent Processing and Combination of Partial Frequency Bands. ICSLP, pages 422–425.
Google Scholar
Campbell, N., (1984). Canonical Variate Analysis-a General Formulation. Australian Journal of Statistics, Vol. 26, pages 86–96.
Article MATH MathSciNet Google Scholar
Chesta, C., Laface, P., and Ravera, F., (1997). Bottom-Up and Top-Down State Clustering for Robust Acoustic Modeling. EUROSPEECH, pages 11–14.
Google Scholar
Chien, J-T., Junqua, J-C., and Gelin, P., (1999). Extraction of Reliable Transformation Parameters for Unsupervised Speaker Adaptation. EUROSPEECH, pages 207–210.
Google Scholar
Chou, W. and Reichl, W., (1998). High Resolution Decision Tree Based Acoustic Modeling Beyond CART. ICSLP, pages 2203–2206.
Google Scholar
Claes, T., and van Compernolle, D., (1996). SNR-Normalization for Robust Speech Recognition. ICASSP, pages 331, 334.
Google Scholar
Cox, S.J., Linford, P.W., Hill, W.B., and Johnson, R.D., (1998). Towards Speech Recogniser Assessment Using a Human Reference Standard. Computer Speech and Language, Vol. 12 pages 375–391.
Article Google Scholar
Davis, S.B. and Mermelstein, P., (1980). Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Trans. on ASSP, Vol. 28, pages 357–366.
Google Scholar
De Mori, R., editor, (1998). Spoken Dialogues with Computers. Academic Press.
Google Scholar
Fischer, A. and Stahl, V., (1999a). On Improvement Measures for Spectral Subtraction Applied to Robust Automatic Speech Recognition in Car Environments. Workshop on Robust Methods for Speech Recognition in Adverse Conditions, Tampere, Finland, pages 75–78.
Google Scholar
Fischer, A. and Stahl, V., (1999b). Database and Online Adaptation for Improved Speech Recognition in Car Environments. ICASSP, pages 445–448.
Google Scholar
Furui, S., (1981). Comparison of Speaker Recognition Methods Using Static Features and Dynamic Features. IEEE Trans. on ASSP, Vol. 29, No. 3, pages 342–350.
Article Google Scholar
Gales, M. and Young, S., (1993). Parallel Model Combination for Speech Recognition in Noise. Technical Report CUED/F-INFENG/TR 135, Cambridge University.
Google Scholar
Gales, M. and Young, S., (1995a). Robust Speech Recognition in Additive and Convolutional Noise Using Parallel Model Combination. Computer Speech and Language, Vol. 9, pages 289–307.
Article Google Scholar
Gales, M. and Young, S., (1995b). A Fast and Flexible Implementation of Parallel Model Combination, ICASSP, pages 133–136.
Google Scholar
Gauvain, J-L. and Lee, C-H., (1992). Bayesian Learning for Hidden Markov Model with Gaussian Mixture State Observation Densities. Speech Communication, Vol. 11, pages 205–213.
Article Google Scholar
Gauvain, J-L. and Lee, C-H., (1994). Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains. IEEE Trans. on SAP, Vol. 2, No 2, pages 291–298.
Google Scholar
Gelin, P. and Junqua, J-C., (1999). Techniques for Robust Speech Recognition in the Car Environment. EUROSPEECH, pages 2483–2486.
Google Scholar
Giuliani, D. and De Mori, R., (1998). Speaker Adaptation. In Spoken Dialogues with Computers, R. De Mori, editor, Academic Press, pages 363–403.
Google Scholar
Goh, Z., Tan, K.-H., and Tan, B.T.G., (1998). Postprocessing Method for Suppressing Musical Noise Generated by Spectral Subtraction. IEEE Trans. on SAP, Vol. 6, pages 287–292.
Google Scholar
Gong, Y. and Godfrey, J.J., (1999). Transforming HMMs for Speaker-Independent Hands-Free Speech Recognition in the Car. ICASSP, pages 297–300.
Google Scholar
Hansen, J.H.L., (1995). Analysis and Compensation of Speech Under Stress and Noise for Environmental Robustness in Speech Recognition. ESCA/NATO Workshop on Speech Under Stress, Lisbon, Portugal, pages 91–98.
Google Scholar
Hanson, B., Applebaum, T., and Junqua, J-C., (1996). Spectral Dynamics for Speech Recognition under Adverse Conditions. In Lee, C.-H., Paliwal, K., and Soong, F., editors, Automatic Speech and Speaker Recognition: Advanced Topics, Kluwer Academic Publishers, pages 331–356.
Google Scholar
Hazen, T. and Glass, J., (1997). A Comparison of Novel Techniques for Instantaneous Speaker Adaptation. EUROSPEECH, pages 2047–2050.
Google Scholar
Hermansky, H., (1990). Perceptual Linear Predictive (PLP) Analysis of Speech. J. Acoust. Soc. America, Vol. 87, pages 1738–1752.
Google Scholar
Hermansky, H., Morgan, N., Bayya, A., and Kohn, P., (1991). Compensation of the Effect of the Communication Channel in Auditory-like Analysis of Speech (RASTA-PLP). EUROSPEECH, pages 1367–1370.
Google Scholar
Hermansky, H. and Morgan, N., (1994). RASTA Processing of Speech. IEEE Trans. on SAP, Vol. 2, No 4, pages 578–589.
Google Scholar
Hirsch, H., Meyer, P., and Ruehl, H., (1991). Improved Speech Recognition Using High-Pass Filtering of Subband Envelopes. EUROSPEECH, pages 413–416.
Google Scholar
Hoshimi, M., Miyata, M., Hiraoka, S., and Niyada, K., (1992). Speaker-Independent Speech Recognition Method Using Trained Speech from a Small Number of Speakers. ICASSP, pages 1.469–1.472, 1992.
Google Scholar
Hoshimi M., Yamada M., Niyada K., and Makino S., (1998). A Study of Noise Robustness for Speaker-Independent Speech Recognition Method using Phoneme Similarity Vector. ICSLP, pages 325–328.
Google Scholar
Houtgast T. and Steeneken, H.J.M., (1972). Envelope Spectrum and Intelligibility of Speech in Enclosures. IEEE Conf. On Speech Communication and Processing, pages 392–395.
Google Scholar
Houtgast, T., Steeneken. J.M., and Plomp, R., (1980). Predicting Speech Intelligibility in Rooms from the Modulation Transfer Function: I. General Room Acoustics. Acustica, No 46, pages 60–72.
Google Scholar
Hunt, M.J., (1979). A Statistical Approach to Metrics for Word and Syllable Recognition. J. Acoust. Soc. America, 1979, Vol. 66, pages S535–536.
Google Scholar
Hunt, M.J. and Lefèbvre, C., (1988). Speaker Dependent and Independent Speech Recognition Experiments with an Auditory Model. ICASSP, pages 215–218.
Google Scholar
Hunt, M.J. and Lefèbvre, C., (1989). A Comparison of Several Acoustic Representations for Speech Recognition with Degraded and Undegraded Speech. ICASSP, pages 262–265.
Google Scholar
Hunt, M.J., Richardson, S., Bateman, D., and Piau, A., (1991). An Investigation of PLP and IMELDA Acoustic Representations and of their Potential for Combination. ICASSP, pages 881–884.
Google Scholar
Hunt, M.J., (1999). Spectral Signal Processing for ASR. IEEE ASRU Workshop, Colorado, Keystone, U.S.A.
Google Scholar
Inamura, A., (1991). Speaker-Adaptive HMM-based Speech Recognition with a Stochastic Speaker Classifier. ICASSP, pages 841–844.
Google Scholar
Junqua J-C., Fohr D., Mari J-F., Applebaum T. and Hanson B., (1995). Time Derivatives, Cepstral Normalization, and Spectral Parameter Filtering for Continuous Spelled Names over the Telephone. EUROSPEECH, pages 1385–1388.
Google Scholar
Junqua, J-C. and Haton, J-P., (1996). Robustness in Automatic Speech Recognition. Kluwer Academic Publishers.
Google Scholar
Junqua, J-C. and Haton, J-P., editors, (1997). Proceedings of Robust Speech Recognition for Unknown Communication Channels. ESCA-NATO Workshop, Pontà-Mousson, France.
Google Scholar
Kanadera, N., Arai, T., Hermansky, H., and Pavel, M., (1997). On the Importance of Various Modulation Frequencies for Speech Recognition. EUROSPEECH, pages 1079–1082.
Google Scholar
Katagiri, S., Lee C-H., and Juang B-H., (1991). New Discriminative Training Algorithm Based on the Generalized Probabilistic Descent Method. IEEE Workshop on Neural Networks for Signal Processing, pages 299–308.
Google Scholar
Klatt, D., (1976). A Digital Filter Bank for Spectral Matching. ICASSP, pages 573–576.
Google Scholar
Kosaka, T., Matsunaga, S., and Sagayama, S., (1994). Tree-Structured Speaker Clustering for Speaker-Independent Continuous Speech Recognition. ICSLP, pages 1375–1378.
Google Scholar
Kuhn, R., Contolini, M., and Junqua, J-C., (1999). Military Training: A Discriminative Method for Growing Acoustic Trees. IEEE ASRU Workshop, Colorado, Keystone, U.S.A.
Google Scholar
Lasry, M. and Stern, R., (1984). A Posteriori Estimation of Correlated Jointly Gaussian Mean Vectors. IEEE Trans. Patt. Anal. Machine Intelli., Vol. 6, No 4, pages 530–535.
MATH Google Scholar
Lazaridès, A., Normandin, Y., and Kuhn, R., (1996). Improving Decision Trees for Acoustic Modeling. ICSLP pages 1053–1056.
Google Scholar
Lee, K-F., (1989). Automatic Speech Recognition: The Development of the SPHINX System. Kluwer Academic Publishers.
Google Scholar
Lee, C-H., Lin, C-H., and Juang, B.H., (1991). A Study of Speaker Adaptation of the Parameters of Continuous Density Hidden Markov Models. IEEE Trans. Sig. Proc., Vol. 39, pages 806–814.
Google Scholar
Lee, C-H., (1998). On Stochastic Feature and Model Compensation Approaches to Robust Speech Recognition. Speech Communication, Vol. 25, No 1–3, pages 29–48.
MATH Google Scholar
Lee, L. and Rose, R., (1996). Speaker Normalization Using Efficient Frequency Warping Procedures. ICASSP, pages 353–356.
Google Scholar
Leggetter, C. J. and Woodland, P.C., (1994). Speaker Adaptation of Continuous Density HMMs Using Linear Regression. ICSLP, pages 451–454.
Google Scholar
Leggetter, C. J. and Woodland, P.C., (1995). Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models. Computer Speech and Language, Vol. 9, pages 171–185.
Article Google Scholar
Lippmann, R., (1987). An Introduction to Computing with Neural Nets. IEEE Trans. ASSP Magazine, Vol. 4, No 2, pages 4–22.
Google Scholar
Lockwood, P., and Boudy, J., (1992). Experiments with a Nonlinear Spectral Subtractor (NSS), Hidden Markov Models and the Projection, for Robust Speech Recognition in Cars. Speech Communication, Vol. 11, No 2–3, pages 215–228.
Google Scholar
Lockwood, P. and Alexandre, P., (1994). Root Adaptive Homomorphic Deconvolution Schemes for Speech Recognition in Noise. ICASSP, pages 441–444.
Google Scholar
Markel, J.D. and Gray, A.H., (1976). Linear Prediction of Speech. Springer-Verlag, Berlin.
MATH Google Scholar
Mathan, L. and Miclet, L., (1990). Speaker Hierarchical Clustering for Improving Speaker-Independent HMM Word Recognition, ICASSP, pages 149–152.
Google Scholar
Mirghafori, N. and Morgan, N., (1998). Combining Connectionist Multi-Band and Full-Band Probability Streams for Speech Recognition of Natural Numbers. ICSLP, pages 743–746.
Google Scholar
Mokbel, C., (1997). MUSE, Multipath Stochastic Equalization. A Theoretical Framework to Combine Equalization and Stochastic Modeling. ESCA-NATO Workshop on Robust Speech Recognition for Unknown Communication Channels, Pontà-Mousson, France, pages 211–214.
Google Scholar
Mokbel, C. and Collin, O., (1999). Incremental Enrollment of Speech Recognizers. ICASSP, pages 453–456.
Google Scholar
Morgan, N.0, (1999). Temporal Signal Processing for ASR. IEEE ASRU Workshop, Colorado, Keystone, U.S.A.
Google Scholar
Morin, P., and Applebaum, T. (1995). Word Hypothesizer Based on Reliably Detected Phoneme Similarity. EUROSPEECH, pages 897–900.
Google Scholar
Morin, P., Applebaum, T., Boman, R., Zhao, Y., and Junqua, J-C., (1998). Robust and Compact Multilingual Word Recognizers using Features Extracted from a Phoneme Similarity Front-End. ICSLP, pages 377–380.
Google Scholar
Nadeu, C., Pachès-Leal, P., and Juang, B.H., (1997). Filtering the Time Sequences of Spectral Parameters for Speech Recognition. Speech Communication, Vol. 22, No 4, pages 315–332.
Article Google Scholar
Neti, C. and Roukos, S., (1997). Phone-Context Specific Gender-Dependent Acoustic Models for Continuous Speech Recognition. IEEE ASRU Workshop, Santa Barbara, U.S.A., pages 192–198.
Google Scholar
Nguyen, P., Gelin, P., Junqua, J-C., and Chien, J-T., (1999). N-Best Based Supervised and Unsupervised Adaptation for Native and Non-Native Speakers in Cars. ICASSP, pages 173–176.
Google Scholar
Nock, N., Gales, M. and Young, S., (1997). A Comparative Study of Methods for Phonetic Decision-Tree State Clustering. EUROSPEECH, pages 111–114.
Google Scholar
Odell, J.J., (1995). The Use of Context in Large Vocabulary Speech Recognition. Ph.D. Thesis, Cambridge University.
Google Scholar
Ohkura, K., Suguyama, M., Sagayama, S., (1992). Speaker Adaptation Based on Transfer Vector Field Smoothing with Continuous Density HMMs. ICSLP, pages 369–372.
Google Scholar
Paul., P., (1997). Extensions to Phone-State Decision-Tree Clustering: Single Tree and Tagged Clustering. ICASSP, pages 1487–1490.
Google Scholar
Pye, D. and Woodland, P.C., (1997). Experiments in Speaker Normalisation and Adaptation for Large Vocabulary Speech Recognition. ICASSP, pages 1047, 1050.
Google Scholar
Rabiner, L. and Juang, B.H., (1993). Fundamentals of Speech Recognition. Prentice Hall.
Google Scholar
Rahim, M. and Juang, B.H., (1996). Signal Bins Removal by Maximum Likelihood Estimatio for Robust Telephone Speech Recognition. IEEE Trans. on SAP, Vol. 4, No 1, pages 19–30.
Google Scholar
Reichl, W. and Chou, W., (1999). A Unified Approach of Incorporating General Features in Decision Tree Based Acoustic Modeling. ICASSP, pages 573–576.
Google Scholar
Rigazio, L., Junqua, J-C., and Caller, M., (1998). Multilevel Discriminative Training for Spelled Word Recognition. ICASSP, pages 489–492.
Google Scholar
Roth, R., Baker, J., Baker, J., Gillick, L., Hunt, M., Ito, Y., Lowe, S., Orloff, J., Peskin, B., and Scattone, F., (1993). Large Vocabulary Continuous Speech Recognition of Wall Street Journal Data. ICASSP, pages 11.640–11.643.
Google Scholar
Sagayama, S., Yamaguchi, Y., Takahashi, S., and Takahashi, J., (1997). Jacobian Approach to Fast Acoustic Model Adaptation. ICASSP, pages 835–838.
Google Scholar
Sankar, A., (1998). A New Look at HMM parameter Tying for Large Vocabulary Speech Recognition. ICSLP, pages 2219–2222.
Google Scholar
Siohan, O., (1995). On the Robustness of Linear Discriminant Analysis as a Preprocessing Step for Noisy Speech Recognition. ICASSP, pages 125–128.
Google Scholar
Singh, R., Raj, B., and Stern, R.M., (1999). Automatic Clustering and Generation of Contextual Questions for Tied States in Hidden Markov Models. ICASSP, pages 117–120.
Google Scholar
Sjölander, K. and Högberg, J., (1997). Using Expanded Question Sets in Decision Tree Clustering for Acoustic Modeling. IEEE ASRU Workshop, Santa Barbara, U.S.A., pages 179–183.
Google Scholar
Stern, R., Acero, A., Liu, F-H., and Oshima, Y., (1996). Signal Processing for Robust Speech Recognition. In Automatic Speech and Speaker Recognition, C-H. Lee, F. Soong, and K. Paliwal, editors, Kluwcr Academic Publishers, pages 357–384.
Google Scholar
Stern, R., Raj, B., and Moreno, P., (1997). Compensation for Environmental Degradation in Automatic Speech Recognition. ESCA-NATO Workshop on Robust Speech Recognition for Unknown Communication Channels, Pontà-Mousson, France, pages 33–42.
Google Scholar
Sun, Z. and Mason, J., (1993). Combining Features via LDA in Speaker Recognition. EUROSPEECH, pages 2287–2290.
Google Scholar
Takahashi, J. and Sagayama, S., (1994). Telephone Line Characteristic Adaptation Using Vector Field Smoothing Technique. ICSLP, pages 991–994.
Google Scholar
Tomlinson, M.J., Russel, M.J. Moorc, R.K., Bucklan, A.P., and Fawley, M.A., (1997). Modelling Asynchrony in Speech Using Elementary Single-Signal Decomposition. ICASSP, pages 1247–1250.
Google Scholar
Tibrewala, S. and Hermansky, H., (1997). Sub-band-based Recognition of Noisy Speech. ICASSP, pages 1255–1258.
Google Scholar
Varga, A.P. and Moore, R.K., (1990). Hidden Markov Model Decomposition of Speech and Noise. ICASSP, pages 845–848.
Google Scholar
Wegmann, S., McAllaster, D., Orioff, J., and Peskin, B., (1996). Speaker Normalization on Conversational Telephone Speech. ICASSP, pages 339–342.
Google Scholar
Willett, D., Neukirchen, C., Rottland, J. and Rigoll, G., (1999). Refining Tree-based State Clustering by Means of Formal Concept Analysis, Balanced Decision Trees and Automatically Generated Model-Sets. ICASSP, pages 565–568.
Google Scholar
Yamaguchi, Y., Takahashi, S., and Sagayama, S., (1997). Fast Adaptation of Acoustic Models to Environmental Noise Using Jacobian Adaptation Algorithm. EUROSPEECH, pages 2051–2054.
Google Scholar
Young, S.J. and Woodland, P.C., (1994). State Clustering in Hidden Markov Model-based Continuous Speech Recognition. Computer Speech and Language, Vol. 8, pages 369–383.
Article Google Scholar
Yu, X. and Ward, W., (1999). Corrective Training for Speaker Adaptation. EUROSPEECH, pages 2535,2538.
Google Scholar
Zavaliagkos, G., Schwartz, R., McDonough, J. and Makhoul, J. (1995). Adaptation Algorithms for Large Scale HMM Recognizer. EUROSPEECH, pages 1131–1134.
Google Scholar

Download references

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

(2002). Environment-Independent Adaptive Speech Recognition: A Review of the State of the Art. In: Robust Speech Recognition in Embedded Systems and PC Applications. The International Series in Engineering and Computer Science, vol 563. Springer, Boston, MA. https://doi.org/10.1007/0-306-47027-6_2

Download citation

DOI: https://doi.org/10.1007/0-306-47027-6_2
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-7923-7873-0
Online ISBN: 978-0-306-47027-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Summary

Buying options