Abstract
In this paper, we discuss a family of new Automatic Speech Recognition (ASR) approaches, which somewhat deviate from the usual ASR approaches but which have recently been shown to be more robust to nonstationary noise, without requiring specific adaptation or “multi-style” training. More specifically, we will motivate and briefly describe new approaches based on multi-stream and subband ASR. These approaches extend the standard hidden Markov model (HMM) based approach by assuming that the different (frequency) streams representing the speech signal are processed by different (independent) “experts”, each expert focusing on a different characteristic of the signal, and that the different stream likelihoods (or posteriors) are combined at some (temporal) stage to yield a global recognition output. As a further extension to multi-stream ASR, we will finally introduce a new approach, referred to as HMM2, where the HMM emission probabilities are estimated via state specific feature based HMMs responsible for merging the stream information and modeling their possible correlation.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Allen J., “HOW do humans process and recognize speech?,” IEEE Trans, on Speech and Audio Processing, Vol. 2, no. 4, pp. 567–577, 1994.
BEngio S., Bourlard H., and Weber K., “An EM Algorithm for HMMs with Emission Distributions Represented by HMMs,” IDIAP Research Report, IDIAP-RR-00-11, 2000.
Berthommier F. and Glotin H., “A new SNR-feature mapping for robust multistream speech recognition,” Intl. Conf. of Phonetic Sciences (ICPhS’99) (San Francisco), to appear, August 1999.
Bishop C.M., Neural Networks for Pattern Recognition, Clarendon Press (Oxford), 1995.
Bourlard H. and Morgan N., Connectionist Speech Recognition-A Hybrid Approach, Kluwer Academic Publishers, 1994.
Bourlard H. and Dupont S., “A new ASR approach based on independent processing and combination of partial frequency bands,” Proc. of Intl. Conf. on Spoken Language Processing (Philadelphia), pp. 422–425, October 1996.
DE Vueth J., DE Wet F., Cranen B., and Boves L., “Missing feature theory in ASR: make sure you miss the right type of features,” Proceedings of the ESCA Workshop on Robust Speech Recognition (Tampere, Finland), May 25-26, 1999.
Duda R.O. and Hart P.E., Pattern Classification and Scene Analysis, John Wiley, 1973.
Greenberg S., “On the origins of speech intelligibility in the real world,” Proc. of the ESCA Workshop on Robust Speech Recognition for Unknown Communication Channels, pp. 23–32, ESCA, April 1997.
Hagen A., Morris A., and Bourlard H., “Subband-based speech recognition in noisy conditions: The full combination approach,” IDIAP Research Report no. IDIAP-RR-98-15, 1998.
Hagen A., Morris A., and Bourlard H., “Different weighting schemes in the full combination subbands approach for noise robust ASR,” Proceedings of the Workshop on Robust Methods for Speech Recognition in Adverse Conditions (Tampere, Finland), May 25-26, 1999.
Hennebert J., Ris C, Bourlard H., REnals S., and Morgan N. (1997), “Estimation of Global Posteriors and Forward-Backward Training of Hybrid Systems,” Proceedings of EUROSPEECH’97 (Rhodes, Greece, Sep. 1997), pp. 1951–1954.
Hermansky H. and Morgan N., “RASTA processing of speech,” IEEE Trans. on Speech and Audio Processing, Vol. 2, no. 4, pp. 578–589, October 1994.
Hermansky H., Pavel M., and Tribewala S., “Towards ASR using partially corrupted speech,” Proc. of Intl. Conf. on Spoken Language Processing (Philadelphia), pp. 458–461, October 1996.
Hermansky H. and Sharma S., “Temporal patterns (TRAPS) in ASR noisy speech,” Proc. of the IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (Phoenix, AZ), pp. 289–292, March 1999.
Houtgast T. and Steeneken H.J.M., “A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria,” J. Acoust. Soc. Am., Vol. 77, no. 3, pp. 1069–1077, March 1985.
Ikbal S., Bourlard H., Bengio S., and Weber K., “IDIAP HMM/HMM2 System: Theoretical Basis and Software Specifications” IDIAP Research Report, IDIAP-RR-01-27, 2001.
Kingsbury B., Morgan N., and Greenberg S., “Robust speech recognition using the modulation spectrogram,” Speech Communication, Vol. 25, nos. 1-3, pp. 117–132, 1998.
Lippmann R.P. and Carlson B.A., “Using missing feature theory to actively select features for robust speech recognition with interruptions, filtering and noise,” Proc. Eurospeech’97 (Rhodes, Greece, September 1997), pp. KN37–40.
Mcgurk H. and Mcdonald J., “Hearing lips and seeing voices,” Nature, no. 264, pp. 746–748, 1976.
Mirghafori N. and Morgan N., “Transmissions and transitions: A study of two common assumptions in multi-band ASR,” Intl. IEEE Conf. on Acoustics, Speech, and Signal Processing (Seattle, WA, May 1997), pp. 713–716.
Morris A.C., Coouke M.P., and Green P.D., “Some solutions to the missing features problem in data classification, with application to noise robust ASR,” Proc. Intl. Conf on Acoustics, Speech, and Signal Processing, pp. 737–740, 1998.
Morris A.C., Hagen A., and Bourlard H., “The full combination subbands approach to noise robust HMM/ANN-based ASR,” Proc. of Eurospeech’99 (Budapest, Sep. 99), to appear.
Moore B.C.J., An Introduction to the Psychology of Hearing (4th edition), Academic Press, 1997.
Nadeu C., Hernando J., and Gorricho M., “On the decorrelation of filterbank energies in speech recognition,” Proc. of Eurospeech’95 (Madrid, Spain), pp. 1381–1384, 1995.
Okawa S., Bocghieri E., and Potamianos A., “Multi-band speech recognition in noisy environment,” Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 1998.
Rao S. and Pearlman W.A., “Analysis of linear prediction, coding, and spectral estimation from subbands,” IEEE Trans, on Information Theory, Vol. 42, pp. 1160–1178, July 1996.
Tomlinson J., Rüssel M.J., and Brooke N.M., “Integrating audio and visual information to provide highly robust speech recognition,” Proc. of IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (Atlanta), May 1996.
Tomlinson M.J., Rüssel M.J., Moore R.K., Bucklan A.P., and Fawley M.A., “Modelling asynchrony in speech using elementary single-signal decomposition,” Proc. of IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (Munich), pp. 1247–1250, April 1997.
Varga A. and Moore R., “Hidden markov model decomposition of speech and noise,” Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, pp. 845–848, 1990.
Weber K., Bengio S., and Bourlard H., “HMM2-Extraction of Formant Features and their Use for Robust ASR”, Proc. of Eurospeech, pp. 607–610, 2001.
Wellekens C.J., Kangasharju J., and Milesi C, “The use of meta-HMM in multistream HMM training for automatic speech recognition,” Proc. of Intl. Conference on Spoken Language Processing (Sydney), pp. 2991–2994, December 1998.
Wu S.-L., KIngsbury B.E., Morgan N., and Greenberg S., “Performance improvements through combining phone and syllable-scale information in automatic speech recognition,” Proc. Intl. Conf. on Spoken Language Processing (Sydney), pp. 459–462, Dec. 1998.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer Science+Business Media New York
About this paper
Cite this paper
Bourlard, H., Bengio, S., Weber, K. (2004). Towards Robust and Adaptive Speech Recognition Models. In: Johnson, M., Khudanpur, S.P., Ostendorf, M., Rosenfeld, R. (eds) Mathematical Foundations of Speech and Language Processing. The IMA Volumes in Mathematics and its Applications, vol 138. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-9017-4_9
Download citation
DOI: https://doi.org/10.1007/978-1-4419-9017-4_9
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4612-6484-2
Online ISBN: 978-1-4419-9017-4
eBook Packages: Springer Book Archive