Abstract
Significant improvements in automatic speech recognition performance have been obtained through front-end feature representations which exploit the time varying properties of speech spectra. Various techniques have been developed to incorporate “spectral dynamics” into the speech representation, including temporal derivative features, spectral mean normalization and, more generally, spectral parameter filtering. This chapter describes the implementation and interrelationships of these techniques and illustrates their use in automatic speech recognition under different types of adverse conditions.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
J. Picone, “Signal modeling techniques in speech recognition,” Proc. IEEE, vol. 81, pp. 1215–1247, Sept. 1993.
W. V. Summers, D. B. Pisoni, R. H. Bernacki, R. I. Pedlow, and M. A. Stokes, “Effects of noise on speech production: Acoustic and perceptual analyses,” JASA, vol. 84, pp. 917–928, 1988.
J. Hansen, Analysis and compensation of stressed and noisy speech with application to robust automatic recognition. PhD. thesis, Georgia Institute of Technology, 1988.
J.-C. Junqua, “The Lombard reflex and its role on human listeners and automatic speech recognizers,” JASA, pp. 510–524, 1993.
J. Pickett, “Effects of vocal force on the intelligibility of speech sounds,” JASA, vol. 28, pp. 902–905, 1956.
J. Dreher and J. O’Neill, “Effects of ambient noise on speaker intelligibility for words and phrases,” JASA, vol. 29, pp. 1320–1323, 1957.
F. Soong and M. M. Sondhi, “A frequency-weighted Itakura spectral distortion measure and its application to speech recognition in noise,” IEEE Trans. ASSP, vol. 36, no. 1, pp. 41–48, 1988.
D. Mansour and B.-H. Juang, “A family of distortion measures based upon projection operation for robust speech recognition,” IEEE Trans. ASSP, vol. 37, no. 11, pp. 1659–1671, 1989.
A. Acero, Acoustical and Environmental Robustness in Automatic Speech Recognition. PhD thesis, Carnegie Mellon University, 1990.
F.-H. Liu, R. Stern, A. Acero, and P. J. Moreno, “Environment normalization for robust speech recognition using direct cepstral comparison,” Proc. ICASSP, vol. II, pp. 61–64, 1994.
J. Smolders, T. Clase, G. Sablon, and D. Van Compernolle, “On the importance of the microphone position for speech recognition in the car,” Proc. ICASSP, vol. I, pp. 429–432, 1994.
J. Chang and V. Zue, “A study of speech recognition system robustness to microphone variations: Experiments in phonetic classification,” Proc. ICSLP, vol. 3, pp. 995–998, 1994.
H. Van Hamme, G. Gallopyn, L. Weynants, B. D’hoore, and H. Bourlard, “Comparison of acoustic features and robustness tests of a real-time recognizer using hardware telephone line simulator,” Proc. ICSLP, pp. 1907–1910, 1994.
H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE Trans. Speech and Audio Processing, vol. 2, pp. 578–589, 1994.
Y. Zhao, “Iterative self-learning speaker and channel adaptation under various initial conditions,” Proc. ICASSP, vol. 1, pp. 712–715, 1995.
A. Sankar and C.-H. Lee, “A maximum-likelihood approach to stochastic matching for robust speech recognition,” accepted for publication in IEEE Trans. Speech and Audio Processing.
Y. Gong, “Speech recognition in noisy environments: A survey,” Speech Communication, vol. 16, pp. 261–291, April 1995.
S. Purui, “Toward robust speech recognition under adverse conditions,” Proc. ESCA Workshop on Speech Processing in Adverse Conditions, pp. 31–42, Nov. 1992.
B.-H. Juang, “Speech recognition in adverse environments,” Computer Speech and Language, vol. 5, pp. 275–294, 1991.
S. Purui, “Speaker-independent isolated word recognition using dynamic features of speech spectrum,” IEEE Trans. ASSP, vol. 34, pp. 52–59, 1986.
F. K. Soong and A. E. Rosenberg, “On the use of instantaneous and transitional spectral information in speaker recognition,” Proc. ICASSP, pp. 877–880, 1986.
S. Purui, “Speaker-independent isolated word recognition based on emphasized spectral dynamics,” Proc. ICASSP, pp. 1991–1994, 1986.
S. Furui, “On the use of hierarchical spectral dynamics in speech recognition,” Proc. ICASSP, pp. 789–792, 1990.
B. A. Hanson and T. H. Applebaum, “Robust speaker-independent word recognition using static, dynamic and acceleration features: Experiments with Lombard and noisy speech,” Proc. ICASSP, pp. 857–860, 1990.
H. Ney, “Experiments on mixture-density phoneme-modelling for the speaker-independent 1000-word speech recognition task,” Proc. ICASSP, pp. 713–716, 1990.
H. Hermansky, N. Morgan, A. Bayya, and P. Kohn, “Compensation for the effect of the communication channel in auditory-like analysis of speech (RASTA-PLP),” Proc. EUROSPEECH, pp. 1367–1370, 1991.
H. G. Hirsch, P. Meyer, and H. W. Ruehl, “Improved speech recognition using high-pass filtering of subband envelopes,” Proc. EUROSPEECH, pp. 413–416, 1991.
T. Kitamura, E. Hayahara, and Y. Simazaki, “Speaker-independent word recognition in noisy environments using dynamic and averaged spectral features based on a two-dimensional mel-cepstrum,” Proc. ICSLP, pp. 1129–1132, 1990.
K. Aikawa, H. Singer, H. Kawahara, and Y. Tohkura, “A dynamic cepstrum incorporating time-frequency masking and its application to continuous speech recognition,” Proc. ICASSP, vol. II, pp. 668–671, 1993.
B. P. Milner and S. V. Vaseghi, “Speech modeling using cepstral-time feature vectors,” Proc. ICASSP, vol. 1, pp. 601–604, 1994.
H.-F. Pai and H.-C. Wang, “A study of the two-dimensional cepstrum approach for speech recognition,” Computer Speech and Language, vol. 6, pp. 361–375, 1992.
S. Fund, “On the role of spectral transition for speech perception,” JASA, pp. 1016–1025, 1986.
J. D. Markel and A. H. Gray Jr., Linear Prediction of Speech. Springer-Verlag, 1976.
H. Hermansky, B. Hanson, and H. Wakita, “Low-dimensional representation of vowels based on all-pole modeling in the psychophysical domain,” Speech Communication, vol. 4, pp. 181–187, 1985.
H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” JASA, vol. 87, no. 4, pp. 1738–1752, 1990.
S. B. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. ASSP, vol. 28, pp. 357–366, Aug. 1980.
B. A. Hanson and H. Wakita, “Spectral slope distance measures with linear prediction analysis for word recognition in noise,” IEEE Trans. ASSP, vol. 35, pp. 968–973, 1987.
T. H. Applebaum and B. A. Hanson, “Perceptually-based dynamic spectrograms,” in Visual Representations of Speech Signals, edited by M. Cooke, S. Beet, and M. Crawford, ch. 11, pp. 153–160, Wiley, 1993.
K. Elenius and M. Blomberg, “Effects of emphasizing transitional or stationary parts of the speech signal in a discrete utterance recognition system,” Proc. ICASSP, pp. 535–538, 1982.
V. N. Gupta, M. Lennig, and P. Mermelstein, “Integration of acoustic information in a large vocabulary word recognizer,” Proc. ICASSP, pp. 697–700, 1987.
K.-F. Lee, Large-Vocabulary Speaker-Independent Continuous Speech Recognition: The SPHINX System. PhD thesis, Comp. Sci. Dept., Carnegie Mellon University, 1988.
K. Shikano, “Evaluation of LPC spectral matching measures for phonetic unit recognition,” CMU-CS-86–108, Comp. Sci. Dept., Carnegie Mellon University, 1986.
T. H. Applebaum and B. A. Hanson, “Robust speaker-independent word recognition using spectral smoothing and temporal derivatives,” Signal Processing V — Proc. EUSIPCO, pp. 1183–1186, Elsevier Science, 1990.
X. Huang, F. Alleva, H.-W. Hon, M.-Y. Hwang, K.-F. Lee, and R. Rosenfeld, “The SPHINX-II speech recognition system: An overview,” Computer Speech and Language, vol. 2, pp. 137–148, 1993.
N. R. Draper and H. Smith, Applied Regression Analysis. New York: Wiley, 1981.
T. H. Applebaum and B. A. Hanson, “Features for speaker-independent recognition of noisy and Lombard speech,” JASA Suppl. 1, vol. 88, Fall 1990. Reprinted in J. of Amer. Voice I/O Soc, vol. 14, pp. 73–80, 1993.
C.-H. Lee, E. Giachin, L. R. Rabiner, R. Pieraccini, and A. E. Rosenberg, “Improved acoustic modeling for continuous speech recognition,” Proc. DARPA Workshop on Speech Recognition, pp. 319–326, DARPA, 1990.
J. G. Wilpon, C.-H. Lee, and L. R. Rabiner, “Connected digit recognition based on improved acoustic resolution,” Computer Speech and Language, vol. 7, pp. 15–26, 1993.
T. H. Applebaum and B. A. Hanson, “Tradeoffs in the design of regression features for word recognition,” Proc. EUROSPEECH, pp. 1203–1206, 1991.
B. A. Hanson and T. H. Applebaum, “Features for noise-robust speaker-independent word recognition,” Proc. ICSLP, pp. 1117–1120, 1990.
A. Acero and R. M. Stern, “Robust speech recognition by normalization of the acoustic space,” Proc. ICASSP, pp. 893–896, 1991.
Y. Ephraim, D. Malah, and B.-H. Juang, “On the application of hidden Markov models for enhancing noisy speech,” IEEE Trans. ASSP, vol. 37, pp. 1846–1856, 1989.
V. L. Beattie and S. J. Young, “Noisy speech recognition using hidden Markov model state based filtering,” Proc. ICASSP, pp. 917–920, 1991.
B. S. Atal, “Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification,” JASA, vol. 55, pp. 1304–1312, 1974.
S. Fund, “Comparison of speaker recognition methods using statistical features and dynamic features,” IEEE Trans. ASSP, vol. 29, pp. 342–350, 1981.
D. Geller, R. Haeb-Urabach, and H. Ney, “Improvements in speech recognition for voice dialing in the car environment,” Proc. ESCA Workshop on Speech Processing in Adverse Conditions, pp. 203–206, Nov. 1992.
R. Schwartz, T. Anastasakos, F. Kubala, J. Makhoul, L. Nguyen, and G. Zavaliagkos, “Comparitive experiments on large vocabulary speech recongition,” Proc. ARPA Workshop on Human Language Tech., March 1993.
B. A. Hanson and T. H. Applebaum, “Subband or cepstral domain filtering for recognition of Lombard and channel-distorted speech,” Proc. ICASSP, vol. II, pp. 79–82, 1993.
A. E. Rosenberg, C.-H. Lee, and F. K. Soong, “Cepstral channel normalization techniques for HMM-based speaker verification,” Proc. ICSLP, vol. 4, pp. 1835–1838, 1994.
T. Houtgast, H. J. M. Steeneken, and R. Plomp, “Predicting speech intelligibility in rooms from the modulation transfer function: I. General room acoustics,” Acustica, no. 46, pp. 60–72, 1980.
H. G. Hirsch and A. Corsten, “A new method to improve speech recognition in a noisy environment,” Signal Processing V — Proc. EUSIPCO, pp. 1187–1190, Elsevier Science, 1990.
H. Murveit, J. Butzburger, and M. Weintraub, “Reduced channel dependence for speech recognition,” Proc. DARPA Speech and Natural Language Workshop, pp. 280–284, Feb. 1992.
J. Smolders and D. V. Compernolle, “In search for the relevant parameters for speaker independent speech recognition,” Proc. ICASSP, vol. II, pp. 684–687, 1993.
S. F. Boll, “Supression of acoustic noise in speech using spectral subtraction,” IEEE Trans. ASSP, vol. 27, pp. 113–120, 1979.
B. H. Juang and L. R. Rabiner, “Signal restoration by spectral mapping,” Proc. ICASSP, pp. 2368–2371, 1987.
M. J. F. Gales and S. J. Young, “Parallel model combination for speech recognition in additive and convolutional noise,” CUED/FINFENG/TR154, Cambridge U. Engineering Dept., Dec. 1993.
D. Dubois, “Comparison of time-dependant acoustic features for a speaker-independent speech recognition system,” Proc. EUROSPEECH, pp. 935–938, 1991.
J.-C. Junqua, S. Valente, D. Fohr, and J.-F. Mari, “An N-best strategy, dynamic grammars and selectively trained neural networks for real-time recognition of continuously spelled names over the telephone,” Proc. ICASSP, vol. 1, pp. 852–855, 1995.
R. A. Cole, K. Roginski, and M. Fanty, “English alphabet recognition with telephone speech,” Proc. EUROSPEECH, pp. 479–482, 1991.
C. Nadeu and B.-H. Juang, “Filtering of spectral parameters for speech recognition,” Proc. ICSLP, pp. 1927–1930, 1994.
B. E. P. Lindblom and M. Studdert-Kennedy, “On the role of formant transitions in vowel recognition,” JASA, vol. 42, pp. 830–843, 1967.
M. J. Hunt and C. Lefèbvre, “A comparison of several acoustic representations for speech recognition with degraded and undegraded speech,” Proc. ICASSP, pp. 262–265, 1989.
S. Furui, “Feature analysis based on articulatory and perceptual models,” Proc. IEEE Workshop on Automatic Speech Recognition, pp. 63–64, 1993.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1996 Kluwer Academic Publishers
About this chapter
Cite this chapter
Hanson, B.A., Applebaum, T.H., Junqua, JC. (1996). Spectral Dynamics for Speech Recognition Under Adverse Conditions. In: Lee, CH., Soong, F.K., Paliwal, K.K. (eds) Automatic Speech and Speaker Recognition. The Kluwer International Series in Engineering and Computer Science, vol 355. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-1367-0_14
Download citation
DOI: https://doi.org/10.1007/978-1-4613-1367-0_14
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4612-8590-8
Online ISBN: 978-1-4613-1367-0
eBook Packages: Springer Book Archive