Spectral Dynamics for Speech Recognition Under Adverse Conditions

Hanson, Brian A.; Applebaum, Ted H.; Junqua, Jean-Claude

doi:10.1007/978-1-4613-1367-0_14

Spectral Dynamics for Speech Recognition Under Adverse Conditions

Brian A. Hanson³,
Ted H. Applebaum³ &
Jean-Claude Junqua³

Chapter

434 Accesses
7 Citations

Part of the book series: The Kluwer International Series in Engineering and Computer Science ((SECS,volume 355))

Abstract

Significant improvements in automatic speech recognition performance have been obtained through front-end feature representations which exploit the time varying properties of speech spectra. Various techniques have been developed to incorporate “spectral dynamics” into the speech representation, including temporal derivative features, spectral mean normalization and, more generally, spectral parameter filtering. This chapter describes the implementation and interrelationships of these techniques and illustrates their use in automatic speech recognition under different types of adverse conditions.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

J. Picone, “Signal modeling techniques in speech recognition,” Proc. IEEE, vol. 81, pp. 1215–1247, Sept. 1993.
Article Google Scholar
W. V. Summers, D. B. Pisoni, R. H. Bernacki, R. I. Pedlow, and M. A. Stokes, “Effects of noise on speech production: Acoustic and perceptual analyses,” JASA, vol. 84, pp. 917–928, 1988.
Google Scholar
J. Hansen, Analysis and compensation of stressed and noisy speech with application to robust automatic recognition. PhD. thesis, Georgia Institute of Technology, 1988.
Google Scholar
J.-C. Junqua, “The Lombard reflex and its role on human listeners and automatic speech recognizers,” JASA, pp. 510–524, 1993.
Google Scholar
J. Pickett, “Effects of vocal force on the intelligibility of speech sounds,” JASA, vol. 28, pp. 902–905, 1956.
Google Scholar
J. Dreher and J. O’Neill, “Effects of ambient noise on speaker intelligibility for words and phrases,” JASA, vol. 29, pp. 1320–1323, 1957.
Google Scholar
F. Soong and M. M. Sondhi, “A frequency-weighted Itakura spectral distortion measure and its application to speech recognition in noise,” IEEE Trans. ASSP, vol. 36, no. 1, pp. 41–48, 1988.
Article Google Scholar
D. Mansour and B.-H. Juang, “A family of distortion measures based upon projection operation for robust speech recognition,” IEEE Trans. ASSP, vol. 37, no. 11, pp. 1659–1671, 1989.
Article Google Scholar
A. Acero, Acoustical and Environmental Robustness in Automatic Speech Recognition. PhD thesis, Carnegie Mellon University, 1990.
Google Scholar
F.-H. Liu, R. Stern, A. Acero, and P. J. Moreno, “Environment normalization for robust speech recognition using direct cepstral comparison,” Proc. ICASSP, vol. II, pp. 61–64, 1994.
Google Scholar
J. Smolders, T. Clase, G. Sablon, and D. Van Compernolle, “On the importance of the microphone position for speech recognition in the car,” Proc. ICASSP, vol. I, pp. 429–432, 1994.
Google Scholar
J. Chang and V. Zue, “A study of speech recognition system robustness to microphone variations: Experiments in phonetic classification,” Proc. ICSLP, vol. 3, pp. 995–998, 1994.
Google Scholar
H. Van Hamme, G. Gallopyn, L. Weynants, B. D’hoore, and H. Bourlard, “Comparison of acoustic features and robustness tests of a real-time recognizer using hardware telephone line simulator,” Proc. ICSLP, pp. 1907–1910, 1994.
Google Scholar
H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE Trans. Speech and Audio Processing, vol. 2, pp. 578–589, 1994.
Article Google Scholar
Y. Zhao, “Iterative self-learning speaker and channel adaptation under various initial conditions,” Proc. ICASSP, vol. 1, pp. 712–715, 1995.
Google Scholar
A. Sankar and C.-H. Lee, “A maximum-likelihood approach to stochastic matching for robust speech recognition,” accepted for publication in IEEE Trans. Speech and Audio Processing.
Google Scholar
Y. Gong, “Speech recognition in noisy environments: A survey,” Speech Communication, vol. 16, pp. 261–291, April 1995.
Article Google Scholar
S. Purui, “Toward robust speech recognition under adverse conditions,” Proc. ESCA Workshop on Speech Processing in Adverse Conditions, pp. 31–42, Nov. 1992.
Google Scholar
B.-H. Juang, “Speech recognition in adverse environments,” Computer Speech and Language, vol. 5, pp. 275–294, 1991.
Article Google Scholar
S. Purui, “Speaker-independent isolated word recognition using dynamic features of speech spectrum,” IEEE Trans. ASSP, vol. 34, pp. 52–59, 1986.
Article Google Scholar
F. K. Soong and A. E. Rosenberg, “On the use of instantaneous and transitional spectral information in speaker recognition,” Proc. ICASSP, pp. 877–880, 1986.
Google Scholar
S. Purui, “Speaker-independent isolated word recognition based on emphasized spectral dynamics,” Proc. ICASSP, pp. 1991–1994, 1986.
Google Scholar
S. Furui, “On the use of hierarchical spectral dynamics in speech recognition,” Proc. ICASSP, pp. 789–792, 1990.
Google Scholar
B. A. Hanson and T. H. Applebaum, “Robust speaker-independent word recognition using static, dynamic and acceleration features: Experiments with Lombard and noisy speech,” Proc. ICASSP, pp. 857–860, 1990.
Google Scholar
H. Ney, “Experiments on mixture-density phoneme-modelling for the speaker-independent 1000-word speech recognition task,” Proc. ICASSP, pp. 713–716, 1990.
Google Scholar
H. Hermansky, N. Morgan, A. Bayya, and P. Kohn, “Compensation for the effect of the communication channel in auditory-like analysis of speech (RASTA-PLP),” Proc. EUROSPEECH, pp. 1367–1370, 1991.
Google Scholar
H. G. Hirsch, P. Meyer, and H. W. Ruehl, “Improved speech recognition using high-pass filtering of subband envelopes,” Proc. EUROSPEECH, pp. 413–416, 1991.
Google Scholar
T. Kitamura, E. Hayahara, and Y. Simazaki, “Speaker-independent word recognition in noisy environments using dynamic and averaged spectral features based on a two-dimensional mel-cepstrum,” Proc. ICSLP, pp. 1129–1132, 1990.
Google Scholar
K. Aikawa, H. Singer, H. Kawahara, and Y. Tohkura, “A dynamic cepstrum incorporating time-frequency masking and its application to continuous speech recognition,” Proc. ICASSP, vol. II, pp. 668–671, 1993.
Google Scholar
B. P. Milner and S. V. Vaseghi, “Speech modeling using cepstral-time feature vectors,” Proc. ICASSP, vol. 1, pp. 601–604, 1994.
Google Scholar
H.-F. Pai and H.-C. Wang, “A study of the two-dimensional cepstrum approach for speech recognition,” Computer Speech and Language, vol. 6, pp. 361–375, 1992.
Article Google Scholar
S. Fund, “On the role of spectral transition for speech perception,” JASA, pp. 1016–1025, 1986.
Google Scholar
J. D. Markel and A. H. Gray Jr., Linear Prediction of Speech. Springer-Verlag, 1976.
MATH Google Scholar
H. Hermansky, B. Hanson, and H. Wakita, “Low-dimensional representation of vowels based on all-pole modeling in the psychophysical domain,” Speech Communication, vol. 4, pp. 181–187, 1985.
Article Google Scholar
H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” JASA, vol. 87, no. 4, pp. 1738–1752, 1990.
Google Scholar
S. B. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. ASSP, vol. 28, pp. 357–366, Aug. 1980.
Article Google Scholar
B. A. Hanson and H. Wakita, “Spectral slope distance measures with linear prediction analysis for word recognition in noise,” IEEE Trans. ASSP, vol. 35, pp. 968–973, 1987.
Article Google Scholar
T. H. Applebaum and B. A. Hanson, “Perceptually-based dynamic spectrograms,” in Visual Representations of Speech Signals, edited by M. Cooke, S. Beet, and M. Crawford, ch. 11, pp. 153–160, Wiley, 1993.
Google Scholar
K. Elenius and M. Blomberg, “Effects of emphasizing transitional or stationary parts of the speech signal in a discrete utterance recognition system,” Proc. ICASSP, pp. 535–538, 1982.
Google Scholar
V. N. Gupta, M. Lennig, and P. Mermelstein, “Integration of acoustic information in a large vocabulary word recognizer,” Proc. ICASSP, pp. 697–700, 1987.
Google Scholar
K.-F. Lee, Large-Vocabulary Speaker-Independent Continuous Speech Recognition: The SPHINX System. PhD thesis, Comp. Sci. Dept., Carnegie Mellon University, 1988.
Google Scholar
K. Shikano, “Evaluation of LPC spectral matching measures for phonetic unit recognition,” CMU-CS-86–108, Comp. Sci. Dept., Carnegie Mellon University, 1986.
Google Scholar
T. H. Applebaum and B. A. Hanson, “Robust speaker-independent word recognition using spectral smoothing and temporal derivatives,” Signal Processing V — Proc. EUSIPCO, pp. 1183–1186, Elsevier Science, 1990.
Google Scholar
X. Huang, F. Alleva, H.-W. Hon, M.-Y. Hwang, K.-F. Lee, and R. Rosenfeld, “The SPHINX-II speech recognition system: An overview,” Computer Speech and Language, vol. 2, pp. 137–148, 1993.
Article Google Scholar
N. R. Draper and H. Smith, Applied Regression Analysis. New York: Wiley, 1981.
MATH Google Scholar
T. H. Applebaum and B. A. Hanson, “Features for speaker-independent recognition of noisy and Lombard speech,” JASA Suppl. 1, vol. 88, Fall 1990. Reprinted in J. of Amer. Voice I/O Soc, vol. 14, pp. 73–80, 1993.
Google Scholar
C.-H. Lee, E. Giachin, L. R. Rabiner, R. Pieraccini, and A. E. Rosenberg, “Improved acoustic modeling for continuous speech recognition,” Proc. DARPA Workshop on Speech Recognition, pp. 319–326, DARPA, 1990.
Google Scholar
J. G. Wilpon, C.-H. Lee, and L. R. Rabiner, “Connected digit recognition based on improved acoustic resolution,” Computer Speech and Language, vol. 7, pp. 15–26, 1993.
Article Google Scholar
T. H. Applebaum and B. A. Hanson, “Tradeoffs in the design of regression features for word recognition,” Proc. EUROSPEECH, pp. 1203–1206, 1991.
Google Scholar
B. A. Hanson and T. H. Applebaum, “Features for noise-robust speaker-independent word recognition,” Proc. ICSLP, pp. 1117–1120, 1990.
Google Scholar
A. Acero and R. M. Stern, “Robust speech recognition by normalization of the acoustic space,” Proc. ICASSP, pp. 893–896, 1991.
Google Scholar
Y. Ephraim, D. Malah, and B.-H. Juang, “On the application of hidden Markov models for enhancing noisy speech,” IEEE Trans. ASSP, vol. 37, pp. 1846–1856, 1989.
Article Google Scholar
V. L. Beattie and S. J. Young, “Noisy speech recognition using hidden Markov model state based filtering,” Proc. ICASSP, pp. 917–920, 1991.
Google Scholar
B. S. Atal, “Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification,” JASA, vol. 55, pp. 1304–1312, 1974.
Google Scholar
S. Fund, “Comparison of speaker recognition methods using statistical features and dynamic features,” IEEE Trans. ASSP, vol. 29, pp. 342–350, 1981.
Article Google Scholar
D. Geller, R. Haeb-Urabach, and H. Ney, “Improvements in speech recognition for voice dialing in the car environment,” Proc. ESCA Workshop on Speech Processing in Adverse Conditions, pp. 203–206, Nov. 1992.
Google Scholar
R. Schwartz, T. Anastasakos, F. Kubala, J. Makhoul, L. Nguyen, and G. Zavaliagkos, “Comparitive experiments on large vocabulary speech recongition,” Proc. ARPA Workshop on Human Language Tech., March 1993.
Google Scholar
B. A. Hanson and T. H. Applebaum, “Subband or cepstral domain filtering for recognition of Lombard and channel-distorted speech,” Proc. ICASSP, vol. II, pp. 79–82, 1993.
Google Scholar
A. E. Rosenberg, C.-H. Lee, and F. K. Soong, “Cepstral channel normalization techniques for HMM-based speaker verification,” Proc. ICSLP, vol. 4, pp. 1835–1838, 1994.
Google Scholar
T. Houtgast, H. J. M. Steeneken, and R. Plomp, “Predicting speech intelligibility in rooms from the modulation transfer function: I. General room acoustics,” Acustica, no. 46, pp. 60–72, 1980.
Google Scholar
H. G. Hirsch and A. Corsten, “A new method to improve speech recognition in a noisy environment,” Signal Processing V — Proc. EUSIPCO, pp. 1187–1190, Elsevier Science, 1990.
Google Scholar
H. Murveit, J. Butzburger, and M. Weintraub, “Reduced channel dependence for speech recognition,” Proc. DARPA Speech and Natural Language Workshop, pp. 280–284, Feb. 1992.
Google Scholar
J. Smolders and D. V. Compernolle, “In search for the relevant parameters for speaker independent speech recognition,” Proc. ICASSP, vol. II, pp. 684–687, 1993.
Google Scholar
S. F. Boll, “Supression of acoustic noise in speech using spectral subtraction,” IEEE Trans. ASSP, vol. 27, pp. 113–120, 1979.
Article Google Scholar
B. H. Juang and L. R. Rabiner, “Signal restoration by spectral mapping,” Proc. ICASSP, pp. 2368–2371, 1987.
Google Scholar
M. J. F. Gales and S. J. Young, “Parallel model combination for speech recognition in additive and convolutional noise,” CUED/FINFENG/TR154, Cambridge U. Engineering Dept., Dec. 1993.
Google Scholar
D. Dubois, “Comparison of time-dependant acoustic features for a speaker-independent speech recognition system,” Proc. EUROSPEECH, pp. 935–938, 1991.
Google Scholar
J.-C. Junqua, S. Valente, D. Fohr, and J.-F. Mari, “An N-best strategy, dynamic grammars and selectively trained neural networks for real-time recognition of continuously spelled names over the telephone,” Proc. ICASSP, vol. 1, pp. 852–855, 1995.
Google Scholar
R. A. Cole, K. Roginski, and M. Fanty, “English alphabet recognition with telephone speech,” Proc. EUROSPEECH, pp. 479–482, 1991.
Google Scholar
C. Nadeu and B.-H. Juang, “Filtering of spectral parameters for speech recognition,” Proc. ICSLP, pp. 1927–1930, 1994.
Google Scholar
B. E. P. Lindblom and M. Studdert-Kennedy, “On the role of formant transitions in vowel recognition,” JASA, vol. 42, pp. 830–843, 1967.
Google Scholar
M. J. Hunt and C. Lefèbvre, “A comparison of several acoustic representations for speech recognition with degraded and undegraded speech,” Proc. ICASSP, pp. 262–265, 1989.
Google Scholar
S. Furui, “Feature analysis based on articulatory and perceptual models,” Proc. IEEE Workshop on Automatic Speech Recognition, pp. 63–64, 1993.
Google Scholar

Download references

Author information

Authors and Affiliations

Speech Technology Laboratory, Panasonic Technologies, Inc., Santa Barbara, California, USA
Brian A. Hanson, Ted H. Applebaum & Jean-Claude Junqua

Authors

Brian A. Hanson
View author publications
You can also search for this author in PubMed Google Scholar
Ted H. Applebaum
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Claude Junqua
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

AT&T Bell Laboratories, Murray Hill, NJ, 07974, USA
Chin-Hui Lee & Frank K. Soong &
School of Microelectronic Engineering, Griffith University, Australia
Kuldip K. Paliwal

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hanson, B.A., Applebaum, T.H., Junqua, JC. (1996). Spectral Dynamics for Speech Recognition Under Adverse Conditions. In: Lee, CH., Soong, F.K., Paliwal, K.K. (eds) Automatic Speech and Speaker Recognition. The Kluwer International Series in Engineering and Computer Science, vol 355. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-1367-0_14

Download citation

DOI: https://doi.org/10.1007/978-1-4613-1367-0_14
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4612-8590-8
Online ISBN: 978-1-4613-1367-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics