Skip to main content

Front-End, Back-End, and Hybrid Techniques for Noise-Robust Speech Recognition

  • Chapter
  • First Online:
Book cover Robust Speech Recognition of Uncertain or Missing Data

Abstract

Noise robustness has long been an active area of research that captures significant interest from speech recognition researchers and developers. In this chapter, with a focus on the problem of uncertainty handling in robust speech recognition, we use the Bayesian framework as a common thread for connecting, analyzing, and categorizing a number of popular approaches to the solutions pursued in the recent past. The topics covered in this chapter include 1) Bayesian decision rules with unreliable features and unreliable model parameters; 2) principled ways of computing feature uncertainty using structured speech distortion models; 3) use of a phase factor in an advanced speech distortion model for feature compensation; 4) a novel perspective on model compensation as a special implementation of the general Bayesian predictive classification rule capitalizing on model parameter uncertainty; 5) taxonomy of noise compensation techniques using two distinct axes, feature vs. model domain and structured vs. unstructured transformation; and 6) noise-adaptive training as a hybrid feature-model compensation framework and its various forms of extension.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. Acero: Acoustical and Environmental Robustness in Automatic Speech Recognition. Kluwer Academic Publishers (1993)

    Google Scholar 

  2. A. Acero, L. Deng, T. Kristjansson, and J. Zhang: HMM adaptation using vector Taylor series for noisy speech recognition. In: Proc. ICSLP, vol.3, pp. 869-872 (2000)

    Google Scholar 

  3. M. Afify, X. Cui, and Y. Gao: Stereo-based stochastic mapping for robust speech recognition. In: Proc. ICASSP (2007)

    Google Scholar 

  4. T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul: A compact model for speaker-adaptive training. In: Proc. ICSLP (1996)

    Google Scholar 

  5. J. Arrowood and M. Clements: Using observation uncertainty in HMM decoding. In: Proc. ICSLP, Denver, Colorado (2002)

    Google Scholar 

  6. R. F. Astudillo, D. Kolossa, and R. Orglmeister: Accounting for the uncertainty of speech estimates in the complex domain for minimum mean squared error speech enhancement. In: Proc. Interspeech (2009)

    Google Scholar 

  7. H. Attias, Li Deng, Alex Acero, and John Platt: A new method for speech denoising and robust speech recognition using probabilistic models for clean speech and for noise. In: Proc. of the Eurospeech Conference (2001)

    Google Scholar 

  8. H. Attias, J. Platt, Alex Acero, and Li Deng: Speech denoising and dereverberation using probabilistic models. In: Proc. NIPS (2000)

    Google Scholar 

  9. J. Baker, Li Deng, Jim Glass, S. Khudanpur, C.-H. Lee, N. Morgan, and D. O’Shaughnessy: Research developments and directions in speech recognition and understanding. IEEE Signal Processing Magazine, vol. 26, no. 3, pp. 75-80 (2009)

    Google Scholar 

  10. J. Baker, Li Deng, S. Khudanpur, C.-H. Lee, J. Glass, N. Morgan, and D. O’Shaughnessy: Updated MINDS report on speech recognition and understanding. IEEE Signal Processing Magazine, vol. 26, no. 4 (2009)

    Google Scholar 

  11. J. Bilmes and C. Bartels: Graphical model architectures for speech recognition. IEEE Signal Processing Magazine, vol. 22, no. 5, pp. 89-100 (2005)

    Article  Google Scholar 

  12. S.F. Boll: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. on Acoustics, Speech, and Signal Processing, 27:113-120 (1979)

    Google Scholar 

  13. K. Demuynck, X. Zhang, D. Van Compernolle, and H. Van hamme: Feature versus model based noise robustness. In: Proc. Interspeech (2010)

    Google Scholar 

  14. L. Deng: Computational models for auditory speech processing. In: Computational Models of Speech Pattern Processing, (NATO ASI Series), pp. 67-77, Springer Verlag (1999)

    Google Scholar 

  15. L. Deng: Computational models for speech production. Computational Models of Speech Pattern Processing, (NATO ASI Series), pp. 199-213, Springer Verlag (1999)

    Google Scholar 

  16. L. Deng, D. Yu, and A. Acero: Structured speech modeling. IEEE Trans. on Audio, Speech and Language Processing (Special Issue on Rich Transcription), vol. 14, No. 5, pp. 1492-1504 (2006)

    Google Scholar 

  17. L. Deng, A. Acero, M. Plumpe, and X.D. Huang: Large vocabulary speech recognition under adverse acoustic environments. In: Proc. ICSLP, pp. 806-809 (2000)

    Google Scholar 

  18. L. Deng, A. Acero, L. Jiang, J. Droppo, and X. Huang: High-performance robust speech recognition using stereo training data. In: Proc. ICASSP, Salt Lake City, Utah (2001)

    Google Scholar 

  19. L. Deng, J. Droppo, and A. Acero: Exploiting variances in robust feature extraction based on a parametric model of speech distortion. In: Proc. ICSLP (2002)

    Google Scholar 

  20. Li Deng, Jasha Droppo, and Alex Acero: A Bayesian approach to speech feature enhancement using the dynamic cepstral prior. In: Proc. ICASSP, Orlando, Florida (2002)

    Google Scholar 

  21. L. Deng, J. Droppo, and A. Acero: Log-domain speech feature enhancement using sequential MAP noise estimation and a phase-sensitive model of the acoustic environment. In: Proc. ICSLP, Denver, Colorado (2002)

    Google Scholar 

  22. L. Deng, K. Wang, A. Acero, H. Hon, J. Droppo, C. Boulis, Y. Wang, D. Jacoby, M. Mahajan, C. Chelba, and XD. Huang: Distributed speech processing in MiPad’s multimodal user interface. IEEE Trans. on Speech and Audio Processing, vol. 10, no. 8, pp. 605-619 (2002)

    Google Scholar 

  23. L. Deng, J. Droppo, and A. Acero: Enhancement of log mel power spectra of speech using a phase-sensitive model of the acoustic environment and sequential estimation of the corrupting noise. IEEE Trans. on Speech and Audio Processing, vol.12, no. 2, pp. 133-143 2004)

    Article  Google Scholar 

  24. Li Deng and Xuedong Huang: Challenges in adopting speech recognition. Communications of the ACM, vol. 47, no. 1, pp. 11-13, (2004)

    Google Scholar 

  25. Li Deng, Jasha Droppo, and Alex Acero: Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition. IEEE Trans. on Speech and Audio Processing, vol. 11, no. 6, pp. 568-580 (2003)

    Google Scholar 

  26. Li Deng, Jasha Droppo, and Alex Acero: Incremental Bayes Learning with Prior Evolution for Tracking Non-Stationary Noise Statistics from Noisy Speech Data. In: Proc. ICASSP, Hong Kong (2003)

    Google Scholar 

  27. Li Deng, Jasha Droppo, and Alex Acero: Estimating cepstrum of speech under the presence of noise using a joint prior of static and dynamic features. IEEE Trans. on Speech and Audio Processing, vol. 12, no. 3, pp. 218-233 (2004)

    Google Scholar 

  28. L. Deng, J. Droppo, and A. Acero: Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion. IEEE Trans. on Speech and Audio Processing, vol. 12, no. 3, (2005)

    Google Scholar 

  29. Li Deng, Mike Seltzer, Dong Yu, Alex Acero, A. Mohamed, and Geoff Hinton: Binary coding of speech spectrograms using a deep auto-encoder. In: Proc. Interspeech (2010)

    Google Scholar 

  30. J. Droppo, A. Acero, and L. Deng: Efficient online acoustic environment estimation for FCDCN in a continuous speech recognition system. In: Proc. ICASSP, Salt Lake City, Utah (2001)

    Google Scholar 

  31. J. Droppo, A. Acero, and L. Deng: A nonlinear observation model for removing noise from corrupted speech log Mel-spectral energies. In: Proc. ICSLP, Denver, Colorado (2002)

    Google Scholar 

  32. J. Droppo, A. Acero, and L. Deng: Uncertainty decoding with SPLICE for noise robust speech recognition. In: Proc. ICASSP, Orlando, Florida (2002)

    Google Scholar 

  33. J. Droppo, L. Deng, and A. Acero: Evaluation of SPLICE on the Aurora 2 and 3 Tasks. In: Proc. ICSLP, Denver, Colorado (2002)

    Google Scholar 

  34. J. Droppo and A. Acero: Environmental Robustness. In: Handbook of Speech Processing, Springer (2007)

    Google Scholar 

  35. Y. Ephraim: A Bayesian estimation approach for speech enhancement using hidden Markov models. IEEE Trans. on Acoustics, Speech, and Signal Processing, 40:725-735 (1992)

    Google Scholar 

  36. Y. Ephraim and D. Malah: Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-32, no. 6, pp. 1109-1121 (1984)

    Google Scholar 

  37. B. Frey, L. Deng, A. Acero, and T.T. Kristjansson: Algonquin: Iterating Laplace’s method to remove multiple types of acoustic distortion for robust speech recognition. In: Proc. Eurospeech, Aalborg, Denmark (2001)

    Google Scholar 

  38. B. Frey, T. Kristjansson, Li Deng, and Alex Acero: Learning dynamic noise models from noisy speech for robust speech recognition. In: Proc. Advances in Neural Information Processing Systems (NIPS), vol. 14, Vancouver, Canada, 2001, pp. 101-108 (2001)

    Google Scholar 

  39. M.J.F. Gales and S.J. Young: Robust speech recognition in additive and convolutional noise using parallel model combination. Computer Speech and Language, 9:289-307 (1995)

    Article  Google Scholar 

  40. M. J. F. Gales: Maximum Likelihood Linear Transformations For HMM-Based Speech Recognition. Computer Speech and Language, 12 (January 1998)

    Google Scholar 

  41. M.J.F. Gales: Model-based approaches to handling uncertainty. Chapter 5 of this book (2011)

    Google Scholar 

  42. G. Hinton, S. Osindero, and Y. Teh: A fast learning algorithm for deep belief nets. Neural Computation, vol. 18, pp. 1527-1554, 2006)

    Article  MATH  MathSciNet  Google Scholar 

  43. R. Haeb-Umbach and V. Ion: Soft features for improved distributed speech recognition over wireless networks. In: Proc. Interspeech (2004)

    Google Scholar 

  44. X. He, L. Deng, and W. Chou: Discriminative learning in sequential pattern recognition — A unifying review. IEEE Signal Processing Magazine (2008)

    Google Scholar 

  45. J. Hershey, S. Rennie, P. Olsen, and T. Kristjansson: Super-human multi-talker speech recognition: A graphical modeling approach. Computer Speech and Language (June 2010)

    Google Scholar 

  46. H. G. Hirsch and D. Pearce: The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: Proc. ISCA ITRW ASR (2000)

    Google Scholar 

  47. C. Hsieh and C. Wu: Stochastic vector mapping-based feature enhancement using prior-models and model adaptation for noisy speech recognition. Speech Communication, vol. 50, No. 6, pp. 467-475 (2008)

    Article  Google Scholar 

  48. Y. Hu and Q. Huo: Irrelevant variability normalization based HMM training using VTS approximation of an explicit model of environmental distortions. In: Proc. Interspeech (2007)

    Google Scholar 

  49. C.-H. Lee and Q. Huo: On adaptive decision rules and decision parameter adaptation for automatic speech recognition. Proc. of the IEEE, vol. 88, No. 8, pp. 1241-1269 (2000)

    Article  Google Scholar 

  50. V. Ion and R. Haeb-Umbach: Uncertainty decoding for distributed speech recognition over error-prone networks. Speech Communication, vol. 48, pp. 1435-1446 (2006)

    Article  Google Scholar 

  51. V. Ion and R. Haeb-Umbach: A novel uncertainty decoding rule with applications to transmission error robust speech recognition. IEEE Trans. Speech and Audio Processing, vol. 16. No. 5, pp. 1047-1060 (2008)

    Article  Google Scholar 

  52. H. Jiang and Li Deng: A Bayesian approach to the verification problem: Applications to speaker verification. IEEE Trans. Speech and Audio Proc., vol. 9, No. 8, pp. 874-884 (2001)

    Article  Google Scholar 

  53. H. Jiang and L. Deng: A robust compensation strategy against extraneous acoustic variations in spontaneous speech recognition. IEEE Trans. on Speech and Audio Processing, vol. 10, no. 1, pp. 9-17 (2002)

    Article  Google Scholar 

  54. O. Kalinli, M.L. Seltzer, and A. Acero: Noise adaptive training using a vector Taylor series approach for noise robust automatic speech recognition. In: Proc. ICASSP, pages 3825-3828, Taipei, Taiwan (2009)

    Google Scholar 

  55. D. Kim and M. Gales: Noisy constrained maximum likelihood linear regression for noise robust speech recognition. IEEE Trans. Audio Speech and Language Processing (2010)

    Google Scholar 

  56. D.Y. Kim, C.K. Un, and N.S. Kim: Speech recognition in noisy environments using first-order vector Taylor series. Speech Communication, vol. 24, pp. 39-49 (1998)

    Article  Google Scholar 

  57. T.T. Kristjansson and B.J. Frey: Accounting for uncertainty in observations: A new paradigm for robust speech recognition. In: Proc. ICASSP, Orlando, Florida (2002)

    Google Scholar 

  58. T.T. Kristjansson, B. Frey, L. Deng, and A. Acero: Towards non-stationary model-based noise adaptation for large vocabulary speech recognition. In: Proc. ICASSP (2001)

    Google Scholar 

  59. C.-H. Lee: On stochastic feature and model compensation approaches to robust speech recognition. Speech Communication, vol. 25, pp. 29-47 (1998).

    Article  Google Scholar 

  60. V. Leutnant and R. Haeb-Umbach: An analytic derivation of a phase-sensitive observation model for noise robust speech recognition. In: Proc. Interspeech (2009)

    Google Scholar 

  61. J. Li, D. Yu, Y. Gong, and Li Deng: Unscented Transform with Online Distortion Estimation for HMM Adaptation. In: Proc. Interspeech (2010)

    Google Scholar 

  62. J. Li, D. Yu, L. Deng, Y. Gong, and A. Acero: A unified framework of HMM adaptation with joint compensation of additive and convolutive distortions. Computer Speech and Language, vol. 23, pp. 389-405 (2009)

    Article  Google Scholar 

  63. J. Li, L. Deng, D. Yu, Y. Gong, and A. Acero: HMM Adaptation Using a Phase-Sensitive Acoustic Distortion Model for Environment-Robust Speech Recognition. In: Proc. ICASSP, Las Vegas (2008)

    Google Scholar 

  64. J. Li, L. Deng, D. Yu, J. Wu, Y. Gong, and A. Acero: Adaptation of compressed HMM parameters for resource-constrained speech recognition. In: Proc. ICASSP, Las Vegas (2008)

    Google Scholar 

  65. H. Liao and M. J. F. Gales: Issues with uncertainty decoding for noise robust speech recognition. In: Proc. ICSLP, pp. 1121-1124 (2006)

    Google Scholar 

  66. H. Liao and M. J. F. Gales: Adaptive training with joint uncertainty decoding for robust recognition of noisy data. In: Proc. ICASSP, vol. IV, pp. 389-392 (2007)

    Google Scholar 

  67. H. Liao and M.J.F. Gales: Joint uncertainty decoding for noise robust speech recognition. In: Proc. Interspeech (2005)

    Google Scholar 

  68. Hui Lin, Li Deng, Dong Yu, Yifan Gong, Alex Acero, and Chi-Hui Lee: A study on multilingual acoustic modeling for large vocabulary ASR. In: Proc. ICASSP (2009)

    Google Scholar 

  69. R. Lyon: Machine hearing: An emerging field. IEEE Signal Processing Magazine (September 2010)

    Google Scholar 

  70. A. Mohamed, D. Yu, and L. Deng: Investigation of full-sequence training of deep belief networks for speech recognition. In: Proc. Interspeech (2010)

    Google Scholar 

  71. P. Moreno: Speech Recognition in Noisy Environments. Ph.D. Thesis, Carnegie Mellon University (1996)

    Google Scholar 

  72. N. Morgan et al.: Pushing the envelope — Aside. IEEE Signal Processing Magazine, vol. 22, No. 5, pp. 81-88 (2005)

    Article  Google Scholar 

  73. R. Munkong and B.-H. Juang: Auditory perception and cognition — Modularization and integration of signal processing from ears to brain. IEEE Signal Processing Magazine, vol. 25, No. 3, pp. 98-117 (2008)

    Article  Google Scholar 

  74. C. Rathinavalu and L. Deng: HMM-based speech recognition using state-dependent, discriminatively derived transforms on Mel-warped DFT features. IEEE Trans. on Speech and Audio Processing, pp. 243-256 (1997)

    Google Scholar 

  75. S. Rennie, J. Hershey, P. Olsen: Combining variational methods and loopy belief propagation for multi-talker speech recognition. IEEE Signal Processing Magazine, Special issue of Graphical Models for Signal Processing (Eds. M. Jordan et al.), (November 2010)

    Google Scholar 

  76. H. Sameti, H. Sheikhzadeh, Li Deng, and R. Brennan: HMM-based strategies for enhancement of speech signals embedded in nonstationary noise. IEEE Trans. on Speech and Audio Processing, vol. 6, no. 5, pp. 445-455 (1998)

    Google Scholar 

  77. H. Sameti and Li Deng: Nonstationary-state hidden Markov model representation of speech signals for speech enhancement. Signal Processing, vol. 82, pp. 205-227 (2002)

    Google Scholar 

  78. M. Seltzer, K. Kalgaonkar, and A. Acero: Acoustic model adaptation via linear spline interpolation for robust speech recognition. In: Proc. ICASSP (2010)

    Google Scholar 

  79. H. Sheikhzadeh and Li Deng: Waveform-based speech recognition using hidden filter models: Parameter selection and sensitivity to power normalization. IEEE Trans. on Speech and Audio Processing, vol. 2, no. 1, pp. 80-91 (1994)

    Article  Google Scholar 

  80. G. Shi, Y. Shi, and Q. Huo: A study of irrelevant variability normalizataion based training and unsupervised online adaptation for LVCSR. In: Proc. Interspeech, Makuhari, Japan (2010)

    Google Scholar 

  81. V. Stouten,, H. Van hamme, P. Wambacq: Effect of phase-sensitive environment model and higher order VTS on noisy speech feature enhancement. In: Proc. ICASSP, pp. 433-436 (2005)

    Google Scholar 

  82. V. Stouten, H. Van hamme, and P. Wambacq: Accounting for the uncertainty of speech estimates in the context of model-based feature enhancement. In: Proc. ICSLP, pp. 105-108, Jeju Island, Korea (2004)

    Google Scholar 

  83. D. Yu, Li Deng, Yifan Gong, and Alex Acero: A novel framework and training algorithm for variable-parameter hidden Markov models. IEEE Trans. on Audio, Speech and Language Processing, vol. 17, no. 7, pp. 1348-1360, IEEE (2009)

    Google Scholar 

  84. D. Yu and Li Deng: Solving nonlinear estimation problems using Splines. IEEE Signal Processing Magazine, vol. 26, no. 4, pp. 86-90, (2009)

    Article  Google Scholar 

  85. D. Yu, Li Deng, J. Droppo, J. Wu, Y. Gong, and A. Acero: Robust speech recognition using cepstral minimum-mean-square-error noise suppressor. IEEE Trans. Audio, Speech, and Language Processing, vol. 16, no. 5 (2008)

    Google Scholar 

  86. D. Yu and L. Deng: Deep-Structured Hidden Conditional Random Fields for Phonetic Recognition. In: Proc. Interspeech (2010)

    Google Scholar 

  87. D. Zhu and Q. Huo: A maximum likelihood approach to unsupervised online adaptation of stochastic vector mapping function for robust speech recognition. In: Proc. ICASSP (2007)

    Google Scholar 

  88. D. Zhu and Q. Huo: Irrelevant variability normalization based HMM training using MAP estimation of feature transforms for robust speech recognition. In: Proc. ICASSP (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li Deng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Deng, L. (2011). Front-End, Back-End, and Hybrid Techniques for Noise-Robust Speech Recognition. In: Kolossa, D., Häb-Umbach, R. (eds) Robust Speech Recognition of Uncertain or Missing Data. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21317-5_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-21317-5_4

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-21316-8

  • Online ISBN: 978-3-642-21317-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics