Skip to main content

Towards Robust Distant-Talking Automatic Speech Recognition in Reverberant Environments

  • Chapter
Speech and Audio Processing in Adverse Environments

Part of the book series: Signals and Communication Technology ((SCT))

In distant-talking scenarios, automatic speech recognition (ASR) is hampered by background noise, competing speakers and room reverberation. Unlike background noise and competing speakers, reverberation cannot be captured by an additive or multiplicative term in the feature domain because reverberation has a dispersive effect on the speech feature sequences. Therefore, traditional acoustic modeling techniques and conventional methods to increase robustness to additive distortions provide only limited performance in reverberant environments.

Based on a thorough analysis of the effect of room reverberation on speech feature sequences, this contribution gives a concise overview of the state of the art in reverberant speech recognition. The methods for achieving robustness are classified into three groups: Signal dereverberation and beamforming as preprocessing, robust feature extraction, and adjustment of the acoustic models to reverberation. Finally, a novel concept called reverberation modeling for speech recognition, which combines advantages of all three classes, is described.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. J. B. Allen, D. A. Berkley: Image method for efficiently simulating small-room acoustics, JASA, 65(4), 943–950, April 1979.

    Google Scholar 

  2. AMI project: “Webpage of the AMI project,” http://corpus.amiproject.org.

  3. B. Atal: Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification, JASA, 55(6), 1304–1312, 1974.

    Google Scholar 

  4. L. E. Baum, J. A. Eagon: An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology, Bulletin of American Mathematical Society, 73, 360–363, 1967.

    Article  MATH  MathSciNet  Google Scholar 

  5. L. E. Baum, et al.: A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains, Annals of Mathematical Statistics, 41, 164–171, 1970.

    Article  MATH  MathSciNet  Google Scholar 

  6. J. Benesty: Adaptive eigenvalue decomposition algorithm for passive acoustic source localization, Journal of the Acoustical Society of America, 107(1), 384–391, Jan. 2000.

    Article  Google Scholar 

  7. J. Benesty, S. Makino, J. Chen (eds.): Speech Enhancement, Berlin, Germany: Springer, 2005.

    Google Scholar 

  8. M. Brandstein, D. Ward (eds.): Microphone Arrays, Berlin, Germany: Springer, 2001.

    Google Scholar 

  9. C. Breining, P. Dreiseitel, E. Hänsler, A. Mader, B. Nitsch, H. Puder, T. Schertler, G. Schmidt, J. Tilp: Acoustic echo control. An application of very-high-order adaptive filters, IEEE Signal Process. Mag., 16(4), 42–69, 1999.

    Article  Google Scholar 

  10. H. Buchner, R. Aichner, W. Kellermann: TRINICON: A versatile framework for multichannel blind signal processing, Proc. ICASSP ’04, 3, 889–892, Montreal, Canada, 2004.

    Google Scholar 

  11. CHIL project: “Webpage of the CHIL project,” http://chil.server.de.

  12. S. Davis, P. Mermelstein: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Trans. Acoust. Speech Signal Process., ASSP-28(4), 357–366, 1980.

    Article  Google Scholar 

  13. S. Furui: On the role of spectral transition for speech perception, JASA, 80(4), 1016–1025, 1986.

    Google Scholar 

  14. K. Furuya, S. Sakauchi, A. Kataoka: Speech dereverberation by combining MINT-based blind deconvolution and modified spectral subtraction, Proc. ICASSP ’06, 1, 813–816, Toulouse, France, 2006.

    Google Scholar 

  15. K. Furuya, A. Kataoka: Robust speech dereverberation using multichannel blind deconvolution with spectral subtraction, IEEE Trans. Audio Speech Language Process., T-ASLP-15(5), 1579–1591, 2007.

    Article  Google Scholar 

  16. M. J. F. Gales, S. J. Young: Robust continuous speech recognition using parallel model combination, IEEE Trans. Speech Audio Process., T-SAP-4(5), 352–359, 1996.

    Article  Google Scholar 

  17. N. D. Gaubitch, P. A. Naylor, D. B. Ward: On the use of linear prediction for dereverberation of speech, Proc. IWAENC ’03, 99–102, Kyoto, Japan, 2003.

    Google Scholar 

  18. B. W. Gillespie, L. E. Atlas: Strategies for improving audible quality and speech recognition accuracy of reverberant speech, Proc. ICASSP ’03, 1, 676–679, Hong Kong, 2003.

    Google Scholar 

  19. D. Giuliani, M. Matassoni, M. Omologo, P. Svaizer: Training of HMM with filtered speech material for hands-free recognition, Proc. ICASSP ’99, 1, 449–452, Phoenix, AZ, USA, 1999.

    Google Scholar 

  20. S. M. Griebel, M. S. Brandstein: Microphone array speech dereverberation using coarse channel modeling, Proc. ICASSP ’01, 1, 201–204, Salt Lake City, UT, USA, 2001.

    Google Scholar 

  21. L. Griffiths, C. Jim: An alternative approach to linearly constrained adaptive beamforming, IEEE Trans. on Antennas and Propagation., 30(1), 27–34, 1982.

    Article  Google Scholar 

  22. M. I. Gürelli, C. L. Nikias: EVAM: an eigenvector-based algorithm for multichannel blind deconvolution of input colored signals, IEEE Trans. on Signal Processing, T-SP-43(1), 134–149, 1995.

    Article  Google Scholar 

  23. T. Haderlein, E. Nöth, W. Herbordt, W. Kellermann, H. Niemann: Using Artificially Reverberated Training Data in Distant Talking ASR, in Proc. TSD ’05, V. Matoušek, P. Mautner, T. Pavelka (eds.), 226–233, Berlin, Germany: Springer, 2005.

    Google Scholar 

  24. E. Hänsler, G. Schmidt (eds.): Topics in Acoustic Echo and Noise Control: Selected Methods for the Cancellation of Acoustical Echoes, the Reduction of Background Noise, and Speech Processing, Berlin, Germany: Springer, 2006.

    Google Scholar 

  25. B. Hanson, T. Applebaum: Robust speaker-independent word recognition using static, dynamic and acceleration features: Experiments with lombard and noisy speech, Proc. ICASSP ’90, 2, 857–860, Albuquerque, NM, USA, 1990.

    Google Scholar 

  26. W. Herbordt: Sound Capture for Human/Machine Interfaces – Practical Aspects of Microphone Array Signal Processing, Heidelberg, Germany: Springer, 2005.

    MATH  Google Scholar 

  27. W. Herbordt, H. Buchner, S. Nakamura, W. Kellermann: Multichannel bin-wise robust frequency-domain adaptive filtering and its application to adaptive beamforming, Trans. Audio Speech Language Process., T-ASLP-15(4), 1340–1351, 2007.

    Article  Google Scholar 

  28. H. Hermansky, N. Morgan: RASTA processing of speech, IEEE Trans. Speech Audio Process., T-SAP-2(4), 578–589, 1994.

    Article  Google Scholar 

  29. T. Hikichi, M. Delcroix, M. Miyoshi: Blind dereverberation based on estimates of signal transmission channels without precise information of channel order, Proc. ICASSP ’05, 1, 1069–1072, Philadelphia, PA, USA, 2005.

    Google Scholar 

  30. H.-G. Hirsch, H. Finster: A new HMM adaptation approach for the case of a hands-free speech input in reverberant rooms, Proc. INTERSPEECH ’06, 781–783, Pittsburgh, PA, USA, 2006.

    Google Scholar 

  31. O. Hoshuyama, A. Sugiyama, A. Hirano: A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters, IEEE Trans. Signal Process., T-SP-47(10), 2677–2684, 1999.

    Article  Google Scholar 

  32. HTK: “HTK webpage,” http://htk.eng.cam.ac.uk.

  33. X. Huang, A. Acero, H.-W. Hon: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Upper Saddle River, NJ, USA: Prentice Hall, 2001.

    Google Scholar 

  34. F. Jelinek: Statistical Methods for Speech Recognition, Cambridge, MA, USA: MIT Press, 1998.

    Google Scholar 

  35. J.-C. Junqua: Robustness in Automatic Speech Recognition, Boston, MA: Kluwer Academic Publishers, 1996.

    Google Scholar 

  36. K. Kinoshita, T. Nakatani, M. Miyoshi: Fast estimation of a precise dereverberation filter based on speech harmonicity, Proc. ICASSP ’05, 1, 1073–1076, Philadelphia, PA, USA, 2005.

    Google Scholar 

  37. H. Kuttruff: Room Acoustics, 4th ed., London, UK: Spon Press, 2000.

    Google Scholar 

  38. C.-H. Lee, C.-H. Lin, B.-H. Juang: A study of speaker adaptation of continuous density HMM parameters, Proc. ICASSP ’90, 1, 145–148, Albuquerque, NM, USA, 1990.

    Google Scholar 

  39. C. J. Leggetter, P. C. Woodland: Speaker adaptation of continuous density HMMs using multivariate linear regression, Proc. ICSLP ’94, 2, 451–454, Yokohama, Japan, 1994.

    Article  Google Scholar 

  40. R. G. Leonard: A database for speaker-independent digit recognition, Proc. ICASSP ’84, 42.11.1–42.11.4, San Diego, CA, USA, 1984.

    Google Scholar 

  41. D. G. Manolakis, V. K. Ingle, S. M. Kogon: Statistical and Adaptive Signal Processing: Spectral Estimation, Signal Modeling, Adaptive Filtering and Array Processing, Boston, MA: McGraw-Hill, 2000.

    Google Scholar 

  42. M. Miyoshi, Y. Kaneda: Inverse filtering of room acoustics, IEEE Trans. Acoust. Speech Signal Process., ASSP-36(2), 145–152, February 1988.

    Article  Google Scholar 

  43. P. J. Moreno, B. Raj, R. M. Stern: A vector taylor series approach for environment independent speech recognition, Proc. ICASSP ’96, 2, 733–736, Atlanta, GA, USA, 1996.

    Google Scholar 

  44. S. Nakamura, T. Takiguchi, K. Shikano: Noise and room acoustics distorted speech reognition by HMM composition, Proc. ICASSP ’96, 1, 69–72, Atlanta, GA, USA, 1996.

    Google Scholar 

  45. T. Nakatani, M. Miyoshi: Blind dereverberation of single channel speech signal based on harmonic structure, Proc. ICASSP ’03, 1, 92–95, Hong Kong, 2003.

    Google Scholar 

  46. T. Nakatani B.-H. Juang, K. Kinoshita, M. Miyoshi: Speech dereverberation based on probabilistic models of source and room acoustics, Proc. ICASSP ’06, 1, 821–824, Toulouse, France, 2006.

    Google Scholar 

  47. T. Nakatani, K. Kinoshita, M. Miyoshi: Harmonicity-based blind dereverberation for single-channel speech signals, IEEE Trans. Audio Speech Language Process., T-ASLP-15(1) 80–95, Jan. 2007.

    Article  Google Scholar 

  48. S. Neely, J. Allen: Invertibility of a room impulse response, JASA, 66(1), 165–169, July 1979.

    Google Scholar 

  49. H. Ney, S. Orthmanns: Dynamic programming search for continuous speech recognition, IEEE Signal Process. Mag., 16(5), 64–63, 1999.

    Article  Google Scholar 

  50. M. Omologo, M. Matassoni, P. Svaizer, D. Giuliani: Microphone array based speech recognition with different talker-array positions, Proc. ICASSP ’97, 1, 227–230, Munich, Germany, 1997.

    Google Scholar 

  51. D. S. Pallett, J. G. Fiscus, W. M. Fisher, J. S. Garofolo, B. S. Lund, A. Martin, M. A. Przybocki: The 1994 benchmark tests for the ARPA spoken language program, Proc. Spoken Language Technology Workshop, 5–38, Austin, TX, USA, 1995.

    Google Scholar 

  52. D. S. Pallett: A look at NIST’s benchmark ASR tests: past, present, and future, Proc. ASRU ’03, 483–488, St. Thomas, Virgin Islands, 2003.

    Google Scholar 

  53. J. G. Proakis, D. G. Manolakis: Digital Signal Processing: Principles, Algorithms, and Applications, Upper Saddle River, NJ, USA: Prentice Hall, 1996.

    Google Scholar 

  54. W. Putnam, D. Rocchesso, J. Smith: A numerical investigation of the invertibility of room transfer functions, Proc. WASPAA ’95, 249–252, Mohonk, NY, USA, 1995.

    Google Scholar 

  55. L. R. Rabiner: A tutorial on hidden markov models and selected applications in speech recognition, Proc. IEEE, 77(2), 257–286, 1989.

    Article  Google Scholar 

  56. C. K. Raut, T. Nishimoto, S. Sagayama:Model adaptation for long convolutional distortion by maximum likelihood based state filtering approach, Proc. ICASSP ’06, 1, 1133–1136, Toulouse, France, 2006.

    Google Scholar 

  57. A. Sehr, M. Zeller, W. Kellermann: Hands-free speech recognition using a reverberation model in the feature domain, Proc. EUSIPCO ’06, Florence, Italy, 2006.

    Google Scholar 

  58. A. Sehr, M. Zeller, W. Kellermann: Distant-talking continuous speech recognition based on a novel reverberation model in the feature domain, Proc. INTERSPEECH ’06, 769 – 772, Pittsburgh, PA, USA, 2006.

    Google Scholar 

  59. A. Sehr, W. Kellermann: A new concept for feature-domain dereverberation for robust distant-talking ASR, Proc. ICASSP ’07, 4, 369–372, Honolulu, Hawaii, 2007.

    Google Scholar 

  60. A. Sehr, Y. Zheng, E. Nöth, W. Kellermann: Maximum likelihood estimation of a reverberation model for robust distant-talking speech recognition, Proc. EUSIPCO ’07, 1299-1303, Poznan, Poland, 2007.

    Google Scholar 

  61. M. L. Seltzer, B. Raj, R. M. Stern: Likelhood-maximizing beamforming for robust hands-free speech recognition, IEEE Trans. Speech Audio Process., T-SAP-12(5), 489–498, 2004.

    Article  Google Scholar 

  62. M. L. Seltzer, R. M. Stern: Subband likelihood-maximizing beamforming for speech recognition in reverberant environments, Trans. Audio Speech Language Process., T-ASLP-14(6), 2109–2121, 2006.

    Article  Google Scholar 

  63. P. C. W. Sommen: Partitioned frequency domain adaptive filters, Proc. 23rd Asilomar Conference on Signals Systems and Computers, 676–681, Pacific Grove, CA, USA, 1989.

    Google Scholar 

  64. J. S. Soo, K. K. Pang: Multidelay block frequency domain adaptive filter, IEEE Trans. Acoust. Speech Signal Process., ASSP-38(2), 373–376, 1990.

    Article  Google Scholar 

  65. J. S. Soo, K. K. Pang: A multistep size (MSS) frequency domain adaptive filter, IEEE Trans. Signal Process., T-SP-39(1), 115–121, 1991.

    Article  Google Scholar 

  66. V. Stahl, A. Fischer, R. Bippus: Acoustic synthesis of training data for speech recognition in living-room environments, Proc. ICASSP ’01, 1, 285–288, Salt Lake City, UT, USA, 2001.

    Google Scholar 

  67. T. G. Stockham: High-speed convolution and correlation, Proc. AFIPS ’66, 28, 229–233, 1966.

    Google Scholar 

  68. T. Takiguchi, S. Nakamura, Q. Huo, K. Shikano: Model adaption based on HMM decomposition for reverberant speech recognition, Proc. ICASSP ’97, 2, 827–830, Munich, Germany, 1997.

    Google Scholar 

  69. T. Takiguchi, S. Nakamura, K. Shikano: HMM-separation-based speech reognition for a distant moving speaker, IEEE Trans. Speech Audio Process., T-SAP-9(2), 127–140, 2001.

    Article  Google Scholar 

  70. T. Takiguchi, M. Nishimura, Y. Ariki: Acoustic model adaptation using first-order linear prediction for reverberant speech, IEICE Trans. Information and Systems, E89-D(3), 908–914, 2006.

    Article  Google Scholar 

  71. A. Torger, A. Farina: Real-time partitioned convolution for ambiophonics surround sound, Proc. WASPAA ’01, 195–198, Mohonk, NY, 2001.

    Google Scholar 

  72. A. P. Varga, R. K. Moore: Hidden Markov model decomposition of speech and noise, Proc. ICASSP ’90, 2, 845–848, Albuquerque, NM, USA, 1990.

    Google Scholar 

  73. B. van Veen, K. Buckley: Beamforming: A versatile approach to spatial filtering, IEEE ASSP Magazine, 5(2), 4–24, 1988.

    Article  Google Scholar 

  74. P. C. Woodland, M. J. F. Gales, D. Pye: Improving environmental robustness in large vocabulary speech recognition, Proc. ICASSP ’96, 1, 65–68, Atlanta, GA, USA, 1996.

    Google Scholar 

  75. B. Yegnanarayana, P. Satyanarayana Murthy: Enhancement of reverberant speech using LP residual signal, IEEE Trans. Speech Audio Process., T-SAP-8(3), 267–281, 2000.

    Article  Google Scholar 

  76. B. Yegnanarayana, S. R. Mathadeva Prasanna, K Sreenivasa Rao: Speech enhancement using excitation source information, Proc. ICASSP ’02, 1, 541–544, Orlando, FL, USA, 2002.

    Google Scholar 

  77. S. J. Young, N. H. Russel, J. H. S. Thornton: Token passing: a simple conceptual model for connected speech recognition systems, CUED technical report, Cambridge University Engineering Department, 1989.

    Google Scholar 

  78. S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland: The HTK Book (for HTK Version 3.2), Cambridge, UK: Cambridge University Engineering Department, 2002.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Sehr, A., Kellermann, W. (2008). Towards Robust Distant-Talking Automatic Speech Recognition in Reverberant Environments. In: Hänsler, E., Schmidt, G. (eds) Speech and Audio Processing in Adverse Environments. Signals and Communication Technology. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70602-1_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-70602-1_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-70601-4

  • Online ISBN: 978-3-540-70602-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics