In distant-talking scenarios, automatic speech recognition (ASR) is hampered by background noise, competing speakers and room reverberation. Unlike background noise and competing speakers, reverberation cannot be captured by an additive or multiplicative term in the feature domain because reverberation has a dispersive effect on the speech feature sequences. Therefore, traditional acoustic modeling techniques and conventional methods to increase robustness to additive distortions provide only limited performance in reverberant environments.
Based on a thorough analysis of the effect of room reverberation on speech feature sequences, this contribution gives a concise overview of the state of the art in reverberant speech recognition. The methods for achieving robustness are classified into three groups: Signal dereverberation and beamforming as preprocessing, robust feature extraction, and adjustment of the acoustic models to reverberation. Finally, a novel concept called reverberation modeling for speech recognition, which combines advantages of all three classes, is described.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
J. B. Allen, D. A. Berkley: Image method for efficiently simulating small-room acoustics, JASA, 65(4), 943–950, April 1979.
AMI project: “Webpage of the AMI project,” http://corpus.amiproject.org.
B. Atal: Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification, JASA, 55(6), 1304–1312, 1974.
L. E. Baum, J. A. Eagon: An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology, Bulletin of American Mathematical Society, 73, 360–363, 1967.
L. E. Baum, et al.: A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains, Annals of Mathematical Statistics, 41, 164–171, 1970.
J. Benesty: Adaptive eigenvalue decomposition algorithm for passive acoustic source localization, Journal of the Acoustical Society of America, 107(1), 384–391, Jan. 2000.
J. Benesty, S. Makino, J. Chen (eds.): Speech Enhancement, Berlin, Germany: Springer, 2005.
M. Brandstein, D. Ward (eds.): Microphone Arrays, Berlin, Germany: Springer, 2001.
C. Breining, P. Dreiseitel, E. Hänsler, A. Mader, B. Nitsch, H. Puder, T. Schertler, G. Schmidt, J. Tilp: Acoustic echo control. An application of very-high-order adaptive filters, IEEE Signal Process. Mag., 16(4), 42–69, 1999.
H. Buchner, R. Aichner, W. Kellermann: TRINICON: A versatile framework for multichannel blind signal processing, Proc. ICASSP ’04, 3, 889–892, Montreal, Canada, 2004.
CHIL project: “Webpage of the CHIL project,” http://chil.server.de.
S. Davis, P. Mermelstein: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Trans. Acoust. Speech Signal Process., ASSP-28(4), 357–366, 1980.
S. Furui: On the role of spectral transition for speech perception, JASA, 80(4), 1016–1025, 1986.
K. Furuya, S. Sakauchi, A. Kataoka: Speech dereverberation by combining MINT-based blind deconvolution and modified spectral subtraction, Proc. ICASSP ’06, 1, 813–816, Toulouse, France, 2006.
K. Furuya, A. Kataoka: Robust speech dereverberation using multichannel blind deconvolution with spectral subtraction, IEEE Trans. Audio Speech Language Process., T-ASLP-15(5), 1579–1591, 2007.
M. J. F. Gales, S. J. Young: Robust continuous speech recognition using parallel model combination, IEEE Trans. Speech Audio Process., T-SAP-4(5), 352–359, 1996.
N. D. Gaubitch, P. A. Naylor, D. B. Ward: On the use of linear prediction for dereverberation of speech, Proc. IWAENC ’03, 99–102, Kyoto, Japan, 2003.
B. W. Gillespie, L. E. Atlas: Strategies for improving audible quality and speech recognition accuracy of reverberant speech, Proc. ICASSP ’03, 1, 676–679, Hong Kong, 2003.
D. Giuliani, M. Matassoni, M. Omologo, P. Svaizer: Training of HMM with filtered speech material for hands-free recognition, Proc. ICASSP ’99, 1, 449–452, Phoenix, AZ, USA, 1999.
S. M. Griebel, M. S. Brandstein: Microphone array speech dereverberation using coarse channel modeling, Proc. ICASSP ’01, 1, 201–204, Salt Lake City, UT, USA, 2001.
L. Griffiths, C. Jim: An alternative approach to linearly constrained adaptive beamforming, IEEE Trans. on Antennas and Propagation., 30(1), 27–34, 1982.
M. I. Gürelli, C. L. Nikias: EVAM: an eigenvector-based algorithm for multichannel blind deconvolution of input colored signals, IEEE Trans. on Signal Processing, T-SP-43(1), 134–149, 1995.
T. Haderlein, E. Nöth, W. Herbordt, W. Kellermann, H. Niemann: Using Artificially Reverberated Training Data in Distant Talking ASR, in Proc. TSD ’05, V. Matoušek, P. Mautner, T. Pavelka (eds.), 226–233, Berlin, Germany: Springer, 2005.
E. Hänsler, G. Schmidt (eds.): Topics in Acoustic Echo and Noise Control: Selected Methods for the Cancellation of Acoustical Echoes, the Reduction of Background Noise, and Speech Processing, Berlin, Germany: Springer, 2006.
B. Hanson, T. Applebaum: Robust speaker-independent word recognition using static, dynamic and acceleration features: Experiments with lombard and noisy speech, Proc. ICASSP ’90, 2, 857–860, Albuquerque, NM, USA, 1990.
W. Herbordt: Sound Capture for Human/Machine Interfaces – Practical Aspects of Microphone Array Signal Processing, Heidelberg, Germany: Springer, 2005.
W. Herbordt, H. Buchner, S. Nakamura, W. Kellermann: Multichannel bin-wise robust frequency-domain adaptive filtering and its application to adaptive beamforming, Trans. Audio Speech Language Process., T-ASLP-15(4), 1340–1351, 2007.
H. Hermansky, N. Morgan: RASTA processing of speech, IEEE Trans. Speech Audio Process., T-SAP-2(4), 578–589, 1994.
T. Hikichi, M. Delcroix, M. Miyoshi: Blind dereverberation based on estimates of signal transmission channels without precise information of channel order, Proc. ICASSP ’05, 1, 1069–1072, Philadelphia, PA, USA, 2005.
H.-G. Hirsch, H. Finster: A new HMM adaptation approach for the case of a hands-free speech input in reverberant rooms, Proc. INTERSPEECH ’06, 781–783, Pittsburgh, PA, USA, 2006.
O. Hoshuyama, A. Sugiyama, A. Hirano: A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters, IEEE Trans. Signal Process., T-SP-47(10), 2677–2684, 1999.
HTK: “HTK webpage,” http://htk.eng.cam.ac.uk.
X. Huang, A. Acero, H.-W. Hon: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Upper Saddle River, NJ, USA: Prentice Hall, 2001.
F. Jelinek: Statistical Methods for Speech Recognition, Cambridge, MA, USA: MIT Press, 1998.
J.-C. Junqua: Robustness in Automatic Speech Recognition, Boston, MA: Kluwer Academic Publishers, 1996.
K. Kinoshita, T. Nakatani, M. Miyoshi: Fast estimation of a precise dereverberation filter based on speech harmonicity, Proc. ICASSP ’05, 1, 1073–1076, Philadelphia, PA, USA, 2005.
H. Kuttruff: Room Acoustics, 4th ed., London, UK: Spon Press, 2000.
C.-H. Lee, C.-H. Lin, B.-H. Juang: A study of speaker adaptation of continuous density HMM parameters, Proc. ICASSP ’90, 1, 145–148, Albuquerque, NM, USA, 1990.
C. J. Leggetter, P. C. Woodland: Speaker adaptation of continuous density HMMs using multivariate linear regression, Proc. ICSLP ’94, 2, 451–454, Yokohama, Japan, 1994.
R. G. Leonard: A database for speaker-independent digit recognition, Proc. ICASSP ’84, 42.11.1–42.11.4, San Diego, CA, USA, 1984.
D. G. Manolakis, V. K. Ingle, S. M. Kogon: Statistical and Adaptive Signal Processing: Spectral Estimation, Signal Modeling, Adaptive Filtering and Array Processing, Boston, MA: McGraw-Hill, 2000.
M. Miyoshi, Y. Kaneda: Inverse filtering of room acoustics, IEEE Trans. Acoust. Speech Signal Process., ASSP-36(2), 145–152, February 1988.
P. J. Moreno, B. Raj, R. M. Stern: A vector taylor series approach for environment independent speech recognition, Proc. ICASSP ’96, 2, 733–736, Atlanta, GA, USA, 1996.
S. Nakamura, T. Takiguchi, K. Shikano: Noise and room acoustics distorted speech reognition by HMM composition, Proc. ICASSP ’96, 1, 69–72, Atlanta, GA, USA, 1996.
T. Nakatani, M. Miyoshi: Blind dereverberation of single channel speech signal based on harmonic structure, Proc. ICASSP ’03, 1, 92–95, Hong Kong, 2003.
T. Nakatani B.-H. Juang, K. Kinoshita, M. Miyoshi: Speech dereverberation based on probabilistic models of source and room acoustics, Proc. ICASSP ’06, 1, 821–824, Toulouse, France, 2006.
T. Nakatani, K. Kinoshita, M. Miyoshi: Harmonicity-based blind dereverberation for single-channel speech signals, IEEE Trans. Audio Speech Language Process., T-ASLP-15(1) 80–95, Jan. 2007.
S. Neely, J. Allen: Invertibility of a room impulse response, JASA, 66(1), 165–169, July 1979.
H. Ney, S. Orthmanns: Dynamic programming search for continuous speech recognition, IEEE Signal Process. Mag., 16(5), 64–63, 1999.
M. Omologo, M. Matassoni, P. Svaizer, D. Giuliani: Microphone array based speech recognition with different talker-array positions, Proc. ICASSP ’97, 1, 227–230, Munich, Germany, 1997.
D. S. Pallett, J. G. Fiscus, W. M. Fisher, J. S. Garofolo, B. S. Lund, A. Martin, M. A. Przybocki: The 1994 benchmark tests for the ARPA spoken language program, Proc. Spoken Language Technology Workshop, 5–38, Austin, TX, USA, 1995.
D. S. Pallett: A look at NIST’s benchmark ASR tests: past, present, and future, Proc. ASRU ’03, 483–488, St. Thomas, Virgin Islands, 2003.
J. G. Proakis, D. G. Manolakis: Digital Signal Processing: Principles, Algorithms, and Applications, Upper Saddle River, NJ, USA: Prentice Hall, 1996.
W. Putnam, D. Rocchesso, J. Smith: A numerical investigation of the invertibility of room transfer functions, Proc. WASPAA ’95, 249–252, Mohonk, NY, USA, 1995.
L. R. Rabiner: A tutorial on hidden markov models and selected applications in speech recognition, Proc. IEEE, 77(2), 257–286, 1989.
C. K. Raut, T. Nishimoto, S. Sagayama:Model adaptation for long convolutional distortion by maximum likelihood based state filtering approach, Proc. ICASSP ’06, 1, 1133–1136, Toulouse, France, 2006.
A. Sehr, M. Zeller, W. Kellermann: Hands-free speech recognition using a reverberation model in the feature domain, Proc. EUSIPCO ’06, Florence, Italy, 2006.
A. Sehr, M. Zeller, W. Kellermann: Distant-talking continuous speech recognition based on a novel reverberation model in the feature domain, Proc. INTERSPEECH ’06, 769 – 772, Pittsburgh, PA, USA, 2006.
A. Sehr, W. Kellermann: A new concept for feature-domain dereverberation for robust distant-talking ASR, Proc. ICASSP ’07, 4, 369–372, Honolulu, Hawaii, 2007.
A. Sehr, Y. Zheng, E. Nöth, W. Kellermann: Maximum likelihood estimation of a reverberation model for robust distant-talking speech recognition, Proc. EUSIPCO ’07, 1299-1303, Poznan, Poland, 2007.
M. L. Seltzer, B. Raj, R. M. Stern: Likelhood-maximizing beamforming for robust hands-free speech recognition, IEEE Trans. Speech Audio Process., T-SAP-12(5), 489–498, 2004.
M. L. Seltzer, R. M. Stern: Subband likelihood-maximizing beamforming for speech recognition in reverberant environments, Trans. Audio Speech Language Process., T-ASLP-14(6), 2109–2121, 2006.
P. C. W. Sommen: Partitioned frequency domain adaptive filters, Proc. 23rd Asilomar Conference on Signals Systems and Computers, 676–681, Pacific Grove, CA, USA, 1989.
J. S. Soo, K. K. Pang: Multidelay block frequency domain adaptive filter, IEEE Trans. Acoust. Speech Signal Process., ASSP-38(2), 373–376, 1990.
J. S. Soo, K. K. Pang: A multistep size (MSS) frequency domain adaptive filter, IEEE Trans. Signal Process., T-SP-39(1), 115–121, 1991.
V. Stahl, A. Fischer, R. Bippus: Acoustic synthesis of training data for speech recognition in living-room environments, Proc. ICASSP ’01, 1, 285–288, Salt Lake City, UT, USA, 2001.
T. G. Stockham: High-speed convolution and correlation, Proc. AFIPS ’66, 28, 229–233, 1966.
T. Takiguchi, S. Nakamura, Q. Huo, K. Shikano: Model adaption based on HMM decomposition for reverberant speech recognition, Proc. ICASSP ’97, 2, 827–830, Munich, Germany, 1997.
T. Takiguchi, S. Nakamura, K. Shikano: HMM-separation-based speech reognition for a distant moving speaker, IEEE Trans. Speech Audio Process., T-SAP-9(2), 127–140, 2001.
T. Takiguchi, M. Nishimura, Y. Ariki: Acoustic model adaptation using first-order linear prediction for reverberant speech, IEICE Trans. Information and Systems, E89-D(3), 908–914, 2006.
A. Torger, A. Farina: Real-time partitioned convolution for ambiophonics surround sound, Proc. WASPAA ’01, 195–198, Mohonk, NY, 2001.
A. P. Varga, R. K. Moore: Hidden Markov model decomposition of speech and noise, Proc. ICASSP ’90, 2, 845–848, Albuquerque, NM, USA, 1990.
B. van Veen, K. Buckley: Beamforming: A versatile approach to spatial filtering, IEEE ASSP Magazine, 5(2), 4–24, 1988.
P. C. Woodland, M. J. F. Gales, D. Pye: Improving environmental robustness in large vocabulary speech recognition, Proc. ICASSP ’96, 1, 65–68, Atlanta, GA, USA, 1996.
B. Yegnanarayana, P. Satyanarayana Murthy: Enhancement of reverberant speech using LP residual signal, IEEE Trans. Speech Audio Process., T-SAP-8(3), 267–281, 2000.
B. Yegnanarayana, S. R. Mathadeva Prasanna, K Sreenivasa Rao: Speech enhancement using excitation source information, Proc. ICASSP ’02, 1, 541–544, Orlando, FL, USA, 2002.
S. J. Young, N. H. Russel, J. H. S. Thornton: Token passing: a simple conceptual model for connected speech recognition systems, CUED technical report, Cambridge University Engineering Department, 1989.
S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland: The HTK Book (for HTK Version 3.2), Cambridge, UK: Cambridge University Engineering Department, 2002.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Sehr, A., Kellermann, W. (2008). Towards Robust Distant-Talking Automatic Speech Recognition in Reverberant Environments. In: Hänsler, E., Schmidt, G. (eds) Speech and Audio Processing in Adverse Environments. Signals and Communication Technology. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70602-1_18
Download citation
DOI: https://doi.org/10.1007/978-3-540-70602-1_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70601-4
Online ISBN: 978-3-540-70602-1
eBook Packages: EngineeringEngineering (R0)