Towards Robust Distant-Talking Automatic Speech Recognition in Reverberant Environments

Sehr, Armin; Kellermann, Walter

doi:10.1007/978-3-540-70602-1_18

Armin Sehr³ &
Walter Kellermann³

Part of the book series: Signals and Communication Technology ((SCT))

1663 Accesses
5 Citations

In distant-talking scenarios, automatic speech recognition (ASR) is hampered by background noise, competing speakers and room reverberation. Unlike background noise and competing speakers, reverberation cannot be captured by an additive or multiplicative term in the feature domain because reverberation has a dispersive effect on the speech feature sequences. Therefore, traditional acoustic modeling techniques and conventional methods to increase robustness to additive distortions provide only limited performance in reverberant environments.

Based on a thorough analysis of the effect of room reverberation on speech feature sequences, this contribution gives a concise overview of the state of the art in reverberant speech recognition. The methods for achieving robustness are classified into three groups: Signal dereverberation and beamforming as preprocessing, robust feature extraction, and adjustment of the acoustic models to reverberation. Finally, a novel concept called reverberation modeling for speech recognition, which combines advantages of all three classes, is described.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

J. B. Allen, D. A. Berkley: Image method for efficiently simulating small-room acoustics, JASA, 65(4), 943–950, April 1979.
Google Scholar
AMI project: “Webpage of the AMI project,” http://corpus.amiproject.org.
B. Atal: Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification, JASA, 55(6), 1304–1312, 1974.
Google Scholar
L. E. Baum, J. A. Eagon: An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology, Bulletin of American Mathematical Society, 73, 360–363, 1967.
Article MATH MathSciNet Google Scholar
L. E. Baum, et al.: A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains, Annals of Mathematical Statistics, 41, 164–171, 1970.
Article MATH MathSciNet Google Scholar
J. Benesty: Adaptive eigenvalue decomposition algorithm for passive acoustic source localization, Journal of the Acoustical Society of America, 107(1), 384–391, Jan. 2000.
Article Google Scholar
J. Benesty, S. Makino, J. Chen (eds.): Speech Enhancement, Berlin, Germany: Springer, 2005.
Google Scholar
M. Brandstein, D. Ward (eds.): Microphone Arrays, Berlin, Germany: Springer, 2001.
Google Scholar
C. Breining, P. Dreiseitel, E. Hänsler, A. Mader, B. Nitsch, H. Puder, T. Schertler, G. Schmidt, J. Tilp: Acoustic echo control. An application of very-high-order adaptive filters, IEEE Signal Process. Mag., 16(4), 42–69, 1999.
Article Google Scholar
H. Buchner, R. Aichner, W. Kellermann: TRINICON: A versatile framework for multichannel blind signal processing, Proc. ICASSP ’04, 3, 889–892, Montreal, Canada, 2004.
Google Scholar
CHIL project: “Webpage of the CHIL project,” http://chil.server.de.
S. Davis, P. Mermelstein: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Trans. Acoust. Speech Signal Process., ASSP-28(4), 357–366, 1980.
Article Google Scholar
S. Furui: On the role of spectral transition for speech perception, JASA, 80(4), 1016–1025, 1986.
Google Scholar
K. Furuya, S. Sakauchi, A. Kataoka: Speech dereverberation by combining MINT-based blind deconvolution and modified spectral subtraction, Proc. ICASSP ’06, 1, 813–816, Toulouse, France, 2006.
Google Scholar
K. Furuya, A. Kataoka: Robust speech dereverberation using multichannel blind deconvolution with spectral subtraction, IEEE Trans. Audio Speech Language Process., T-ASLP-15(5), 1579–1591, 2007.
Article Google Scholar
M. J. F. Gales, S. J. Young: Robust continuous speech recognition using parallel model combination, IEEE Trans. Speech Audio Process., T-SAP-4(5), 352–359, 1996.
Article Google Scholar
N. D. Gaubitch, P. A. Naylor, D. B. Ward: On the use of linear prediction for dereverberation of speech, Proc. IWAENC ’03, 99–102, Kyoto, Japan, 2003.
Google Scholar
B. W. Gillespie, L. E. Atlas: Strategies for improving audible quality and speech recognition accuracy of reverberant speech, Proc. ICASSP ’03, 1, 676–679, Hong Kong, 2003.
Google Scholar
D. Giuliani, M. Matassoni, M. Omologo, P. Svaizer: Training of HMM with filtered speech material for hands-free recognition, Proc. ICASSP ’99, 1, 449–452, Phoenix, AZ, USA, 1999.
Google Scholar
S. M. Griebel, M. S. Brandstein: Microphone array speech dereverberation using coarse channel modeling, Proc. ICASSP ’01, 1, 201–204, Salt Lake City, UT, USA, 2001.
Google Scholar
L. Griffiths, C. Jim: An alternative approach to linearly constrained adaptive beamforming, IEEE Trans. on Antennas and Propagation., 30(1), 27–34, 1982.
Article Google Scholar
M. I. Gürelli, C. L. Nikias: EVAM: an eigenvector-based algorithm for multichannel blind deconvolution of input colored signals, IEEE Trans. on Signal Processing, T-SP-43(1), 134–149, 1995.
Article Google Scholar
T. Haderlein, E. Nöth, W. Herbordt, W. Kellermann, H. Niemann: Using Artificially Reverberated Training Data in Distant Talking ASR, in Proc. TSD ’05, V. Matoušek, P. Mautner, T. Pavelka (eds.), 226–233, Berlin, Germany: Springer, 2005.
Google Scholar
E. Hänsler, G. Schmidt (eds.): Topics in Acoustic Echo and Noise Control: Selected Methods for the Cancellation of Acoustical Echoes, the Reduction of Background Noise, and Speech Processing, Berlin, Germany: Springer, 2006.
Google Scholar
B. Hanson, T. Applebaum: Robust speaker-independent word recognition using static, dynamic and acceleration features: Experiments with lombard and noisy speech, Proc. ICASSP ’90, 2, 857–860, Albuquerque, NM, USA, 1990.
Google Scholar
W. Herbordt: Sound Capture for Human/Machine Interfaces – Practical Aspects of Microphone Array Signal Processing, Heidelberg, Germany: Springer, 2005.
MATH Google Scholar
W. Herbordt, H. Buchner, S. Nakamura, W. Kellermann: Multichannel bin-wise robust frequency-domain adaptive filtering and its application to adaptive beamforming, Trans. Audio Speech Language Process., T-ASLP-15(4), 1340–1351, 2007.
Article Google Scholar
H. Hermansky, N. Morgan: RASTA processing of speech, IEEE Trans. Speech Audio Process., T-SAP-2(4), 578–589, 1994.
Article Google Scholar
T. Hikichi, M. Delcroix, M. Miyoshi: Blind dereverberation based on estimates of signal transmission channels without precise information of channel order, Proc. ICASSP ’05, 1, 1069–1072, Philadelphia, PA, USA, 2005.
Google Scholar
H.-G. Hirsch, H. Finster: A new HMM adaptation approach for the case of a hands-free speech input in reverberant rooms, Proc. INTERSPEECH ’06, 781–783, Pittsburgh, PA, USA, 2006.
Google Scholar
O. Hoshuyama, A. Sugiyama, A. Hirano: A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters, IEEE Trans. Signal Process., T-SP-47(10), 2677–2684, 1999.
Article Google Scholar
HTK: “HTK webpage,” http://htk.eng.cam.ac.uk.
X. Huang, A. Acero, H.-W. Hon: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Upper Saddle River, NJ, USA: Prentice Hall, 2001.
Google Scholar
F. Jelinek: Statistical Methods for Speech Recognition, Cambridge, MA, USA: MIT Press, 1998.
Google Scholar
J.-C. Junqua: Robustness in Automatic Speech Recognition, Boston, MA: Kluwer Academic Publishers, 1996.
Google Scholar
K. Kinoshita, T. Nakatani, M. Miyoshi: Fast estimation of a precise dereverberation filter based on speech harmonicity, Proc. ICASSP ’05, 1, 1073–1076, Philadelphia, PA, USA, 2005.
Google Scholar
H. Kuttruff: Room Acoustics, 4th ed., London, UK: Spon Press, 2000.
Google Scholar
C.-H. Lee, C.-H. Lin, B.-H. Juang: A study of speaker adaptation of continuous density HMM parameters, Proc. ICASSP ’90, 1, 145–148, Albuquerque, NM, USA, 1990.
Google Scholar
C. J. Leggetter, P. C. Woodland: Speaker adaptation of continuous density HMMs using multivariate linear regression, Proc. ICSLP ’94, 2, 451–454, Yokohama, Japan, 1994.
Article Google Scholar
R. G. Leonard: A database for speaker-independent digit recognition, Proc. ICASSP ’84, 42.11.1–42.11.4, San Diego, CA, USA, 1984.
Google Scholar
D. G. Manolakis, V. K. Ingle, S. M. Kogon: Statistical and Adaptive Signal Processing: Spectral Estimation, Signal Modeling, Adaptive Filtering and Array Processing, Boston, MA: McGraw-Hill, 2000.
Google Scholar
M. Miyoshi, Y. Kaneda: Inverse filtering of room acoustics, IEEE Trans. Acoust. Speech Signal Process., ASSP-36(2), 145–152, February 1988.
Article Google Scholar
P. J. Moreno, B. Raj, R. M. Stern: A vector taylor series approach for environment independent speech recognition, Proc. ICASSP ’96, 2, 733–736, Atlanta, GA, USA, 1996.
Google Scholar
S. Nakamura, T. Takiguchi, K. Shikano: Noise and room acoustics distorted speech reognition by HMM composition, Proc. ICASSP ’96, 1, 69–72, Atlanta, GA, USA, 1996.
Google Scholar
T. Nakatani, M. Miyoshi: Blind dereverberation of single channel speech signal based on harmonic structure, Proc. ICASSP ’03, 1, 92–95, Hong Kong, 2003.
Google Scholar
T. Nakatani B.-H. Juang, K. Kinoshita, M. Miyoshi: Speech dereverberation based on probabilistic models of source and room acoustics, Proc. ICASSP ’06, 1, 821–824, Toulouse, France, 2006.
Google Scholar
T. Nakatani, K. Kinoshita, M. Miyoshi: Harmonicity-based blind dereverberation for single-channel speech signals, IEEE Trans. Audio Speech Language Process., T-ASLP-15(1) 80–95, Jan. 2007.
Article Google Scholar
S. Neely, J. Allen: Invertibility of a room impulse response, JASA, 66(1), 165–169, July 1979.
Google Scholar
H. Ney, S. Orthmanns: Dynamic programming search for continuous speech recognition, IEEE Signal Process. Mag., 16(5), 64–63, 1999.
Article Google Scholar
M. Omologo, M. Matassoni, P. Svaizer, D. Giuliani: Microphone array based speech recognition with different talker-array positions, Proc. ICASSP ’97, 1, 227–230, Munich, Germany, 1997.
Google Scholar
D. S. Pallett, J. G. Fiscus, W. M. Fisher, J. S. Garofolo, B. S. Lund, A. Martin, M. A. Przybocki: The 1994 benchmark tests for the ARPA spoken language program, Proc. Spoken Language Technology Workshop, 5–38, Austin, TX, USA, 1995.
Google Scholar
D. S. Pallett: A look at NIST’s benchmark ASR tests: past, present, and future, Proc. ASRU ’03, 483–488, St. Thomas, Virgin Islands, 2003.
Google Scholar
J. G. Proakis, D. G. Manolakis: Digital Signal Processing: Principles, Algorithms, and Applications, Upper Saddle River, NJ, USA: Prentice Hall, 1996.
Google Scholar
W. Putnam, D. Rocchesso, J. Smith: A numerical investigation of the invertibility of room transfer functions, Proc. WASPAA ’95, 249–252, Mohonk, NY, USA, 1995.
Google Scholar
L. R. Rabiner: A tutorial on hidden markov models and selected applications in speech recognition, Proc. IEEE, 77(2), 257–286, 1989.
Article Google Scholar
C. K. Raut, T. Nishimoto, S. Sagayama:Model adaptation for long convolutional distortion by maximum likelihood based state filtering approach, Proc. ICASSP ’06, 1, 1133–1136, Toulouse, France, 2006.
Google Scholar
A. Sehr, M. Zeller, W. Kellermann: Hands-free speech recognition using a reverberation model in the feature domain, Proc. EUSIPCO ’06, Florence, Italy, 2006.
Google Scholar
A. Sehr, M. Zeller, W. Kellermann: Distant-talking continuous speech recognition based on a novel reverberation model in the feature domain, Proc. INTERSPEECH ’06, 769 – 772, Pittsburgh, PA, USA, 2006.
Google Scholar
A. Sehr, W. Kellermann: A new concept for feature-domain dereverberation for robust distant-talking ASR, Proc. ICASSP ’07, 4, 369–372, Honolulu, Hawaii, 2007.
Google Scholar
A. Sehr, Y. Zheng, E. Nöth, W. Kellermann: Maximum likelihood estimation of a reverberation model for robust distant-talking speech recognition, Proc. EUSIPCO ’07, 1299-1303, Poznan, Poland, 2007.
Google Scholar
M. L. Seltzer, B. Raj, R. M. Stern: Likelhood-maximizing beamforming for robust hands-free speech recognition, IEEE Trans. Speech Audio Process., T-SAP-12(5), 489–498, 2004.
Article Google Scholar
M. L. Seltzer, R. M. Stern: Subband likelihood-maximizing beamforming for speech recognition in reverberant environments, Trans. Audio Speech Language Process., T-ASLP-14(6), 2109–2121, 2006.
Article Google Scholar
P. C. W. Sommen: Partitioned frequency domain adaptive filters, Proc. 23rd Asilomar Conference on Signals Systems and Computers, 676–681, Pacific Grove, CA, USA, 1989.
Google Scholar
J. S. Soo, K. K. Pang: Multidelay block frequency domain adaptive filter, IEEE Trans. Acoust. Speech Signal Process., ASSP-38(2), 373–376, 1990.
Article Google Scholar
J. S. Soo, K. K. Pang: A multistep size (MSS) frequency domain adaptive filter, IEEE Trans. Signal Process., T-SP-39(1), 115–121, 1991.
Article Google Scholar
V. Stahl, A. Fischer, R. Bippus: Acoustic synthesis of training data for speech recognition in living-room environments, Proc. ICASSP ’01, 1, 285–288, Salt Lake City, UT, USA, 2001.
Google Scholar
T. G. Stockham: High-speed convolution and correlation, Proc. AFIPS ’66, 28, 229–233, 1966.
Google Scholar
T. Takiguchi, S. Nakamura, Q. Huo, K. Shikano: Model adaption based on HMM decomposition for reverberant speech recognition, Proc. ICASSP ’97, 2, 827–830, Munich, Germany, 1997.
Google Scholar
T. Takiguchi, S. Nakamura, K. Shikano: HMM-separation-based speech reognition for a distant moving speaker, IEEE Trans. Speech Audio Process., T-SAP-9(2), 127–140, 2001.
Article Google Scholar
T. Takiguchi, M. Nishimura, Y. Ariki: Acoustic model adaptation using first-order linear prediction for reverberant speech, IEICE Trans. Information and Systems, E89-D(3), 908–914, 2006.
Article Google Scholar
A. Torger, A. Farina: Real-time partitioned convolution for ambiophonics surround sound, Proc. WASPAA ’01, 195–198, Mohonk, NY, 2001.
Google Scholar
A. P. Varga, R. K. Moore: Hidden Markov model decomposition of speech and noise, Proc. ICASSP ’90, 2, 845–848, Albuquerque, NM, USA, 1990.
Google Scholar
B. van Veen, K. Buckley: Beamforming: A versatile approach to spatial filtering, IEEE ASSP Magazine, 5(2), 4–24, 1988.
Article Google Scholar
P. C. Woodland, M. J. F. Gales, D. Pye: Improving environmental robustness in large vocabulary speech recognition, Proc. ICASSP ’96, 1, 65–68, Atlanta, GA, USA, 1996.
Google Scholar
B. Yegnanarayana, P. Satyanarayana Murthy: Enhancement of reverberant speech using LP residual signal, IEEE Trans. Speech Audio Process., T-SAP-8(3), 267–281, 2000.
Article Google Scholar
B. Yegnanarayana, S. R. Mathadeva Prasanna, K Sreenivasa Rao: Speech enhancement using excitation source information, Proc. ICASSP ’02, 1, 541–544, Orlando, FL, USA, 2002.
Google Scholar
S. J. Young, N. H. Russel, J. H. S. Thornton: Token passing: a simple conceptual model for connected speech recognition systems, CUED technical report, Cambridge University Engineering Department, 1989.
Google Scholar
S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland: The HTK Book (for HTK Version 3.2), Cambridge, UK: Cambridge University Engineering Department, 2002.
Google Scholar

Download references

Author information

Authors and Affiliations

Multimedia Communications and Signal Processing, University of Erlangen-Nuremberg, Germany
Armin Sehr & Walter Kellermann

Authors

Armin Sehr
View author publications
You can also search for this author in PubMed Google Scholar
Walter Kellermann
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Technische Universität, Darmstadt, Germany
Eberhard Hänsler
Harman/Becker Automotive Systems, Ulm, Germany
Gerhard Schmidt

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Sehr, A., Kellermann, W. (2008). Towards Robust Distant-Talking Automatic Speech Recognition in Reverberant Environments. In: Hänsler, E., Schmidt, G. (eds) Speech and Audio Processing in Adverse Environments. Signals and Communication Technology. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70602-1_18

Download citation

DOI: https://doi.org/10.1007/978-3-540-70602-1_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70601-4
Online ISBN: 978-3-540-70602-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics