Robust Distant Speech Recognition by Combining Multiple Microphone-Array Processing with Position-Dependent CMN
- 905 Downloads
We propose robust distant speech recognition by combining multiple microphone-array processing with position-dependent cepstral mean normalization (CMN). In the recognition stage, the system estimates the speaker position and adopts compensation parameters estimated a priori corresponding to the estimated position. Then the system applies CMN to the speech (i.e., position-dependent CMN) and performs speech recognition for each channel. The features obtained from the multiple channels are integrated with the following two types of processings. The first method is to use the maximum vote or the maximum summation likelihood of recognition results from multiple channels to obtain the final result, which is called multiple-decoder processing. The second method is to calculate the output probability of each input at frame level, and a single decoder using these output probabilities is used to perform speech recognition. This is called single-decoder processing, resulting in lower computational cost. We combine the delay-and-sum beamforming with multiple-decoder processing or single-decoder processing, which is termed multiple microphone-array processing. We conducted the experiments of our proposed method using a limited vocabulary (100 words) distant isolated word recognition in a real environment. The proposed multiple microphone-array processing using multiple decoders with position-dependent CMN achieved a 3.2% improvement (50% relative error reduction rate) over the delay-and-sum beamforming with conventional CMN (i.e., the conventional method). The multiple microphone-array processing using a single decoder needs about one-third the computational time of that using multiple decoders without degrading speech recognition performance.
KeywordsWord Recognition Speech Recognition Recognition Performance Multiple Channel Lower Computational Cost
- 1.Juang BH, Soong FK: Hands-free telecommunications. Proceedings of the International Workshop on Hands-Free Speech Communication (HSC '01), April 2001, Kyoto, Japan 5–10.Google Scholar
- 7.Liu F, Stern RM, Huang X, Acero A: Efficient cepstral normalization for robust speech recognition. Proceedings of the ARPA Speech and Natural Language Workshop, March 1993, Princeton, NJ, USA 69–74.Google Scholar
- 8.Kitaoka N, Akahori I, Nakagawa S: Speech recognition under noisy environments using spectral subtraction with smoothing of time direction and real-time cepstral mean normalization. Proceedings of the International Workshop on Hands-Free Speech Communication (HSC '01), April 2001, Kyoto, Japan 159–162.Google Scholar
- 11.Omologo M, Svaizer P: Acoustic source location in noisy and reverberant environment using CSP analysis. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '96), May 1996, Atlanta, Ga, USA 2: 921–924.Google Scholar
- 12.Wang L, Kitaoka N, Nakagawa S: Robust distant speech recognition based on position dependent CMN using a novel multiple microphone processing technique. Proceedings of the 9th European Conference on Speech Communication and Technology (EUROSPEECH '05), September 2005, Lisbon, Portugal 2661–2664.Google Scholar
- 17.Brandstein M: A framework for speech source localization using sensor arrays, M.S. thesis. Brown University, Providence, RI, USA; 1995.Google Scholar
- 22.Wang L, Kitaoka N, Nakagawa S: Distant speech recognition based on position dependent cepstral mean normalization. Proceedings of the 6th IASTED International Conference on Signal and Image Processing (SIP '04), August 2004, Honolulu, Hawaii, USA 249–254.Google Scholar
- 23.Wang L, Kitaoka N, Nakagawa S: Robust distant speech recognition based on position dependent CMN. Proceedings of the 9th International Conference on Spoken Language Processing (ICSLP '04), October 2004, Jeju Island, Korea 2409–2052.Google Scholar
- 26.Nakagawa S, Hanai K, Yamamoto K, Minematsu N: Comparison of syllable-based HMMs and triphone-based HMMs in Japanese speech recognition. Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, December 1999, Keystone, Colo, USA 393–396.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.