Advertisement

Microphone Array Driven Speech Recognition: Influence of Localization on the Word Error Rate

  • Matthias Wölfel
  • Kai Nickel
  • John McDonough
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3869)

Abstract

Interest within the automatic speech recognition (ASR) research community has recently focused on the recognition of speech captured with one or more microphones located in the far field, rather than being mounted on a headset and positioned next to the speaker’s mouth. Far field ASR is a natural application for beamforming techniques using an array of microphones. A prerequisite for applying such techniques, however, is a reliable means of speaker localization. In this work, we compare the accuracy of source localization systems based on only audio features, only video features, as well as a combination of audio and video features using speech data collected during seminars held by actual speakers. We also investigate the influence of source localization accuracy on the word error rate (WER) of a far field ASR system, comparing the WERs obtained with position estimates from several automatic source localizers with those obtained from true speaker positions. Our results reveal that accurate speaker localization is crucial for minimizing the error rate of a far field ASR system.

Keywords

Automatic Speech Recognition Acoustic Model Word Error Rate Microphone Array Audio Feature 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Klee, U., Gehrig, T., McDonough, J.: Kalman filters for time delay of arrivalbased source localization. In: Proc. Eurospeech (2005)Google Scholar
  2. 2.
    Gehrig, T., Nickel, K., Ekenel, H.K., Klee, U., McDonough, J.: Kalman filters for audio-video source localization. In: Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (2005)Google Scholar
  3. 3.
    Nickel, K., Gehrig, T., Stiefelhagen, R., McDonough, J.: A joint particle filter for audiovisual speaker tracking. In: 7th Intl. Conference on Multimodal Interfaces (2005)Google Scholar
  4. 4.
    Steusslof, H., Waibel, A., Stiefelhagen, R.: Computers in the human interaction loop, http://chil.server.de
  5. 5.
    Focken, D., Stiefelhagen, R.: Towards vision-based 3-d people tracking in a smart room. In: IEEE Int. Conf. Multimodal Interfaces (2002)Google Scholar
  6. 6.
    Omologo, M., Svaizer, P.: Acoustic event localization using a crosspowerspectrum phase based technique. In: Proc. ICASSP, vol. II, pp. 273–276 (1994)Google Scholar
  7. 7.
    Chen, J., Benesty, J., Huang, Y.A.: Robust time delay estimation exploiting redundancy among multiple microphones. IEEE Trans. Speech Audio Proc. 11(6), 549–557 (2003)CrossRefGoogle Scholar
  8. 8.
    Isard, M., Blake, A.: Condensation–conditional density propagation for visual tracking. International Journal of Computer Vision 29(1), 5–28 (1998)CrossRefGoogle Scholar
  9. 9.
    Vaidyanathan, P.P.: Multirate Systems and Filter Banks. Prentice Hall, Englewood Cliffs (1993)zbMATHGoogle Scholar
  10. 10.
    Van Trees, H.L.: Optimum Array Processing. Wiley-Interscience, New York (2002)CrossRefGoogle Scholar
  11. 11.
    Linguistic Data Consortium (LDC), English broadcast news speech (Hub-4), http://www.ldc.upenn.edu/Catalog/LDC97S44.html
  12. 12.
    Metze, F., Fügen, C., Pan, Y., Schultz, T., Yu, H.: The ISL rt-04s meeting transcription system. In: Proc. ICASSP-2004 Meeting Recognition Workshop. NIST, Montreal, Canada (2004)Google Scholar
  13. 13.
    Burger, S., Maclaren, V., Yu, H.: The isl meeting corpus: The impact of meeting type on speech style. In: ICSLP (2002)Google Scholar
  14. 14.
    Wölfel, M., McDonough, J., Waibel, A.: Warping and scaling of the minimum variance distortionless response. In: ASRU (2003)Google Scholar
  15. 15.
    Stanford, V., Rochet, C., Michel, M., Garofolo, J.: Beyond close-talk - issues in distant speech acquisition, conditioning classification, and recognition. In: ICASSP 2004 Meeting Recognition Workshop (2004)Google Scholar
  16. 16.
    Janin, A., Ang, J., Bhagat, S., Dhillon, R., Edwards, J., Morgan, N., Peskin, B., Shriberg, E., Stolcke, A., Wooters, C., Wrede, B.: The ICSI meeting project: Resources and research. In: ICASSP 2004 Meeting Recognition Workshop (2004)Google Scholar
  17. 17.
    Raub, D., McDonough, J., Wölfel, M.: A cepstral domain maximum likelihood beamformer for speech recognition. In: ICSLP (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Matthias Wölfel
    • 1
  • Kai Nickel
    • 1
  • John McDonough
    • 1
  1. 1.Institut für Theoretische InformatikUniversität Karlsruhe (TH)KarlsruheGermany

Personalised recommendations