Further Progress in Meeting Recognition: The ICSI-SRI Spring 2005 Speech-to-Text Evaluation System

Stolcke, Andreas; Anguera, Xavier; Boakye, Kofi; Çetin, Özgür; Grézl, František; Janin, Adam; Mandal, Arindam; Peskin, Barbara; Wooters, Chuck; Zheng, Jing

doi:10.1007/11677482_39

Andreas Stolcke^18,19,
Xavier Anguera^18,20,
Kofi Boakye¹⁸,
Özgür Çetin¹⁸,
František Grézl^18,21,
Adam Janin¹⁸,
Arindam Mandal²²,
Barbara Peskin¹⁸,
Chuck Wooters¹⁸ &
…
Jing Zheng¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3869))

Included in the following conference series:

International Workshop on Machine Learning for Multimodal Interaction

1990 Accesses
15 Citations

Abstract

We describe the development of our speech recognition system for the National Institute of Standards and Technology (NIST) Spring 2005 Meeting Rich Transcription (RT-05S) evaluation, highlighting improvements made since last year [1]. The system is based on the SRI-ICSI-UW RT-04F conversational telephone speech (CTS) recognition system, with meeting-adapted models and various audio preprocessing steps. This year’s system features better delay-sum processing of distant microphone channels and energy-based crosstalk suppression for close-talking microphones. Acoustic modeling is improved by virtue of various enhancements to the background (CTS) models, including added training data, decision-tree based state tying, and the inclusion of discriminatively trained phone posterior features estimated by multilayer perceptrons. In particular, we make use of adaptation of both acoustic models and MLP features to the meeting domain. For distant microphone recognition we obtained considerable gains by combining and cross-adapting narrow-band (telephone) acoustic models with broadband (broadcast news) models. Language models (LMs) were improved with the inclusion of new meeting and web data. In spite of a lack of training data, we created effective LMs for the CHIL lecture domain. Results are reported on RT-04S and RT-05S meeting data. Measured on RT-04S conference data, we achieved an overall improvement of 17% relative in both MDM and IHM conditions compared to last year’s evaluation system. Results on lecture data are comparable to the best reported results for that task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Stolcke, A., Wooters, C., Mirghafori, N., Pirinen, T., Bulyko, I., Gelbart, D., Graciarena, M., Otterson, S., Peskin, B., Ostendorf, M.: Progress in meeting recognition: The ICSI-SRI-UW Spring 2004 evaluation system. In: Proceedings NIST ICASSP 2004 Meeting Recognition Workshop, Montreal, National Institute of Standards and Technology (2004)
Google Scholar
Adami, A., Burget, L., Dupont, S., Garudadri, H., Grezl, F., Hermansky, H., Jain, P., Kajarekar, S., Morgan, N., Sivadas, S.: Qualcomm-ICSI-OGI features for ASR. In: Hansen, J.H.L., Pellom, B. (eds.) Proc. ICSLP, Denver, vol. 1, pp. 4–7 (2002)
Google Scholar
Anguera, X., Wooters, C., Peskin, B., Aguiló, M.: Robust speaker segmentation for meetings: The ICSI-SRI Spring 2005 diarization system. In: Proceedings of the Rich Transcription 2005 Spring Meeting Recognition Evaluation, Edinburgh, National Institute of Standards and Technology, pp. 26–38 (2005)
Google Scholar
Flanagan, J.L., Johnston, J.D., Zahn, R., Elko, G.W.: Computer-steered microphone arrays for sound transduction in large rooms. J. Acoust. Soc. Am. 78, 1508–1518 (1985)
Article Google Scholar
Vergyri, D., Stolcke, A., Gadde, V.R.R., Ferrer, L., Shriberg, E.: Prosodic knowledge sources for automatic speech recognition. In: Proc. ICASSP, Hong Kong, vol. 1, pp. 208–211 (2003)
Google Scholar
Povey, D., Woodland, P.C.: Minimum phone error and I-smoothing for improved discriminative training. In: Proc. ICASSP, Orlando, FL, vol. 1, pp. 105–108 (2002)
Google Scholar
Graciarena, M., Franco, H., Zheng, J., Vergyri, D., Stolcke, A.: Voicing feature integration in SRI’s Decipher LVCSR system. In: Proc. ICASSP, Montreal, vol. 1, pp. 921–924 (2004)
Google Scholar
Kumar, N.: Investigation of Silicon-Auditory Models and Generalization of Linear Discriminant Analysis for Improved Speech Recognition. PhD thesis, John Hopkins University, Baltimore (1997)
Google Scholar
Morgan, N., Chen, B.Y., Zhu, Q., Stolcke, A.: TRAPping conversational speech: Extending TRAP/Tandem approaches to conversational telephone speech recognition. In: Proc. ICASSP, Montreal, vol. 1, pp. 536–539 (2004)
Google Scholar
Zhu, Q., Stolcke, A., Chen, B.Y., Morgan, N.: Using MLP features in SRI’s conversational speech recognition system. In: Proc. Interspeech, Lisbon, pp. 2141–2144 (2005)
Google Scholar
Jin, H., Matsoukas, S., Schwartz, R., Kubala, F.: Fast robust inverse transform SAT and multistage adaptation. In: Proceedings DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, VA, pp. 105–109. Morgan Kaufmann, San Francisco (1998)
Google Scholar
Metze, F., Fügen, C., Pan, Y., Waibel, A.: Automatically transcribing meetings using distant microphones. In: Proc. ICASSP, Philadelphia, vol. 1, pp. 989–902 (2005)
Google Scholar
Povey, D., Gales, M.J.F., Kim, D.Y., Woodland, P.C.: MMI-MAP and MPE-MAP for acoustic model adaptation. In: Proc. EUROSPEECH, Geneva, pp. 1981–1984 (2003)
Google Scholar
Bulyko, I., Ostendorf, M., Stolcke, A.: Getting more mileage from web text sources for conversational speech language modeling using class-dependent mixtures. In: Hearst, M., Ostendorf, M. (eds.) Proc. HLT-NAACL, Edmonton, Alberta, Canada. Association for Computational Linguistics, vol. 2, pp. 7–9 (2003)
Google Scholar
Lamel, L., Adda, G., Bilinski, E., Gauvain, J.L.: Transcribing lectures and seminars. In: Proc. Interspeech, Lisbon (2005)
Google Scholar
Çetin, Ö., Stolcke, A.: Language modeling in the ICSI-SRI Spring 2005 meeting speech recognition evaluation system. Technical Report TR-05-06, International Computer Science Institute (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

International Computer Science Institute, Berkeley, CA, USA
Andreas Stolcke, Xavier Anguera, Kofi Boakye, Özgür Çetin, František Grézl, Adam Janin, Barbara Peskin & Chuck Wooters
SRI International, Menlo Park, CA, USA
Andreas Stolcke & Jing Zheng
Technical University of Catalonia, Barcelona, Spain
Xavier Anguera
Brno University of Technology, Czech Republic
František Grézl
University of Washington, Seattle, WA, USA
Arindam Mandal

Authors

Andreas Stolcke
View author publications
You can also search for this author in PubMed Google Scholar
Xavier Anguera
View author publications
You can also search for this author in PubMed Google Scholar
Kofi Boakye
View author publications
You can also search for this author in PubMed Google Scholar
Özgür Çetin
View author publications
You can also search for this author in PubMed Google Scholar
František Grézl
View author publications
You can also search for this author in PubMed Google Scholar
Adam Janin
View author publications
You can also search for this author in PubMed Google Scholar
Arindam Mandal
View author publications
You can also search for this author in PubMed Google Scholar
Barbara Peskin
View author publications
You can also search for this author in PubMed Google Scholar
Chuck Wooters
View author publications
You can also search for this author in PubMed Google Scholar
Jing Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Edinburgh, Edinburgh, Scotland
Steve Renals
IDIAP Research Institute, Martigny, Switzerland
Samy Bengio

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Stolcke, A. et al. (2006). Further Progress in Meeting Recognition: The ICSI-SRI Spring 2005 Speech-to-Text Evaluation System. In: Renals, S., Bengio, S. (eds) Machine Learning for Multimodal Interaction. MLMI 2005. Lecture Notes in Computer Science, vol 3869. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11677482_39

Download citation

DOI: https://doi.org/10.1007/11677482_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32549-9
Online ISBN: 978-3-540-32550-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics