The IBM Rich Transcription Spring 2006 Speech-to-Text System for Lecture Meetings

Huang, Jing; Westphal, Martin; Chen, Stanley; Siohan, Olivier; Povey, Daniel; Libal, Vit; Soneiro, Alvaro; Schulz, Henrik; Ross, Thomas; Potamianos, Gerasimos

doi:10.1007/11965152_38

Jing Huang¹⁹,
Martin Westphal¹⁹,
Stanley Chen¹⁹,
Olivier Siohan¹⁹,
Daniel Povey¹⁹,
Vit Libal¹⁹,
Alvaro Soneiro¹⁹,
Henrik Schulz¹⁹,
Thomas Ross¹⁹ &
…
Gerasimos Potamianos¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4299))

Included in the following conference series:

International Workshop on Machine Learning for Multimodal Interaction

745 Accesses
6 Citations

Abstract

We describe the IBM systems submitted to the NIST RT06s Speech-to-Text (STT) evaluation campaign on the CHIL lecture meeting data for three conditions: Multiple distant microphone (MDM), single distant microphone (SDM), and individual headset microphone (IHM). The system building process is similar to the IBM conversational telephone speech recognition system. However, the best models for the far-field conditions (SDM and MDM) proved to be the ones that use neither variance normalization nor vocal tract length normalization. Instead, feature-space minimum-phone error discriminative training yielded the best results. Due to the relatively small amount of CHIL-domain data, the acoustic models of our systems are built on publicly available meeting corpora, with maximum a-posteriori adaptation applied twice on CHIL data during training: First, at the initial speaker-independent model, and subsequently at the minimum phone error model. For language modeling, we utilized meeting transcripts, text from scientific conference proceedings, and spontaneous telephone conversations. On development data, chosen in our work to be the 2005 CHIL-internal STT evaluation test set, the resulting language model provided a 4% absolute gain in word error rate (WER), compared to the model used in last year’s CHIL evaluation. Furthermore, the developed STT system significantly outperformed our last year’s results, by reducing close-talking microphone data WER from 36.9% to 25.4% on our development set. In the NIST RT06s evaluation campaign, both MDM and SDM systems scored well, however the IHM system did poorly due to unsuccessful cross-talk removal.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

The LDC Corpus Catalog, Linguistic Data Consortium, University of Pennsylvania. Philadelphia, PA. Available: http://www.ldc.upenn.edu/Catalog
Fiscus, J.G.: A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER). In: Proc. Wksp. on Automatic Speech Recog. and Understanding (ASRU), Santa Barbara, CA, pp. 347–354 (1997)
Google Scholar
Chu, S., Marcheret, E., Potamianos, G.: Automatic speech recognition and speech activity detection in the CHIL smart room. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 332–343. Springer, Heidelberg (2006)
Chapter Google Scholar
Ajmera, J., Wooters, C.: A robust speaker clustering algorithm. In: Proc. Wksp. on Automatic Speech Recog. and Understanding (ASRU), St. Thomas, US Virgin Islands, pp. 411–416 (2003)
Google Scholar
Stolcke, A., Anguera, X., Boakye, K., Cetin, O., Grezl, F., Janin, A., Mandal, A., Peskin, B., Wooters, C., Zheng, J.: Further progress in meeting recognition: the ICSI-SRI Spring 2005 speech-to-text evaluation system. In: Proc. Rich Transcription 2005 Spring Meeting Recog. Eval., Edinburgh, UK, pp. 39–50 (2005)
Google Scholar
Gales, M.F.J.: Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech and Language 12, 75–98 (1998)
Article Google Scholar
Saon, G., Zweig, G., Padmanabhan, M.: Linear feature space projections for speaker adaptation. In: Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Process (ICASSP), Salt Lake City, UT, pp. 325–328 (2001)
Google Scholar
Wegmann, S., McAllaster, D., Orloff, J., Peskin, B.: Speaker normalization on conversational telephone speech. In: Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Process (ICASSP), Atlanta, GA, pp. 339–341 (1996)
Google Scholar
Saon, G., Padmanabhan, M., Gopinath, R.: Eliminating inter-speaker variability prior to discriminant transforms. In: Proc. Wksp. on Automatic Speech Recog. and Understanding (ASRU), Trento, Italy, pp. 73–76 (2001)
Google Scholar
Povey, D., Kingsbury, B., Mangu, L., Saon, G., Soltau, H., Zweig, G.: fMPE: Discriminatively trained features for speech recognition. In: Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Process (ICASSP), Philadelphia, PA, vol. 1, pp. 961–964 (2005)
Google Scholar
Povey, D., Woodland, P.C.: Minimum phone error and I-smoothing for improved discriminative training. In: Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Process (ICASSP), Orlando, FL, pp. 105–108 (2002)
Google Scholar
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Computer Speech and Language 13, 359–393 (1999)
Article Google Scholar
Stolcke, A.: Entropy-based pruning of backoff languge models. In: Proc. DARPA Broadcast News Transcription and Understanding Wksp., Lansdowne, VA, pp. 270–274 (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

IBM Thomas J. Watson Research Center, Yorktown Heights, NY, 10598, U.S.A.
Jing Huang, Martin Westphal, Stanley Chen, Olivier Siohan, Daniel Povey, Vit Libal, Alvaro Soneiro, Henrik Schulz, Thomas Ross & Gerasimos Potamianos

Authors

Jing Huang
View author publications
You can also search for this author in PubMed Google Scholar
Martin Westphal
View author publications
You can also search for this author in PubMed Google Scholar
Stanley Chen
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Siohan
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Povey
View author publications
You can also search for this author in PubMed Google Scholar
Vit Libal
View author publications
You can also search for this author in PubMed Google Scholar
Alvaro Soneiro
View author publications
You can also search for this author in PubMed Google Scholar
Henrik Schulz
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Ross
View author publications
You can also search for this author in PubMed Google Scholar
Gerasimos Potamianos
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Edinburgh, Edinburgh, Scotland
Steve Renals
IDIAP Research Institute, Martigny, Switzerland
Samy Bengio
National Institute Of Standards and Technology, 100 Bureau Drive Stop 8940, Gaithersburg, MD, 20899
Jonathan G. Fiscus

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, J. et al. (2006). The IBM Rich Transcription Spring 2006 Speech-to-Text System for Lecture Meetings. In: Renals, S., Bengio, S., Fiscus, J.G. (eds) Machine Learning for Multimodal Interaction. MLMI 2006. Lecture Notes in Computer Science, vol 4299. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11965152_38

Download citation

DOI: https://doi.org/10.1007/11965152_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69267-6
Online ISBN: 978-3-540-69268-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics