The 2005 AMI System for the Transcription of Speech in Meetings

Hain, Thomas; Burget, Lukas; Dines, John; Garau, Giulia; Karafiat, Martin; Lincoln, Mike; McCowan, Iain; Moore, Darren; Wan, Vincent; Ordelman, Roeland; Renals, Steve

doi:10.1007/11677482_38

Thomas Hain¹⁸,
Lukas Burget¹⁹,
John Dines²⁰,
Giulia Garau²¹,
Martin Karafiat¹⁹,
Mike Lincoln²¹,
Iain McCowan²⁰,
Darren Moore²⁰,
Vincent Wan¹⁸,
Roeland Ordelman²² &
…
Steve Renals²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3869))

Included in the following conference series:

International Workshop on Machine Learning for Multimodal Interaction

1983 Accesses
12 Citations

Abstract

In this paper we describe the 2005 AMI system for the transcription of speech in meetings used in the 2005 NIST RT evaluations. The system was designed for participation in the speech to text part of the evaluations, in particular for transcription of speech recorded with multiple distant microphones and independent headset microphones. System performance was tested on both conference room and lecture style meetings. Although input sources are processed using different front-ends, the recognition process is based on a unified system architecture. The system operates in multiple passes and makes use of state of the art technologies such as discriminative training, vocal tract length normalisation, heteroscedastic linear discriminant analysis, speaker adaptation with maximum likelihood linear regression and minimum word error rate decoding. In this paper we describe the system performance on the official development and test sets for the NIST RT05s evaluations. The system was jointly developed in less than 10 months by a multi-site team and was shown to achieve competitive performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bulyko, I., Ostendorf, M., Stolcke, A.: Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures. In: Proc. HLT 2003 (2003)
Google Scholar
Burger, S., MacLaren, V., Yu, H.: The ISL Meeting Corpus: The Impact of Meeting Type on Speech Style. In: Proc. ICSLP 2002 (2002)
Google Scholar
Carletta, J., Ashby, S., Bourban, S., Guillemot, M., Kronenthal, M., Lathoud, G., Lincoln, M., McCowan, I., Hain, T., Kraaij, W., Post, W., Kadlec, J., Wellner, P., Flynn, M., Reidsma, D.: The AMI Meeting Corpus. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 28–39. Springer, Heidelberg (2006)
Chapter Google Scholar
Cox, H., Zeskind, R., Kooij, I.: Practical supergain. IEEE Trans. ASSP 34(3), 393–397 (1986)
Article Google Scholar
Cox, H., Zeskind, R., Owen, M.: Robust adaptive beamforming. IEEE Trans. ASSP 35(10), 1365–1376 (1987)
Article Google Scholar
Fitt, S.: Documentation and user guide to UNISYN lexicon and post-lexical rules, Tech. Rep., Centre for Speech Technology Research, Edinburgh (2000)
Google Scholar
Gales, M.J.F., Woodland, P.C.: Mean and Variance Adaptation within the MLLR Framework. Computer Speech & Language 10, 249–264 (1996)
Article Google Scholar
Garafolo, J.S., Laprun, C.D., Michel, M., Stanford, V.M., Tabassi, E.: Proc. 4th Intl. Conf. on Language Resources and Evaluation (LREC 2004) (2004)
Google Scholar
Gauvain, J.L., Lee, C.: MAP estimation for multivariate Gaussian mixture observation of Markov Chains. IEEE Tr. Speech & Audio Processing 2, 291–298 (1994)
Article Google Scholar
Hain, T., Burget, L., Dines, J., McCowan, I., Garau, G., Karafiat, M., Lincoln, M., Moore, D., Wan, V., Ordelman, R., Renals, S.: The Development of the AMI System for the Transcription of Speech in Meetings. In: Proc. MLMI 2005, Edinburgh (2005)
Google Scholar
Hain, T.: Implicit modelling of pronunciation variation in automatic speech recognition. Speech Communication 46(2), 171–188 (2005)
Article Google Scholar
Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., Wooters, C.: The ICSI Meeting Corpus. In: Proc. ICASSP 2003, Hong Kong (2003)
Google Scholar
Knapp, C.H., Carter, G.C.: The generalized correlation method for estimation of time delay/ IEEE Transactions on Acoustics. Speech and Signal Processing, Trans. ASSP 24, 320–327 (1976)
Article Google Scholar
Kumar, N.: Investigation of Silicon-Auditory Models and Generalization of Linear Discriminant Analysis for Improved Speech Recognition. PhD thesis, John Hopkins University, Baltimore (1997)
Google Scholar
Burget, L.: Combination of Speech Features Using Smoothed Heteroscedastic Linear Discriminant Analysis. In: Proc. ICSLP 2004, Jeju Island, Korea, pp. 4–7 (2004)
Google Scholar
Mangu, L., Brill, E., Stolcke, A.: Finding Consensus Among Words: Lattice-Based Word Error Minimization. In: Proc. Eurospeech 1999, Budapest, pp. 495–498 (1999)
Google Scholar
Messerschmitt, D., Hedberg, D., Cole, C., Haoui, A., Winship, P.: Digital voice echo canceller with a TMS32020. Appl. Rep. SPRA129, Texas Instruments (1989)
Google Scholar
Spring 2004 (RT04S) Rich Transcription Meeting Recognition Evaluation Plan. NIST, US, Available at: http://www.nist.gov/speech
Pfau, T., Ellis, D.P.W.: Hidden Markov model based speech activity detection for the ICSI meeting project. In: Eurospeech 2001 (2001)
Google Scholar
Povey, D., Woodland, P.C.: Minimum Phone Error and I-Smoothing for Improved Discriminative Training. In: Proc. ICASSP 2002, Orlando (2002)
Google Scholar
Stolcke, A., Wooters, C., Mirghafori, N., Pirinen, T., Bulyko, I., Gelbart, D., Graciarena, M., Otterson, S., Peskin, B., Ostendorf, M.: Progress in Meeting Recognition: The ICSI-SRI-UW Spring 2004 Evaluation System. In: Proc. NIST RT04S Workshop (2004)
Google Scholar
Woodland, P.C., Gales, M.J.F., Pye, D., Young, S.J.: Broadcast News Transcription using HTK. In: Proc. ICASSP 1997, Munich, pp. 719–722 (1997)
Google Scholar
Wrigley, S., Brown, G., Wan, V., Renals, S.: Speech and crosstalk detection in multichannel audio. IEEE Trans. Speech & Audio Proc. 13(1), 84–91 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Sheffield, Sheffield, S1 4DP, UK
Thomas Hain & Vincent Wan
Faculty of Information Engineering, Brno University of Technology, Brno, 612 66, Czech Republic
Lukas Burget & Martin Karafiat
IDIAP Research Institute, CH-1920, Martigny, Switzerland
John Dines, Iain McCowan & Darren Moore
Centre for Speech Technology Research, University of Edinburgh, Edinburgh, EH8 9LW, UK
Giulia Garau, Mike Lincoln & Steve Renals
Department of Electrical Engineering, University of Twente, 7500AE, Enschede, The Netherlands
Roeland Ordelman

Authors

Thomas Hain
View author publications
You can also search for this author in PubMed Google Scholar
Lukas Burget
View author publications
You can also search for this author in PubMed Google Scholar
John Dines
View author publications
You can also search for this author in PubMed Google Scholar
Giulia Garau
View author publications
You can also search for this author in PubMed Google Scholar
Martin Karafiat
View author publications
You can also search for this author in PubMed Google Scholar
Mike Lincoln
View author publications
You can also search for this author in PubMed Google Scholar
Iain McCowan
View author publications
You can also search for this author in PubMed Google Scholar
Darren Moore
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Wan
View author publications
You can also search for this author in PubMed Google Scholar
Roeland Ordelman
View author publications
You can also search for this author in PubMed Google Scholar
Steve Renals
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Edinburgh, Edinburgh, Scotland
Steve Renals
IDIAP Research Institute, Martigny, Switzerland
Samy Bengio

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hain, T. et al. (2006). The 2005 AMI System for the Transcription of Speech in Meetings. In: Renals, S., Bengio, S. (eds) Machine Learning for Multimodal Interaction. MLMI 2005. Lecture Notes in Computer Science, vol 3869. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11677482_38

Download citation

DOI: https://doi.org/10.1007/11677482_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32549-9
Online ISBN: 978-3-540-32550-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics