Advertisement

The 2005 AMI System for the Transcription of Speech in Meetings

  • Thomas Hain
  • Lukas Burget
  • John Dines
  • Giulia Garau
  • Martin Karafiat
  • Mike Lincoln
  • Iain McCowan
  • Darren Moore
  • Vincent Wan
  • Roeland Ordelman
  • Steve Renals
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3869)

Abstract

In this paper we describe the 2005 AMI system for the transcription of speech in meetings used in the 2005 NIST RT evaluations. The system was designed for participation in the speech to text part of the evaluations, in particular for transcription of speech recorded with multiple distant microphones and independent headset microphones. System performance was tested on both conference room and lecture style meetings. Although input sources are processed using different front-ends, the recognition process is based on a unified system architecture. The system operates in multiple passes and makes use of state of the art technologies such as discriminative training, vocal tract length normalisation, heteroscedastic linear discriminant analysis, speaker adaptation with maximum likelihood linear regression and minimum word error rate decoding. In this paper we describe the system performance on the official development and test sets for the NIST RT05s evaluations. The system was jointly developed in less than 10 months by a multi-site team and was shown to achieve competitive performance.

Keywords

Warp Factor Acoustic Model Word Error Rate Discriminative Training Maximum Likelihood Linear Regression 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bulyko, I., Ostendorf, M., Stolcke, A.: Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures. In: Proc. HLT 2003 (2003)Google Scholar
  2. 2.
    Burger, S., MacLaren, V., Yu, H.: The ISL Meeting Corpus: The Impact of Meeting Type on Speech Style. In: Proc. ICSLP 2002 (2002)Google Scholar
  3. 3.
    Carletta, J., Ashby, S., Bourban, S., Guillemot, M., Kronenthal, M., Lathoud, G., Lincoln, M., McCowan, I., Hain, T., Kraaij, W., Post, W., Kadlec, J., Wellner, P., Flynn, M., Reidsma, D.: The AMI Meeting Corpus. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 28–39. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  4. 4.
    Cox, H., Zeskind, R., Kooij, I.: Practical supergain. IEEE Trans. ASSP 34(3), 393–397 (1986)CrossRefGoogle Scholar
  5. 5.
    Cox, H., Zeskind, R., Owen, M.: Robust adaptive beamforming. IEEE Trans. ASSP 35(10), 1365–1376 (1987)CrossRefGoogle Scholar
  6. 6.
    Fitt, S.: Documentation and user guide to UNISYN lexicon and post-lexical rules, Tech. Rep., Centre for Speech Technology Research, Edinburgh (2000)Google Scholar
  7. 7.
    Gales, M.J.F., Woodland, P.C.: Mean and Variance Adaptation within the MLLR Framework. Computer Speech & Language 10, 249–264 (1996)CrossRefGoogle Scholar
  8. 8.
    Garafolo, J.S., Laprun, C.D., Michel, M., Stanford, V.M., Tabassi, E.: Proc. 4th Intl. Conf. on Language Resources and Evaluation (LREC 2004) (2004)Google Scholar
  9. 9.
    Gauvain, J.L., Lee, C.: MAP estimation for multivariate Gaussian mixture observation of Markov Chains. IEEE Tr. Speech & Audio Processing 2, 291–298 (1994)CrossRefGoogle Scholar
  10. 10.
    Hain, T., Burget, L., Dines, J., McCowan, I., Garau, G., Karafiat, M., Lincoln, M., Moore, D., Wan, V., Ordelman, R., Renals, S.: The Development of the AMI System for the Transcription of Speech in Meetings. In: Proc. MLMI 2005, Edinburgh (2005)Google Scholar
  11. 11.
    Hain, T.: Implicit modelling of pronunciation variation in automatic speech recognition. Speech Communication 46(2), 171–188 (2005)CrossRefGoogle Scholar
  12. 12.
    Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., Wooters, C.: The ICSI Meeting Corpus. In: Proc. ICASSP 2003, Hong Kong (2003)Google Scholar
  13. 13.
    Knapp, C.H., Carter, G.C.: The generalized correlation method for estimation of time delay/ IEEE Transactions on Acoustics. Speech and Signal Processing, Trans. ASSP 24, 320–327 (1976)CrossRefGoogle Scholar
  14. 14.
    Kumar, N.: Investigation of Silicon-Auditory Models and Generalization of Linear Discriminant Analysis for Improved Speech Recognition. PhD thesis, John Hopkins University, Baltimore (1997)Google Scholar
  15. 15.
    Burget, L.: Combination of Speech Features Using Smoothed Heteroscedastic Linear Discriminant Analysis. In: Proc. ICSLP 2004, Jeju Island, Korea, pp. 4–7 (2004)Google Scholar
  16. 16.
    Mangu, L., Brill, E., Stolcke, A.: Finding Consensus Among Words: Lattice-Based Word Error Minimization. In: Proc. Eurospeech 1999, Budapest, pp. 495–498 (1999)Google Scholar
  17. 17.
    Messerschmitt, D., Hedberg, D., Cole, C., Haoui, A., Winship, P.: Digital voice echo canceller with a TMS32020. Appl. Rep. SPRA129, Texas Instruments (1989)Google Scholar
  18. 18.
    Spring 2004 (RT04S) Rich Transcription Meeting Recognition Evaluation Plan. NIST, US, Available at: http://www.nist.gov/speech
  19. 19.
    Pfau, T., Ellis, D.P.W.: Hidden Markov model based speech activity detection for the ICSI meeting project. In: Eurospeech 2001 (2001)Google Scholar
  20. 20.
    Povey, D., Woodland, P.C.: Minimum Phone Error and I-Smoothing for Improved Discriminative Training. In: Proc. ICASSP 2002, Orlando (2002)Google Scholar
  21. 21.
    Stolcke, A., Wooters, C., Mirghafori, N., Pirinen, T., Bulyko, I., Gelbart, D., Graciarena, M., Otterson, S., Peskin, B., Ostendorf, M.: Progress in Meeting Recognition: The ICSI-SRI-UW Spring 2004 Evaluation System. In: Proc. NIST RT04S Workshop (2004)Google Scholar
  22. 22.
    Woodland, P.C., Gales, M.J.F., Pye, D., Young, S.J.: Broadcast News Transcription using HTK. In: Proc. ICASSP 1997, Munich, pp. 719–722 (1997)Google Scholar
  23. 23.
    Wrigley, S., Brown, G., Wan, V., Renals, S.: Speech and crosstalk detection in multichannel audio. IEEE Trans. Speech & Audio Proc. 13(1), 84–91 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Thomas Hain
    • 1
  • Lukas Burget
    • 2
  • John Dines
    • 3
  • Giulia Garau
    • 4
  • Martin Karafiat
    • 2
  • Mike Lincoln
    • 4
  • Iain McCowan
    • 3
  • Darren Moore
    • 3
  • Vincent Wan
    • 1
  • Roeland Ordelman
    • 5
  • Steve Renals
    • 4
  1. 1.Department of Computer ScienceUniversity of SheffieldSheffieldUK
  2. 2.Faculty of Information EngineeringBrno University of TechnologyBrnoCzech Republic
  3. 3.IDIAP Research InstituteMartignySwitzerland
  4. 4.Centre for Speech Technology ResearchUniversity of EdinburghEdinburghUK
  5. 5.Department of Electrical EngineeringUniversity of TwenteEnschedeThe Netherlands

Personalised recommendations