Advertisement

The Development of the AMI System for the Transcription of Speech in Meetings

  • Thomas Hain
  • Lukas Burget
  • John Dines
  • Iain McCowan
  • Giulia Garau
  • Martin Karafiat
  • Mike Lincoln
  • Darren Moore
  • Vincent Wan
  • Roeland Ordelman
  • Steve Renals
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3869)

Abstract

The automatic processing of speech collected in conference style meetings has attracted considerable interest with several large scale projects devoted to this area. This paper describes the development of a baseline automatic speech transcription system for meetings in the context of the AMI (Augmented Multiparty Interaction) project. We present several techniques important to processing of this data and show the performance in terms of word error rates (WERs). An important aspect of transcription of this data is the necessary flexibility in terms of audio pre-processing. Real world systems have to deal with flexible input, for example by using microphone arrays or randomly placed microphones in a room. Automatic segmentation and microphone array processing techniques are described and the effect on WERs is discussed. The system and its components presented in this paper yield competitive performance and form a baseline for future research in this domain.

Keywords

Meeting Room Word Error Rate Microphone Array Broadcast News Maximum Likelihood Linear Regression 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Black, A.W., Taylor, P., Caley, R.: The Festival Speech Synthesis System, Version 1.95 beta. CSTR, University of Edinburgh, Edinburgh (2004)Google Scholar
  2. 2.
    Bulyko, I., Ostendorf, M., Stolcke, A.: Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures. In: Proc. HLT 2003 (2003)Google Scholar
  3. 3.
    Burger, S., MacLaren, V., Yu, H.: The ISL Meeting Corpus: The Impact of Meeting Type on Speech Style. In: Proc. ICSLP (2002) (2002)Google Scholar
  4. 4.
    Carletta, J., Ashby, S., Bourban, S., Guillemot, M., Kronenthal, M., Lathoud, G., Lincoln, M., McCowan, I., Hain, T., Kraaij, W., Post, W., Kadlec, J., Wellner, P., Flynn, M., Reidsma, D.: The AMI Meeting Corpus (2005); Submitted to MLMI 2005Google Scholar
  5. 5.
    Cox, H., Zeskind, R., Kooij, I.: Practical supergain. IEEE Transactions on Acoustics, Speech and Signal Processing ASSP-34(3), 393–397 (1986)CrossRefGoogle Scholar
  6. 6.
    Cox, H., Zeskind, R., Owen, M.: Robust adaptive beamforming. IEEE Transactions on Acoustics, Speech and Signal Processing ASSP-35(10), 1365–1376 (1987)CrossRefGoogle Scholar
  7. 7.
    Fitt, S.: Documentation and user guide to UNISYN lexicon and post-lexical rules, Tech. Rep., Centre for Speech Technology Research, Edinburgh (2000) Google Scholar
  8. 8.
    Gales, M.J.F., Woodland, P.C.: Mean and Variance Adaptation within the MLLR Framework. Computer Speech & Language 10, 249–264 (1996)CrossRefGoogle Scholar
  9. 9.
    Garafolo, J.S., Laprun, C.D., Michel, M., Stanford, V.M., Tabassi, E.: Proc. 4th Intl. Conf. on Language Resources and Evaluation, LREC 2004 (2004)Google Scholar
  10. 10.
    Gauvain, J.L., Lee, C.: MAP estimation for multivariate Gaussian mixture observation of Markov Chains. IEEE Tr. Speech & Audio Processing 2, 291–298 (1994)CrossRefGoogle Scholar
  11. 11.
    Hain, T., Woodland, P., Niesler, T., Whittaker, E.: The 1998 HTK system for transcription of conversational telephone speech. In: Proc. IEEE ICASSP (1999)Google Scholar
  12. 12.
    Hermansky, H.: Perceptual Linear Predictive (PLP) analysis of speech. Acoustical Society of America 87(4), 1738–1752 (1990)CrossRefGoogle Scholar
  13. 13.
    Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., Wooters, C.: The ICSI Meeting Corpus. In: ICASSP 2003, Hong Kong (2003)Google Scholar
  14. 14.
    Klimt, B., Yang, Y.: Introducing the Enron Corpus. In: Second Conference on Email and Anti-Spam, CEAS 2004 (2004)Google Scholar
  15. 15.
    Knapp, C.H., Carter, G.C.: The generalized correlation method for estimation of time delay/ IEEE Transactions on Acoustics. Speech and Signal Processing ASSP-24, 320–327 (August 1976)CrossRefGoogle Scholar
  16. 16.
    Kumar N.: Investigation of Silicon-Auditory Models and Generalization of Linear Discriminant Analysis for Improved Speech Recognition. PhD thesis, John Hopkins University, Baltimore (1997) Google Scholar
  17. 17.
    Burget, L.: Combination of Speech Features Using Smoothed Heteroscedastic Linear Discriminant Analysis. In: Proc. ICSLP 2004, Jeju island, KR, p. 4 (2004)Google Scholar
  18. 18.
    Messerschmitt, D., Hedberg, D., Cole, C., Haoui, A., Winship, P.: Digital voice echo canceller with a TMS32020. Appl. Rep. SPRA129, Texas Instruments (1989) Google Scholar
  19. 19.
    Spring 2004 (RT04S) Rich Transcription Meeting Recognition Evaluation Plan. NIST, US (2004), Available at, http://www.nist.gov/speech
  20. 20.
    Pfau, T., Ellis, D.P.W.: Hidden markov model based speech activity detection for the ICSI meeting project. In: Eurospeech 2001 (2001)Google Scholar
  21. 21.
    Schultz, T., Waibel, A., Bett, M., Metze, F., Pan, Y., Ries, K., Schaaf, T., Soltau, H., Westphal, M., Yu, H., Zechner, K.: The ISL Meeting Room System. In: Proc. of the Workshop on Hands-Free Speech Communication (HSC 2001), Kyoto (2001)Google Scholar
  22. 22.
    Stolcke, A., Wooters, C., Mirghafori, N., Pirinen, T., Bulyko, I., Gelbart, D., Graciarena, M., Otterson, S., Peskin, B., Ostendorf, M.: Progress in Meeting Recognition: The ICSI-SRI-UW Spring 2004 Evaluation System. In: NIST RT 2004 Workshop (2004)Google Scholar
  23. 23.
    The SRI Language Modelling Toolkit (SRILM). SRI international, California, http://www.speech.sri.com/projects/srilm
  24. 24.
    Woodland, P.C., Gales, M.J.F., Pye, D., Young, S.J.: Broadcast News Transcription using HTK. In: Proc. ICASSP 1997, Munich, pp. 719–722 (1997)Google Scholar
  25. 25.
    Wrigley, S., Brown, G., Wan, V., Renals, S.: Speech and crosstalk detection in multichannel audio. IEEE Trans. Speech & Audio Proc. 13(1), 84–91 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Thomas Hain
    • 1
  • Lukas Burget
    • 2
  • John Dines
    • 3
  • Iain McCowan
    • 3
  • Giulia Garau
    • 4
  • Martin Karafiat
    • 2
  • Mike Lincoln
    • 4
  • Darren Moore
    • 3
  • Vincent Wan
    • 1
  • Roeland Ordelman
    • 5
  • Steve Renals
    • 4
  1. 1.Department of Computer ScienceUniversity of SheffieldSheffieldUK
  2. 2.Faculty of Information EngineeringBrno University of TechnologyBrnoCzech Republic
  3. 3.IDIAPMartignySwitzerland
  4. 4.Centre for Speech Technology ResearchUniversity of EdinburghEdinburghUK
  5. 5.Department of Electrical EngineeringUniversity of TwenteEnschedeThe Netherlands

Personalised recommendations