Skip to main content

Evalita 2011: Automatic Speech Recognition Large Vocabulary Transcription

  • Conference paper
  • 669 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7689))

Abstract

In this paper we describe design, setup and results of the speech recognition task in the framework of the Evalita campaign for the Italian language, giving details on the released corpora and tools used for the challenge. A general discussion about approaches to large vocabulary speech recognition introduces the recognition tasks. Systems are compared for recognition accuracy on audio sequences of Italian parliament. Although only a few systems have participated to the tasks, the contest provides an overview of the state-of-the-art of speech-to-text transcription technologies; the document reports systems performance, computed as Word Error Rate (WER), showing that the current approaches provide effective results. The best system achieves a WER as low as 5.4% on the released testset.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Davis, K.H., Biddulph, R., Balashek, S.: Automatic recognition of spoken digits. J. Acoust. Soc. Amer. 24(6), 627–642 (1952)

    Article  Google Scholar 

  2. Baker, J., Deng, L., Glass, J., Khudanpur, S., Lee, C.-H., Morgan, N., O’Shaughnessy, D.: Developments and directions in speech recognition and understanding, Part 1 [DSP Education]. IEEE Signal Processing Magazine 26(3), 75–80 (2009)

    Article  Google Scholar 

  3. Povey, D.: Discriminative training for large vocabulary speech recognition. Ph.D. thesis, Cambridge University, Cambridge (2004)

    Google Scholar 

  4. Sha, F.: Large margin training of acoustic models for speech recognition. Ph.D. thesis, University of Pennsylvania, Philadelphia (2007)

    Google Scholar 

  5. Schwenk, H.: Continuous space language models. Computer Speech and Language 21(3), 492–518 (2007)

    Article  Google Scholar 

  6. Mohamed, A.R., Dahl, G.E., Hinton, G.E.: Deep belief networks for phone recognition. In: NIPS 22 Workshop on Deep Learning for Speech Recognition (2009)

    Google Scholar 

  7. Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust., Speech, and Signal Processing 28(4), 357–366 (1980)

    Article  Google Scholar 

  8. Chiu, Y.-H. , Raj, B. , Stern, R.: Learning based auditory encoding for robust speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 428–431 (2010)

    Google Scholar 

  9. Cohen, J., Kamm, T., Andreou, A.: Vocal tract normalization in speech recognition: compensation for system systematic speaker variability. J. Acoust. Soc. Amer. 97(5), pt. 2, 3246–3247 (1995)

    Article  Google Scholar 

  10. Kumar, N., Andreou, A.G.: Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. In: Speech Communication, pp. 283–297 (1998)

    Google Scholar 

  11. Bilmes, J.: A Gentle Tutorial of the EM algorithm and its application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Technical Report TR-97-021, International Computer Science Institute (1997)

    Google Scholar 

  12. Yu, D., Deng, L.: Large-Margin Discriminative Training of Hidden Markov Models for Speech Recognition. In: Proceedings of the International Conference on Semantic Computing, pp. 429–438. IEEE Computer Society, Washington, DC (2007)

    Google Scholar 

  13. Gauvain, J.-L., Lee, C.-H.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing 2(2), 291–298 (1994)

    Article  Google Scholar 

  14. Leggetter, C.J., Woodland, P.C.: Maximum likelihood linear regression for speaker adaptation of continuous density HMMs. Speech Communication 9, 171–186 (1995)

    Google Scholar 

  15. Fiscus, J.G.: A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER). In: 1997 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 347–354 (1997)

    Google Scholar 

  16. Hoffmeister, B., Hillard, D., Hahn, S., Schluter, R., Ostendorf, M., Ney, H.: Cross-Site and Intra-Site ASR System Combination: Comparisons on Lattice and 1-Best Methods.XS. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. 1145–1148 (2007)

    Google Scholar 

  17. Hermansky, H., Ellis, D.P.W., Sharma, S.: Tandem connectionist feature extraction for conventional HMM systems. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, pp. 1635–1638 (2000)

    Google Scholar 

  18. Pinto, J.P.: Multilayer Perceptron Based Hierarchical Acoustic Modeling for Automatic Speech Recognition. PhD thesis, EPFL Switzerland (2010)

    Google Scholar 

  19. Schwarz, P., Matejka, P., Cernocky, J.: Hierarchical Structures of Neural Networks for Phoneme Recognition. In: 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1(I), pp. 14–19 (2006)

    Google Scholar 

  20. Zweig, G., Nguyen, P.: A segmental CRF approach to large vocabulary continuous speech recognition. In: IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 152–157 (2009)

    Google Scholar 

  21. Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition. IEEE Transactions on Audio, Speech, and Language Processing 20(1), 30–42 (2012)

    Article  Google Scholar 

  22. Katz, S.: Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing 35(3), 400–401 (1987)

    Article  Google Scholar 

  23. Rosenfeld, R.: Two decades of statistical language modeling: where do we go from here? Proceedings of the IEEE 88(8), 1270–1278 (2000)

    Article  Google Scholar 

  24. Schwenk, H.: Trends and challenges in language modeling for speech recognition and machine translation. In: IEEE Workshop on Automatic Speech Recognition and Understanding, Merano (2009)

    Google Scholar 

  25. The History of Automatic Speech Recognition Evaluations at NIST, http://www.itl.nist.gov/iad/mig/publications/ASRhistory/index.html

  26. Lamel, L., Gauvain, J.L., Adda, G., Barras, C., Bilinksi, E., Galibert, O., Pujol, A., Schwenk, H., Xuan, Z.: The LIMSI 2006 TC-STAR EPPS Transcription Systems. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. 997–1000 (2007)

    Google Scholar 

  27. SAMPA - computer readable phonetic alphabet, http://www.phon.ucl.ac.uk/home/sampa/

  28. Gretter, R., Peirone, G.: A Morphological Analyzer for the Italian Language. Istituto per la Ricerca Scientifica e Tecnologica, Tech. Rep. - Ref. No. 9108-01, Italy (December 12, 1991)

    Google Scholar 

  29. NIST: Speech recognition scoring toolkit, http://www.itl.nist.gov/iad/mig/tools/

  30. Ronny, R., Shakoor, A., Brugnara, F., Gretter, R.: The FBK ASR system for Evalita 2011. In: Working Notes of EVALITA 2011, Rome, Italy (January 24-25, 2012)

    Google Scholar 

  31. Despres, J., Lamel, L., Gauvain, J.-L., Vieru, B., Woehrling, C., Bac Le, V., Oparin, I.: The Vocapia Research ASR Systems for Evalita 2011. In: Working Notes of EVALITA 2011, Rome, Italy (January 24-25, 2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Matassoni, M., Brugnara, F., Gretter, R. (2013). Evalita 2011: Automatic Speech Recognition Large Vocabulary Transcription. In: Magnini, B., Cutugno, F., Falcone, M., Pianta, E. (eds) Evaluation of Natural Language and Speech Tools for Italian. EVALITA 2012. Lecture Notes in Computer Science(), vol 7689. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35828-9_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-35828-9_30

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-35827-2

  • Online ISBN: 978-3-642-35828-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics