Evalita 2011: Automatic Speech Recognition Large Vocabulary Transcription

Matassoni, Marco; Brugnara, Fabio; Gretter, Roberto

doi:10.1007/978-3-642-35828-9_30

Evalita 2011: Automatic Speech Recognition Large Vocabulary Transcription

Marco Matassoni²³,
Fabio Brugnara²³ &
Roberto Gretter²³

Conference paper

669 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7689))

Abstract

In this paper we describe design, setup and results of the speech recognition task in the framework of the Evalita campaign for the Italian language, giving details on the released corpora and tools used for the challenge. A general discussion about approaches to large vocabulary speech recognition introduces the recognition tasks. Systems are compared for recognition accuracy on audio sequences of Italian parliament. Although only a few systems have participated to the tasks, the contest provides an overview of the state-of-the-art of speech-to-text transcription technologies; the document reports systems performance, computed as Word Error Rate (WER), showing that the current approaches provide effective results. The best system achieves a WER as low as 5.4% on the released testset.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Davis, K.H., Biddulph, R., Balashek, S.: Automatic recognition of spoken digits. J. Acoust. Soc. Amer. 24(6), 627–642 (1952)
Article Google Scholar
Baker, J., Deng, L., Glass, J., Khudanpur, S., Lee, C.-H., Morgan, N., O’Shaughnessy, D.: Developments and directions in speech recognition and understanding, Part 1 [DSP Education]. IEEE Signal Processing Magazine 26(3), 75–80 (2009)
Article Google Scholar
Povey, D.: Discriminative training for large vocabulary speech recognition. Ph.D. thesis, Cambridge University, Cambridge (2004)
Google Scholar
Sha, F.: Large margin training of acoustic models for speech recognition. Ph.D. thesis, University of Pennsylvania, Philadelphia (2007)
Google Scholar
Schwenk, H.: Continuous space language models. Computer Speech and Language 21(3), 492–518 (2007)
Article Google Scholar
Mohamed, A.R., Dahl, G.E., Hinton, G.E.: Deep belief networks for phone recognition. In: NIPS 22 Workshop on Deep Learning for Speech Recognition (2009)
Google Scholar
Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust., Speech, and Signal Processing 28(4), 357–366 (1980)
Article Google Scholar
Chiu, Y.-H. , Raj, B. , Stern, R.: Learning based auditory encoding for robust speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 428–431 (2010)
Google Scholar
Cohen, J., Kamm, T., Andreou, A.: Vocal tract normalization in speech recognition: compensation for system systematic speaker variability. J. Acoust. Soc. Amer. 97(5), pt. 2, 3246–3247 (1995)
Article Google Scholar
Kumar, N., Andreou, A.G.: Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. In: Speech Communication, pp. 283–297 (1998)
Google Scholar
Bilmes, J.: A Gentle Tutorial of the EM algorithm and its application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Technical Report TR-97-021, International Computer Science Institute (1997)
Google Scholar
Yu, D., Deng, L.: Large-Margin Discriminative Training of Hidden Markov Models for Speech Recognition. In: Proceedings of the International Conference on Semantic Computing, pp. 429–438. IEEE Computer Society, Washington, DC (2007)
Google Scholar
Gauvain, J.-L., Lee, C.-H.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing 2(2), 291–298 (1994)
Article Google Scholar
Leggetter, C.J., Woodland, P.C.: Maximum likelihood linear regression for speaker adaptation of continuous density HMMs. Speech Communication 9, 171–186 (1995)
Google Scholar
Fiscus, J.G.: A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER). In: 1997 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 347–354 (1997)
Google Scholar
Hoffmeister, B., Hillard, D., Hahn, S., Schluter, R., Ostendorf, M., Ney, H.: Cross-Site and Intra-Site ASR System Combination: Comparisons on Lattice and 1-Best Methods.XS. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. 1145–1148 (2007)
Google Scholar
Hermansky, H., Ellis, D.P.W., Sharma, S.: Tandem connectionist feature extraction for conventional HMM systems. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, pp. 1635–1638 (2000)
Google Scholar
Pinto, J.P.: Multilayer Perceptron Based Hierarchical Acoustic Modeling for Automatic Speech Recognition. PhD thesis, EPFL Switzerland (2010)
Google Scholar
Schwarz, P., Matejka, P., Cernocky, J.: Hierarchical Structures of Neural Networks for Phoneme Recognition. In: 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1(I), pp. 14–19 (2006)
Google Scholar
Zweig, G., Nguyen, P.: A segmental CRF approach to large vocabulary continuous speech recognition. In: IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 152–157 (2009)
Google Scholar
Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition. IEEE Transactions on Audio, Speech, and Language Processing 20(1), 30–42 (2012)
Article Google Scholar
Katz, S.: Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing 35(3), 400–401 (1987)
Article Google Scholar
Rosenfeld, R.: Two decades of statistical language modeling: where do we go from here? Proceedings of the IEEE 88(8), 1270–1278 (2000)
Article Google Scholar
Schwenk, H.: Trends and challenges in language modeling for speech recognition and machine translation. In: IEEE Workshop on Automatic Speech Recognition and Understanding, Merano (2009)
Google Scholar
The History of Automatic Speech Recognition Evaluations at NIST, http://www.itl.nist.gov/iad/mig/publications/ASRhistory/index.html
Lamel, L., Gauvain, J.L., Adda, G., Barras, C., Bilinksi, E., Galibert, O., Pujol, A., Schwenk, H., Xuan, Z.: The LIMSI 2006 TC-STAR EPPS Transcription Systems. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. 997–1000 (2007)
Google Scholar
SAMPA - computer readable phonetic alphabet, http://www.phon.ucl.ac.uk/home/sampa/
Gretter, R., Peirone, G.: A Morphological Analyzer for the Italian Language. Istituto per la Ricerca Scientifica e Tecnologica, Tech. Rep. - Ref. No. 9108-01, Italy (December 12, 1991)
Google Scholar
NIST: Speech recognition scoring toolkit, http://www.itl.nist.gov/iad/mig/tools/
Ronny, R., Shakoor, A., Brugnara, F., Gretter, R.: The FBK ASR system for Evalita 2011. In: Working Notes of EVALITA 2011, Rome, Italy (January 24-25, 2012)
Google Scholar
Despres, J., Lamel, L., Gauvain, J.-L., Vieru, B., Woehrling, C., Bac Le, V., Oparin, I.: The Vocapia Research ASR Systems for Evalita 2011. In: Working Notes of EVALITA 2011, Rome, Italy (January 24-25, 2012)
Google Scholar

Download references

Author information

Authors and Affiliations

FBK-irst, via Sommarive 18, Povo, TN, 38123, Italy
Marco Matassoni, Fabio Brugnara & Roberto Gretter

Authors

Marco Matassoni
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Brugnara
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Gretter
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Fondazione Bruno Kessler, Via Sommarive 18, 38123, Povo, TN, Italy
Bernardo Magnini
University of Naples, Via Cinthia, 80126, Napoli, NA, Italy
Francesco Cutugno
Fondazione Ugo Bordoni, Viale del Policlinico, 161, Roma, Italy
Mauro Falcone
CELCT, Via alla Cascata, 38123, Povo, TN, Italy
Emanuele Pianta

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Matassoni, M., Brugnara, F., Gretter, R. (2013). Evalita 2011: Automatic Speech Recognition Large Vocabulary Transcription. In: Magnini, B., Cutugno, F., Falcone, M., Pianta, E. (eds) Evaluation of Natural Language and Speech Tools for Italian. EVALITA 2012. Lecture Notes in Computer Science(), vol 7689. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35828-9_30

Download citation

DOI: https://doi.org/10.1007/978-3-642-35828-9_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35827-2
Online ISBN: 978-3-642-35828-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics