Slovak Language Model from Internet Text Data

Staš, Ján; Hládek, Daniel; Pleva, Matúš; Juhár, Jozef

doi:10.1007/978-3-642-18184-9_29

Ján Staš²¹,
Daniel Hládek²¹,
Matúš Pleva²¹ &
…
Jozef Juhár²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6456))

1183 Accesses
4 Citations

Abstract

Automatic speech recognition system is one of the parts of the multimodal dialogue system. It is necessary to create correct vocabulary and to generate suitable language model for this purpose. The main aim of this article is to describe a process of building statistical models of the Slovak language with large vocabulary trained on the text data gathered mainly from Internet sources. Several smoothing techniques for different sizes of vocabulary have been used in order to obtain an optimal model of the Slovak language. We have also employed pruning technique based on relative entropy for size reduction of a language model to find the maximum threshold of pruning with minimum degradation in recognition accuracy. Tests were performed by the decoder based on the HTK Toolkit.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chollet, G., Esposito, A., Gentes, A., Horain, P., Karam, W., Li, Z., Pelachaud, C., Perrot, P., Petrovska-Delacrétaz, D., Zhou, D., Zouari, L.: Multimodal Human Machine Interactions in Virtual and Augmented Reality. In: Esposito, A., Hussain, A., Marinaro, M., Martone, R. (eds.) COST Action 2102. LNCS(LNAI), vol. 5398, pp. 1–23. Springer, Heidelberg (2009)
Chapter Google Scholar
Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd edn., p. 998. Prentice Hall, Englewood Cliffs (2009) ISBN-13 978-0-13-504196-3
Google Scholar
Chen, S.F., Goodman, J.: An Empirical Study of Smoothing Techniques for Language Modeling. Technical Report TR-10-98, p. 63 (1998)
Google Scholar
Stolcke, A.: Entropy-based Pruning of Backoff Language Models. In: Proc. DARPA Broadcast News Transcription and Understanding Workshop, pp. 270–274 (1998)
Google Scholar
Mirilovič, M., Juhár, J., Cižmár, A.: Large Vocabulary Continuous Speech Recognition in Slovak. In: Proc. of International Conference on Applied Electrical Engineering and Informatics, Athens, Greece, pp. 73–77 (2008) ISBN 978-80-553-0066-5
Google Scholar
Stolcke, A.: SRILM - An Extensible Language Modeling Toolkit. In: Proc. of the 7^th International Conference on Spoken Language Processing, Denver, Colorado, pp. 901–904 (2002)
Google Scholar
Cowan, I.A., Moore, D., Dines, J., Gatiza-Perez, D., Flynn, M., Wellner, P., Bourlard, H.: On the Use of Information Retrieval Measures for Speech Recognition Evaluation. In: IDIAP-RR-73, Martigny, Switzerland, p. 15 (2005)
Google Scholar
Young, S., Odell, J., Ollason, D., Valtchev, V., Woodland, P., Evermann, G., Hain, T., Kershaw, D., Moore, G.: The HTK Book (v3.4). Cambridge University, Cambridge (2009)
Google Scholar
Rusko, M., Trnka, M., Daržagín, S.: MobilDat-SK - A Mobile Telephone Extension to the SpeechDat-E SK Telephone Speech Database in Slovak. In: Proc. of the 11^th International Conference Speech and Computer, SPECOM 2006, pp. 485–488 (2006)
Google Scholar
Mirilovič, M., Juhár, J., Čižmár, A.: Comparison of Grapheme and Phoneme Based Acoustic Modeling in LVCSR Task in Slovak. In: Proc. of the 7^th International Conference on Spoken Language Processing, Denver, Colorado, pp. 901–904 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory of Advanced Speech Technologies, Faculty of Electrical Engineering and Informatics, Technical University of Košice, Letná 9/A, Košice, Slovakia
Ján Staš, Daniel Hládek, Matúš Pleva & Jozef Juhár

Authors

Ján Staš
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Hládek
View author publications
You can also search for this author in PubMed Google Scholar
Matúš Pleva
View author publications
You can also search for this author in PubMed Google Scholar
Jozef Juhár
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute for Advanced Scientific Studies, Second University of Naples, and IIASS, Via Pellegrino 19, 84019, Vietri sul Mare (SA), Italy
Anna Esposito
Istituto Nazionale di Geofisica e Vulcanologia, Osservatorio Vesuviano, Via Diocleziano 328, 80124, Napoli, Italy
Antonietta M. Esposito
Dipartemento di Ingegneria dell’ Informazione, Seconda Università di Napoli, Via Roma 29, 81031, Aversa (CE), Italy
Raffaele Martone
Department of Humanities and Social Sciences, Anatolia College/ACT, Kennedy Street, 55510, Pylaia, Greece
Vincent C. Müller
Departmnet of Physics "E.R. Caoamoeööp", University of Salerno and IIASS, International Institute for Advanced Scientific Studies, 84081, Baronissi (SA), Italy
Gaetano Scarpetta

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Staš, J., Hládek, D., Pleva, M., Juhár, J. (2011). Slovak Language Model from Internet Text Data. In: Esposito, A., Esposito, A.M., Martone, R., Müller, V.C., Scarpetta, G. (eds) Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces. Theoretical and Practical Issues. Lecture Notes in Computer Science, vol 6456. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-18184-9_29

Download citation

DOI: https://doi.org/10.1007/978-3-642-18184-9_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-18183-2
Online ISBN: 978-3-642-18184-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics