Slovak Language Model from Internet Text Data

  • Ján Staš
  • Daniel Hládek
  • Matúš Pleva
  • Jozef Juhár
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6456)


Automatic speech recognition system is one of the parts of the multimodal dialogue system. It is necessary to create correct vocabulary and to generate suitable language model for this purpose. The main aim of this article is to describe a process of building statistical models of the Slovak language with large vocabulary trained on the text data gathered mainly from Internet sources. Several smoothing techniques for different sizes of vocabulary have been used in order to obtain an optimal model of the Slovak language. We have also employed pruning technique based on relative entropy for size reduction of a language model to find the maximum threshold of pruning with minimum degradation in recognition accuracy. Tests were performed by the decoder based on the HTK Toolkit.


Language model n-grams speech recognition spellchecking text normalization vocabulary 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Chollet, G., Esposito, A., Gentes, A., Horain, P., Karam, W., Li, Z., Pelachaud, C., Perrot, P., Petrovska-Delacrétaz, D., Zhou, D., Zouari, L.: Multimodal Human Machine Interactions in Virtual and Augmented Reality. In: Esposito, A., Hussain, A., Marinaro, M., Martone, R. (eds.) COST Action 2102. LNCS(LNAI), vol. 5398, pp. 1–23. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  2. 2.
    Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd edn., p. 998. Prentice Hall, Englewood Cliffs (2009) ISBN-13 978-0-13-504196-3 Google Scholar
  3. 3.
    Chen, S.F., Goodman, J.: An Empirical Study of Smoothing Techniques for Language Modeling. Technical Report TR-10-98, p. 63 (1998)Google Scholar
  4. 4.
    Stolcke, A.: Entropy-based Pruning of Backoff Language Models. In: Proc. DARPA Broadcast News Transcription and Understanding Workshop, pp. 270–274 (1998)Google Scholar
  5. 5.
    Mirilovič, M., Juhár, J., Cižmár, A.: Large Vocabulary Continuous Speech Recognition in Slovak. In: Proc. of International Conference on Applied Electrical Engineering and Informatics, Athens, Greece, pp. 73–77 (2008) ISBN 978-80-553-0066-5 Google Scholar
  6. 6.
    Stolcke, A.: SRILM - An Extensible Language Modeling Toolkit. In: Proc. of the 7th International Conference on Spoken Language Processing, Denver, Colorado, pp. 901–904 (2002)Google Scholar
  7. 7.
    Cowan, I.A., Moore, D., Dines, J., Gatiza-Perez, D., Flynn, M., Wellner, P., Bourlard, H.: On the Use of Information Retrieval Measures for Speech Recognition Evaluation. In: IDIAP-RR-73, Martigny, Switzerland, p. 15 (2005)Google Scholar
  8. 8.
    Young, S., Odell, J., Ollason, D., Valtchev, V., Woodland, P., Evermann, G., Hain, T., Kershaw, D., Moore, G.: The HTK Book (v3.4). Cambridge University, Cambridge (2009)Google Scholar
  9. 9.
    Rusko, M., Trnka, M., Daržagín, S.: MobilDat-SK - A Mobile Telephone Extension to the SpeechDat-E SK Telephone Speech Database in Slovak. In: Proc. of the 11th International Conference Speech and Computer, SPECOM 2006, pp. 485–488 (2006)Google Scholar
  10. 10.
    Mirilovič, M., Juhár, J., Čižmár, A.: Comparison of Grapheme and Phoneme Based Acoustic Modeling in LVCSR Task in Slovak. In: Proc. of the 7th International Conference on Spoken Language Processing, Denver, Colorado, pp. 901–904 (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Ján Staš
    • 1
  • Daniel Hládek
    • 1
  • Matúš Pleva
    • 1
  • Jozef Juhár
    • 1
  1. 1.Laboratory of Advanced Speech Technologies, Faculty of Electrical Engineering and InformaticsTechnical University of KošiceKošiceSlovakia

Personalised recommendations