Advertisement

Diacritics Restoration in the Slovak Texts Using Hidden Markov Model

  • Daniel HládekEmail author
  • Ján Staš
  • Jozef Juhár
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9561)

Abstract

This paper presents fast and accurate method for recovering diacritical markings and guessing original meaning of the word from the context based on a hidden Markov model and the Viterbi algorithm. The proposed algorithm might find usage in any area where erroneous text might appear, such as a web search engine, e-mail messages, office suite, optical character recognition or helping to type on small mobile device keyboards.

Keywords

Hide Markov Model Language Model Viterbi Algorithm Training Corpus Automatic Speech Recognition System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgement

The research presented in this paper was supported by the Ministry of Education, Science, Research and Sport of the Slovak Republic under the research project VEGA 1/0386/12 (50 %) and Research and Development Operational Program funded by the ERDF under the project ITMS-26220220141 (50 %).

References

  1. 1.
    Bahanshal, A., Al-Khalifa, H.: A first approach to the evaluation of arabic diacritization systems, pp. 155–158 (2012)Google Scholar
  2. 2.
    De Pauw, G., Wagacha, P.W., de Schryver, G.-M.: Automatic diacritic restoration for resource-scarce languages. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS(LNAI), vol. 4629, pp. 170–179. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  3. 3.
    Grobbelaar, L., Kinyua, J.: A spell checker and corrector for the native South African language, South Sotho. In: Proceedings of 2009 Annual Conference of the Southern African Computer Lecturers’ Association, SACLA 2009, Mpekweni Beach Resort, South Africa, pp. 50–59 (2009)Google Scholar
  4. 4.
    Grozea, C.: Experiments and results with diacritics restoration in Romanian. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS, vol. 7499, pp. 199–206. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  5. 5.
    Hládek, D., Staš, J.: Text gathering and processing agent for language modeling corpus. In: Proceedings of the 12th International Conference on Research in Telecommunication Technologies, RTT, pp. 200–203 (2010)Google Scholar
  6. 6.
    Hládek, D., Staš, J.: Text gathering and processing agent for language modeling corpus. In: Proceedings of 12th International Conference on Research in Telecommunication Technologies, RTT 2010, Veľké Losiny, Czech Republic, pp. 137–140 (2010)Google Scholar
  7. 7.
    Hládek, D., Staš, J., Juhár, J.: Dagger: the Slovak morphological classifier, pp. 195–198 (2012)Google Scholar
  8. 8.
    Jayalatharachchi, E., Wasala, A., Weerasinghe, R.: Data-driven spell checking: the synergy of two algorithms for spelling error detection and correction. In: 2012 International Conference on Advances in ICT for Emerging Regions (ICTer), pp. 7–13. IEEE (2012)Google Scholar
  9. 9.
    Krajči, S., Mati, M., Novotný, R.: Morphonary: a Slovak language dictionary, tools for acquisition, organisation and presenting of information and knowledge. Návrat, P., et al. (eds.) Informatics and Information Technologies, pp. 162–165 (2006)Google Scholar
  10. 10.
    Kukich, K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24(4), 377–439 (1992)CrossRefGoogle Scholar
  11. 11.
    Li, Y., Duan, H., Zhai, C.: A generalized hidden Markov model with discriminative training for query spelling correction, pp. 611–620 (2012)Google Scholar
  12. 12.
    Lund, W., Ringger, E.: Error correction with in-domain training across multiple OCR system outputs, pp. 658–662 (2011)Google Scholar
  13. 13.
    Mihalcea, R.F.: Diacritics restoration: learning from letters versus learning from words. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 339–348. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  14. 14.
    Nguyen, K.-H., Ock, C.-Y.: Diacritics restoration in Vietnamese: letter based vs. syllable based model. In: Zhang, B.-T., Orgun, M.A. (eds.) PRICAI 2010. LNCS(LNAI), vol. 6230, pp. 631–636. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  15. 15.
    Rashwan, M., Al-Badrashiny, M., Attia, M., Abdou, S., Rafea, A.: A stochastic arabic diacritizer based on a hybrid of factorized and unfactorized textual features. IEEE Trans. Audio Speech Lang. Process. 19(1), 166–175 (2011)CrossRefGoogle Scholar
  16. 16.
    Rodphon, M., Siriboon, K., Kruatrachue, B.: Thai OCR error correction using token passing algorithm. In: 2001 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, 2001, PACRIM, vol. 2, pp. 599–602. IEEE (2001)Google Scholar
  17. 17.
    Rusko, M., et al.: Slovak automatic dictation system for judicial domain. In: Vetulani, Z., Mariani, J. (eds.) LTC 2011. LNCS(LNAI), vol. 8387, pp. 16–27. Springer, Heidelberg (2014)Google Scholar
  18. 18.
    Sirts, K.: Noisy-channel spelling correction models for Estonian learner language corpus lemmatisation. In: Proceedings of the 5th International Conference Human Language Technologies - The Baltic Perspective, HLT 2012, Tartu, Estonia, pp. 213–220 (2012)Google Scholar
  19. 19.
    Staš, J., Hládek, D., Juhár, J.: Language model adaptation for Slovak LVCSR. In: Proceedings of the International Conference on AEI, pp. 101–106 (2010)Google Scholar
  20. 20.
    Staš, J., Hládek, D., Pleva, M., Juhár, J.: Slovak language model from internet text data. In: Esposito, A., Esposito, A.M., Martone, R., Müller, V.C., Scarpetta, G. (eds.) COST 2010. LNCS, vol. 6456, pp. 340–346. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  21. 21.
    Tufiş, D., Ceauşu, A.: Diacritics restoration in Romanian texts. In: A Common Natural Language Processing Paradigm for Balkan Languages, pp. 49–55 (2007)Google Scholar
  22. 22.
    Zhou, Y., Jing, S., Huang, G., Liu, S., Zhang, Y.: A correcting model based on tribayes for real-word errors in English essays. In: 2012 Fifth International Symposium on Computational Intelligence and Design (ISCID), vol. 1, pp. 407–410. IEEE (2012)Google Scholar
  23. 23.
    Zitouni, I., Sarikaya, R.: Arabic diacritic restoration approach based on maximum entropy models. Comput. Speech Lang. 23(3), 257–276 (2009)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Department of Electronics and Multimedia Communications, FEITechnical University of KošiceKošiceSlovakia

Personalised recommendations