Advertisement

A Comparative Evaluation of Statistical Part-of-Speech Taggers for Russian

  • Rinat GareevEmail author
  • Vladimir Ivanov
Chapter
Part of the Communications in Computer and Information Science book series (CCIS, volume 505)

Abstract

Part-of-speech (POS) tagging is an essential step in many text processing applications. Quite a few works focus on solving this task for Russian; their results are not directly comparable due to the lack of shared datasets and tools. We propose a POS tagging evaluation framework for Russian that comprises existing third-party resources available for researchers. We applied the framework to compare several implementations of statistical classifiers: HunPos, Stanford POS tagger, OpenNLP implementation of MaxEnt Markov Model, and our own re-implementation of Tiered Conditional Random Fields. The best tagger that was trained on a corpus with less than one million words achieved an accuracy above 93 % .We expect that the evaluation framework will facilitate future studies and improvements on POS tagging for Russian.

Notes

Acknowledgments

This work was financially supported by the Russian Science Foundation (grant 15-11-10019).

References

  1. 1.
    Antonova, A.Y., Soloviev, A.N.: Conditional random field models for the processing of Russian. In: Computational Linguistics and Intellectual Technologies: Papers From the Annual Conference “Dialogue” (Bekasovo, 29 May – 2 June 2013), vol. 1, pp. 27–44. RGGU, Moscow (2013) (in Russian)Google Scholar
  2. 2.
    Bocharov, V., Bichineva, S., Granovsky, D., Ostapuk, N., Stepanova, M.: Quality assurance tools in the OpenCorpora project. In: Computational Linguistics and Intellectual Technologies: Papers From the Annual Conference “Dialogue” (Bekasovo, 25–29 May 2011), pp. 101–109. RGGU, Moscow, Russia (2011)Google Scholar
  3. 3.
    Brants, T.: TnT: a statistical part-of-speech tagger. In: Proceedings of the Sixth Conference on Applied Natural Language Processing, ANLC 2000, pp. 224–231. Association for Computational Linguistics, Stroudsburg, PA, USA (2000)Google Scholar
  4. 4.
    de Castilho, R.E., Gurevych, I.: A lightweight framework for reproducible parameter sweeping in information retrieval. In: Agosti, M., Ferro, N., Thanos, C. (eds.) Proceedings of the 2011 Workshop on Data Infrastructures for Supporting Information Retrieval Evaluation, DESIRE 2011, pp. 7–10. ACM, New York (2011)Google Scholar
  5. 5.
    Hajič, J., Krbec, P., Květoň, P., Oliva, K., Petkevič, V.: Serial combination of rules and statistics: a case study in Czech tagging. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, ACL 2001, pp. 268–275. Association for Computational Linguistics, Stroudsburg, PA, USA (2001)Google Scholar
  6. 6.
    Halácsy, P., Kornai, A., Oravecz, C.: HunPos: an open source trigram tagger. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL 2007, pp. 209–212. Association for Computational Linguistics, Stroudsburg, PA, USA (2007)Google Scholar
  7. 7.
    Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2001)Google Scholar
  8. 8.
    Lakomkin, E.D., Ryzhova, D.A., Puzyrevskij, I.: Analiz statisticheskix algoritmov snyatiya morfologicheskoj omonimii v russkom yazyke. In: Open image in new window 2013. Moscow (2013) (in Russian)Google Scholar
  9. 9.
    Ljashevskaja, O.N., Astaf’eva, I., Bonch-Osmolovskaja, A., Garejshina, A., Grishina, J., D’jachkov, V., Ionov, M., Koroleva, A., Kudrinskij, M., Litjagina, A., Luchina, E., Sidorova, E., Toldova, S., Savchuk, S., Koval, S.: NLP evaluation: Russian morphological parsers. In: Computational Linguistics and Intellectual Technologies: Papers From the Annual Conference “Dialogue” (Bekasovo, 26–30 May 2010), pp. 318-326 (2010) (in Russian)Google Scholar
  10. 10.
    Ljashevskaja, O.N., Plungjan, V.A., Sichinava, D.V.: O morfologicheskom standarte Nacional’nogo korpusa russkogo jazyka. In: Open image in new window: 2003–2005. Open image in new window, pp. 111–135. Indrik, Moscow, Russia (2005) (in Russian)Google Scholar
  11. 11.
    Noreen, E.: Computer-Intensive Methods for Testing Hypotheses: An Introduction. A Wiley-Interscience publication, Wiley (1989)Google Scholar
  12. 12.
    Ogren, P.V., Wetzler, P.G., Bethard, S.J.: ClearTK: a framework for statistical natural language processing. In: Unstructured Information Management Architecture Workshop at the Conference of the German Society for Computational Linguistics and Language Technology (2009)Google Scholar
  13. 13.
    Okazaki, N.: CRFsuite: a fast implementation of conditional random fields (CRFs) (2007). http://www.chokkan.org/software/crfsuite/
  14. 14.
    Radziszewski, A.: A tiered CRF tagger for polish. In: Bembenik, R., Skonieczny, Ł., Rybiński, H., Kryszkiewicz, M., Niezgódka, M. (eds.) Intell. Tools for Building a Scientific Information. SCI, vol. 467, pp. 215–230. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  15. 15.
    Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Brill, E., Church, K. (eds.) Proceedings of the Empirical Methods in Natural Language Processing, pp. 133–142 (1996)Google Scholar
  16. 16.
    Sharoff, S., Kopotev, M., Erjavec, T., Feldman, A., Divjak, D.: Designing and evaluating a Russian tagset. In: Chair, N.C.C., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Tapias, D. (eds.) Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008). European Language Resources Association (ELRA), Marrakech, Morocco (2008)Google Scholar
  17. 17.
    Sokirko, A., Toldova, S.: Sravnenie effektivnosti dvuh metodik snyatiya lexicheskoy i morfologicheskoy neodnoznachnosti dlya russkogo yazyka. Technical report (2005). http://www.aot.ru/docs/RusCorporaHMM.htm, in Russian
  18. 18.
    Sutton, C., McCallum, A.: An introduction to conditional random fields. Found. Trends Mach. Learn. 4(4), 267–373 (2012)CrossRefGoogle Scholar
  19. 19.
    Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, NAACL 2003, vol.1, pp. 173–180. Association for Computational Linguistics, Stroudsburg, PA, USA (2003)Google Scholar
  20. 20.
    Zaliznjak, A.A.: Grammaticheskij slovar’ russkogo jazyka. Slovoizmenenie. Open image in new window Russkij jazyk, Moscow, 3 edn. (1987) (in Russian)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Authors and Affiliations

  1. 1.Kazan Federal UniversityKazanRussia

Personalised recommendations