Morphosyntactic Disambiguation and Segmentation for Historical Polish with Graph-Based Conditional Random Fields

  • Jakub WaszczukEmail author
  • Witold Kieraś
  • Marcin Woliński
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11107)


The paper presents a system for joint morphosyntactic disambiguation and segmentation of Polish based on conditional random fields (CRFs). The system is coupled with Morfeusz, a morphosyntactic analyzer for Polish, which represents both morphosyntactic and segmentation ambiguities in the form of a directed acyclic graph (DAG). We rely on constrained linear-chain CRFs generalized to work directly on DAGs, which allows us to perform segmentation as a by-product of morphosyntactic disambiguation. This is in contrast with other existing taggers for Polish, which either neglect the problem of segmentation or rely on heuristics to perform it in a pre-processing stage. We evaluate our system on historical corpora of Polish, where segmentation ambiguities are more prominent than in contemporary Polish, and show that our system significantly outperforms several baseline segmentation methods.


Word segmentation Morphosyntactic tagging Historical Polish Conditional random fields 



The work being reported was partially supported by a National Science Centre, Poland grant DEC-2014/15/B/HS2/03119.


  1. 1.
    Acedański, S.: A morphosyntactic brill tagger for inflectional languages. In: Loftsson, H., Rögnvaldsson, E., Helgadóttir, S. (eds.) NLP 2010. LNCS (LNAI), vol. 6233, pp. 3–14. Springer, Heidelberg (2010). Scholar
  2. 2.
    Calzolari, N., et al., (eds.): Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014. ELRA, Reykjavík, Iceland (2014).
  3. 3.
    Chen, X., Qiu, X., Zhu, C., Liu, P., Huang, X.: Long short-term memory neural networks for Chinese word segmentation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1197–1206. ACL (2015).
  4. 4.
    Dębowski, L.: Trigram morphosyntactic tagger for Polish. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) Intelligent Information Processing and Web Mining, pp. 409–413. Springer, Heidelberg (2004). Scholar
  5. 5.
    Kieraś, W., Komosińska, D., Modrzejewski, E., Woliński, M.: Morphosyntactic annotation of historical texts. The making of the baroque corpus of Polish. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 308–316. Springer, Cham (2017). Scholar
  6. 6.
    Kieraś, W., Woliński, M.: Manually annotated corpus of Polish texts published between 1830 and 1918. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2018. ELRA, Miyazaki, Japan (2018)Google Scholar
  7. 7.
    Kobyliński, Ł., Ogrodniczuk, M.: Results of the PolEval 2017 competition: part-of-speech tagging shared task. In: Vetulani and Paroubek [17], pp. 362–366Google Scholar
  8. 8.
    Kobyliński, Ł.: PoliTa: A multitagger for Polish. In: Calzolari et al. [2], pp. 2949–2954.
  9. 9.
    Krasnowska-Kieraś, K.: Morphosyntactic disambiguation for Polish with bi-LSTM neural networks. In: Vetulani and Paroubek [17], pp. 367–371Google Scholar
  10. 10.
    Kudo, T., Yamamoto, K., Matsumoto, Y.: Applying conditional random fields to Japanese morphological analysis. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (2004).
  11. 11.
    Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics (2004).
  12. 12.
    Piasecki, M., Wardyński, A.: Multiclassifier approach to tagging of Polish. In: Proceedings of the International Multiconference on ISSN, vol. 1896, p. 7094Google Scholar
  13. 13.
    Radziszewski, A.: A tiered CRF tagger for Polish. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform, pp. 215–230. Springer, Heidelberg (2013). Scholar
  14. 14.
    Radziszewski, A., Acedański, S.: Taggers gonna tag: an argument against evaluating disambiguation capacities of morphosyntactic taggers. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS (LNAI), vol. 7499, pp. 81–87. Springer, Heidelberg (2012). Scholar
  15. 15.
    Radziszewski, A., Śniatowski, T.: Maca-a configurable tool to integrate Polish morphological data. In: Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation (2011)Google Scholar
  16. 16.
    Sutton, C., McCallum, A.: An introduction to conditional random fields. Found. Trends® Mach. Learn. 4(4), 267–373 (2012)CrossRefGoogle Scholar
  17. 17.
    Vetulani, Z., Paroubek, P. (eds.): Proceedings of the 8th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu, Poznań, Poland (2017)Google Scholar
  18. 18.
    Wainwright, M.J., Jordan, M.I., et al.: Graphical models, exponential families, and variational inference. Found. Trends® Mach. Learn. 1(1–2), 1–305 (2008)zbMATHGoogle Scholar
  19. 19.
    Walentynowicz, W.: MorphoDiTa-based tagger for Polish language (2017), CLARIN-PL digital repository.
  20. 20.
    Waszczuk, J.: Harnessing the CRF complexity with domain-specific constraints. The case of morphosyntactic tagging of a highly inflected language. In: Proceedings of COLING 2012, pp. 2789–2804 (2012).
  21. 21.
    Woliński, M.: Morfeusz reloaded. In: Calzolari et al. [2], pp. 1106–1111.
  22. 22.
    Wróbel, K.: KRNNT: Polish recurrent neural network tagger. In: Vetulani, Paroubek [17], pp. 386–391Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Jakub Waszczuk
    • 1
    Email author
  • Witold Kieraś
    • 2
  • Marcin Woliński
    • 2
  1. 1.Heinrich Heine University DüsseldorfDüsseldorfGermany
  2. 2.Institute of Computer Science, Polish Academy of SciencesWarsawPoland

Personalised recommendations