Skip to main content

Morphosyntactic Disambiguation and Segmentation for Historical Polish with Graph-Based Conditional Random Fields

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11107))

Included in the following conference series:

Abstract

The paper presents a system for joint morphosyntactic disambiguation and segmentation of Polish based on conditional random fields (CRFs). The system is coupled with Morfeusz, a morphosyntactic analyzer for Polish, which represents both morphosyntactic and segmentation ambiguities in the form of a directed acyclic graph (DAG). We rely on constrained linear-chain CRFs generalized to work directly on DAGs, which allows us to perform segmentation as a by-product of morphosyntactic disambiguation. This is in contrast with other existing taggers for Polish, which either neglect the problem of segmentation or rely on heuristics to perform it in a pre-processing stage. We evaluate our system on historical corpora of Polish, where segmentation ambiguities are more prominent than in contemporary Polish, and show that our system significantly outperforms several baseline segmentation methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Based on algorithms involving automatic extraction of rules.

  2. 2.

    See: http://poleval.pl/index.php/results/.

  3. 3.

    By extension, this holds true also for ensemble taggers, e.g. PoliTa [8].

  4. 4.

    Intuitively, \(f_k\) has a positive influence on the modeled probability if \(\theta _k > 0\), negative influence if \(\theta _k < 0\), and no influence whatsoever if \(\theta _k = 0\).

  5. 5.

    With \(r_i = Y\) for out-of-vocabulary words.

  6. 6.

    Note that these results abstract from the potential morphosyntactic analysis errors.

  7. 7.

    Increasing all counts by 1 makes the probability of unseed segments equal to 1/2.

References

  1. Acedański, S.: A morphosyntactic brill tagger for inflectional languages. In: Loftsson, H., Rögnvaldsson, E., Helgadóttir, S. (eds.) NLP 2010. LNCS (LNAI), vol. 6233, pp. 3–14. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14770-8_3

    Chapter  Google Scholar 

  2. Calzolari, N., et al., (eds.): Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014. ELRA, Reykjavík, Iceland (2014). http://www.lrec-conf.org/proceedings/lrec2014/index.html

  3. Chen, X., Qiu, X., Zhu, C., Liu, P., Huang, X.: Long short-term memory neural networks for Chinese word segmentation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1197–1206. ACL (2015). http://www.aclweb.org/anthology/D15-1141

  4. Dębowski, L.: Trigram morphosyntactic tagger for Polish. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) Intelligent Information Processing and Web Mining, pp. 409–413. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-39985-8_43

    Chapter  Google Scholar 

  5. Kieraś, W., Komosińska, D., Modrzejewski, E., Woliński, M.: Morphosyntactic annotation of historical texts. The making of the baroque corpus of Polish. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 308–316. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64206-2_35

    Chapter  Google Scholar 

  6. Kieraś, W., Woliński, M.: Manually annotated corpus of Polish texts published between 1830 and 1918. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2018. ELRA, Miyazaki, Japan (2018)

    Google Scholar 

  7. Kobyliński, Ł., Ogrodniczuk, M.: Results of the PolEval 2017 competition: part-of-speech tagging shared task. In: Vetulani and Paroubek [17], pp. 362–366

    Google Scholar 

  8. Kobyliński, Ł.: PoliTa: A multitagger for Polish. In: Calzolari et al. [2], pp. 2949–2954. http://www.lrec-conf.org/proceedings/lrec2014/index.html

  9. Krasnowska-Kieraś, K.: Morphosyntactic disambiguation for Polish with bi-LSTM neural networks. In: Vetulani and Paroubek [17], pp. 367–371

    Google Scholar 

  10. Kudo, T., Yamamoto, K., Matsumoto, Y.: Applying conditional random fields to Japanese morphological analysis. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (2004). http://www.aclweb.org/anthology/W04-3230

  11. Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics (2004). http://www.aclweb.org/anthology/C04-1081

  12. Piasecki, M., Wardyński, A.: Multiclassifier approach to tagging of Polish. In: Proceedings of the International Multiconference on ISSN, vol. 1896, p. 7094

    Google Scholar 

  13. Radziszewski, A.: A tiered CRF tagger for Polish. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform, pp. 215–230. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35647-6_16

    Chapter  Google Scholar 

  14. Radziszewski, A., Acedański, S.: Taggers gonna tag: an argument against evaluating disambiguation capacities of morphosyntactic taggers. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS (LNAI), vol. 7499, pp. 81–87. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32790-2_9

    Chapter  Google Scholar 

  15. Radziszewski, A., Śniatowski, T.: Maca-a configurable tool to integrate Polish morphological data. In: Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation (2011)

    Google Scholar 

  16. Sutton, C., McCallum, A.: An introduction to conditional random fields. Found. Trends® Mach. Learn. 4(4), 267–373 (2012)

    Article  Google Scholar 

  17. Vetulani, Z., Paroubek, P. (eds.): Proceedings of the 8th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu, Poznań, Poland (2017)

    Google Scholar 

  18. Wainwright, M.J., Jordan, M.I., et al.: Graphical models, exponential families, and variational inference. Found. Trends® Mach. Learn. 1(1–2), 1–305 (2008)

    MATH  Google Scholar 

  19. Walentynowicz, W.: MorphoDiTa-based tagger for Polish language (2017), CLARIN-PL digital repository. http://hdl.handle.net/11321/425

  20. Waszczuk, J.: Harnessing the CRF complexity with domain-specific constraints. The case of morphosyntactic tagging of a highly inflected language. In: Proceedings of COLING 2012, pp. 2789–2804 (2012). http://www.aclweb.org/anthology/C12-1170

  21. Woliński, M.: Morfeusz reloaded. In: Calzolari et al. [2], pp. 1106–1111. http://www.lrec-conf.org/proceedings/lrec2014/index.html

  22. Wróbel, K.: KRNNT: Polish recurrent neural network tagger. In: Vetulani, Paroubek [17], pp. 386–391

    Google Scholar 

Download references

Acknowledgements

The work being reported was partially supported by a National Science Centre, Poland grant DEC-2014/15/B/HS2/03119.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jakub Waszczuk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Waszczuk, J., Kieraś, W., Woliński, M. (2018). Morphosyntactic Disambiguation and Segmentation for Historical Polish with Graph-Based Conditional Random Fields. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2018. Lecture Notes in Computer Science(), vol 11107. Springer, Cham. https://doi.org/10.1007/978-3-030-00794-2_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00794-2_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00793-5

  • Online ISBN: 978-3-030-00794-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics