Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER

Straka, Milan; Straková, Jana; Hajič, Jan

doi:10.1007/978-3-030-27947-9_12

Milan Straka⁹,
Jana Straková⁹ &
Jan Hajič⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11697))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

980 Accesses
11 Citations

Abstract

Contextualized embeddings, which capture appropriate word meaning depending on context, have recently been proposed. We evaluate two methods for precomputing such embeddings, BERT and Flair, on four Czech text processing tasks: part-of-speech (POS) tagging, lemmatization, dependency parsing and named entity recognition (NER). The first three tasks, POS tagging, lemmatization and dependency parsing, are evaluated on two corpora: the Prague Dependency Treebank 3.5 and the Universal Dependencies 2.3. The named entity recognition (NER) is evaluated on the Czech Named Entity Corpus 1.1 and 2.0. We report state-of-the-art results for the above mentioned tasks and corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
With options -size 300 -window 5 -negative 5 -iter 1 -cbow 0.
2.
The concatenated corpus has approximately 4G words, two thirds of them from SYN v3 [14].
3.
https://lindat.cz.
4.
We use -minCount 5 -epoch 10 -neg 10 options to generate the embeddings.
5.
We use the BERT-Base Multilingual Uncased model from https://github.com/google-research/bert.
6.
tf.contrib.opt.lazyadamoptimizer from www.tensorflow.org.
7.
https://fasttext.cc/docs/en/crawl-vectors.html.
8.
POS tagging and lemmatization done with MorphoDiTa [34], http://ufal.mff.cuni.cz/morphodita.

References

Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649. Association for Computational Linguistics (2018)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Article Google Scholar
Che, W., Liu, Y., Wang, Y., Zheng, B., Liu, T.: Towards better UD parsing: deep contextualized word embeddings, ensemble, and treebank concatenation. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 55–64. Association for Computational Linguistics (2018)
Google Scholar
Cho, K., van Merrienboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. CoRR (2014)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018)
Google Scholar
Dozat, T., Manning, C.D.: Deep biaffine attention for neural dependency parsing. CoRR abs/1611.01734 (2016)
Google Scholar
Fares, M., Oepen, S., Øvrelid, L., Björne, J., Johansson, R.: The 2018 shared task on extrinsic parser evaluation: on the downstream utility of English Universal Dependency Parsers. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 22–33. Association for Computational Linguistics (2018)
Google Scholar
Gesmundo, A., Henderson, J., Merlo, P., Titov, I.: A latent variable model of synchronous syntactic-semantic parsing for multiple languages. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task, Boulder, pp. 37–42. Association for Computational Linguistics, June 2009
Google Scholar
Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005)
Article Google Scholar
Hajič, J.: Building a syntactically annotated corpus: the Prague dependency treebank. In: Hajičová, E. (ed.) Issues of Valency and Meaning. Studies in Honour of Jarmila Panevová, pp. 106–132. Karolinum, Charles University Press, Prague (1998)
Google Scholar
Hajič, J.: Disambiguation of Rich Inflection: Computational Morphology of Czech. Karolinum Press, Prague (2004)
Google Scholar
Hajič, J., Hlaváčová, J.: MorfFlex CZ 161115 (2016). LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), aculty of Mathematics and Physics, Charles University. http://hdl.handle.net/11234/1-1834
Hajič, J., et al.: Prague dependency treebank 3.5 (2018). LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. http://hdl.handle.net/11234/1-2621
Hnátková, M., Křen, M., Procházka, P., Skoumalová, H.: The SYN-series corpora of written Czech. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland, pp. 160–164. European Language Resources Association (ELRA), May 2014
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Holan, T., Žabokrtský, Z.: Combining Czech dependency parsers. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 95–102. Springer, Heidelberg (2006). https://doi.org/10.1007/11846406_12
Chapter Google Scholar
Kanerva, J., Ginter, F., Miekka, N., Leino, A., Salakoski, T.: Turku neural parser pipeline: an end-to-end system for the CoNLL 2018 shared task. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Brussels, Belgium, pp. 133–142. Association for Computational Linguistics, October 2018
Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, December 2014
Google Scholar
Kondratyuk, D., Gavenčiak, T., Straka, M., Hajič, J.: LemmaTag: jointly tagging and lemmatizing for morphologically rich languages with BRNNs. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4921–4928. Association for Computational Linguistics (2018)
Google Scholar
Konkol, M., Konopík, M.: CRF-based Czech named entity recognizer and consolidation of Czech NER Research. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 153–160. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40585-3_20
Chapter Google Scholar
Koo, T., Rush, A.M., Collins, M., Jaakkola, T., Sontag, D.: Dual decomposition for parsing with non-projective head automata. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, pp. 1288–1298. Association for Computational Linguistics, October 2010
Google Scholar
Ling, W., et al.: Finding function in form: compositional character models for open vocabulary word representation. CoRR (2015)
Google Scholar
Nakagawa, T.: Multilingual dependency parsing using global features. In: Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, Prague, Czech Republic, pp. 952–956. Association for Computational Linguistics, June 2007
Google Scholar
Nivre, J., et al.: Universal dependencies v1: a multilingual treebank collection. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, pp. 1659–1666. European Language Resources Association (2016)
Google Scholar
Nivre, J., et al.: Universal dependencies 2.3 (2018). LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. http://hdl.handle.net/11234/1-2895
Novák, V., Žabokrtský, Z.: Feature engineering in maximum spanning tree dependency parser. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 92–98. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74628-7_14
Chapter Google Scholar
Peters, M., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. Association for Computational Linguistics (2018)
Google Scholar
Ševčíková, M., Žabokrtský, Z., Krůza, O.: Named entities in Czech: annotating data and developing NE tagger. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 188–195. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74628-7_26
Chapter Google Scholar
Ševčíková, M., Žabokrtský, Z., Straková, J., Straka, M.: Czech named entity corpus 1.1 (2014). LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. http://hdl.handle.net/11858/00-097C-0000-0023-1B04-C
Ševčíková, M., Žabokrtský, Z., Straková, J., Straka, M.: Czech named entity corpus 2.0 (2014). LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. http://hdl.handle.net/11858/00-097C-0000-0023-1B22-8
Spoustová, D.J., Hajič, J., Raab, J., Spousta, M.: Semi-supervised training for the averaged perceptron POS tagger. In: Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pp. 763–771. Association for Computational Linguistics, March 2009
Google Scholar
Straka, M.: UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In: Proceedings of CoNLL 2018: The SIGNLL Conference on Computational Natural Language Learning, Stroudsburg, PA, USA, pp. 197–207. Association for Computational Linguistics (2018)
Google Scholar
Straková, J., Straka, M., Hajič, J.: A new state-of-the-art Czech named entity recognizer. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 68–75. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40585-3_10
Chapter Google Scholar
Straková, J., Straka, M., Hajič, J.: Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Stroudsburg, PA, USA, pp. 13–18. Johns Hopkins University, USA, Association for Computational Linguistics (2014)
Google Scholar
Straková, J., Straka, M., Hajič, J.: Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, USA, pp. 13–18. Johns Hopkins University, Association for Computational Linguistics (2014)
Google Scholar
Straková, J., Straka, M., Hajič, J.: Neural networks for featureless named entity recognition in Czech. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 173–181. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45510-5_20
Chapter Google Scholar
Straková, J., Straka, M., Hajič, J.: Neural architectures for nested NER through linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. CoRR abs/1706.03762 (2017)
Google Scholar
Žabokrtský, Z.: Treex - an open-source framework for natural language processing. In: Lopatková, M. (ed.) Information Technologies - Applications and Theory, vol. 788, pp. 7–14. Univerzita Pavla Jozefa Šafárika v Košiciach, Slovakia (2011)
Google Scholar
Zeman, D., Ginter, F., Hajič, J., Nivre, J., Popel, M., Straka, M.: CoNLL 2018 shared task: multilingual parsing from raw text to universal dependencies. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Brussels, Belgium. Association for Computational Linguistics (2018)
Google Scholar

Download references

Acknowledgements

The work described herein has been supported by OP VVV VI LINDAT/CLARIN project (CZ.02.1.01/0.0/0.0/16_013/0001781) and it has been supported and has been using language resources developed by the LINDAT/CLARIN project (LM2015071) of the Ministry of Education, Youth and Sports of the Czech Republic.

Author information

Authors and Affiliations

Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, Charles University, Prague, Czechia
Milan Straka, Jana Straková & Jan Hajič

Authors

Milan Straka
View author publications
You can also search for this author in PubMed Google Scholar
Jana Straková
View author publications
You can also search for this author in PubMed Google Scholar
Jan Hajič
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Milan Straka .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Straka, M., Straková, J., Hajič, J. (2019). Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-27947-9_12
Published: 06 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27946-2
Online ISBN: 978-3-030-27947-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics