Advertisement

Multilingual Tokenization and Part-of-speech Tagging. Lightweight Versus Heavyweight Algorithms

  • Tiberiu BorosEmail author
  • Stefan Daniel DumitrescuEmail author
Conference paper
  • 300 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10930)

Abstract

This work focuses on morphological analysis of raw text and provides a recipe for tokenization, sentence splitting and part-of-speech tagging for all languages included in the Universal Dependencies Corpus. Scalability is an important issue when dealing with large-sized multilingual corpora. The experiments include both lightweight classifiers (linear and decision trees) and heavyweight LSTM-based architectures which are able to attain state-of-the-art results. All the experiments are carried out using the provided data “as-is”. We apply lightweight and heavyweight classifiers on 5 distinct tasks, on multiple languages; we present some lessons learned during the training process; we look at per-language results as well as task averages, we present model footprints, and finally draw a few conclusions regarding trade-offs between the classifiers’ characteristics.

Keywords

Linear models Neural networks Long-Short-Term-Memory (LSTM) networks Decision trees Sequence labeling Part-of-speech tagging Morphological attributes Tokenization Sentence splitting 

References

  1. 1.
    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information (2016). arXiv preprint arXiv:1607.04606
  2. 2.
    Boroş, T., Dumitrescu, S.D., Pipa, S.: Fast and accurate decision trees for natural language processing tasks. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, INCOMA Ltd., Varna, Bulgaria, pp. 103–110, September 2017.  https://doi.org/10.26615/978-954-452-049-6_016
  3. 3.
    Chen, D., Manning, C.D.: A fast and accurate dependency parser using neural networks. In: EMNLP, pp. 740–750 (2014)Google Scholar
  4. 4.
    Dozat, T., Manning, C.D.: Deep Biaffine attention for neural dependency parsing (2016). arXiv preprint arXiv:1611.01734
  5. 5.
    Dozat, T., Qi, P., Manning, C.D.: Stanford’s graph-based neural dependency parser at the CoNLL 2017 shared task. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 20–30. Association for Computational Linguistics, Vancouver, Canada, August 2017. http://www.aclweb.org/anthology/K/K17/K17-3002.pdf
  6. 6.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  7. 7.
    Nivre, J., et al.: Universal Dependencies 2.0 (2017). http://hdl.handle.net/11234/1-1983, LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University, Prague. http://hdl.handle.net/11234/1-1983
  8. 8.
    Petrov, S., Das, D., McDonald, R.: A universal part-of-speech tagset (2011). arXiv preprint arXiv:1104.2086
  9. 9.
    Quinlan, J.R.: Simplifying decision trees. Int. J. Man Mach. Stud. 27(3), 221–234 (1987)CrossRefGoogle Scholar
  10. 10.
    Tufiş, D., Dragomirescu, L.: Tiered tagging revisited. In: Proceedings of the 4th LREC Conference, pp. 39–42 (2004)Google Scholar
  11. 11.
    Zafiu, A., Dumitrescu, S.D., Boroş, T.: Modular language processing framework for lightweight applications (MLPLA). In: 7th Language & Technology Conference (2015)Google Scholar
  12. 12.
    Zeman, D., Ginter, F., Hajič, J., Nivre, J., Popel, M., Straka, M., et al.: CoNLL 2017 shared task: multilingual parsing from raw text to universal dependencies. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 1–20. Association for Computational Linguistics (2017)Google Scholar
  13. 13.
    Zeman, D., Popel, M., Nitisaroj, R., Li, J.: CoNLL 2017 shared task: multilingual parsing from raw text to universal dependencies. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 1–19. Association for Computational Linguistics, Vancouver, Canada, August 2017. http://www.aclweb.org/anthology/K/K17/K17-3001.pdf

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Research Institute for Artificial Intelligence, Romanian AcademyBucharestRomania

Personalised recommendations