Skip to main content

MULTEXT-East

  • Chapter
  • First Online:
Handbook of Linguistic Annotation

Abstract

The chapter presents the MULTEXT-East language resources, a multilingual dataset for language engineering research, focused on the morphosyntactic level of linguistic description. The MULTEXT-East dataset includes the EAGLES-based morphosyntactic specifications, morphosyntactic lexicons, and an annotated multilingual corpora. The parallel corpus, the novel “1984” by George Orwell, is sentence aligned and contains hand-validated morphosyntactic descriptions and lemmas. The resources are uniformly encoded in XML, using the Text Encoding Initiative Guidelines, TEI P5, and cover 16 languages: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, and Ukrainian. This dataset is extensively documented, and freely available for research purposes. This case study gives a history of the development of the MULTEXT-East resources, presents their encoding and components, discusses related work and gives some conclusions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 349.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 449.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 449.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Buchholz, S., Marsi, E.: CoNLL-X shared task on multilingual dependency parsing. In: Proceedings of the Tenth Conference on Computational Natural Language Learning, Association for Computational Linguistics, Stroudsburg, PA, USA, CoNLL-X ’06, pp. 149–164 (2006). http://dl.acm.org/citation.cfm?id=1596276.1596305

  2. Carpenter, B.: The Logic of Typed Feature Structures. Tracts in Theoretical Computer Science. Cambridge University Press, Cambridge (1992)

    Book  Google Scholar 

  3. Chiarcos, C., Erjavec, T.: OWL/DL formalization of the MULTEXT-East morphosyntactic specifications. In: Proceedings of the 5th Linguistic Annotation Workshop, Association for Computational Linguistics, Stroudsburg, PA, USA, LAW V ’11, pp. 11–20 (2011). http://dl.acm.org/citation.cfm?id=2018966.2018968

  4. Čerepnalkoski, D.: Constructing n-way alignment using multiple pair-wise alignments (Seminar work at Jožef Stefan International Postgraduate School) (2008)

    Google Scholar 

  5. Derzhanski, I.A., Kotsyba, N.: Towards a consistent morphological tagset for Slavic languages: extending MULTEXT-East for Polish, Ukrainian and Belarusian. In: Proceedings of the Mondilex Third Open Workshop: Metalanguage and Encoding Scheme Design for Digital Lexicography, pp. 9–26. Ľ. Štúr Institute of Linguistic, Slovak Academy of Sciences, Bratislava, Slovakia (2009)

    Google Scholar 

  6. Dimitrova, L., Erjavec, T., Ide, N., Kaalep, H.J., Petkevič, V., Tufiş, D.: Multext-East: parallel and comparable corpora and lexicons for six Central and Eastern European languages. In: Proceedings of COLING-ACL ’98, pp. 315–319. ACL, Montréal, Québec, Canada (1998)

    Google Scholar 

  7. EAGLES Expert Advisory Group on Language Engineering Standards. http://www.ilc.pi.cnr.it/EAGLES/home.html (1996)

  8. Erjavec, T.: MULTEXT-East Version 4: Multilingual morphosyntactic specifications, lexicons and corpora. In: Seventh International Conference on Language Resources and Evaluation, LREC’10, ELRA, Paris (2010)

    Google Scholar 

  9. Erjavec, T.: MULTEXT-East: morphosyntactic resources for Central and Eastern European languages. Lang. Res. Eval. 46(1), 131–142 (2012). doi:10.1007/s10579-011-9174-8

  10. Erjavec, T.: The goo300k corpus of historical Slovene. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey (2012)

    Google Scholar 

  11. Erjavec, T.: Vzporedni korpus SPOOK: označevanje, zapis in iskanje // The SPOOK parallel corpus: annotation, enoding and search. In: Vintar, Š. (ed.) Slovenski prevodi skozi korpusno prizmo // Slovene translations through a corpus prism, pp. 14–31. Zbirka Prevodoslovje in uporabno jezikoslovje, Znanstvena založba Filozofske fakultete, Ljubljana (2013)

    Google Scholar 

  12. Erjavec, T., Džeroski, S.: Machine learning of language structure: lemmatising unknown Slovene words. Appl. Artif. Intell. 18(1), 17–41 (2004)

    Article  Google Scholar 

  13. Erjavec, T., Fišer, D., Krek, S., Ledinek, N.: The JOS linguistically tagged corpus of Slovene. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation, LREC’10, ELRA, Paris (2010)

    Google Scholar 

  14. Farrar, S., Langendoen, D.T.: A linguistic ontology for the Semantic Web. GLOT International 7(3), 97–100 (2003). http://linguistics-ontology.org/

  15. Feldman, A., Hana, J.: A Resource-Light Approach to Morpho-Syntactic Tagging. Rodopi, Amsterdam (2010)

    Book  Google Scholar 

  16. Francopoulo, G., Declerck, T., Sornlertlamvanich, V., De la Clergerie, E., Monachini, M.: Data category registry : morpho-syntactic and syntactic profiles. In: Proceedings of the LREC 2008 Workshop on Uses and Usage of Language Resource-related Standards, pp. 31–40 [Marrakech], 27 May (2008)

    Google Scholar 

  17. Hajič, J.: Morphological Tagging: Data vs. Dictionaries. In: Proceedings of the ANLP/NAACL 2000, Seattle, pp. 94–101 (2000)

    Google Scholar 

  18. Hajič, J., Panevová, J., Hajičová, E., Pajas, P., Sgall, P., Štěpánek, J., Havelka, J., Milkulová, M.: Prague Dependency Treebank 2.0. Catalog Number LDC2006T01 (2006)

    Google Scholar 

  19. Ide, N.: Corpus encoding standard: SGML guidelines for encoding linguistic corpora. In: Proceedings of the First International Conference on Language Resources and Evaluation, LREC’98, ELRA, Granada, pp. 463–470 (1998). http://www.cs.vassar.edu/CES/

  20. Ide, N.: Cross-lingual sense determination: Can it work? Comput. Humanit. 34, 223–234 (2000)

    Article  Google Scholar 

  21. Ide, N., Véronis, J.: Multext (multilingual tools and corpora). In: Proceedings of the ACL, pp. 90–96 (1994)

    Google Scholar 

  22. Ide, N., Romary, L., Bonhomme, P.: CES/XML : An XML-based Standard for Linguistic Corpora. In: Proceedings of the Second International Conference on Language Resources and Evaluation, LREC’00, Athens (2000)

    Google Scholar 

  23. Ide, N., Erjavec, T., Tufiş, D.: Sense discrimination with parallel corpora. In: Proceedings of the Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, pp. 54–60. ACL, Philadelphia (2002)

    Google Scholar 

  24. ISO: ISO/IEC 19757-2:2003 - Information technology – Document Schema Definition Language (DSDL) – Part 2: Regular-grammar-based validation – RELAX NG (2000)

    Google Scholar 

  25. Kemps-Snijders, M., Windhouwer, M., Wittenburg, P., Wright, S.E.: ISOcat: corralling data categories in the wild. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC’08, ELRA, Paris (2008)

    Google Scholar 

  26. Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the Conference on Tenth Machine Translation Summit, AAMT, AAMT, Phuket, Thailand, pp. 79–86 (2005). http://mt-archive.info/MTS-2005-Koehn.pdf

  27. Martin, J., Mihalcea, R., Pedersen, T.: Word alignment for languages with scarce resources. In: Proceedings of the ACL Workshop on Building and Using Parallel Texts, Association for Computational Linguistics, Ann Arbor, Michigan, pp. 65–74 (2005). http://www.aclweb.org/anthology/W/W05/W05-0809

  28. Patejuk, A., Przepiórkowski, A.: ISOcat Definition of the national corpus of Polish tagset. In: Proceedings of LREC 2010 workshop on LRT Standards (2010)

    Google Scholar 

  29. Petrov, S., Das, D., McDonald, R.: A universal part-of-speech tagset. In: (Chair) NCC., Choukri, K., Declerck, T., Doğan, MU., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S. (eds) Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey (2012)

    Google Scholar 

  30. Przepiórkowski, A., Woliński, M.: A Flexemic Tagset for Polish. In: Proceedings of the Morphological Processing of Slavic Languages, EACL 2003 (2003)

    Google Scholar 

  31. Rosen, A.: Morphological tags in parallel corpora. In: Čermák, F., Klégr, A., Corness, P. (eds.) InterCorp: Exploring a Multilingual Corpus. Praha, Nakladatelstvé Lidové noviny (2010)

    Google Scholar 

  32. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the International Conference on New Methods in Language Processing, pp. 44–49 (1994)

    Google Scholar 

  33. Sperberg-McQueen, C.M., Burnard, L. (eds.): Guidelines for Electronic Text Encoding and Interchange P3. Text Encoding Initiative, Chicago (1994)

    Google Scholar 

  34. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. CoRR abs/cs/0609058. http://arxiv.org/abs/cs/0609058 (2006)

  35. TEI Consortium (ed.): TEI P5: Guidelines for Electronic Text Encoding and Interchange. TEI Consortium, http://www.tei-c.org/Guidelines/P5/ (2007)

  36. Toutanova, K., Cherry, C.: A global model for joint lemmatization and part-of-speech prediction. In: Proceedings of the ACL (2009)

    Google Scholar 

  37. Tufiş, D.: Tiered tagging and combined language model classifiers. In: Jelinek, F., Noth, E. (eds.) Text, Speech and Dialogue, Springer-Verlag, Berlin, no. 1692 in Lecture Notes in Artificial Intelligence, pp. 28–33 (1999)

    Google Scholar 

  38. Tufiş, D.: A cheap and fast way to build useful translation lexicons. In: Proceedings of the 19th international conference on Computational linguistics, Association for Computational Linguistics (2002)

    Google Scholar 

  39. Tufiş, D., Cristea, D., Stamou, S.: BalkaNet: aims, methods, results and perspectives. A general overview. Romanian. J. Inform. Sci. Technol. 7(1–2), 9–43 (2004)

    Google Scholar 

  40. Zeman, D.: Reusable tagset conversion using tagset drivers. Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), pp. 213–218. European Language Resources Association, Marrakech, Morocco (2008)

    Google Scholar 

  41. Zeman, D.: Hard problems of tagset conversion. In: Proceedings of the Second International Conference on Global Interoperability for Language Resources, pp. 181–185. City University of Hong Kong, Hong Kong, China (2010)

    Google Scholar 

  42. Zeman, D., Dušek, O., Mareček, D., Popel, M., Ramasamy, L., Štěpánek, J., Žabokrtský, Z., Hajič, J.: HamleDT: harmonized multi-language dependency treebank. Lang. Res. Eval. 48(4), 601–637 (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomaž Erjavec .

Editor information

Editors and Affiliations

Appendix 1. Examples of Annotated Text from Orwell’s “1984”

Appendix 1. Examples of Annotated Text from Orwell’s “1984”

figure c

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Erjavec, T. (2017). MULTEXT-East. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_17

Download citation

  • DOI: https://doi.org/10.1007/978-94-024-0881-2_17

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-94-024-0879-9

  • Online ISBN: 978-94-024-0881-2

  • eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics