Abstract
In this paper we present an alternative version of the morphosyntactically annotated Serbian translation of 1984. This version follows the basic principles of the MULTEXT-East version, except for one addition—the text will be annotated with multi-word units as well. We will present the resources used for annotation with multi-word units and explain how these resources were enriched with multi-word units extracted from the processed text. Finally, we will present the format of this alternative version and the benefits obtained both from preparing the new resource and from the resource itself.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
All examples in this paper (in English with a Serbian translation) are from the novel 1984, if such an example occurs in the text.
- 2.
References
Dimitrova, L., Ide, N., Petkevic, V., Erjavec, T., Kaalep, H.J., Tufis, D.: Multext-East: parallel and comparable corpora and lexicons for six Central and Eastern european languages. In: Proceedings of the 36th Annual Meeting of the ACL and 17th International Conference on Computational Linguistics, vol. 1, pp. 315–319. ACL, Université de Montréal, Montréal (1998)
Erjavec, T.: MULTEXT-East: morphosyntactic resources for central and eastern european languages. Lang. Resour. Eval. 46(1), 131–142 (2012)
Chiarcos, C., Erjavec, T.: OWL/DL formalization of the MULTEXT-East morphosyntactic specifications. In: Proceedings of the 5th Linguistic Annotation Workshop (LAW 2011), Portland, OR, USA. pp. 11–20. ACL (2011)
Erjavec, T., Lawson, A., Romary, L. (eds.): East Meets West – A Compendium of Multilingual Resources (CD-ROM). Telri Association e.V, IdS, Mannheim (1998)
Krstev, C., Vitas, D., Erjavec, T.: Morpho-syntactic descriptions in MULTEXT-East–the case of Serbian. Informatica 28, 431–436 (2004)
Erjavec, T.: MULTEXT-East version 3: multilingual morphosyntactic specifications, lexicons and corpora. In: Lino, M.T., Xavier, M.F., Ferreira, F., Costa, R., Silva, R. (eds.) Proceedings of the 4th International Conference on Language Resources and Evaluation–LREC, Paris, pp. 1535–1538. ELRA, Paris (2004)
Popović, Z.: Taggers applied on texts in Serbian. INFOtheca 11(2), 21a–38a (2010)
Utvić, M.: Annotating corpus of contemporary Serbian. INFOtheca 12(2), 36a–47a (2011)
Delić, V., Sečujski, M., Kupusinac, A.: Transformation-based part-of-speech tagging for serbian language. In: Proceedings CIMMACS’09 of the 8th WSEAS International Conference on Computational Intelligence, Man machine Systems and Cybernetics, pp. 98–103. World Scientific and Engineering Academy and Society, Stevens Point, WI, USA (2009)
Božović, M.: Computational linguistics methods of parallel text alignment and their application to the English-Serbian language pair. Master thesis, Faculty of Philology, University of Belgrade, Belgrade (2010)
Gesmundo, A., Samardžić, T.: Lemmatisation as a tagging task. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers. ACL ’12, vol. 2, pp. 368–372. ACL, Stroudsburg, PA (2012)
Ermolaev, N., Tasovac, T.: Building a lexicographic infrastructure for serbian digital libraries. In: Proceedings of the 12th international Conference on Libraries in the Digital Age (LIDA) (2012)
Gross, M.: Lexicon-grammar. The representation of compound words. In: Proceedings of Coling 1986, pp. 1–6, Bonn (1986)
Savary, A.: Multiflex: a multilingual finite-state tool for multi-word units. In: Maneth, S. (ed.) CIAA 2009. LNCS, vol. 5642, pp. 237–240. Springer, Heidelberg (2009)
Laporte, É., Nakamura, T., Voyatzi, S., et al.: A French corpus annotated for multiword nouns. In: Proceedings of the 6th Language Resources and Evaluation Conference. Workshop Towards a Shared Task on Multiword Expressions, Marrakech, Morocco, pp. 27–30. ELRA (2008)
Utvić, M., Obradović, I., Krstev, C., Vitas, D.: The effects of multi-word tagging on text disambiguation. In: Proceedings of the 29th International Conference on Lexis and Grammar, Belgrade, Serbia, pp. 333–342. Faculty of Mathematics, University of Belgrade (2010)
Savary, A., Waszczuk, J., Przepiórkowski, A.: Towards the annotation of named entities in the National Corpus of Polish. In: Proceedings of the 7th International Conference on Language Resources and Evaluation, Valetta, Malta, pp. 3622–3629. ELRA (2010)
Krstev, C., Obradović, I., Utvić, M., Vitas, D.: A system for named entity recognition based on local grammars. J. Logic Comput. 24, 473–489 (2014)
Krstev, C., Vitas, D.: Finite state transducers for recognition and generation of compound words. In: Erjavec, T., Žganec Gros, J. (eds.) Proceedings of IS-LTC 2006, Ljubljana, Slovenia, pp. 192–197. Institut “Jožef Stefan” (2006)
Courtois, B., Silberztein, M.: Dictionnaires électroniques du français. Larousse, Paris (1990)
Krstev, C., Obradović, I., Stanković, R., Vitas, D.: An approach to efficient processing of multi-word units. In: Przepiórkowski, A., Piasecki, M., Jassem, K., Fuglewicz, P. (eds.) Computational Linguistics. SCI, vol. 458, pp. 109–129. Springer, Heidelberg (2013)
Woźbniak, M.: Automatic extraction of multiword lexical units from Polish texts. In: Vetulani, Z. (ed.) Proceedings of the 5th Language & Technology Conference on Human Language Technologies as a Challenge for Computer Science and Linguistics, Poznań, Poland, pp. 187–191. Fundacja Uniwersytetu im. A. Mickiewicza (2011)
Paumier, S.: Unitex 3.1beta User Manual (2013). http://www-igm.univ-mlv.fr/~unitex/UnitexManual3.1beta.pdf
Savary, A.: Recensement et description des mots composés - méthodes et applications. Ph.D. thèse, Université de Marne-la-Vallée (2000)
Alegria, I., Ansa, O., Artola, X., Ezeiza, N., Nojenola, K., Urizar, R.: Representation and treatment of multiword expressions in Basque. In: Proceedings of the 2nd ACL Workshop on Multiword Expressions: Integrating Processing, Barcelona, Spain, pp. 48–55 (2004)
Przepiórkowski, A., Woliński, M.: The unbearable lightness of tagging: a case study in morphosyntactic tagging of Polish. In: Proceedings of 4th International Workshop on Linguistically Interpreted Corpora (LINC-03), Budapest, Hungary, pp. 13–14 (2003)
B.: Automatic recognition of composite verb forms in Serbian. In: Proceedings of the Workshop on Computational Linguistics and Natural Language Processing of Balkan Languages of the 5th Balkan Conference in Informatics, Novi Sad, Serbia, pp. 89–92. Faculty of Sciences, University of Novi Sad (2012)
Acknowledgments
This research was supported by the Serbian Ministry of Education and Science (grant NO 178003).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Krstev, C., Vitas, D., Trtovac, A. (2014). Orwell’s 1984—From Simple to Multi-word Units. In: Vetulani, Z., Mariani, J. (eds) Human Language Technology Challenges for Computer Science and Linguistics. LTC 2011. Lecture Notes in Computer Science(), vol 8387. Springer, Cham. https://doi.org/10.1007/978-3-319-08958-4_23
Download citation
DOI: https://doi.org/10.1007/978-3-319-08958-4_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08957-7
Online ISBN: 978-3-319-08958-4
eBook Packages: Computer ScienceComputer Science (R0)