Abstract
Multiword Expressions (MWEs) are important linguistic units that require special treatment in many NLP applications. It is thus desirable to be able to recognize them automatically. Semantically annotated corpora should mark MWEs in a clear way that facilitates development of automatic recognition tools. In the present paper we discuss various corpus design decisions from this perspective. We propose guidelines that should lead to MWE-friendly annotation and evaluate them on numerous sentence examples. Our experience of identifying MWEs in the Prague Dependency Treebank provides the base for the discussion and examples from other languages are added whenever appropriate.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Mel’čuk, I.A., Polguère, A.: A formal lexicon in The Meaning-Text Theory (or how to do lexica with words). Computational Linguistics 13(3-4), 261–275 (1987)
Ren, Z., Lü, Y., Cao, J., Liu, Q., Huang, Y.: Improving statistical machine translation using domain bilingual multiword expressions. In: Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, pp. 47–54. Association for Computational Linguistics, Singapore (2009)
Bejček, E., Straňák, P.: Annotation of multiword expressions in the Prague dependency treebank. Language Resources and Evaluation (44), 7–21 (2010)
Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: A pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, p. 1. Springer, Heidelberg (2002)
Baldwin, T., Bannard, C., Tanaka, T., Widdows, D.: An empirical model of multiword expression decomposability. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions, pp. 89–96. Association for Computational Linguistics, Morristown (2003)
Pecina, P.: Lexical Association Measures: Collocation Extraction. Studies in Computational and Theoretical Linguistics, vol. 4. Institute of Formal and Applied Linguistics, Prague (2009)
Baldwin, T.: Multiword expressions, CSSE, University of Melbourne (2004)
Straňák, P.: Annotation of Multiword Expressions in The Prague Dependency Treebank. PhD thesis, Charles University in Prague (2010)
Mikulová, M., Bémová, A., Hajič, J., Hajičová, E., Havelka, J., Kolářová, V., Kučová, L., Lopatková, M., Pajas, P., Panevová, J., Razímová, M., Sgall, P., Štěpánek, J., Urešová, Z., Veselá, K., Žabokrtský, Z.: Annotation on the tectogrammatical level in the Prague Dependency Treebank. annotation manual. Technical Report 30, ÚFAL MFF UK, Prague, Czech Rep. (2006)
Buchholz, S., Marsi, E.: CoNLL-X shared task on multilingual dependency parsing. In: Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X), pp. 149–164. Association for Computational Linguistics, New York City (2006)
Nivre, J., Hall, J., Kübler, S., McDonald, R., Nilsson, J., Riedel, S., Yuret, D.: The CoNLL 2007 shared task on dependency parsing. In: Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, pp. 915–932. Association for Computational Linguistics, Praha (2007)
Sgall, P., Hajičová, E., Panevová, J.: The Meaning of the Sentence in Its Semantic and Pragmatic Aspects. Academia/Reidel Publ. Comp., Praha (1986)
Kahane, S.: The Meaning-Text Theory. In: Dependency and Valency, Handbooks of Linguistics and Communication Sciences, vol. 25(1-2), p. 32. De Gruyter, Berlin (2003)
Palmer, M., Gildea, D., Kingsbury, P.: The Proposition Bank: A corpus annotated with semantic roles. Computational Linguistics Journal 31(1) (2005)
Meyers, A.: Using treebank, dictionaries and GLARF to improve NomBank annotation. In: Proceedings of The Linguistic Annotation Workshop, LREC 2008, Morocco (2008)
Xue, N.: Labeling Chinese predicates with semantic roles. Computational Linguistics 34(2), 225–256 (2008)
Burchardt, A., Erk, K., Frank, A., Kowalski, A., Padó, S., Pinkal, M.: The SALSA corpus: a German corpus resource for lexical semantics. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Citeseer (2006)
Mel’čuk, I.: Lexical functions: A tool for the description of lexical relations in a lexicon. In: Wanner, L. (ed.) Lexical Functions in Lexicography and Natural Language Processing. SLCS, vol. 31, pp. 37–102. John Benjamins, Amsterdam (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bejček, E., Straňák, P., Zeman, D. (2011). Influence of Treebank Design on Representation of Multiword Expressions. In: Gelbukh, A.F. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2011. Lecture Notes in Computer Science, vol 6608. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19400-9_1
Download citation
DOI: https://doi.org/10.1007/978-3-642-19400-9_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19399-6
Online ISBN: 978-3-642-19400-9
eBook Packages: Computer ScienceComputer Science (R0)