Skip to main content

Influence of Treebank Design on Representation of Multiword Expressions

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2011)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6608))

Abstract

Multiword Expressions (MWEs) are important linguistic units that require special treatment in many NLP applications. It is thus desirable to be able to recognize them automatically. Semantically annotated corpora should mark MWEs in a clear way that facilitates development of automatic recognition tools. In the present paper we discuss various corpus design decisions from this perspective. We propose guidelines that should lead to MWE-friendly annotation and evaluate them on numerous sentence examples. Our experience of identifying MWEs in the Prague Dependency Treebank provides the base for the discussion and examples from other languages are added whenever appropriate.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Mel’čuk, I.A., Polguère, A.: A formal lexicon in The Meaning-Text Theory (or how to do lexica with words). Computational Linguistics 13(3-4), 261–275 (1987)

    Google Scholar 

  2. Ren, Z., Lü, Y., Cao, J., Liu, Q., Huang, Y.: Improving statistical machine translation using domain bilingual multiword expressions. In: Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, pp. 47–54. Association for Computational Linguistics, Singapore (2009)

    Google Scholar 

  3. Bejček, E., Straňák, P.: Annotation of multiword expressions in the Prague dependency treebank. Language Resources and Evaluation (44), 7–21 (2010)

    Google Scholar 

  4. Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: A pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, p. 1. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  5. Baldwin, T., Bannard, C., Tanaka, T., Widdows, D.: An empirical model of multiword expression decomposability. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions, pp. 89–96. Association for Computational Linguistics, Morristown (2003)

    Google Scholar 

  6. Pecina, P.: Lexical Association Measures: Collocation Extraction. Studies in Computational and Theoretical Linguistics, vol. 4. Institute of Formal and Applied Linguistics, Prague (2009)

    Google Scholar 

  7. Baldwin, T.: Multiword expressions, CSSE, University of Melbourne (2004)

    Google Scholar 

  8. Straňák, P.: Annotation of Multiword Expressions in The Prague Dependency Treebank. PhD thesis, Charles University in Prague (2010)

    Google Scholar 

  9. Mikulová, M., Bémová, A., Hajič, J., Hajičová, E., Havelka, J., Kolářová, V., Kučová, L., Lopatková, M., Pajas, P., Panevová, J., Razímová, M., Sgall, P., Štěpánek, J., Urešová, Z., Veselá, K., Žabokrtský, Z.: Annotation on the tectogrammatical level in the Prague Dependency Treebank. annotation manual. Technical Report 30, ÚFAL MFF UK, Prague, Czech Rep. (2006)

    Google Scholar 

  10. Buchholz, S., Marsi, E.: CoNLL-X shared task on multilingual dependency parsing. In: Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X), pp. 149–164. Association for Computational Linguistics, New York City (2006)

    Chapter  Google Scholar 

  11. Nivre, J., Hall, J., Kübler, S., McDonald, R., Nilsson, J., Riedel, S., Yuret, D.: The CoNLL 2007 shared task on dependency parsing. In: Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, pp. 915–932. Association for Computational Linguistics, Praha (2007)

    Google Scholar 

  12. Sgall, P., Hajičová, E., Panevová, J.: The Meaning of the Sentence in Its Semantic and Pragmatic Aspects. Academia/Reidel Publ. Comp., Praha (1986)

    Google Scholar 

  13. Kahane, S.: The Meaning-Text Theory. In: Dependency and Valency, Handbooks of Linguistics and Communication Sciences, vol. 25(1-2), p. 32. De Gruyter, Berlin (2003)

    Google Scholar 

  14. Palmer, M., Gildea, D., Kingsbury, P.: The Proposition Bank: A corpus annotated with semantic roles. Computational Linguistics Journal 31(1) (2005)

    Google Scholar 

  15. Meyers, A.: Using treebank, dictionaries and GLARF to improve NomBank annotation. In: Proceedings of The Linguistic Annotation Workshop, LREC 2008, Morocco (2008)

    Google Scholar 

  16. Xue, N.: Labeling Chinese predicates with semantic roles. Computational Linguistics 34(2), 225–256 (2008)

    Article  MathSciNet  Google Scholar 

  17. Burchardt, A., Erk, K., Frank, A., Kowalski, A., Padó, S., Pinkal, M.: The SALSA corpus: a German corpus resource for lexical semantics. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Citeseer (2006)

    Google Scholar 

  18. Mel’čuk, I.: Lexical functions: A tool for the description of lexical relations in a lexicon. In: Wanner, L. (ed.) Lexical Functions in Lexicography and Natural Language Processing. SLCS, vol. 31, pp. 37–102. John Benjamins, Amsterdam (1996)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bejček, E., Straňák, P., Zeman, D. (2011). Influence of Treebank Design on Representation of Multiword Expressions. In: Gelbukh, A.F. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2011. Lecture Notes in Computer Science, vol 6608. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19400-9_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-19400-9_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-19399-6

  • Online ISBN: 978-3-642-19400-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics