Advertisement

Identification and Lexical Representation of Multiword Expressions

  • Jan OdijkEmail author
Chapter
Part of the Theory and Applications of Natural Language Processing book series (NLP)

Abstract

The central problems that this paper addresses are (i) the lack of large and rich formalised lexicons for multi-word expressions for use in Natural Language Processing (NLP); (ii) the lack of proper methods and tools to extend the lexicon of an NLP-system for multi-word expressions given a text corpus in a maximally automated manner. The paper describes innovative methods and tools for the automatic identification and lexical representation of multi-word expressions. In addition, it describes a 5.000 entry corpus-based multi-word expression lexical database for Dutch developed using these methods. The database has been externally validated, and its usability has been evaluated in NLP-systems for Dutch. The MWE database developed fills a gap in existing lexical resources for Dutch. The generic methods and tools for MWE identification and lexical representation focus on Dutch, but they are largely language-independent and can also be used for other languages, new domains, and beyond this project. The research results and data described in this paper contribute directly to strengthening the digital infrastructure for Dutch.

Notes

Acknowledgements

The paper describes joint work by the IRME project team, and especially work carried out by Nicole Grégoire and Begoña Villada Moirón. I have liberally used material from reports, articles, PhDs etc. written by them, for which I am very grateful. I would also like to thank them and two anonymous reviewers for useful suggestions to improve this paper.

References

  1. 1.
    Broeder, D., Kemps-Snijders, M., Uytvanck, D.V., Windhouwer, M., Withers, P., Wittenburg, P., Zinn, C.: A data category registry- and component-based metadata framework. In: Calzolari, N., Maegaard, B., Mariani, J., Odijk, J., Choukri, K., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Valletta, pp. 43–47. European Language Resources Association (ELRA), Valletta (2010)Google Scholar
  2. 2.
    Broekhuis, H.: Het voorzetselvoorwerp. Nederlandse Taalkunde 9 (2), 97–131 (2004)Google Scholar
  3. 3.
    Copestake, A., Lambeau, F., Villavicencio, A., Bond, F., Baldwin, T., Sag, I., Flickinger, D.: Multiword expressions: linguistic precision and reusability. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC 2002), Las Palmas, pp. 1941–7. ELRA (2002)Google Scholar
  4. 4.
    Francopoulo, G., George, M., Calzolari, N., Monachini, M., Bel, N., Pet, M., Soria, C.: Lexical markup framework (LMF). In: Proceedings of LREC 2006, Genoa, pp. 233–236. ELRA, Genoa (2006)Google Scholar
  5. 5.
    Grégoire, N.: Untangling multiword expressions: a study on the representation and variation of Dutch multiword expressions. Phd, Utrecht University, Utrecht (2009). LOT PublicationGoogle Scholar
  6. 6.
    Grégoire, N.: DuELME: A Dutch electronic lexicon of multiword expressions. J. Lang. Resour. Eval. 44 (1/2), 23–40 (2010).http://dx.doi.org/10.1007/s10579-009-9094-zCrossRefGoogle Scholar
  7. 7.
    Hoekstra, H., Moortgat, M., Renmans, B., Schouppe, M., Schuurman, I., van der Wouden, T.: CGN syntactische annotatie. CGN report, Utrecht University, Utrecht (2003).http://lands.let.kun.nl/cgn/doc_Dutch/topics/version_1.0/annot/syntax/syn_prot.pdf
  8. 8.
    Hollebrandse, B.: Dutch light verb constructions. Master’s thesis, Tilburg University, Tilburg (1993)Google Scholar
  9. 9.
    Kemps-Snijders, M., Windhouwer, M., Wright, S.: Principles of ISOcat, a data category registry (2010). Presentation at the RELISH Workshop Rendering Endangered Languages Lexicons Interoperable Through Standards Harmonization – Workshop on Lexicon Tools and Lexicon Standards, Nijmegen, 4–5 August 2010.http://www.mpi.nl/research/research-projects/language-archiving-technology/events/relish-workshop/program/ISOcat.pptx
  10. 10.
    Kilgarriff, A., Tugwell, D.: Word sketch: extraction & display of significant collocations for lexicography. In: Proceedings of the 39th ACL & 10th EACL workshop ‘Collocation: Computational Extraction, Analysis and Exploitation’, Toulouse, pp. 32–38. (2001)Google Scholar
  11. 11.
    Martin, W., Maks, I.: Referentie Bestand Nederlands: Documentatie. Report, TST Centrale (2005).http://www.tst-centrale.org/images/stories/producten/documentatie/rbn_documentatie_nl.pdf
  12. 12.
    Merlo, P., Leybold, M.: Automatic distinction of arguments and modifiers: the case of prepositional phrases. In: Proceedings of the Fifth Computational Natural Language Learning Workshop (CoNLL-2001), Toulouse, pp. 121–128 (2001)Google Scholar
  13. 13.
    Odijk, J.: A proposed standard for the lexical representation of idioms. In: Williams, G., Vessier, S. (eds.) EURALEX 2004 Proceedings, vol. I, pp. 153–164. Université de Bretagne Sud, Lorient (2004)Google Scholar
  14. 14.
    Odijk, J.: Reusable lexical representations for idioms. In: Lino, M.T., Xavier, M.F., Ferreira, F., Costa, R., Silva, R. (eds.) Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC-2004), III, Lisbon, pp. 903–906. ELRA, Lisbon (2004)Google Scholar
  15. 15.
    Rosetta, M.: Compositional Translation, Kluwer International Series in Engineering and Computer Science (Natural Language Processing and Machine Translation), vol. 273. Kluwer, Dordrecht (1994)Google Scholar
  16. 16.
    Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. LinGO Working Paper 2001-03 (2001).http://lingo.stanford.edu/csli/pubs/WP-2001-03.ps.gz
  17. 17.
    Van De Cruys, T.: Semantic clustering in Dutch. In: Sima’an, K., de Rijke, M., Scha, R., van Son, R. (eds.) Proceedings of the Sixteenth Computational Linguistics in the Netherlands (CLIN), pp. 17–32. University of Amsterdam, Amsterdam (2006)Google Scholar
  18. 18.
    Van de Cruys, T., Villada Moirón, B.: Semantics-based multiword expression extraction. In: Grégoire, N., Evert, S., Kim, S. (eds.) Proceedings of the Workshop ‘A Broader Perspective on Multiword Expressions’, Prague, pp. 25–32. ACL, Prague (2007)Google Scholar
  19. 19.
    van der Beek, L., Bouma, G., van Noord, G.: Een brede computationele grammatica voor het Nederlands. Nederlandse Taalkunde 7, 353–374 (2002)Google Scholar
  20. 20.
    van Noord, G.: At last parsing is now operational. In: Mertens, P., Fairon, C., Dister, A., Watrin, P. (eds.) TALN06 Verbum Ex Machina. Actes de la 13e conference sur le traitement automatique des langues naturelles, Leuven, pp. 20–42 (2006)Google Scholar
  21. 21.
    Villada Moirón, B.: Evaluation of a machine-learning algorithm for MWE identification. Decision trees. STEVIN-IRME Deliverable 1.3, Alfa-Informatica, Groningen (2006).http://www-uilots.let.uu.nl/irme/documentation/Deliverables/BVM_D1-3.pdf
  22. 22.
    Wermter, J., Hahn, U.: Collocation extraction based on modifiability statistics. In: Proceedings of COLING 2004, Geneva (2004)Google Scholar
  23. 23.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)Google Scholar

Copyright information

© The Author(s) 2013

Open Access. This chapter is distributed under the terms of the Creative Commons Attribution Noncommercial License, which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Authors and Affiliations

  1. 1.UiL-OTSUtrechtThe Netherlands

Personalised recommendations