Skip to main content

State of the Art in MWE Processing

  • Chapter
  • First Online:
Multiword Expressions Acquisition

Abstract

In the previous chapter, we provided the historical and theoretical foundations for the study of multiword expressions. The set of definitions, characteristics and types described give an idea of the difficulty of the computational tasks involving MWEs. The goal of the present chapter is to draw an overview of the state of the art in computational methods for MWE treatment, focusing on acquisition. State-of-the-art techniques to deal with MWEs are the starting point of the methodology proposed in Chap. 5. Information contained in the present chapter allows better comparison and contextualisation of the present work in the computational linguistics panorama.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The goal of this section is not to provide a substantial introduction to empirical methods in computational linguistics. Instead, we remind and try to disambiguate as much as possible the definitions of concepts that are already familiar to the reader to some extent. If this is not the case, we recommend Jurafsky and Martin (2008) as a consolidated and wide introduction to NLP and Manning and Schütze (1999) for a more specific introduction to empirical methods. Our text is inspired by these two standard reference textbooks.

  2. 2.

    Contraction identification usually requires context-aware analysis. For instance, in French, the contraction \(\mathit{des} = \mathit{de} + \mathit{les}\) is homonym to the partitive/indefinite article des.

  3. 3.

    We use the character ˽ only to emphasise the spaces between words.

  4. 4.

    However, it is not enough to lowercase the whole text as case information may be important, for instance, in domain-specific texts (chemical element NaCl), acronyms (NASA, CIA) and to distinguish named entities (Bill Gates, March) from common words (pay the bill, open the gates, the soldiers march).

  5. 5.

    The tagset used by the TreeTagger in English is available at ftp://ftp.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz and reproduced in Appendix D.4.

  6. 6.

    Actually, RASP does not generate dependency relations directly, but it infers grammatical relations using equivalence rules applied to a traditional constituent parsing tree. Relations are mostly acyclic and exceptions can be dealt with on a case by case basis.

  7. 7.

    Documentation about RASP’s tagset and grammatical relations is available at http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-662.pdf and in Appendix C of Jurafsky and Martin (2008). Moreover, the tags used by RASP for POS and syntax are reproduced in Appendices D.2 and D.3.

  8. 8.

    This is a simplification, as described by Briscoe et al. (2006).

  9. 9.

    The type/token ration, that is, the number of types with respect to the number of tokens in a text, has been used as a measure of the richness of the vocabulary. This measure depends on the corpus size (Baayen 2001). In BNC-frg, the type/token ratio is of 0.091.

  10. 10.

    A word occurring once in the corpus is called a hapax, from the Greek hapax legomena.

  11. 11.

    Discontiguous sequences are sometimes referred to as flexigrams, that is, n-grams with gaps.

  12. 12.

    The term association measure is standard in MWE acquisition, but it would be more appropriate to talk about association scores instead, since not all the scores discussed here are proper measures.

  13. 13.

    The test statistic is a random variable with a known distribution, from which we can obtain the p-value. If the p-value is below a certain significance level, we can reject the null hypothesis.

  14. 14.

    http://multiword.sourceforge.net/mwe2009

  15. 15.

    Recommended by the author of the algorithm in personal communication.

  16. 16.

    Although this can be simulated by concatenating words and POS tags together in order to form a token.

  17. 17.

    http://olst.ling.umontreal.ca/~drouinp/termostat_web/

  18. 18.

    http://www.antlab.sci.waseda.ac.jp/software.html

  19. 19.

    http://www.nactem.ac.uk/software/termine/

  20. 20.

    http://en.wikipedia.org/wiki/Terminology_extraction

  21. 21.

    http://mwetoolkit.sourceforge.net

  22. 22.

    http://www.temis.com/

  23. 23.

    http://www.temis.com/index.php?id=201&selt=1

  24. 24.

    http://developer.yahoo.com/search/content/V1/termExtraction.html

  25. 25.

    http://129.194.38.128:81/FipsCoView

  26. 26.

    http://similis.org/

  27. 27.

    A noun derived from a verb, like replacement is a nominalisation of the verb replace.

  28. 28.

    The context unit used for annotation was the sentence. However, due to anaphora, sometimes it was impossible to know the intended meaning without looking at neighbour sentences.

References

  • Acosta O, Villavicencio A, Moreira V (2011) Identification and treatment of multiword expressions applied to information retrieval. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Association for Computational Linguistics, Portland, pp 101–109. http://www.aclweb.org/anthology/W/W11/W11-0815

  • Anastasiou D, Hashimoto C, Nakov P, Kim SN (eds) (2009) Proceedings of the ACL workshop on multiword expressions: identification, interpretation, disambiguation, applications (MWE 2009), Singapore. Association for Computational Linguistics/Suntec. http://aclweb.org/anthology-new/W/W09/W09-29, 70 p.

  • Apresian J, Boguslavsky I, Iomdin L, Tsinman L (2003) Lexical functions as a tool of ETAP-3. In: Proceedings of the first international conference on meaning-text theory (MTT 2003), Paris

    Google Scholar 

  • Attia M, Toral A, Tounsi L, Pecina P, van Genabith J (2010) Automatic extraction of Arabic multiword expressions. In: Laporte É, Nakov P, Ramisch C, Villavicencio A (eds) Proceedings of the COLING workshop on multiword expressions: from theory to applications (MWE 2010), Beijing. Association for Computational Linguistics, pp 18–26

    Google Scholar 

  • Baayen RH (2001) Word frequency distributions, text, speech and language technology, vol 18. Springer, Berlin/New York

    Book  Google Scholar 

  • Bai MH, You JM, Chen KJ, Chang JS (2009) Acquiring translation equivalences of multiword expressions by normalized correlation frequencies. In: Proceedings of the 2009 conference on empirical methods in natural language processing (EMNLP 2009), Singapore. Association for Computational Linguistics/Suntec, pp 478–486

    Google Scholar 

  • Baldwin T (2005) Deep lexical acquisition of verb-particle constructions. Comput Speech Lang Spec Issue MWEs 19(4):398–414

    Article  Google Scholar 

  • Baldwin T (2011) MWEs and topic modelling: enhancing machine learning with linguistics. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Portland. Association for Computational Linguistics, p 1. http://www.aclweb.org/anthology/W/W11/W11-0801

  • Baldwin T, Tanaka T (2004) Translation by machine of complex nominals: getting it right. In: Tanaka T, Villavicencio A, Bond F, Korhonen A (eds) Proceedings of the ACL workshop on multiword expressions: integrating processing (MWE 2004), Barcelona. Association for Computational Linguistics, pp 24–31

    Google Scholar 

  • Baldwin T, Bannard C, Tanaka T, Widdows D (2003) An empirical model of multiword expression decomposability. In: Bond F, Korhonen A, McCarthy D, Villavicencio A (eds) Proceedings of the ACL workshop on multiword expressions: analysis, acquisition and treatment (MWE 2003), Sapporo. Association for Computational Linguistics, pp 89–96. doi:10.3115/1119282.1119294, http://www.aclweb.org/anthology/W03-1812

  • Banerjee S, Pedersen T (2003) The design, implementation, and use of the Ngram Statistic Package. In: Proceedings of the fourth international conference on intelligent text processing and computational linguistics, Mexico City, pp 370–381

    Google Scholar 

  • Bannard C (2005) Learning about the meaning of verb-particle constructions from corpora. Comput Speech Lang Spec Issue MWEs 19(4):467–478

    Article  Google Scholar 

  • Bejček E, Stranak P, Pecina P (2013) Syntactic identification of occurrences of multiword expressions in text using a lexicon with dependency structures. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the 9th workshop on multiword expressions (MWE 2013), Atlanta. Association for Computational Linguistics, pp 106–115. http://www.aclweb.org/anthology/W13-1016

  • Bonin F, Dell’Orletta F, Montemagni S, Venturi G (2010a) A contrastive approach to multi-word extraction from domain-specific corpora. In: Proceedings of the seventh international conference on language resources and evaluation (LREC 2010), Valetta. European Language Resources Association

    Google Scholar 

  • Bonin F, Dell’Orletta F, Venturi G, Montemagni S (2010b) Contrastive filtering of domain-specific multi-word terms from different types of corpora. In: Laporte É, Nakov P, Ramisch C, Villavicencio A (eds) Proceedings of the COLING workshop on multiword expressions: from theory to applications (MWE 2010), Beijing. Association for Computational Linguistics, pp 76–79

    Google Scholar 

  • Bouamor D, Semmar N, Zweigenbaum P (2012) Identifying bilingual multi-word expressions for statistical machine translation. In: Proceedings of the eigth international conference on language resources and evaluation (LREC 2012), Istanbul. European Language Resources Association

    Google Scholar 

  • Briscoe T, Carroll J, Watson R (2006) The second release of the RASP system. In: Curran J (ed) Proceedings of the COLING/ACL 2006 interactive presentation sessions, Sidney. Association for Computational Linguistics, pp 77–80. http://www.aclweb.org/anthology/P/P06/P06-4020

  • Bungum L, Gambäck B, Lynum A, Marsi E (2013) Improving word translation disambiguation by capturing multiword expressions with dictionaries. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the 9th workshop on multiword expressions (MWE 2013), Atlanta. Association for Computational Linguistics, pp 21–30. http://www.aclweb.org/anthology/W13-1003

  • Burnard L (2007) User reference guide for the British National Corpus. Technical report, Oxford University Computing Services

    Google Scholar 

  • Butnariu C, Kim SN, Nakov P, Séaghdha DO, Szpakowicz S, Veale T (2010) Semeval-2 task 9: the interpretation of noun compounds using paraphrasing verbs and prepositions. In: Erk K, Strapparava C (eds) Proceedings of the 5th international workshop on semantic evaluation (SemEval 2010), Uppsala. Association for Computational Linguistics, pp 39–44. http://www.aclweb.org/anthology/S10-1007

  • Carpuat M, Diab M (2010) Task-based evaluation of multiword expressions: a pilot study in statistical machine translation. In: Proceedings of human language technology: the 2010 annual conference of the North American chapter of the Association for Computational Linguistics (NAACL 2003), Los Angeles. Association for Computational Linguistics, pp 242–245. http://www.aclweb.org/anthology/N10-1029

  • Chen SF, Goodman J (1999) An empirical study of smoothing techniques for language modeling. Comput Speech Lang 13(4):359–394

    Article  Google Scholar 

  • Church K, Hanks P (1990) Word association norms mutual information, and lexicography. Comput Linguist 16(1):22–29

    Google Scholar 

  • Constant M, Sigogne A (2011) MWU-aware part-of-speech tagging with a CRF model and lexical resources. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real World (MWE 2011), Portland. Association for Computational Linguistics, pp 49–56. http://www.aclweb.org/anthology/W/W11/W11-0809

  • Constant M, Roux JL, Sigogne A (2013) Combining compound recognition and PCFG-LA parsing with word lattices and conditional random fields. ACM Trans Speech Lang Process Spec Issue Multiword Expr Theory Pract Use Part 2 (TSLP) 10(3):1–24

    Article  Google Scholar 

  • Cook P, Stevenson S (2006) Classifying particle semantics in English verb-particle constructions. In: Moirón BV, Villavicencio A, McCarthy D, Evert S, Stevenson S (eds) Proceedings of the COLING/ACL workshop on multiword expressions: identifying and exploiting underlying properties (MWE 2006), Sidney. Association for Computational Linguistics, pp 45–53. http://www.aclweb.org/anthology/W/W06/W06-1207

  • Cook P, Fazly A, Stevenson S (2007) Pulling their weight: exploiting syntactic forms for the automatic identification of idiomatic expressions in context. In: Grégoire N, Evert S, Kim SN (eds) Proceedings of the ACL workshop on a broader perspective on multiword expressions (MWE 2007), Prague. Association for Computational Linguistics, pp 41–48. http://www.aclweb.org/anthology/W/W07/W07-1106

  • Cook P, Fazly A, Stevenson S (2008) The VNC-tokens dataset. In: Grégoire N, Evert S, Krenn B (eds) Proceedings of the LREC workshop towards a shared task for multiword expressions (MWE 2008), Marrakech, pp 19–22

    Google Scholar 

  • Daille B (2003) Conceptual structuring through term variations. In: Bond F, Korhonen A, McCarthy D, Villavicencio A (eds) Proceedings of the ACL workshop on multiword expressions: analysis, acquisition and treatment (MWE 2003), Sapporo. Association for Computational Linguistics, pp 9–16. doi:10.3115/1119282.1119284. http://www.aclweb.org/anthology/W03-1802

  • Daille B, Dufour-Kowalski S, Morin E (2004) French-English multi-word term alignment based on lexical context analysis. In: Proceedings of the fourth international conference on language resources and evaluation (LREC 2004), Lisbon. European Language Resources Association, pp 919–922

    Google Scholar 

  • Déjean H, Gaussier É, Sadat F (2002) An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In: Proceedings of the 19th international conference on computational linguistics (COLING 2002), Taipei. http://aclweb.org/anthology-new/C/C02/C02-1166.pdf

  • de Medeiros Caseli H, Villavicencio A, Machado A, Finatto MJ (2009) Statistically-driven alignment-based multiword expression identification for technical domains. In: Anastasiou D, Hashimoto C, Nakov P, Kim SN (eds) Proceedings of the ACL workshop on multiword expressions: identification, interpretation, disambiguation, applications (MWE 2009), Singapore. Association for Computational Linguistics/Suntec, pp 1–8

    Google Scholar 

  • de Medeiros Caseli H, Ramisch C, das Graças Volpe Nunes M, Villavicencio A (2010) Alignment-based extraction of multiword expressions. Lang Resour Eval Spec Issue Multiword Express Hard Going Plain Sail 44(1–2):59–77. doi:10.1007/s10579-009-9097-9, http://www.springerlink.com/content/H7313427H78865MG

  • Dias G (2003) Multiword unit hybrid extraction. In: Bond F, Korhonen A, McCarthy D, Villavicencio A (eds) Proceedings of the ACL workshop on multiword expressions: analysis, acquisition and treatment (MWE 2003), Sapporo. Association for Computational Linguistics, pp 41–48. doi:10.3115/1119282.1119288. http://www.aclweb.org/anthology/W03-1806

  • Duan J, Lu R, Wu W, Hu Y, Tian Y (2006) A bio-inspired approach for multi-word expression extraction. In: Curran J (ed) Proceedings of the COLING/ACL 2006 main conference poster sessions, Sidney. Association for Computational Linguistics, pp 176–182. http://www.aclweb.org/anthology/P/P06/P06-2023

  • Dunning T (1993) Accurate methods for the statistics of surprise and coincidence. Comput Linguist 19(1):61–74

    Google Scholar 

  • Duran MS, Ramisch C, Aluísio SM, Villavicencio A (2011) Identifying and analyzing Brazilian Portuguese complex predicates. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Portland. Association for Computational Linguistics, pp 74–82. http://www.aclweb.org/anthology/W/W11/W11-0812

  • Evert S (2004) The statistics of word cooccurrences: word pairs and collocations. PhD thesis, Institut für maschinelle Sprachverarbeitung, University of Stuttgart, Stuttgart, 353p

    Google Scholar 

  • Evert S, Krenn B (2005) Using small random samples for the manual evaluation of statistical association measures. Comput Speech Lang Spec Issue MWEs 19(4):450–466

    Article  Google Scholar 

  • Fazly A, Stevenson S (2007) Distinguishing subtypes of multiword expressions using linguistically-motivated statistical measures. In: Grégoire N, Evert S, Kim SN (eds) Proceedings of the ACL workshop on a broader perspective on multiword expressions (MWE 2007), Prague. Association for Computational Linguistics, pp 9–16. http://www.aclweb.org/anthology/W/W07/W07-1102

  • Finlayson M, Kulkarni N (2011) Detecting multi-word expressions improves word sense disambiguation. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Portland. Association for Computational Linguistics, pp 20–24. http://www.aclweb.org/anthology/W/W11/W11-0805

  • Frantzi K, Ananiadou S, Mima H (2000) Automatic recognition of multiword terms: the C-value/NC-value method. Int J Digit Libr 3(2):115–130

    Article  Google Scholar 

  • Fritzinger F, Weller M, Heid U (2010) A survey of idiomatic preposition-noun-verb triples on token level. In: Proceedings of the seventh international conference on language resources and evaluation (LREC 2010), Valetta. European Language Resources Association, pp 2908–2914

    Google Scholar 

  • Gil A, Dias G (2003) Using masks, suffix array-based data structures and multidimensional arrays to compute positional n-gram statistics from corpora. In: Bond F, Korhonen A, McCarthy D, Villavicencio A (eds) Proceedings of the ACL workshop on multiword expressions: analysis, acquisition and treatment (MWE 2003), Sapporo. Association for Computational Linguistics, pp 25–32. doi:10.3115/1119282.1119286, http://www.aclweb.org/anthology/W03-1804

  • Girju R, Moldovan D, Tatu M, Antohe D (2005) On the semantics of noun compounds. Comput Speech Lang Spec Issue MWEs 19(4):479–496

    Article  Google Scholar 

  • Good IJ (1953) The population frequencies of species and the estimation of population parameters. Biometrika 40(3–4):237–264. doi:10.1093/biomet/40.3-4.237

    Article  MathSciNet  MATH  Google Scholar 

  • Graliński F, Savary A, Czerepowicka M, Makowiecki F (2010) Computational lexicography of multi-word units: how efficient can it be? In: Laporte É, Nakov P, Ramisch C, Villavicencio A (eds) Proceedings of the COLING workshop on multiword expressions: from theory to applications (MWE 2010), Beijing. Association for Computational Linguistics, pp 1–9

    Google Scholar 

  • Green S, de Marneffe MC, Bauer J, Manning CD (2011) Multiword expression identification with tree substitution grammars: a parsing tour de force with French. In: Barzilay R, Johnson M (eds) Proceedings of the 2011 conference on empirical methods in natural language processing (EMNLP 2011), Edinburgh. Association for Computational Linguistics, pp 725–735. http://www.aclweb.org/anthology/D11-1067

  • Grefenstette G (1999) The world wide web as a resource for example-based machine translation tasks. In: Proceedings of the twenty-first international conference on translating and the computer, ASLIB, London

    Google Scholar 

  • Grégoire N (2007) Design and implementation of a lexicon of Dutch multiword expressions. In: Grégoire N, Evert S, Kim SN (eds) Proceedings of the ACL workshop on a broader perspective on multiword expressions (MWE 2007), Prague. Association for Computational Linguistics, pp 17–24. http://www.aclweb.org/anthology/W/W07/W07-1103

  • Grégoire N (2010) DuELME: a Dutch electronic lexicon of multiword expressions. Lang Resour Eval Spec Issue Multiword Expr Hard Going Plain Sail 44(1–2):23–39. doi:10.1007/s10579-009-9094-z. http://www.springerlink.com/content/7308605442W17698

  • Grégoire N, Evert S, Krenn B (eds) (2008) Proceedings of the LREC workshop towards a shared task for multiword expressions (MWE 2008), Marrakech, 57p. http://www.lrec-conf.org/proceedings/lrec2008/workshops/W20_Proceedings.pdf

  • Gurrutxaga A, Alegria I (2011) Automatic extraction of NV expressions in Basque: basic issues on cooccurrence techniques. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Portland. Association for Computational Linguistics, pp 2–7. http://www.aclweb.org/anthology/W/W11/W11-0802

  • Haugereid P, Bond F (2011) Extracting transfer rules for multiword expressions from parallel corpora. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Portland. Association for Computational Linguistics, pp 92–100. http://www.aclweb.org/anthology/W/W11/W11-0814

  • Hendrickx I, Kim SN, Kozareva Z, Nakov P, Séaghdha DO, Padó S, Pennacchiotti M, Romano L, Szpakowicz S (2010) Semeval-2010 task 8: multi-way classification of semantic relations between pairs of nominals. In: Erk K, Strapparava C (eds) Proceedings of the 5th international workshop on semantic evaluation (SemEval 2010), Uppsala. Association for Computational Linguistics, pp 33–38. http://www.aclweb.org/anthology/S10-1006

  • Hoang HH, Kim SN, Kan MY (2009) A re-examination of lexical association measures. In: Anastasiou D, Hashimoto C, Nakov P, Kim SN (eds) Proceedings of the ACL workshop on multiword expressions: identification, interpretation, disambiguation, applications (MWE 2009), Singapore. Association for Computational Linguistics/Suntec, pp 31–39

    Google Scholar 

  • Hogan D, Foster J, van Genabith J (2011) Decreasing lexical data sparsity in statistical syntactic parsing – experiments with named entities. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Portland. Association for Computational Linguistics, pp 14–19. http://www.aclweb.org/anthology/W/W11/W11-0804

  • Izumi T, Imamura K, Kikui G, Sato S (2010) Standardizing complex functional expressions in Japanese predicates: applying theoretically-based paraphrasing rules. In: Laporte É, Nakov P, Ramisch C, Villavicencio A (eds) Proceedings of the COLING workshop on multiword expressions: from theory to applications (MWE 2010), Beijing. Association for Computational Linguistics, pp 63–71

    Google Scholar 

  • Jurafsky D, Martin JH (2008) Speech and language processing, 2nd edn. Prentice Hall, Upper Saddle River, 1024p

    Google Scholar 

  • Justeson JS, Katz SM (1995) Technical terminology: some linguistic properties and an algorithm for identification in text. Nat Lang Eng 1(1):9–27

    Article  Google Scholar 

  • Keller F, Lapata M (2003) Using the web to obtain frequencies for unseen bigrams. Comput Linguist Spec Issue Web Corpus 29(3):459–484

    Article  Google Scholar 

  • Kim SN, Baldwin T (2013) A lexical semantic approach to interpreting and bracketing English noun compounds. Nat Lang Eng Spec Issue Noun Compd 19(3):385–407. doi:10.1017/S1351324913000107, http://journals.cambridge.org/article_S1351324913000107

  • Kim SN, Nakov P (2011) Large-scale noun compound interpretation using bootstrapping and the web as a corpus. In: Barzilay R, Johnson M (eds) Proceedings of the 2011 conference on empirical methods in natural language processing (EMNLP 2011), Edinburgh. Association for Computational Linguistics, pp 648–658. http://www.aclweb.org/anthology/D11-1060

  • Kneser R, Ney H (1995) Improved backing-off for M-gram language modeling. In: Proceedings of the international conference on acoustics, speech, and signal processing (ICASSP 1995), Detroit, vol 1, pp 181–184. doi:10.1109/ICASSP.1995.479394, http://dx.doi.org/10.1109/ICASSP.1995.479394

  • Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the tenth machine translation summit (MT Summit 2005), Phuket. Asian-Pacific Association for Machine Translation, pp 79–86

    Google Scholar 

  • Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the Association for Computational Linguistics (ACL 2007), Prague. Association for Computational Linguistics, pp 177–180

    Google Scholar 

  • Korkontzelos I, Manandhar S (2010) Can recognising multiword expressions improve shallow parsing? In: Proceedings of human language technology: the 2010 annual conference of the North American chapter of the Association for Computational Linguistics (NAACL 2003), Los Angeles. Association for Computational Linguistics, pp 636–644. http://www.aclweb.org/anthology/N10-1089

  • Kulkarni N, Finlayson M (2011) jMWE: a java toolkit for detecting multi-word expressions. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Portland. Association for Computational Linguistics, pp 122–124. http://www.aclweb.org/anthology/W/W11/W11-0818

  • Lapata M (2002) The disambiguation of nominalizations. Comput Linguist 28(3):357–388

    Article  Google Scholar 

  • Laporte É, Voyatzi S (2008) An electronic dictionary of French multiword adverbs. In: Grégoire N, Evert S, Krenn B (eds) Proceedings of the LREC workshop towards a shared task for multiword expressions (MWE 2008), Marrakech, pp 31–34

    Google Scholar 

  • Laporte É, Nakamura T, Voyatzi S (2008) A French corpus annotated for multiword nouns. In: Grégoire N, Evert S, Krenn B (eds) Proceedings of the LREC workshop towards a shared task for multiword expressions (MWE 2008), Marrakech, pp 27–30

    Google Scholar 

  • Li Z, Callison-Burch C, Dyer C, Ganitkevitch J, Khudanpur S, Schwartz L, Thornton WNG, Weese J, Zaidan OF (2009) Joshua: an open source toolkit for parsing-based machine translation. In: Proceedingsof the fourth workshop on statistical machine translation (WMT 2009), Athens. Association for Computational Linguistics, pp 135–139

    Google Scholar 

  • Manber U, Myers G (1990) Suffix arrays: a new method for on-line string searches. In: SODA ’90: proceedings of the first annual ACM-SIAM symposium on discrete algorithms, San Francisco. Society for Industrial and Applied Mathematics, Philadelphia, pp 319–327

    Google Scholar 

  • Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT, Cambridge, 620p

    MATH  Google Scholar 

  • Martens S (2010) Varro: an algorithm and toolkit for regular structure discovery in treebanks. In: Huang CR, Jurafsky D (eds) Proceedings of the 23rd international conference on computational linguistics (COLING 2010)—posters, Beijing. The Coling 2010 Organizing Committee, pp 810–818. http://www.aclweb.org/anthology/C10-2093

  • Martens S, Vandeghinste V (2010) An efficient, generic approach to extracting multi-word expressions from dependency trees. In: Laporte É, Nakov P, Ramisch C, Villavicencio A (eds) Proceedings of the COLING workshop on multiword expressions: from theory to applications (MWE 2010), Beijing. Association for Computational Linguistics, pp 84–87

    Google Scholar 

  • McCarthy D, Keller B, Carroll J (2003) Detecting a continuum of compositionality in phrasal verbs. In: Bond F, Korhonen A, McCarthy D, Villavicencio A (eds) Proceedings of the ACL workshop on multiword expressions: analysis, acquisition and treatment (MWE 2003), Sapporo. Association for Computational Linguistics, pp 73–80. doi:10.3115/1119282.1119292, http://www.aclweb.org/anthology/W03-1810

  • McCarthy D, Venkatapathy S, Joshi A (2007) Detecting compositionality of verb-object combinations using selectional preferences. In: Eisner J (ed) Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL 2007), Prague. Association for Computational Linguistics, pp 369–379. http://www.aclweb.org/anthology/D/D07/D07-1039

  • Melamed ID (1997) Automatic discovery of non-compositional compounds in parallel data. In: Proceedings of the 2nd conference on empirical methods in natural language processing (EMNLP-2), Brown University, Providence. Association for Computational Linguistics, pp 97–108

    Google Scholar 

  • Michou A, Seretan V (2009) A tool for multi-word expression extraction in modern Greek using syntactic parsing. In: Proceedings of the demonstrations session at EACL 2009, Athens. Association for Computational Linguistics, pp 45–48

    Google Scholar 

  • Mikheev A (2002) Periods, capitalized words, etc. Comput Linguist 28(3):289–318

    Article  Google Scholar 

  • Mirroshandel SA, Nasr A, Roux JL (2012) Semi-supervised dependency parsing using lexical affinities. In: Proceedings of the 50th annual meeting of the Association for Computational Linguistics (vol 1: long papers), Jeju Island. Association for Computational Linguistics, pp 777–785. http://www.aclweb.org/anthology/P12-1082

  • Mitkov R, Monti J, Pastor GC, Seretan V (eds) (2013) Proceedings of the MT summit 2013 workshop on multi-word units in machine translation and translation technology (MUMTTT 2013), Nice. European Association for Machine Translation, 71p. http://www.mtsummit2013.info/workshop4.asp

  • Monti J, Barreiro A, Elia A, Marano F, Napoli A (2011) Taking on new challenges in multi-word unit processing for machine translation. In: Proceedings of the second international workshop on free/open-source rule-based machine translation, Barcelona

    Google Scholar 

  • Morin E, Daille B (2010) Compositionality and lexical alignment of multi-word terms. Lang Resour Eval Spec Issue Multiword Express Hard Going Plain Sail 44(1–2):79–95. doi:10.1007/s10579-009-9098-8, http://www.springerlink.com/content/30264870R1K04744

  • Nakov P (2007) Using the web as an implicit training set: application to noun compound syntax and semantics. PhD thesis, EECS Department, University of California, Berkeley, 392p

    Google Scholar 

  • Nakov P (2008a) Improved statistical machine translation using monolingual paraphrases. In: Ghallab M, Spyropoulos CD, Fakotakis N, Avouris NM (eds) Proceedings of the 18th European conference on artificial intelligence (ECAI 2008), Patras. Frontiers in Artificial Intelligence and Applications, vol 178. IOS Press, pp 338–342

    Google Scholar 

  • Nakov P (2008b) Paraphrasing verbs for noun compound interpretation. In: Grégoire N, Evert S, Krenn B (eds) Proceedings of the LREC workshop towards a shared task for multiword expressions (MWE 2008), Marrakech, pp 46–49

    Google Scholar 

  • Nakov P (2013) On the interpretation of noun compounds: syntax, semantics, and entailment. Nat Lang Eng Spec Issue Noun Compd 19(3):291–330. doi:10.1017/S1351324913000065, http://journals.cambridge.org/article_S1351324913000065

  • Nakov P, Hearst MA (2005) Search engine statistics beyond the n-gram: application to noun compound bracketing. In: Dagan I, Gildea D (eds) Proceedings of the ninth conference on natural language learning (CoNLL-2005), University of Michigan, Ann Arbor. Association for Computational Linguistics, pp 17–24. http://www.aclweb.org/anthology/W/W05/W05-0603

  • Nakov P, Hearst MA (2008) Solving relational similarity problems using the web as a corpus. In: Proceedings of the 46th annual meeting of the Association for Computational Linguistics: human language technology (ACL-08: HLT), Columbus. Association for Computational Linguistics, pp 452–460

    Google Scholar 

  • Nasr A, Bechet F, Rey JF, Favre B, Roux JL (2011) MACAON an NLP tool suite for processing word lattices. In: Proceedings of the ACL 2011 system demonstrations, Portland. Association for Computational Linguistics, pp 86–91. http://www.aclweb.org/anthology/P11-4015

  • Newman MEJ (2005) Power laws, pareto distributions and zipf’s law. Contemp Phys 46:323–351

    Article  Google Scholar 

  • Nicholson J, Baldwin T (2006) Interpretation of compound nominalisations using corpus and web statistics. In: Moirón BV, Villavicencio A, McCarthy D, Evert S, Stevenson S (eds) Proceedings of the COLING/ACL workshop on multiword expressions: identifying and exploiting underlying properties (MWE 2006), Sidney. Association for Computational Linguistics, pp 54–61. http://www.aclweb.org/anthology/W/W06/W06-1208

  • Nicholson J, Baldwin T (2008) Interpreting compound nominalisations. In: Grégoire N, Evert S, Krenn B (eds) Proceedings of the LREC workshop towards a shared task for multiword expressions (MWE 2008), Marrakech, pp 43–45

    Google Scholar 

  • Nulty P, Costello F (2010) UCD-PN: Selecting general paraphrases using conditional probability. In: Erk K, Strapparava C (eds) Proceedings of the 5th international workshop on semantic evaluation (SemEval 2010), Uppsala. Association for Computational Linguistics, pp 234–237. http://www.aclweb.org/anthology/S10-1052

  • Nulty P, Costello F (2013) General and specific paraphrases of semantic relations between nouns. Nat Lang Eng Spec Issue Noun Compd 19(3):357–384. doi:10.1017/S1351324913000089, http://journals.cambridge.org/article_S1351324913000089

  • Pal S, Naskar SK, Pecina P, Bandyopadhyay S, Way A (2010) Handling named entities and compound verbs in phrase-based statistical machine translation. In: Laporte É, Nakov P, Ramisch C, Villavicencio A (eds) Proceedings of the COLING workshop on multiword expressions: from theory to applications (MWE 2010), Beijing. Association for Computational Linguistics, pp 45–53

    Google Scholar 

  • Pearce D (2002) A comparative evaluation of collocation extraction techniques. In: Proceedings of the third international conference on language resources and evaluation (LREC 2002), Las Palmas. European Language Resources Association, pp 1530–1536

    Google Scholar 

  • Pecina P (2005) An extensive empirical study of collocation extraction methods. In: Proceedings of the ACL 2005 student research workshop, Ann Arbor. Association for Computational Linguistics, pp 13–18. http://www.aclweb.org/anthology/P/P05/P05-2003

  • Pecina P (2008) Reference data for Czech collocation extraction. In: Grégoire N, Evert S, Krenn B (eds) Proceedings of the LREC workshop towards a shared task for multiword expressions (MWE 2008), Marrakech, pp 11–14

    Google Scholar 

  • Pedersen T, Banerjee S, McInnes B, Kohli S, Joshi M, Liu Y (2011) The n-gram statistics package (text::NSP): a flexible tool for identifying n-grams, collocations, and word associations. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Portland. Association for Computational Linguistics, pp 131–133. http://www.aclweb.org/anthology/W/W11/W11-0821

  • Planas E, Furuse O (2000) Multi-level similar segment matching algorithm for translation memories and example-based machine translation. In: Proceedings of the 18th international conference on computational linguistics (COLING 2000), Saarbrücken. http://aclweb.org/anthology-new/C/C00/C00-2090.pdf

  • Ramisch C (2009) Multiword terminology extraction for domain-specific documents. Master’s thesis, École Nationale Supérieure d’Informatique et de Mathématiques Appliquées, Grenoble, 79p

    Google Scholar 

  • Ramisch C, Villavicencio A, Moura L, Idiart M (2008) Picking them up and figuring them out: verb-particle constructions, noise and idiomaticity. In: Clark A, Toutanova K (eds) Proceedings of the twelfth conference on natural language learning (CoNLL 2008), Manchester. The Coling 2008 Organizing Committee, pp 49–56. http://www.aclweb.org/anthology/W08-2107

  • Ramisch C, de Medeiros Caseli H, Villavicencio A, Machado A, Finatto MJ (2010) A hybrid approach for multiword expression identification. In: Proceedings of the 9th international conference on computational processing of Portuguese language (PROPOR 2010), Porto Alegre. Lecture notes in computer science (Lecture notes in artificail intelligence), vol 6001. Springer, pp 65–74. doi:10.1007/978-3-642-12320-7_9, http://www.springerlink.com/content/978-3-642-12319-1

  • Ren Z, Lü Y, Cao J, Liu Q, Huang Y (2009) Improving statistical machine translation using domain bilingual multiword expressions. In: Anastasiou D, Hashimoto C, Nakov P, Kim SN (eds) Proceedings of the ACL workshop on multiword expressions: identification, interpretation, disambiguation, applications (MWE 2009), Singapore. Association for Computational Linguistics/Suntec, pp 47–54

    Google Scholar 

  • Roller S, im Walde SS, Scheible S (2013) The (un)expected effects of applying standard cleansing models to human ratings on compositionality. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the 9th workshop on multiword expressions (MWE 2013), Atlanta. Association for Computational Linguistics, pp 32–41. http://www.aclweb.org/anthology/W13-1005

  • Sag I, Baldwin T, Bond F, Copestake A, Flickinger D (2002) Multiword expressions: a pain in the neck for NLP. In: Proceedings of the 3rd international conference on intelligent text processing and computational linguistics (CICLing-2002), Mexico City. Lecture notes in computer science, vol 2276/2010. Springer, pp 1–15

    Google Scholar 

  • SanJuan E, Dowdall J, Ibekwe-SanJuan F, Rinaldi F (2005) A symbolic approach to automatic multiword term structuring. Comput Speech Lang Spec Issue MWEs 19(4):524–542

    Article  Google Scholar 

  • Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the international conference on new methods in language processing, Manchester, pp 44–49. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.28.1139

  • Schone P, Jurafsky D (2001) Is knowledge-free induction of multiword unit dictionary headwords a solved problem? In: Lee L, Harman D (eds) Proceedings of the 2001 conference on empirical methods in natural language processing (EMNLP 2001), Pittsburgh. Association for Computational Linguistics, pp 100–108

    Google Scholar 

  • Schuler W, Joshi A (2011) Tree-rewriting models of multi-word expressions. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Portland. Association for Computational Linguistics, pp 25–30. http://www.aclweb.org/anthology/W/W11/W11-0806

  • Séaghdha DÓ, Copestake A (2013) Interpreting compound nouns with kernel methods. Nat Lang Eng Spec Issue Noun Compd 19(3):331–356. doi:10.1017/S1351324912000368, http://journals.cambridge.org/article_S1351324912000368

  • Seretan V (2008) Collocation extraction based on syntactic parsing. PhD thesis, University of Geneva, Geneva, 249p

    Google Scholar 

  • Seretan V (2011) Syntax-based Collocation extraction, text, speech and language technology, vol 44, 1st edn. Springer, Dordrecht, 212p

    Google Scholar 

  • Seretan V, Wehrli E (2006) Multilingual collocation extraction: issues and solutions. In: Witt A, Sérasset G, Armstrong S, Breen J, Heid U, Sasaki F (eds) Proceedings of the ACL workshop on multilingual language resources and interoperability, Sydney. Association for Computational Linguistics, pp 40–49. http://www.aclweb.org/anthology/W/W06/W06-1006

  • Seretan V, Wehrli E (2009) Multilingual collocation extraction with a syntactic parser. Lang Resour Eval Spec Issue Multiling Lang Resour Interoper 43(1):71–85. doi:10.1007/s10579-008-9075-7, http://www.springerlink.com/content/341877K50497682X

  • Seretan V, Wehrli E (2011) Fipscoview: on-line visualisation of collocations extracted from multilingual parallel corpora. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Portland. Association for Computational Linguistics, pp 125–127. http://www.aclweb.org/anthology/W/W11/W11-0819

  • Silva J, Lopes G (1999) A local maxima method and a fair dispersion normalization for extracting multi-word units from corpora. In: Proceedings of the sixth meeting on mathematics of language (MOL6), Orlando, pp 369–381

    Google Scholar 

  • Silva J, Lopes G (2010) Towards automatic building of document keywords. In: Huang CR, Jurafsky D (eds) Proceedings of the 23rd international conference on computational linguistics (COLING 2010)—posters, Beijing. The Coling 2010 Organizing Committee, pp 1149–1157. http://www.aclweb.org/anthology/C10-2132

  • da Silva JF, Dias G, Guilloré S, Lopes JGP (1999) Using localmaxs algorithm for the extraction of contiguous and non-contiguous multiword lexical units. In: Proceedings of the 9th Portuguese conference on artificial intelligence: progress in artificial intelligence, London. EPIA 1999, pp 113–132. Springer. http://dl.acm.org/citation.cfm?id=645377.651205

  • Smadja FA (1993) Retrieving collocations from text: xtract. Comput Linguist 19(1):143–177

    Google Scholar 

  • Stymne S (2009) A comparison of merging strategies for translation of German compounds. In: Proceedings of the student research workshop at EACL 2009, Athens, pp 61–69

    Google Scholar 

  • Stymne S (2011) Pre- and postprocessing for statistical machine translation into Germanic languages. In: Proceedings of the ACL 2011 student research workshop, Portland. Association for Computational Linguistics, pp 12–17. http://www.aclweb.org/anthology/P11-3003

  • Szpakowicz S, Bond F, Nakov P, Kim SN (2013) On the semantics of noun compounds. In: Nat Lang Eng Spec Issue Noun Compd 19(3):289–290. Cambridge Univesity Press, Cambridge

    Google Scholar 

  • Tanaka T, Baldwin T (2003) Noun-noun compound machine translation a feasibility study on shallow processing. In: Bond F, Korhonen A, McCarthy D, Villavicencio A (eds) Proceedings of the ACL workshop on multiword expressions: analysis, acquisition and treatment (MWE 2003), Sapporo. Association for Computational Linguistics, pp 17–24. doi:10.3115/1119282.1119285. http://www.aclweb.org/anthology/W03-1803

  • Tsvetkov Y, Wintner S (2010) Extraction of multi-word expressions from small parallel corpora. In: Huang CR, Jurafsky D (eds) Proceedings of the 23rd international conference on computational linguistics (COLING 2010)—posters, Beijing. The Coling 2010 Organizing Committee, pp 1256–1264. http://www.aclweb.org/anthology/C10-2144

  • Tsvetkov Y, Wintner S (2011) Identification of multi-word expressions by combining multiple linguistic information sources. In: Barzilay R, Johnson M (eds) Proceedings of the 2011 conference on empirical methods in natural language processing (EMNLP 2011), Edinburgh. Association for Computational Linguistics, pp 836–845. http://www.aclweb.org/anthology/D11-1077

  • Uchiyama K, Baldwin T, Ishizaki S (2005) Disambiguating Japanese compound verbs. Comput Speech Lang Spec Issue MWEs 19(4):497–512

    Article  Google Scholar 

  • Uresova Z, Hajic J, Fucikova E, Sindlerova J (2013) An analysis of annotation of verb-noun idiomatic combinations in a parallel dependency corpus. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the 9th workshop on multiword expressions (MWE 2013), Atlanta. Association for Computational Linguistics, pp 58–63. http://www.aclweb.org/anthology/W13-1009

  • Venkatapathy S, Joshi AK (2006) Using information about multi-word expressions for the word-alignment task. In: Moirón BV, Villavicencio A, McCarthy D, Evert S, Stevenson S (eds) Proceedings of the COLING/ACL workshop on multiword expressions: identifying and exploiting underlying properties (MWE 2006), Sidney. Association for Computational Linguistics, pp 20–27. http://www.aclweb.org/anthology/W/W06/W06-1204

  • Villavicencio A, Bond F, Korhonen A, McCarthy D (2005) Introduction to the special issue on multiword expressions: having a crack at a hard nut. Comput Speech Lang Spec Issue MWEs 19(4):365–377

    Article  Google Scholar 

  • Villavicencio A, Kordoni V, Zhang Y, Idiart M, Ramisch C (2007) Validation and evaluation of automatically acquired multiword expressions for grammar engineering. In: Eisner J (ed) Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL 2007), Prague. Association for Computational Linguistics, pp 1034–1043. http://www.aclweb.org/anthology/D/D07/D07-1110

  • Vincze V, Nagy TI, Berend G (2011) Detecting noun compounds and light verb constructions: a contrastive study. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Portland. Association for Computational Linguistics, pp 116–121. http://www.aclweb.org/anthology/W/W11/W11-0817

  • Wehrli E (1998) Translating idioms. In: Proceedings of the 36th annual meeting of the Association for Computational Linguistics and 17th international conference on computational linguistics, Montreal, vol 2. Association for Computational Linguistics, pp 1388–1392. doi:10.3115/980691.980795. http://www.aclweb.org/anthology/P98-2226

  • Wehrli E, Seretan V, Nerima L (2010) Sentence analysis and collocation identification. In: Laporte É, Nakov P, Ramisch C, Villavicencio A (eds) Proceedings of the COLING workshop on multiword expressions: from theory to applications (MWE 2010), Beijing. Association for Computational Linguistics, pp 27–35

    Google Scholar 

  • Wermter J, Hahn U (2006) You can’t beat frequency (unless you use linguistic knowledge) – a qualitative evaluation of association measures for collocation and term extraction. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (COLING/ACL 2006), Sidney. Association for Computational Linguistics, pp 785–792

    Google Scholar 

  • Xu Y, Goebel R, Ringlstetter C, Kondrak G (2010) Application of the tightness continuum measure to Chinese information retrieval. In: Laporte É, Nakov P, Ramisch C, Villavicencio A (eds) Proceedings of the COLING workshop on multiword expressions: from theory to applications (MWE 2010), Beijing. Association for Computational Linguistics, pp 54–62

    Google Scholar 

  • Yamamoto M, Church K (2001) Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Comput Linguist 27(1):1–30

    Article  Google Scholar 

  • Zarrieß S, Kuhn J (2009) Exploiting translational correspondences for pattern-independent MWE identification. In: Anastasiou D, Hashimoto C, Nakov P, Kim SN (eds) Proceedings of the ACL workshop on multiword expressions: identification, interpretation, disambiguation, applications (MWE 2009), Singapore. Association for Computational Linguistics/Suntec, pp 23–30

    Google Scholar 

  • Zhang Y, Kordoni V (2006) Automated deep lexical acquisition for robust open texts processing. In: Proceedings of the sixth international conference on language resources and evaluation (LREC 2006), Genoa. European Language Resources Association, pp 275–280

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Ramisch, C. (2015). State of the Art in MWE Processing. In: Multiword Expressions Acquisition. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-09207-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-09207-2_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-09206-5

  • Online ISBN: 978-3-319-09207-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics