State of the Art in MWE Processing

Ramisch, Carlos

doi:10.1007/978-3-319-09207-2_3

Carlos Ramisch⁵

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

1055 Accesses
1 Citations

Abstract

In the previous chapter, we provided the historical and theoretical foundations for the study of multiword expressions. The set of definitions, characteristics and types described give an idea of the difficulty of the computational tasks involving MWEs. The goal of the present chapter is to draw an overview of the state of the art in computational methods for MWE treatment, focusing on acquisition. State-of-the-art techniques to deal with MWEs are the starting point of the methodology proposed in Chap. 5. Information contained in the present chapter allows better comparison and contextualisation of the present work in the computational linguistics panorama.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The goal of this section is not to provide a substantial introduction to empirical methods in computational linguistics. Instead, we remind and try to disambiguate as much as possible the definitions of concepts that are already familiar to the reader to some extent. If this is not the case, we recommend Jurafsky and Martin (2008) as a consolidated and wide introduction to NLP and Manning and Schütze (1999) for a more specific introduction to empirical methods. Our text is inspired by these two standard reference textbooks.
2.
Contraction identification usually requires context-aware analysis. For instance, in French, the contraction \(\mathit{des} = \mathit{de} + \mathit{les}\) is homonym to the partitive/indefinite article des.
3.
We use the character ˽ only to emphasise the spaces between words.
4.
However, it is not enough to lowercase the whole text as case information may be important, for instance, in domain-specific texts (chemical element NaCl), acronyms (NASA, CIA) and to distinguish named entities (Bill Gates, March) from common words (pay the bill, open the gates, the soldiers march).
5.
The tagset used by the TreeTagger in English is available at ftp://ftp.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz and reproduced in Appendix D.4.
6.
Actually, RASP does not generate dependency relations directly, but it infers grammatical relations using equivalence rules applied to a traditional constituent parsing tree. Relations are mostly acyclic and exceptions can be dealt with on a case by case basis.
7.
Documentation about RASP’s tagset and grammatical relations is available at http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-662.pdf and in Appendix C of Jurafsky and Martin (2008). Moreover, the tags used by RASP for POS and syntax are reproduced in Appendices D.2 and D.3.
8.
This is a simplification, as described by Briscoe et al. (2006).
9.
The type/token ration, that is, the number of types with respect to the number of tokens in a text, has been used as a measure of the richness of the vocabulary. This measure depends on the corpus size (Baayen 2001). In BNC-frg, the type/token ratio is of 0.091.
10.
A word occurring once in the corpus is called a hapax, from the Greek hapax legomena.
11.
Discontiguous sequences are sometimes referred to as flexigrams, that is, n-grams with gaps.
12.
The term association measure is standard in MWE acquisition, but it would be more appropriate to talk about association scores instead, since not all the scores discussed here are proper measures.
13.
The test statistic is a random variable with a known distribution, from which we can obtain the p-value. If the p-value is below a certain significance level, we can reject the null hypothesis.
14.
http://multiword.sourceforge.net/mwe2009
15.
Recommended by the author of the algorithm in personal communication.
16.
Although this can be simulated by concatenating words and POS tags together in order to form a token.
17.
http://olst.ling.umontreal.ca/~drouinp/termostat_web/
18.
http://www.antlab.sci.waseda.ac.jp/software.html
19.
http://www.nactem.ac.uk/software/termine/
20.
http://en.wikipedia.org/wiki/Terminology_extraction
21.
http://mwetoolkit.sourceforge.net
22.
http://www.temis.com/
23.
http://www.temis.com/index.php?id=201&selt=1
24.
http://developer.yahoo.com/search/content/V1/termExtraction.html
25.
http://129.194.38.128:81/FipsCoView
26.
http://similis.org/
27.
A noun derived from a verb, like replacement is a nominalisation of the verb replace.
28.
The context unit used for annotation was the sentence. However, due to anaphora, sometimes it was impossible to know the intended meaning without looking at neighbour sentences.

References

Acosta O, Villavicencio A, Moreira V (2011) Identification and treatment of multiword expressions applied to information retrieval. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Association for Computational Linguistics, Portland, pp 101–109. http://www.aclweb.org/anthology/W/W11/W11-0815
Anastasiou D, Hashimoto C, Nakov P, Kim SN (eds) (2009) Proceedings of the ACL workshop on multiword expressions: identification, interpretation, disambiguation, applications (MWE 2009), Singapore. Association for Computational Linguistics/Suntec. http://aclweb.org/anthology-new/W/W09/W09-29, 70 p.
Apresian J, Boguslavsky I, Iomdin L, Tsinman L (2003) Lexical functions as a tool of ETAP-3. In: Proceedings of the first international conference on meaning-text theory (MTT 2003), Paris
Google Scholar
Attia M, Toral A, Tounsi L, Pecina P, van Genabith J (2010) Automatic extraction of Arabic multiword expressions. In: Laporte É, Nakov P, Ramisch C, Villavicencio A (eds) Proceedings of the COLING workshop on multiword expressions: from theory to applications (MWE 2010), Beijing. Association for Computational Linguistics, pp 18–26
Google Scholar
Baayen RH (2001) Word frequency distributions, text, speech and language technology, vol 18. Springer, Berlin/New York
Book Google Scholar
Bai MH, You JM, Chen KJ, Chang JS (2009) Acquiring translation equivalences of multiword expressions by normalized correlation frequencies. In: Proceedings of the 2009 conference on empirical methods in natural language processing (EMNLP 2009), Singapore. Association for Computational Linguistics/Suntec, pp 478–486
Google Scholar
Baldwin T (2005) Deep lexical acquisition of verb-particle constructions. Comput Speech Lang Spec Issue MWEs 19(4):398–414
Article Google Scholar
Baldwin T (2011) MWEs and topic modelling: enhancing machine learning with linguistics. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Portland. Association for Computational Linguistics, p 1. http://www.aclweb.org/anthology/W/W11/W11-0801
Baldwin T, Tanaka T (2004) Translation by machine of complex nominals: getting it right. In: Tanaka T, Villavicencio A, Bond F, Korhonen A (eds) Proceedings of the ACL workshop on multiword expressions: integrating processing (MWE 2004), Barcelona. Association for Computational Linguistics, pp 24–31
Google Scholar
Baldwin T, Bannard C, Tanaka T, Widdows D (2003) An empirical model of multiword expression decomposability. In: Bond F, Korhonen A, McCarthy D, Villavicencio A (eds) Proceedings of the ACL workshop on multiword expressions: analysis, acquisition and treatment (MWE 2003), Sapporo. Association for Computational Linguistics, pp 89–96. doi:10.3115/1119282.1119294, http://www.aclweb.org/anthology/W03-1812
Banerjee S, Pedersen T (2003) The design, implementation, and use of the Ngram Statistic Package. In: Proceedings of the fourth international conference on intelligent text processing and computational linguistics, Mexico City, pp 370–381
Google Scholar
Bannard C (2005) Learning about the meaning of verb-particle constructions from corpora. Comput Speech Lang Spec Issue MWEs 19(4):467–478
Article Google Scholar
Bejček E, Stranak P, Pecina P (2013) Syntactic identification of occurrences of multiword expressions in text using a lexicon with dependency structures. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the 9th workshop on multiword expressions (MWE 2013), Atlanta. Association for Computational Linguistics, pp 106–115. http://www.aclweb.org/anthology/W13-1016
Bonin F, Dell’Orletta F, Montemagni S, Venturi G (2010a) A contrastive approach to multi-word extraction from domain-specific corpora. In: Proceedings of the seventh international conference on language resources and evaluation (LREC 2010), Valetta. European Language Resources Association
Google Scholar
Bonin F, Dell’Orletta F, Venturi G, Montemagni S (2010b) Contrastive filtering of domain-specific multi-word terms from different types of corpora. In: Laporte É, Nakov P, Ramisch C, Villavicencio A (eds) Proceedings of the COLING workshop on multiword expressions: from theory to applications (MWE 2010), Beijing. Association for Computational Linguistics, pp 76–79
Google Scholar
Bouamor D, Semmar N, Zweigenbaum P (2012) Identifying bilingual multi-word expressions for statistical machine translation. In: Proceedings of the eigth international conference on language resources and evaluation (LREC 2012), Istanbul. European Language Resources Association
Google Scholar
Briscoe T, Carroll J, Watson R (2006) The second release of the RASP system. In: Curran J (ed) Proceedings of the COLING/ACL 2006 interactive presentation sessions, Sidney. Association for Computational Linguistics, pp 77–80. http://www.aclweb.org/anthology/P/P06/P06-4020
Bungum L, Gambäck B, Lynum A, Marsi E (2013) Improving word translation disambiguation by capturing multiword expressions with dictionaries. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the 9th workshop on multiword expressions (MWE 2013), Atlanta. Association for Computational Linguistics, pp 21–30. http://www.aclweb.org/anthology/W13-1003
Burnard L (2007) User reference guide for the British National Corpus. Technical report, Oxford University Computing Services
Google Scholar
Butnariu C, Kim SN, Nakov P, Séaghdha DO, Szpakowicz S, Veale T (2010) Semeval-2 task 9: the interpretation of noun compounds using paraphrasing verbs and prepositions. In: Erk K, Strapparava C (eds) Proceedings of the 5th international workshop on semantic evaluation (SemEval 2010), Uppsala. Association for Computational Linguistics, pp 39–44. http://www.aclweb.org/anthology/S10-1007
Carpuat M, Diab M (2010) Task-based evaluation of multiword expressions: a pilot study in statistical machine translation. In: Proceedings of human language technology: the 2010 annual conference of the North American chapter of the Association for Computational Linguistics (NAACL 2003), Los Angeles. Association for Computational Linguistics, pp 242–245. http://www.aclweb.org/anthology/N10-1029
Chen SF, Goodman J (1999) An empirical study of smoothing techniques for language modeling. Comput Speech Lang 13(4):359–394
Article Google Scholar
Church K, Hanks P (1990) Word association norms mutual information, and lexicography. Comput Linguist 16(1):22–29
Google Scholar
Constant M, Sigogne A (2011) MWU-aware part-of-speech tagging with a CRF model and lexical resources. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real World (MWE 2011), Portland. Association for Computational Linguistics, pp 49–56. http://www.aclweb.org/anthology/W/W11/W11-0809
Constant M, Roux JL, Sigogne A (2013) Combining compound recognition and PCFG-LA parsing with word lattices and conditional random fields. ACM Trans Speech Lang Process Spec Issue Multiword Expr Theory Pract Use Part 2 (TSLP) 10(3):1–24
Article Google Scholar
Cook P, Stevenson S (2006) Classifying particle semantics in English verb-particle constructions. In: Moirón BV, Villavicencio A, McCarthy D, Evert S, Stevenson S (eds) Proceedings of the COLING/ACL workshop on multiword expressions: identifying and exploiting underlying properties (MWE 2006), Sidney. Association for Computational Linguistics, pp 45–53. http://www.aclweb.org/anthology/W/W06/W06-1207
Cook P, Fazly A, Stevenson S (2007) Pulling their weight: exploiting syntactic forms for the automatic identification of idiomatic expressions in context. In: Grégoire N, Evert S, Kim SN (eds) Proceedings of the ACL workshop on a broader perspective on multiword expressions (MWE 2007), Prague. Association for Computational Linguistics, pp 41–48. http://www.aclweb.org/anthology/W/W07/W07-1106
Cook P, Fazly A, Stevenson S (2008) The VNC-tokens dataset. In: Grégoire N, Evert S, Krenn B (eds) Proceedings of the LREC workshop towards a shared task for multiword expressions (MWE 2008), Marrakech, pp 19–22
Google Scholar
Daille B (2003) Conceptual structuring through term variations. In: Bond F, Korhonen A, McCarthy D, Villavicencio A (eds) Proceedings of the ACL workshop on multiword expressions: analysis, acquisition and treatment (MWE 2003), Sapporo. Association for Computational Linguistics, pp 9–16. doi:10.3115/1119282.1119284. http://www.aclweb.org/anthology/W03-1802
Daille B, Dufour-Kowalski S, Morin E (2004) French-English multi-word term alignment based on lexical context analysis. In: Proceedings of the fourth international conference on language resources and evaluation (LREC 2004), Lisbon. European Language Resources Association, pp 919–922
Google Scholar
Déjean H, Gaussier É, Sadat F (2002) An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In: Proceedings of the 19th international conference on computational linguistics (COLING 2002), Taipei. http://aclweb.org/anthology-new/C/C02/C02-1166.pdf
de Medeiros Caseli H, Villavicencio A, Machado A, Finatto MJ (2009) Statistically-driven alignment-based multiword expression identification for technical domains. In: Anastasiou D, Hashimoto C, Nakov P, Kim SN (eds) Proceedings of the ACL workshop on multiword expressions: identification, interpretation, disambiguation, applications (MWE 2009), Singapore. Association for Computational Linguistics/Suntec, pp 1–8
Google Scholar
de Medeiros Caseli H, Ramisch C, das Graças Volpe Nunes M, Villavicencio A (2010) Alignment-based extraction of multiword expressions. Lang Resour Eval Spec Issue Multiword Express Hard Going Plain Sail 44(1–2):59–77. doi:10.1007/s10579-009-9097-9, http://www.springerlink.com/content/H7313427H78865MG
Dias G (2003) Multiword unit hybrid extraction. In: Bond F, Korhonen A, McCarthy D, Villavicencio A (eds) Proceedings of the ACL workshop on multiword expressions: analysis, acquisition and treatment (MWE 2003), Sapporo. Association for Computational Linguistics, pp 41–48. doi:10.3115/1119282.1119288. http://www.aclweb.org/anthology/W03-1806
Duan J, Lu R, Wu W, Hu Y, Tian Y (2006) A bio-inspired approach for multi-word expression extraction. In: Curran J (ed) Proceedings of the COLING/ACL 2006 main conference poster sessions, Sidney. Association for Computational Linguistics, pp 176–182. http://www.aclweb.org/anthology/P/P06/P06-2023
Dunning T (1993) Accurate methods for the statistics of surprise and coincidence. Comput Linguist 19(1):61–74
Google Scholar
Duran MS, Ramisch C, Aluísio SM, Villavicencio A (2011) Identifying and analyzing Brazilian Portuguese complex predicates. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Portland. Association for Computational Linguistics, pp 74–82. http://www.aclweb.org/anthology/W/W11/W11-0812
Evert S (2004) The statistics of word cooccurrences: word pairs and collocations. PhD thesis, Institut für maschinelle Sprachverarbeitung, University of Stuttgart, Stuttgart, 353p
Google Scholar
Evert S, Krenn B (2005) Using small random samples for the manual evaluation of statistical association measures. Comput Speech Lang Spec Issue MWEs 19(4):450–466
Article Google Scholar
Fazly A, Stevenson S (2007) Distinguishing subtypes of multiword expressions using linguistically-motivated statistical measures. In: Grégoire N, Evert S, Kim SN (eds) Proceedings of the ACL workshop on a broader perspective on multiword expressions (MWE 2007), Prague. Association for Computational Linguistics, pp 9–16. http://www.aclweb.org/anthology/W/W07/W07-1102
Finlayson M, Kulkarni N (2011) Detecting multi-word expressions improves word sense disambiguation. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Portland. Association for Computational Linguistics, pp 20–24. http://www.aclweb.org/anthology/W/W11/W11-0805
Frantzi K, Ananiadou S, Mima H (2000) Automatic recognition of multiword terms: the C-value/NC-value method. Int J Digit Libr 3(2):115–130
Article Google Scholar
Fritzinger F, Weller M, Heid U (2010) A survey of idiomatic preposition-noun-verb triples on token level. In: Proceedings of the seventh international conference on language resources and evaluation (LREC 2010), Valetta. European Language Resources Association, pp 2908–2914
Google Scholar
Gil A, Dias G (2003) Using masks, suffix array-based data structures and multidimensional arrays to compute positional n-gram statistics from corpora. In: Bond F, Korhonen A, McCarthy D, Villavicencio A (eds) Proceedings of the ACL workshop on multiword expressions: analysis, acquisition and treatment (MWE 2003), Sapporo. Association for Computational Linguistics, pp 25–32. doi:10.3115/1119282.1119286, http://www.aclweb.org/anthology/W03-1804
Girju R, Moldovan D, Tatu M, Antohe D (2005) On the semantics of noun compounds. Comput Speech Lang Spec Issue MWEs 19(4):479–496
Article Google Scholar
Good IJ (1953) The population frequencies of species and the estimation of population parameters. Biometrika 40(3–4):237–264. doi:10.1093/biomet/40.3-4.237
Article MathSciNet MATH Google Scholar
Graliński F, Savary A, Czerepowicka M, Makowiecki F (2010) Computational lexicography of multi-word units: how efficient can it be? In: Laporte É, Nakov P, Ramisch C, Villavicencio A (eds) Proceedings of the COLING workshop on multiword expressions: from theory to applications (MWE 2010), Beijing. Association for Computational Linguistics, pp 1–9
Google Scholar
Green S, de Marneffe MC, Bauer J, Manning CD (2011) Multiword expression identification with tree substitution grammars: a parsing tour de force with French. In: Barzilay R, Johnson M (eds) Proceedings of the 2011 conference on empirical methods in natural language processing (EMNLP 2011), Edinburgh. Association for Computational Linguistics, pp 725–735. http://www.aclweb.org/anthology/D11-1067
Grefenstette G (1999) The world wide web as a resource for example-based machine translation tasks. In: Proceedings of the twenty-first international conference on translating and the computer, ASLIB, London
Google Scholar
Grégoire N (2007) Design and implementation of a lexicon of Dutch multiword expressions. In: Grégoire N, Evert S, Kim SN (eds) Proceedings of the ACL workshop on a broader perspective on multiword expressions (MWE 2007), Prague. Association for Computational Linguistics, pp 17–24. http://www.aclweb.org/anthology/W/W07/W07-1103
Grégoire N (2010) DuELME: a Dutch electronic lexicon of multiword expressions. Lang Resour Eval Spec Issue Multiword Expr Hard Going Plain Sail 44(1–2):23–39. doi:10.1007/s10579-009-9094-z. http://www.springerlink.com/content/7308605442W17698
Grégoire N, Evert S, Krenn B (eds) (2008) Proceedings of the LREC workshop towards a shared task for multiword expressions (MWE 2008), Marrakech, 57p. http://www.lrec-conf.org/proceedings/lrec2008/workshops/W20_Proceedings.pdf
Gurrutxaga A, Alegria I (2011) Automatic extraction of NV expressions in Basque: basic issues on cooccurrence techniques. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Portland. Association for Computational Linguistics, pp 2–7. http://www.aclweb.org/anthology/W/W11/W11-0802
Haugereid P, Bond F (2011) Extracting transfer rules for multiword expressions from parallel corpora. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Portland. Association for Computational Linguistics, pp 92–100. http://www.aclweb.org/anthology/W/W11/W11-0814
Hendrickx I, Kim SN, Kozareva Z, Nakov P, Séaghdha DO, Padó S, Pennacchiotti M, Romano L, Szpakowicz S (2010) Semeval-2010 task 8: multi-way classification of semantic relations between pairs of nominals. In: Erk K, Strapparava C (eds) Proceedings of the 5th international workshop on semantic evaluation (SemEval 2010), Uppsala. Association for Computational Linguistics, pp 33–38. http://www.aclweb.org/anthology/S10-1006
Hoang HH, Kim SN, Kan MY (2009) A re-examination of lexical association measures. In: Anastasiou D, Hashimoto C, Nakov P, Kim SN (eds) Proceedings of the ACL workshop on multiword expressions: identification, interpretation, disambiguation, applications (MWE 2009), Singapore. Association for Computational Linguistics/Suntec, pp 31–39
Google Scholar
Hogan D, Foster J, van Genabith J (2011) Decreasing lexical data sparsity in statistical syntactic parsing – experiments with named entities. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Portland. Association for Computational Linguistics, pp 14–19. http://www.aclweb.org/anthology/W/W11/W11-0804
Izumi T, Imamura K, Kikui G, Sato S (2010) Standardizing complex functional expressions in Japanese predicates: applying theoretically-based paraphrasing rules. In: Laporte É, Nakov P, Ramisch C, Villavicencio A (eds) Proceedings of the COLING workshop on multiword expressions: from theory to applications (MWE 2010), Beijing. Association for Computational Linguistics, pp 63–71
Google Scholar
Jurafsky D, Martin JH (2008) Speech and language processing, 2nd edn. Prentice Hall, Upper Saddle River, 1024p
Google Scholar
Justeson JS, Katz SM (1995) Technical terminology: some linguistic properties and an algorithm for identification in text. Nat Lang Eng 1(1):9–27
Article Google Scholar
Keller F, Lapata M (2003) Using the web to obtain frequencies for unseen bigrams. Comput Linguist Spec Issue Web Corpus 29(3):459–484
Article Google Scholar
Kim SN, Baldwin T (2013) A lexical semantic approach to interpreting and bracketing English noun compounds. Nat Lang Eng Spec Issue Noun Compd 19(3):385–407. doi:10.1017/S1351324913000107, http://journals.cambridge.org/article_S1351324913000107
Kim SN, Nakov P (2011) Large-scale noun compound interpretation using bootstrapping and the web as a corpus. In: Barzilay R, Johnson M (eds) Proceedings of the 2011 conference on empirical methods in natural language processing (EMNLP 2011), Edinburgh. Association for Computational Linguistics, pp 648–658. http://www.aclweb.org/anthology/D11-1060
Kneser R, Ney H (1995) Improved backing-off for M-gram language modeling. In: Proceedings of the international conference on acoustics, speech, and signal processing (ICASSP 1995), Detroit, vol 1, pp 181–184. doi:10.1109/ICASSP.1995.479394, http://dx.doi.org/10.1109/ICASSP.1995.479394
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the tenth machine translation summit (MT Summit 2005), Phuket. Asian-Pacific Association for Machine Translation, pp 79–86
Google Scholar
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the Association for Computational Linguistics (ACL 2007), Prague. Association for Computational Linguistics, pp 177–180
Google Scholar
Korkontzelos I, Manandhar S (2010) Can recognising multiword expressions improve shallow parsing? In: Proceedings of human language technology: the 2010 annual conference of the North American chapter of the Association for Computational Linguistics (NAACL 2003), Los Angeles. Association for Computational Linguistics, pp 636–644. http://www.aclweb.org/anthology/N10-1089
Kulkarni N, Finlayson M (2011) jMWE: a java toolkit for detecting multi-word expressions. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Portland. Association for Computational Linguistics, pp 122–124. http://www.aclweb.org/anthology/W/W11/W11-0818
Lapata M (2002) The disambiguation of nominalizations. Comput Linguist 28(3):357–388
Article Google Scholar
Laporte É, Voyatzi S (2008) An electronic dictionary of French multiword adverbs. In: Grégoire N, Evert S, Krenn B (eds) Proceedings of the LREC workshop towards a shared task for multiword expressions (MWE 2008), Marrakech, pp 31–34
Google Scholar
Laporte É, Nakamura T, Voyatzi S (2008) A French corpus annotated for multiword nouns. In: Grégoire N, Evert S, Krenn B (eds) Proceedings of the LREC workshop towards a shared task for multiword expressions (MWE 2008), Marrakech, pp 27–30
Google Scholar
Li Z, Callison-Burch C, Dyer C, Ganitkevitch J, Khudanpur S, Schwartz L, Thornton WNG, Weese J, Zaidan OF (2009) Joshua: an open source toolkit for parsing-based machine translation. In: Proceedingsof the fourth workshop on statistical machine translation (WMT 2009), Athens. Association for Computational Linguistics, pp 135–139
Google Scholar
Manber U, Myers G (1990) Suffix arrays: a new method for on-line string searches. In: SODA ’90: proceedings of the first annual ACM-SIAM symposium on discrete algorithms, San Francisco. Society for Industrial and Applied Mathematics, Philadelphia, pp 319–327
Google Scholar
Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT, Cambridge, 620p
MATH Google Scholar
Martens S (2010) Varro: an algorithm and toolkit for regular structure discovery in treebanks. In: Huang CR, Jurafsky D (eds) Proceedings of the 23rd international conference on computational linguistics (COLING 2010)—posters, Beijing. The Coling 2010 Organizing Committee, pp 810–818. http://www.aclweb.org/anthology/C10-2093
Martens S, Vandeghinste V (2010) An efficient, generic approach to extracting multi-word expressions from dependency trees. In: Laporte É, Nakov P, Ramisch C, Villavicencio A (eds) Proceedings of the COLING workshop on multiword expressions: from theory to applications (MWE 2010), Beijing. Association for Computational Linguistics, pp 84–87
Google Scholar
McCarthy D, Keller B, Carroll J (2003) Detecting a continuum of compositionality in phrasal verbs. In: Bond F, Korhonen A, McCarthy D, Villavicencio A (eds) Proceedings of the ACL workshop on multiword expressions: analysis, acquisition and treatment (MWE 2003), Sapporo. Association for Computational Linguistics, pp 73–80. doi:10.3115/1119282.1119292, http://www.aclweb.org/anthology/W03-1810
McCarthy D, Venkatapathy S, Joshi A (2007) Detecting compositionality of verb-object combinations using selectional preferences. In: Eisner J (ed) Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL 2007), Prague. Association for Computational Linguistics, pp 369–379. http://www.aclweb.org/anthology/D/D07/D07-1039
Melamed ID (1997) Automatic discovery of non-compositional compounds in parallel data. In: Proceedings of the 2nd conference on empirical methods in natural language processing (EMNLP-2), Brown University, Providence. Association for Computational Linguistics, pp 97–108
Google Scholar
Michou A, Seretan V (2009) A tool for multi-word expression extraction in modern Greek using syntactic parsing. In: Proceedings of the demonstrations session at EACL 2009, Athens. Association for Computational Linguistics, pp 45–48
Google Scholar
Mikheev A (2002) Periods, capitalized words, etc. Comput Linguist 28(3):289–318
Article Google Scholar
Mirroshandel SA, Nasr A, Roux JL (2012) Semi-supervised dependency parsing using lexical affinities. In: Proceedings of the 50th annual meeting of the Association for Computational Linguistics (vol 1: long papers), Jeju Island. Association for Computational Linguistics, pp 777–785. http://www.aclweb.org/anthology/P12-1082
Mitkov R, Monti J, Pastor GC, Seretan V (eds) (2013) Proceedings of the MT summit 2013 workshop on multi-word units in machine translation and translation technology (MUMTTT 2013), Nice. European Association for Machine Translation, 71p. http://www.mtsummit2013.info/workshop4.asp
Monti J, Barreiro A, Elia A, Marano F, Napoli A (2011) Taking on new challenges in multi-word unit processing for machine translation. In: Proceedings of the second international workshop on free/open-source rule-based machine translation, Barcelona
Google Scholar
Morin E, Daille B (2010) Compositionality and lexical alignment of multi-word terms. Lang Resour Eval Spec Issue Multiword Express Hard Going Plain Sail 44(1–2):79–95. doi:10.1007/s10579-009-9098-8, http://www.springerlink.com/content/30264870R1K04744
Nakov P (2007) Using the web as an implicit training set: application to noun compound syntax and semantics. PhD thesis, EECS Department, University of California, Berkeley, 392p
Google Scholar
Nakov P (2008a) Improved statistical machine translation using monolingual paraphrases. In: Ghallab M, Spyropoulos CD, Fakotakis N, Avouris NM (eds) Proceedings of the 18th European conference on artificial intelligence (ECAI 2008), Patras. Frontiers in Artificial Intelligence and Applications, vol 178. IOS Press, pp 338–342
Google Scholar
Nakov P (2008b) Paraphrasing verbs for noun compound interpretation. In: Grégoire N, Evert S, Krenn B (eds) Proceedings of the LREC workshop towards a shared task for multiword expressions (MWE 2008), Marrakech, pp 46–49
Google Scholar
Nakov P (2013) On the interpretation of noun compounds: syntax, semantics, and entailment. Nat Lang Eng Spec Issue Noun Compd 19(3):291–330. doi:10.1017/S1351324913000065, http://journals.cambridge.org/article_S1351324913000065
Nakov P, Hearst MA (2005) Search engine statistics beyond the n-gram: application to noun compound bracketing. In: Dagan I, Gildea D (eds) Proceedings of the ninth conference on natural language learning (CoNLL-2005), University of Michigan, Ann Arbor. Association for Computational Linguistics, pp 17–24. http://www.aclweb.org/anthology/W/W05/W05-0603
Nakov P, Hearst MA (2008) Solving relational similarity problems using the web as a corpus. In: Proceedings of the 46th annual meeting of the Association for Computational Linguistics: human language technology (ACL-08: HLT), Columbus. Association for Computational Linguistics, pp 452–460
Google Scholar
Nasr A, Bechet F, Rey JF, Favre B, Roux JL (2011) MACAON an NLP tool suite for processing word lattices. In: Proceedings of the ACL 2011 system demonstrations, Portland. Association for Computational Linguistics, pp 86–91. http://www.aclweb.org/anthology/P11-4015
Newman MEJ (2005) Power laws, pareto distributions and zipf’s law. Contemp Phys 46:323–351
Article Google Scholar
Nicholson J, Baldwin T (2006) Interpretation of compound nominalisations using corpus and web statistics. In: Moirón BV, Villavicencio A, McCarthy D, Evert S, Stevenson S (eds) Proceedings of the COLING/ACL workshop on multiword expressions: identifying and exploiting underlying properties (MWE 2006), Sidney. Association for Computational Linguistics, pp 54–61. http://www.aclweb.org/anthology/W/W06/W06-1208
Nicholson J, Baldwin T (2008) Interpreting compound nominalisations. In: Grégoire N, Evert S, Krenn B (eds) Proceedings of the LREC workshop towards a shared task for multiword expressions (MWE 2008), Marrakech, pp 43–45
Google Scholar
Nulty P, Costello F (2010) UCD-PN: Selecting general paraphrases using conditional probability. In: Erk K, Strapparava C (eds) Proceedings of the 5th international workshop on semantic evaluation (SemEval 2010), Uppsala. Association for Computational Linguistics, pp 234–237. http://www.aclweb.org/anthology/S10-1052
Nulty P, Costello F (2013) General and specific paraphrases of semantic relations between nouns. Nat Lang Eng Spec Issue Noun Compd 19(3):357–384. doi:10.1017/S1351324913000089, http://journals.cambridge.org/article_S1351324913000089
Pal S, Naskar SK, Pecina P, Bandyopadhyay S, Way A (2010) Handling named entities and compound verbs in phrase-based statistical machine translation. In: Laporte É, Nakov P, Ramisch C, Villavicencio A (eds) Proceedings of the COLING workshop on multiword expressions: from theory to applications (MWE 2010), Beijing. Association for Computational Linguistics, pp 45–53
Google Scholar
Pearce D (2002) A comparative evaluation of collocation extraction techniques. In: Proceedings of the third international conference on language resources and evaluation (LREC 2002), Las Palmas. European Language Resources Association, pp 1530–1536
Google Scholar
Pecina P (2005) An extensive empirical study of collocation extraction methods. In: Proceedings of the ACL 2005 student research workshop, Ann Arbor. Association for Computational Linguistics, pp 13–18. http://www.aclweb.org/anthology/P/P05/P05-2003
Pecina P (2008) Reference data for Czech collocation extraction. In: Grégoire N, Evert S, Krenn B (eds) Proceedings of the LREC workshop towards a shared task for multiword expressions (MWE 2008), Marrakech, pp 11–14
Google Scholar
Pedersen T, Banerjee S, McInnes B, Kohli S, Joshi M, Liu Y (2011) The n-gram statistics package (text::NSP): a flexible tool for identifying n-grams, collocations, and word associations. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Portland. Association for Computational Linguistics, pp 131–133. http://www.aclweb.org/anthology/W/W11/W11-0821
Planas E, Furuse O (2000) Multi-level similar segment matching algorithm for translation memories and example-based machine translation. In: Proceedings of the 18th international conference on computational linguistics (COLING 2000), Saarbrücken. http://aclweb.org/anthology-new/C/C00/C00-2090.pdf
Ramisch C (2009) Multiword terminology extraction for domain-specific documents. Master’s thesis, École Nationale Supérieure d’Informatique et de Mathématiques Appliquées, Grenoble, 79p
Google Scholar
Ramisch C, Villavicencio A, Moura L, Idiart M (2008) Picking them up and figuring them out: verb-particle constructions, noise and idiomaticity. In: Clark A, Toutanova K (eds) Proceedings of the twelfth conference on natural language learning (CoNLL 2008), Manchester. The Coling 2008 Organizing Committee, pp 49–56. http://www.aclweb.org/anthology/W08-2107
Ramisch C, de Medeiros Caseli H, Villavicencio A, Machado A, Finatto MJ (2010) A hybrid approach for multiword expression identification. In: Proceedings of the 9th international conference on computational processing of Portuguese language (PROPOR 2010), Porto Alegre. Lecture notes in computer science (Lecture notes in artificail intelligence), vol 6001. Springer, pp 65–74. doi:10.1007/978-3-642-12320-7_9, http://www.springerlink.com/content/978-3-642-12319-1
Ren Z, Lü Y, Cao J, Liu Q, Huang Y (2009) Improving statistical machine translation using domain bilingual multiword expressions. In: Anastasiou D, Hashimoto C, Nakov P, Kim SN (eds) Proceedings of the ACL workshop on multiword expressions: identification, interpretation, disambiguation, applications (MWE 2009), Singapore. Association for Computational Linguistics/Suntec, pp 47–54
Google Scholar
Roller S, im Walde SS, Scheible S (2013) The (un)expected effects of applying standard cleansing models to human ratings on compositionality. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the 9th workshop on multiword expressions (MWE 2013), Atlanta. Association for Computational Linguistics, pp 32–41. http://www.aclweb.org/anthology/W13-1005
Sag I, Baldwin T, Bond F, Copestake A, Flickinger D (2002) Multiword expressions: a pain in the neck for NLP. In: Proceedings of the 3rd international conference on intelligent text processing and computational linguistics (CICLing-2002), Mexico City. Lecture notes in computer science, vol 2276/2010. Springer, pp 1–15
Google Scholar
SanJuan E, Dowdall J, Ibekwe-SanJuan F, Rinaldi F (2005) A symbolic approach to automatic multiword term structuring. Comput Speech Lang Spec Issue MWEs 19(4):524–542
Article Google Scholar
Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the international conference on new methods in language processing, Manchester, pp 44–49. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.28.1139
Schone P, Jurafsky D (2001) Is knowledge-free induction of multiword unit dictionary headwords a solved problem? In: Lee L, Harman D (eds) Proceedings of the 2001 conference on empirical methods in natural language processing (EMNLP 2001), Pittsburgh. Association for Computational Linguistics, pp 100–108
Google Scholar
Schuler W, Joshi A (2011) Tree-rewriting models of multi-word expressions. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Portland. Association for Computational Linguistics, pp 25–30. http://www.aclweb.org/anthology/W/W11/W11-0806
Séaghdha DÓ, Copestake A (2013) Interpreting compound nouns with kernel methods. Nat Lang Eng Spec Issue Noun Compd 19(3):331–356. doi:10.1017/S1351324912000368, http://journals.cambridge.org/article_S1351324912000368
Seretan V (2008) Collocation extraction based on syntactic parsing. PhD thesis, University of Geneva, Geneva, 249p
Google Scholar
Seretan V (2011) Syntax-based Collocation extraction, text, speech and language technology, vol 44, 1st edn. Springer, Dordrecht, 212p
Google Scholar
Seretan V, Wehrli E (2006) Multilingual collocation extraction: issues and solutions. In: Witt A, Sérasset G, Armstrong S, Breen J, Heid U, Sasaki F (eds) Proceedings of the ACL workshop on multilingual language resources and interoperability, Sydney. Association for Computational Linguistics, pp 40–49. http://www.aclweb.org/anthology/W/W06/W06-1006
Seretan V, Wehrli E (2009) Multilingual collocation extraction with a syntactic parser. Lang Resour Eval Spec Issue Multiling Lang Resour Interoper 43(1):71–85. doi:10.1007/s10579-008-9075-7, http://www.springerlink.com/content/341877K50497682X
Seretan V, Wehrli E (2011) Fipscoview: on-line visualisation of collocations extracted from multilingual parallel corpora. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Portland. Association for Computational Linguistics, pp 125–127. http://www.aclweb.org/anthology/W/W11/W11-0819
Silva J, Lopes G (1999) A local maxima method and a fair dispersion normalization for extracting multi-word units from corpora. In: Proceedings of the sixth meeting on mathematics of language (MOL6), Orlando, pp 369–381
Google Scholar
Silva J, Lopes G (2010) Towards automatic building of document keywords. In: Huang CR, Jurafsky D (eds) Proceedings of the 23rd international conference on computational linguistics (COLING 2010)—posters, Beijing. The Coling 2010 Organizing Committee, pp 1149–1157. http://www.aclweb.org/anthology/C10-2132
da Silva JF, Dias G, Guilloré S, Lopes JGP (1999) Using localmaxs algorithm for the extraction of contiguous and non-contiguous multiword lexical units. In: Proceedings of the 9th Portuguese conference on artificial intelligence: progress in artificial intelligence, London. EPIA 1999, pp 113–132. Springer. http://dl.acm.org/citation.cfm?id=645377.651205
Smadja FA (1993) Retrieving collocations from text: xtract. Comput Linguist 19(1):143–177
Google Scholar
Stymne S (2009) A comparison of merging strategies for translation of German compounds. In: Proceedings of the student research workshop at EACL 2009, Athens, pp 61–69
Google Scholar
Stymne S (2011) Pre- and postprocessing for statistical machine translation into Germanic languages. In: Proceedings of the ACL 2011 student research workshop, Portland. Association for Computational Linguistics, pp 12–17. http://www.aclweb.org/anthology/P11-3003
Szpakowicz S, Bond F, Nakov P, Kim SN (2013) On the semantics of noun compounds. In: Nat Lang Eng Spec Issue Noun Compd 19(3):289–290. Cambridge Univesity Press, Cambridge
Google Scholar
Tanaka T, Baldwin T (2003) Noun-noun compound machine translation a feasibility study on shallow processing. In: Bond F, Korhonen A, McCarthy D, Villavicencio A (eds) Proceedings of the ACL workshop on multiword expressions: analysis, acquisition and treatment (MWE 2003), Sapporo. Association for Computational Linguistics, pp 17–24. doi:10.3115/1119282.1119285. http://www.aclweb.org/anthology/W03-1803
Tsvetkov Y, Wintner S (2010) Extraction of multi-word expressions from small parallel corpora. In: Huang CR, Jurafsky D (eds) Proceedings of the 23rd international conference on computational linguistics (COLING 2010)—posters, Beijing. The Coling 2010 Organizing Committee, pp 1256–1264. http://www.aclweb.org/anthology/C10-2144
Tsvetkov Y, Wintner S (2011) Identification of multi-word expressions by combining multiple linguistic information sources. In: Barzilay R, Johnson M (eds) Proceedings of the 2011 conference on empirical methods in natural language processing (EMNLP 2011), Edinburgh. Association for Computational Linguistics, pp 836–845. http://www.aclweb.org/anthology/D11-1077
Uchiyama K, Baldwin T, Ishizaki S (2005) Disambiguating Japanese compound verbs. Comput Speech Lang Spec Issue MWEs 19(4):497–512
Article Google Scholar
Uresova Z, Hajic J, Fucikova E, Sindlerova J (2013) An analysis of annotation of verb-noun idiomatic combinations in a parallel dependency corpus. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the 9th workshop on multiword expressions (MWE 2013), Atlanta. Association for Computational Linguistics, pp 58–63. http://www.aclweb.org/anthology/W13-1009
Venkatapathy S, Joshi AK (2006) Using information about multi-word expressions for the word-alignment task. In: Moirón BV, Villavicencio A, McCarthy D, Evert S, Stevenson S (eds) Proceedings of the COLING/ACL workshop on multiword expressions: identifying and exploiting underlying properties (MWE 2006), Sidney. Association for Computational Linguistics, pp 20–27. http://www.aclweb.org/anthology/W/W06/W06-1204
Villavicencio A, Bond F, Korhonen A, McCarthy D (2005) Introduction to the special issue on multiword expressions: having a crack at a hard nut. Comput Speech Lang Spec Issue MWEs 19(4):365–377
Article Google Scholar
Villavicencio A, Kordoni V, Zhang Y, Idiart M, Ramisch C (2007) Validation and evaluation of automatically acquired multiword expressions for grammar engineering. In: Eisner J (ed) Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL 2007), Prague. Association for Computational Linguistics, pp 1034–1043. http://www.aclweb.org/anthology/D/D07/D07-1110
Vincze V, Nagy TI, Berend G (2011) Detecting noun compounds and light verb constructions: a contrastive study. In: Kordoni V, Ramisch C, Villavicencio A (eds) Proceedings of the ALC workshop on multiword expressions: from parsing and generation to the real world (MWE 2011), Portland. Association for Computational Linguistics, pp 116–121. http://www.aclweb.org/anthology/W/W11/W11-0817
Wehrli E (1998) Translating idioms. In: Proceedings of the 36th annual meeting of the Association for Computational Linguistics and 17th international conference on computational linguistics, Montreal, vol 2. Association for Computational Linguistics, pp 1388–1392. doi:10.3115/980691.980795. http://www.aclweb.org/anthology/P98-2226
Wehrli E, Seretan V, Nerima L (2010) Sentence analysis and collocation identification. In: Laporte É, Nakov P, Ramisch C, Villavicencio A (eds) Proceedings of the COLING workshop on multiword expressions: from theory to applications (MWE 2010), Beijing. Association for Computational Linguistics, pp 27–35
Google Scholar
Wermter J, Hahn U (2006) You can’t beat frequency (unless you use linguistic knowledge) – a qualitative evaluation of association measures for collocation and term extraction. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (COLING/ACL 2006), Sidney. Association for Computational Linguistics, pp 785–792
Google Scholar
Xu Y, Goebel R, Ringlstetter C, Kondrak G (2010) Application of the tightness continuum measure to Chinese information retrieval. In: Laporte É, Nakov P, Ramisch C, Villavicencio A (eds) Proceedings of the COLING workshop on multiword expressions: from theory to applications (MWE 2010), Beijing. Association for Computational Linguistics, pp 54–62
Google Scholar
Yamamoto M, Church K (2001) Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Comput Linguist 27(1):1–30
Article Google Scholar
Zarrieß S, Kuhn J (2009) Exploiting translational correspondences for pattern-independent MWE identification. In: Anastasiou D, Hashimoto C, Nakov P, Kim SN (eds) Proceedings of the ACL workshop on multiword expressions: identification, interpretation, disambiguation, applications (MWE 2009), Singapore. Association for Computational Linguistics/Suntec, pp 23–30
Google Scholar
Zhang Y, Kordoni V (2006) Automated deep lexical acquisition for robust open texts processing. In: Proceedings of the sixth international conference on language resources and evaluation (LREC 2006), Genoa. European Language Resources Association, pp 275–280
Google Scholar

Download references

Author information

Authors and Affiliations

Aix Marseille University, Marseille, France
Carlos Ramisch

Authors

Carlos Ramisch
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ramisch, C. (2015). State of the Art in MWE Processing. In: Multiword Expressions Acquisition. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-09207-2_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-09207-2_3
Published: 05 August 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09206-5
Online ISBN: 978-3-319-09207-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics