Abstract
Building on the syntax-based extraction method presented in BLcolor Chapter ref cha:Our Method , we extend our practical investigations on collocations in directions less explored in previous work. We begin by enlarging the scope of the extraction method to cover a broader spectrum of collocational phenomena. More precisely, we go beyond binary collocations and propose a tractable method for acquiring complex collocations (i.e., those collocations containing embedded collocations). We then investigate how we can automatically find all syntactic configurations (patterns) applying to collocations in a given language. Often arbitrarily chosen, these patterns play a crucial role in the quality of extraction results. In order to overcome these shortcomings, our original extraction method has been adapted to allow for data-driven induction of relevant syntactic configurations. Finally, in order to support the compilation of bilingual collocational resources for machine translation—an NLP application for which collocations are of great importance-we present a method for acquiring translation equivalents for collocations by matching collocations extracted from the source and target language versions of parallel texts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Villada MoirĂłn (2005) extends MI and \(\chi^2\) in order to deal with candidates of length 3.
- 2.
For the sake of simplicity, we will use the terms of bigram, trigram, and in general that of n-gram in order to indicate the arity of collocations, even if the component items are not adjacent.
- 3.
An in-depth discussion on collocation chains can be found in Ramos and Chains(2007)
- 4.
If more alternatives are possible, multiple types are generated accordingly.
- 5.
This filter is not implemented in the current version of our system.
- 6.
As illustrated by the trigram ((allgemeine, Gültigkeit), haben), composed of (allgemeine, Gültigkeit) and (Gültigkeit, haben), lit., “general validity have” (Heid, 1994, 232).
- 7.
Note that some items constitute complex units in turn, e.g., premier plan in the fifth 4-gram shown. Also, as suggested by the last 4-grams in the list, our strategy that consists of systematically displaying lemmas rather than word forms led to an unusual presentation of some expressions (such as trouver bon solution possible instead of trouver meilleur solution possible).
- 8.
This table displays for Benson et al. (1986a) only the lexical collocations listed in the preface of the BBI dictionary. Also, the system of Smadja (1993) deals with the following additional types: V-V, N-P, N-D. Similarly, Basili et al. (1994) state that about 20 patterns were used, while our table displays only those that were explicitly mentioned in their publication.
- 9.
Some of the most representative patterns that are currently supported by our extraction system are shown in Table 4.1. As mentioned in Section 4.3.1, the complete list of patterns is actually longer, and is evolving as more and more data is inspected.
- 10.
URL: http://www.ldc.upenn.edu/, accessed June, 2010.
- 11.
Since the experiments are not comparative and were conducted independently, the two corpora are of different sizes.
- 12.
More precisely, multi-word units are then identified in Dias (2003) by: (a) applying this measure on both sequences of words and on the corresponding sequences of POS tags; (b) combining the scores obtained, and (c) retaining only the local maxima candidates as valid, according to the method briefly explained in Section 5.1.3.
- 13.
In seretan and Wehrli (2006) we discuss in detail the problems a (syntax-based or syntax-free) collocation extraction system faces when ported from English to a new language with a richer morphology and more flexible word order.
- 14.
Term introduced by Pearce (2001a).
- 15.
The F-measure is the harmonic mean of precision and recall: \(F = 2PR/(P+R)\). If we consider that the task to solve is to find exactly one translation for each source collocation, then the recall represents, in our case, the number of correct translations returned divided by the number of correct translations expected.
References
Basili R, Pazienza MT, Velardi P (1994) A “not-so-shallow” parser for collocational analysis. In: Proceedings of the 15th Conference on Computational Linguistics, Kyoto, Japan, pp 447–453
Benson M, Benson E, Ilson R (1986a) The BBI Dictionary of English Word Combinations. John Benjamins, Amsterdam/Philadelphia
Blaheta D, Johnson M (2001) Unsupervised learning of multi-word verbs. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp 54–60
Choueka Y, Klein S, Neuwitz E (1983) Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. Journal of the Association for Literary and Linguistic Computing 4(1):34–38
Dagan I, Church K (1994) Termight: Identifying and translating technical terminology. In: Proceedings of the 4th Conference on Applied Natural Language Processing (ANLP), Stuttgart, Germany, pp 34–40
Daille B (1994) Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres linguistiques. PhD thesis, Université Paris 7
Dias G (2003) Multiword unit hybrid extraction. In: Proceedings of the ACL Workshop on Multiword Expressions, Sapporo, Japan, pp 41–48
van der Eijk P (1993) Automating the acquisition of bilingual terminology. In: Proceedings of the 6th Conference on European Chapter of the Association for Computational Linguistics, Utrecht, The Netherlands, pp 113–119
Fontenelle T (1992) Collocation acquisition from a corpus or from a dictionary: A comparison. Proceedings I-II Papers submitted to the 5th EURALEX International Congress on Lexicography in Tampere, Tampere, Finland, pp 221–228
Frantzi KT, Ananiadou S, Mima H (2000) Automatic recognition of multi-word terms: The C-value/NC-value method. International Journal on Digital Libraries 2(3):115–130
Goldman JP, Nerima L, Wehrli E (2001) Collocation extraction using a syntactic parser. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp 61–66
Hausmann FJ (1989) Le dictionnaire de collocations. In: Hausmann F, Reichmann O, Wiegand H, Zgusta L (eds) Wörterbücher: Ein internationales Handbuch zur Lexicographie. Dictionaries, Dictionnaires, de Gruyter, Berlin, pp 1010–1019
Heid U (1994) On ways words work together – research topics in lexical combinatorics. In: Proceedings of the 6th Euralex International Congress on Lexicography (EURALEX ’94), Amsterdam, The Netherlands, pp 226–257
Kilgarriff A, Tugwell D (2001) WORD SKETCH: Extraction and display of significant collocations for lexicography. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp 32–38
Kilgarriff A, Kovář V, Krek S, Srdanović I, Tiberius C (2010) A quantitative evaluation of word sketches. In: Proceedings of the 14th EURALEX International Congress, Leeuwarden, The Netherlands
Kim S, Yang Z, Song M, Ahn JH (1999) Retrieving collocations from Korean text. In: Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, College Park, MD, USA, pp 71–81
Kim S, Yoon J, Song M (2001) Automatic extraction of collocations from Korean text. Computers and the Humanities 35(3):273–297
Kupiec J (1993) An algorithm for finding noun phrase correspondences in bilingual corpora. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, Columbus, OH, USA, pp 17–22
Lin D (1998) Extracting collocations from text corpora. In: First Workshop on Computational Terminology, Montreal, Canada, pp 57–63
Lin D (1999) Automatic identification of non-compositional phrases. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, Morristown, NJ, USA, pp 317–324
Lü Y, Zhou M (2004) Collocation translation acquisition using monolingual corpora. In: Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Barcelona, Spain, pp 167–174
Nerima L, Seretan V, Wehrli E (2003) Creating a multilingual collocation dictionary from large text corpora. In: Companion Volume to the Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL’03), Budapest, Hungary, pp 131–134
Orliac B, Dillinger M (2003) Collocation extraction for machine translation. In: Proceedings of Machine Translation Summit IX, New Orleans, LA, USA, pp 292–298
Pearce D (2001a) Synonymy in collocation extraction. In: Proceedings of the NAACL Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations, Pittsburgh, PA, USA, pp 41–46
Rögnvaldsson E (2010) Collocations in the minimalist framework. Lambda (18):107–118
Seretan V, Wehrli E (2006) Multilingual collocation extraction: Issues and solutions. In: Proceedings of COLING/ACL Workshop on Multilingual Language Resources and Interoperability, Sydney, Australia, pp 40–49
Seretan V, Wehrli E (2007) Collocation translation based on sentence alignment and parsing. In: Actes de la 14e conférence sur le Traitement Automatique des Langues Naturelles (TALN 2007), Toulouse, France, pp 401–410
Smadja F (1993) Retrieving collocations from text: Xtract. Computational Linguistics 19(1):143–177
Smadja F, McKeown K, Hatzivassiloglou V (1996) Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics 22(1):1–38
Villada MoirĂłn MBn (2005) Data-driven identification of fixed expressions and their modifiability. PhD thesis, University of Groningen
Wehrli E (2007) Fips, a “deep” linguistic multilingual parser. In: ACL 2007 Workshop on Deep Linguistic Processing, Prague, Czech Republic, pp 120–127
van der Wouden T (2001) Collocational behaviour in non content words. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp 16–23
Zinsmeister H, Heid U (2003) Significant triples: Adjective+Noun+Verb combinations. In: Proceedings of the 7th Conference on Computational Lexicography and Text Research (Complex 2003), Budapest, Hungary
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2011 Springer Science+Business Media B.V.
About this chapter
Cite this chapter
Seretan, V. (2011). Extensions. In: Syntax-Based Collocation Extraction. Text, Speech and Language Technology, vol 44. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-0134-2_5
Download citation
DOI: https://doi.org/10.1007/978-94-007-0134-2_5
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-007-0133-5
Online ISBN: 978-94-007-0134-2
eBook Packages: Computer ScienceComputer Science (R0)