Extensions

Seretan, Violeta

doi:10.1007/978-94-007-0134-2_5

Violeta Seretan²

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 44))

860 Accesses

Abstract

Building on the syntax-based extraction method presented in BLcolor Chapter ref cha:Our Method , we extend our practical investigations on collocations in directions less explored in previous work. We begin by enlarging the scope of the extraction method to cover a broader spectrum of collocational phenomena. More precisely, we go beyond binary collocations and propose a tractable method for acquiring complex collocations (i.e., those collocations containing embedded collocations). We then investigate how we can automatically find all syntactic configurations (patterns) applying to collocations in a given language. Often arbitrarily chosen, these patterns play a crucial role in the quality of extraction results. In order to overcome these shortcomings, our original extraction method has been adapted to allow for data-driven induction of relevant syntactic configurations. Finally, in order to support the compilation of bilingual collocational resources for machine translation—an NLP application for which collocations are of great importance-we present a method for acquiring translation equivalents for collocations by matching collocations extracted from the source and target language versions of parallel texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Villada Moirón (2005) extends MI and \(\chi^2\) in order to deal with candidates of length 3.
2.
For the sake of simplicity, we will use the terms of bigram, trigram, and in general that of n-gram in order to indicate the arity of collocations, even if the component items are not adjacent.
3.
An in-depth discussion on collocation chains can be found in Ramos and Chains(2007)
4.
If more alternatives are possible, multiple types are generated accordingly.
5.
This filter is not implemented in the current version of our system.
6.
As illustrated by the trigram ((allgemeine, Gültigkeit), haben), composed of (allgemeine, Gültigkeit) and (Gültigkeit, haben), lit., “general validity have” (Heid, 1994, 232).
7.
Note that some items constitute complex units in turn, e.g., premier plan in the fifth 4-gram shown. Also, as suggested by the last 4-grams in the list, our strategy that consists of systematically displaying lemmas rather than word forms led to an unusual presentation of some expressions (such as trouver bon solution possible instead of trouver meilleur solution possible).
8.
This table displays for Benson et al. (1986a) only the lexical collocations listed in the preface of the BBI dictionary. Also, the system of Smadja (1993) deals with the following additional types: V-V, N-P, N-D. Similarly, Basili et al. (1994) state that about 20 patterns were used, while our table displays only those that were explicitly mentioned in their publication.
9.
Some of the most representative patterns that are currently supported by our extraction system are shown in Table 4.1. As mentioned in Section 4.3.1, the complete list of patterns is actually longer, and is evolving as more and more data is inspected.
10.
URL: http://www.ldc.upenn.edu/, accessed June, 2010.
11.
Since the experiments are not comparative and were conducted independently, the two corpora are of different sizes.
12.
More precisely, multi-word units are then identified in Dias (2003) by: (a) applying this measure on both sequences of words and on the corresponding sequences of POS tags; (b) combining the scores obtained, and (c) retaining only the local maxima candidates as valid, according to the method briefly explained in Section 5.1.3.
13.
In seretan and Wehrli (2006) we discuss in detail the problems a (syntax-based or syntax-free) collocation extraction system faces when ported from English to a new language with a richer morphology and more flexible word order.
14.
Term introduced by Pearce (2001a).
15.
The F-measure is the harmonic mean of precision and recall: \(F = 2PR/(P+R)\). If we consider that the task to solve is to find exactly one translation for each source collocation, then the recall represents, in our case, the number of correct translations returned divided by the number of correct translations expected.

References

Basili R, Pazienza MT, Velardi P (1994) A “not-so-shallow” parser for collocational analysis. In: Proceedings of the 15th Conference on Computational Linguistics, Kyoto, Japan, pp 447–453
Google Scholar
Benson M, Benson E, Ilson R (1986a) The BBI Dictionary of English Word Combinations. John Benjamins, Amsterdam/Philadelphia
Google Scholar
Blaheta D, Johnson M (2001) Unsupervised learning of multi-word verbs. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp 54–60
Google Scholar
Choueka Y, Klein S, Neuwitz E (1983) Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. Journal of the Association for Literary and Linguistic Computing 4(1):34–38
Google Scholar
Dagan I, Church K (1994) Termight: Identifying and translating technical terminology. In: Proceedings of the 4th Conference on Applied Natural Language Processing (ANLP), Stuttgart, Germany, pp 34–40
Google Scholar
Daille B (1994) Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres linguistiques. PhD thesis, Université Paris 7
Google Scholar
Dias G (2003) Multiword unit hybrid extraction. In: Proceedings of the ACL Workshop on Multiword Expressions, Sapporo, Japan, pp 41–48
Google Scholar
van der Eijk P (1993) Automating the acquisition of bilingual terminology. In: Proceedings of the 6th Conference on European Chapter of the Association for Computational Linguistics, Utrecht, The Netherlands, pp 113–119
Google Scholar
Fontenelle T (1992) Collocation acquisition from a corpus or from a dictionary: A comparison. Proceedings I-II Papers submitted to the 5th EURALEX International Congress on Lexicography in Tampere, Tampere, Finland, pp 221–228
Google Scholar
Frantzi KT, Ananiadou S, Mima H (2000) Automatic recognition of multi-word terms: The C-value/NC-value method. International Journal on Digital Libraries 2(3):115–130
Article Google Scholar
Goldman JP, Nerima L, Wehrli E (2001) Collocation extraction using a syntactic parser. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp 61–66
Google Scholar
Hausmann FJ (1989) Le dictionnaire de collocations. In: Hausmann F, Reichmann O, Wiegand H, Zgusta L (eds) Wörterbücher: Ein internationales Handbuch zur Lexicographie. Dictionaries, Dictionnaires, de Gruyter, Berlin, pp 1010–1019
Google Scholar
Heid U (1994) On ways words work together – research topics in lexical combinatorics. In: Proceedings of the 6th Euralex International Congress on Lexicography (EURALEX ’94), Amsterdam, The Netherlands, pp 226–257
Google Scholar
Kilgarriff A, Tugwell D (2001) WORD SKETCH: Extraction and display of significant collocations for lexicography. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp 32–38
Google Scholar
Kilgarriff A, Kovář V, Krek S, Srdanović I, Tiberius C (2010) A quantitative evaluation of word sketches. In: Proceedings of the 14th EURALEX International Congress, Leeuwarden, The Netherlands
Google Scholar
Kim S, Yang Z, Song M, Ahn JH (1999) Retrieving collocations from Korean text. In: Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, College Park, MD, USA, pp 71–81
Google Scholar
Kim S, Yoon J, Song M (2001) Automatic extraction of collocations from Korean text. Computers and the Humanities 35(3):273–297
Article Google Scholar
Kupiec J (1993) An algorithm for finding noun phrase correspondences in bilingual corpora. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, Columbus, OH, USA, pp 17–22
Google Scholar
Lin D (1998) Extracting collocations from text corpora. In: First Workshop on Computational Terminology, Montreal, Canada, pp 57–63
Google Scholar
Lin D (1999) Automatic identification of non-compositional phrases. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, Morristown, NJ, USA, pp 317–324
Google Scholar
Lü Y, Zhou M (2004) Collocation translation acquisition using monolingual corpora. In: Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Barcelona, Spain, pp 167–174
Google Scholar
Nerima L, Seretan V, Wehrli E (2003) Creating a multilingual collocation dictionary from large text corpora. In: Companion Volume to the Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL’03), Budapest, Hungary, pp 131–134
Google Scholar
Orliac B, Dillinger M (2003) Collocation extraction for machine translation. In: Proceedings of Machine Translation Summit IX, New Orleans, LA, USA, pp 292–298
Google Scholar
Pearce D (2001a) Synonymy in collocation extraction. In: Proceedings of the NAACL Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations, Pittsburgh, PA, USA, pp 41–46
Google Scholar
Rögnvaldsson E (2010) Collocations in the minimalist framework. Lambda (18):107–118
Google Scholar
Seretan V, Wehrli E (2006) Multilingual collocation extraction: Issues and solutions. In: Proceedings of COLING/ACL Workshop on Multilingual Language Resources and Interoperability, Sydney, Australia, pp 40–49
Google Scholar
Seretan V, Wehrli E (2007) Collocation translation based on sentence alignment and parsing. In: Actes de la 14e conférence sur le Traitement Automatique des Langues Naturelles (TALN 2007), Toulouse, France, pp 401–410
Google Scholar
Smadja F (1993) Retrieving collocations from text: Xtract. Computational Linguistics 19(1):143–177
Google Scholar
Smadja F, McKeown K, Hatzivassiloglou V (1996) Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics 22(1):1–38
Google Scholar
Villada Moirón MBn (2005) Data-driven identification of fixed expressions and their modifiability. PhD thesis, University of Groningen
Google Scholar
Wehrli E (2007) Fips, a “deep” linguistic multilingual parser. In: ACL 2007 Workshop on Deep Linguistic Processing, Prague, Czech Republic, pp 120–127
Google Scholar
van der Wouden T (2001) Collocational behaviour in non content words. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp 16–23
Google Scholar
Zinsmeister H, Heid U (2003) Significant triples: Adjective+Noun+Verb combinations. In: Proceedings of the 7th Conference on Computational Lexicography and Text Research (Complex 2003), Budapest, Hungary
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Linguistics (Office L706), University of Geneva, Rue de Candolle 2, 1211, Geneva, Switzerland
Violeta Seretan

Authors

Violeta Seretan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Violeta Seretan .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Seretan, V. (2011). Extensions. In: Syntax-Based Collocation Extraction. Text, Speech and Language Technology, vol 44. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-0134-2_5

Download citation

DOI: https://doi.org/10.1007/978-94-007-0134-2_5
Published: 20 November 2010
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-007-0133-5
Online ISBN: 978-94-007-0134-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics