Skip to main content

Extensions

  • Chapter
  • First Online:
Syntax-Based Collocation Extraction

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 44))

  • 860 Accesses

Abstract

Building on the syntax-based extraction method presented in BLcolor Chapter ref cha:Our Method , we extend our practical investigations on collocations in directions less explored in previous work. We begin by enlarging the scope of the extraction method to cover a broader spectrum of collocational phenomena. More precisely, we go beyond binary collocations and propose a tractable method for acquiring complex collocations (i.e., those collocations containing embedded collocations). We then investigate how we can automatically find all syntactic configurations (patterns) applying to collocations in a given language. Often arbitrarily chosen, these patterns play a crucial role in the quality of extraction results. In order to overcome these shortcomings, our original extraction method has been adapted to allow for data-driven induction of relevant syntactic configurations. Finally, in order to support the compilation of bilingual collocational resources for machine translation—an NLP application for which collocations are of great importance-we present a method for acquiring translation equivalents for collocations by matching collocations extracted from the source and target language versions of parallel texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Villada MoirĂłn (2005) extends MI and \(\chi^2\) in order to deal with candidates of length 3.

  2. 2.

    For the sake of simplicity, we will use the terms of bigram, trigram, and in general that of n-gram in order to indicate the arity of collocations, even if the component items are not adjacent.

  3. 3.

    An in-depth discussion on collocation chains can be found in Ramos and Chains(2007)

  4. 4.

    If more alternatives are possible, multiple types are generated accordingly.

  5. 5.

    This filter is not implemented in the current version of our system.

  6. 6.

    As illustrated by the trigram ((allgemeine, Gültigkeit), haben), composed of (allgemeine, Gültigkeit) and (Gültigkeit, haben), lit., “general validity have” (Heid, 1994, 232).

  7. 7.

    Note that some items constitute complex units in turn, e.g., premier plan in the fifth 4-gram shown. Also, as suggested by the last 4-grams in the list, our strategy that consists of systematically displaying lemmas rather than word forms led to an unusual presentation of some expressions (such as trouver bon solution possible instead of trouver meilleur solution possible).

  8. 8.

    This table displays for Benson et al. (1986a) only the lexical collocations listed in the preface of the BBI dictionary. Also, the system of Smadja (1993) deals with the following additional types: V-V, N-P, N-D. Similarly, Basili et al. (1994) state that about 20 patterns were used, while our table displays only those that were explicitly mentioned in their publication.

  9. 9.

    Some of the most representative patterns that are currently supported by our extraction system are shown in Table 4.1. As mentioned in Section 4.3.1, the complete list of patterns is actually longer, and is evolving as more and more data is inspected.

  10. 10.

    URL: http://www.ldc.upenn.edu/, accessed June, 2010.

  11. 11.

    Since the experiments are not comparative and were conducted independently, the two corpora are of different sizes.

  12. 12.

    More precisely, multi-word units are then identified in Dias (2003) by: (a) applying this measure on both sequences of words and on the corresponding sequences of POS tags; (b) combining the scores obtained, and (c) retaining only the local maxima candidates as valid, according to the method briefly explained in Section 5.1.3.

  13. 13.

    In seretan and Wehrli (2006) we discuss in detail the problems a (syntax-based or syntax-free) collocation extraction system faces when ported from English to a new language with a richer morphology and more flexible word order.

  14. 14.

    Term introduced by Pearce (2001a).

  15. 15.

    The F-measure is the harmonic mean of precision and recall: \(F = 2PR/(P+R)\). If we consider that the task to solve is to find exactly one translation for each source collocation, then the recall represents, in our case, the number of correct translations returned divided by the number of correct translations expected.

References

  • Basili R, Pazienza MT, Velardi P (1994) A “not-so-shallow” parser for collocational analysis. In: Proceedings of the 15th Conference on Computational Linguistics, Kyoto, Japan, pp 447–453

    Google Scholar 

  • Benson M, Benson E, Ilson R (1986a) The BBI Dictionary of English Word Combinations. John Benjamins, Amsterdam/Philadelphia

    Google Scholar 

  • Blaheta D, Johnson M (2001) Unsupervised learning of multi-word verbs. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp 54–60

    Google Scholar 

  • Choueka Y, Klein S, Neuwitz E (1983) Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. Journal of the Association for Literary and Linguistic Computing 4(1):34–38

    Google Scholar 

  • Dagan I, Church K (1994) Termight: Identifying and translating technical terminology. In: Proceedings of the 4th Conference on Applied Natural Language Processing (ANLP), Stuttgart, Germany, pp 34–40

    Google Scholar 

  • Daille B (1994) Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres linguistiques. PhD thesis, UniversitĂ© Paris 7

    Google Scholar 

  • Dias G (2003) Multiword unit hybrid extraction. In: Proceedings of the ACL Workshop on Multiword Expressions, Sapporo, Japan, pp 41–48

    Google Scholar 

  • van der Eijk P (1993) Automating the acquisition of bilingual terminology. In: Proceedings of the 6th Conference on European Chapter of the Association for Computational Linguistics, Utrecht, The Netherlands, pp 113–119

    Google Scholar 

  • Fontenelle T (1992) Collocation acquisition from a corpus or from a dictionary: A comparison. Proceedings I-II Papers submitted to the 5th EURALEX International Congress on Lexicography in Tampere, Tampere, Finland, pp 221–228

    Google Scholar 

  • Frantzi KT, Ananiadou S, Mima H (2000) Automatic recognition of multi-word terms: The C-value/NC-value method. International Journal on Digital Libraries 2(3):115–130

    Article  Google Scholar 

  • Goldman JP, Nerima L, Wehrli E (2001) Collocation extraction using a syntactic parser. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp 61–66

    Google Scholar 

  • Hausmann FJ (1989) Le dictionnaire de collocations. In: Hausmann F, Reichmann O, Wiegand H, Zgusta L (eds) WörterbĂĽcher: Ein internationales Handbuch zur Lexicographie. Dictionaries, Dictionnaires, de Gruyter, Berlin, pp 1010–1019

    Google Scholar 

  • Heid U (1994) On ways words work together – research topics in lexical combinatorics. In: Proceedings of the 6th Euralex International Congress on Lexicography (EURALEX ’94), Amsterdam, The Netherlands, pp 226–257

    Google Scholar 

  • Kilgarriff A, Tugwell D (2001) WORD SKETCH: Extraction and display of significant collocations for lexicography. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp 32–38

    Google Scholar 

  • Kilgarriff A, Kovář V, Krek S, Srdanović I, Tiberius C (2010) A quantitative evaluation of word sketches. In: Proceedings of the 14th EURALEX International Congress, Leeuwarden, The Netherlands

    Google Scholar 

  • Kim S, Yang Z, Song M, Ahn JH (1999) Retrieving collocations from Korean text. In: Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, College Park, MD, USA, pp 71–81

    Google Scholar 

  • Kim S, Yoon J, Song M (2001) Automatic extraction of collocations from Korean text. Computers and the Humanities 35(3):273–297

    Article  Google Scholar 

  • Kupiec J (1993) An algorithm for finding noun phrase correspondences in bilingual corpora. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, Columbus, OH, USA, pp 17–22

    Google Scholar 

  • Lin D (1998) Extracting collocations from text corpora. In: First Workshop on Computational Terminology, Montreal, Canada, pp 57–63

    Google Scholar 

  • Lin D (1999) Automatic identification of non-compositional phrases. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, Morristown, NJ, USA, pp 317–324

    Google Scholar 

  • LĂĽ Y, Zhou M (2004) Collocation translation acquisition using monolingual corpora. In: Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Barcelona, Spain, pp 167–174

    Google Scholar 

  • Nerima L, Seretan V, Wehrli E (2003) Creating a multilingual collocation dictionary from large text corpora. In: Companion Volume to the Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL’03), Budapest, Hungary, pp 131–134

    Google Scholar 

  • Orliac B, Dillinger M (2003) Collocation extraction for machine translation. In: Proceedings of Machine Translation Summit IX, New Orleans, LA, USA, pp 292–298

    Google Scholar 

  • Pearce D (2001a) Synonymy in collocation extraction. In: Proceedings of the NAACL Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations, Pittsburgh, PA, USA, pp 41–46

    Google Scholar 

  • Rögnvaldsson E (2010) Collocations in the minimalist framework. Lambda (18):107–118

    Google Scholar 

  • Seretan V, Wehrli E (2006) Multilingual collocation extraction: Issues and solutions. In: Proceedings of COLING/ACL Workshop on Multilingual Language Resources and Interoperability, Sydney, Australia, pp 40–49

    Google Scholar 

  • Seretan V, Wehrli E (2007) Collocation translation based on sentence alignment and parsing. In: Actes de la 14e confĂ©rence sur le Traitement Automatique des Langues Naturelles (TALN 2007), Toulouse, France, pp 401–410

    Google Scholar 

  • Smadja F (1993) Retrieving collocations from text: Xtract. Computational Linguistics 19(1):143–177

    Google Scholar 

  • Smadja F, McKeown K, Hatzivassiloglou V (1996) Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics 22(1):1–38

    Google Scholar 

  • Villada MoirĂłn MBn (2005) Data-driven identification of fixed expressions and their modifiability. PhD thesis, University of Groningen

    Google Scholar 

  • Wehrli E (2007) Fips, a “deep” linguistic multilingual parser. In: ACL 2007 Workshop on Deep Linguistic Processing, Prague, Czech Republic, pp 120–127

    Google Scholar 

  • van der Wouden T (2001) Collocational behaviour in non content words. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp 16–23

    Google Scholar 

  • Zinsmeister H, Heid U (2003) Significant triples: Adjective+Noun+Verb combinations. In: Proceedings of the 7th Conference on Computational Lexicography and Text Research (Complex 2003), Budapest, Hungary

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Violeta Seretan .

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media B.V.

About this chapter

Cite this chapter

Seretan, V. (2011). Extensions. In: Syntax-Based Collocation Extraction. Text, Speech and Language Technology, vol 44. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-0134-2_5

Download citation

  • DOI: https://doi.org/10.1007/978-94-007-0134-2_5

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-94-007-0133-5

  • Online ISBN: 978-94-007-0134-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics