Skip to main content

Syntax-Based Extraction

  • Chapter
  • First Online:
Syntax-Based Collocation Extraction

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 44))

  • 908 Accesses

Abstract

In this chapter—the core of the book—we present and evaluate our methodology for collocation extraction based on deep syntactic parsing. First, a closer look at previous work which made use of parsed text for collocation extraction will reveal that the aim of fully-fledged syntax-based extraction was far from realized in these efforts due, primarily, to the insufficient robustness, precision, or coverage of the parsers used, as well as to the small number of syntactic configurations taken into account. Our work addresses these deficiencies with a generic extraction procedure that relies on a large-scale multilingual parsing system. After describing the system and extraction method, we focus on the contrastive evaluation of the method against the sliding window method, a standard syntax-free method based on the linear proximity of words. Cross-language evaluation shows that, despite the inherent errors and the challenges posed by the analysis of large amounts of unrestricted text, deep parsing contributes to a significant increase in performance. A detailed qualitative analysis of the results, including a case-study comparison, allows an assessment of the relative strengths and weaknesses of the two methods to be made. Following the qualitative comparison, a brief comparison of the current system with systems based on shallow parsing is presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    All the examples provided in this book are sentences actually occurring in our corpora.

  2. 2.

    The author notes that newer versions of the parser are able to process these sentences as well.

  3. 3.

    See the examples provided later, in Section 4.2.

  4. 4.

    Note that a relative reading is also possible for this example.

  5. 5.

    The subordinate clause in this example is the relative introduced by where.

  6. 6.

    URL:http://www.elda.org/easy/,accessed June, 2010.

  7. 7.

    URL: http://atoll.inria.fr/passage/,accessed June, 2010.

  8. 8.

    Prepositions are also included along with noun lexemes for readability reasons.

  9. 9.

    Far from being exhaustive, this list is continuously evolving since many new combinations emerge as collocationally relevant as more data is processed (see also the considerations in Section 3.3.2 on the syntactic configuration of collocations).

  10. 10.

    Thanks to parsing, the readings of a lexical item are syntactically disambiguated. It might therefore happen that two pairs that are identical in form (the key fields are the same) actually contain different lexical items.

  11. 11.

    Section 3.2.5 discusses in detail the issue of choosing an appropriate AM.

  12. 12.

    The selection of higher-scored pairs can be made a posteriori, according to the desired degree of confidence (see Section 3.2.4).

  13. 13.

    As noted in the previous section, the system can recognise those pairs of lexemes that make up known collocations, i.e., collocations that are stored in the parser’s lexicon.

  14. 14.

    For example, in Smadja (1993), the lexicographers classified each item as N (not a good collocation), Y (good collocation), and YY (good collocation, but of lesser quality than a Y collocation).

  15. 15.

    The precision computed on the top n results is referred to as the n-best precision (Evert, 2004a).

  16. 16.

    The inter-annotator agreement achieved by non-specialised students is lower (Krenn, 2008).

  17. 17.

    A number of testbeds have been released after the Shared Task for Multiword Expressions (Grégoire et al., 2008).

  18. 18.

    For instance, Daille (1994, 145) reported that only 300 out of the 2,200 terms tested (13.6%) were found in a reference list containing about 6,000 terms from the same domain as the source corpus, namely, that of satellite telecommunications. A similar coverage (13.4%) is reported in Justeson and Katz (1995): only 13 of the identified 97 terms are found in a dictionary containing more than 20,000 terms of the same domain. We can speculate that when the domain is not the same, the intersection is virtually insignificant.

  19. 19.

    See Section 4.6 for a discussion on the effect that ignoring such long-distance pairs has on the extraction results.

  20. 20.

    As explained in Chapter 2, the collocation phenomenon is more acutely perceived by near-native than by native speakers.

  21. 21.

    In Pecina (2010), for instance, the judges decided upon the status of a pair without referring to context.

  22. 22.

    We used this strict policy because one of our objectives was to measure the quality of the candidate identification step. Yet, the correctness of grammatical information may have less relevance in practice: the mere presence of the component words in a collocation may be sufficient for lexicographers to spot it and consider it for inclusion in a lexicon. In fact, Kilgarriff et al. (2010) categorise such wrongly analysed pairs as true positives.

  23. 23.

    This choice can be seen as biasing the candidate identification process, since parsing errors are reflected in the POS tags assigned. We argue, however, that the assignment of tags in case of ambiguity is more precise if done with Fips than without parsing information, and that, on the contrary, our choice makes the two methods more comparable: rather than introducing errors with another POS tagger, we would retrieve the same errors, and could more easily highlight the differences between the two extraction approaches.

  24. 24.

    According to a study by Hajič (2000) cited in Section 3.3.3, about 40% of the tokens in an English text are POS-ambiguous.

  25. 25.

    For instance, combinations involving an adverb have not been considered, since ignored by most window-based extraction systems.

  26. 26.

    LLR is not defined for those pairs that contain a 0 value in their contingency table.

  27. 27.

    This pair has been erroneously extracted from the phrase petite entreprise (see the error analysis in Section 4.5).

  28. 28.

    The same strategy is used, for instance, in Daille (1994).

  29. 29.

    Note that the pairs contain lemmas rather than word forms.

  30. 30.

    Recall from Section 4.4.2 that the mark represents the dominant label of an annotated pair.

  31. 31.

    The κ values are slightly divergent from those reported, on the same annotation data, in our previous publications (Seretan, 2008, Seretan and Wehrli, 2009). This is because we previously used a κ calculator that implemented a weighted version of Cohen’s κ.

  32. 32.

    This corpus is on average 3.1 times bigger than the corpus used in Experiment 1.

  33. 33.

    The numbers in Experiment 1 were quite similar, i.e., 76.4% vs. 99.0% for the top 500 pairs.

  34. 34.

    Disagreements involving erroneous pairs are not discussed here, since they are not linguistically relevant.

  35. 35.

    Entreprise can be either a noun (“company”) or the past of the verb entreprendre (“to undertake”).

  36. 36.

    Faible can be an adjective (“weak”) or a noun (“weak person”).

  37. 37.

    The parser distinguishes between Monsieur (title) and monsieur (common noun) and therefore considers them as two different lexemes.

  38. 38.

    Note that the window method cannot be subject to such errors as long as no syntactic type is associated with the output pairs, but only POS labels.

  39. 39.

    This is mainly the consequence of the manner in which the test sets have been constructed, by considering non-adjacent sets at various levels in the output list (Section 4.4.5).

  40. 40.

    For instance, Example 7 contains a false instance for the pair président de élection, while Example 8 contains a true instance.

  41. 41.

    This pair is actually part of the longer collocation vote – take place that should have been extracted if take place was included in the parser’s lexicon (Chapter 5 presents a method for obtaining longer collocations by taking into account previously extracted pairs).

  42. 42.

    Fontenelle (1999) discusses the problem of transparent nouns, by showing that they may involve a wide range of partitives and quantifiers, as in shot clouds of arrows, melt a bar of chocolate, suffer from an outbreak of fever, a warm round of applause. He proposes a lexical-function account for these nouns, in which the transparent nouns are considered as the value of the lexical function Mult (e.g., Mult(arrow)=clowd).

  43. 43.

    In the current version of the system, the extraction method was adapted to perform this computation. In the lexicon of Fips, a specific flag is used for noun to signal that they are semantically transparent.

  44. 44.

    This term was introduced in Section 3.2.3.

  45. 45.

    In this case, however, the argument of language independence and ease of implementation does not hold anymore, as shallow parser are also relatively difficult to develop.

References

  • Artstein R, Poesio M (2008) Inter-coder agreement for computational linguistics. Computational Linguistics 34(4):555–596

    Article  Google Scholar 

  • Blaheta D, Johnson M (2001) Unsupervised learning of multi-word verbs. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp 54–60

    Google Scholar 

  • Breidt E (1993) Extraction of V-N-collocations from text corpora: A feasibility study for German. In: Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus, OH, USA, pp 74–83

    Google Scholar 

  • Bresnan J (2001) Lexical Functional Syntax. Blackwell, Oxford

    Google Scholar 

  • Chomsky N (1995) The Minimalist Program. MIT Press, Cambridge, MA

    MATH  Google Scholar 

  • Choueka Y (1988) Looking for needles in a haystack, or locating interesting collocational expressions in large textual databases. In: Proceedings of the International Conference on User-Oriented Content-Based Text and Image Handling, Cambridge, MA, USA, pp 609–623

    Google Scholar 

  • Cohen J (1960) A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20:37–46

    Article  Google Scholar 

  • Cook P, Fazly A, Stevenson S (2008) The VNC-tokens dataset. In: Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco, pp 19–22

    Google Scholar 

  • Culicover P, Jackendoff R (2005) Simpler Syntax. Oxford University Press, Oxford

    Book  Google Scholar 

  • Daille B (1994) Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres linguistiques. PhD thesis, Université Paris 7

    Google Scholar 

  • Diab MT, Bhutada P (2009) Verb noun construction MWE token supervised classification. In: 2009 Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation, Applications, Suntec, Singapore, pp 17–22

    Chapter  Google Scholar 

  • Dunning T (1993) Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1):61–74

    Google Scholar 

  • Evert S (2004a) Significance tests for the evaluation of ranking methods. In: Proceedings of Coling 2004, Geneva, Switzerland, pp 945–951

    Google Scholar 

  • Evert S (2004b) The statistics of word cooccurrences: Word pairs and collocations. PhD thesis, University of Stuttgart

    Google Scholar 

  • Evert S (2008b) A lexicographic evaluation of German adjective-noun collocations. In: Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco

    Google Scholar 

  • Evert S, Kermes H (2003) Experiments on candidate data for collocation extraction. In: Companion Volume to the Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL’03), Budapest, Hungary, pp 83–86

    Google Scholar 

  • Evert S, Krenn B (2001) Methods for the qualitative evaluation of lexical association measures. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France, pp 188–195

    Google Scholar 

  • Evert S, Krenn B (2005) Using small random samples for the manual evaluation of statistical association measures. Computer Speech & Language 19(4):450–466

    Google Scholar 

  • Evert S, Heid U, Spranger K (2004) Identifying morphosyntactic preferences in collocations. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, pp 907–910

    Google Scholar 

  • Fleiss JL (1981) Measuring nominal scale agreement among many raters. Psychological Bulletin 76:378–382

    Article  Google Scholar 

  • Fontenelle T (1999) Semantic resources for word sense disambiguation: A sine qua non? Linguistica e Filologia (9):25–43, dipartimento di Linguistica e Letterature Comparate, Università degli Studi di Bergamo

    Google Scholar 

  • Fritzinger F, Weller M, Heid U (2010) A survey of idiomatic Preposition-Noun-Verb triples on token level. In: Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta

    Google Scholar 

  • Grégoire N, Evert S, Krenn B (eds) (2008) Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008). European Language Resources Association (ELRA), Marrakech, Morocco

    Google Scholar 

  • Hajič J (2000) Morphological tagging: Data vs. dictionaries. In: Proceedings of the 6th Applied Natural Language Processing and the 1st NAACL Conference, Seattle, WA, USA, pp 94–101

    Google Scholar 

  • Heid U, Weller M (2008) Tools for collocation extraction: Preferences for active vs. passive. In: Proceedings of the 6th International Language Resources and Evaluation (LREC’08), Marrakech, Morocco

    Google Scholar 

  • Justeson JS, Katz SM (1995) Technical terminology: Some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1(1):9–27

    Article  Google Scholar 

  • Kilgarriff A, Rychly P, Smrz P, Tugwell D (2004) The Sketch Engine. In: Proceedings of the 11th EURALEX International Congress, Lorient, France, pp 105–116

    Google Scholar 

  • Kilgarriff A, Kovář V, Krek S, Srdanović I, Tiberius C (2010) A quantitative evaluation of word sketches. In: Proceedings of the 14th EURALEX International Congress, Leeuwarden, The Netherlands

    Google Scholar 

  • Koehn P (2005) Europarl: A parallel corpus for statistical machine translation. In: Proceedings of the 10th Machine Translation Summit (MT Summit X), Phuket, Thailand, pp 79–86

    Google Scholar 

  • Krenn B (2000a) Collocation mining: Exploiting corpora for collocation identification and representation. In: Proceedings of KONVENS 2000, Ilmenau, Germany, pp 209–214

    Google Scholar 

  • Krenn B (2000b) The Usual Suspects: Data-Oriented Models for Identification and Representation of Lexical Collocations, vol 7. German Research Center for Artificial Intelligence and Saarland University Dissertations in Computational Linguistics and Language Technology, Saarbrücken, Germany

    Google Scholar 

  • Krenn B (2008) Description of evaluation resource – German PP-verb data. In: Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco

    Google Scholar 

  • Krenn B, Evert S (2001) Can we do better than frequency? A case study on extracting PP-verb collocations. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp 39–46

    Google Scholar 

  • Krenn B, Evert S, Zinsmeister H (2004) Determining intercoder agreement for a collocation identification task. In: Proceedings of KONVENS 2004, Vienna, Austria

    Google Scholar 

  • Landis J, Koch G (1977) The measurement of observer agreement for categorical data. Biometrics 33:159–174

    Article  MATH  MathSciNet  Google Scholar 

  • Lin D (1998) Extracting collocations from text corpora. In: First Workshop on Computational Terminology, Montreal, Canada, pp 57–63

    Google Scholar 

  • Lin D (1999) Automatic identification of non-compositional phrases. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, Morristown, NJ, USA, pp 317–324

    Google Scholar 

  • Lü Y, Zhou M (2004) Collocation translation acquisition using monolingual corpora. In: Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Barcelona, Spain, pp 167–174

    Google Scholar 

  • Manning CD, Schütze H (1999) Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA

    MATH  Google Scholar 

  • McKeown KR, Radev DR (2000) Collocations. In: Dale R, Moisl H, Somers H (eds) A Handbook of Natural Language Processing, Marcel Dekker, New York, NY, pp 507–523

    Google Scholar 

  • Orliac B, Dillinger M (2003) Collocation extraction for machine translation. In: Proceedings of Machine Translation Summit IX, New Orleans, LA, USA, pp 292–298

    Google Scholar 

  • Pearce D (2001a) Synonymy in collocation extraction. In: Proceedings of the NAACL Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations, Pittsburgh, PA, USA, pp 41–46

    Google Scholar 

  • Pearce D (2002) A comparative evaluation of collocation extraction techniques. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation, Las Palmas, Spain, pp 1530–1536

    Google Scholar 

  • Pecina P (2008a) Lexical association measures: Collocation extraction. PhD thesis, Charles University in Prague

    Google Scholar 

  • Pecina P (2008b) A machine learning approach to multiword expression extraction. In: Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco, pp 54–57

    Google Scholar 

  • Pecina P (2010) Lexical association measures and collocation extraction. Language Resources and Evaluation 1(44):137–158

    Article  Google Scholar 

  • Ramisch C, Schreiner P, Idiart M, Villavicencio A (2008) An evaluation of methods for the extraction of multiword expressions. In: Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco

    Google Scholar 

  • Ritz J (2006) Collocation extraction: Needs, feeds and results of an extraction system for German. In: Proceedings of the Workshop on Multi-Word-Expressions in a Multilingual Context at the 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, pp 41–48

    Google Scholar 

  • Schulte im Walde S (2003) A collocation database for German verbs and nouns. In: Kiefer F, Pajzs J (eds) Proceedings of the 7th Conference on Computational Lexicography and Corpus Research, Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest, Hungary

    Google Scholar 

  • Seretan V (2008) Collocation extraction based on syntactic parsing. PhD thesis, University of Geneva

    Google Scholar 

  • Seretan V (2009) An integrated environment for extracting and translating collocations. In: Mahlberg M, González-Díaz V, Smith C (eds) Proceedings of the Corpus Linguistics Conference CL2009, Liverpool, UK

    Google Scholar 

  • Seretan V, Wehrli E (2009) Multilingual collocation extraction with a syntactic parser. Language Resources and Evaluation 43(1):71–85

    Article  Google Scholar 

  • Seretan V, Nerima L, Wehrli E (2004) A tool for multi-word collocation extraction and visualization in multilingual corpora. In: Proceedings of the 11th EURALEX International Congress, EURALEX 2004, Lorient, France, pp 755–766

    Google Scholar 

  • Smadja F (1993) Retrieving collocations from text: Xtract. Computational Linguistics 19(1):143–177

    Google Scholar 

  • Thanopoulos A, Fakotakis N, Kokkinakis G (2002) Comparative evaluation of collocation extraction metrics. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation, Las Palmas, Spain, pp 620–625

    Google Scholar 

  • Villada Moirón MBn (2005) Data-driven identification of fixed expressions and their modifiability. PhD thesis, University of Groningen

    Google Scholar 

  • Wehrli E (1997) L’analyse syntaxique des langues naturelles: Problèmes et méthodes. Masson, Paris

    Google Scholar 

  • Wehrli E (2004) Un modèle multilingue d’analyse syntaxique. In: Auchlin A, Burger M, Filliettaz L, Grobet A, Moeschler J, Perrin L, Rossari C, de Saussure L (eds) Structures et discours - Mélanges offerts à Eddy Roulet, Éditions Nota bene, Québec, pp 311–329

    Google Scholar 

  • Wehrli E (2007) Fips, a “deep” linguistic multilingual parser. In: ACL 2007 Workshop on Deep Linguistic Processing, Prague, Czech Republic, pp 120–127

    Google Scholar 

  • Weller M, Heid U (2010) Extraction of German multiword expressions from parsed corpora using context features. In: Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta

    Google Scholar 

  • Wermter J, Hahn U (2006) You can’t beat frequency (unless you use linguistic knowledge) – a qualitative evaluation of association measures for collocation and term extraction. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, pp 785–792

    Google Scholar 

  • Wu H, Zhou M (2003) Synonymous collocation extraction using translation information. In: Proceeding of the Annual Meeting of the Association for Computational Linguistics (ACL 2003), Sapporo, Japan, pp 120–127

    Google Scholar 

  • Zajac R, Lange E, Yang J (2003) Customizing complex lexical entries for high-quality MT. In: Proceedings of the 9th Machine Translation Summit, New Orleans, LA, USA, pp 433–438

    Google Scholar 

  • Zinsmeister H, Heid U (2003) Significant triples: Adjective+Noun+Verb combinations. In: Proceedings of the 7th Conference on Computational Lexicography and Text Research (Complex 2003), Budapest, Hungary

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Violeta Seretan .

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media B.V.

About this chapter

Cite this chapter

Seretan, V. (2011). Syntax-Based Extraction. In: Syntax-Based Collocation Extraction. Text, Speech and Language Technology, vol 44. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-0134-2_4

Download citation

  • DOI: https://doi.org/10.1007/978-94-007-0134-2_4

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-94-007-0133-5

  • Online ISBN: 978-94-007-0134-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics