Syntax-Based Extraction

Seretan, Violeta

doi:10.1007/978-94-007-0134-2_4

Violeta Seretan²

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 44))

908 Accesses

Abstract

In this chapter—the core of the book—we present and evaluate our methodology for collocation extraction based on deep syntactic parsing. First, a closer look at previous work which made use of parsed text for collocation extraction will reveal that the aim of fully-fledged syntax-based extraction was far from realized in these efforts due, primarily, to the insufficient robustness, precision, or coverage of the parsers used, as well as to the small number of syntactic configurations taken into account. Our work addresses these deficiencies with a generic extraction procedure that relies on a large-scale multilingual parsing system. After describing the system and extraction method, we focus on the contrastive evaluation of the method against the sliding window method, a standard syntax-free method based on the linear proximity of words. Cross-language evaluation shows that, despite the inherent errors and the challenges posed by the analysis of large amounts of unrestricted text, deep parsing contributes to a significant increase in performance. A detailed qualitative analysis of the results, including a case-study comparison, allows an assessment of the relative strengths and weaknesses of the two methods to be made. Following the qualitative comparison, a brief comparison of the current system with systems based on shallow parsing is presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
All the examples provided in this book are sentences actually occurring in our corpora.
2.
The author notes that newer versions of the parser are able to process these sentences as well.
3.
See the examples provided later, in Section 4.2.
4.
Note that a relative reading is also possible for this example.
5.
The subordinate clause in this example is the relative introduced by where.
6.
URL:http://www.elda.org/easy/,accessed June, 2010.
7.
URL: http://atoll.inria.fr/passage/,accessed June, 2010.
8.
Prepositions are also included along with noun lexemes for readability reasons.
9.
Far from being exhaustive, this list is continuously evolving since many new combinations emerge as collocationally relevant as more data is processed (see also the considerations in Section 3.3.2 on the syntactic configuration of collocations).
10.
Thanks to parsing, the readings of a lexical item are syntactically disambiguated. It might therefore happen that two pairs that are identical in form (the key fields are the same) actually contain different lexical items.
11.
Section 3.2.5 discusses in detail the issue of choosing an appropriate AM.
12.
The selection of higher-scored pairs can be made a posteriori, according to the desired degree of confidence (see Section 3.2.4).
13.
As noted in the previous section, the system can recognise those pairs of lexemes that make up known collocations, i.e., collocations that are stored in the parser’s lexicon.
14.
For example, in Smadja (1993), the lexicographers classified each item as N (not a good collocation), Y (good collocation), and YY (good collocation, but of lesser quality than a Y collocation).
15.
The precision computed on the top n results is referred to as the n-best precision (Evert, 2004a).
16.
The inter-annotator agreement achieved by non-specialised students is lower (Krenn, 2008).
17.
A number of testbeds have been released after the Shared Task for Multiword Expressions (Grégoire et al., 2008).
18.
For instance, Daille (1994, 145) reported that only 300 out of the 2,200 terms tested (13.6%) were found in a reference list containing about 6,000 terms from the same domain as the source corpus, namely, that of satellite telecommunications. A similar coverage (13.4%) is reported in Justeson and Katz (1995): only 13 of the identified 97 terms are found in a dictionary containing more than 20,000 terms of the same domain. We can speculate that when the domain is not the same, the intersection is virtually insignificant.
19.
See Section 4.6 for a discussion on the effect that ignoring such long-distance pairs has on the extraction results.
20.
As explained in Chapter 2, the collocation phenomenon is more acutely perceived by near-native than by native speakers.
21.
In Pecina (2010), for instance, the judges decided upon the status of a pair without referring to context.
22.
We used this strict policy because one of our objectives was to measure the quality of the candidate identification step. Yet, the correctness of grammatical information may have less relevance in practice: the mere presence of the component words in a collocation may be sufficient for lexicographers to spot it and consider it for inclusion in a lexicon. In fact, Kilgarriff et al. (2010) categorise such wrongly analysed pairs as true positives.
23.
This choice can be seen as biasing the candidate identification process, since parsing errors are reflected in the POS tags assigned. We argue, however, that the assignment of tags in case of ambiguity is more precise if done with Fips than without parsing information, and that, on the contrary, our choice makes the two methods more comparable: rather than introducing errors with another POS tagger, we would retrieve the same errors, and could more easily highlight the differences between the two extraction approaches.
24.
According to a study by Hajič (2000) cited in Section 3.3.3, about 40% of the tokens in an English text are POS-ambiguous.
25.
For instance, combinations involving an adverb have not been considered, since ignored by most window-based extraction systems.
26.
LLR is not defined for those pairs that contain a 0 value in their contingency table.
27.
This pair has been erroneously extracted from the phrase petite entreprise (see the error analysis in Section 4.5).
28.
The same strategy is used, for instance, in Daille (1994).
29.
Note that the pairs contain lemmas rather than word forms.
30.
Recall from Section 4.4.2 that the mark represents the dominant label of an annotated pair.
31.
The κ values are slightly divergent from those reported, on the same annotation data, in our previous publications (Seretan, 2008, Seretan and Wehrli, 2009). This is because we previously used a κ calculator that implemented a weighted version of Cohen’s κ.
32.
This corpus is on average 3.1 times bigger than the corpus used in Experiment 1.
33.
The numbers in Experiment 1 were quite similar, i.e., 76.4% vs. 99.0% for the top 500 pairs.
34.
Disagreements involving erroneous pairs are not discussed here, since they are not linguistically relevant.
35.
Entreprise can be either a noun (“company”) or the past of the verb entreprendre (“to undertake”).
36.
Faible can be an adjective (“weak”) or a noun (“weak person”).
37.
The parser distinguishes between Monsieur (title) and monsieur (common noun) and therefore considers them as two different lexemes.
38.
Note that the window method cannot be subject to such errors as long as no syntactic type is associated with the output pairs, but only POS labels.
39.
This is mainly the consequence of the manner in which the test sets have been constructed, by considering non-adjacent sets at various levels in the output list (Section 4.4.5).
40.
For instance, Example 7 contains a false instance for the pair président de élection, while Example 8 contains a true instance.
41.
This pair is actually part of the longer collocation vote – take place that should have been extracted if take place was included in the parser’s lexicon (Chapter 5 presents a method for obtaining longer collocations by taking into account previously extracted pairs).
42.
Fontenelle (1999) discusses the problem of transparent nouns, by showing that they may involve a wide range of partitives and quantifiers, as in shot clouds of arrows, melt a bar of chocolate, suffer from an outbreak of fever, a warm round of applause. He proposes a lexical-function account for these nouns, in which the transparent nouns are considered as the value of the lexical function Mult (e.g., Mult(arrow)=clowd).
43.
In the current version of the system, the extraction method was adapted to perform this computation. In the lexicon of Fips, a specific flag is used for noun to signal that they are semantically transparent.
44.
This term was introduced in Section 3.2.3.
45.
In this case, however, the argument of language independence and ease of implementation does not hold anymore, as shallow parser are also relatively difficult to develop.

References

Artstein R, Poesio M (2008) Inter-coder agreement for computational linguistics. Computational Linguistics 34(4):555–596
Article Google Scholar
Blaheta D, Johnson M (2001) Unsupervised learning of multi-word verbs. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp 54–60
Google Scholar
Breidt E (1993) Extraction of V-N-collocations from text corpora: A feasibility study for German. In: Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus, OH, USA, pp 74–83
Google Scholar
Bresnan J (2001) Lexical Functional Syntax. Blackwell, Oxford
Google Scholar
Chomsky N (1995) The Minimalist Program. MIT Press, Cambridge, MA
MATH Google Scholar
Choueka Y (1988) Looking for needles in a haystack, or locating interesting collocational expressions in large textual databases. In: Proceedings of the International Conference on User-Oriented Content-Based Text and Image Handling, Cambridge, MA, USA, pp 609–623
Google Scholar
Cohen J (1960) A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20:37–46
Article Google Scholar
Cook P, Fazly A, Stevenson S (2008) The VNC-tokens dataset. In: Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco, pp 19–22
Google Scholar
Culicover P, Jackendoff R (2005) Simpler Syntax. Oxford University Press, Oxford
Book Google Scholar
Daille B (1994) Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres linguistiques. PhD thesis, Université Paris 7
Google Scholar
Diab MT, Bhutada P (2009) Verb noun construction MWE token supervised classification. In: 2009 Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation, Applications, Suntec, Singapore, pp 17–22
Chapter Google Scholar
Dunning T (1993) Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1):61–74
Google Scholar
Evert S (2004a) Significance tests for the evaluation of ranking methods. In: Proceedings of Coling 2004, Geneva, Switzerland, pp 945–951
Google Scholar
Evert S (2004b) The statistics of word cooccurrences: Word pairs and collocations. PhD thesis, University of Stuttgart
Google Scholar
Evert S (2008b) A lexicographic evaluation of German adjective-noun collocations. In: Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco
Google Scholar
Evert S, Kermes H (2003) Experiments on candidate data for collocation extraction. In: Companion Volume to the Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL’03), Budapest, Hungary, pp 83–86
Google Scholar
Evert S, Krenn B (2001) Methods for the qualitative evaluation of lexical association measures. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France, pp 188–195
Google Scholar
Evert S, Krenn B (2005) Using small random samples for the manual evaluation of statistical association measures. Computer Speech & Language 19(4):450–466
Google Scholar
Evert S, Heid U, Spranger K (2004) Identifying morphosyntactic preferences in collocations. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, pp 907–910
Google Scholar
Fleiss JL (1981) Measuring nominal scale agreement among many raters. Psychological Bulletin 76:378–382
Article Google Scholar
Fontenelle T (1999) Semantic resources for word sense disambiguation: A sine qua non? Linguistica e Filologia (9):25–43, dipartimento di Linguistica e Letterature Comparate, Università degli Studi di Bergamo
Google Scholar
Fritzinger F, Weller M, Heid U (2010) A survey of idiomatic Preposition-Noun-Verb triples on token level. In: Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta
Google Scholar
Grégoire N, Evert S, Krenn B (eds) (2008) Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008). European Language Resources Association (ELRA), Marrakech, Morocco
Google Scholar
Hajič J (2000) Morphological tagging: Data vs. dictionaries. In: Proceedings of the 6th Applied Natural Language Processing and the 1st NAACL Conference, Seattle, WA, USA, pp 94–101
Google Scholar
Heid U, Weller M (2008) Tools for collocation extraction: Preferences for active vs. passive. In: Proceedings of the 6th International Language Resources and Evaluation (LREC’08), Marrakech, Morocco
Google Scholar
Justeson JS, Katz SM (1995) Technical terminology: Some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1(1):9–27
Article Google Scholar
Kilgarriff A, Rychly P, Smrz P, Tugwell D (2004) The Sketch Engine. In: Proceedings of the 11th EURALEX International Congress, Lorient, France, pp 105–116
Google Scholar
Kilgarriff A, Kovář V, Krek S, Srdanović I, Tiberius C (2010) A quantitative evaluation of word sketches. In: Proceedings of the 14th EURALEX International Congress, Leeuwarden, The Netherlands
Google Scholar
Koehn P (2005) Europarl: A parallel corpus for statistical machine translation. In: Proceedings of the 10th Machine Translation Summit (MT Summit X), Phuket, Thailand, pp 79–86
Google Scholar
Krenn B (2000a) Collocation mining: Exploiting corpora for collocation identification and representation. In: Proceedings of KONVENS 2000, Ilmenau, Germany, pp 209–214
Google Scholar
Krenn B (2000b) The Usual Suspects: Data-Oriented Models for Identification and Representation of Lexical Collocations, vol 7. German Research Center for Artificial Intelligence and Saarland University Dissertations in Computational Linguistics and Language Technology, Saarbrücken, Germany
Google Scholar
Krenn B (2008) Description of evaluation resource – German PP-verb data. In: Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco
Google Scholar
Krenn B, Evert S (2001) Can we do better than frequency? A case study on extracting PP-verb collocations. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp 39–46
Google Scholar
Krenn B, Evert S, Zinsmeister H (2004) Determining intercoder agreement for a collocation identification task. In: Proceedings of KONVENS 2004, Vienna, Austria
Google Scholar
Landis J, Koch G (1977) The measurement of observer agreement for categorical data. Biometrics 33:159–174
Article MATH MathSciNet Google Scholar
Lin D (1998) Extracting collocations from text corpora. In: First Workshop on Computational Terminology, Montreal, Canada, pp 57–63
Google Scholar
Lin D (1999) Automatic identification of non-compositional phrases. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, Morristown, NJ, USA, pp 317–324
Google Scholar
Lü Y, Zhou M (2004) Collocation translation acquisition using monolingual corpora. In: Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Barcelona, Spain, pp 167–174
Google Scholar
Manning CD, Schütze H (1999) Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA
MATH Google Scholar
McKeown KR, Radev DR (2000) Collocations. In: Dale R, Moisl H, Somers H (eds) A Handbook of Natural Language Processing, Marcel Dekker, New York, NY, pp 507–523
Google Scholar
Orliac B, Dillinger M (2003) Collocation extraction for machine translation. In: Proceedings of Machine Translation Summit IX, New Orleans, LA, USA, pp 292–298
Google Scholar
Pearce D (2001a) Synonymy in collocation extraction. In: Proceedings of the NAACL Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations, Pittsburgh, PA, USA, pp 41–46
Google Scholar
Pearce D (2002) A comparative evaluation of collocation extraction techniques. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation, Las Palmas, Spain, pp 1530–1536
Google Scholar
Pecina P (2008a) Lexical association measures: Collocation extraction. PhD thesis, Charles University in Prague
Google Scholar
Pecina P (2008b) A machine learning approach to multiword expression extraction. In: Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco, pp 54–57
Google Scholar
Pecina P (2010) Lexical association measures and collocation extraction. Language Resources and Evaluation 1(44):137–158
Article Google Scholar
Ramisch C, Schreiner P, Idiart M, Villavicencio A (2008) An evaluation of methods for the extraction of multiword expressions. In: Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco
Google Scholar
Ritz J (2006) Collocation extraction: Needs, feeds and results of an extraction system for German. In: Proceedings of the Workshop on Multi-Word-Expressions in a Multilingual Context at the 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, pp 41–48
Google Scholar
Schulte im Walde S (2003) A collocation database for German verbs and nouns. In: Kiefer F, Pajzs J (eds) Proceedings of the 7th Conference on Computational Lexicography and Corpus Research, Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest, Hungary
Google Scholar
Seretan V (2008) Collocation extraction based on syntactic parsing. PhD thesis, University of Geneva
Google Scholar
Seretan V (2009) An integrated environment for extracting and translating collocations. In: Mahlberg M, González-Díaz V, Smith C (eds) Proceedings of the Corpus Linguistics Conference CL2009, Liverpool, UK
Google Scholar
Seretan V, Wehrli E (2009) Multilingual collocation extraction with a syntactic parser. Language Resources and Evaluation 43(1):71–85
Article Google Scholar
Seretan V, Nerima L, Wehrli E (2004) A tool for multi-word collocation extraction and visualization in multilingual corpora. In: Proceedings of the 11th EURALEX International Congress, EURALEX 2004, Lorient, France, pp 755–766
Google Scholar
Smadja F (1993) Retrieving collocations from text: Xtract. Computational Linguistics 19(1):143–177
Google Scholar
Thanopoulos A, Fakotakis N, Kokkinakis G (2002) Comparative evaluation of collocation extraction metrics. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation, Las Palmas, Spain, pp 620–625
Google Scholar
Villada Moirón MBn (2005) Data-driven identification of fixed expressions and their modifiability. PhD thesis, University of Groningen
Google Scholar
Wehrli E (1997) L’analyse syntaxique des langues naturelles: Problèmes et méthodes. Masson, Paris
Google Scholar
Wehrli E (2004) Un modèle multilingue d’analyse syntaxique. In: Auchlin A, Burger M, Filliettaz L, Grobet A, Moeschler J, Perrin L, Rossari C, de Saussure L (eds) Structures et discours - Mélanges offerts à Eddy Roulet, Éditions Nota bene, Québec, pp 311–329
Google Scholar
Wehrli E (2007) Fips, a “deep” linguistic multilingual parser. In: ACL 2007 Workshop on Deep Linguistic Processing, Prague, Czech Republic, pp 120–127
Google Scholar
Weller M, Heid U (2010) Extraction of German multiword expressions from parsed corpora using context features. In: Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta
Google Scholar
Wermter J, Hahn U (2006) You can’t beat frequency (unless you use linguistic knowledge) – a qualitative evaluation of association measures for collocation and term extraction. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, pp 785–792
Google Scholar
Wu H, Zhou M (2003) Synonymous collocation extraction using translation information. In: Proceeding of the Annual Meeting of the Association for Computational Linguistics (ACL 2003), Sapporo, Japan, pp 120–127
Google Scholar
Zajac R, Lange E, Yang J (2003) Customizing complex lexical entries for high-quality MT. In: Proceedings of the 9th Machine Translation Summit, New Orleans, LA, USA, pp 433–438
Google Scholar
Zinsmeister H, Heid U (2003) Significant triples: Adjective+Noun+Verb combinations. In: Proceedings of the 7th Conference on Computational Lexicography and Text Research (Complex 2003), Budapest, Hungary
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Linguistics (Office L706), University of Geneva, Rue de Candolle 2, 1211, Geneva, Switzerland
Violeta Seretan

Authors

Violeta Seretan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Violeta Seretan .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Seretan, V. (2011). Syntax-Based Extraction. In: Syntax-Based Collocation Extraction. Text, Speech and Language Technology, vol 44. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-0134-2_4

Download citation

DOI: https://doi.org/10.1007/978-94-007-0134-2_4
Published: 20 November 2010
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-007-0133-5
Online ISBN: 978-94-007-0134-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics