Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The construction of a 500-million-word reference corpus of written Dutch has been identified as one of the priorities in the STEVIN programme. The focus is on written language in order to complement the Spoken Dutch Corpus (CGN) [13], completed in 2003. In D-COI (a pilot project funded by STEVIN), a 50-million-word pilot corpus has been compiled, parts of which were enriched with verified syntactic annotations. In particular, syntactic annotation of a sub-corpus of 200,000 words has been completed. Further details of the D-COI project can be found inChap.ā€‰13, p. 219. In Lassy, the sub-corpus with verified syntactic annotations has been extended to one million words. We refer to this sub-corpus as Lassy Small. In addition, a much larger corpus has been annotated with syntactic annotations automatically. This larger corpus is called Lassy Large. Lassy Small contains corpora compiled in STEVIN D-COI, some corpora from STEVIN DPC (cf.Ā Chap.ā€‰11, p. 185), and some excerpts from the Dutch Wikipedia. Lassy Large includes the corpora compiled in the STEVIN SONAR project [7]ā€”cf.Ā Chap.ā€‰13, p.Ā 219.

The Lassy project has extended the available syntactically annotated corpora for Dutch both in size as well as with respect to the various text genres and topical domains. In order to judge the quality of the resources, the annotated corpora have been externally validated by the Center for Sprogteknologi of the University of Copenhagen. In Sect.ā€‰9.5 we present the results of this external validation.

In addition, various browse and search tools for syntactically annotated corpora have been developed and made freely available. Their potential for applications in corpus linguistics and information extraction has been illustrated and evaluated through a series of three case studies.

In this article, we illustrate the potential of the Lassy treebanks by providing a short introduction to the annotations and the available tools, and by describing a number of typical research cases which employ the Lassy treebanks in various ways.

2 Annotation and Representation

In this section we describe the annotations in terms of part-of-speech, lemma and syntactic dependency. Furthermore, we illustrate how the annotations are stored in a straightforward XML file format.

Annotations for Lassy Small have been assigned in a semi-automatic manner. Annotations were first assigned automatically, and then our annotators manually inspected these annotations and corrected the mistakes. For part-of-speech and lemma annotation, Tadpole [12] has been used. For the syntactic dependency annotation, we used the Alpino parser [15]. For the correction of syntactic dependency annotations, we employed the TrEd tree editor [8]. In addition, a large number of heuristics have been applied to find annotation mistakes semi-automatically.

The annotations for Lassy Large have been assigned in the same way, except that there has been no manual inspection and correction phase.

2.1 Part-of-Speech and Lemma Annotation

The annotations include part-of-speech annotation and lemma annotation for words, and syntactic dependency annotation between words and word groups. Part-of-speech and lemma annotation closely follow the guide-lines developed in D-COI [14]. These guide-lines extend the guide-lines developed in CGN, in order to take into account typical constructs for written language. As an example of the annotations, consider the sentence

(1) ā€‹ā€‹ā€‹ā€‹ In 2005 moest hij stoppen na een meningsverschil met de studio.

In 2005 must he stop after a dispute with the studio.

In 2005, he had to stop after a dispute with the studio

The annotation of this sentence is given here as follows where each line contains the word, followed by the lemma, followed by the part-of-speech tag.

As the example indicates, the annotation not only provides the main part-of-speeches such as VZ (preposition), WW (verb), VNW (pronoun), but also various features to indicate tense, aspect, person, number, gender etc. Below, we describe how the part-of-speech and lemma annotations are included in the XML representation, together with the syntactic dependency annotations.

2.2 Syntactic Dependency Annotation

The guide-lines for the syntactic dependency annotation are given in detail in [18]. This manual is a descendent of the CGN and D-Coi syntactic annotation manuals. The CGN syntactic annotation manual [5] has been extended with many more examples and adapted by reducing the amount of linguistic discussions. Some further changes were applied based on the feedback in the validation report of the D-COI project. In a number of documented cases, the annotation guidelines themselves have changed in order to ensure consistency of the annotations, and to facilitate semi-automatic annotation.

Syntactic dependency annotations express three types of syntactic information:

  • hierarchical information: which words belong together

  • relational information: what is the grammatical function of the various words and word groups (such functions include head, subject, direct object, modifier etc.)

  • categorial information: what is the categorial status of the various word groups (categories include NP, PP, SMAIN, SSUB, etc.)

As an example of a syntactic dependency annotation, consider the graph in Fig.ā€‰9.1 for the example sentence given above. This example illustrates a number of properties of the representation. Firstly, the left-to-right surface order of the word is not always directly represented in the tree structure, but rather each of the leaf nodes is associated with the position in the string that it occurred at (the subscript). The tree structure does represent hierarchical information, in that words that belong together are represented under one node. Secondly, some word groups are analysed to belong to more than a single word group. We use a co-indexing mechanism to represent such secondary edges. In this example, the wordhij functions both as the subject of the modal verbmoeten and the main verbstoppen ā€”this is indicated by the indexĀ 1. Thirdly, we inherit from CGN the practice that punctuation (including sentence internal punctuation) is not analysed syntactically, but simply attached to the top node, to ensure that all tokens of the input are part of the dependency structure. The precise location of punctuation tokens is represented because all tokens are associated with an integer indicating the position in the sentence.

Fig.Ā 9.1
figure 1figure 1

Syntactic dependency graph for sentence (1)

2.3 Representation in XML

Both the dependency structures and the part-of-speech and lemma annotations are stored in a single XML format. Advantages of the use of XML include the availability of general purpose search and visualisation software. For instance, we exploit XPath Footnote 1 (standard XML query language) to search in large sets of dependency structures, and XQuery to extract information from such large sets of dependency structures.

In the XML-structure, every node is represented by anode entity. The other information is presented as values of various XML-attributes of those nodes. The important attributes arecat (syntactic category),rel (grammatical function),postag (part-of-speech tag),word ,lemma ,index ,begin (starting position in the surface string) andend (end position in the surface string).

Ignoring some attributes for expository purposes, part of the annotation of our running example is given in XML as follows:

figure 3figure 3

Leaf nodes have further attributes to represent the part-of-speech tag. The attributepostag will be the part-of-speech tag including the various sub-features. The abbreviated part-of-speech tagā€”the part without the attributesā€”is available as the value of the attributept. In addition, each of the sub-features itself is available as the value of further XML attributes. The precise mapping of part-of-speech tags and attributes with values is given in [18]. The actual node for the finite verbmoest in the example including the attributes to represent the part-of-speech is:

figure 4figure 4

This somewhat redundant specification of the information encoded in the part-of-speech labels facilitates the construction of queries, since it is possible to refer directly to particular sub-features, and therefore to generalise more easily over part-of-speech labels.

3 Querying the Treebanks

As the annotations are represented in XML, there is a variety of tools available to work with the annotations. Such tools include XSLT, XPath and XQuery, as well as a number of special purpose toolsā€”some of which were developed in the course of the Lassy project. Most of these tools have in common that particular parts of the tree can be identified using the XPath query language. XPath (XML Path Language) is an official W3C standard which provides a language for addressing parts of an XML document. In this section we provide a number of simple examples of the use of XPath to search in the Lassy corpora. We then continue to argue against some perceived limitations of XPath.

3.1 Search with XPath

We start by providing a number of simple XPath queries that can be used to search in the Lassy treebanks. We do not give a full introduction to the XPath languageā€”for this purpose there are various resources available on the web.

3.1.1 Some Examples

With XPath, we can refer to hierarchical information (encoded by the hierarchical embedding of node elements), grammatical categories and functions (encoded by thecat andrel attributes), and surface order (encoded by the attributesbegin andend .

As a simple introductory example, the following query:

figure 5figure 5

identifies all nodes anywhere in a given document, for which the value of thecat attribute equalspp. In practice, if we use such a query against our Lassy Small corpus using the Dact tool (introduced below), we will get all sentences which contain a prepositional phrase. In addition, these prepositional phrases will be highlighted. In the query we use the double slash notation to indicate that thisnode can appear anywhere in the dependency structure. Conditions about this node can be given between square brackets. Such conditions often refer to particular values of particular attributes. Conditions can be combined using the boolean operatorsand ,or andnot. For instance, we can extend the previous query by requiring that thePP node should start at the beginning of the sentence:

figure 6figure 6

Brackets can be used to indicate the intended structure of the conditions, as in:

figure 7figure 7

Conditions can also refer to the context of the node. In the following query, we pose further restrictions on a daughter node of the PP category.

figure 8figure 8

This query will find all sentences in which a PP occurs with a head node for which it is the case that its part-of-speech label is not of the formVZ(..). Such a query will return quite a few hitsā€”in most cases for prepositional phrases which are headed by multi-word-units such asin tegenstelling tot (in contrast with),met betrekking tot (with respect to), ā€¦. If we want to exclude such multi-word-units, the query could be extended as follows, where we require that there is aword attribute, irrespective of its value.

figure 9figure 9

We can look further down inside a node using the single slash notation. For instance, the expressionnode[@rel="obj1"]/node[@rel="hd"]will refer to the head of the direct object. We can also access the value of an attribute of a sub-node as innode[@rel="hd"]/@postag.

It is also possible to refer to the mother node of a given node, using the double dot notation. The following query identifies prepositional phrases which are a dependent in a main sentence:

figure 10figure 10

Combining the two possibilities we can also refer to sister nodes. In this query, we find prepositional phrases as long as there is a sister which functions as a secondary object:

figure 11figure 11

Finally, the special notation.//identifies any node which is embedded anywhere in the current node. The next query finds embedded sentences which include the wordvan anywhere.

figure 12figure 12

3.1.2 Left to Right Ordering

Consider the following example, in which we identify prepositional phrases in which the preposition (the head) is preceded by the NP (which is assigned theobj1 function). Here we use the operator< to implement precedence .

figure 13figure 13

Note that we use in these examples thenumber() function to map the string value explicitly to a number. This is required in some implementations of XPath.

The operator= can be used to implement direct precedence. As another example, consider the problem of finding a prepositional phrase which follows a finite verb directly in a subordinate finite sentence. Initially, we arrive at the following query:

figure 14figure 14

This does identify subordinate finite sentences in which the finite verb is directly followed by a PP. But note that the query also requires that this PP is a dependent of the same node. If we want to find a PP anywhere, then the query becomes:

figure 15figure 15

3.1.3 Pitfalls

The content and sub-structure of coindexed nodes (to represent secondary edges) is present in the XML structure only once. The index attribute is used to indicate equivalence of the nodes. This may have some unexpected effects. For instance, the following query will not match with the dependency structure given in Fig.ā€‰9.1 .

figure 16figure 16

The reason is, that the subject ofstoppen itself does not have a subject withlemma=hij. Instead, it does have a subject which is co-indexed with a node for which this requirement is true. In order to match this case also, the query should be complicated, for instance as follows:

figure 17figure 17

The example illustrates that the use of co-indexing is not problematic for XPath, but it does complicate the queries in some cases. Some tools (for instance the Dact tool described in Sect.ā€‰9.3.3) provide the capacity to define macro substitutions in queries, which simplifies matters considerably.

3.2 Comparison with Lai and Bird 2004

In [6] a comparison of a number of existing query languages is presented, by focussing on seven example queries. Here we show that each of the seven queries can be formulated in XPath for the Lassy treebank. In order to do this, we first adapted the queries in a non-essential way. For one thing, some queries refer to English words which we mapped to Dutch words. Some other differences are that there is no (finite) VP in the Lassy treebank. The adapted queries with the implementation in XPath is now given as follows:

  1. 1.

    Find sentences that include the word zag.

figure 18figure 18
  1. 2.

    Find sentences that do not include the word zag.

figure 19figure 19
  1. 3.

    Find noun phrases whose rightmost child is a noun.

figure 20figure 20
  1. 4.

    Find root sentences that contain a verb immediately followed by a noun phrase that is immediately followed by a prepositional phrase.

figure 21figure 21
  1. 5.

    Find the first common ancestor of sequences of a noun phrase followed by a prepositional phrase.

figure 22figure 22
  1. 6.

    Find a noun phrase which dominates a worddonker (dark) that is dominated by an intermediate phrase that is a prepositional phrase.

figure 23figure 23
  1. 7.

    Find a noun phrase dominated by a root sentence. Return the subtree dominated by that noun phrase only.

figure 24figure 24

The ease with which the queries can be defined may be surprising to readers familiar with Lai and Bird [6]. In that paper, the authors conclude that XPath is not expressive enough for some queries. As an alternative, the special query language LPATH is introduced, which extends XPath in three ways:

  • the additional axis immediately following

  • the scope operator{...}

  • the node alignment operators^ and$

However, we note here that these extensions are unnecessary. As long as the surface order of nodes is explicitly encoded by XML attributesbegin andend, as in the Lassy treebank, then the additional power is redundant. An LPATH query which requires that a node x immediately follows a node y can be encoded in XPath by requiring that the begin-attribute of x equals the end-attribute of y. The examples which motivate the introduction of the other two extensions likewise can be encoded in XPath by means of the begin- and end-attributes. For instance, the LPATH query

figure 25figure 25

where an SMAIN node is selected which contains a right-aligned NP can be defined in XPath as:

figure 26figure 26

Based on these examples we conclude that there is no motivation for an ad-hoc special purpose extension of XPath, but that instead we can safely continue to use the XPath standard.

3.3 A Graphical User Interface for Lassy

Dact is a recent easy-to-use open-source tool, available for multiple platforms, to browse and search through Lassy treebanks. It provides graphical tree visualizations of the dependency structures of the treebank, full XPath search to select relevant dependency structures in a given corpus and to highlight the selected nodes of dependency structures, simple statistical operations to generate frequency lists for any attributes of selected nodes, and sentence-based outputs in several formats to display selected nodes e.g. by bracketing the selected nodes, or by a keyword-in-context presentation. Dact can be downloaded fromhttp://rug-compling.github.com/dact/.

For the XML processing, Dact supports both the libxml2 (http://xmlsoft.org) and the Oracle Berkeley DB XML (http://www.oracle.com) libraries. In the latter case, database technology is used to preprocess the corpus for faster query evaluation. In addition, the use of XPath 2.0 is supported. Furthermore, Dact provides macro expansion in XPath queries.

The availability of XPath 2.0 is useful in order to specify quantified queries (argued for in the context of the Lassy treebanks in [1]). As an example, consider the query in which we want to identify a NP which contains a VC complement (infinite VP complement), in such a way that there is a noun which is preceded by the head of that NP, and which precedes the VC complement. In other words, in such a case there is an (extraposed) VC complement of a noun for which there is another noun which appears in between the noun and the VC complement. The query can be formulated as:

figure 27figure 27

The availability of a macro facility is useful to build up more complicated queries in a transparent way. The following example illustrates this point. Macroā€™s are defined using the formatnameĀ =Ā string . A macro is used by putting the name between%Ā % . The following set of macroā€™s defines the solution to the fifth problem posed in [6] in a more transparent manner. In order to define the minimal node which dominates a NP PP sequence, we first define the notion dominates a NP PP sequence, and then use it to state that the first common ancestor of a sequence of NP PP is a node which is an ancestor of a NP PP sequence, but which does not contain a node which is an ancestor of a NP PP sequence.

figure 28figure 28

4 Using the Lassy Treebanks

4.1 Introduction

The availability of manually constructed treebanks for research and development in natural language processing is crucial, in particular for training statistical syntactic analysers or statistical components of syntactic analysers of a more hybrid nature. In addition such high quality treebanks are important for evaluation purposes for any kind of automatic syntactic analysers.

Syntactically annotated corpora of the size of Lassy Small are also very useful resources for corpus linguists. Note that the size of Lassy Small (one million words) is the same as the subset of the Corpus of Spoken Dutch (CGN) which has been syntactically annotated. Furthermore, the syntactic annotations of the CGN are also available in a format which is understood by Dact. This implies that it is now straightforward to perform corpus linguistic research both on spoken and written Dutch. Below, we provide a simple case study where we compare the frequency of WH-questions formed withwie (who) as opposed towelk(e) (which).

It is less obvious whether large quantity, lower quality treebanks are a useful resource. As one case in point, we note that a preliminary version of the Lassy Large treebank was used as gold standard training data to train a memory-based parser for Dutch [12]. In this article, we illustrate the expected quality of the automatic annotations, and we discuss an example study which illustrates the promise of large quantity, lower quality treebanks. In this section, we therefore focus on the use of the Lassy Large treebank.

4.2 Estimating the Frequency of Question Types

As an example of the use of Lassy Small, we report on a question of a researcher in psycholinguistics who focuses on the linguistic processing of WH-questions from a behavioral (e.g.Ā self-paced reading studies) and neurological (event-related potentials) viewpoint. She studies the effect of information load: the difference betweenwie andwelk(e) in for example:

(2) ā€‹ā€‹ā€‹ā€‹ Wie bakt het lekkerste brood?

Who bakes the nicest bread?

(3) ā€‹ā€‹ā€‹ā€‹ Welke bakker bakt het lekkerste brood?

Which baker bakes the nicest bread?

To be sure that the results she finds are psycholinguistic or neurolinguistic in nature, she wants to be able to compare them to a frequency count in corpora.

Such questions can now be answered using the Lassy Small treebank or the CGN treebank by posing two simple queries. The following query finds WH-questions formed withwie :

figure 29figure 29

The number of hits of the queries are given in TableĀ 9.1 :

TableĀ 9.1 number of hits per query

4.3 Estimation of the Quality of Lassy Large

In order to judge the quality of the Lassy Large corpus, we evaluate the automatic parser that was used to construct Lassy Large on the manually verified annotations of Lassy Small. The Lassy Small corpus is composed of a number of sub-corpora. Each sub-corpus is composed of a number of documents. In the experiment, Alpino (version of October 1, 2010) was applied to a single document, using the same options which have been used for the construction of the Lassy Large corpus. With these options, the parser delivers a single parse, which it believes is the best parse according to a variety of heuristics. These include the disambiguation model and various optimizations of the parser presented in [9, 16, 17]. Furthermore, a time-out is enforced in order that the parser cannot spend more than 190ā€‰s on a single sentence. If no result is obtained within this time, the parser is assumed to have returned an empty set of dependencies, and hence such cases have a very bad impact on accuracy.

In the presentation of the results, we aggregate over sub-corpora. The variousdpc- sub-corpora are taken from the Dutch Parallel Corpus, and meta-information should be obtained from that corpus. The variousWR- andWS corpora are inherited from D-COI. Thewiki- subcorpus contains wikipedia articles, in many cases about topics related to Flanders.

Parsing results are listed in TableĀ 9.2. Mean accuracy is given in terms of the f-score of named dependency relations. As can be observed from this table, parsing accuracies are fairly stable across the various sub-corpora. An outlier is the result of the parser on the WR-P-P-G sub-corpus (legal texts), both in terms of accuracy and in terms of parsing times. We note that the parser performs best on the dpc-bal- subcorpus, a series of speeches by former prime-minister Balkenende.

TableĀ 9.2 Parsing results (f-score of named dependencies) on the Lassy Small sub-corpora

4.4 The Distribution of zelf and zichzelf

As a further example of the use of parsed corpora to further linguistic insights, we consider a recent study [2] of the distribution of weak and strong reflexive objects in Dutch.

If a verb is used reflexively in Dutch, two forms of the reflexive pronoun are available. This is illustrated for the third person form in the examples below.

(4) ā€‹ā€‹ā€‹ā€‹ Brouwers schaamt zich/āˆ—zichzelf voor zijn schrijverschap.

Brouwers shames self1/self2 for his writingĀ 

Brouwers is ashamed of his writing

(5) ā€‹ā€‹ā€‹ā€‹ Duitsland volgtāˆ—zich/zichzelf niet op als Europees kampioen.

Germany follows self1/self2 not PART as European Champion

Germany does not succeed itself as European champion

(6) ā€‹ā€‹ā€‹ā€‹ Wie zich/zichzelf niet juist introduceert, valt af.

Who self1/self2 not properly introduces, is out

Everyone who does not introduce himself properly, is out.

The choice between zich and zichzelf depends on the verb. Generally three groups of verbs are distinguished. Inherent reflexives never occur with a non-reflexive argument and occur only with zich (4). Non-reflexive verbs seldom, if ever occur with a reflexive argument. If they do, however, they can only take zichzelf as a reflexive argument (5). Accidental reflexives can be used with both zich and zichzelf, (6). Accidental reflexive verbs vary widely as to the frequency with which they occur with both arguments. [2] set out to explain this distribution.

The influential theory of [10] explains the distribution as the surface realization of two different ways of reflexive coding. An accidental reflexive that can be realized with both zich and zichzelf is actually ambiguous between an inherent reflexive and an accidental reflexive (which always is realized with zichzelf ). An alternative approach is that of [3, 4, 11], who have claimed that the distribution of weak vs. strong reflexive object pronouns correlates with the proportion of events described by the verb that are self-directed vs. other-directed.

In the course of this investigation, a first interesting observation is, that many inherently reflexive verbs, which are claimed not to occur with zichzelf, actually often do combine with this pronoun. Two typical examples are:

(7) ā€‹ā€‹ā€‹ā€‹ Nederland moet stoppen zichzelf op de borst te slaan

Netherlands must stop self2 on the chest to beat

The Netherlands must stop beating itself on the chest

(8) ā€‹ā€‹ā€‹ā€‹ Hunze wil zichzelf niet al te zeer op de borst kloppen

Hunze want self2 not all too much on the chest knock

Hunze doesnā€™t want to knock itself on the chest too much

With regards to the main hypothesis of their study, Bouma and Spenader [2] use linear regression to determine the correlation between reflexive use of a (non-inherently reflexive) verb and the relative preference for a weak or strong reflexive pronoun. Frequency counts are collected from the parsed TwNC corpus (almost 500 million words). They limit the analysis to verbs that occur at least 10Ā times with a reflexive meaning and at least 50Ā times in total, distinguishing uses by subcategorization frames. The statistical analysis shows a significant correlation, which accounts for 30ā€‰% of the variance of the ratio of nonreflexive over reflexive uses.

5 Validation

The Lassy Small and Lassy Large treebanks have been validated by a project-external body, the Center for Sprogteknologi, University of Copenhagen. The validation report gives a detailed account of the validation of the linguistic annotations of syntax, PoS and lemma in the Lassy treebanks. The validation comprises extraction of validation samples, manual checks of the content, and computation of named dependency accuracy figures of the syntax validation results.

The content validation falls in two parts: validation of the linguistic annotations (PoS-tagging, lemmatization) and the validation of the syntactic annotations. The validation of the syntactic annotation was carried out on 250 sentences from Lassy Large and 500 sentences from Lassy Small, all randomly selected. The validation of the lemma and PoS-tag annotations was carried out on the same sample from Lassy Small as for syntax, i.e. 500 sentences.

Formal validation i.e. the checking of formal information such as file structure, size of files and directories, names of files etc.Ā is not included in this validation task but no problems were encountered in accessing the data and understanding the structure. For the syntax, the validators computed a sentence based accuracy (number of sentences without errors divided by the total number of sentences). For Lassy Large, the validators found that the syntactic analysis was correct for a proportion of 78.4ā€‰% of the sentences. For Lassy Small, the proportion of correct syntactic analyses was 97.8ā€‰%. Out of the 500 validated sentences with a total of 8,494 words, the validators found 31 words with a wrong lemma (the accuracy of the lemma annotation therefore is 99.6ā€‰%. For this same set of sentences, validators found 116 words with wrong part-of-speech tag (accuracy 98.63ā€‰%).

In conclusion, the validation states that the Lassy corpora comprise a well elaborated resource of high quality. Lassy Small, the manually verified corpus, has really fine results for both syntax, part-of-speech and lemma, and the overall impression is very good. Lassy Large also has fine results for the syntax. The overall impression of the Lassy Large annotations is that the parser succeeds in building up acceptable trees for most of the sentences. Often the errors are merely a question of the correct labeling of the nodes.

6 Conclusion

In this article we have introduced the Lassy treebanks, and we illustrated the lemma, part-of-speech and dependency annotations. The quality of the annotations has been confirmed by an external validation. We provided evidence that the use of the standard XPath language suffices for the identification of relevant nodes in the treebanks, countering some evidence to the contrary by Lai and Bird [6] and Bouma [1]. We illustrated the potential usefulness of Lassy Small by estimating the frequency of question types in the context of a psycho-linguistic study. We furthermore illustrated the use of the Lassy Large treebank in a study of the distribution of the two Dutch reflexive pronounszich andzichzelf .