The PROIEL treebank family: a standard for early attestations of Indo-European languages

Eckhoff, Hanne; Bech, Kristin; Bouma, Gerlof; Eide, Kristine; Haug, Dag; Haugen, Odd Einar; Jøhndal, Marius

doi:10.1007/s10579-017-9388-5

The PROIEL treebank family: a standard for early attestations of Indo-European languages

Original Paper
Published: 09 May 2017

Volume 52, pages 29–65, (2018)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Hanne Eckhoff ORCID: orcid.org/0000-0001-8096-6515¹,
Kristin Bech²,
Gerlof Bouma³,
Kristine Eide²,
Dag Haug²,
Odd Einar Haugen⁴ &
…
Marius Jøhndal²

680 Accesses
13 Citations
2 Altmetric
Explore all metrics

Abstract

This article describes a family of dependency treebanks of early attestations of Indo-European languages originating in the parallel treebank built by the members of the project pragmatic resources in old Indo-European languages. The treebanks all share a set of open-source software tools, including a web annotation interface, and a set of annotation schemes and guidelines developed especially for the project languages. The treebanks use an enriched dependency grammar scheme complemented by detailed morphological tags, which have proved sufficient to give detailed descriptions of these richly inflected languages, and which have been easy to adapt to new languages. We describe the tools and annotation schemes and discuss some challenges posed by the various languages that have been annotated. We also discuss problems with tokenisation, sentence division and lemmatisation, commonly encountered in ancient and mediaeval texts, and challenges associated with low levels of standardisation and ongoing morphological and syntactic change.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

German Treebanks: TIGER and TüBa-D/Z

DeepBankPT and Companion Portuguese Treebanks in a Multilingual Collection of Treebanks Aligned with the Penn Treebank

HamleDT: Harmonized multi-language dependency treebank

Article 26 August 2014

Notes

Work by Gerlof Bouma was supported by the Marcus and Amalia Wallenberg foundation (MAW 2012.0146: MAÞiR).
http://www.hf.uio.no/ifikk/english/research/projects/proiel/, http://proiel.github.io/.
http://foni.uio.no:3000
Information Structure and Word Order Change in Germanic and Romance Languages, http://www.hf.uio.no/ilos/english/research/projects/iswoc/.
http://iswoc.github.io/, also hosted with the PROIEL treebank at http://foni.uio.no:3000.
The project has also made use of Menotec’s Old Norwegian treebank and PROIEL’s Gothic treebank.
http://torottreebank.github.io/, https://nestor.uit.no.
http://www.menota.org/menotec.xml.
Hosted with the PROIEL treebank at http://foni.uio.no:3000 and also accessible through the INESS portal at http://clarino.uib.no/iness (select the treebanks for Old Norse).
http://www.menota.org.
http://bragi.info/greinir/.
https://spraakbanken.gu.se/mathir.
http://proiel.github.io/framework.
The reviewers have generally been senior project members or very experienced annotators. The number of corrections made by reviewers vary considerably depending on the accuracy and experience of the annotator as well as on the complexity of the text. For instance, the PROIEL New Testament texts generally have few corrections due to the fact that this text is extremely well supported by translations and exegesis, as well as by the fact that analyses could be compared across languages during annotation (0.1–3.5% of the tokens were corrected for morphology or lemmatisation errors, 1.5–11.8% of the sentences were corrected for syntactic attachment or label errors). More complicated and less supported texts had considerably more corrections. For instance, Herodotus’ Histories (Ancient Greek, PROIEL) had 9% of its tokens corrected for morphology or lemmatisation, while 65.5% of the sentences were syntactically corrected. Very similarly, the Russkaja pravda (Old East Slavic, TOROT) had 10.8% of its tokens corrected for morphology or lemmatisation, and 65.1% of its sentences were syntactically corrected.
Currently, the PROIEL, Menotec, ISWOC and TOROT treebanks are also available for syntactic query in the INESS treebank facility, http://clarino.uib.no/iness/page.
All examples are given with a text reference and a sentence ID in the relevant treebank, if they are publicly available.
https://github.com/morphgnt.
http://www.wulfila.be/gothic/.
A further 7.5% of the tokens off-by-one errors, i.e. only one of the ten morphological fields had the wrong value.
For a fuller description of the scheme, see Haug et al. 2009. For an exhaustive documentation of the scheme, see the PROIEL guidelines for syntactic annotation (http://folk.uio.no/daghaug/syntactic_guidelines.pdf). For documentation of the application of the scheme to Slavic, see the TOROT guidelines (http://folk.uio.no/hanneme/torot.pdf). For documentation of the application of the scheme to Old Norwegian, see Haugen and Øverland 2014. For documentation of the application of the scheme to Old English, see http://folk.uio.no/krisbec/OE_guidelines.pdf.
https://ufal.mff.cuni.cz/pdt2.0/.
https://perseusdl.github.io/treebank_data/.
http://itreebank.marginalia.it/.
http://ufal.mff.cuni.cz/project/pdt2.0/doc/manuals/en/t-layer/html/index.html.
For a discussion of dependency parsing with empty nodes, see Seeker et al. (2012). For an experiment using MaltParser on OCS data from TOROT, see Berdičevskis (2015), for a pre-parsing experiment on TOROT data, see Eckhoff and Berdičevskis (2016).
For further discussion and motivation of the differences between the two schemes, see Haug and Jøhndal (2008) and Haug et al. (2009).
In PDT-style treebanks, the dependent infinitive would be analysed as a subject. In the PROIEL scheme, argument infinitives are never analysed as SUBs unless they are nominalised by way of definite articles, since such structures are often ambiguous. Instead, they are analysed as COMP or XOBJ depending on whether they have an external subject.
Note that the secondary dependency from on ‘in’ to synt ‘are’ indicates that the dependent shares its subject with its head verb, but that this subject is not overtly expressed. Moreover, the dependent’s subject may be any (non-overt) argument of the verb.
Note that the APOS relation label is used not only for the usual type of nominal appositions, but also to mark non-restrictive relative clauses, as seen in the tree for example (11). Restrictive relative clauses are ATR dependents of their antecedents. In this the PROIEL scheme differs from the annotation in PDT-style treebanks, where restrictive and non-restrictive relative clauses are not distinguished.
As seen in the tree for example 12, the relation XOBJ is also used for nominal predicates in copular constructions. The reasoning is thus that nominal predicates are arguments of the copula and have external subjects which are identical with the copula’s subject. In this the PROIEL scheme deviates from the PDT scheme, which has a separate label PNOM for nominal predicates. PNOMs are also deemed to be dependents of the copula, but there is no direct indication of the external subject.
Note that, unlike in the PDT-based schemes, subjunctions are given the dependency label of the whole subordinate clause (in this case COMP). This is a general principle in the PROIEL scheme: the head of a subtree should always carry the relation label of the whole subtree, regardless of its form. In the PDT-based schemes, the subjunction is also the head of the subordinate clause, but it is labeled AuxC, and its dependent verb carries the relation label of the whole subordinate clause.
An additional confounding factor is that conditional clauses, a common environment for the indefinite pronoun čьto, often lack subordinators in Middle Russian texts, which can cause even more ambiguities.
Example (20a) also illustrates how predicate identity (PID) and shared dependents can be indicated by way of secondary dependencies.
The Penn Parsed Corpus of Historical Greek (PPCHiG) used a constituency-based annotation, but is no longer in active development.
https://github.com/biblicalhumanities/greek-new-testament/tree/master/syntax-trees/sblgnt.
http://www.dh.uni-leipzig.de/wo/projects/ancient-greek-and-latin-dependency-treebank-2-0/.
http://www.sfs.uni-tuebingen.de/en/ascl/resources/corpora/index-thomisticus-treebank.html.
Note that AGDT also annotates punctuation.
http://universaldependencies.org/.

References

Adesam, Y., & Bouma, G. (2016). Part-of-speech tagging Old Swedish. In Proc of language technology for cultural heritage, social sciences, and humanities. Berlin.
Andrews, A. D. (1971). Case agreement of predicate modifiers in Ancient Greek. Linguistic Inquiry, 2(2), 127–151.
Google Scholar
Andrews, A. D. (1982). Long distance agreement in Modern Icelandic. In P. Jacobson & G. K. Pullum (Eds.), The nature of syntactic representation (pp. 1–33). Dordrecht: D. Reidel.
Google Scholar
Bamman, D., Crane, G., Passarotti, M., & Raynaud, S. (2007). Guidelines for the syntactic annotation of Latin treebanks. Technical report. Boston: Tufts Digital Library.
Berdičevskis, A. (2015). Estimating grammeme redundancy by measuring their importance for syntactic parser performance. In Proceedings of the 6th workshop on cognitive aspects of computational language learning (pp. 65–73). Association for Computational Linguistics.
Berdičevskis, A., Eckhoff, H., & Gavrilova, T. (2016). The beginning of a beautiful friendship: Rule-based and statistical analysis of Middle Russian. In Computational linguistics and intellectual technologies. Proceedings of Dialogue 16. Moscow.
Birnbaum, D., & Eckhoff, H. Machine-assisted multilingual alignment of the Codex Suprasliensis (manuscript).
Bouma, G., & Adesam, Y. (2013). Experiments on sentence segmentation in Old Swedish editions. In Þ. Eyþórsson, L. Borin, D. Haug & E. Rögnvaldsson (Eds.), Proceedings of the workshop on computational historical linguistics at NODALIDA 2013, (pp. 11–26). Oslo. http://www.ep.liu.se/ecp/article.asp?issue=87&article=2.
Brants, T. (2000). TnT—A statistical part-of-speech tagger. In Proceedings of the 6th Applied Natural Language Processing Conference ANLP-2000. Seattle, WA.
Delsing, L.-O. (2002). Fornsvenska textbanken. In S. Lagman, S. Ö. Ohlsson & V. Voodla (Eds.) Svenska språkets historia i Östersjöområdet, (pp 149–156). Tartu.
Eckhoff, H. M. (2011). Old Russian possessive constructions: A construction grammar approach. Berlin: Mouton de Gruyter.
Book Google Scholar
Eckhoff, H. M. (2015). Animacy and differential object marking in Old Church Slavonic. Russian Linguistics, 39(2), 233–254.
Article Google Scholar
Eckhoff, H., & Berdičevskis, B. (2015). Linguistics vs. digital editions: The Tromsø Old Russian and OCS Treebank. Scripta and e-Scripta, 14–15, 9–25.
Google Scholar
Eckhoff, H., & Berdičevskis, B. (2016). Automatic parsing as an efficient pre-annotation tool for historical texts. In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH, held in conjunction with COLING). http://de.clarin.eu/en/current-issues/lt4dh/lt4dh-proceedings.
Eckhoff, H., & Haug, D. (2015). Aspect and prefixation in Old Church Slavonic. Diachronica, 32(2), 186–230.
Article Google Scholar
Fort, K., & Sagot, B. (2010). Influence of pre-annotation on POS-tagged corpus development. In Proceedings of the 4th linguistic annotation workshop, (pp. 56–63). ACL: Uppsala.
Haug, D. (2011). From dependency structures to LFG representations. In M. Butt & T. Holloway King (Eds.), Proceedings of LFG12 (pp. 271–291). Stanford: CSLI Publications.
Google Scholar
Haug, D., Eckhoff, H., & Welo, E. (2014). The theoretical foundations of givenness annotation. In K. Bech & K. Eide (Eds.), Information structure and syntactic change in Germanic and Romance languages. Amsterdam: John Benjamins.
Google Scholar
Haug, D. T. T., & Jøhndal, M. L. (2008). Creating a parallel treebank of the old Indo-European Bible translations. In Proceedings of the 6th international language resources and evaluation (LREC’08). European Language Resources Association (ELRA).
Haug, D. T. T., Jøhndal, M., Eckhoff, H. M., Welo, E., Hertzenberg, M. J. B., & Müth, A. (2009). Computational and linguistic issues in designing a syntactically annotated parallel corpus of Indo-European languages. Traitement Automatique des Langues, 50, 17–45.
Google Scholar
Haugen, O. E., & Øverland, F. Th. (2014). Guidelines for morphological and syntactic annotation of Old Norwegian texts. Bergen Language and Linguistics Studies 4(2). https://bells.uib.no/bells/issue/view/158.
Haugland, K. E. (2007). Old English impersonal constructions and the use and non-use of nonreferential pronouns. Ph.D. dissertation, University of Bergen.
Hertzenberg, M. J. B. (2011). Classical and Romance usages of ipse in the Vulgate. Oslo Studies in Language (OSLa), 3(3), 173–188.
Google Scholar
Hertzenberg, M. J. B. (2014). “The valley” or “that valley”? Ille and ipse in the Itinerarium Egeriae. In P. Molinelli, P. Cuzzolin, & C. Fedriani (Eds.) Latin vulgaire—Latin tardif X. Actes du Xe colloque international sur le latin vulgaire et tardif. Bergamo, 5–9 septembre 2012. Bergamo: Sestante edizioni.
Jøhndal, M. (2012). Non-finiteness in Latin. Ph.D. dissertation, University of Cambridge.
König, E., & Lezius, W. (2003). The TIGER language—A description language for syntax graphs, formal definition. Technical report. IMS, University of Stuttgart.
Lee, J., & Haug, D. (2010). Porting an Ancient Greek and Latin treebank. In Proc. conference on language resources and evaluation (LREC).
Lindberg, R. (2013). Definiteness in Old Church Slavonic: A study of how long and short form in adjectives reflect information status. Master’s thesis, University of Oslo.
Mitchell, B. (1985). Old English syntax (Vol. 2). Oxford: Clarendon.
Book Google Scholar
Müth, A. (2015). Indefiniteness, animacy and object marking: A quantitative study based on the Classical Armenian Gospel translation. Ph.D. thesis, University of Oslo.
Seeker, W., Farkas, R., Bohnet, B., Schmid, H., & Kuhn, J. (2012). Data-driven dependency parsing with empty heads. In M. Kay & C. Boitet (Eds.), Proceedings of COLING 2012: Posters (pp. 1081–1090). Mumbai. http://www.aclweb.org/anthology/C12-2105.
Skjærholt, A. (2011). More, faster: Accelerated corpus annotation with statistical taggers. Journal for Language Technology and Computational Linguistics, 26(2), 151–163.
Google Scholar
Söderwall, K. F. (1884–1918). Ordbok över svenska medeltids-språket. Samlingar utgivna av Svenska fornskriftsällskapet, Serie 1, Svenska skrifter 27.
Traugott, E. C. (1992). Syntax. In R. M. Hogg (Ed.), The Cambridge history of the English language, vol. 1: The beginnings to 1066 (pp. 168–289). Cambridge: Cambridge University Press.
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

UiT The Arctic University of Norway, Tromsø, Norway
Hanne Eckhoff
University of Oslo, Oslo, Norway
Kristin Bech, Kristine Eide, Dag Haug & Marius Jøhndal
University of Gothenburg, Gothenburg, Sweden
Gerlof Bouma
University of Bergen, Bergen, Norway
Odd Einar Haugen

Authors

Hanne Eckhoff
View author publications
You can also search for this author in PubMed Google Scholar
Kristin Bech
View author publications
You can also search for this author in PubMed Google Scholar
Gerlof Bouma
View author publications
You can also search for this author in PubMed Google Scholar
Kristine Eide
View author publications
You can also search for this author in PubMed Google Scholar
Dag Haug
View author publications
You can also search for this author in PubMed Google Scholar
Odd Einar Haugen
View author publications
You can also search for this author in PubMed Google Scholar
Marius Jøhndal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hanne Eckhoff.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Eckhoff, H., Bech, K., Bouma, G. et al. The PROIEL treebank family: a standard for early attestations of Indo-European languages. Lang Resources & Evaluation 52, 29–65 (2018). https://doi.org/10.1007/s10579-017-9388-5

Download citation

Published: 09 May 2017
Issue Date: March 2018
DOI: https://doi.org/10.1007/s10579-017-9388-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The PROIEL treebank family: a standard for early attestations of Indo-European languages

Abstract

Access this article

Similar content being viewed by others

German Treebanks: TIGER and TüBa-D/Z

DeepBankPT and Companion Portuguese Treebanks in a Multilingual Collection of Treebanks Aligned with the Penn Treebank

HamleDT: Harmonized multi-language dependency treebank

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The PROIEL treebank family: a standard for early attestations of Indo-European languages

Abstract

Access this article

Similar content being viewed by others

German Treebanks: TIGER and TüBa-D/Z

DeepBankPT and Companion Portuguese Treebanks in a Multilingual Collection of Treebanks Aligned with the Penn Treebank

HamleDT: Harmonized multi-language dependency treebank

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation