PartTUT: The Turin University Parallel Treebank

Sanguinetti, Manuela; Bosco, Cristina

doi:10.1007/978-3-319-14206-7_3

Manuela Sanguinetti⁷ &
Cristina Bosco⁷

Part of the book series: Studies in Computational Intelligence ((SCI,volume 589))

421 Accesses
2 Citations

Abstract

In this paper, we introduce an ongoing project for the development of a parallel treebank for Italian, English and French. The treebank is annotated in a dependency format, namely the one designed in the Turin University Treebank (TUT), hence the choice to call such new resource Par(allel)TUT. The project aims at creating a resource which can be useful in particular for translation research. Therefore, beyond constantly enriching the treebank with new and heterogeneous data, so as to build a dynamic and balanced multilingual treebank, the current stage of the project is devoted to the design of a tool for the alignment of data, which takes into account syntactic knowledge as annotated in this kind of resource. The paper focuses in particular on the study of translational divergences and their implications for the development of the alignment tool. The paper provides an overview of the treebank, with its current content and the peculiarities of the annotation format, the description of the classes of translational divergences which could be encountered in the treebank, together with a proposal for their alignment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://ufal.mff.cuni.cz/pcedt2.0/en/index.html.
2.
http://www.cl.uzh.ch/research/paralleltreebanks/smultron_en.html.
3.
Contrarily to work on statistical machine translation, phrase alignment in this work is intended as an alignment between linguistically motivated phrases.
4.
http://kitt.cl.uzh.ch/kitt/treealigner.
5.
http://code.google.com/p/copenhagen-dependency-treebank/.
6.
http://fedora.clarin-d.uni-saarland.de/grug/.
7.
http://www.cis.upenn.edu/~ chinese/.
8.
http://www.ircs.upenn.edu/arabic/.
9.
http://ufal.mff.cuni.cz/pcedt2.0/.
10.
http://www.di.unito.it/~tutreeb.
11.
http://www.evalita.it/.
12.
http://atoll.inria.fr/passage/eval2.en.html.
13.
http://creativecommons.org/licenses/by-nc-sa/2.0.
14.
http://www.statmt.org/europarl/; the section used is ep_00_01_17.
15.
Namely the “Help” section, at https://www.facebook.com/help/345121355559712/.
16.
The section used is jrc52006DC243.
17.
http://www.ohchr.org/EN/UDHR/Pages/SearchByLang.aspx.
18.
https://wit3.fbk.eu/; we retrieved the texts used for training of MT systems, downloaded from https://wit3.fbk.eu/mt.php?release=2012-02.
19.
As for the sentence count, we would like to clarify that some sub-corpora, especially the UDHR, are featured by the presence of short headings (e.g. ‘Article 1’) that we did not considered for calculating the average sentence length, even if they were treated as separate sentences according to the parser segmentation criteria.
20.
In general, considering the sources from which the texts of ParTUT have been retrieved, it can be assumed that they are not all original, but drafted in one or more languages and then translated into the others.
21.
In this paper, we report examples of sentences (or fragments of sentences) in all the languages involved. The glosses for non-English examples are then provided; they are intended as literal and do not necessarily correspond to the correct English expression.
22.
In the Italian TUT there is a third component (omitted here and in the current ParTUT annotation) concerning the semantic role of the dependent with respect to its governor.
23.
The TUTtoPenn converter can be downloaded at http://www.di.unito.it/~tutreeb/TUTtoPENNconverter/.
24.
http://nlp.stanford.edu/software/stanford-dependencies.shtml.
25.
A semi-automatic alignment has also been performed with LF Aligner (http://sourceforge.net/projects/aligner/).
26.
These labels are used to identify the treebank fragment we refer to in the examples: they indicate section_language#sentencenumber.
27.
Since in the ParTUT texts translation direction is unknown, we consider the two transformation strategies as counterparts one of each other and put them in the same subclass, while other works rather considered them as separate categories [8]. We applied the same principle even for the cases of addition/deletion, mentioned below.
28.
In this example, in particular, we observe both additions and deletions while comparing the English sentence to the French version.
29.
http://www.cse.unt.edu/~rada/wpt; http://www.cse.unt.edu/~rada/wpt05.

References

Bosco C., Mazzei A.: The EVALITA dependency parsing task: from 2007 to 2011. In: Proceedings of Evalita 2011, Evaluation of Natural Language and Speech Tools for Italian. LNCS/LNAI, Springer (2012)
Google Scholar
Bosco C., Mazzei A., Lavelli A.: Looking back to the EVALITA constituency parsing task: 2007–2011. In: Proceedings of Evalita 2011, Evaluation of Natural Language and Speech Tools for Italian. LNCS/LNAI, Springer (2012)
Google Scholar
Bosco, C., Simi, M., Montemagni, S.: Converting Italian Treebanks: towards an Italian stanford dependency treebank. In: Proceedings of the ACL’13 7th Linguistic Annotation Workshop and Interoperability with Discourse (LAW&ID), Sofia, Bulgaria (2013)
Google Scholar
Bucholz, S., Marsi, E.: CoNLL-X shared task on multilingual dependency parsing. In: Proceedings of CoNLL (2006)
Google Scholar
Catford, J.C.: A Linguistic Theory of Translation: An Essay on Applied Linguistics. Oxford University Press, Oxford (1965)
Google Scholar
Cettolo, M., Ghirardi, F., Federico M.: WIT3: a web inventory of transcribed talks. In: Proceedings of the 16th EAMT Conference, Trento, Italy (2012)
Google Scholar
Copestake, A., Flickinger, D., Pollard, C., Sag, C.: Minimal recursion semantics: an introduction. Res. Lang. Comput. 3(4), 281–332 (2005)
Article Google Scholar
Cyrus, L.: Building a resource for studying translation shifts. In: Proceedings of Language Resources and Evaluation Conference (LREC’06), Genova, Italy (2006)
Google Scholar
de Marneffe, M-C., Manning, C. D.: The stanford typed dependencies representation. In: Proceedings of the COLING’08 Workshop on Cross-Framework and Cross-Domain Parser Evaluation (CrossParser’08), Manchester, United Kingdom (2008)
Google Scholar
Ding, Y., Palmer, M.: Automatic learning of parallel dependency treelet pairs. In: Proceedings of the First International Joint Conference on Natural Language Processing (IJCNLP-04) (2004)
Google Scholar
Ding, Y., Gildea, D., Palmer, M.: An algorithm for word-level alignment of parallel dependency trees. In: The 9th Machine Translation Summit of the International Association for Machine Translation (2003)
Google Scholar
Dyvik, H., Meurer, P., Rosén, V., De Smedt, K.: Linguistically motivated parallel parsebanks. In: Proceedings of the Eighth International Workshop on Treebanks and Linguistic Theories (TLT8) (2009)
Google Scholar
Flickinger, D., Kordoni, V., Zhang, Y., Branco, A., Simov, K., Osenova, P., Carvalheiro, C., Costa F., Castro, S.: ParDeepBank: multiple parallel deep treebanking. In: Proceedings of the Eleventh International Workshop on Treebanks and Linguistic Theories (2012)
Google Scholar
Fox, H.J.: Phrasal cohesion and statistical machine translation. In: Proceedings of the ACL-02 conference on Empirical methods in Natural Language Processing (EMNL’02) (2002)
Google Scholar
Hajič, J., Zemánek, P.: Prague Arabic dependency treebank: development in data and tools. In: Proceedings of NEMLAR the NEMLAR Conference on Arabic Language Resources and Tools (2003)
Google Scholar
Hearne, M, Tinsley, J., Zhechev, V., Way, A.: Capturing translational divergences with a statistical tree-to-tree aligner. In: Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation (TMI-07) (2007)
Google Scholar
Hudson, R.: Word Grammar. Blackwell, Oxford (1984)
Google Scholar
Koehn P.: Europarl: A parallel corpus for statistical machine translation. In: Machine Translation Summit X, Phuket, Thailand (2005)
Google Scholar
Lavie, A., Parlikar, A., Ambati, V.: Syntax-driven learning of sub-sentential translation equivalents and translation rules from parsed parallel corpora. In: Proceedings of the Second Workshop on Syntax and Structure in Statistical Translation (SSST’08) (2008)
Google Scholar
Lesmo, L.: The Turin University Parser at Evalita 2009. In: Proceedings of Evalita’09, Reggio Emilia, Italy (2009)
Google Scholar
Ma, Y., Ozdowska, S., Sun, Y., Way, A.: Improving word alignment using syntactic dependencies. In: Proceeding of the Second ACL Workshop on Syntax and Structure in Statistical Translation (SSST-2) (2008)
Google Scholar
Mareček, D., Žabortský, Z., Novák, V.: Automatic alignment of Czech and English deep syntactic dependency tree. In: Proceedings of the 12th EAMT Conference (2008)
Google Scholar
Menezes A., Richardson, S.D.: A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora. In: Proceedings of the Workshop on Data-driven Methods in Machine Translation at ACL-2001 (2001)
Google Scholar
Moore, R.C.: Fast and accurate sentence alignment of bilingual corpora. In: Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: From Research to Real Users, Tiburon, California (2002)
Google Scholar
Nakazawa, T., Kurohashi, S.: Bayesian subtree alignment model based on dependency trees. In: Proceedings of 5th Joint Conference on Natural Language Processing, Chiang Mai, Thailand (2011)
Google Scholar
Nivre, J., Hall, J., Kübler, S., McDonald, R., Nilsson, J., Riedel, S., Yuret, D.: The CoNLL 2007 shared task on dependency parsing. In: Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007 (2007)
Google Scholar
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. In: Computational Linguistics, vol .29(1). MIT Press, Cambridge (2003)
Google Scholar
Osborne, T., Putnam, M., Gross, T.: Catenae: introducing a novel unit of syntactic analysis. In: Syntax, 15(4) (2012)
Google Scholar
Ozdowska, S.: Using bilingual dependencies to align words in English/French parallel corpora. In: Proceedings of the ACL Student Research Workshop (2005)
Google Scholar
Sanguinetti, M., Bosco, C., Cupi, L.: Exploiting catenae in a parallel treebank alignment. In: Proceedings of the 9th Language Resources and Evaluation Conference (LREC’14). Reykjavik, Iceland (2014)
Google Scholar
Simov, K., Osenova, P., Laskova, L., Savkov, A., Kancheva, S.: Bulgarian-English parallel treebank: word and semantic level alignment. In: Proceedings of Recent Advances in Natural Language Processing (RANLP), Hissar, Bulgaria (2011)
Google Scholar
Simov, K., Osenova, P.: Bulgarian-English treebank: desing and implementation. In: Linguist. Issues Lang. Technol. - LiLT 7(14) (2012)
Google Scholar
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., Varga, D.: The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. In: Proceedings of Language Resources and Evaluation Conference (LREC’06), Genova (2006)
Google Scholar
Tiedemann, J., Kotzé, G.: Building a large machine-aligned parallel treebank. In: Proceedings of the 8th International Workshop on Treebanks and Linguistic Theories (TLT’08) (2009)
Google Scholar
Vinay, J.P., Darbelnet, J.: Comparative Stylistics of French and English. John Benjamins, Amsterdam and Philadelphia (1958)
Google Scholar
Zhechev, V., Way, A.: Automatic generation of parallel treebanks. In: 22nd International Conference on Computational Linguistics (COLING 2008) (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica, Università di Torino, Corso Svizzera 195, 10149, Torino, Italy
Manuela Sanguinetti & Cristina Bosco

Authors

Manuela Sanguinetti
View author publications
You can also search for this author in PubMed Google Scholar
Cristina Bosco
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manuela Sanguinetti .

Editor information

Editors and Affiliations

Department of Computer Science, Systems and Production, University of Rome Tor Vergata, Rome, Italy
Roberto Basili
Department of Computer Science, University of Turin, Turin, Italy
Cristina Bosco
Department of Language and Cultural Studies, Department of Computer Science, Ca’ Foscari University of Venice, Venezia, Italy
Rodolfo Delmonte
Department of Computer Science and Information Engineering, University of Trento, Trento, Italy
Alessandro Moschitti
Department of Computer Science, University of Pisa, Pisa, Italy
Maria Simi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Sanguinetti, M., Bosco, C. (2015). PartTUT: The Turin University Parallel Treebank. In: Basili, R., Bosco, C., Delmonte, R., Moschitti, A., Simi, M. (eds) Harmonization and Development of Resources and Tools for Italian Natural Language Processing within the PARLI Project. Studies in Computational Intelligence, vol 589. Springer, Cham. https://doi.org/10.1007/978-3-319-14206-7_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-14206-7_3
Published: 15 January 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14205-0
Online ISBN: 978-3-319-14206-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics