Skip to main content

Part of the book series: Studies in Computational Intelligence ((SCI,volume 589))

Abstract

In this paper, we introduce an ongoing project for the development of a parallel treebank for Italian, English and French. The treebank is annotated in a dependency format, namely the one designed in the Turin University Treebank (TUT), hence the choice to call such new resource Par(allel)TUT. The project aims at creating a resource which can be useful in particular for translation research. Therefore, beyond constantly enriching the treebank with new and heterogeneous data, so as to build a dynamic and balanced multilingual treebank, the current stage of the project is devoted to the design of a tool for the alignment of data, which takes into account syntactic knowledge as annotated in this kind of resource. The paper focuses in particular on the study of translational divergences and their implications for the development of the alignment tool. The paper provides an overview of the treebank, with its current content and the peculiarities of the annotation format, the description of the classes of translational divergences which could be encountered in the treebank, together with a proposal for their alignment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://ufal.mff.cuni.cz/pcedt2.0/en/index.html.

  2. 2.

    http://www.cl.uzh.ch/research/paralleltreebanks/smultron_en.html.

  3. 3.

    Contrarily to work on statistical machine translation, phrase alignment in this work is intended as an alignment between linguistically motivated phrases.

  4. 4.

    http://kitt.cl.uzh.ch/kitt/treealigner.

  5. 5.

    http://code.google.com/p/copenhagen-dependency-treebank/.

  6. 6.

    http://fedora.clarin-d.uni-saarland.de/grug/.

  7. 7.

    http://www.cis.upenn.edu/~ chinese/.

  8. 8.

    http://www.ircs.upenn.edu/arabic/.

  9. 9.

    http://ufal.mff.cuni.cz/pcedt2.0/.

  10. 10.

    http://www.di.unito.it/~tutreeb.

  11. 11.

    http://www.evalita.it/.

  12. 12.

    http://atoll.inria.fr/passage/eval2.en.html.

  13. 13.

    http://creativecommons.org/licenses/by-nc-sa/2.0.

  14. 14.

    http://www.statmt.org/europarl/; the section used is ep_00_01_17.

  15. 15.

    Namely the “Help” section, at https://www.facebook.com/help/345121355559712/.

  16. 16.

    The section used is jrc52006DC243.

  17. 17.

    http://www.ohchr.org/EN/UDHR/Pages/SearchByLang.aspx.

  18. 18.

    https://wit3.fbk.eu/; we retrieved the texts used for training of MT systems, downloaded from https://wit3.fbk.eu/mt.php?release=2012-02.

  19. 19.

    As for the sentence count, we would like to clarify that some sub-corpora, especially the UDHR, are featured by the presence of short headings (e.g. ‘Article 1’) that we did not considered for calculating the average sentence length, even if they were treated as separate sentences according to the parser segmentation criteria.

  20. 20.

    In general, considering the sources from which the texts of ParTUT have been retrieved, it can be assumed that they are not all original, but drafted in one or more languages and then translated into the others.

  21. 21.

    In this paper, we report examples of sentences (or fragments of sentences) in all the languages involved. The glosses for non-English examples are then provided; they are intended as literal and do not necessarily correspond to the correct English expression.

  22. 22.

    In the Italian TUT there is a third component (omitted here and in the current ParTUT annotation) concerning the semantic role of the dependent with respect to its governor.

  23. 23.

    The TUTtoPenn converter can be downloaded at http://www.di.unito.it/~tutreeb/TUTtoPENNconverter/.

  24. 24.

    http://nlp.stanford.edu/software/stanford-dependencies.shtml.

  25. 25.

    A semi-automatic alignment has also been performed with LF Aligner (http://sourceforge.net/projects/aligner/).

  26. 26.

    These labels are used to identify the treebank fragment we refer to in the examples: they indicate section_language#sentencenumber.

  27. 27.

    Since in the ParTUT texts translation direction is unknown, we consider the two transformation strategies as counterparts one of each other and put them in the same subclass, while other works rather considered them as separate categories [8]. We applied the same principle even for the cases of addition/deletion, mentioned below.

  28. 28.

    In this example, in particular, we observe both additions and deletions while comparing the English sentence to the French version.

  29. 29.

    http://www.cse.unt.edu/~rada/wpt; http://www.cse.unt.edu/~rada/wpt05.

References

  1. Bosco C., Mazzei A.: The EVALITA dependency parsing task: from 2007 to 2011. In: Proceedings of Evalita 2011, Evaluation of Natural Language and Speech Tools for Italian. LNCS/LNAI, Springer (2012)

    Google Scholar 

  2. Bosco C., Mazzei A., Lavelli A.: Looking back to the EVALITA constituency parsing task: 2007–2011. In: Proceedings of Evalita 2011, Evaluation of Natural Language and Speech Tools for Italian. LNCS/LNAI, Springer (2012)

    Google Scholar 

  3. Bosco, C., Simi, M., Montemagni, S.: Converting Italian Treebanks: towards an Italian stanford dependency treebank. In: Proceedings of the ACL’13 7th Linguistic Annotation Workshop and Interoperability with Discourse (LAW&ID), Sofia, Bulgaria (2013)

    Google Scholar 

  4. Bucholz, S., Marsi, E.: CoNLL-X shared task on multilingual dependency parsing. In: Proceedings of CoNLL (2006)

    Google Scholar 

  5. Catford, J.C.: A Linguistic Theory of Translation: An Essay on Applied Linguistics. Oxford University Press, Oxford (1965)

    Google Scholar 

  6. Cettolo, M., Ghirardi, F., Federico M.: WIT3: a web inventory of transcribed talks. In: Proceedings of the 16th EAMT Conference, Trento, Italy (2012)

    Google Scholar 

  7. Copestake, A., Flickinger, D., Pollard, C., Sag, C.: Minimal recursion semantics: an introduction. Res. Lang. Comput. 3(4), 281–332 (2005)

    Article  Google Scholar 

  8. Cyrus, L.: Building a resource for studying translation shifts. In: Proceedings of Language Resources and Evaluation Conference (LREC’06), Genova, Italy (2006)

    Google Scholar 

  9. de Marneffe, M-C., Manning, C. D.: The stanford typed dependencies representation. In: Proceedings of the COLING’08 Workshop on Cross-Framework and Cross-Domain Parser Evaluation (CrossParser’08), Manchester, United Kingdom (2008)

    Google Scholar 

  10. Ding, Y., Palmer, M.: Automatic learning of parallel dependency treelet pairs. In: Proceedings of the First International Joint Conference on Natural Language Processing (IJCNLP-04) (2004)

    Google Scholar 

  11. Ding, Y., Gildea, D., Palmer, M.: An algorithm for word-level alignment of parallel dependency trees. In: The 9th Machine Translation Summit of the International Association for Machine Translation (2003)

    Google Scholar 

  12. Dyvik, H., Meurer, P., Rosén, V., De Smedt, K.: Linguistically motivated parallel parsebanks. In: Proceedings of the Eighth International Workshop on Treebanks and Linguistic Theories (TLT8) (2009)

    Google Scholar 

  13. Flickinger, D., Kordoni, V., Zhang, Y., Branco, A., Simov, K., Osenova, P., Carvalheiro, C., Costa F., Castro, S.: ParDeepBank: multiple parallel deep treebanking. In: Proceedings of the Eleventh International Workshop on Treebanks and Linguistic Theories (2012)

    Google Scholar 

  14. Fox, H.J.: Phrasal cohesion and statistical machine translation. In: Proceedings of the ACL-02 conference on Empirical methods in Natural Language Processing (EMNL’02) (2002)

    Google Scholar 

  15. Hajič, J., Zemánek, P.: Prague Arabic dependency treebank: development in data and tools. In: Proceedings of NEMLAR the NEMLAR Conference on Arabic Language Resources and Tools (2003)

    Google Scholar 

  16. Hearne, M, Tinsley, J., Zhechev, V., Way, A.: Capturing translational divergences with a statistical tree-to-tree aligner. In: Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation (TMI-07) (2007)

    Google Scholar 

  17. Hudson, R.: Word Grammar. Blackwell, Oxford (1984)

    Google Scholar 

  18. Koehn P.: Europarl: A parallel corpus for statistical machine translation. In: Machine Translation Summit X, Phuket, Thailand (2005)

    Google Scholar 

  19. Lavie, A., Parlikar, A., Ambati, V.: Syntax-driven learning of sub-sentential translation equivalents and translation rules from parsed parallel corpora. In: Proceedings of the Second Workshop on Syntax and Structure in Statistical Translation (SSST’08) (2008)

    Google Scholar 

  20. Lesmo, L.: The Turin University Parser at Evalita 2009. In: Proceedings of Evalita’09, Reggio Emilia, Italy (2009)

    Google Scholar 

  21. Ma, Y., Ozdowska, S., Sun, Y., Way, A.: Improving word alignment using syntactic dependencies. In: Proceeding of the Second ACL Workshop on Syntax and Structure in Statistical Translation (SSST-2) (2008)

    Google Scholar 

  22. Mareček, D., Žabortský, Z., Novák, V.: Automatic alignment of Czech and English deep syntactic dependency tree. In: Proceedings of the 12th EAMT Conference (2008)

    Google Scholar 

  23. Menezes A., Richardson, S.D.: A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora. In: Proceedings of the Workshop on Data-driven Methods in Machine Translation at ACL-2001 (2001)

    Google Scholar 

  24. Moore, R.C.: Fast and accurate sentence alignment of bilingual corpora. In: Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: From Research to Real Users, Tiburon, California (2002)

    Google Scholar 

  25. Nakazawa, T., Kurohashi, S.: Bayesian subtree alignment model based on dependency trees. In: Proceedings of 5th Joint Conference on Natural Language Processing, Chiang Mai, Thailand (2011)

    Google Scholar 

  26. Nivre, J., Hall, J., Kübler, S., McDonald, R., Nilsson, J., Riedel, S., Yuret, D.: The CoNLL 2007 shared task on dependency parsing. In: Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007 (2007)

    Google Scholar 

  27. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. In: Computational Linguistics, vol .29(1). MIT Press, Cambridge (2003)

    Google Scholar 

  28. Osborne, T., Putnam, M., Gross, T.: Catenae: introducing a novel unit of syntactic analysis. In: Syntax, 15(4) (2012)

    Google Scholar 

  29. Ozdowska, S.: Using bilingual dependencies to align words in English/French parallel corpora. In: Proceedings of the ACL Student Research Workshop (2005)

    Google Scholar 

  30. Sanguinetti, M., Bosco, C., Cupi, L.: Exploiting catenae in a parallel treebank alignment. In: Proceedings of the 9th Language Resources and Evaluation Conference (LREC’14). Reykjavik, Iceland (2014)

    Google Scholar 

  31. Simov, K., Osenova, P., Laskova, L., Savkov, A., Kancheva, S.: Bulgarian-English parallel treebank: word and semantic level alignment. In: Proceedings of Recent Advances in Natural Language Processing (RANLP), Hissar, Bulgaria (2011)

    Google Scholar 

  32. Simov, K., Osenova, P.: Bulgarian-English treebank: desing and implementation. In: Linguist. Issues Lang. Technol. - LiLT 7(14) (2012)

    Google Scholar 

  33. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., Varga, D.: The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. In: Proceedings of Language Resources and Evaluation Conference (LREC’06), Genova (2006)

    Google Scholar 

  34. Tiedemann, J., Kotzé, G.: Building a large machine-aligned parallel treebank. In: Proceedings of the 8th International Workshop on Treebanks and Linguistic Theories (TLT’08) (2009)

    Google Scholar 

  35. Vinay, J.P., Darbelnet, J.: Comparative Stylistics of French and English. John Benjamins, Amsterdam and Philadelphia (1958)

    Google Scholar 

  36. Zhechev, V., Way, A.: Automatic generation of parallel treebanks. In: 22nd International Conference on Computational Linguistics (COLING 2008) (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manuela Sanguinetti .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Sanguinetti, M., Bosco, C. (2015). PartTUT: The Turin University Parallel Treebank. In: Basili, R., Bosco, C., Delmonte, R., Moschitti, A., Simi, M. (eds) Harmonization and Development of Resources and Tools for Italian Natural Language Processing within the PARLI Project. Studies in Computational Intelligence, vol 589. Springer, Cham. https://doi.org/10.1007/978-3-319-14206-7_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-14206-7_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-14205-0

  • Online ISBN: 978-3-319-14206-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics