Abstract
In this paper we propose a corpus structure which represents and manages an aligned parallel corpus. The corpus structure is based on a stand-off annotation model, which is composed of several XML documents. A bilingual parallel corpus represented in the proposed structure will contain: (1) the entire corpus together with its corresponding linguistic information, (2) translation units and alignment relations between units of the two languages: paragraphs, sentences and named entities. The proposed structure permits to work with the corpus both as an annotated corpus with linguistic information, and as a translation memory.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aldezabal, I., Ansa, O., Arrieta, B., Artola, X., Ezeiza, A., Hernández, G., Lersundi, M.: EDBL: a general lexical basis for the automatic processing of Basque. In: IRCS Workshop on linguistic databases (2001)
Aduriz, I., Agirre, E., Aldezabal, I., Alegria, I., Ansa, O., Arregi, X., Arriola, J.M., Artola, X., de Ilarraza, A.D., Ezeiza, N., Gojenola K., Maritxalar, A., Maritxalar, M., Oronoz, M., Sarasola, K., Soroa, A., Urizar, R., Urkia, M.: A Framework for the Automatic Processing of Basque. In: Proceedings of the First International Conference on Language Resources and Evaluation (1998)
Artola, X., de Illarraza, A.D., Ezeiza, N., Gojenola, K., Labaka, G., Salogaistoa, A., Soroa, A.: A framework for representing and managing linguistic annotations based on typed feature structures. In: RANLP (2005)
Artola, X., de Ilarraza, A.D., Ezeiza, N., Gojenola, K., Sologaistoa, A., Soroa, A.: EULIA: a graphical web interface for creating, browsing and editing linguistically annotated corpora. In: LREC (2004)
Casillas, A., de Illarraza, A.D., Igartua, J., Martínez, R., Sarasola, K.: Compilation and Structuring of a Spanish-Basque Parallel Corpus. In: 5th SALTMIL Workshop on Minority Languages
Euskal Herriko Agintaritzaren Ofiziala (EHAA), http://www.euskadi.net
Erjavec, T.: Compiling and using the IJS-ELAN Parallel Corpus. Informatica 26, 299–307 (2002)
Ezeiza, N., Aduriz, I., Alegria, I., Arriola, J.M., Urizar, R.: Combining Stochastic and Rule-Based Methods for Disambiguation in Agglutinative Languages. In: Proceedings of COLING-ACL 1998 (1998)
FreeLing 1.5 An Open Source Suite of Language Analyzers, http://garraf.epsevg.upc.es/freeling/
Martínez, R., Abaitua, J., Casillas, A.: Bitext Correspondences through Rich Mark-up. In: Proceedings of the 17th International Conference on Computational Linguistics (COLING’98) and 36th Annual Meeting of the Association for Computational Linguistics (ACL 1998), pp. 812–818 (1997)
Martínez, R., Abaitua, J., Casillas, A.: Aligning tagged bitext. In: Proceedings of the Sixth Workshop on Very Large Corpora, pp. 102–109 (1998)
MarSperberg-McQueen, C.M., Burnard, L.: Guidelines for Electronic Text Encoding and Interchange. TEI P3 Text Encoding Initiative (1994)
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages. In: LREC, pp. 2142–2147 (2006)
Marko, T.: Building the Croatian-English Parallel Corpus. In: LREC (2000)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Casillas, A., de Illarraza, A.D., Igartua, J., Martínez, R., Sarasola, K., Sologaistoa, A. (2007). Spanish-Basque Parallel Corpus Structure: Linguistic Annotations and Translation Units. In: Matoušek, V., Mautner, P. (eds) Text, Speech and Dialogue. TSD 2007. Lecture Notes in Computer Science(), vol 4629. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74628-7_31
Download citation
DOI: https://doi.org/10.1007/978-3-540-74628-7_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74627-0
Online ISBN: 978-3-540-74628-7
eBook Packages: Computer ScienceComputer Science (R0)