ParCoLab: A Parallel Corpus for Serbian, French and English

Miletic, Aleksandra; Stosic, Dejan; Marjanović, Saša

doi:10.1007/978-3-319-64206-2_18

Aleksandra Miletic¹⁵,
Dejan Stosic¹⁵ &
Saša Marjanović¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10415))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1504 Accesses
3 Citations

Abstract

ParCoLab is a trilingual parallel corpus containing texts in Serbian, French and English. It is developed at the CLLE-ERSS research unit (UMR 5263 CNRS) at the University of Toulouse, France, in collaboration with the Department of Romance Studies at the University of Belgrade, Serbia. Serbian being one of the less-resourced European languages, this is an important step towards the creation of freely accessible corpora and NLP tools for this language. Our main goal is to provide the scientific community with a high-quality resource that can be used in a wide range of applications, such as contrastive linguistic studies, NLP research, machine and computer assisted translation, translation studies, second language learning and teaching, and applied lexicography. The corpus currently contains 7.1M tokens mainly from literary works, but corpus extension and diversification efforts are ongoing. ParCoLab can be queried online and a part of it is available for download.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The only two modifications we make is that we introduce an attribute @langOri used in the<teiHeader> in order to encode the language of the original text in the XML files containing translations, and the @id attribute used on the root<TEI> element, indicating the unique ID of the file inside the collection.
2.
TED is a platform for short talks on various subjects. See http://www.ted.com/.
3.
See, e.g., [23] for POS-tagging and [4] for parsing of English; [21] for POS-tagging, [3] for parsing, and [22] for lemmatization of French.

References

Agić, Ž., Ljubešić, N., Berović, D.: Lemmatization and morphosyntactic tagging of Croatian and Serbian. In: 4th Biennial International Workkshop on Balto-Slavic Natural Language Processing, BSNLP 2013 (2013)
Google Scholar
Agić, Ž., Merkler, D., Berović, D.: Parsing Croatian and Serbian by using Croatian dependency treebanks. In: Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages (2013)
Google Scholar
Candito, M., Nivre, J., Denis, P., Anguiano, E.H.: Benchmarking of statistical dependency parsers for French. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 108–116. Association for Computational Linguistics (2010)
Google Scholar
Carreras, X.: Experiments with a higher-order projective dependency parser. In: EMNLP-CoNLL, pp. 957–961 (2007)
Google Scholar
Čermák, F., Rosen, A.: The case of InterCorp, a multilingual parallel corpus. Int. J. Corpus Linguist. 17(3), 411–427 (2012)
Article Google Scholar
Text Encoding Initiative Consortium (eds.): TEI P5: Guidelines for Electronic Text Encoding and Interchange. Text Encoding Initiative Consortium (2008)
Google Scholar
Esplá-Gomis, M., Forcada, M.: Bitextor, a free/open-source software to harvest translation memories from multilingual websites. In: Proceedings of MT Summit XII, Ottawa, Canada. Association for Machine Translation in the Americas (2009)
Google Scholar
Gesmundo, A., Samardžić, T.: Lemmatising Serbian as category tagging with bidirectional sequence classification. In: LREC, pp. 2103–2106 (2012)
Google Scholar
Halácsy, P., Kornai, A., Oravecz, C.: Hunpos: an open source trigram tagger. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 209–212. Association for Computational Linguistics (2007)
Google Scholar
Jakovljević, B., Kovačević, A., Sečujski, M., Marković, M.: A dependency treebank for Serbian: initial experiments. In: Ronzhin, A., Potapova, R., Delic, V. (eds.) SPECOM 2014. LNCS, vol. 8773, pp. 42–49. Springer, Cham (2014). doi:10.1007/978-3-319-11581-8_5
Google Scholar
Jongejan, B., Dalianis, H.: Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 145–153 (2009)
Google Scholar
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: MT Summit, vol. 5, pp. 79–86 (2005)
Google Scholar
Krstev, C., Vitas, D.: An aligned English-Serbian corpus. In: ELLSIIR Proceedings (English Language and Literature Studies: Image, Identity, Reality), vol. 1, pp. 495–508 (2011)
Google Scholar
Krstev, C., Vitas, D., Erjavec, T.: MULTEXT-East resources for Serbian. In: Zbornik 7. mednarodne multikonference Informacijska druzba IS 2004 Jezikovne tehnologije 9–15 Oktober 2004, Ljubljana, Slovenija, 2004. Erjavec, Tomaž and Zganec Gros, Jerneja (2004)
Google Scholar
Ljubešić, N., Klubička, F., Agić, Ž., Jazbec, I.P.: New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian. In: Calzolari, N., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA), Paris, May 2016
Google Scholar
Ljubešić, N., Klubička, F.: \(\{\)bs, hr, sr\(\}\) WaC-web corpora of Bosnian, Croatian and Serbian. In: Proceedings of the 9th Web as Corpus Workshop (WaC-9), pp. 29–35 (2014)
Google Scholar
Marjanović, S.: « Entrez, s’il vous plaît ! » : De la sélection lexicographique des phrasémes. In: Repenser le figement: enjeux et perspectives en phraséo-didactique des langues. Université Paris3 - Sorbonne Nouvelle (2016, forthcoming)
Google Scholar
McDonald, R., Lerman, K., Pereira, F.: Multilingual dependency analysis with a two-stage discriminative parser. In: Proceedings of the Tenth Conference on Computational Natural Language Learning, pp. 216–220. Association for Computational Linguistics (2006)
Google Scholar
Miletic, A.: Annotation morphosyntaxique semi-automatique d’un corpus litéraire serbe. Master’s thesis, Université Charles de Gaulle - Lille 3 (2013)
Google Scholar
Miletic, A.: Building a morphosyntactic lexicon for Serbian using Wiktionary. In: 6th Journées d’études Toulousaines, JéTou 2017 (2017, forthcoming)
Google Scholar
Sagot, B.: Etiquetage multilingue en parties du discours avec MELT. In: Actes de la conférence conjointe JEP-TALN-RECITAL 2016 (2016)
Google Scholar
Seddah, D., Chrupała, G., Çetinoğlu, Ö., Van Genabith, J., Candito, M.: Lemmatization and lexicalized statistical parsing of morphologically rich languages: the case of French. In: Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, pp. 85–93. Association for Computational Linguistics (2010)
Google Scholar
Shen, L., Satta, G., Joshi, A.: Guided learning for bidirectional sequence classification. ACL 7, 760–767 (2007)
Google Scholar
Stanojević, V., Durić, L.: Sur les indéfinis singuliers génériques en français et en serbe. Travaux de linguistique 1, 121–133 (2016)
Article Google Scholar
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In: 5th International Conference on Language Ressources and Evaluation, LREC2006 (2006)
Google Scholar
Stosic, D., Fagard, B., Sarda, L., Colin, C.: Does the road go up the mountain? Fictive motion between linguistic conventions and cognitive motivations. Cogn. Process. 16(1), 221–225 (2015)
Article Google Scholar
Tiedemann, J.: News from Opus-a collection of multilingual parallel corpora with tools and interfaces. In: Nicolov, N., Bontchev, K., Angelova, G., Mitkov, R. (eds.) Recent Advances in Natural Language Processing, vol. 5, pp. 237–248 (2009)
Google Scholar
Tyers, F.M., Alperen, M.S.: South-East European Times: a parallel corpus of Balkan languages. In: Proceedings of the LREC Workshop on Exploitation of Multilingual Resources and Tools for Central and (South-) Eastern European Languages, pp. 49–53 (2010)
Google Scholar
Urieli, A.: Robust French syntax analysis: reconciling statistical methods and linguistic knowledge in the Talismane toolkit. Ph.D. thesis, Université Toulouse le Mirail-Toulouse II (2013)
Google Scholar
Utvić, M.: Annotating the corpus of contemporary Serbian. In: Proceedings of the INFOtheca 2012 Conference (2011)
Google Scholar
Vitas, D., Krstev, C.: Literature and aligned texts. Readings in Multilinguality, pp. 148–155 (2006)
Google Scholar
von Waldenfels, R.: Compiling a parallel corpus of Slavic languages. Text strategies, tools and the question of lemmatization in alignment. In: Brehmer, B., Zdanova, V., Zimny, R. (eds.) Beiträge der Europäischen Slavistischen Linguistik, vol. 9, pp. 123–138 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

CLLE, CNRS & University of Toulouse, 5, Allées Antonio Machado, 31058, Toulouse, France
Aleksandra Miletic & Dejan Stosic
Faculty of Philology, University of Belgrade, Studentski Trg 3, 11000, Belgrade, Serbia
Saša Marjanović

Authors

Aleksandra Miletic
View author publications
You can also search for this author in PubMed Google Scholar
Dejan Stosic
View author publications
You can also search for this author in PubMed Google Scholar
Saša Marjanović
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aleksandra Miletic .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein
University of West Bohemia, Pilsen, Czech Republic
Václav Matoušek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Miletic, A., Stosic, D., Marjanović, S. (2017). ParCoLab: A Parallel Corpus for Serbian, French and English. In: Ekštein, K., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2017. Lecture Notes in Computer Science(), vol 10415. Springer, Cham. https://doi.org/10.1007/978-3-319-64206-2_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-64206-2_18
Published: 29 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64205-5
Online ISBN: 978-3-319-64206-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics