Advertisement

ParCoLab: A Parallel Corpus for Serbian, French and English

  • Aleksandra MileticEmail author
  • Dejan Stosic
  • Saša Marjanović
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10415)

Abstract

ParCoLab is a trilingual parallel corpus containing texts in Serbian, French and English. It is developed at the CLLE-ERSS research unit (UMR 5263 CNRS) at the University of Toulouse, France, in collaboration with the Department of Romance Studies at the University of Belgrade, Serbia. Serbian being one of the less-resourced European languages, this is an important step towards the creation of freely accessible corpora and NLP tools for this language. Our main goal is to provide the scientific community with a high-quality resource that can be used in a wide range of applications, such as contrastive linguistic studies, NLP research, machine and computer assisted translation, translation studies, second language learning and teaching, and applied lexicography. The corpus currently contains 7.1M tokens mainly from literary works, but corpus extension and diversification efforts are ongoing. ParCoLab can be queried online and a part of it is available for download.

Keywords

Parallel corpus Serbian French English NLP resources 

References

  1. 1.
    Agić, Ž., Ljubešić, N., Berović, D.: Lemmatization and morphosyntactic tagging of Croatian and Serbian. In: 4th Biennial International Workkshop on Balto-Slavic Natural Language Processing, BSNLP 2013 (2013)Google Scholar
  2. 2.
    Agić, Ž., Merkler, D., Berović, D.: Parsing Croatian and Serbian by using Croatian dependency treebanks. In: Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages (2013)Google Scholar
  3. 3.
    Candito, M., Nivre, J., Denis, P., Anguiano, E.H.: Benchmarking of statistical dependency parsers for French. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 108–116. Association for Computational Linguistics (2010)Google Scholar
  4. 4.
    Carreras, X.: Experiments with a higher-order projective dependency parser. In: EMNLP-CoNLL, pp. 957–961 (2007)Google Scholar
  5. 5.
    Čermák, F., Rosen, A.: The case of InterCorp, a multilingual parallel corpus. Int. J. Corpus Linguist. 17(3), 411–427 (2012)CrossRefGoogle Scholar
  6. 6.
    Text Encoding Initiative Consortium (eds.): TEI P5: Guidelines for Electronic Text Encoding and Interchange. Text Encoding Initiative Consortium (2008)Google Scholar
  7. 7.
    Esplá-Gomis, M., Forcada, M.: Bitextor, a free/open-source software to harvest translation memories from multilingual websites. In: Proceedings of MT Summit XII, Ottawa, Canada. Association for Machine Translation in the Americas (2009)Google Scholar
  8. 8.
    Gesmundo, A., Samardžić, T.: Lemmatising Serbian as category tagging with bidirectional sequence classification. In: LREC, pp. 2103–2106 (2012)Google Scholar
  9. 9.
    Halácsy, P., Kornai, A., Oravecz, C.: Hunpos: an open source trigram tagger. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 209–212. Association for Computational Linguistics (2007)Google Scholar
  10. 10.
    Jakovljević, B., Kovačević, A., Sečujski, M., Marković, M.: A dependency treebank for Serbian: initial experiments. In: Ronzhin, A., Potapova, R., Delic, V. (eds.) SPECOM 2014. LNCS, vol. 8773, pp. 42–49. Springer, Cham (2014). doi: 10.1007/978-3-319-11581-8_5 Google Scholar
  11. 11.
    Jongejan, B., Dalianis, H.: Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 145–153 (2009)Google Scholar
  12. 12.
    Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: MT Summit, vol. 5, pp. 79–86 (2005)Google Scholar
  13. 13.
    Krstev, C., Vitas, D.: An aligned English-Serbian corpus. In: ELLSIIR Proceedings (English Language and Literature Studies: Image, Identity, Reality), vol. 1, pp. 495–508 (2011)Google Scholar
  14. 14.
    Krstev, C., Vitas, D., Erjavec, T.: MULTEXT-East resources for Serbian. In: Zbornik 7. mednarodne multikonference Informacijska druzba IS 2004 Jezikovne tehnologije 9–15 Oktober 2004, Ljubljana, Slovenija, 2004. Erjavec, Tomaž and Zganec Gros, Jerneja (2004)Google Scholar
  15. 15.
    Ljubešić, N., Klubička, F., Agić, Ž., Jazbec, I.P.: New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian. In: Calzolari, N., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA), Paris, May 2016Google Scholar
  16. 16.
    Ljubešić, N., Klubička, F.: \(\{\)bs, hr, sr\(\}\) WaC-web corpora of Bosnian, Croatian and Serbian. In: Proceedings of the 9th Web as Corpus Workshop (WaC-9), pp. 29–35 (2014)Google Scholar
  17. 17.
    Marjanović, S.: « Entrez, s’il vous plaît ! » : De la sélection lexicographique des phrasémes. In: Repenser le figement: enjeux et perspectives en phraséo-didactique des langues. Université Paris3 - Sorbonne Nouvelle (2016, forthcoming)Google Scholar
  18. 18.
    McDonald, R., Lerman, K., Pereira, F.: Multilingual dependency analysis with a two-stage discriminative parser. In: Proceedings of the Tenth Conference on Computational Natural Language Learning, pp. 216–220. Association for Computational Linguistics (2006)Google Scholar
  19. 19.
    Miletic, A.: Annotation morphosyntaxique semi-automatique d’un corpus litéraire serbe. Master’s thesis, Université Charles de Gaulle - Lille 3 (2013)Google Scholar
  20. 20.
    Miletic, A.: Building a morphosyntactic lexicon for Serbian using Wiktionary. In: 6th Journées d’études Toulousaines, JéTou 2017 (2017, forthcoming)Google Scholar
  21. 21.
    Sagot, B.: Etiquetage multilingue en parties du discours avec MELT. In: Actes de la conférence conjointe JEP-TALN-RECITAL 2016 (2016)Google Scholar
  22. 22.
    Seddah, D., Chrupała, G., Çetinoğlu, Ö., Van Genabith, J., Candito, M.: Lemmatization and lexicalized statistical parsing of morphologically rich languages: the case of French. In: Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, pp. 85–93. Association for Computational Linguistics (2010)Google Scholar
  23. 23.
    Shen, L., Satta, G., Joshi, A.: Guided learning for bidirectional sequence classification. ACL 7, 760–767 (2007)Google Scholar
  24. 24.
    Stanojević, V., Durić, L.: Sur les indéfinis singuliers génériques en français et en serbe. Travaux de linguistique 1, 121–133 (2016)CrossRefGoogle Scholar
  25. 25.
    Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In: 5th International Conference on Language Ressources and Evaluation, LREC2006 (2006)Google Scholar
  26. 26.
    Stosic, D., Fagard, B., Sarda, L., Colin, C.: Does the road go up the mountain? Fictive motion between linguistic conventions and cognitive motivations. Cogn. Process. 16(1), 221–225 (2015)CrossRefGoogle Scholar
  27. 27.
    Tiedemann, J.: News from Opus-a collection of multilingual parallel corpora with tools and interfaces. In: Nicolov, N., Bontchev, K., Angelova, G., Mitkov, R. (eds.) Recent Advances in Natural Language Processing, vol. 5, pp. 237–248 (2009)Google Scholar
  28. 28.
    Tyers, F.M., Alperen, M.S.: South-East European Times: a parallel corpus of Balkan languages. In: Proceedings of the LREC Workshop on Exploitation of Multilingual Resources and Tools for Central and (South-) Eastern European Languages, pp. 49–53 (2010)Google Scholar
  29. 29.
    Urieli, A.: Robust French syntax analysis: reconciling statistical methods and linguistic knowledge in the Talismane toolkit. Ph.D. thesis, Université Toulouse le Mirail-Toulouse II (2013)Google Scholar
  30. 30.
    Utvić, M.: Annotating the corpus of contemporary Serbian. In: Proceedings of the INFOtheca 2012 Conference (2011)Google Scholar
  31. 31.
    Vitas, D., Krstev, C.: Literature and aligned texts. Readings in Multilinguality, pp. 148–155 (2006)Google Scholar
  32. 32.
    von Waldenfels, R.: Compiling a parallel corpus of Slavic languages. Text strategies, tools and the question of lemmatization in alignment. In: Brehmer, B., Zdanova, V., Zimny, R. (eds.) Beiträge der Europäischen Slavistischen Linguistik, vol. 9, pp. 123–138 (2006)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Aleksandra Miletic
    • 1
    Email author
  • Dejan Stosic
    • 1
  • Saša Marjanović
    • 2
  1. 1.CLLE, CNRS & University of ToulouseToulouseFrance
  2. 2.Faculty of PhilologyUniversity of BelgradeBelgradeSerbia

Personalised recommendations