Advertisement

Introduction

  • Inguna SkadiņaEmail author
  • Robert Gaizauskas
  • Andrejs Vasiļjevs
  • Monica Lestari Paramita
Chapter
Part of the Theory and Applications of Natural Language Processing book series (NLP)

Abstract

This book addresses the full set of questions that arise when attempting to exploit comparable corpora to overcome the bottleneck of insufficient parallel corpora that affects any data-driven machine translation approach, particularly in relation to under-resourced languages and narrow domains. It describes methods and tools for identifying and assessing comparability, for gathering comparable corpora from the Web, for extracting translation equivalents from within comparable texts and discusses the evaluation of this pipeline of methods and tools by incorporating their outputs into a machine translation system and assessing its performance in real application settings.

References

  1. Abdul-Rauf, S., & Schwenk, H. (2009). On the use of comparable corpora to improve SMT performance. EACL 2009: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (pp. 16–23), Athens, Greece.Google Scholar
  2. Abdul-Rauf, S., & Schwenk, H. (2011). Parallel sentence generation from comparable corpora for improved SMT. Machine Translation, 25(4), 341–375.CrossRefGoogle Scholar
  3. Adafre, S. F., & de Rijke, M. (2006). Finding similar sentences across multiple languages in Wikipedia. Proceedings of the EACL Workshop on New Text, Trento, Italy.Google Scholar
  4. Azpeitia, A., Etchegoyhen, T., & Martinez Garcia, E. (2018). Extracting parallel sentences from comparable corpora with STACC variants. Proceedings of the 11th Workshop on Building and Using Comparable Corpora (pp. 48–52).Google Scholar
  5. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0. article. Retrieved from http://arxiv.org/abs/1409.0473
  6. Bojar, O., Chatterjee, R., Federmann, C., Graham, Y., Haddow, B., Huck, M., et al. (2016). Findings of the 2016 conference on machine translation. Proceedings of the First Conference on Machine Translation (WMT 2016), Vol. 2: Shared Task Papers (pp. 131–198).Google Scholar
  7. Bojar, O., Chatterjee, R., Federmann, C., Graham, Y., Haddow, B., Huang, S., et al. (2017). Findings of the 2017 conference on machine translation (WMT17). Proceedings of the Second Conference on Machine Translation, Vol. 2: Shared Task Papers (pp. 169–214). Association for Computational Linguistics, Copenhagen, Denmark. Retrieved from http://www.aclweb.org/anthology/W17-4717
  8. Ceauşu, A., Ştefănescu, D., & Tufiş, D. (2006). Acquis communautaire sentence alignment using support vector machines. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006) (pp. 2134–2137).Google Scholar
  9. Chiao, Y., & Zweigenbaum, P. (2002). Looking for candidate translational equivalents in specialized, comparable. COLING '02 Proceedings of the 19th International Conference on Computational Linguistics (Vol. 2, pp. 1–5).Google Scholar
  10. Daille, B., & Morin, E. (2008). An effective compositional model for lexical alignment. Proceedings, 3rd International Joint Conference on Natural Language Processing (IJCLNP) (pp. 95–102).Google Scholar
  11. Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R. M., & Makhoul, J. (2014). Fast and robust neural network joint models for statistical machine translation. ACL (1) (pp. 1370–1380). In Proceedings.Google Scholar
  12. EAGLES. (1996). Preliminary recommendations on corpus typology. Electronic Resource: http://www.ilc.cnr.it/EAGLES96/corpustyp/corpustyp.html
  13. Etchegoyhen, T., & Azpeitia, A. (2016). Set-theoretic alignment for comparable corpora. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol.1: Long Papers, pp. 2009–2018).Google Scholar
  14. Fung, P., & Cheung, P. (2004). Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and EM. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004) (pp. 57–63), Barcelona, Spain.Google Scholar
  15. Hewavitharana, S., & Vogel, S. (2008). Enhancing a statistical machine translation system by using an automatically extracted parallel corpus from comparable sources. Proceedings of the Workshop on Comparable Corpora, LREC’08 (pp. 7–10).Google Scholar
  16. Ion, R., & Tufiş, D. (2007). RACAI: Meaning affinity models. Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007) (pp. 282–287), Association for Computational Linguistics, Prague, Czech Republic, June 2007.Google Scholar
  17. Irimia, E. (2009). Metode de traducere automată prin analogie. Aplicaţii pentru limbile română şi engleză. (Methods for Analogy-based Machine Translation. Applications for Romanian and English). PhD thesis, March 2009.Google Scholar
  18. Irvine, A., & Callison-Burch, Ch. (2013). Combining bilingual and comparable corpora for low resource machine translation. Proceedings of the Eighth Workshop on Statistical Machine Translation (pp. 262—270).Google Scholar
  19. Jean, S., Firat, O., Cho, K., Memisevic, R., & Bengio, Y. (2015). Montreal neural machine translation systems for WMT15. Proceedings of the Tenth Workshop on Statistical Machine Translation (pp. 134–140).Google Scholar
  20. Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 1–37.CrossRefGoogle Scholar
  21. Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. Proceedings of Machine Translation Summit X.Google Scholar
  22. Koehn, P. (2010). Statistical machine translation. Cambridge University Press.Google Scholar
  23. Koehn, P., & Knowles, R. (2017). Six challenges for neural machine translation. Proceedings of the First Workshop on Neural Machine Translation, NMT@ACL 2017 (pp. 28–39), Vancouver, Canada, August 4, 2017.Google Scholar
  24. Li, B., & Gaussier, E. (2010). Improving corpus comparability for bilingual lexicon extraction from comparable corpora. Proceedings of COLING 2010, Beijing, China.Google Scholar
  25. Lu, B., Jiang, T., Chow, K., & Tsou, B. K. (2010). Building a large English-Chinese parallel corpus from comparable patents and its experimental application to SMT. Proceedings of the 3rd Workshop on Building and Using Comparable Corpora: From Parallel to Non-parallel Corpora, Valletta, Malta (pp. 42–48).Google Scholar
  26. Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 1412–1421).Google Scholar
  27. McEnery, A., & Xiao, Z. (2007). Parallel and comparable corpora? Incorporating Corpora: Translation and the Linguist. Translating Europe. Multilingual Matters, Clevedon.CrossRefGoogle Scholar
  28. Morin, E., Daille, B., Takeuchi, K., & Kageura, K. (2007). Bilingual terminology mining – Using brain, not brawn comparable corpora. Proceedings, 45th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 664–671).Google Scholar
  29. Munteanu, D., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.CrossRefGoogle Scholar
  30. Pinnis, M., Ion, R., Ştefănescu, D., Su, F., Skadiņa, I., Vasiļjevs, A. et al. (2012). ACCURAT toolkit for multi-level alignment and information extraction from comparable corpora. Proceedings of the ACL 2012 System Demonstrations (pp. 91–96). Association for Computational Linguistics, Jeju, South Korea.Google Scholar
  31. Rapp, R. (1995). Identifying word translations in non-parallel texts. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (pp. 320–322).Google Scholar
  32. Rayson, P., & Garside, R. (2000) Comparing corpora using frequency profiling. Proceedings of the Comparing Corpora Workshop at ACL’00 (pp. 1–6).Google Scholar
  33. Rehm, G., & Uszkoreit, H. (Eds.). (2012). White paper series. Springer.Google Scholar
  34. Sennrich, R., Haddow, B., & Birch, A. (2016a). Edinburgh neural machine translation systems for WMT 16. Proceedings of the First Conference on Machine Translation, Vol. 2: Shared Task Papers (pp. 368–373), Berlin, Germany.Google Scholar
  35. Sennrich, R., Hadow, B., & Birch, A. (2016b). Improving neural machine translation models with monolingual data. Proceedings of Annual Meeting of ACL (pp. 86–96).Google Scholar
  36. Sharoff, S. (2007). Classifying Web corpora into domain and genre using automatic feature identification. Proceedings of 3rd Web as Corpus Workshop, Louvain-la-Neuve, BelgiumGoogle Scholar
  37. Skadiņa, I., Aker, A., Mastropavlos, N., Su, F., Tufiș, D., Verlic, M. et al. (2012). Collecting and using comparable corpora for statistical machine translation. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12) (pp. 438–445).Google Scholar
  38. Smith, J.R., Quirk, C., & Toutanova, K. (2010). Extracting parallel sentences from comparable corpora using document level alignment. NAACL-HLT 2010 (pp. 403–411).Google Scholar
  39. Ştefănescu, D., Ion, R., & Hunsicker, S. (2012). Hybrid parallel sentence mining from comparable corpora. Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT 2012) (pp. 137–144), Trento, Italy.Google Scholar
  40. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D. et al. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation: LREC’06.Google Scholar
  41. Steinberger, R., Eisele, A., Klocek, S., Pilos, S., & Schlüter, P. (2012). DGT-TM: A freely available translation memory in 22 languages. Proceedings of LREC’2012 (pp. 454–459), Istanbul, Turkey.Google Scholar
  42. Tiedemann, J. (2016). OPUS – Parallel corpora for everyone. Baltic Journal of Modern Computing (BJMC), 4(2). Special Issue: Proceedings of the 19th Annual Conference of the European Association of Machine Translation (EAMT), 2016.Google Scholar
  43. Tyers, F. M., & Alpren, M. S. (2010). South-East European Times: A parallel corpus of Balkan languages. Proceedings of Workshop “Exploitation of Multilingual Resources and Tools for Central and (South) Eastern European Languages”.Google Scholar
  44. Utiyama, M., & Isahara, H. (2003). Reliable measures for aligning Japanese-English news articles and sentences. Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (pp. 7–12).Google Scholar
  45. Xu, J., Kennington, C., Przywara, C., & Wanzare, L. (2012). Comparable corpora in Wikipedia text for machine translation. Proceedings of the 6th NIC Symposium 2012: 25 Years HLRZ/NIC (Book Section). ISBN: 9783893367580, Jülich, Germany, February 2012.Google Scholar
  46. Zhao, B., & Vogel, S. (2002). Adaptive parallel sentences mining from web bilingual news collection. Proceedings of the 2002 I.E. International Conference on Data Mining (ICDM’02) (p. 74).Google Scholar
  47. Zweigenbaum, P., Sharoff, S., & Rapp, R. (2018). Overview of the third BUCC shared task: Spotting parallel sentences in comparable corpora. Proceedings of 11th Workshop on Building and Using Comparable Corpora (pp. 39–42).Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Inguna Skadiņa
    • 1
    Email author
  • Robert Gaizauskas
    • 2
  • Andrejs Vasiļjevs
    • 1
  • Monica Lestari Paramita
    • 2
  1. 1.TildeRigaLatvia
  2. 2.University of SheffieldSheffieldUK

Personalised recommendations