Skip to main content

Training, Enhancing, Evaluating and Using MT Systems with Comparable Data

  • Chapter
  • First Online:
Using Comparable Corpora for Under-Resourced Areas of Machine Translation

Abstract

This chapter describes how semi-parallel and parallel data extracted from comparable corpora can be used in enhancing machine translation (MT) systems: what are the methods used for this task in statistical and rule-based machine translation systems; what kinds of showcases exist that illustrate the usage of such enhanced MT systems. The impact of data extracted from comparable corpora on MT quality is evaluated for 17 language pairs, and detailed studies involving human evaluation are carried out for 11 language pairs. At first, baseline statistical machine translation (SMT) systems were built using traditional SMT techniques. Then they were improved by the integration of additional data extracted from the comparable corpora. Comparative evaluation was performed to measure improvements. Comparable corpora were also used to enrich the linguistic knowledge of rule-based machine translation (RBMT) systems by applying terminology extraction technology. Finally, SMT systems were adjusted for a narrow domain and included domain-specific knowledge such as terminology, named entities (NEs), domain-specific language models (LMs), etc.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.setimes.com

  2. 2.

    The News Commentary corpus is from the training data released for the shared tasks of the last few workshops for statistical machine translation (SMT).

  3. 3.

    Apache OpenNLP (available at: http://opennlp.apache.org/).

  4. 4.

    EuroTermBank (http://www.eurotermbank.com/).

  5. 5.

    WordPress: http:www.wordpress.com

  6. 6.

    Blogger: http://www.blogger.com/

  7. 7.

    Twitter: http://twitter.com

  8. 8.

    Tumblr: http://www.tumblr.com

  9. 9.

    MediaWiki: http://www.mediawiki.org/

  10. 10.

    https://www.tumblr.com/about, accessed in January, 2016.

  11. 11.

    https://about.twitter.com/company, all numbers approximate as of September 30, 2015.

  12. 12.

    http://stats.wikimedia.org/EN/ReportCardTopWikis.htm, accessed in January, 2016.

  13. 13.

    Translatewiki project: http://translatewiki.net/wiki/

  14. 14.

    Translate Toolkit & Pootle: http://translate.sourceforge.net/wiki/

  15. 15.

    Yandex Translate: http://company.yandex.com/technologies/translation.xml

  16. 16.

    https://meta.wikimedia.org/wiki/Machine_translation

References

  • Abdul-Rauf, S., & Schwenk, H. (2009). On the use of comparable corpora to improve SMT performance. Proceedings of the 12thConference of the European Chapter of the Association for Computational Linguistics (pp. 16–23), Athens, Greece.

    Google Scholar 

  • Abdul-Rauf, S., & Schwenk, H. (2011). Parallel sentence generation from comparable corpora for improved SMT. Machine Translation, 25(4), 341–375.

    Article  Google Scholar 

  • Aleksić, V., & Thurmair, Gr. (2011). Personal Translator at WMT 2011. Proceedings of the WMT Edinburgh, UK.

    Google Scholar 

  • Babych, B., & Hartley, A. (2008). Sensitivity of automated MT evaluation metrics on higher quality MT output: BLEU vs task-based evaluation methods. Proceedings of LREC, Marrakech.

    Google Scholar 

  • Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43rd Annual Meeting of the Association of Computational Linguistics (ACL 2005), June 2005, Michigan.

    Google Scholar 

  • Bertoldi, N., Haddow, B., & Fouet, J. B. (2009). Improved minimum error rate training in moses. The Prague Bulletin of Mathematical Linguistics, 91, 7–16.

    Article  Google Scholar 

  • Bojar, O., Federmann, C., Fishel, M., Graham, Y., Haddow, B., Koehn, P., & Monz, C. (2018). Findings of the 2018 Conference on Machine Translation (WMT18) (pp. 272–303). WMT (shared task) 2018.

    Google Scholar 

  • Bontchev, B., & Vassileva, D. (2009). Courseware authoring for adaptive e-learning. Proceedings of the 2009 International Conference on Education Technology and Computer (ICETC ’09) (pp. 176–180). IEEE Computer Society, Washington, DC.

    Google Scholar 

  • Bulterman, D. C. A., & Hardman, L. (2005). Structured multimedia authoring. ACM Transactions on Multimedia Computing, Communication and Applications, 1, 89–109.

    Article  Google Scholar 

  • Callison-Burch, Ch., Koehn, Ph., Monz, Ch., & Schroeder, J. (2009). Findings of the 2009 workshop on statistical machine translation. Proceedings of the 4th Workshop on SMT, Athens.

    Google Scholar 

  • Capuano, N., Pierri, A., Colace, F., Gaeta, M., & Mangione, G. R. (2009). A mash-up authoring tool for e-learning based on pedagogical templates. Proceedings of the First ACM International Workshop on Multimedia Technologies for Distance Learning (MTDL ’09) (pp. 87–94). ACM, New York, NY.

    Google Scholar 

  • Carrera, J., Beregovaya, O., & Yanishevsky, A. (2009). Machine Translation for Cross-Language Social Media. Accessed April 23, 2013 from http://www.promt.com/company/technology/pdf/machine_translation_for_cross_language_social_media.pdf

  • Clark, E., & Araki, K. (2011). Text normalization in social media: Progress, problems and applications for a pre-processing system of casual English. Procedia – Social and Behavioral Sciences, 27, 2–11.

    Article  Google Scholar 

  • Deltour, R., & Roisin, C. (2006). The limsee3 multimedia authoring model. Proceedings of the 2006 ACM Symposium on Document Engineering (DocEng ‘06) (pp. 173–175). ACM, New York, NY.

    Google Scholar 

  • Désilets, A., Gonzalez, L., Paquet, S., & Stojanovic, M. (2006). Translation the Wiki Way. The Conference Wiki of the 2006 International Symposium on Wikis. Odense, Denmark.

    Google Scholar 

  • Doddington, G. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. Proceedings of the Second International Conference on Human Language Technology Research (HLT 2002) (pp. 138–145), San Diego.

    Google Scholar 

  • Escudero, H., & Fuentes, R. (2010). Exchanging courses between different Intelligent Tutoring Systems: A generic course generation authoring tool. Knowledge-Based Systems, 23(8), 864–874.

    Article  Google Scholar 

  • Flournoy, R., & Duran, C. (2009). Machine translation and document localization at Adobe: From pilot to production. Proceedings of the Twelfth Machine Translation Summit, Ottawa, Canada.

    Google Scholar 

  • Flournoy, R., & Rueppel, J. (2010). One technology: Many solutions. AMTA 2010: The Ninth Conference of the Association for Machine Translation in the Americas, Denver, CO, 6p.

    Google Scholar 

  • Forcada, M. (2006). Open-source machine translation: An opportunity for minor languages. 5th SALTMIL Workshop on Minority Languages (pp. 1–7).

    Google Scholar 

  • Garcia, I. (2009). Beyond translation memory: Computers and the professional translator. The Journal of Specialised Translation, 12, 199–214.

    Google Scholar 

  • Hamon, O., Popescu-Belis, A., Choukri, K., Dabbadie, M., Hartley, A., Mustafa El Hadi, W., et al. (2006). CESTA: First conclusions of the technolangue mt evaluation campaign. Proceedings of the LREC, Genova, Italy.

    Google Scholar 

  • Hewavitharana, S., & Vogel, S. (2008). Enhancing a statistical machine translation system by using an automatically extracted parallel corpus from comparable sources. Proceedings of the Workshop on Comparable Corpora, LREC’08 (pp. 7–10).

    Google Scholar 

  • Hovy, E., King, M., & Popescu-Belis, A. (2002). Principles of context-based machine translation evaluation. Machine Translation, 17(1), 43–75.

    Article  Google Scholar 

  • Hutchins, J. (2003). Machine translation and computer-based translation tools: What’s available and how it’s used. A New Spectrum of Translation Studies. University of Valladolid.

    Google Scholar 

  • Intel Corporation. (2012). Enabling Multilingual Collaboration through Machine Translation (IT@Intel White Paper). Accessed March 30, 2013 from http://www.intel.com/content/www/us/en/it-management/intel-it-best-practices/enabling-multilingual-collaboration-through-machine-translation.html

  • Irvine, A., & Callison-Burch, Ch. (2013). Combining bilingual and comparable corpora for low resource machine translation. Proceedings of the Eighth Workshop on Statistical Machine Translation (pp. 262–270).

    Google Scholar 

  • Jiang, J., Way, A., & Haque, R. (2012). Translating user-generated content in the social networking space. Proceedings of the Tenth Biennial Conference of the Association for Machine Translation in the Americas (AMTA-2012), San Diego, CA.

    Google Scholar 

  • King, M., Popescu-Belis, A., & Hovy, E. (2003). FEMTI: Creating and using a framework for MT evaluation. Proceedings of MT Summit, New Orleans.

    Google Scholar 

  • Koehn, P., & Schroeder, J. (2007). Experiments in domain adaptation for statistical machine translation. Proceedings of the Second Workshop on Statistical Machine Translation, Prague.

    Google Scholar 

  • Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., et al. (2007). Moses: Open source toolkit for statistical machine translation. Annual Meeting of the Association for Computational Linguistics (ACL), Demonstration Session.

    Google Scholar 

  • Lewis, W., Wendt, C., & Bullock, D. (2010). Achieving domain specificity in SMT without overt siloing. Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010).

    Google Scholar 

  • Lu, B., Jiang, T., Chow, K., & Tsou, B. K. (2010). Building a large English-Chinese parallel corpus from comparable patents and its experimental application to SMT. Proceedings of the 3rd Workshop on Building and Using Comparable Corpora: From Parallel to Non-parallel Corpora (pp. 42–48), Valletta, Malta.

    Google Scholar 

  • Mehm, F., Reuter, C., Göbel, S., & Steinmetz, R. (2012). Future trends in game authoring tools. Entertainment Computing-ICEC 2012 (Vol. 7522, pp. 536–541),Springer, Heidelberg.

    Google Scholar 

  • Mitchell, L., & Roturier, J. (2012). Evaluation of machine-translated user generated content: A pilot study based on user ratings. Proceedings of the 16th EAMT Conference, 28–30 May 2012, Trento, Italy.

    Google Scholar 

  • Mugwanya, R., & Marsden, G. (2010). Mobile learning content authoring tools (MLCATs): A systematic review. Proceedings E-Infrastructures and E-Services on Developing Countries – Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering (pp. 20–31).

    Google Scholar 

  • Müller, W., Iurgel, I., Otero, N., & Massler, U. (2010). Teaching English as a second language utilizing authoring tools for interactive digital storytelling. ICIDS’10 Proceedings of the Third Joint Conference on Interactive Digital Storytelling (pp. 222–227).

    Chapter  Google Scholar 

  • Munteanu, D., & Marcu, D. (2006). Improving machine translation performance by ex-ploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.

    Article  Google Scholar 

  • Najeh, H., Kolovratnik, D., Vaeyrynen, J., Steinberger, R., & Varga,D. (2014). DCEP-digital corpus of the European parliament. Proceedings of LREC 2014 (Language Resources and Evaluation Conference) (pp. 3164–3171).

    Google Scholar 

  • O’Brien, S. (2005). Methodologies for measuring the correlations between post-editing effort and machine translatability. Machine Translation, 19(1), 37–58.

    Article  Google Scholar 

  • Och, F. J. (2003) Minimum error rate training in statistical machine translation. Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (Vol. 1, pp. 160–167).

    Google Scholar 

  • Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a method for automatic evaluation of machine translation. Proceedings of ACL-2002: 40th Annual meeting of the Association for Computational Linguistics (pp. 311–318).

  • Pecina, P., Toral, A., Papavassiliou, V., Prokopidis, P., & van Genabith, J. (2012). Domain adaptation of statistical machine translation using web-crawled resources: A case study. Proceedings of the EAMT 2012, Trento, Italy.

    Google Scholar 

  • Pinnis, M. (2012). Latvian and lithuanian named entity recognition with TildeNER. Proceedings of LREC 2012, 21–27 May, 2012, Istanbul, Turkey.

    Google Scholar 

  • Pinnis, M., & Skadiņš, R. (2012). MT Adaptation for Under-Resourced Domains –What Works and What Not. Baltic HLT2012.

    Google Scholar 

  • Pinnis, M., Ion, R., Ştefănescu, D., Su, F., Skadiņa, I., Vasiļjevs, A., et al. (2012a). Toolkit for multi-level alignment and information extraction from comparable corpora. Proceedings of ACL 2012, System Demonstrations Track, Jeju Island, Republic of Korea, 8–14 July 2012.

    Google Scholar 

  • Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I., Tadić, M., & Gornostay, T. (2012b). Term extraction, tagging and mapping tools for under-resourced languages. Proceedings of the 10th Conference on Terminology and Knowledge Engineering, Madrid, Spain.

    Google Scholar 

  • Pinnis, M., Skadiņa, I., & Vasiļjevs, A. (2013). Domain adaptation in statistical machine translation using comparable corpora: Case study for english latvian it localisation. Proceedings of the 14th International Conference on Intelligent Text Processing and Computational Linguistics CICLING 2013.

    Google Scholar 

  • Plitt, M., & Masselot, F. (2010). A Productivity Test of Statistical Machine Translation Post-Editing in a Typical Localisation Context. The Prague Bulletin of Mathematical Lin-Guistics, 93, 7–16.

    Google Scholar 

  • Popescu-Belis, A. (2008). Reference-based vs. task-based evaluation of human language technology. Proceedings of LREC.

    Google Scholar 

  • Rirdance, S., & Vasiljevs, A. (Eds.). (2006). Towards consolidation of European terminology resources. Experience and recommendations from EuroTermBank project. Riga: EuroTermBank Consortium.

    Google Scholar 

  • Roturier, J., & Bensadoun, A. (2011). Evaluation of MT systems to translate user generated content. Proceedings of Machine Translation Summit XIII (pp. 244–251), Xiamen, China.

    Google Scholar 

  • Scherp, A., & Boll, S. (2005). Context-driven smart authoring of multimedia content with xSMART. Proceedings of the 13th Annual ACM International Conference on Multimedia (MULTIMEDIA ’05) (pp. 802–803). ACM, New York, NY.

    Google Scholar 

  • Schmidtke, D. (2008). Microsoft office localization: Use of language and translation technology. Available at: http://www.tm-europe.org/files/resources/TM-Europe2008-Dag-Schmidtke-Microsoft.pdf

  • Schwenk, H., & Koehn, P. (2008). Large and diverse language models for statistical machine translation. IJCNLP2008.

    Google Scholar 

  • Skadiņa, I., Aker, A., Giouli, V., Tufis, D., Gaizauskas, R., Mieriņa, M., et al. (2010). A collection of comparable corpora for under-resourced languages. Proceedings of the Fourth International Conference Baltic HLT 2010, Frontiers in Artificial Intelligence and Applications(Vol. 219, pp. 161–168), IOS Press.

    Google Scholar 

  • Skadiņa, I., Aker, A., Mastropavlos, N., Su, F., Tufiş, D., Verlič, M., et al. (2012). Collecting and using comparable corpora for statistical machine translation. Proceedings of LREC’12 (pp. 438–445), Istanbul, Turkey, 21–27 May 2012.

    Google Scholar 

  • Skadiņš, R., Goba, K., & Šics, V. (2010). Improving SMT for baltic languages with factored models. Proceedings of the Fourth International Conference Baltic HLT 2010 (pp. 125–132), October 7–8, 2010, Riga, Latvia.

    Google Scholar 

  • Skadiņš, R., Puriņš, M., Skadiņa, I., & Vasiļjevs,A. (2011). Evaluation of SMT in localization to under-resourced inflected language. Proceedings of the 15th International Conference of the European Association for Machine Translation EAMT 2011 (pp. 35–40), May 30–31, 2011, Leuven, Belgium.

    Google Scholar 

  • Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. Proceedings of Association for Machine Translation in the Americas.

    Google Scholar 

  • Snover, M., Madnani, N., Dorr, B., & Schwartz, R. (2009). Fluency, adequacy, or HTER? Exploring different human judgments with a tunable MT metric. Proceedings of WMT09.

    Google Scholar 

  • Spärck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21.

    Article  Google Scholar 

  • Ştefănescu, D., Ion, R., & Hunsicker, S. (2012). Hybrid parallel sentence mining from comparable corpora. Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT 2012) (pp. 137–144), Trento, Italy.

    Google Scholar 

  • Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., TufisD., et al. (2006). The jrcacquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation.

    Google Scholar 

  • Steinberger, R., Eisele, A., Klocek, S., Pilos, S., & Schlüter, P. (2012). DGT-TM: A freely available translation memory in 22 languages. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’2012), Istanbul, 21–27 May 2012.

    Google Scholar 

  • Steinberger, R., Ebrahim, M., Poulis, A., Carrasco-Benitez, M., Schlüter, P., Przybyszewski, M., et al. (2014). An overview of the European Union’s highly multilingual parallel corpora. Language Resources and Evaluation Journal (LRE), 48(4), 679–707.

    Article  Google Scholar 

  • Su, F., & Babych, B. (2012). Measuring comparability of documents in non-parallel corpora for efficient extraction of (semi-) parallel translation equivalents. Proceedings of the EACL’12 Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRBMT) and Hybrid Approaches to Machine Translation (HyTra) (pp. 10–19), Avignon, France, 23–27 April 2012.

    Google Scholar 

  • Thurmair, Gr., & Aleksić, V. (2012). Creating term and lexicon entries from phrase tables. Proceedings of the EAMT 2012,Trento, Italy.

    Google Scholar 

  • Tiedemann, J. (2009). News from OPUS - a collection of multilingual parallel corpora with tools and interfaces. In N. Nicolov, K. Bontcheva, G. Angelova, & R. Mitkov (Eds.), Recent Advances in Natural Language Processing (Vol. V, pp. 237–248). Amsterdam/ Philadelphia: John Benjamins.

    Chapter  Google Scholar 

  • Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’2012).

    Google Scholar 

  • Tyers, F., & Alperen, M. (2010). South-East European Times: A parallel corpus of Balkan languages. Proceedings of Workshop “Exploitation of Multilingual Resources and Tools for Central and (South) Eastern European Languages”.

    Google Scholar 

  • Vasiļjevs, A., Skadiņš, R., & Tiedemann, J. (2012). LetsMT!: A cloud-based platform for do-it-yourself machine translation. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL2012) (pp. 43–48), Jeju, Republic of Korea, 10 July 2012, System Demonstrations.

    Google Scholar 

  • Watson, C., Li, F. W. B., & Lau, R. W. H. (2010). A pedagogical interface for authoring adaptive e-learning courses. Proceedings of the Second ACM International Workshop on Multimedia Technologies for Distance Learning (MTDL ’10) (pp. 13–18). ACM, New York, NY.

    Google Scholar 

  • White, J., O’Connell, T., & O’Mara, F. (1994). The ARPA MT evaluation methodologies: Evolution, lessons, and future approaches. Proceedings of the 1st Conference of the Association for Machine Translation in the Americas (pp. 193–205). Columbia.

    Google Scholar 

  • Xu, J., Zens, R., & Ney, H. (2006) Partitioning parallel documents using binary segmentation. Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL): Proceedings of the Workshop on Statistical Machine Translation (pp. 78–85), New York City, NY, June 2006.

    Google Scholar 

  • Xu, J., Deng, Y., Gao, Y., & Ney, H. (2007) Domain dependent machine translation. Proceedings of the Machine Translation Summit XI, Copenhagen, Danmark, September 2007.

    Google Scholar 

  • Zhang, X. (2011). Two-level parallel text extraction from comparable corpora. Diploma thesis of Univeristy of Saarland.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Inguna Skadiņa .

Editor information

Editors and Affiliations

Additional information

Chapter editor: Inguna Skadiņa

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Babych, B. et al. (2019). Training, Enhancing, Evaluating and Using MT Systems with Comparable Data. In: Skadiņa, I., Gaizauskas, R., Babych, B., Ljubešić, N., Tufiş, D., Vasiļjevs, A. (eds) Using Comparable Corpora for Under-Resourced Areas of Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-99004-0_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99004-0_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99003-3

  • Online ISBN: 978-3-319-99004-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics