Cross-Language Comparability and Its Applications for MT

  • Bogdan BabychEmail author
  • Fangzhong Su
  • Anthony Hartley
  • Ahmet Aker
  • Monica Lestari Paramita
  • Paul Clough
  • Robert Gaizauskas
Part of the Theory and Applications of Natural Language Processing book series (NLP)


The concept of comparability, or linguistic relatedness, or closeness between textual units or corpora has many possible applications in computational linguistics. Consequently, the task of measuring comparability has increasingly become a core technological challenge in the field, and needs to be developed and evaluated systematically. Many practical applications require corpora with controlled levels of comparability, which are established by comparability metrics. From this perspective, it is important to understand the linguistic and technological mechanisms and implications of comparability and develop a systematic methodology for developing, evaluating and using comparability metrics. This chapter presents our approach to developing and using such metrics for machine translation (MT), especially for under-resourced languages. We address three core areas: (1) systematic meta-evaluation (or calibration) of the metrics on the basis of parallel corpora; (2) the development of feature-selection techniques for the metrics on the basis of aligned comparable texts, such as Wikipedia articles and (3) applying the developed metrics for the tasks of MT for under-resourced languages and measuring their effectiveness for corpora with unknown degrees of comparability. This has led to redefining the vague linguistic concept of comparability in terms of task-specific performance of the tools, which extract phrase-level translation equivalents from comparable texts.


  1. Adafre, S. F., & de Rijke, M. (2006). Finding similar sentences across multiple languages in Wikipedia. Proceedings of the EACL Workshop on New Text, Trento, Italy.Google Scholar
  2. Babych, B., Hartley, A., Sharoff, S., & Mudraya, O. (2007). Assisting translators in indirect lexical transfer. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (pp. 136–143).Google Scholar
  3. Babych, B., Sharoff S., & Hartley, A. (2008). Generalising lexical translation strategies for MT using comparable corpora. Proceedings of LREC 2008, Marrakech, Morocco.Google Scholar
  4. Bharadwaj, R. G., & Varma, V. (2011). Language independent identification of parallel sentences using Wikipedia. Proceedings of the 20th International Conference Companion on World Wide Web, WWW ’11 (pp. 11–12).Google Scholar
  5. Chiao, Y.-Ch., & Zweigenbaum, P. (2002). Looking for candidate translational equivalents in specialized, comparable corpora. Proceedings of COLING 2002, Taipei, Taiwan.Google Scholar
  6. Daille, B., & Morin, E. (2005). French-English terminology extraction from comparable corpora. IJCNLP (pp. 707–718).Google Scholar
  7. Eisele, A., & Xu, J. (2010). improving machine translation performance using comparable corpora. Proceedings of the LREC Workshop on Building and Using Comparable Corpora, Malta, May 2010.Google Scholar
  8. Eisele, A., Federmann, C., Saint-Amand, H., Jellinghaus, M., Herrmann, T., & Chen, Y. (2008). Using Moses to integrate multiple rule-based machine translation engines into a hybrid system. Proceedings of the Third Workshop on Statistical Machine Translation (pp. 179–182).Google Scholar
  9. Erdmann, M., Nakayama, K., Hara, T., & Nishio, S. (2008). Extraction of bilingual terminology from a multilingual web-based encyclopedia. Journal of Information Processing, 16, 67–79.CrossRefGoogle Scholar
  10. Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press.CrossRefGoogle Scholar
  11. Filatova, E. (2009). Directions for exploiting asymmetries in multilingual Wikipedia. Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies (CLIAWS3 '09).Google Scholar
  12. Finkel, J., Grenager, T., & Manning, C. (2005). Incorporating non-local information into information extraction systems by gibbs sampling. Proceedings of ACL 2005, University of Michigan, Ann Arbor, MI.Google Scholar
  13. Frank, E., Paynter, G, & Witten, I. (1999). Domain-specific keyphrase extraction. Proceedings of IJCAI 1999, Stockholm, Sweden.Google Scholar
  14. Fung, P. (1998). A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. Proceedings of the 3rd Conference of the Association for Machine Translation in the Americas (AMTA’98) (pp. 1–16). Springer.Google Scholar
  15. Fung, P., & Cheung, P. (2004a). Mining very non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and EM. Proceedings of EMNLP 2004, Barcelona, Spain.Google Scholar
  16. Fung, P., & Cheung, P. (2004b). Multi-level bootstrapping for extracting parallel sentences from a quasicomparable corpus. Proceedings of COL- ING 2004, Geneva, Switzerland.Google Scholar
  17. Fung, P., & Yee, L. Y. (1998). An IR approach for translating new words from nonparallel, comparable texts. COLING ’98: Proceedings of the 17th International Conference on Computational Linguistics (pp. 414–420).Google Scholar
  18. Gamallo, P. O., & López, I. G. (2010). Wikipedia as multilingual source of comparable corpora. Proceedings of the 3rd Workshop on Building and Using Comparable Corpora, LREC (pp. 21–25).
  19. Hatzivassiloglou, V., Klavans, J. L., & Eskin, E. (1999). Detecting text similarity over short passages: Exploring linguistic feature combinations via machine learning. Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (pp. 203–212).Google Scholar
  20. Hulth, A. (2003). Improved automatic keyword extraction given more linguistic knowledge. Proceedings of EMNLP 2003, Sapporo, Japan.Google Scholar
  21. Ion, R. (2012). PEXACC: A parallel data mining algorithm from comparable corpora. Proceedings of LREC 2012, Istanbul, Turkey.Google Scholar
  22. Kanaris, I., & Stamatatos, E. (2009). Learning to recognize webpage genres. Information Processing and Management, 45, 499–512.CrossRefGoogle Scholar
  23. Kessler, B., Numberg, G., & Schuetze, H. (1998). Automatic detection of text genre. ACL '98: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics (pp. 32–38).Google Scholar
  24. Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 1–37 (Reprinted in Teubert & Krishnamurthy (Eds.), Corpus linguistics: Critical concepts in linguistics. Routledge. 2007.) Retrieved from Scholar
  25. Kilgarriff, A., & Rose, T. (1998). Measures for corpus similarity and homogeneity. Proceedings of EMNLP 1998, Granada, Spain.Google Scholar
  26. Lee, M. D., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity. Proceedings of the 27th Annual Conference of the Cognitive Science Society (pp. 1254–1259).Google Scholar
  27. Li, B., & Gaussier, E. (2010). Improving corpus comparability for bilingual lexicon extraction from comparable corpora. Proceedings of COLING 2010, Beijing, China.Google Scholar
  28. Li, Y., McLean, D., Bandar, Z., O’Shea, J., & Crockett, K. (2006). Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering, 18(8), 1138–1150.CrossRefGoogle Scholar
  29. Lin, W., Snover, M., & Ji, H. (2011). Unsupervised language-independent name translation mining from Wikipedia infoboxes. Proceedings of EMNLP 2011, Conference on Empirical Methods in Natural Language Processing (pp. 43–52). Edinburgh, Scotland (pp. 27–31).Google Scholar
  30. Liu, F., Pennell, D., Liu, F., & Liu, Y. (2009). Unsupervised approaches for automatic keyword extraction using meeting transcripts. Proceedings of NAACL 2009, Boulder, Colorado.Google Scholar
  31. Lu, Y., Huang, J., & Liu, Q. (2007). Improving statistical machine translation performance by training data selection and optimization. Proceedings of the 2007 EMNLP-CoNLL (pp. 343–350).Google Scholar
  32. Maia, B. (2003). What are comparable corpora? Proceedings of the Corpus Linguistics Workshop on Multilingual Corpora: Linguistic Requirements and Technical Perspectives, Lancaster.Google Scholar
  33. McEnery, A., & Xiao, Z. (2007). Parallel and comparable corpora? Incorporating Corpora: Translation and the Linguist. Translating Europe. Multilingual Matters, Clevedon.Google Scholar
  34. Morin, E., Daille, B., Takeuchi, K., & Kageura, K. (2007). Bilingual terminology mining – using brain, not brawn comparable corpora. Proceedings of ACL 2007 (pp. 664–671), Prague, Czech Republic.Google Scholar
  35. Munteanu, D., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.CrossRefGoogle Scholar
  36. Munteanu, D. S., & Marcu, D. (2006). Extracting parallel sub-sentential fragments from non-parallel corpora. ACL-2006: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (pp. 81–88), Sydney, Australia.Google Scholar
  37. Munteanu, D. S., Fraser, A., Marcu, D. (2004). Improved machine translation performance via parallel sentence extraction from comparable corpora. In: HLT-NAACL 2004: Main Proceedings (pp. 265–272).Google Scholar
  38. Och, F., & Ney, H. (2000). Improved statistical alignment models. Proceedings of ACL 2000, Hongkong, China.Google Scholar
  39. Och, F., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.CrossRefGoogle Scholar
  40. Otero, P. G., & López, I. G. (2010). Wikipedia as multilingual source of comparable corpora. Proceedings of the LREC Workshop on BUCC (pp. 30–37).Google Scholar
  41. Patry, A., & Langlais, P. (2011). Identifying parallel documents from a large bilingual collection of texts: Application to parallel article extraction in Wikipedia. Proceedings of the 4th Workshop on Building and Using Comparable Corpora (pp. 87–95).Google Scholar
  42. Prochasson, E., & Fung, P. (2011). Rare word translation extraction from aligned comparable documents. Proceedings of ACL-HLT 2011, Portland, OR.Google Scholar
  43. Rapp, R. (1995). Identifying word translations in non-parallel texts. ACL ‘95: Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics (pp. 320–322), Cambridge, MA.Google Scholar
  44. Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora. ACL ’99: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (pp. 519–526). College Park, MA.Google Scholar
  45. Rayson, P., & Garside, R. (2000). Comparing corpora using frequency profiling. WCC ‘00: Proceedings of the Workshop on Comparing Corpora (pp. 1–6).Google Scholar
  46. Saralegi, X., Vicente, I., & Gurrutxaga, A. (2008). Automatic extraction of bilingual terms from comparable corpora in a popular science domain. Proceedings of the Workshop on Comparable Corpora, LREC 2008, Marrakech, Morocco.Google Scholar
  47. Sharoff, S. (2007). Classifying Web corpora into domain and genre using automatic feature identification. Proceedings of 3rd Web as Corpus Workshop, Louvain-la-Neuve, Belgium.Google Scholar
  48. Sharoff, S., Babych, B., & Hartley, A. (2006). Using comparable corpora to solve problems difficult for human translators. COLING/ACL 2006 Main Conference Poster Sessions (pp. 739–746).Google Scholar
  49. Skadiņa, I., Vasiļjevs, A., Skadiņš, R., Gaizauskas, R., Tufiş, D., & Gornostay, T. (2010). Analysis and evaluation of comparable corpora for under resourced areas of machine translation. Proceedings of the 3rd Workshop on Building and Using Comparable Corpora. Applications of Parallel and Comparable Corpora in Natural Language Engineering and the Humanities (pp. 6–14), Valletta, Malta.Google Scholar
  50. Smith, J., Quirk, C., & Toutanova, K. (2010). Extracting parallel sentences from comparable corpora using document level alignment. Proceedings of NAACL 2010, Los Angeles, CA.Google Scholar
  51. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., & Tufiş, D. (2006). The JRC- Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of LREC 2006, Genoa, Italy.Google Scholar
  52. Teubert, W. (1996). Comparable or parallel corpora? International Journal of Lexicography, 9, 238–264.CrossRefGoogle Scholar
  53. Tomás, J., Bataller, J., Casacuberta, F., & Lloret, J. (2008). Mining Wikipedia as a parallel and comparable corpus. Language Forum, 1, 34.Google Scholar
  54. Vidulin, V., Lustrek, M., & Gams, M. (2007). Using genres to improve search engines. Proceedings of the International Workshop Towards Genre-Enable Search Engines: The Impact of Natural Language Processing (pp. 45–51).Google Scholar
  55. Wu, D., & Fung, P. (2005). Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. Natural Language Processing IJCNLP 2005, 3651, 257–268.CrossRefGoogle Scholar
  56. Wu, Z., Markert, K., & Sharoff, S. (2010). Fine-grained genre classification using structural learning algorithms. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 749–759).Google Scholar
  57. Xu, J., Deng, Y., Gao, Y., & Ney, H. (2007). Domain dependent machine translation. Proceedings of the Machine Translation Summit XI, Copenhagen, Denmark.Google Scholar
  58. Yu, K., & Tsujii, J. (2009). Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. Proceedings of HLT-NAACL 2009, Boulder, CO.Google Scholar
  59. Zesch, T., Műller, C., & Gurevych, I. (2008). Extracting lexical semantic knowledge from Wikipedia and Wikictionary. Proceedings of the LREC 2008, Marrakech, Morocco.Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Bogdan Babych
    • 1
    Email author
  • Fangzhong Su
    • 1
  • Anthony Hartley
    • 1
  • Ahmet Aker
    • 2
  • Monica Lestari Paramita
    • 2
  • Paul Clough
    • 2
  • Robert Gaizauskas
    • 2
  1. 1.University of LeedsLeedsUK
  2. 2.University of SheffieldSheffieldUK

Personalised recommendations