Cross-Language Comparability and Its Applications for MT
Abstract
The concept of comparability, or linguistic relatedness, or closeness between textual units or corpora has many possible applications in computational linguistics. Consequently, the task of measuring comparability has increasingly become a core technological challenge in the field, and needs to be developed and evaluated systematically. Many practical applications require corpora with controlled levels of comparability, which are established by comparability metrics. From this perspective, it is important to understand the linguistic and technological mechanisms and implications of comparability and develop a systematic methodology for developing, evaluating and using comparability metrics. This chapter presents our approach to developing and using such metrics for machine translation (MT), especially for under-resourced languages. We address three core areas: (1) systematic meta-evaluation (or calibration) of the metrics on the basis of parallel corpora; (2) the development of feature-selection techniques for the metrics on the basis of aligned comparable texts, such as Wikipedia articles and (3) applying the developed metrics for the tasks of MT for under-resourced languages and measuring their effectiveness for corpora with unknown degrees of comparability. This has led to redefining the vague linguistic concept of comparability in terms of task-specific performance of the tools, which extract phrase-level translation equivalents from comparable texts.
References
- Adafre, S. F., & de Rijke, M. (2006). Finding similar sentences across multiple languages in Wikipedia. Proceedings of the EACL Workshop on New Text, Trento, Italy.Google Scholar
- Babych, B., Hartley, A., Sharoff, S., & Mudraya, O. (2007). Assisting translators in indirect lexical transfer. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (pp. 136–143).Google Scholar
- Babych, B., Sharoff S., & Hartley, A. (2008). Generalising lexical translation strategies for MT using comparable corpora. Proceedings of LREC 2008, Marrakech, Morocco.Google Scholar
- Bharadwaj, R. G., & Varma, V. (2011). Language independent identification of parallel sentences using Wikipedia. Proceedings of the 20th International Conference Companion on World Wide Web, WWW ’11 (pp. 11–12).Google Scholar
- Chiao, Y.-Ch., & Zweigenbaum, P. (2002). Looking for candidate translational equivalents in specialized, comparable corpora. Proceedings of COLING 2002, Taipei, Taiwan.Google Scholar
- Daille, B., & Morin, E. (2005). French-English terminology extraction from comparable corpora. IJCNLP (pp. 707–718).Google Scholar
- Eisele, A., & Xu, J. (2010). improving machine translation performance using comparable corpora. Proceedings of the LREC Workshop on Building and Using Comparable Corpora, Malta, May 2010.Google Scholar
- Eisele, A., Federmann, C., Saint-Amand, H., Jellinghaus, M., Herrmann, T., & Chen, Y. (2008). Using Moses to integrate multiple rule-based machine translation engines into a hybrid system. Proceedings of the Third Workshop on Statistical Machine Translation (pp. 179–182).Google Scholar
- Erdmann, M., Nakayama, K., Hara, T., & Nishio, S. (2008). Extraction of bilingual terminology from a multilingual web-based encyclopedia. Journal of Information Processing, 16, 67–79.CrossRefGoogle Scholar
- Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press.CrossRefGoogle Scholar
- Filatova, E. (2009). Directions for exploiting asymmetries in multilingual Wikipedia. Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies (CLIAWS3 '09).Google Scholar
- Finkel, J., Grenager, T., & Manning, C. (2005). Incorporating non-local information into information extraction systems by gibbs sampling. Proceedings of ACL 2005, University of Michigan, Ann Arbor, MI.Google Scholar
- Frank, E., Paynter, G, & Witten, I. (1999). Domain-specific keyphrase extraction. Proceedings of IJCAI 1999, Stockholm, Sweden.Google Scholar
- Fung, P. (1998). A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. Proceedings of the 3rd Conference of the Association for Machine Translation in the Americas (AMTA’98) (pp. 1–16). Springer.Google Scholar
- Fung, P., & Cheung, P. (2004a). Mining very non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and EM. Proceedings of EMNLP 2004, Barcelona, Spain.Google Scholar
- Fung, P., & Cheung, P. (2004b). Multi-level bootstrapping for extracting parallel sentences from a quasicomparable corpus. Proceedings of COL- ING 2004, Geneva, Switzerland.Google Scholar
- Fung, P., & Yee, L. Y. (1998). An IR approach for translating new words from nonparallel, comparable texts. COLING ’98: Proceedings of the 17th International Conference on Computational Linguistics (pp. 414–420).Google Scholar
- Gamallo, P. O., & López, I. G. (2010). Wikipedia as multilingual source of comparable corpora. Proceedings of the 3rd Workshop on Building and Using Comparable Corpora, LREC (pp. 21–25). http://www.fb06.unimainz.de/lk/bucc2010/documents/Proceedings-BUCC-2010.pdf#page=29
- Hatzivassiloglou, V., Klavans, J. L., & Eskin, E. (1999). Detecting text similarity over short passages: Exploring linguistic feature combinations via machine learning. Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (pp. 203–212).Google Scholar
- Hulth, A. (2003). Improved automatic keyword extraction given more linguistic knowledge. Proceedings of EMNLP 2003, Sapporo, Japan.Google Scholar
- Ion, R. (2012). PEXACC: A parallel data mining algorithm from comparable corpora. Proceedings of LREC 2012, Istanbul, Turkey.Google Scholar
- Kanaris, I., & Stamatatos, E. (2009). Learning to recognize webpage genres. Information Processing and Management, 45, 499–512.CrossRefGoogle Scholar
- Kessler, B., Numberg, G., & Schuetze, H. (1998). Automatic detection of text genre. ACL '98: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics (pp. 32–38).Google Scholar
- Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 1–37 (Reprinted in Teubert & Krishnamurthy (Eds.), Corpus linguistics: Critical concepts in linguistics. Routledge. 2007.) Retrieved from http://www.kilgarriff.co.uk/Publications/2001-K-CompCorpIJCL.pdf.CrossRefGoogle Scholar
- Kilgarriff, A., & Rose, T. (1998). Measures for corpus similarity and homogeneity. Proceedings of EMNLP 1998, Granada, Spain.Google Scholar
- Lee, M. D., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity. Proceedings of the 27th Annual Conference of the Cognitive Science Society (pp. 1254–1259).Google Scholar
- Li, B., & Gaussier, E. (2010). Improving corpus comparability for bilingual lexicon extraction from comparable corpora. Proceedings of COLING 2010, Beijing, China.Google Scholar
- Li, Y., McLean, D., Bandar, Z., O’Shea, J., & Crockett, K. (2006). Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering, 18(8), 1138–1150.CrossRefGoogle Scholar
- Lin, W., Snover, M., & Ji, H. (2011). Unsupervised language-independent name translation mining from Wikipedia infoboxes. Proceedings of EMNLP 2011, Conference on Empirical Methods in Natural Language Processing (pp. 43–52). Edinburgh, Scotland (pp. 27–31).Google Scholar
- Liu, F., Pennell, D., Liu, F., & Liu, Y. (2009). Unsupervised approaches for automatic keyword extraction using meeting transcripts. Proceedings of NAACL 2009, Boulder, Colorado.Google Scholar
- Lu, Y., Huang, J., & Liu, Q. (2007). Improving statistical machine translation performance by training data selection and optimization. Proceedings of the 2007 EMNLP-CoNLL (pp. 343–350).Google Scholar
- Maia, B. (2003). What are comparable corpora? Proceedings of the Corpus Linguistics Workshop on Multilingual Corpora: Linguistic Requirements and Technical Perspectives, Lancaster.Google Scholar
- McEnery, A., & Xiao, Z. (2007). Parallel and comparable corpora? Incorporating Corpora: Translation and the Linguist. Translating Europe. Multilingual Matters, Clevedon.Google Scholar
- Morin, E., Daille, B., Takeuchi, K., & Kageura, K. (2007). Bilingual terminology mining – using brain, not brawn comparable corpora. Proceedings of ACL 2007 (pp. 664–671), Prague, Czech Republic.Google Scholar
- Munteanu, D., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.CrossRefGoogle Scholar
- Munteanu, D. S., & Marcu, D. (2006). Extracting parallel sub-sentential fragments from non-parallel corpora. ACL-2006: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (pp. 81–88), Sydney, Australia.Google Scholar
- Munteanu, D. S., Fraser, A., Marcu, D. (2004). Improved machine translation performance via parallel sentence extraction from comparable corpora. In: HLT-NAACL 2004: Main Proceedings (pp. 265–272).Google Scholar
- Och, F., & Ney, H. (2000). Improved statistical alignment models. Proceedings of ACL 2000, Hongkong, China.Google Scholar
- Och, F., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.CrossRefGoogle Scholar
- Otero, P. G., & López, I. G. (2010). Wikipedia as multilingual source of comparable corpora. Proceedings of the LREC Workshop on BUCC (pp. 30–37).Google Scholar
- Patry, A., & Langlais, P. (2011). Identifying parallel documents from a large bilingual collection of texts: Application to parallel article extraction in Wikipedia. Proceedings of the 4th Workshop on Building and Using Comparable Corpora (pp. 87–95).Google Scholar
- Prochasson, E., & Fung, P. (2011). Rare word translation extraction from aligned comparable documents. Proceedings of ACL-HLT 2011, Portland, OR.Google Scholar
- Rapp, R. (1995). Identifying word translations in non-parallel texts. ACL ‘95: Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics (pp. 320–322), Cambridge, MA.Google Scholar
- Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora. ACL ’99: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (pp. 519–526). College Park, MA.Google Scholar
- Rayson, P., & Garside, R. (2000). Comparing corpora using frequency profiling. WCC ‘00: Proceedings of the Workshop on Comparing Corpora (pp. 1–6).Google Scholar
- Saralegi, X., Vicente, I., & Gurrutxaga, A. (2008). Automatic extraction of bilingual terms from comparable corpora in a popular science domain. Proceedings of the Workshop on Comparable Corpora, LREC 2008, Marrakech, Morocco.Google Scholar
- Sharoff, S. (2007). Classifying Web corpora into domain and genre using automatic feature identification. Proceedings of 3rd Web as Corpus Workshop, Louvain-la-Neuve, Belgium.Google Scholar
- Sharoff, S., Babych, B., & Hartley, A. (2006). Using comparable corpora to solve problems difficult for human translators. COLING/ACL 2006 Main Conference Poster Sessions (pp. 739–746).Google Scholar
- Skadiņa, I., Vasiļjevs, A., Skadiņš, R., Gaizauskas, R., Tufiş, D., & Gornostay, T. (2010). Analysis and evaluation of comparable corpora for under resourced areas of machine translation. Proceedings of the 3rd Workshop on Building and Using Comparable Corpora. Applications of Parallel and Comparable Corpora in Natural Language Engineering and the Humanities (pp. 6–14), Valletta, Malta.Google Scholar
- Smith, J., Quirk, C., & Toutanova, K. (2010). Extracting parallel sentences from comparable corpora using document level alignment. Proceedings of NAACL 2010, Los Angeles, CA.Google Scholar
- Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., & Tufiş, D. (2006). The JRC- Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of LREC 2006, Genoa, Italy.Google Scholar
- Teubert, W. (1996). Comparable or parallel corpora? International Journal of Lexicography, 9, 238–264.CrossRefGoogle Scholar
- Tomás, J., Bataller, J., Casacuberta, F., & Lloret, J. (2008). Mining Wikipedia as a parallel and comparable corpus. Language Forum, 1, 34.Google Scholar
- Vidulin, V., Lustrek, M., & Gams, M. (2007). Using genres to improve search engines. Proceedings of the International Workshop Towards Genre-Enable Search Engines: The Impact of Natural Language Processing (pp. 45–51).Google Scholar
- Wu, D., & Fung, P. (2005). Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. Natural Language Processing IJCNLP 2005, 3651, 257–268.CrossRefGoogle Scholar
- Wu, Z., Markert, K., & Sharoff, S. (2010). Fine-grained genre classification using structural learning algorithms. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 749–759).Google Scholar
- Xu, J., Deng, Y., Gao, Y., & Ney, H. (2007). Domain dependent machine translation. Proceedings of the Machine Translation Summit XI, Copenhagen, Denmark.Google Scholar
- Yu, K., & Tsujii, J. (2009). Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. Proceedings of HLT-NAACL 2009, Boulder, CO.Google Scholar
- Zesch, T., Műller, C., & Gurevych, I. (2008). Extracting lexical semantic knowledge from Wikipedia and Wikictionary. Proceedings of the LREC 2008, Marrakech, Morocco.Google Scholar