Advertisement

New Areas of Application of Comparable Corpora

  • Reinhard Rapp
  • Vivian Xu
  • Michael Zock
  • Serge Sharoff
  • Richard Forsyth
  • Bogdan BabychEmail author
  • Chenhui Chu
  • Toshiaki Nakazawa
  • Sadao Kurohashi
Chapter
Part of the Theory and Applications of Natural Language Processing book series (NLP)

Abstract

This chapter describes several approaches of using comparable corpora beyond the area of MT for under-resourced languages, which is the primary focus of the ACCURAT project. Section 7.1, which is based on Rapp and Zock (Automatic dictionary expansion using non-parallel corpora. In: A. Fink, B. Lausen, W. Seidel, & A. Ultsch (Eds.) Advances in Data Analysis, Data Handling and Business Intelligence. Proceedings of the 32nd Annual Meeting of the GfKl, 2008. Springer, Heidelberg, 2010), addresses the task of creating resources for bilingual dictionaries using a seed lexicon; Sect. 7.2 (based on Rapp et al., Identifying word translations from comparable documents without a seed lexicon. Proceedings of LREC 2012, Istanbul, 2012) develops and evaluates a novel methodology of creating bilingual dictionaries without an initial lexicon. Section 7.3 proposes a novel system that can extract Chinese–Japanese parallel sentences from quasi-comparable and comparable corpora.

References

  1. Abdul-Rauf, S., & Schwenk, H. (2011). Parallel sentence generation from comparable corpora for improved SMT. Machine Translation, 25(4), 341–375.CrossRefGoogle Scholar
  2. Adafre, S. F., & de Rijke, M. (2006). Finding similar sentences across multiple languages in Wikipedia. In Proceedings of EACL (pp. 62–69).Google Scholar
  3. Armstrong, S., Kempen, M., McKelvie, D., Petitpierre, D., Rapp, R., & Thompson, H. (1998). Multilingual corpora for cooperation. In Proceedings of the 1st International Conference on Linguistic Resources and Evaluation (LREC) (Vol. 2, pp. 975–980), Granada.Google Scholar
  4. Brants, T. (2000). TnT − A statistical part-of-speech tagger. In Proceedings of the 6th Applied Natural Language Processing Conference (pp. 224–231).Google Scholar
  5. Chiao, Y.-C., Sta, J.-D., & Zweigenbaum, P. (2004). A novel approach to improve word translations extraction from non-parallel, comparable corpora. In Proceedings of the International Joint Conference on Natural Language Processing, Hainan, China, AFNLP, 2004.Google Scholar
  6. Chu, C., Nakazawa, T., & Kurohashi, S. (2011). Japanese-Chinese phrase alignment using common Chinese characters information. In Proceedings of MT Summit XIII (pp. 475–482), Xiamen, China, September.Google Scholar
  7. Chu, C., Nakazawa, T., Kawahara, D., & Kurohashi, S. (2012a, May). Exploiting shared Chinese characters in Chinese word segmentation optimization for Chinese-Japanese machine translation. In Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT2012) (pp. 35–42), Trento, Italy.Google Scholar
  8. Chu, C., Nakazawa, T., Kawahara, D., & Kurohashi, S. (2012b, May). Chinese characters mapping table of Japanese, Traditional Chinese and Simplified Chinese. In Proceedings of the Eighth Conference on International Language Resources and Evaluation (LREC2012) (pp. 2149–2152), Istanbul, Turkey.Google Scholar
  9. Chu, C., Nakazawa, T., Kawahara, D., & Kurohashi, S. (2013, August). Chinese–Japanese parallel sentence extraction from quasi–comparable corpora. In Proceedings of the Sixth Workshop on Building and Using Comparable Corpora (pp. 34–42). Association for Computational Linguistics, Sofia, Bulgaria.Google Scholar
  10. Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.Google Scholar
  11. Fung, P., & Cheung, P. (2004). Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In Proceedings of Coling 2004 (pp. 1051–1057), Geneva, Switzerland, Aug 23–Aug 27. COLING.Google Scholar
  12. Fung, P., & McKeown, K. (1997). Finding terminology translations from non-parallel corpora. In Proceedings of the 5th Annual Workshop on Very Large Corpora (pp. 192–202), Hong Kong.Google Scholar
  13. Fung, P., & Yee, L. Y. (1998). An IR approach for translating new words from nonparallel, comparable texts. In Proceedings of COLING-ACL 1998 (Vol. 1, pp. 414–420), Montreal.Google Scholar
  14. Goh, C. L., Asahara, M., & Matsumoto, Y. (2005). Building a Japanese-Chinese dictionary using kanji/hanzi conversion. In Proceedings of the International Joint Conference on Natural Language Processing (pp. 670–681).Google Scholar
  15. Jongejan, B., & Dalianis, H. (2009). Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (pp. 145–153).Google Scholar
  16. Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In D. Lin, & D. Wu (Eds.), Proceedings of EMNLP 2004 (pp. 388–395). Association for Computational Linguistics, Barcelona, Spain.Google Scholar
  17. Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of MT Summit (pp. 79–86), Phuket, Thailand.Google Scholar
  18. Koehn, P., Hoang, H., Birch, A., et al. (2007, June). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions (pp. 177–180), Association for Computational Linguistics, Prague, Czech Republic.Google Scholar
  19. Kondrak, G., Marcu, D., & Knight, K. (2003). Cognates can improve statistical translation models. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (pp. 46–48).Google Scholar
  20. Kurohashi, S., Nakamura, T., Matsumoto, Y., & Nagao, M. (1994). Improvements of Japanese morphological analyzer JUMAN. In Proceedings of the International Workshop on Sharable Natural Language(pp. 22–28).Google Scholar
  21. Laws, F., Michelbacher, L., Dorow, B., Scheible, C., Heid, U., & Schütze, H. (2010). A linguistically grounded graph model for bilingual lexicon extraction. In Proceedings of Coling, Poster Volume (pp. 614–622).Google Scholar
  22. Munteanu, D. S., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.CrossRefGoogle Scholar
  23. Munteanu, D. S., & Marcu, D. (2006, July). Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (pp. 81–88). Association for Computational Linguistics, Sydney, Australia.Google Scholar
  24. Och, F. J. (2003). Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (pp. 160–167). Association for Computational Linguistics, Sapporo, Japan.Google Scholar
  25. Papineni, K., Roukos, S.,Ward, T., & Zhu, W-J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (pp. 311–318), Philadelphia, PA.Google Scholar
  26. Rapp, R. (1995). Identifying word translations in non-parallel texts. In Proceedings of the 33rd Meeting of the Association for Computational Linguistics (pp. 320–322), Cambridge, MA.Google Scholar
  27. Rapp, R. (1996). Die Berechnung von Assoziatonen. Hildesheim: Olms.Google Scholar
  28. Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (pp. 519–526), College Park, MD.Google Scholar
  29. Rapp, R., & Martin Vide, C. (2007). Statistical machine translation without parallel corpora. In G. Rehm, A. Witt, & L. Lemnitzer (Eds.), Datenstrukturen für linguistische Ressourcen und ihre Anwendungen/Data Structures for Linguistic Resources and Applications. Proceedings of the Biennial GLDV Conference 2007 (pp. 231–240). Gunter Narr Verlag, Tübingen.Google Scholar
  30. Rapp, R., & Zock, M. (2010). Automatic dictionary expansion using non-parallel corpora. In A. Fink, B. Lausen, W. Seidel, & A. Ultsch (Eds.) Advances in Data Analysis, Data Handling and Business Intelligence. Proceedings of the 32nd Annual Meeting of the GfKl, 2008. Springer, Heidelberg.Google Scholar
  31. Rapp, R., Sharoff, S., & Babych, B. (2012). Identifying word translations from comparable documents without a seed lexicon. In Proceedings of LREC 2012, Istanbul.Google Scholar
  32. Rayson, P., & Garside, R. (2000). Comparing corpora using frequency profiling. In Proceedings of the Workshop on Comparing Corpora (WCC00) (Vol. 9, pp. 1–6).Google Scholar
  33. Rumelhart, D. E., & McClelland, J. L. (1987). Parallel Distributed Processing. Explorations in the Microstructure of Cognition. Vol. 1: Foundations. MIT Press.Google Scholar
  34. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. International Conference on New Methods in Language Processing (pp. 44–49).Google Scholar
  35. Sharoff, S., Kopotev, M., Erjavec, T., Feldman, A., & Divjak, D. (2008). Designing and evaluating a Russian tagset. In Proceedings of the Sixth Language Resources and Evaluation Conference, LREC 2008 (pp. 279–285), Marrakech.Google Scholar
  36. Smith, J. R., Quirk, Ch., & Toutanova, K. (2010, June). Extracting parallel sentences from comparable corpora using document level alignment. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 403–411), Association for Computational Linguistics, Los Angeles, CA.Google Scholar
  37. Stefanescu, D., Ion, R., & Hunsicker, S. (2012, May). Hybrid parallel sentence mining from comparable corpora. In Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT2012) (pp. 117–128), Trento, Italy.Google Scholar
  38. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., et al. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation, Genoa, Italy.Google Scholar
  39. Tan, Ch. L., & Nagao, M. (1995). Automatic alignment of Japanese-Chinese bilingual texts. IEICE Transactions on Information and Systems, E78-D(1), 68–76.Google Scholar
  40. Tillmann, Ch. (2009, August). A beam-search extraction algorithm for comparable data. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers (pp. 225–228), Association for Computational Linguistics, Suntec, Singapore.Google Scholar
  41. Utiyama, M., & Isahara, H. (2003, July). Reliable measures for aligning Japanese-English news articles and sentences. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (pp. 72–79), Association for Computational Linguistics, Sapporo, Japan.Google Scholar
  42. Wu, D., & Fung, P. (2005). Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In Proceedings of the Second International Joint Conference on Natural Language Processing (IJCNLP-2005), Jeju, Korea.Google Scholar
  43. Zhao, B., & Vogel, S. (2002). Adaptive parallel sentences mining from web a bilingual news collections. In Proceedings of the 2002 I.E. International Conference on Data Mining (pp. 745–748), IEEE Computer Society, Maebashi City, Japan.Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Reinhard Rapp
    • 1
  • Vivian Xu
    • 2
  • Michael Zock
    • 3
  • Serge Sharoff
    • 4
  • Richard Forsyth
    • 4
  • Bogdan Babych
    • 4
    Email author
  • Chenhui Chu
    • 5
  • Toshiaki Nakazawa
    • 5
  • Sadao Kurohashi
    • 5
  1. 1.University of MainzMainzGermany
  2. 2.Beijing Foreign Studies UniversityBeijingChina
  3. 3.CNRSMarseilleFrance
  4. 4.University of LeedsLeedsUK
  5. 5.Graduate School of Informatics, Kyoto UniversityKyotoJapan

Personalised recommendations