Skip to main content

Iterative Bilingual Lexicon Extraction from Comparable Corpora with Topical and Contextual Knowledge

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2014)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8404))

Abstract

In the literature, two main categories of methods have been proposed for bilingual lexicon extraction from comparable corpora, namely topic model and context based methods. In this paper, we present a bilingual lexicon extraction system that is based on a novel combination of these two methods in an iterative process. Our system does not rely on any prior knowledge and the performance can be iteratively improved. To the best of our knowledge, this is the first study that iteratively exploits both topical and contextual knowledge for bilingual lexicon extraction. Experiments conduct on Chinese–English and Japanese–English Wikipedia data show that our proposed method performs significantly better than a state–of–the–art method that only uses topical knowledge.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Association for Computational Linguistics 19, 263–312 (1993)

    Google Scholar 

  2. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, Prague, Czech Republic, pp. 177–180. Association for Computational Linguistics (2007)

    Google Scholar 

  3. Pirkola, A., Hedlund, T., Keskustalo, H., Järvelin, K.: Dictionary-based cross-language information retrieval: Problems, methods, and research findings. Information Retrieval 4, 209–230 (2001)

    Article  MATH  Google Scholar 

  4. Vulić, I., De Smet, W., Moens, M.F.: Identifying word translations from comparable corpora using latent topic models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 479–484. Association for Computational Linguistics (2011)

    Google Scholar 

  5. Rapp, R.: Automatic identification of word translations from unrelated english and german corpora. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, College Park, Maryland, USA, pp. 519–526. Association for Computational Linguistics (1999)

    Google Scholar 

  6. Harris, Z.S.: Distributional structure. Word 10, 146–162 (1954)

    Google Scholar 

  7. Vulić, I., Moens, M.F.: Detecting highly confident word translations from comparable corpora without any prior knowledge. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, pp. 449–459. Association for Computational Linguistics (2012)

    Google Scholar 

  8. Liu, X., Duh, K., Matsumoto, Y.: Topic models + word alignment = a flexible framework for extracting bilingual dictionary from comparable corpus. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, Sofia, Bulgaria, pp. 212–221. Association for Computational Linguistics (2013)

    Google Scholar 

  9. Richardson, J., Nakazawa, T., Kurohashi, S.: Robust transliteration mining from comparable corpora with bilingual topic models. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing, Nagoya, Japan, pp. 261–269. Asian Federation of Natural Language Processing (2013)

    Google Scholar 

  10. Rapp, R.: Identifying word translations in non-parallel texts. In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, Cambridge, Massachusetts, USA, pp. 320–322. Association for Computational Linguistics (1995)

    Google Scholar 

  11. Fung, P.: Compiling bilingual lexicon entries from a non-parallel english-chinese corpus. In: Proceedings of the 3rd Annual Workshop on Very Large Corpora, pp. 173–183 (1995)

    Google Scholar 

  12. Koehn, P., Knight, K.: Learning a translation lexicon from monolingual corpora. In: Proceedings of the ACL 2002 Workshop on Unsupervised Lexical Acquisition, Philadelphia, Pennsylvania, USA, pp. 9–16. Association for Computational Linguistics (2002)

    Google Scholar 

  13. Haghighi, A., Liang, P., Berg-Kirkpatrick, T., Klein, D.: Learning bilingual lexicons from monolingual corpora. In: Proceedings of ACL 2008, HLT, Columbus, Ohio, pp. 771–779. Association for Computational Linguistics (2008)

    Google Scholar 

  14. Prochasson, E., Fung, P.: Rare word translation extraction from aligned comparable documents. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 1327–1335. Association for Computational Linguistics (2011)

    Google Scholar 

  15. Tamura, A., Watanabe, T., Sumita, E.: Bilingual lexicon extraction from comparable corpora using label propagation. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea, pp. 24–36. Association for Computational Linguistics (2012)

    Google Scholar 

  16. Fung, P., Yee, L.Y.: An ir approach for translating new words from nonparallel, comparable texts. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, Quebec, Canada, vol. 1, pp. 414–420. Association for Computational Linguistics (1998)

    Google Scholar 

  17. Garera, N., Callison-Burch, C., Yarowsky, D.: Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009), Boulder, Colorado, pp. 129–137. Association for Computational Linguistics (2009)

    Google Scholar 

  18. Yu, K., Tsujii, J.: Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Boulder, Colorado. Companion Volume: Short Papers, pp. 121–124. Association for Computational Linguistics (2009)

    Google Scholar 

  19. Mimno, D., Wallach, H.M., Naradowsky, J., Smith, D.A., McCallum, A.: Polylingual topic models. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 880–889. Association for Computational Linguistics (2009)

    Google Scholar 

  20. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)

    MATH  Google Scholar 

  21. Utiyama, M., Isahara, H.: Reliable measures for aligning japanese-english news articles and sentences. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp. 72–79. Association for Computational Linguistics (2003)

    Google Scholar 

  22. Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31, 477–504 (2005)

    Article  Google Scholar 

  23. Vu, T., Aw, A.T., Zhang, M.: Feature-based method for document alignment in comparable news corpora. In: Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), Athens, Greece, pp. 843–851. Association for Computational Linguistics (2009)

    Google Scholar 

  24. Zhu, Z., Li, M., Chen, L., Yang, Z.: Building comparable corpora based on bilingual lda model. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria. Short Papers, vol. 2, pp. 278–282. Association for Computational Linguistics (2013)

    Google Scholar 

  25. Chu, C., Nakazawa, T., Kawahara, D., Kurohashi, S.: Exploiting shared Chinese characters in Chinese word segmentation optimization for Chinese-Japanese machine translation. In: Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT 2012), Trento, Italy, pp. 35–42 (2012)

    Google Scholar 

  26. Kurohashi, S., Nakamura, T., Matsumoto, Y., Nagao, M.: Improvements of Japanese morphological analyzer JUMAN. In: Proceedings of the International Workshop on Sharable Natural Language, pp. 22–28 (1994)

    Google Scholar 

  27. Tsuruoka, Y., Miyao, Y., Kazama, J.: Learning with lookahead: Can history-based models rival globally optimized models? In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, Portland, Oregon, USA, pp. 238–246. Association for Computational Linguistics (2011)

    Google Scholar 

  28. Voorhees, E.M.: The TREC-8 question answering track report. In: Proceedings of the Eighth TExt Retrieval Conference (TREC-8), pp. 77–82 (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chu, C., Nakazawa, T., Kurohashi, S. (2014). Iterative Bilingual Lexicon Extraction from Comparable Corpora with Topical and Contextual Knowledge. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8404. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54903-8_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-54903-8_25

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-54902-1

  • Online ISBN: 978-3-642-54903-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics