Iterative Bilingual Lexicon Extraction from Comparable Corpora with Topical and Contextual Knowledge

Chu, Chenhui; Nakazawa, Toshiaki; Kurohashi, Sadao

doi:10.1007/978-3-642-54903-8_25

Chenhui Chu¹⁷,
Toshiaki Nakazawa¹⁷ &
Sadao Kurohashi¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8404))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1680 Accesses
5 Citations

Abstract

In the literature, two main categories of methods have been proposed for bilingual lexicon extraction from comparable corpora, namely topic model and context based methods. In this paper, we present a bilingual lexicon extraction system that is based on a novel combination of these two methods in an iterative process. Our system does not rely on any prior knowledge and the performance can be iteratively improved. To the best of our knowledge, this is the first study that iteratively exploits both topical and contextual knowledge for bilingual lexicon extraction. Experiments conduct on Chinese–English and Japanese–English Wikipedia data show that our proposed method performs significantly better than a state–of–the–art method that only uses topical knowledge.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Association for Computational Linguistics 19, 263–312 (1993)
Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, Prague, Czech Republic, pp. 177–180. Association for Computational Linguistics (2007)
Google Scholar
Pirkola, A., Hedlund, T., Keskustalo, H., Järvelin, K.: Dictionary-based cross-language information retrieval: Problems, methods, and research findings. Information Retrieval 4, 209–230 (2001)
Article MATH Google Scholar
Vulić, I., De Smet, W., Moens, M.F.: Identifying word translations from comparable corpora using latent topic models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 479–484. Association for Computational Linguistics (2011)
Google Scholar
Rapp, R.: Automatic identification of word translations from unrelated english and german corpora. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, College Park, Maryland, USA, pp. 519–526. Association for Computational Linguistics (1999)
Google Scholar
Harris, Z.S.: Distributional structure. Word 10, 146–162 (1954)
Google Scholar
Vulić, I., Moens, M.F.: Detecting highly confident word translations from comparable corpora without any prior knowledge. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, pp. 449–459. Association for Computational Linguistics (2012)
Google Scholar
Liu, X., Duh, K., Matsumoto, Y.: Topic models + word alignment = a flexible framework for extracting bilingual dictionary from comparable corpus. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, Sofia, Bulgaria, pp. 212–221. Association for Computational Linguistics (2013)
Google Scholar
Richardson, J., Nakazawa, T., Kurohashi, S.: Robust transliteration mining from comparable corpora with bilingual topic models. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing, Nagoya, Japan, pp. 261–269. Asian Federation of Natural Language Processing (2013)
Google Scholar
Rapp, R.: Identifying word translations in non-parallel texts. In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, Cambridge, Massachusetts, USA, pp. 320–322. Association for Computational Linguistics (1995)
Google Scholar
Fung, P.: Compiling bilingual lexicon entries from a non-parallel english-chinese corpus. In: Proceedings of the 3rd Annual Workshop on Very Large Corpora, pp. 173–183 (1995)
Google Scholar
Koehn, P., Knight, K.: Learning a translation lexicon from monolingual corpora. In: Proceedings of the ACL 2002 Workshop on Unsupervised Lexical Acquisition, Philadelphia, Pennsylvania, USA, pp. 9–16. Association for Computational Linguistics (2002)
Google Scholar
Haghighi, A., Liang, P., Berg-Kirkpatrick, T., Klein, D.: Learning bilingual lexicons from monolingual corpora. In: Proceedings of ACL 2008, HLT, Columbus, Ohio, pp. 771–779. Association for Computational Linguistics (2008)
Google Scholar
Prochasson, E., Fung, P.: Rare word translation extraction from aligned comparable documents. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 1327–1335. Association for Computational Linguistics (2011)
Google Scholar
Tamura, A., Watanabe, T., Sumita, E.: Bilingual lexicon extraction from comparable corpora using label propagation. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea, pp. 24–36. Association for Computational Linguistics (2012)
Google Scholar
Fung, P., Yee, L.Y.: An ir approach for translating new words from nonparallel, comparable texts. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, Quebec, Canada, vol. 1, pp. 414–420. Association for Computational Linguistics (1998)
Google Scholar
Garera, N., Callison-Burch, C., Yarowsky, D.: Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009), Boulder, Colorado, pp. 129–137. Association for Computational Linguistics (2009)
Google Scholar
Yu, K., Tsujii, J.: Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Boulder, Colorado. Companion Volume: Short Papers, pp. 121–124. Association for Computational Linguistics (2009)
Google Scholar
Mimno, D., Wallach, H.M., Naradowsky, J., Smith, D.A., McCallum, A.: Polylingual topic models. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 880–889. Association for Computational Linguistics (2009)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
MATH Google Scholar
Utiyama, M., Isahara, H.: Reliable measures for aligning japanese-english news articles and sentences. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp. 72–79. Association for Computational Linguistics (2003)
Google Scholar
Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31, 477–504 (2005)
Article Google Scholar
Vu, T., Aw, A.T., Zhang, M.: Feature-based method for document alignment in comparable news corpora. In: Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), Athens, Greece, pp. 843–851. Association for Computational Linguistics (2009)
Google Scholar
Zhu, Z., Li, M., Chen, L., Yang, Z.: Building comparable corpora based on bilingual lda model. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria. Short Papers, vol. 2, pp. 278–282. Association for Computational Linguistics (2013)
Google Scholar
Chu, C., Nakazawa, T., Kawahara, D., Kurohashi, S.: Exploiting shared Chinese characters in Chinese word segmentation optimization for Chinese-Japanese machine translation. In: Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT 2012), Trento, Italy, pp. 35–42 (2012)
Google Scholar
Kurohashi, S., Nakamura, T., Matsumoto, Y., Nagao, M.: Improvements of Japanese morphological analyzer JUMAN. In: Proceedings of the International Workshop on Sharable Natural Language, pp. 22–28 (1994)
Google Scholar
Tsuruoka, Y., Miyao, Y., Kazama, J.: Learning with lookahead: Can history-based models rival globally optimized models? In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, Portland, Oregon, USA, pp. 238–246. Association for Computational Linguistics (2011)
Google Scholar
Voorhees, E.M.: The TREC-8 question answering track report. In: Proceedings of the Eighth TExt Retrieval Conference (TREC-8), pp. 77–82 (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Informatics, Kyoto University, Kyoto, Japan
Chenhui Chu, Toshiaki Nakazawa & Sadao Kurohashi

Authors

Chenhui Chu
View author publications
You can also search for this author in PubMed Google Scholar
Toshiaki Nakazawa
View author publications
You can also search for this author in PubMed Google Scholar
Sadao Kurohashi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, Av. Juan Dios Bátiz, Col. Nueva Industrial Vallejo, 07738, Mexico D.F, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chu, C., Nakazawa, T., Kurohashi, S. (2014). Iterative Bilingual Lexicon Extraction from Comparable Corpora with Topical and Contextual Knowledge. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8404. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54903-8_25

Download citation

DOI: https://doi.org/10.1007/978-3-642-54903-8_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54902-1
Online ISBN: 978-3-642-54903-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics