Extracting Translation Knowledge from Parallel Corpora

Yamamoto, Kaoru; Matsumoto, Yuji

doi:10.1007/978-94-010-0181-6_13

Kaoru Yamamoto &
Yuji Matsumoto

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 21))

200 Accesses
2 Citations

Abstract

This chapter describes two studies concerning automatic extraction of translation knowledge from parallel corpora. In the first study, we use statistically probable dependency relations to acquire word and phrasal correspondences. We obtained 90% precision using an English-Japanese parallel corpus of 9268 sentences in the business domain. The result showed that statistically probable dependency relations are effective in translation knowledge acquisition even for language pairs with different word ordering.

The second study compares three models of translation units, each of which uses different linguistic information: word segmentation, chunk boundaries, and word dependencies. The study investigates the relationship between the linguistic clues applied and translation knowledge extracted. We found that chunk boundaries are useful linguistic clues in extracting compound NPs which will be effective for extracting bilingual lexicons in the new domain. Furthermore, word dependencies are also useful for longer translation pairs such as idiomatic expressions. We demonstrate that using statistical NLP tools, in particular, statistical dependency parsers, offers robustness to the approach as well as valuable linguistic clues that work effectively in extracting translation knowledge from parallel corpora of languages from different families.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Brants, T. 2000. TnT — A Statistical Part-of-Speech Tagger. In NAACL-00: Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, Seattle, WA., pp.224–231.
Google Scholar
Charniak, E. 2000. A Maximum Entropy Inspired Parser. In NAACL-00: Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, Seattle, WA., pp.132–139.
Google Scholar
Cha Sen and IPADIC Users Manual. 2001. http://chasen.aist-nara.ac.jp/
Google Scholar
Collins, M. 1997. Three Generative Lexicalised Models for Statistical Parsing. In 35th Annual Meeting of the Association for Computational Linguistics and the 8th Conference of the European Chapter, Madrid, Spain, pp.16–23.
Google Scholar
Dagan, I., K.W. Church and W.A. Gale. 1993. Robust bilingual word alignment for machine aided translation. In Proceedings of the Workshop on Very Large Corpora, Academic and Industrial Perspectives, Columbus OH., pp.1–8.
Google Scholar
Fujio, M. and Y. Matsumoto. 1998. Japanese Dependency Structure Analysis based on Lexicalized Statistics. In Proceedings of the 3rd Conference of Empirical Methods in Natural Language Processing, Granada, Spain, pp.88–96.
Google Scholar
Gale, W. and K. Church. 1991. Identifying Word Correspondences in Parallel Texts. In Proceedings of the 4th DARPA Speech and Natural Language Workshop, Pacific Grove, CA., pp.152–157.
Google Scholar
Grishman, R. 1994. Iterative alignment of syntactic structures for a bilingual corpus. In Proceedings of the 2nd Annual Workshop on Very Large Corpora, Kyoto, Japan, pp.57–68.
Google Scholar
Haruno, M. and T. Yamazaki. 1996. High-performance bilingual text alignement using statistical and dictionary information. In 34th Annual Meeting of the Association for Compuational Linguistics, Santa Cruz, CA., pp.131–138.
Google Scholar
Kaji, H., Y. Kida and Y. Morimoto. 1992. Learning translation templates from bilingual text. In Proceedings of the fifteenth [sic] International Conference on Computational Linguistics, COLING-92, Nantes, France, pp.672–678.
Google Scholar
Kitamura, M. and Y. Matsumoto. 1996. Automatic Extraction of Word Sequence Correspondences in Parallel Corpora. In Proceedings of the 4th Annual Workshop on Very Large Corpora, Copenhagen, Denmark, pp.79–87.
Google Scholar
Kudo, T. and Y. Matsumoto. 2001. Chunking with Support Vector Machines. In NAACL-01: Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh, PA., pp.192–199.
Google Scholar
Kudo, T. and Y. Matsumoto. 2002. Japanese Dependency Analysis using Cascaded Chunking. In CoNLL-2002: Proceedings of the Sixth Conference on Natural Language Learning, Taiwan, pp.63–69.
Google Scholar
Kumano, A. and H. Hirakawa. 1994. Building an MT dictionary from parallel texts based on linguistic and statistical information. In COLING-94: Proceedings of the 15th International Conference on Computational Linguistics, Kyoto, Japan, pp.76–81.
Google Scholar
Kupiec, J. 1993. An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora. In 31st Annual Meeting of the Association for Computational Linguistics, Columbus, OH., pp.17–22.
Google Scholar
Matsumoto, Y., H. Ishimoto and T. Utsuro. 1993. Structural Matching of Parallel Texts. In 31st Annual Meeting of the Association for Computational Linguistics, Columbus, OH., pp.23–30.
Google Scholar
Matsumoto, Y. and M. Kitamura. 1995. Acquisition of Translation Rules from Parallel Corpora. In International Conference, Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria, pp.406–416.
Google Scholar
Melamed, I.D. 1995. Automatic Evaluation and Uniform Filter Cascades for Inducing N-best translation lexicons. In Proceedings of the Third Annual Workshop on Very Large Corpora, Cambridge, England, pp.184–198.
Google Scholar
Meyers, A., R. Yangarber and R. Grishman. 1996. Alignment of Shared Forests for Bilingual Corpora. In COLING-96: Proceedings of the 16th International Conference on Computational Linguistics, Copenhagen, Denmark, pp.460–465.
Google Scholar
Takubo, K. and M. Hashimoto. 1999. A Dictionary of English Business Letter Expressions, Nihon Keizai Shimbun, Inc., Tokyo, Japan.
Google Scholar
Santorini, B. 1991. Part-of-Speech Tagging Guidelines for the Penn Treebank Project. Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA.
Google Scholar
Smadja, F. 1993. Retrieving collocation from text: Xtract. Computational Linguistics 19(1):143–177.
Google Scholar
Smadja, F., K.R. McKeown and V. Hatzivassiloglou. 1996. Translating Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics 22(1):1–38.
Google Scholar
Ratnaparkhi, A. 1996. A Maximum Entropy Model for Part-Of-Speech Tagging. In Proceedings of the 1st Conference on Empirical Methods in Natural Language Processing, Philadelphia, PA., pp.133–142.
Google Scholar
Ratnaparkhi, A. 1997. A linear observed time statistical parser based on maximum entropy models. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing, Providence, RL, pp.1–10.
Google Scholar
Wu, D. 1995. An algorithm for simultaneously bracketing parallel texts by aligning words. In 33rd Annual Meeting of the Association for Computational Linguistics, Cambridge, MA., pp.244–251.
Google Scholar
Yamamoto, K. and Y. Matsumoto. 2000. Acquisition of phrase-level bilingual correspondence using dependency relations. In Proceedings of the 18th International Conference on Computational Linguistics: COLING 2000 in Europe, Saarbrücken, Germany, 2:933–939.
Google Scholar
Yamamoto, K. Y. Matsumoto and M. Kitamura. 2001. A Comparative Study on Translation Units for Bilingual Lexicon Extraction. In Proceedings of the Workshop on Data-driven Machine Translation, 39th Annual Meeting and 10th Conference of the European Chapter of the Association for Computational Linguistics, Toulouse, France, pp.87–95.
Google Scholar

Download references

Authors

Kaoru Yamamoto
View author publications
You can also search for this author in PubMed Google Scholar
Yuji Matsumoto
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Applications, Dublin City University, Dublin, Ireland
Andy Way
Institut der Gesellschaft zur Forderung der Angewandten Informationsforschung e. V. an der Universität des Saarlandes, Saarbrücken, Germany
Michael Carl

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Yamamoto, K., Matsumoto, Y. (2003). Extracting Translation Knowledge from Parallel Corpora. In: Carl, M., Way, A. (eds) Recent Advances in Example-Based Machine Translation. Text, Speech and Language Technology, vol 21. Springer, Dordrecht. https://doi.org/10.1007/978-94-010-0181-6_13

Download citation

DOI: https://doi.org/10.1007/978-94-010-0181-6_13
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-1401-7
Online ISBN: 978-94-010-0181-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics