Abstract
This chapter describes two studies concerning automatic extraction of translation knowledge from parallel corpora. In the first study, we use statistically probable dependency relations to acquire word and phrasal correspondences. We obtained 90% precision using an English-Japanese parallel corpus of 9268 sentences in the business domain. The result showed that statistically probable dependency relations are effective in translation knowledge acquisition even for language pairs with different word ordering.
The second study compares three models of translation units, each of which uses different linguistic information: word segmentation, chunk boundaries, and word dependencies. The study investigates the relationship between the linguistic clues applied and translation knowledge extracted. We found that chunk boundaries are useful linguistic clues in extracting compound NPs which will be effective for extracting bilingual lexicons in the new domain. Furthermore, word dependencies are also useful for longer translation pairs such as idiomatic expressions. We demonstrate that using statistical NLP tools, in particular, statistical dependency parsers, offers robustness to the approach as well as valuable linguistic clues that work effectively in extracting translation knowledge from parallel corpora of languages from different families.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Brants, T. 2000. TnT — A Statistical Part-of-Speech Tagger. In NAACL-00: Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, Seattle, WA., pp.224–231.
Charniak, E. 2000. A Maximum Entropy Inspired Parser. In NAACL-00: Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, Seattle, WA., pp.132–139.
Cha Sen and IPADIC Users Manual. 2001. http://chasen.aist-nara.ac.jp/
Collins, M. 1997. Three Generative Lexicalised Models for Statistical Parsing. In 35th Annual Meeting of the Association for Computational Linguistics and the 8th Conference of the European Chapter, Madrid, Spain, pp.16–23.
Dagan, I., K.W. Church and W.A. Gale. 1993. Robust bilingual word alignment for machine aided translation. In Proceedings of the Workshop on Very Large Corpora, Academic and Industrial Perspectives, Columbus OH., pp.1–8.
Fujio, M. and Y. Matsumoto. 1998. Japanese Dependency Structure Analysis based on Lexicalized Statistics. In Proceedings of the 3rd Conference of Empirical Methods in Natural Language Processing, Granada, Spain, pp.88–96.
Gale, W. and K. Church. 1991. Identifying Word Correspondences in Parallel Texts. In Proceedings of the 4th DARPA Speech and Natural Language Workshop, Pacific Grove, CA., pp.152–157.
Grishman, R. 1994. Iterative alignment of syntactic structures for a bilingual corpus. In Proceedings of the 2nd Annual Workshop on Very Large Corpora, Kyoto, Japan, pp.57–68.
Haruno, M. and T. Yamazaki. 1996. High-performance bilingual text alignement using statistical and dictionary information. In 34th Annual Meeting of the Association for Compuational Linguistics, Santa Cruz, CA., pp.131–138.
Kaji, H., Y. Kida and Y. Morimoto. 1992. Learning translation templates from bilingual text. In Proceedings of the fifteenth [sic] International Conference on Computational Linguistics, COLING-92, Nantes, France, pp.672–678.
Kitamura, M. and Y. Matsumoto. 1996. Automatic Extraction of Word Sequence Correspondences in Parallel Corpora. In Proceedings of the 4th Annual Workshop on Very Large Corpora, Copenhagen, Denmark, pp.79–87.
Kudo, T. and Y. Matsumoto. 2001. Chunking with Support Vector Machines. In NAACL-01: Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh, PA., pp.192–199.
Kudo, T. and Y. Matsumoto. 2002. Japanese Dependency Analysis using Cascaded Chunking. In CoNLL-2002: Proceedings of the Sixth Conference on Natural Language Learning, Taiwan, pp.63–69.
Kumano, A. and H. Hirakawa. 1994. Building an MT dictionary from parallel texts based on linguistic and statistical information. In COLING-94: Proceedings of the 15th International Conference on Computational Linguistics, Kyoto, Japan, pp.76–81.
Kupiec, J. 1993. An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora. In 31st Annual Meeting of the Association for Computational Linguistics, Columbus, OH., pp.17–22.
Matsumoto, Y., H. Ishimoto and T. Utsuro. 1993. Structural Matching of Parallel Texts. In 31st Annual Meeting of the Association for Computational Linguistics, Columbus, OH., pp.23–30.
Matsumoto, Y. and M. Kitamura. 1995. Acquisition of Translation Rules from Parallel Corpora. In International Conference, Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria, pp.406–416.
Melamed, I.D. 1995. Automatic Evaluation and Uniform Filter Cascades for Inducing N-best translation lexicons. In Proceedings of the Third Annual Workshop on Very Large Corpora, Cambridge, England, pp.184–198.
Meyers, A., R. Yangarber and R. Grishman. 1996. Alignment of Shared Forests for Bilingual Corpora. In COLING-96: Proceedings of the 16th International Conference on Computational Linguistics, Copenhagen, Denmark, pp.460–465.
Takubo, K. and M. Hashimoto. 1999. A Dictionary of English Business Letter Expressions, Nihon Keizai Shimbun, Inc., Tokyo, Japan.
Santorini, B. 1991. Part-of-Speech Tagging Guidelines for the Penn Treebank Project. Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA.
Smadja, F. 1993. Retrieving collocation from text: Xtract. Computational Linguistics 19(1):143–177.
Smadja, F., K.R. McKeown and V. Hatzivassiloglou. 1996. Translating Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics 22(1):1–38.
Ratnaparkhi, A. 1996. A Maximum Entropy Model for Part-Of-Speech Tagging. In Proceedings of the 1st Conference on Empirical Methods in Natural Language Processing, Philadelphia, PA., pp.133–142.
Ratnaparkhi, A. 1997. A linear observed time statistical parser based on maximum entropy models. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing, Providence, RL, pp.1–10.
Wu, D. 1995. An algorithm for simultaneously bracketing parallel texts by aligning words. In 33rd Annual Meeting of the Association for Computational Linguistics, Cambridge, MA., pp.244–251.
Yamamoto, K. and Y. Matsumoto. 2000. Acquisition of phrase-level bilingual correspondence using dependency relations. In Proceedings of the 18th International Conference on Computational Linguistics: COLING 2000 in Europe, Saarbrücken, Germany, 2:933–939.
Yamamoto, K. Y. Matsumoto and M. Kitamura. 2001. A Comparative Study on Translation Units for Bilingual Lexicon Extraction. In Proceedings of the Workshop on Data-driven Machine Translation, 39th Annual Meeting and 10th Conference of the European Chapter of the Association for Computational Linguistics, Toulouse, France, pp.87–95.
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Yamamoto, K., Matsumoto, Y. (2003). Extracting Translation Knowledge from Parallel Corpora. In: Carl, M., Way, A. (eds) Recent Advances in Example-Based Machine Translation. Text, Speech and Language Technology, vol 21. Springer, Dordrecht. https://doi.org/10.1007/978-94-010-0181-6_13
Download citation
DOI: https://doi.org/10.1007/978-94-010-0181-6_13
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-1401-7
Online ISBN: 978-94-010-0181-6
eBook Packages: Springer Book Archive