Skip to main content

Extracting Translation Knowledge from Parallel Corpora

  • Chapter
Recent Advances in Example-Based Machine Translation

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 21))

Abstract

This chapter describes two studies concerning automatic extraction of translation knowledge from parallel corpora. In the first study, we use statistically probable dependency relations to acquire word and phrasal correspondences. We obtained 90% precision using an English-Japanese parallel corpus of 9268 sentences in the business domain. The result showed that statistically probable dependency relations are effective in translation knowledge acquisition even for language pairs with different word ordering.

The second study compares three models of translation units, each of which uses different linguistic information: word segmentation, chunk boundaries, and word dependencies. The study investigates the relationship between the linguistic clues applied and translation knowledge extracted. We found that chunk boundaries are useful linguistic clues in extracting compound NPs which will be effective for extracting bilingual lexicons in the new domain. Furthermore, word dependencies are also useful for longer translation pairs such as idiomatic expressions. We demonstrate that using statistical NLP tools, in particular, statistical dependency parsers, offers robustness to the approach as well as valuable linguistic clues that work effectively in extracting translation knowledge from parallel corpora of languages from different families.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Brants, T. 2000. TnT — A Statistical Part-of-Speech Tagger. In NAACL-00: Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, Seattle, WA., pp.224–231.

    Google Scholar 

  • Charniak, E. 2000. A Maximum Entropy Inspired Parser. In NAACL-00: Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, Seattle, WA., pp.132–139.

    Google Scholar 

  • Cha Sen and IPADIC Users Manual. 2001. http://chasen.aist-nara.ac.jp/

    Google Scholar 

  • Collins, M. 1997. Three Generative Lexicalised Models for Statistical Parsing. In 35th Annual Meeting of the Association for Computational Linguistics and the 8th Conference of the European Chapter, Madrid, Spain, pp.16–23.

    Google Scholar 

  • Dagan, I., K.W. Church and W.A. Gale. 1993. Robust bilingual word alignment for machine aided translation. In Proceedings of the Workshop on Very Large Corpora, Academic and Industrial Perspectives, Columbus OH., pp.1–8.

    Google Scholar 

  • Fujio, M. and Y. Matsumoto. 1998. Japanese Dependency Structure Analysis based on Lexicalized Statistics. In Proceedings of the 3rd Conference of Empirical Methods in Natural Language Processing, Granada, Spain, pp.88–96.

    Google Scholar 

  • Gale, W. and K. Church. 1991. Identifying Word Correspondences in Parallel Texts. In Proceedings of the 4th DARPA Speech and Natural Language Workshop, Pacific Grove, CA., pp.152–157.

    Google Scholar 

  • Grishman, R. 1994. Iterative alignment of syntactic structures for a bilingual corpus. In Proceedings of the 2nd Annual Workshop on Very Large Corpora, Kyoto, Japan, pp.57–68.

    Google Scholar 

  • Haruno, M. and T. Yamazaki. 1996. High-performance bilingual text alignement using statistical and dictionary information. In 34th Annual Meeting of the Association for Compuational Linguistics, Santa Cruz, CA., pp.131–138.

    Google Scholar 

  • Kaji, H., Y. Kida and Y. Morimoto. 1992. Learning translation templates from bilingual text. In Proceedings of the fifteenth [sic] International Conference on Computational Linguistics, COLING-92, Nantes, France, pp.672–678.

    Google Scholar 

  • Kitamura, M. and Y. Matsumoto. 1996. Automatic Extraction of Word Sequence Correspondences in Parallel Corpora. In Proceedings of the 4th Annual Workshop on Very Large Corpora, Copenhagen, Denmark, pp.79–87.

    Google Scholar 

  • Kudo, T. and Y. Matsumoto. 2001. Chunking with Support Vector Machines. In NAACL-01: Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh, PA., pp.192–199.

    Google Scholar 

  • Kudo, T. and Y. Matsumoto. 2002. Japanese Dependency Analysis using Cascaded Chunking. In CoNLL-2002: Proceedings of the Sixth Conference on Natural Language Learning, Taiwan, pp.63–69.

    Google Scholar 

  • Kumano, A. and H. Hirakawa. 1994. Building an MT dictionary from parallel texts based on linguistic and statistical information. In COLING-94: Proceedings of the 15th International Conference on Computational Linguistics, Kyoto, Japan, pp.76–81.

    Google Scholar 

  • Kupiec, J. 1993. An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora. In 31st Annual Meeting of the Association for Computational Linguistics, Columbus, OH., pp.17–22.

    Google Scholar 

  • Matsumoto, Y., H. Ishimoto and T. Utsuro. 1993. Structural Matching of Parallel Texts. In 31st Annual Meeting of the Association for Computational Linguistics, Columbus, OH., pp.23–30.

    Google Scholar 

  • Matsumoto, Y. and M. Kitamura. 1995. Acquisition of Translation Rules from Parallel Corpora. In International Conference, Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria, pp.406–416.

    Google Scholar 

  • Melamed, I.D. 1995. Automatic Evaluation and Uniform Filter Cascades for Inducing N-best translation lexicons. In Proceedings of the Third Annual Workshop on Very Large Corpora, Cambridge, England, pp.184–198.

    Google Scholar 

  • Meyers, A., R. Yangarber and R. Grishman. 1996. Alignment of Shared Forests for Bilingual Corpora. In COLING-96: Proceedings of the 16th International Conference on Computational Linguistics, Copenhagen, Denmark, pp.460–465.

    Google Scholar 

  • Takubo, K. and M. Hashimoto. 1999. A Dictionary of English Business Letter Expressions, Nihon Keizai Shimbun, Inc., Tokyo, Japan.

    Google Scholar 

  • Santorini, B. 1991. Part-of-Speech Tagging Guidelines for the Penn Treebank Project. Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA.

    Google Scholar 

  • Smadja, F. 1993. Retrieving collocation from text: Xtract. Computational Linguistics 19(1):143–177.

    Google Scholar 

  • Smadja, F., K.R. McKeown and V. Hatzivassiloglou. 1996. Translating Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics 22(1):1–38.

    Google Scholar 

  • Ratnaparkhi, A. 1996. A Maximum Entropy Model for Part-Of-Speech Tagging. In Proceedings of the 1st Conference on Empirical Methods in Natural Language Processing, Philadelphia, PA., pp.133–142.

    Google Scholar 

  • Ratnaparkhi, A. 1997. A linear observed time statistical parser based on maximum entropy models. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing, Providence, RL, pp.1–10.

    Google Scholar 

  • Wu, D. 1995. An algorithm for simultaneously bracketing parallel texts by aligning words. In 33rd Annual Meeting of the Association for Computational Linguistics, Cambridge, MA., pp.244–251.

    Google Scholar 

  • Yamamoto, K. and Y. Matsumoto. 2000. Acquisition of phrase-level bilingual correspondence using dependency relations. In Proceedings of the 18th International Conference on Computational Linguistics: COLING 2000 in Europe, Saarbrücken, Germany, 2:933–939.

    Google Scholar 

  • Yamamoto, K. Y. Matsumoto and M. Kitamura. 2001. A Comparative Study on Translation Units for Bilingual Lexicon Extraction. In Proceedings of the Workshop on Data-driven Machine Translation, 39th Annual Meeting and 10th Conference of the European Chapter of the Association for Computational Linguistics, Toulouse, France, pp.87–95.

    Google Scholar 

Download references

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Yamamoto, K., Matsumoto, Y. (2003). Extracting Translation Knowledge from Parallel Corpora. In: Carl, M., Way, A. (eds) Recent Advances in Example-Based Machine Translation. Text, Speech and Language Technology, vol 21. Springer, Dordrecht. https://doi.org/10.1007/978-94-010-0181-6_13

Download citation

  • DOI: https://doi.org/10.1007/978-94-010-0181-6_13

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-1-4020-1401-7

  • Online ISBN: 978-94-010-0181-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics