Advertisement

Robust Bilingual Word Alignment for Machine Aided Translation

  • I. Dagan
  • K. Church
  • W. Gale
Part of the Text, Speech and Language Technology book series (TLTB, volume 11)

Abstract

We have developed a new program called word_align for aligning parallel text, text such as the Canadian Hansards that are available in two or more languages. The program takes the output of char_align (Church, 1993), a robust alternative to sentence-based alignment programs, and applies word-level constraints using a version of Brown et al.’s Model 2 (Brown et al., 1993), modified and extended to deal with robustness issues. Word_align was tested on a subset of Canadian Hansards supplied by Simard (Simard et al., 1992). The combination of word_align plus char_align reduces the variance (average square error) by a factor of 5 over char_align alone. More importantly, because word_align and char_align were designed to work robustly on texts that are smaller and more noisy than the Hansards, it has been possible to successfully deploy the programs at AT&T Language Line Services, a commercial translation service, to help them with difficult terminology.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baum, L. E. 1972. An inequality and an associated maximization technique in statistical estimation of probabilistic functions of a Markov process. Inequalities, 3: 1–8.Google Scholar
  2. Brown, P., Cocke, J., Della Pietra, S., Della Pietra, V., Jelinek, F., Mercer, R. L. and Roossin, P.S. 1990. A statistical approach to language translation. Computational Linguistics, 16 (2): 79–85.Google Scholar
  3. Brown, P., Lai, J. and Mercer, R. 1991a. Aligning sentences in parallel corpora. In Proceedings of the 29th Annual Meeting of the ACL, pp. 169–176.Google Scholar
  4. Brown, P., Della Pietra, S., Della Pietra, V. and Mercer, R. 1991b. Word sense disambiguation using statistical methods. In Proceedings of the 29th Annual Meeting of the ACL, pp. 264–270.Google Scholar
  5. Brown, P., Della Pietra, S., Della Pietra, V. and Mercer, R. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19 (2): 263–311.Google Scholar
  6. Church, K. W. 1993. Char_align: A program for aligning parallel texts at the character level. In Proceedings of the 31st Annual Meeting of the ACL, pp. 1–8.Google Scholar
  7. Dempster, A. P., Laird, N. M. and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39 (B): 1–38.Google Scholar
  8. Gale, W. and Church, K. 1991a. Identifying word correspondence in parallel text. In Proceedings of the DARPA Workshop on Speech and Natural Language.Google Scholar
  9. Gale, W. and Church, K. 1991b. A program for aligning sentences in bilingual corpora. In Proceedings of the 29th Annual Meeting of the ACL, pp. 177–184.Google Scholar
  10. Gale, W., Church, K. and Yarowsky, D. 1992. Using bilingual materials to develop word sense disambiguation methods. In Proceedings of the International Conference on Theoretical and Methodological Issues in Machine Translation, pp. 101–112.Google Scholar
  11. Isabelle, P. 1992. Bi-textual aids for translators. In Proceedings of the Annual Conference of the UW Center for the New OED and Text Research.Google Scholar
  12. Kay, M. and Roscheisen, M. 1993. Text-translation alignment. Computational Linguistics, 19 (1): 121–142.Google Scholar
  13. Klavans, J. and Tzoukermann, E. 1990. The BICORD system. In Proceedings of COLING 1990, Helsinki, Finland, pp. 174–178.Google Scholar
  14. Kupiec, J. 1993. An algorithm for finding noun phrase correspondences in bilingual corpora. In Proceedings of the 31st Annual Meeting of the ACL, pp. 17–22.Google Scholar
  15. Landauer, T. K. and Littman, M. L. 1990. Fully automatic cross-language document retrieval using latent semantic indexing. In Proceedings of the Annual Conference of the UW Center for the New OED and Text Research, pp. 31–38.Google Scholar
  16. Matsumoto, Y., Ishimoto, H., Utsuro, T. and Nagao, M. 1993. Structural matching of parallel texts. In Proceedings of the 31st Annual Meeting of the ACL, pp. 23–30.Google Scholar
  17. Ogden, W. and Gonzales, M. 1993. Norm — a system for translators. Demonstration at ARPA Workshop on Human Language Technology.Google Scholar
  18. Sadler, V. 1989. Working with analogical semantics: Disambiguation techniques in DLT. Foris Publications.Google Scholar
  19. Simard, M. Foster, G. and Isabelle, P. 1992. Using cognates to align sentences in bilingual corpora. In Proceedings of the International Conference on Theoretical and Methodological Issues in Machine Translation, pp. 67–82.Google Scholar
  20. Smadja, F. 1992. How to compile a bilingual collocational lexicon automatically. In AAAI Workshop on Statistically-based Natural Language Processing Techniques,July.Google Scholar
  21. Warwick, S., Hajic, J. and Russell, G. 1990. Searching on tagged corpora: linguistically motivated concordance analysis. In Proceedings of the Annual Conference of the UW Center for the New OED and Text Research, pp. 10–18.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 1999

Authors and Affiliations

  • I. Dagan
  • K. Church
  • W. Gale

There are no affiliations available

Personalised recommendations