Skip to main content

A comprehensive bilingual word alignment system

Application to disparate languages: Hebrew and English

  • Chapter
Parallel Text Processing

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 13))

Abstract

This chapter describes a general, comprehensive and robust word-alignment system and its application to the Hebrew-English language pair. A major goal of the system architecture is to assume as little as possible about its input and about the relative nature of the two languages, while allowing the use of (minimal) specific monolingual pre-processing resources when required. The system thus receives as input a pair of raw parallel texts and requires only a tokeniser (and possibly a lemmatiser) for each language. After tokenisation (and lemmatisation if necessary), a rough initial alignment is obtained for the texts using a version of Fung and McKeown’s DK-vec algorithm (Fung und McKeown, 1997; Fung, this volume). The initial alignment is given as input to a version of the word_ align algorithm (Dagan, Church and Gale, 1993), an extension of Model 2 in the IBM statistical translation model. Word_align produces a word level alignment for the texts and a probabilistic bilingual dictionary. The chapter describes the details of the system architecture, the algorithms implemented (emphasising implementation details), the issues regarding their application to Hebrew and similar Semitic languages, and some experimental results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Attar, R., Choueka, Y., Dershowitz, N. und Fraenkel, A. S. (1978). KEDMA–Linguistic tools in retrieval systems. Journal of Association for Computing Machinery, 25, 52–66.

    Article  Google Scholar 

  • Baum, L. E. (1972). An inequality and an associated maximization technique in statistical estima-tion of probabilistic functions of a Markov process. Inequalities, 3, 1–8.

    Google Scholar 

  • Brill, E. (1992). A simple rule-based part of speech tagger. Proceedings of the Third Conference on Applied Natural Language Processing (ANLP’92), Trento, 152–155.

    Google Scholar 

  • Brown, P. F., Della Pietra, S., Della Pietra, V. J. und Mercer, R. L. (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2), 263311.

    Google Scholar 

  • Brown, P. F., Lai, J. C. und Mercer, R. L. (1991). Aligning Sentences in Parallel Corpora, Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics. Berkeley, 169–176.

    Google Scholar 

  • Choueka, Y. (1983). Linguistic and word-manipulation components in textual information systems. In Keren, C. und Perlmutter, L. (Eds.). The applications of mini-and micro-computers in information, documentation and libraries (pp. 405–417 ). Amsterdam: North-Holland.

    Google Scholar 

  • Choueka, Y. (1990). RESPONSA: A full-text retrieval system with linguistic components for large corpora. In Zampolli, A. (Ed.). Computational Lexicology and Lexicography, a volume in honor of B. Quemada (pp. 51–92 ). Pisa: Giardini Editions.

    Google Scholar 

  • Choueka, Y. (1997). Rav-Milim: the Complete Dictionary of Modern Hebrew in 6 Vols. Tel-Aviv: Steimatzki, Miskal and C.E.T.

    Google Scholar 

  • Church, K. W. (1993). Char_align: a program for aligning parallel texts at the character level. Proceedings of the 31S t Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 1–8.

    Google Scholar 

  • Church, K. W., Dagan, I., Gale, W. A., Fung, P., Helfman, J. und Satish, B. (1993). Aligning parallel texts: Do methods developed for English-French generalize to Asian languages? Proceedings of the Pacific Asia Conference on Formal and Computational Linguistics, Taipei, 112.

    Google Scholar 

  • Cormen, T. H., Leiserson, C. E. und Rivest, R. L. (1989). Dynamic programming. Introduction to algorithms. Cambridge, MA: The MIT Press.

    Google Scholar 

  • Dagan, I. und Church, K. W. (1994). Termight: identifying and translating technical terminology. Proceedings of the 4` h Conference on Applied Natural Language Processing (ANLP ‘84), University of Stuttgart, Germany, 34–40.

    Google Scholar 

  • Dagan, I. und Church, K. W. (1997). Termight: Coordinating humans and machines in bilingual terminology acquisition. Machine Translation, 12 (1/2), 89–107.

    Google Scholar 

  • Dagan, I., Church, K. W. und Gale. W. (1993). Robust Bilingual Word Alignment for Machine-Aided Translation. Proceedings of the Workshop on Very Large Corpora: Academic and In-dustrial Perspectives, Columbus, Ohio, 1–8.

    Google Scholar 

  • Dempster, A. P., Laird, N. M. und Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39 (B), 1–38.

    Google Scholar 

  • Fung, P. (1995). A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. Proceedings of the 33’d Annual Conference of the Association for Computational Linguistics, Boston, MA, 236–233

    Google Scholar 

  • Fung, P. (this volume). A statistical view on bilingual lexicon extraction. From parallel corpora to non-parallel corpora. In Véronis, J. (Ed.). Parallel Text Processing. Dordrecht: Kluwer Academic Publishers.

    Google Scholar 

  • Fung, P. und Church, K. W. (1994). Kvec: A new approach for aligning parallel texts. In Proceedings of the 15`“ International Conference on Computational Linguistics (COLING 94), Kyoto, Japan, 1096–1102.

    Google Scholar 

  • Fung, P. und McKeown, K. R. (1997). A technical word and term translation aid using noisy parallel corpora across language groups. Machine Translation, 12 (1/2), 53–87.

    Google Scholar 

  • Gale, W. A. und Church, K. W. (1991a). Identifying word correspondences in parallel text. Proceedings of the Fourth Darpa Workshop on Speech and Natural Language, Asilomar, 152157.

    Google Scholar 

  • Gale, W. A. und Church, K. W. (1991). A program for aligning sentences in bilingual corpora. Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics ( ACL ), Berkeley, 177–184.

    Google Scholar 

  • Isabelle, P. (1992). Bitextual Aids for Translators. Screening Words: User Interfaces for Text, Proceedings of the Eight Annual Conference of the UW Centre for the New OED and Text Research (Waterloo, October 18–20, 1992 ), 76–89.

    Google Scholar 

  • Kay, M. (1997). The proper place of men and machines in language translation. Machine Translation, 12(1/2), 3–23.

    Google Scholar 

  • Kay, M. und Röscheisen, M. (1993). Text-translation alignment. Computational Linguistics, 19(1), 121–142.

    Google Scholar 

  • Klavans, J. und Tzoukermann, E. (1990). The BICORD system: combining lexical information from bilingual corpora and machine-readable dictionaries. Proceedings of the 12th International Conference on Computational Linguistics (COL!NG’90), Helsinki, 174–179.

    Google Scholar 

  • Kupiec, J. (1993). An algorithm for finding noun phrase correspondences in bilingual corpora. Proceedings of the 31’1 Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 17–22.

    Book  Google Scholar 

  • Melamed, I. D. (1997a). A Portable Algorithm for Mapping Bitext Correspondence. Proceedings of the 35th Conference of the Association for Computational Linguistics. Madrid, 305–312.

    Google Scholar 

  • Melamed, I. D. (1997b). A word-to-word model of translational equivalence. Proceedings of the 35th Conference of the Association for Computational Linguistics (ACL’97), Madrid, 490–497.

    Google Scholar 

  • Miller, G. A. (1990). WordNet: An on-line lexical database. International Journal of Lexicogra-phy, 3 (4), 235–312.

    Article  Google Scholar 

  • Picchi, E., Peters, C. und Marinai, E. (1992). A translator’s workstation. Proceedings of the 14th International Conference on Computational Linguistics (COLING ‘82), Nantes, France, 972–976.

    Chapter  Google Scholar 

  • Shemtov, H. (1993). Text alignment in a tool for translating revised documents. Proceedings of the 6th Conference of the European Chapter of the Association for Computational Linguistics (EACL’93), Utrecht, 449–453.

    Google Scholar 

  • Simard, M., Foster, G. F. und Isabelle, P. (1992). Using cognates to align sentences in bilingual corpora. Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI), Montréal, Canada, 67–81.

    Google Scholar 

  • Smadja, F. A. (1992). How to Compile a Bilingual Collocation Lexicon Automatically, Proceed- ings of the AAAI Workshop on Statistically-Based NLP Techniques, San Jose, CA, 65–71.

    Google Scholar 

  • Ukkonen, E. (1983). On approximate string matching. Proceedings of the International Foundations of Computation Theory Conference, Borgholm, Sweden (August 1983). Lecture Notes in Computer Science 158, Berlin: Springer-Verlag, 487–495.

    Google Scholar 

  • Wu, D. (1994). Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics. Las Cruces, 80–87.

    Google Scholar 

  • Wu, D. und Xia, X. (1995). Large-scale automatic extraction of an English-Chinese lexicon. Machine Translation, 9 (3/4), 285–313.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Choueka, Y., Conley, E.S., Dagan, I. (2000). A comprehensive bilingual word alignment system. In: Véronis, J. (eds) Parallel Text Processing. Text, Speech and Language Technology, vol 13. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2535-4_4

Download citation

  • DOI: https://doi.org/10.1007/978-94-017-2535-4_4

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-90-481-5555-2

  • Online ISBN: 978-94-017-2535-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics