A comprehensive bilingual word alignment system

Choueka, Yaacov; Conley, Ehud S.; Dagan, Ido

doi:10.1007/978-94-017-2535-4_4

Yaacov Choueka⁴,
Ehud S. Conley⁴ &
Ido Dagan⁴

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 13))

256 Accesses
13 Citations

Abstract

This chapter describes a general, comprehensive and robust word-alignment system and its application to the Hebrew-English language pair. A major goal of the system architecture is to assume as little as possible about its input and about the relative nature of the two languages, while allowing the use of (minimal) specific monolingual pre-processing resources when required. The system thus receives as input a pair of raw parallel texts and requires only a tokeniser (and possibly a lemmatiser) for each language. After tokenisation (and lemmatisation if necessary), a rough initial alignment is obtained for the texts using a version of Fung and McKeown’s DK-vec algorithm (Fung und McKeown, 1997; Fung, this volume). The initial alignment is given as input to a version of the word_ align algorithm (Dagan, Church and Gale, 1993), an extension of Model 2 in the IBM statistical translation model. Word_align produces a word level alignment for the texts and a probabilistic bilingual dictionary. The chapter describes the details of the system architecture, the algorithms implemented (emphasising implementation details), the issues regarding their application to Hebrew and similar Semitic languages, and some experimental results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Attar, R., Choueka, Y., Dershowitz, N. und Fraenkel, A. S. (1978). KEDMA–Linguistic tools in retrieval systems. Journal of Association for Computing Machinery, 25, 52–66.
Article Google Scholar
Baum, L. E. (1972). An inequality and an associated maximization technique in statistical estima-tion of probabilistic functions of a Markov process. Inequalities, 3, 1–8.
Google Scholar
Brill, E. (1992). A simple rule-based part of speech tagger. Proceedings of the Third Conference on Applied Natural Language Processing (ANLP’92), Trento, 152–155.
Google Scholar
Brown, P. F., Della Pietra, S., Della Pietra, V. J. und Mercer, R. L. (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2), 263311.
Google Scholar
Brown, P. F., Lai, J. C. und Mercer, R. L. (1991). Aligning Sentences in Parallel Corpora, Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics. Berkeley, 169–176.
Google Scholar
Choueka, Y. (1983). Linguistic and word-manipulation components in textual information systems. In Keren, C. und Perlmutter, L. (Eds.). The applications of mini-and micro-computers in information, documentation and libraries (pp. 405–417 ). Amsterdam: North-Holland.
Google Scholar
Choueka, Y. (1990). RESPONSA: A full-text retrieval system with linguistic components for large corpora. In Zampolli, A. (Ed.). Computational Lexicology and Lexicography, a volume in honor of B. Quemada (pp. 51–92 ). Pisa: Giardini Editions.
Google Scholar
Choueka, Y. (1997). Rav-Milim: the Complete Dictionary of Modern Hebrew in 6 Vols. Tel-Aviv: Steimatzki, Miskal and C.E.T.
Google Scholar
Church, K. W. (1993). Char_align: a program for aligning parallel texts at the character level. Proceedings of the 31S t Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 1–8.
Google Scholar
Church, K. W., Dagan, I., Gale, W. A., Fung, P., Helfman, J. und Satish, B. (1993). Aligning parallel texts: Do methods developed for English-French generalize to Asian languages? Proceedings of the Pacific Asia Conference on Formal and Computational Linguistics, Taipei, 112.
Google Scholar
Cormen, T. H., Leiserson, C. E. und Rivest, R. L. (1989). Dynamic programming. Introduction to algorithms. Cambridge, MA: The MIT Press.
Google Scholar
Dagan, I. und Church, K. W. (1994). Termight: identifying and translating technical terminology. Proceedings of the 4` h Conference on Applied Natural Language Processing (ANLP ‘84), University of Stuttgart, Germany, 34–40.
Google Scholar
Dagan, I. und Church, K. W. (1997). Termight: Coordinating humans and machines in bilingual terminology acquisition. Machine Translation, 12 (1/2), 89–107.
Google Scholar
Dagan, I., Church, K. W. und Gale. W. (1993). Robust Bilingual Word Alignment for Machine-Aided Translation. Proceedings of the Workshop on Very Large Corpora: Academic and In-dustrial Perspectives, Columbus, Ohio, 1–8.
Google Scholar
Dempster, A. P., Laird, N. M. und Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39 (B), 1–38.
Google Scholar
Fung, P. (1995). A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. Proceedings of the 33’d Annual Conference of the Association for Computational Linguistics, Boston, MA, 236–233
Google Scholar
Fung, P. (this volume). A statistical view on bilingual lexicon extraction. From parallel corpora to non-parallel corpora. In Véronis, J. (Ed.). Parallel Text Processing. Dordrecht: Kluwer Academic Publishers.
Google Scholar
Fung, P. und Church, K. W. (1994). Kvec: A new approach for aligning parallel texts. In Proceedings of the 15`“ International Conference on Computational Linguistics (COLING 94), Kyoto, Japan, 1096–1102.
Google Scholar
Fung, P. und McKeown, K. R. (1997). A technical word and term translation aid using noisy parallel corpora across language groups. Machine Translation, 12 (1/2), 53–87.
Google Scholar
Gale, W. A. und Church, K. W. (1991a). Identifying word correspondences in parallel text. Proceedings of the Fourth Darpa Workshop on Speech and Natural Language, Asilomar, 152157.
Google Scholar
Gale, W. A. und Church, K. W. (1991). A program for aligning sentences in bilingual corpora. Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics ( ACL ), Berkeley, 177–184.
Google Scholar
Isabelle, P. (1992). Bitextual Aids for Translators. Screening Words: User Interfaces for Text, Proceedings of the Eight Annual Conference of the UW Centre for the New OED and Text Research (Waterloo, October 18–20, 1992 ), 76–89.
Google Scholar
Kay, M. (1997). The proper place of men and machines in language translation. Machine Translation, 12(1/2), 3–23.
Google Scholar
Kay, M. und Röscheisen, M. (1993). Text-translation alignment. Computational Linguistics, 19(1), 121–142.
Google Scholar
Klavans, J. und Tzoukermann, E. (1990). The BICORD system: combining lexical information from bilingual corpora and machine-readable dictionaries. Proceedings of the 12th International Conference on Computational Linguistics (COL!NG’90), Helsinki, 174–179.
Google Scholar
Kupiec, J. (1993). An algorithm for finding noun phrase correspondences in bilingual corpora. Proceedings of the 31’1 Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 17–22.
Book Google Scholar
Melamed, I. D. (1997a). A Portable Algorithm for Mapping Bitext Correspondence. Proceedings of the 35th Conference of the Association for Computational Linguistics. Madrid, 305–312.
Google Scholar
Melamed, I. D. (1997b). A word-to-word model of translational equivalence. Proceedings of the 35th Conference of the Association for Computational Linguistics (ACL’97), Madrid, 490–497.
Google Scholar
Miller, G. A. (1990). WordNet: An on-line lexical database. International Journal of Lexicogra-phy, 3 (4), 235–312.
Article Google Scholar
Picchi, E., Peters, C. und Marinai, E. (1992). A translator’s workstation. Proceedings of the 14th International Conference on Computational Linguistics (COLING ‘82), Nantes, France, 972–976.
Chapter Google Scholar
Shemtov, H. (1993). Text alignment in a tool for translating revised documents. Proceedings of the 6th Conference of the European Chapter of the Association for Computational Linguistics (EACL’93), Utrecht, 449–453.
Google Scholar
Simard, M., Foster, G. F. und Isabelle, P. (1992). Using cognates to align sentences in bilingual corpora. Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI), Montréal, Canada, 67–81.
Google Scholar
Smadja, F. A. (1992). How to Compile a Bilingual Collocation Lexicon Automatically, Proceed- ings of the AAAI Workshop on Statistically-Based NLP Techniques, San Jose, CA, 65–71.
Google Scholar
Ukkonen, E. (1983). On approximate string matching. Proceedings of the International Foundations of Computation Theory Conference, Borgholm, Sweden (August 1983). Lecture Notes in Computer Science 158, Berlin: Springer-Verlag, 487–495.
Google Scholar
Wu, D. (1994). Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics. Las Cruces, 80–87.
Google Scholar
Wu, D. und Xia, X. (1995). Large-scale automatic extraction of an English-Chinese lexicon. Machine Translation, 9 (3/4), 285–313.
Google Scholar

Download references

Author information

Authors and Affiliations

Bar-Ilan University, Israel
Yaacov Choueka, Ehud S. Conley & Ido Dagan

Authors

Yaacov Choueka
View author publications
You can also search for this author in PubMed Google Scholar
Ehud S. Conley
View author publications
You can also search for this author in PubMed Google Scholar
Ido Dagan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Université de Provence and CNRS, 29, Avenue Robert Schuman, 13100, Aix-en-Provence, France
Jean Véronis

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Choueka, Y., Conley, E.S., Dagan, I. (2000). A comprehensive bilingual word alignment system. In: Véronis, J. (eds) Parallel Text Processing. Text, Speech and Language Technology, vol 13. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2535-4_4

Download citation

DOI: https://doi.org/10.1007/978-94-017-2535-4_4
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-5555-2
Online ISBN: 978-94-017-2535-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics