Abstract
This chapter describes a general, comprehensive and robust word-alignment system and its application to the Hebrew-English language pair. A major goal of the system architecture is to assume as little as possible about its input and about the relative nature of the two languages, while allowing the use of (minimal) specific monolingual pre-processing resources when required. The system thus receives as input a pair of raw parallel texts and requires only a tokeniser (and possibly a lemmatiser) for each language. After tokenisation (and lemmatisation if necessary), a rough initial alignment is obtained for the texts using a version of Fung and McKeown’s DK-vec algorithm (Fung und McKeown, 1997; Fung, this volume). The initial alignment is given as input to a version of the word_ align algorithm (Dagan, Church and Gale, 1993), an extension of Model 2 in the IBM statistical translation model. Word_align produces a word level alignment for the texts and a probabilistic bilingual dictionary. The chapter describes the details of the system architecture, the algorithms implemented (emphasising implementation details), the issues regarding their application to Hebrew and similar Semitic languages, and some experimental results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Attar, R., Choueka, Y., Dershowitz, N. und Fraenkel, A. S. (1978). KEDMA–Linguistic tools in retrieval systems. Journal of Association for Computing Machinery, 25, 52–66.
Baum, L. E. (1972). An inequality and an associated maximization technique in statistical estima-tion of probabilistic functions of a Markov process. Inequalities, 3, 1–8.
Brill, E. (1992). A simple rule-based part of speech tagger. Proceedings of the Third Conference on Applied Natural Language Processing (ANLP’92), Trento, 152–155.
Brown, P. F., Della Pietra, S., Della Pietra, V. J. und Mercer, R. L. (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2), 263311.
Brown, P. F., Lai, J. C. und Mercer, R. L. (1991). Aligning Sentences in Parallel Corpora, Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics. Berkeley, 169–176.
Choueka, Y. (1983). Linguistic and word-manipulation components in textual information systems. In Keren, C. und Perlmutter, L. (Eds.). The applications of mini-and micro-computers in information, documentation and libraries (pp. 405–417 ). Amsterdam: North-Holland.
Choueka, Y. (1990). RESPONSA: A full-text retrieval system with linguistic components for large corpora. In Zampolli, A. (Ed.). Computational Lexicology and Lexicography, a volume in honor of B. Quemada (pp. 51–92 ). Pisa: Giardini Editions.
Choueka, Y. (1997). Rav-Milim: the Complete Dictionary of Modern Hebrew in 6 Vols. Tel-Aviv: Steimatzki, Miskal and C.E.T.
Church, K. W. (1993). Char_align: a program for aligning parallel texts at the character level. Proceedings of the 31S t Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 1–8.
Church, K. W., Dagan, I., Gale, W. A., Fung, P., Helfman, J. und Satish, B. (1993). Aligning parallel texts: Do methods developed for English-French generalize to Asian languages? Proceedings of the Pacific Asia Conference on Formal and Computational Linguistics, Taipei, 112.
Cormen, T. H., Leiserson, C. E. und Rivest, R. L. (1989). Dynamic programming. Introduction to algorithms. Cambridge, MA: The MIT Press.
Dagan, I. und Church, K. W. (1994). Termight: identifying and translating technical terminology. Proceedings of the 4` h Conference on Applied Natural Language Processing (ANLP ‘84), University of Stuttgart, Germany, 34–40.
Dagan, I. und Church, K. W. (1997). Termight: Coordinating humans and machines in bilingual terminology acquisition. Machine Translation, 12 (1/2), 89–107.
Dagan, I., Church, K. W. und Gale. W. (1993). Robust Bilingual Word Alignment for Machine-Aided Translation. Proceedings of the Workshop on Very Large Corpora: Academic and In-dustrial Perspectives, Columbus, Ohio, 1–8.
Dempster, A. P., Laird, N. M. und Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39 (B), 1–38.
Fung, P. (1995). A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. Proceedings of the 33’d Annual Conference of the Association for Computational Linguistics, Boston, MA, 236–233
Fung, P. (this volume). A statistical view on bilingual lexicon extraction. From parallel corpora to non-parallel corpora. In Véronis, J. (Ed.). Parallel Text Processing. Dordrecht: Kluwer Academic Publishers.
Fung, P. und Church, K. W. (1994). Kvec: A new approach for aligning parallel texts. In Proceedings of the 15`“ International Conference on Computational Linguistics (COLING 94), Kyoto, Japan, 1096–1102.
Fung, P. und McKeown, K. R. (1997). A technical word and term translation aid using noisy parallel corpora across language groups. Machine Translation, 12 (1/2), 53–87.
Gale, W. A. und Church, K. W. (1991a). Identifying word correspondences in parallel text. Proceedings of the Fourth Darpa Workshop on Speech and Natural Language, Asilomar, 152157.
Gale, W. A. und Church, K. W. (1991). A program for aligning sentences in bilingual corpora. Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics ( ACL ), Berkeley, 177–184.
Isabelle, P. (1992). Bitextual Aids for Translators. Screening Words: User Interfaces for Text, Proceedings of the Eight Annual Conference of the UW Centre for the New OED and Text Research (Waterloo, October 18–20, 1992 ), 76–89.
Kay, M. (1997). The proper place of men and machines in language translation. Machine Translation, 12(1/2), 3–23.
Kay, M. und Röscheisen, M. (1993). Text-translation alignment. Computational Linguistics, 19(1), 121–142.
Klavans, J. und Tzoukermann, E. (1990). The BICORD system: combining lexical information from bilingual corpora and machine-readable dictionaries. Proceedings of the 12th International Conference on Computational Linguistics (COL!NG’90), Helsinki, 174–179.
Kupiec, J. (1993). An algorithm for finding noun phrase correspondences in bilingual corpora. Proceedings of the 31’1 Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 17–22.
Melamed, I. D. (1997a). A Portable Algorithm for Mapping Bitext Correspondence. Proceedings of the 35th Conference of the Association for Computational Linguistics. Madrid, 305–312.
Melamed, I. D. (1997b). A word-to-word model of translational equivalence. Proceedings of the 35th Conference of the Association for Computational Linguistics (ACL’97), Madrid, 490–497.
Miller, G. A. (1990). WordNet: An on-line lexical database. International Journal of Lexicogra-phy, 3 (4), 235–312.
Picchi, E., Peters, C. und Marinai, E. (1992). A translator’s workstation. Proceedings of the 14th International Conference on Computational Linguistics (COLING ‘82), Nantes, France, 972–976.
Shemtov, H. (1993). Text alignment in a tool for translating revised documents. Proceedings of the 6th Conference of the European Chapter of the Association for Computational Linguistics (EACL’93), Utrecht, 449–453.
Simard, M., Foster, G. F. und Isabelle, P. (1992). Using cognates to align sentences in bilingual corpora. Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI), Montréal, Canada, 67–81.
Smadja, F. A. (1992). How to Compile a Bilingual Collocation Lexicon Automatically, Proceed- ings of the AAAI Workshop on Statistically-Based NLP Techniques, San Jose, CA, 65–71.
Ukkonen, E. (1983). On approximate string matching. Proceedings of the International Foundations of Computation Theory Conference, Borgholm, Sweden (August 1983). Lecture Notes in Computer Science 158, Berlin: Springer-Verlag, 487–495.
Wu, D. (1994). Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics. Las Cruces, 80–87.
Wu, D. und Xia, X. (1995). Large-scale automatic extraction of an English-Chinese lexicon. Machine Translation, 9 (3/4), 285–313.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Choueka, Y., Conley, E.S., Dagan, I. (2000). A comprehensive bilingual word alignment system. In: Véronis, J. (eds) Parallel Text Processing. Text, Speech and Language Technology, vol 13. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2535-4_4
Download citation
DOI: https://doi.org/10.1007/978-94-017-2535-4_4
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-5555-2
Online ISBN: 978-94-017-2535-4
eBook Packages: Springer Book Archive