Abstract
We present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using only clean parallel corpora. DKvec is a method for extracting bilingual lexicons, from noisy parallel corpora based on arrival distances of words in noisy parallel corpora. Using DKvec on noisy parallel corpora in English/Japanese and English/Chinese, our evaluations show a 55.35% precision from a small corpus and 89.93% precision from a larger corpus. Our major contribution is in the extraction of bilingual lexicon from non-parallel corpora. We present a first such result in this area, from a new method—Convec. Convec is based on context information of a word to be translated. Even though the accuracy for top translation candidate is about 30% for 3 months of English and Chinese newspaper material, we show a dramatic increase of accuracy when we use a larger evaluation corpus in English and French. We find a 75% precision for the top three candidate translation of 75 content words, on English Wall Street Journal and French European News from different years.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Bookstein, A. (1983). Explanation and generalization of vector models in information retrieval. Proceedings of the 6 m Annual International Conference on Research and Development in Information Retrieval, 118–132
Brown, P. F., Cocke, J., Della Pietra, S., Della Pietra, V. J., Jelinek, F., Lafferty, J., Mercer, R. L. Roosin, P. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2), 79–85.
Brown, P. F., Della Pietra, S., Della Pietra, V. J. Mercer, R. L. (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2), 263311
Chen, S. (1993). Aligning sentences in bilingual corpora using lexical information. Proceedings of the 31“ Annual Conference of the Association for Computational Linguistics, Columbus, Ohio, 9–16.
Croft, W. B. (1984). A comparison of the cosine correlation and the modified probabilistic model. Information Technology, 3, 113–114.
Dagan, I. Church, K. W. (1994). Termight: identifying and translating technical terminology. Proceedings of the 4r’ Conference on Applied Natural Language Processing (ANLP ‘94), University of Stuttgart, Germany, 34–40.
Dagan, I., Church, K. W. Gale. W. A. (1993). Robust Bilingual Word Alignment for Machine-Aided Translation. Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus, Ohio, 1–8.
Dagan, I. Itai, A. (1994). Word sense disambiguation using a second language monolingual corpus. Computational Linguistics, 20(4), 564–596.
Fung, Pascale (í995a). A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. Proceedings of the 33’ d Annual Conference of the Association for Computational Linguistics,Boston, MA, 236–233
Fung, Pascale (1995b). Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. Proceedings of the Third Annual Workshop on Very Large Corpora, Boston, MA, 173–183.
Fung, Pascale (1996). Domain word translation by space-frequency analysis of context length histograms. Proceedings of International Conference on Acoustics, Speech and Signal Processing 96, volume 1, pages 184–187, Atlanta, Georgia.
Fung, Pascale Church, K. W. (1994). Kvec: A new approach for aligning parallel texts. In Proceedings of the J5(1 International Conference on Computational Linguistics (COLING 94), Kyoto, Japan, 1096–1102.
Fung, Pascale Lo, Yuen Yee (1998). An IR approach for translating new words from nonparallel, comparable texts. Proceedings of the joint 17th International Conference on Computational Linguistics (COLING’98) and 36th Annual Meeting of the Association for Computational Linguistics (ACL’98), August 10–14, 1998, Université de Montréal, Canada, 414–420
Fung, Pascale McKeown, K. R. (1994). Aligning noisy parallel corpora across language groups: Word pair feature matching by dynamic time warping. Proceedings of the First Conference of the Association for Machine Translation in the Americas, pages 81–88, Columbia, Maryland.
Fung, Pascale McKeown, K. R. (1997a). A Technical Word and Term Translation Aid using Noisy Parallel Corpora Across Language Groups. Machine Translation, /7(1/2), 53–87.
Fung, Pascale McKeown, K. R. (1997b). Finding terminology translations from non-parallel corpora. Proceedings of the 5 t ’ Annual Workshop on Very Large Corpora, Hong Kong, August 1997, 192–202.
Gale, W. A. Church, K. W. (1991). Identifying word correspondences in parallel text. Proceed- ings of the Fourth Darpa Workshop on Speech and Natural Language, Asilomar, 152–157.
Gale, W. A., Church, K. W. Yarowsky, D. (1992a). Estimating upper and lower bounds on the performance of word-sense disambiguation programs. Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, 28 June-2 July 1992, University of Delaware, Newark, Delaware, 249–256.
Gale, W. A., Church, K. W. Yarowsky, D. (1992b). Work on statistical methods for word sense disambiguation. Probabilistic Approaches to Natural Language: Papers from the 1992 AAA! Fall Symposium, 23–25 October 1992, Cambridge, MA, 54–60.
Gale, W. A., Church, K. W. Yarowsky, D. (1992c). Using bilingual materials to develop word sense disambiguation methods. Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation (TM! ‘92), Montréal, 101–112.
Gale, W. A., Church, K. W. Yarowsky, D. (1993). A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26, 415–439.
Hearst, M. (1991). Noun homograph disambiguation using local context in large text corpora. Proceedings of the 7th Annual Conf of the University of Waterloo Centre for the New OED and Text Research, Oxford, United Kingdom, 1–19.
Kay, M. Röscheisen, M. (1993). Text-translation alignment. Computational Linguistics, 19 (1), 121–142.
Korfhage, Rt. (1995). Some thoughts on similarity measures. The SIGIR Forum, 29, p. 8.
Kupiec, J. (1993). An algorithm for finding noun phrase correspondences in bilingual corpora. Proceedings of the 31“ Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 17–22.
Mosteller, F. Wallace, D. L. (1968). Applied Bayesian and Classical Inference - The Case of The Federalist Papers. Springer Series in Statistics, Berlin: Springer-Verlag.
Salton, G. McGill., M. J. (1983). Introduction to Modern Information Retrieval. New York: McGraw-Hill.
Schütze, H. (1992). Dimensions of meaning. Proceedings of Supercomputing ‘92. IEEE Computer Society Press, Los Alamitos, CA, 787–796.
Smadja, F. A., McKeown, K. R. Hatzivassiloglou, V. (1996). Translation Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics, 22 (1), 1–38.
Turtle, H. R. Croft, W. B. (1992). A comparison of text retrieval methods. The Computer Journal, 35, 279–290.
Wu, Dekai Wong, Hongsing (1998). Machine translation with a stochastical grammatical channel. Proceedings of the joint 17th International Conference on Computational Linguistics (COLING’98) and 36th Annual Meeting of the Association for Computational Linguistics (ACL’98), August 10–14, 1998, Université de Montréal, Canada, 1408–1414.
Wu, Dekai Xia, Xuanyin. (1994). Learning an English-Chinese lexicon from a parallel corpus. Proceedings of the First Conference of the Association for Machine Translation in the Americas, Columbia, Maryland, 206–213.
Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, 2630 June 1995, Cambridge, MA, 189–196.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Fung, P. (2000). A statistical view on bilingual lexicon extraction. In: Véronis, J. (eds) Parallel Text Processing. Text, Speech and Language Technology, vol 13. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2535-4_11
Download citation
DOI: https://doi.org/10.1007/978-94-017-2535-4_11
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-5555-2
Online ISBN: 978-94-017-2535-4
eBook Packages: Springer Book Archive