A statistical view on bilingual lexicon extraction

Fung, Pascale

doi:10.1007/978-94-017-2535-4_11

A statistical view on bilingual lexicon extraction

From parallel corpora to non-parallel corpora

Pascale Fung⁴

Chapter

253 Accesses
6 Citations

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 13))

Abstract

We present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using only clean parallel corpora. DKvec is a method for extracting bilingual lexicons, from noisy parallel corpora based on arrival distances of words in noisy parallel corpora. Using DKvec on noisy parallel corpora in English/Japanese and English/Chinese, our evaluations show a 55.35% precision from a small corpus and 89.93% precision from a larger corpus. Our major contribution is in the extraction of bilingual lexicon from non-parallel corpora. We present a first such result in this area, from a new method—Convec. Convec is based on context information of a word to be translated. Even though the accuracy for top translation candidate is about 30% for 3 months of English and Chinese newspaper material, we show a dramatic increase of accuracy when we use a larger evaluation corpus in English and French. We find a 75% precision for the top three candidate translation of 75 content words, on English Wall Street Journal and French European News from different years.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bookstein, A. (1983). Explanation and generalization of vector models in information retrieval. Proceedings of the 6 m Annual International Conference on Research and Development in Information Retrieval, 118–132
Google Scholar
Brown, P. F., Cocke, J., Della Pietra, S., Della Pietra, V. J., Jelinek, F., Lafferty, J., Mercer, R. L. Roosin, P. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2), 79–85.
Google Scholar
Brown, P. F., Della Pietra, S., Della Pietra, V. J. Mercer, R. L. (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2), 263311
Google Scholar
Chen, S. (1993). Aligning sentences in bilingual corpora using lexical information. Proceedings of the 31“ Annual Conference of the Association for Computational Linguistics, Columbus, Ohio, 9–16.
Google Scholar
Croft, W. B. (1984). A comparison of the cosine correlation and the modified probabilistic model. Information Technology, 3, 113–114.
Google Scholar
Dagan, I. Church, K. W. (1994). Termight: identifying and translating technical terminology. Proceedings of the 4r’ Conference on Applied Natural Language Processing (ANLP ‘94), University of Stuttgart, Germany, 34–40.
Google Scholar
Dagan, I., Church, K. W. Gale. W. A. (1993). Robust Bilingual Word Alignment for Machine-Aided Translation. Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus, Ohio, 1–8.
Google Scholar
Dagan, I. Itai, A. (1994). Word sense disambiguation using a second language monolingual corpus. Computational Linguistics, 20(4), 564–596.
Google Scholar
Fung, Pascale (í995a). A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. Proceedings of the 33’ ^d Annual Conference of the Association for Computational Linguistics,Boston, MA, 236–233
Google Scholar
Fung, Pascale (1995b). Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. Proceedings of the Third Annual Workshop on Very Large Corpora, Boston, MA, 173–183.
Google Scholar
Fung, Pascale (1996). Domain word translation by space-frequency analysis of context length histograms. Proceedings of International Conference on Acoustics, Speech and Signal Processing 96, volume 1, pages 184–187, Atlanta, Georgia.
Google Scholar
Fung, Pascale Church, K. W. (1994). Kvec: A new approach for aligning parallel texts. In Proceedings of the J5(1 International Conference on Computational Linguistics (COLING 94), Kyoto, Japan, 1096–1102.
Google Scholar
Fung, Pascale Lo, Yuen Yee (1998). An IR approach for translating new words from nonparallel, comparable texts. Proceedings of the joint 17th International Conference on Computational Linguistics (COLING’98) and 36th Annual Meeting of the Association for Computational Linguistics (ACL’98), August 10–14, 1998, Université de Montréal, Canada, 414–420
Google Scholar
Fung, Pascale McKeown, K. R. (1994). Aligning noisy parallel corpora across language groups: Word pair feature matching by dynamic time warping. Proceedings of the First Conference of the Association for Machine Translation in the Americas, pages 81–88, Columbia, Maryland.
Google Scholar
Fung, Pascale McKeown, K. R. (1997a). A Technical Word and Term Translation Aid using Noisy Parallel Corpora Across Language Groups. Machine Translation, /7(1/2), 53–87.
Google Scholar
Fung, Pascale McKeown, K. R. (1997b). Finding terminology translations from non-parallel corpora. Proceedings of the 5 t ’ Annual Workshop on Very Large Corpora, Hong Kong, August 1997, 192–202.
Google Scholar
Gale, W. A. Church, K. W. (1991). Identifying word correspondences in parallel text. Proceed- ings of the Fourth Darpa Workshop on Speech and Natural Language, Asilomar, 152–157.
Google Scholar
Gale, W. A., Church, K. W. Yarowsky, D. (1992a). Estimating upper and lower bounds on the performance of word-sense disambiguation programs. Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, 28 June-2 July 1992, University of Delaware, Newark, Delaware, 249–256.
Google Scholar
Gale, W. A., Church, K. W. Yarowsky, D. (1992b). Work on statistical methods for word sense disambiguation. Probabilistic Approaches to Natural Language: Papers from the 1992 AAA! Fall Symposium, 23–25 October 1992, Cambridge, MA, 54–60.
Google Scholar
Gale, W. A., Church, K. W. Yarowsky, D. (1992c). Using bilingual materials to develop word sense disambiguation methods. Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation (TM! ‘92), Montréal, 101–112.
Google Scholar
Gale, W. A., Church, K. W. Yarowsky, D. (1993). A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26, 415–439.
Article Google Scholar
Hearst, M. (1991). Noun homograph disambiguation using local context in large text corpora. Proceedings of the 7th Annual Conf of the University of Waterloo Centre for the New OED and Text Research, Oxford, United Kingdom, 1–19.
Google Scholar
Kay, M. Röscheisen, M. (1993). Text-translation alignment. Computational Linguistics, 19 (1), 121–142.
Google Scholar
Korfhage, Rt. (1995). Some thoughts on similarity measures. The SIGIR Forum, 29, p. 8.
Article Google Scholar
Kupiec, J. (1993). An algorithm for finding noun phrase correspondences in bilingual corpora. Proceedings of the 31“ Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 17–22.
Google Scholar
Mosteller, F. Wallace, D. L. (1968). Applied Bayesian and Classical Inference - The Case of The Federalist Papers. Springer Series in Statistics, Berlin: Springer-Verlag.
Google Scholar
Salton, G. McGill., M. J. (1983). Introduction to Modern Information Retrieval. New York: McGraw-Hill.
Google Scholar
Schütze, H. (1992). Dimensions of meaning. Proceedings of Supercomputing ‘92. IEEE Computer Society Press, Los Alamitos, CA, 787–796.
Google Scholar
Smadja, F. A., McKeown, K. R. Hatzivassiloglou, V. (1996). Translation Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics, 22 (1), 1–38.
Google Scholar
Turtle, H. R. Croft, W. B. (1992). A comparison of text retrieval methods. The Computer Journal, 35, 279–290.
Article Google Scholar
Wu, Dekai Wong, Hongsing (1998). Machine translation with a stochastical grammatical channel. Proceedings of the joint 17th International Conference on Computational Linguistics (COLING’98) and 36th Annual Meeting of the Association for Computational Linguistics (ACL’98), August 10–14, 1998, Université de Montréal, Canada, 1408–1414.
Google Scholar
Wu, Dekai Xia, Xuanyin. (1994). Learning an English-Chinese lexicon from a parallel corpus. Proceedings of the First Conference of the Association for Machine Translation in the Americas, Columbia, Maryland, 206–213.
Google Scholar
Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, 2630 June 1995, Cambridge, MA, 189–196.
Google Scholar

Download references

Author information

Authors and Affiliations

Hong Kong University of Science and Technology, Hong Kong
Pascale Fung

Authors

Pascale Fung
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Université de Provence and CNRS, 29, Avenue Robert Schuman, 13100, Aix-en-Provence, France
Jean Véronis

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Fung, P. (2000). A statistical view on bilingual lexicon extraction. In: Véronis, J. (eds) Parallel Text Processing. Text, Speech and Language Technology, vol 13. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2535-4_11

Download citation

DOI: https://doi.org/10.1007/978-94-017-2535-4_11
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-5555-2
Online ISBN: 978-94-017-2535-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics