Skip to main content

A statistical view on bilingual lexicon extraction

From parallel corpora to non-parallel corpora

  • Chapter

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 13))

Abstract

We present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using only clean parallel corpora. DKvec is a method for extracting bilingual lexicons, from noisy parallel corpora based on arrival distances of words in noisy parallel corpora. Using DKvec on noisy parallel corpora in English/Japanese and English/Chinese, our evaluations show a 55.35% precision from a small corpus and 89.93% precision from a larger corpus. Our major contribution is in the extraction of bilingual lexicon from non-parallel corpora. We present a first such result in this area, from a new method—Convec. Convec is based on context information of a word to be translated. Even though the accuracy for top translation candidate is about 30% for 3 months of English and Chinese newspaper material, we show a dramatic increase of accuracy when we use a larger evaluation corpus in English and French. We find a 75% precision for the top three candidate translation of 75 content words, on English Wall Street Journal and French European News from different years.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Bookstein, A. (1983). Explanation and generalization of vector models in information retrieval. Proceedings of the 6 m Annual International Conference on Research and Development in Information Retrieval, 118–132

    Google Scholar 

  • Brown, P. F., Cocke, J., Della Pietra, S., Della Pietra, V. J., Jelinek, F., Lafferty, J., Mercer, R. L. Roosin, P. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2), 79–85.

    Google Scholar 

  • Brown, P. F., Della Pietra, S., Della Pietra, V. J. Mercer, R. L. (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2), 263311

    Google Scholar 

  • Chen, S. (1993). Aligning sentences in bilingual corpora using lexical information. Proceedings of the 31“ Annual Conference of the Association for Computational Linguistics, Columbus, Ohio, 9–16.

    Google Scholar 

  • Croft, W. B. (1984). A comparison of the cosine correlation and the modified probabilistic model. Information Technology, 3, 113–114.

    Google Scholar 

  • Dagan, I. Church, K. W. (1994). Termight: identifying and translating technical terminology. Proceedings of the 4r’ Conference on Applied Natural Language Processing (ANLP ‘94), University of Stuttgart, Germany, 34–40.

    Google Scholar 

  • Dagan, I., Church, K. W. Gale. W. A. (1993). Robust Bilingual Word Alignment for Machine-Aided Translation. Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus, Ohio, 1–8.

    Google Scholar 

  • Dagan, I. Itai, A. (1994). Word sense disambiguation using a second language monolingual corpus. Computational Linguistics, 20(4), 564–596.

    Google Scholar 

  • Fung, Pascale (í995a). A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. Proceedings of the 33’ d Annual Conference of the Association for Computational Linguistics,Boston, MA, 236–233

    Google Scholar 

  • Fung, Pascale (1995b). Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. Proceedings of the Third Annual Workshop on Very Large Corpora, Boston, MA, 173–183.

    Google Scholar 

  • Fung, Pascale (1996). Domain word translation by space-frequency analysis of context length histograms. Proceedings of International Conference on Acoustics, Speech and Signal Processing 96, volume 1, pages 184–187, Atlanta, Georgia.

    Google Scholar 

  • Fung, Pascale Church, K. W. (1994). Kvec: A new approach for aligning parallel texts. In Proceedings of the J5(1 International Conference on Computational Linguistics (COLING 94), Kyoto, Japan, 1096–1102.

    Google Scholar 

  • Fung, Pascale Lo, Yuen Yee (1998). An IR approach for translating new words from nonparallel, comparable texts. Proceedings of the joint 17th International Conference on Computational Linguistics (COLING’98) and 36th Annual Meeting of the Association for Computational Linguistics (ACL’98), August 10–14, 1998, Université de Montréal, Canada, 414–420

    Google Scholar 

  • Fung, Pascale McKeown, K. R. (1994). Aligning noisy parallel corpora across language groups: Word pair feature matching by dynamic time warping. Proceedings of the First Conference of the Association for Machine Translation in the Americas, pages 81–88, Columbia, Maryland.

    Google Scholar 

  • Fung, Pascale McKeown, K. R. (1997a). A Technical Word and Term Translation Aid using Noisy Parallel Corpora Across Language Groups. Machine Translation, /7(1/2), 53–87.

    Google Scholar 

  • Fung, Pascale McKeown, K. R. (1997b). Finding terminology translations from non-parallel corpora. Proceedings of the 5 t ’ Annual Workshop on Very Large Corpora, Hong Kong, August 1997, 192–202.

    Google Scholar 

  • Gale, W. A. Church, K. W. (1991). Identifying word correspondences in parallel text. Proceed- ings of the Fourth Darpa Workshop on Speech and Natural Language, Asilomar, 152–157.

    Google Scholar 

  • Gale, W. A., Church, K. W. Yarowsky, D. (1992a). Estimating upper and lower bounds on the performance of word-sense disambiguation programs. Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, 28 June-2 July 1992, University of Delaware, Newark, Delaware, 249–256.

    Google Scholar 

  • Gale, W. A., Church, K. W. Yarowsky, D. (1992b). Work on statistical methods for word sense disambiguation. Probabilistic Approaches to Natural Language: Papers from the 1992 AAA! Fall Symposium, 23–25 October 1992, Cambridge, MA, 54–60.

    Google Scholar 

  • Gale, W. A., Church, K. W. Yarowsky, D. (1992c). Using bilingual materials to develop word sense disambiguation methods. Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation (TM! ‘92), Montréal, 101–112.

    Google Scholar 

  • Gale, W. A., Church, K. W. Yarowsky, D. (1993). A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26, 415–439.

    Article  Google Scholar 

  • Hearst, M. (1991). Noun homograph disambiguation using local context in large text corpora. Proceedings of the 7th Annual Conf of the University of Waterloo Centre for the New OED and Text Research, Oxford, United Kingdom, 1–19.

    Google Scholar 

  • Kay, M. Röscheisen, M. (1993). Text-translation alignment. Computational Linguistics, 19 (1), 121–142.

    Google Scholar 

  • Korfhage, Rt. (1995). Some thoughts on similarity measures. The SIGIR Forum, 29, p. 8.

    Article  Google Scholar 

  • Kupiec, J. (1993). An algorithm for finding noun phrase correspondences in bilingual corpora. Proceedings of the 31“ Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 17–22.

    Google Scholar 

  • Mosteller, F. Wallace, D. L. (1968). Applied Bayesian and Classical Inference - The Case of The Federalist Papers. Springer Series in Statistics, Berlin: Springer-Verlag.

    Google Scholar 

  • Salton, G. McGill., M. J. (1983). Introduction to Modern Information Retrieval. New York: McGraw-Hill.

    Google Scholar 

  • Schütze, H. (1992). Dimensions of meaning. Proceedings of Supercomputing ‘92. IEEE Computer Society Press, Los Alamitos, CA, 787–796.

    Google Scholar 

  • Smadja, F. A., McKeown, K. R. Hatzivassiloglou, V. (1996). Translation Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics, 22 (1), 1–38.

    Google Scholar 

  • Turtle, H. R. Croft, W. B. (1992). A comparison of text retrieval methods. The Computer Journal, 35, 279–290.

    Article  Google Scholar 

  • Wu, Dekai Wong, Hongsing (1998). Machine translation with a stochastical grammatical channel. Proceedings of the joint 17th International Conference on Computational Linguistics (COLING’98) and 36th Annual Meeting of the Association for Computational Linguistics (ACL’98), August 10–14, 1998, Université de Montréal, Canada, 1408–1414.

    Google Scholar 

  • Wu, Dekai Xia, Xuanyin. (1994). Learning an English-Chinese lexicon from a parallel corpus. Proceedings of the First Conference of the Association for Machine Translation in the Americas, Columbia, Maryland, 206–213.

    Google Scholar 

  • Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, 2630 June 1995, Cambridge, MA, 189–196.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Fung, P. (2000). A statistical view on bilingual lexicon extraction. In: Véronis, J. (eds) Parallel Text Processing. Text, Speech and Language Technology, vol 13. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2535-4_11

Download citation

  • DOI: https://doi.org/10.1007/978-94-017-2535-4_11

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-90-481-5555-2

  • Online ISBN: 978-94-017-2535-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics