Skip to main content

A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora

  • Conference paper
  • First Online:
Book cover Machine Translation and the Information Soup (AMTA 1998)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1529))

Included in the following conference series:

Abstract

We present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using only clean parallel corpora. DKvec is a method for extracting bilingual lexicons, from noisy parallel corpora based on arrival distances of words in noisy parallel corpora. Using DKvec on noisy parallel corpora in English/Japanese and English/Chinese, our evaluations show a 55.35% precision from a small corpus and 89.93% precision from a larger corpus. Our major contribution is in the extraction of bilingual lexicon from non-parallel corpora. We present a first such result in this area, from a new method-Convec. Convec is based on context information of a word to be translated. We show a 30% to 76% precision when top-one to top-20 translation candidates are considered. Most of the top-20 candidates are either collocations or words related to the correct translation. Since non-parallel corpora contain a lot more polysemous words, many-to-many translations, and different lexical items in the two languages, we conclude that the output from Convec is reasonable and useful.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. Bookstein. Explanation and generalization of vector models in information retrieval. In Proceedings of the 6th Annual International Conference on Research and Development in Information Retrieval, pages 118–132, 1983.

    Google Scholar 

  2. P.F. Brown, J. Cocke, S.A. Della Pietra, V.J. Della Pietra, F. Jelinek, J.D. Lafferty, R.L. Mercer, and P. Roosin. A statistical approach to machine translation. Computational Linguistics, 16:79–85, 1990.

    Google Scholar 

  3. P.F. Brown, S.A Della Pietra, V.J. Della Pietra, and R.L. Mercer. The mathematics of machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311, 1993.

    Google Scholar 

  4. Stanley Chen. Aligning sentences in bilingual corpora using lexical information. In Proceedings of the 31st Annual Conference of the Association for Computational Linguistics, pages 9–16, Columbus, Ohio, June 1993.

    Google Scholar 

  5. W. Bruce Croft. A comparison of the cosine correlation and the modified probabilistic model. In Information Technology, volume 3, pages 113–114, 1984.

    Google Scholar 

  6. Ido Dagan and Kenneth W. Church. Termight: Identifying and translating technical terminology. In Proceedings of the 4th Conference on Applied Natural Language Processing, pages 34–40, Stuttgart, Germany, October 1994.

    Google Scholar 

  7. Ido Dagan, Kenneth W. Church, and William A. Gale. Robust bilingual word alignment for machine aided translation. In Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, pages 1–8, Columbus, Ohio, June 1993.

    Google Scholar 

  8. Ido Dagan and Alon Itai. Word sense disambiguation using a second language monolingual corpus. In Computational Linguistics, pages 564–596, 1994.

    Google Scholar 

  9. Pascale Fung. Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. In Proceedings of the Third Annual Workshop on Very Large Corpora, pages 173–183, Boston, Massachusettes, June 1995.

    Google Scholar 

  10. Pascale Fung. A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. In Proceedings of the 33rd Annual Conference of the Association for Computational Linguistics, pages 236–233, Boston, Massachusettes, June 1995.

    Google Scholar 

  11. Pascale Fung. Domain word translation by space-frequency analysis of context length histograms. In Proceedings of ICASSP 96, volume 1, pages 184–187, Atlanta, Georgia, May 1996.

    Google Scholar 

  12. Pascale Fung and Kenneth Church. Kvec: A new approach for aligning parallel texts. In Proceedings of COLING 94, pages 1096–1102, Kyoto, Japan, August 1994.

    Google Scholar 

  13. Pascale Fung and Kathleen McKeown. Aligning noisy parallel corpora across language groups: Word pair feature matching by dynamic time warping. In Proceedings of the First Conference of the Association for Machine Translation in the Americas, pages 81–88, Columbia, Maryland, October 1994.

    Google Scholar 

  14. Pascale Fung and Kathleen McKeown. A technical word and term translation aid using noisy parallel corpora across language groups. Machine Translation, pages 53–87, 1996.

    Google Scholar 

  15. Pascale Fung and Kathleen McKeown. Finding terminology translations from non-parallel corpora. In The 5th Annual Workshop on Very Large Corpora, pages 192–202, Hong Kong, Aug. 1997.

    Google Scholar 

  16. Pascale Fung and Lo Yuen Yee. An ir approach for translating new words from nonparallel, comparable texts.

    Google Scholar 

  17. W. Gale, K. Church, and D. Yarowsky. Estimating upper and lower bounds on the performance of word-sense disambiguation programs. In Proceedings of the 30th Conference of the Association for Computational Linguistics. Association for Computational Linguistics, 1992.

    Google Scholar 

  18. W. Gale, K. Church, and D. Yarowsky. Using bilingual materials to develop word sense disambiguation methods. In Proceedings of TMI 92, 1992.

    Google Scholar 

  19. W. Gale, K. Church, and D. Yarowsky. Work on statistical methods for word sense disambiguation. In Proceedings of AAAI 92, 1992.

    Google Scholar 

  20. W. Gale, K. Church, and D. Yarowsky. A method for disambiguating word senses in a large corpus. In Computers and Humanities, volume 26, pages 415–439, 1993.

    Article  Google Scholar 

  21. William Gale and Kenneth Church. Identifying word correspondences in parallel text. In Proceedings of the Fourth Darpa Workshop on Speech and Natural Language, Asilomar, 1991.

    Google Scholar 

  22. M. Hearst. Noun homograph disambiguation using local context in large text corpora. In Using Corpora, Waterloo, Canada, 1991.

    Google Scholar 

  23. Martin Kay and Martin Röscheisen. Text-Translation alignment. Computational Linguistics, 19(1):121–142, 1993.

    Google Scholar 

  24. Robert Korfhage. Some thoughts on similarity measures. In The SIGIR Forum, volume 29, page 8, 1995.

    Article  Google Scholar 

  25. Julian Kupiec. An algorithm for finding noun phrase correspondences in bilingual corpora. In Proceedings of the 31st Annual Conference of the Association for Computational Linguistics, pages 17–22, Columbus, Ohio, June 1993.

    Google Scholar 

  26. Frederick Mosteller and David L. Wallace. Applied Bayesian and Classical Inference-The Case of The Federalist Papers. Springer Series in Satistics, Springer-Verlag, 1968.

    Google Scholar 

  27. G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.

    Google Scholar 

  28. Hinrich Shütze. Dimensions of meaning. In Proceedings of Supercomputing’ 92, 1992.

    Google Scholar 

  29. Frank Smadja, Kathleen McKeown, and Vasileios Hatzsivassiloglou. Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 21(4):1–38, 1996.

    Google Scholar 

  30. Howard R. Turtle and W. Bruce Croft. A comparison of text retrieval methods. In The Computer Journal, volume 35, pages 279–290, 1992.

    Article  MATH  Google Scholar 

  31. Dekai Wu and Hongsing Wong. Machine translation with a stochastical grammatical channel.

    Google Scholar 

  32. Dekai Wu and Xuanyin Xia. Learning an English-Chinese lexicon from a parallel corpus. In Proceedings of the First Conference of the Association for Machine Translation in the Americas, pages 206–213, Columbia, Maryland, October 1994.

    Google Scholar 

  33. D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Conference of the Association for Computational Linguistics, pages 189–196. Association for Computational Linguistics, 1995.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1998 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Fung, P. (1998). A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora. In: Farwell, D., Gerber, L., Hovy, E. (eds) Machine Translation and the Information Soup. AMTA 1998. Lecture Notes in Computer Science(), vol 1529. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49478-2_1

Download citation

  • DOI: https://doi.org/10.1007/3-540-49478-2_1

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-65259-5

  • Online ISBN: 978-3-540-49478-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics