Abstract
In experiments comparing a variety of different methods for cross-language information retrieval using a bilingual training corpus—methods based on both machine translation and “traditional” information-retrieval techniques—a fairly simple statistical technique for automatically extracting a bilingual dictionary from parallel text proved to have the best performance. Surprisingly, an improvement to the dictionary extraction method that significantly increases the accuracy of the dictionary proved to be slightly detrimental to overall performance even though it is highly beneficial for other applications. This chapter will describe the extraction method and its enhancement in detail, and compare the performance of a retrieval system using the automatically-generated dictionaries with other retrieval methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ballesteros, L. and Croft, W. B. (1997). Phrasal Translation and Query Expansion Techniques for Cross-Language Information Retrieval. University of Massachusetts Technical Report: IR-104.
Brown, P. F., Della Pietra, S., Della Pietra, V. J. and Mercer, R. L. (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2), 263311
Brown, R. D. (1996). Example-Based Machine Translation in the Pangloss System. Proceedings of the 16th International Conference on Computational Linguistics (COLING-96),Copenhagen, 169–174. Available: http://www.cs.cmu.edu/—ralf/ papers.html.
Brown, R. D. (1997). Automated Dictionary Extraction for “Knowledge-Free” Example-Based Translation. Proceedings of the Seventh International Conference on Theoretical and Methodological Issues in Machine Translation (TMI97), 111–118. Available: http://www.cs.cmu.edu/-ralf/papers.html.
Brown, R. D. (1998). Automatically-Extracted Thesauri for Cross-Language IR: When Better is Worse. First Workshop on Computational Terminology, 15–21. Available: http://www.cs.cmu.edu/—ralf/papers.html.
Buckley, C., Salton, G., Allan, A. and Singhal, A. (1995). Automatic Query Expansion Using SMART: TREC 3. Overview of the Third Text REtrieval Conference (TREC-3), 69–80.
Carbonell, J. G. and Goldstein, J. (1998). The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. Proceedings of the 21’` Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), 335–336. Available: http://www.cs.cmu.edu/jade/ps/sigir98.ps.
Carbonell, J. G., Yang, Y., Frederking, R. E., Brown, R. D., Geng, Y. and Lee, D. (1997). Translingual Information Retrieval: A Comparative Evaluation. Proceedings of Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-97), volume I, 708–715. Available: http://www.cs.cmu.edu/—ralf/papers.html.
Davis, M. W. and Dunning, T. E. (1995). A TREC Evaluation of Query Translation Methods for Multi-Lingual Text Retrieval. The Fourth Text Retrieval Conference (TREC-4), IST, 483–498.
Deerwester, S., Dumais, S. T., Fumas, G. W., Landauer, T. K. and Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science 1 (6), 391–407.
Dumais, S. T., Landauer, T. K. and Littman, M. L. (1996). Automatic Cross-Linguistic Information Retrieval Using Latent Semantic Indexing. SIGIR’96 Workshop on Cross-Linguistic Information Retrieval.
Frederking, R. E., Nirenburg, S., Farwell, D., Helmreich, S., Hovy, E., Knight, K., Beale, S., Domashnev, C., Attardo, D., Grannes, D. and Brown, R. D. (1994). Integrating Translations from Multiple Sources within the Pangloss Mark III Machine Translation. Proceedings of the First Conference of the Association for Machine Translation in the Americas, Columbia, Maryland, 73–80.
Gaussier, E. (1998). Flow Network Models for Word Alignment and Terminology Extraction from Bilingual Corpora. Proceedings of the 36 th Annual Meeting of the Association for Computational Linguistics and 17` h International Conference on Computational Linguistics (COLING-ACL’98), Montréal, Quebec, Canada, 444–450.
Graff, D. and Finch, R. (1994). Multilingual Text Resources at the Linguistic Data Consortium. Proceedings of the 1994 ARPA Human Language Technology Workshop. Morgan Kaufmann, 18–22.
Hersh, W. R., Buckley, C., Leone, T. J. and Hickman, D. (1994). OHSUMED: An Interactive Retrieval Evaluation and New Large Text Collection for Research. 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’94), 192–201.
Hull, D. A. and Grefenstette, G. (1996). Querying Across Languages: a Dictionary-based Approach to Multilingual Information Retrieval. Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’96), 49–57.
Melamed, I. D. (1997). A Word-to-Word Model of Translational Equivalence. Proceedings of the 35` h Annual Meeting of the Association for Computational Linguistics (ACL’97), 490–497.
Salton, G. and Buckley, C. (1990). Improving Retrieval Performance by Relevance Feedback. Journal of American Society for Information Sciences, 41: 288–297.
Salton, G. (1989). Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, Pennsylvania.
Sheridan, P. and Ballerini, J. P. (1996). Experiments in Multilingual Information Retrieval using the SPIDER System. Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’96), 58–65.
Srinivasan, P. (1996). Optimal Document Indexing Vocabulary for MEDLINE. Information Processing and Management, 32 (5): 503–514.
Wong, S. K. M., Ziarko, W. and Wong, P. C. N. (1985). Generalized Vector Space Model in Information Retrieval. Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’85), 18–25.
Yang, Y. and Pedersen, J. P. (1997). Feature selection in statistical learning of text categorization. Proceedings of The Fourteenth International Conference on Machine Learning, 412–420. Available: http://www.cs.cmu.edu/yiming/publications.html.
Yang, Y., Brown, R. D., Frederking, R. E., CarbonellJ. G., Geng, G. and Lee, D. (1997). Bilingual-corpus Based Approaches to Translingual Information Retrieval. Proceedings of The 2“a Workshop on Multilinguality in Software Industry: The AI Contribution (MULSAIC’97).
Yang, Y., Carbonell, J. G., Brown, R. D. and Frederking, R. E. (1998). Translingual Information Retrieval: Learning from Bilingual Corpora. Artificial Intelligence Journal (Special issue: Best of IJCAI-97), 103, 323–345. Available: http://www.cs.cmu.edu/—ralf/ papers.html.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Brown, R.D., Carbonell, J.G., Yang, Y. (2000). Automatic dictionary extraction for cross-language information retrieval. In: Véronis, J. (eds) Parallel Text Processing. Text, Speech and Language Technology, vol 13. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2535-4_14
Download citation
DOI: https://doi.org/10.1007/978-94-017-2535-4_14
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-5555-2
Online ISBN: 978-94-017-2535-4
eBook Packages: Springer Book Archive