Abstract
Large-scale comparable corpora became more abundant and accessible than parallel corpora, with the explosive growth of the World Wide Web. From the Cross-Language Information Retrieval point of view, limitation of translation resources as well as ambiguity arising due to failure to translate query terms is largely responsible for large drops in the effectiveness below monolingual performance. Therefore, strategies on bilingual terminology extraction from comparable texts must be given more attention in order to enrich existing bilingual lexicons and thesauri and to enhance Cross-Language Information Retrieval. In the present paper, we focus on the enhancement of Cross-Language Information Retrieval using a two-stage corpus-based translation model that includes bi-directional extraction of bilingual terminology from comparable corpora and selection of best translation alternatives on the basis of their morphological knowledge. The impact of comparable corpora on the performance of the Cross-Language Information Retrieval process is evaluated in this study and the results indicate that the effect is clearly positive, especially when using the linear combination with bilingual dictionaries and Japanese-English pair of languages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Buckley, C., Allan, J., Salton, G.: Automatic Routing and Ad-hoc Retrieval using SMART. In: Proceedings of the Second Text Retrieval Conference TREC-2, pp. 45–56 (1994)
Cancedda, N., Dejean, H., Gaussier, E., Renders, J.M., Vinokourov, A.: Report on CLEF-2003 Experiments : Two ways of Extracting Multilingual Resources from Corpora. In: Proceedings of LEF 2003 Evaluation Campaign, Norway, Trondheim, August 21-22 (2003)
Dagan, I., Itai, I.: Word Sense Disambiguation using a Second Language Monolingual Corpus. Computational Linguistics 20(4), 563–596 (1994)
Dejean, H., Gaussier, E., Sadat, F.: An Approach based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction. In: Proceedings of COLING 2002, Taiwan, pp. 218–224 (2002)
Diab, M., Finch, S.: A Statistical Word-Level Translation Model for Comparable Corpora. In: Proceedings of the Conference on Content-based Multimedia Information Access RIAO (2000)
Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1), 61–74 (1993)
EDR. Japan Electronic Dictionary Research Institute, Ltd. EDR electronic dictionary version 1.5 technical guide. Technical report TR2-007. Japan Electronic Dictionary research Institute, Ltd. (1996)
Fox, A.E., Shaw, A.J.: Combination of Multiple Searches. In: Proceedings of the Second Text Retrieval Conference TREC-2, pp. 243–252 (1994)
Fuhr, N., Pfeifer, U., Bremkamp, C., Pollmann, M., Buckley, C.: Probabilistic learning Approaches for Indexing and Retrieval with the TREC-2 Collection. In: Proceedings of the Second Text Retrieval Conference TREC-2, pp. 67–74 (1994)
Fung, P.: A Statistical View of Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora. In: Véronis, J. (ed.) Parallel Text Processing (2000)
Gaussier, E., Renders, J.M., Matveeva, I., Goutte, C., Dejean, H.: A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora. In: Proceedings of ACL 2004, Barcelona, Spain, pp. 526–533 (2004)
Grefenstette, G.: The WWW as a Resource for Example-based MT Tasks. In: ASLIB 1999 Translating and the Computer 21 (1999)
Hedlund, T.: Compounds in dictionary-based cross-language information retrieval. Information Research 7(2) (January 2002)
Kaji, H.: Word Sense Acquisition from Bilingual Corpora. In: Proceedings of HLT-NAACL 2003, Edmonton, Canada, pp. 32–39 (2003)
Kando, N.: Overview of the Second NTCIR Workshop. In: Proceedings of the Second NTCIR Workshop on Research in Chinese and Japanese Text Retrieval and text Summarization, Tokyo (2001)
Klavens, J., Tzoukermann, E.: Combining Corpus and Machine-Readable Dictionary Data for Building Bilingual Lexicons. Machine Translation 10(3-4), 1–34 (1996)
Knaus, D., Shauble, P.: Effective and Efficient retrieval from large and Dynamic Document Collections. In: Proceedings of the Second Text Retrieval Conference TREC-2, pp. 163–170 (1994)
Knight, K., Graehl, J.: Machine Transliteration. Computational Linguistics 24(4) (1998)
Koehn, P., Knight, K.: Learning a Translation Lexicon from Monolingual Corpora. In: Proceedings of ACL 2002 Workshop on Unsupervised Lexical Acquisition (2002)
Matsumoto, Y., Kitauchi, A., Yamashita, T., Imaichi, O., Imamura, T.: Japanese morphological analysis system ChaSen manual. Technical Report NAIST-IS-TR97007, NAIST (1997)
Nakagawa, H.: Disambiguation of Lexical Translations Based on Bilingual Comparable Corpora. In: Proceedings of LREC 2000, Workshop of Terminology Resources and Computation WTRC 2000, pp. 33–38 (2000)
Nie, J.Y., Simard, M., Isabelle, P., Durand, R.: Cross-Language Information Retrieval based on parallel texts and automatic mining of parallel texts from the Web. In: Proceedings of the 22nd ACM SIGIR Conference, pp. 74–81 (1999)
Oard, D., Diekema, A.: Cross-Language Information Retrieval. In: Annual Review of Information Science and Technology (ARIST), vol. 33, pp. 223–256 (1998)
Peters, C., Picchi, E.: Capturing the Comparable: A System for Querying Comparable Text Corpora. In: Proceedings of the Third International Conference on Statistical Analysis of Textual Data, pp. 255–262 (1995)
Pirkola, A., Hedlund, T., Keskustalo, H., Järvelin, K.: Dictionary-based cross-language information retrieval: problems, methods, and research findings. Information Retrieval 4(3/4), 209–230 (2001)
Rapp, R.: Automatic Identification of Word Translations from Unrelated English and German Corpora. In: Proceedings of European Chapter of the Association for Computational Linguistics, EACL (1999)
Renders, J.M., Dejean, H., Gaussier, E.: Assessing Automatically Extracted Bilingual Lexicons for CLIR in Vertical Domains: XRCE Participation in the GIRT Track of CLEF 2002. In: Peters, C., Braschler, M., Gonzalo, J. (eds.) CLEF 2002. LNCS, vol. 2785, Springer, Heidelberg (2003)
Sadat, F., Maeda, A., Yoshikawa, M., Uemura, S.: Exploiting and Combining Multiple Resources for Query Expansion in Cross-Language Information Retrieval. IPSJ Transactions of Databases 43(SIG 9) (TOD 15), 39–54 (2002)
Sadat, F., Yoshikawa, M., Uemura, S.: Enhancing Cross-language Information Retrieval by an Automatic Acquisition of Bilingual Terminology from Comparable Corpora. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2003, Toronto, Canada (2003)
Sadat, F., Yoshikawa, M., Uemura, S.: Learning Bilingual Translations from Comparable Corpora to Cross-Language Information Retrieval: Hybrid Statistics-based and Linguistics-based Approach. In: Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages IRAL 2003, Sapporo, Japan,
Sadat, F., Yoshikawa, M., Uemura, S.: Bilingual Terminology Acquisition from Comparable Corpora and Phrasal Translation to Cross-Language Information Retrieval. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, ACL 2003, Sapporo, Japan (2003)
Sadat, F.: Knowledge Acquisition from Collections of News Articles to Cross-language Information Retrieval. In: Proceedings of RIAO 2004 Conference (Recherche d’Information Assisté par Ordinateur), Avignon, France, April 26-28, pp. 504–513 (2004)
Salton, G.: The SMART Retrieval System, Experiments in Automatic Documents Processing. Prentice-Hall, Inc., Englewood Cliffs (1971)
Salton, G., McGill, J.: Introduction to Modern Information Retrieval. Mc Graw-Hill, New York (1983)
Savoy, J.: Cross-Language Information Retrieval: Experiments based on CLEF 2000 Corpora. Information Processing and Management 39(1), 75–115 (2003)
Sekine, S.: OAK System– Manual. New York University (2001)
Shahzad, I., Ohtake, K., Masuyama, S., Yamamoto, K.: Identifying Translations of Compound Using Non-aligned Corpora. In: Proceedings of the Workshop MAL, pp. 108–113 (1999)
Tanaka, K., Iwasaki, H.: Extraction of Lexical Translations from Non-Aligned Corpora. In: Proceedings of COLING (1996)
Utsuro, U., Horiuchi, T., Chiba, Y., Hamamoto, T.: Semi-automatic Compilation of Bilingual Lexicon Entries from Cross-Lingually Relevant News Articles on WWW News Sites. In: Proceedings of the Association for Machine Translation in the Americas (AMTA 2002), pp. 165–176 (2002)
Utsuro, T., Horiuchi, T., Hamamoto, T., Hino, K., Nakayama, T.: Effect of Cross-Language IR in Bilingual Lexicon Acquisition from Comparable Corpora. In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003), pp. 355–362 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sadat, F. (2010). Using Comparable Corpora to Improve the Effectiveness of Cross-Language Information Retrieval. In: Loftsson, H., Rögnvaldsson, E., Helgadóttir, S. (eds) Advances in Natural Language Processing. NLP 2010. Lecture Notes in Computer Science(), vol 6233. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14770-8_36
Download citation
DOI: https://doi.org/10.1007/978-3-642-14770-8_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14769-2
Online ISBN: 978-3-642-14770-8
eBook Packages: Computer ScienceComputer Science (R0)