Abstract
The Tampere University CLEF research group participated in CLEF2001 with four automated bilingual runs. Our cross-lingual software, UTACLIR, uses an automated method for query construction for cross-language information retrieval (CLIR). This method seeks to automatically extract topical information from request sentences written in one of the source languages and to create a target language query, based on translations given by a translation dictionary. The new features for the CLIR process from Finnish, Swedish and German to English focus on translating and matching compound words, and a new n-gram based technique for translating and matching proper names and other non-translatable words. Non-translatable words can also be components in compounds. The n-gram based method is clearly efficient in matching inflected proper names and spelling variants. However, using it for all non-identified and non-translatable words adds noise to the query. For German — English we have tested two types of dictionaries (two runs). The first included all translations from the standard dictionary. The second contained the same data, except that all direct translations of compounds were excluded. The test with two dictionaries for the German runs gives an indication that the new features for compound processing work well even with a limited dictionary.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Gadd, T. 1990. Phonix: The algorithm. Program, 24(4), 363–369.
Grefenstette, G., Segond, F. (1997) Multilingual natural language procesing. International Journal of Corpus Linguistics 2(1), 153–162.
Haas, S. W., Losee, R. M. Jr (1994) Looking in text windows: Their size and composition. Information Processing and Management 30(5), 619–629.
Hedlund, T., Keskustalo, H., Pirkola, A., Seppänen, M., Järvelin, K. (2000) Bilingual tests with Swedish, Finnish and German queries. Working Notes for CLEF Workshop http://www.iei.pi.cnr.it/DELOS/CLEF/Notes.html
Hedlund, T., Keskustalo, H., Pirkola, A., Seppänen, M., Järvelin, K. (2001a) Bilingual tests with Swedish, Finnish and German queries: Dealing with morphology, compound words and query structuring. In Carol Peters (ed). Cross-Language Information Retrieval and Evaluation: Proceedings of the CLEF 2000 Workshop, Lecture Notes in Computer Science 2069, Springer 2001, pp 211–225.
Hedlund, T., Pirkola, A. and Järvelin, K. (2001b). Aspects of Swedish Morphology and Semantics from the Perspective of Mono-and Cross-language Information Retrieval. Information Processing & Management vol. 37/1 pp.147–161.
Jacquemin, C. (1996) What is the three that we see through the window: A linguistic approach to windowing and term variation. Information Processing & Management 32(4), 445–458.
Levi, J. N. (1978) The syntax and semantics of complex nominals. London: Academic Press.
Pfeifer, U., Poersch, T., and Fuhr, N. 1996. Retrieval effectiveness of proper name search methods. Information Processing & Management, 32(6), 667–679.
Pirkola, A., Keskustalo, H., Leppänen, E., and Järvelin, K. 2001. Targeted s-gram matching: a novel n-gram matching technique for cross-and monolingual word form variants. Manuscript, submitted to Information Research
Pirkola, A. (1998). The Effects of Query Structure and Dictionary Setups in Dictionary-Based Cross-language Information Retrieval. In Proceedings of the 21 st ACM/SIGIR Conference, pp. 55–63.
Robertson, A.M. and Willett, P. 1998. Applications of n-grams in textual information systems. Journal of Documentation, 54(1), 48–69.
Spyns, P., De Wachter, L. (1995) Morphological analysis of Dutch medical compounds and derivations. ITL review of applied linguistics Institute of applied linguistics 109–110, 19-35.
Zhou, J. (1999) Phrasal terms in real-word applications. In Thomek Strzalkowski (ed). Natural language informations retrieval. Dordrecht: Kluwer 1999.
Zobel, J. and Dart, P. 1995. Finding approximate matches in large lexicons. Software-practice and experience, 25(3), 331–345.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hedlund, T., Keskustalo, H., Pirkola, A., Airio, E., Järvelin, K. (2002). Utaclir @ CLEF 2001 — Effects of Compound Splitting and N-Gram Techniques. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds) Evaluation of Cross-Language Information Retrieval Systems. CLEF 2001. Lecture Notes in Computer Science, vol 2406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45691-0_10
Download citation
DOI: https://doi.org/10.1007/3-540-45691-0_10
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44042-0
Online ISBN: 978-3-540-45691-9
eBook Packages: Springer Book Archive