Skip to main content

Utaclir @ CLEF 2001 — Effects of Compound Splitting and N-Gram Techniques

  • Conference paper
  • First Online:
Evaluation of Cross-Language Information Retrieval Systems (CLEF 2001)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2406))

Included in the following conference series:

Abstract

The Tampere University CLEF research group participated in CLEF2001 with four automated bilingual runs. Our cross-lingual software, UTACLIR, uses an automated method for query construction for cross-language information retrieval (CLIR). This method seeks to automatically extract topical information from request sentences written in one of the source languages and to create a target language query, based on translations given by a translation dictionary. The new features for the CLIR process from Finnish, Swedish and German to English focus on translating and matching compound words, and a new n-gram based technique for translating and matching proper names and other non-translatable words. Non-translatable words can also be components in compounds. The n-gram based method is clearly efficient in matching inflected proper names and spelling variants. However, using it for all non-identified and non-translatable words adds noise to the query. For German — English we have tested two types of dictionaries (two runs). The first included all translations from the standard dictionary. The second contained the same data, except that all direct translations of compounds were excluded. The test with two dictionaries for the German runs gives an indication that the new features for compound processing work well even with a limited dictionary.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Gadd, T. 1990. Phonix: The algorithm. Program, 24(4), 363–369.

    Google Scholar 

  2. Grefenstette, G., Segond, F. (1997) Multilingual natural language procesing. International Journal of Corpus Linguistics 2(1), 153–162.

    Google Scholar 

  3. Haas, S. W., Losee, R. M. Jr (1994) Looking in text windows: Their size and composition. Information Processing and Management 30(5), 619–629.

    Article  Google Scholar 

  4. Hedlund, T., Keskustalo, H., Pirkola, A., Seppänen, M., Järvelin, K. (2000) Bilingual tests with Swedish, Finnish and German queries. Working Notes for CLEF Workshop http://www.iei.pi.cnr.it/DELOS/CLEF/Notes.html

  5. Hedlund, T., Keskustalo, H., Pirkola, A., Seppänen, M., Järvelin, K. (2001a) Bilingual tests with Swedish, Finnish and German queries: Dealing with morphology, compound words and query structuring. In Carol Peters (ed). Cross-Language Information Retrieval and Evaluation: Proceedings of the CLEF 2000 Workshop, Lecture Notes in Computer Science 2069, Springer 2001, pp 211–225.

    Google Scholar 

  6. Hedlund, T., Pirkola, A. and Järvelin, K. (2001b). Aspects of Swedish Morphology and Semantics from the Perspective of Mono-and Cross-language Information Retrieval. Information Processing & Management vol. 37/1 pp.147–161.

    Article  Google Scholar 

  7. Jacquemin, C. (1996) What is the three that we see through the window: A linguistic approach to windowing and term variation. Information Processing & Management 32(4), 445–458.

    Article  Google Scholar 

  8. Levi, J. N. (1978) The syntax and semantics of complex nominals. London: Academic Press.

    Google Scholar 

  9. Pfeifer, U., Poersch, T., and Fuhr, N. 1996. Retrieval effectiveness of proper name search methods. Information Processing & Management, 32(6), 667–679.

    Article  Google Scholar 

  10. Pirkola, A., Keskustalo, H., Leppänen, E., and Järvelin, K. 2001. Targeted s-gram matching: a novel n-gram matching technique for cross-and monolingual word form variants. Manuscript, submitted to Information Research

    Google Scholar 

  11. Pirkola, A. (1998). The Effects of Query Structure and Dictionary Setups in Dictionary-Based Cross-language Information Retrieval. In Proceedings of the 21 st ACM/SIGIR Conference, pp. 55–63.

    Google Scholar 

  12. Robertson, A.M. and Willett, P. 1998. Applications of n-grams in textual information systems. Journal of Documentation, 54(1), 48–69.

    Article  Google Scholar 

  13. Spyns, P., De Wachter, L. (1995) Morphological analysis of Dutch medical compounds and derivations. ITL review of applied linguistics Institute of applied linguistics 109–110, 19-35.

    Google Scholar 

  14. Zhou, J. (1999) Phrasal terms in real-word applications. In Thomek Strzalkowski (ed). Natural language informations retrieval. Dordrecht: Kluwer 1999.

    Google Scholar 

  15. Zobel, J. and Dart, P. 1995. Finding approximate matches in large lexicons. Software-practice and experience, 25(3), 331–345.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hedlund, T., Keskustalo, H., Pirkola, A., Airio, E., Järvelin, K. (2002). Utaclir @ CLEF 2001 — Effects of Compound Splitting and N-Gram Techniques. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds) Evaluation of Cross-Language Information Retrieval Systems. CLEF 2001. Lecture Notes in Computer Science, vol 2406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45691-0_10

Download citation

  • DOI: https://doi.org/10.1007/3-540-45691-0_10

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44042-0

  • Online ISBN: 978-3-540-45691-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics