Advertisement

JHU/APL Experiments in Tokenization and Non-word Translation

  • Paul McNamee
  • James Mayfield
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3237)

Abstract

In the past we have conducted experiments that investigate the benefits and peculiarities attendant to alternative methods for tokenization, particularly overlapping character n-grams. This year we continued this line of work and report new findings reaffirming that the judicious use of n-grams can lead to performance surpassing that of word-based tokenization. In particular we examined: the relative performance of n-grams and a popular suffix stemmer; a novel form of n-gram indexing that approximates stemming and achieves fast run-time performance; various lengths of n-grams; and the use of n-grams for robust translation of queries using an aligned parallel text. For the CLEF 2003 evaluation we submitted monolingual and bilingual runs for all languages and language pairs and multilingual runs using English as a source language. Our key findings are that shorter n-grams (n=4 and n=5) outperform a popular stemmer in non-Romance languages, that direct translation of n-grams is feasible using an aligned corpus, that translated 5-grams yield superior performance to words, stems, or 4-grams, and that a combination of indexing methods is best of all.

Keywords

Information Retrieval Target Language Relevance Feedback Query Expansion Source Language 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Monz, C., Kamps, J., de Rijke, M.: The University of Amsterdam at CLEF 2002. In: Peters, C., Braschler, M., Gonzalo, J. (eds.) CLEF 2002. LNCS, vol. 2785, pp. 73–84. Springer, Heidelberg (2003)Google Scholar
  2. 2.
    Reidsma, D., Hiemstra, D., de Jong, F., Kraaij, W.: Cross-language Retrieval at Twente and TNO. In: Working Notes of the CLEF 2002 Workshop, pp. 111–114 (2002)Google Scholar
  3. 3.
    Savoy, J.: Cross-language information retrieval: experiments based on CLEF 2000 corpora. CLEF 2000 39(1), 75–115 (2003)zbMATHCrossRefGoogle Scholar
  4. 4.
    Miller, E., Shen, D., Liu, J., Nicholas, C.: Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System. the Journal of Digital Information 1(5) (2000)Google Scholar
  5. 5.
    McNamee, P., Mayfield, J.: N-Grams for Translation and Retrieval in CL-SDR. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 658–663. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  6. 6.
    Pirkola, A., Hedlund, T., Keskusalo, H., Järvelin, K.: Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings. Information Retrieval 4, 209–230 (2001)zbMATHCrossRefGoogle Scholar
  7. 7.
    Tomlinson, S.: Experiments in 8 European Languages with Hummingbird SearchServer at CLEF 2002. In: Peters, C., Braschler, M., Gonzalo, J. (eds.) CLEF 2002. LNCS, vol. 2785, pp. 203–214. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  8. 8.
    Miller, D., Leek, T., Schwartz, R.: A hidden Markov model information retrieval system. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 214–221 (1999)Google Scholar
  9. 9.
    Hiemstra, D.: Using Language Models for Information Retrieval. Ph. D. Thesis. Center for Telematics and Information Technology, The Netherlands (2000)Google Scholar
  10. 10.
    Jelinek, F., Mercer, R.: Interpolated Estimation of Markov Source Parameters from Sparse Data. In: Gelsema, E., Kanal, L. (eds.) Pattern Recognition in Practice, pp. 381–402. North-Holland, Amsterdam (1980)Google Scholar
  11. 11.
    McNamee, P., Mayfield, J.: Scalable Multilingual Information Access. In: Peters, C., Braschler, M., Gonzalo, J. (eds.) CLEF 2002. LNCS, vol. 2785, Springer, Heidelberg (2003)CrossRefGoogle Scholar
  12. 12.
    Zhai, C., Lafferty, J.: A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 334–342 (2001)Google Scholar
  13. 13.
    McNamee, P., Mayfield, J.: Character N-gram Tokenization for European Language Text Retrieval. Information Retrieval 7(1-2), 73–97 (2004)CrossRefGoogle Scholar
  14. 14.
    Porter, M.: Snowball: A Language for Stemming Algorithms (visited March 13, 2003), Available online at: http://snowball.tartarus.org/texts/introduction.html
  15. 15.
    Mayfield, J., McNamee, P.: Single N-gram Stemming. In: The Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 415–416 (2003)Google Scholar
  16. 16.
    Kwok, K., Chan, M.: Improving Two-Stage Ad-Hoc Retrieval for Short Queries. In: The Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 250–256 (1998)Google Scholar
  17. 17.
  18. 18.
    Church, K.: Char_align: A program for aligning parallel texts at the character level. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 1–8 (1993)Google Scholar
  19. 19.
    McNamee, P., Mayfield, J.: Comparing Cross-Language Query Expansion Techniques by Degrading Translation Resources. In: Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval, pp. 159–166 (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Paul McNamee
    • 1
  • James Mayfield
    • 1
  1. 1.Applied Physics LaboratoryThe Johns Hopkins UniversityLaurelUSA

Personalised recommendations