Non-adjacent Digrams Improve Matching of Cross-Lingual Spelling Variants

  • Heikki Keskustalo
  • Ari Pirkola
  • Kari Visala
  • Erkka Leppänen
  • Kalervo Järvelin
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2857)


Untranslatable query keys pose a problem in dictionary-based cross-language information retrieval (CLIR). One solution consists of using approximate string matching methods for finding the spelling variants of the source key among the target database index. In such a setting, it is important to select a matching method suited especially for CLIR. This paper focuses on comparing the effectiveness of several matching methods in a cross-lingual setting. Search words from five domains were expressed in six languages (French, Spanish, Italian, German, Swedish, and Finnish). The target data consisted of the index of an English full-text database. In this setting, we first established the best method among six baseline matching methods for each language pair. Secondly, we tested novel matching methods based on binary digrams formed of both adjacent and non-adjacent characters of words. The latter methods consistently outperformed all baseline methods.


Target Word Average Precision Exact Match Edit Distance Baseline Method 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Angell, R.C., Freund, G.E., Willett, P.: Automatic Spelling Correction Using a Trigram Similarity Measure. Information Processing & Managament 4, 255–261 (1983)CrossRefGoogle Scholar
  2. 2.
    Damashek, M.: Gauging Similarity with n-Grams: Language-Independent Sorting, Categorization, and Retrieval of Text. Science 267, 843–848 (1995)CrossRefGoogle Scholar
  3. 3.
    Hull, D., Grefenstette, G.: Querying Across Languages: A Dictionary- Based Approach to Multilingual Information Retrieval. In: Proc. ACM SIGIR, Zürich, Switzerland, pp. 49–57 (1996)Google Scholar
  4. 4.
    Peters, C.: Cross Language Evaluation Forum (2002),
  5. 5.
    Pfeifer, U., Poersch, T., Fuhr, N.: Searching Proper Names in Databases. HIM, 259–275 (1995)Google Scholar
  6. 6.
    Pirkola, A., Hedlund, T., Keskustalo, H., Järvelin, K.: Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings. Information Retrieval 4(3/4), 209–230 (2001)zbMATHCrossRefGoogle Scholar
  7. 7.
    Pirkola, A., Keskustalo, H., Leppänen, E., Känsälä, A.-P., Järvelin, K.: Targeted s-Gram Matching: a Novel n-Gram Matching Technique for Cross- and Monolingual Word Form Variants. Information Research 7 (2) (2002), Available at
  8. 8.
    Pirkola, A., Toivonen, J., Keskustalo, H., Visala, K., Järvelin, K.: Fuzzy Translation of Cross-Lingual Spelling Variants. Accepted for ACM SIGIR 2003 (2003)Google Scholar
  9. 9.
    Robertson, A.M., Willet, P.: Applications of N-Grams in Textual Information Systems. Journal of Documentation 1, 48–69 (1998)CrossRefGoogle Scholar
  10. 10.
    Salosaari, P., Järvelin, K.: MUSIR - A Retrieval Model for Music. Research Notes 1, Department of Information Studies, University of Tampere (1998)Google Scholar
  11. 11.
    Ullman, J.R.: A Binary n-Gram Technique for Automatic Correction of Substitution, Deletion, Insertion and Reversal Errors in Words. Computer Journal 2, 141–147 (1977)CrossRefGoogle Scholar
  12. 12.
    Zobel, J., Dart, P.: Phonetic String Matching: Lessons from Information Retrieval. In: Proc. ACM SIGIR, Zürich, Switzerland, pp. 166–173 (1996)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Heikki Keskustalo
    • 1
  • Ari Pirkola
    • 1
  • Kari Visala
    • 1
  • Erkka Leppänen
    • 1
  • Kalervo Järvelin
    • 1
  1. 1.Department of Information StudiesUniversity of TampereFinland

Personalised recommendations