Information Retrieval

, Volume 10, Issue 4–5, pp 415–444 | Cite as

Restricted inflectional form generation in management of morphological keyword variation



Word form normalization through lemmatization or stemming is a standard procedure in information retrieval because morphological variation needs to be accounted for and several languages are morphologically non-trivial. Lemmatization is effective but often requires expensive resources. Stemming is also effective in most contexts, generally almost as good as lemmatization and typically much less expensive; besides it also has a query expansion effect. However, in both approaches the idea is to turn many inflectional word forms to a single lemma or stem both in the database index and in queries. This means extra effort in creating database indexes. In this paper we take an opposite approach: we leave the database index un-normalized and enrich the queries to cover for surface form variation of keywords. A potential penalty of the approach would be long queries and slow processing. However, we show that it only matters to cover a negligible number of possible surface forms even in morphologically complex languages to arrive at a performance that is almost as good as that delivered by stemming or lemmatization. Moreover, we show that, at least for typical test collections, it only matters to cover nouns and adjectives in queries. Furthermore, we show that our findings are particularly good for short queries that resemble normal searches of web users. Our approach is called FCG (for Frequent Case (form) Generation). It can be relatively easily implemented for Latin/Greek/Cyrillic alphabet languages by examining their (typically very skewed) nominal form statistics in a small text sample and by creating surface form generators for the 3–9 most frequent forms. We demonstrate the potential of our FCG approach for several languages of varying morphological complexity: Swedish, German, Russian, and Finnish in well-known test collections. Applications include in particular Web IR in languages poor in morphological resources.


Best-match IR Inflected indexes Frequent case form generation for keywords Generative methods in management of keyword variation 



Ph.D. Mihail Mihailov (Department of Translation Studies, University of Tampere) has helped with details of Russian word formation. Ph. D. Grigori Sidorov (Center for Computing Research, Mexico) provided a Russian inflectional generator for use. Ph. D. Harald Lüngen (Justus-Liebig Universität, Giessen, FB 05—Applied and Computational Linguistics) gave helpful comments on German inflection. We are also grateful to the FIRE research group for helpful comments. FINTWOL (morphological description of Finnish). Copyright © Kimmo Koskenniemi and Lingsoft plc. 1983–1993. GERTWOL (Morphological Transducer Lexicon Description of German) Copyright © Kimmo Koskenniemi and Lingsoft plc. 1997. SWETWOL (Morphological Transducer Lexicon Description of Swedish): Copyright (c) 1998 Fred Karlsson and Lingsoft, Inc. The InQuery search engine was provided by the Center for Intelligent Information Retrieval at the University of Massachusetts, Amherst. The Lemur query system is available from It is “a collaboration between the Computer Science Department at the University of Massachusetts and the School of Computer Science at Carnegie Mellon University”. The Snowball stemmers for Finnish, German, Russian and Swedish are available from the Snowball web site, This research was supported, in part, by the Academy of Finland Grant No. 204978.


  1. Ahlgren, P. (2004). The effects of indexing strategy-query term combination on retrieval effectiveness in a Swedish full text database. Department of Library and Information Science/Swedish School of Library and Information Science. University college of Borås/Göteborg University.Google Scholar
  2. Ahlgren, P., & Kekäläinen, J. (2007). Indexing strategies for Swedish full text retrieval under different user scenarios. Information Processing and Management, 43, 81–102.CrossRefGoogle Scholar
  3. Airio, E. (2006). Word Normalization and decompounding in mono- and bilingual IR. Information Retrieval, 9, 249–271.CrossRefGoogle Scholar
  4. Baayen, R. H. (1993). Statistical Models for Word Frequency Distribution. Computers and the Humanities, 26, 347–363.CrossRefGoogle Scholar
  5. Baayen, R. H. (2001). Word frequency distributions. Dordrecht Boston London: Kluwer Academic Publishers.MATHGoogle Scholar
  6. Bacchin, M., Ferro, N., & Melucci, M. (2004). A probabilistic model for stemmer generation. Information Processing and Management, 41(1), 121–137.CrossRefGoogle Scholar
  7. Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. USA: Addison Wesley.Google Scholar
  8. Bar-Ilan, J., & Gutman, T. (2005). How do search engines respond to some non-English queries? Journal of Information Science, 31, 13–28.CrossRefGoogle Scholar
  9. Beard, R. (1996). An interactive on-line Russian reference grammar. (visited September 25th, 2006).Google Scholar
  10. Braschler, M., & Ripplinger, B. (2004). How effective is stemming and decompounding for German text retrieval? Information Retrieval, 7, 291–316.CrossRefGoogle Scholar
  11. Broglio, J., Callan, J., & Croft, W. B. (1994). INQUERY System Overview. In Proceedings of the TIPSTER text program (Phase I). San Francisco, CA: Morgan Kaufmann Publishers.Google Scholar
  12. Canoonet. Free Online German language resources, (visited August 2006).Google Scholar
  13. Comrie, B. (1990). Russian. In B. Comrie (Ed.), The World’s Major Languages (pp. 329–347). New York: Oxford University Press.Google Scholar
  14. Conover, W. J. (1980). Practical nonparametric statistics (2nd ed.). New York: John Wiley and Sons.Google Scholar
  15. Deutsche Deklination, (visited June 8th, 2006).Google Scholar
  16. Di Nunzio, G. M., Ferro, N., Melucci, M., & Orio, N. (2004). Experiments to evaluate probabilistic models for automatic stemmer generation and query word translation. In C. Peters, et al. (Ed.), Comparative evaluation of multilingual information access systems, LNCS #3237 (pp. 220–235). Springer-Verlag, Berlin.Google Scholar
  17. Galvez, C., & de Moya-Anegón, F. (2006). An evaluation of conflation accuracy using finite-state transducers. Journal of Documentation, 62, 328–349.CrossRefGoogle Scholar
  18. Galvez, C., de Moya-Anegón, F., & Solana, V. H. (2005). Term conflation methods in information retrieval. Non-linguistic and linguistic approaches. Journal of Documentation, 61, 520–547.CrossRefGoogle Scholar
  19. Gelbukh, A., & Sidorov, G. (2003). Approach to construction of automatic morphological analysis systems for inflective languages with little effort. In Computational linguistics and intelligent text processing (CICLing-2003, Mexico City) (pp. 215–220). Lecture Notes in Computer Science No. 2588, Springer-Verlag.Google Scholar
  20. Gey, F. (2004). Searching a Russian document collection using english, Chinese and Japanese queries. Working Notes for the CLEF 2004 Workshop, 15–17 September, Bath, UK. (visited October 10th, 2006).Google Scholar
  21. Gey, F. (2005). Domain-specific Russian retrieval: A baseline approach. Working notes for the CLEF 2005 Workshop, 21–23 September, Vienna, Austria. (visited October 10th, 2006).Google Scholar
  22. Grefenstette, G., & Nioche, J. (2000). Estimation of english and non-english language use on the Web. (visited 28th October, 2006).Google Scholar
  23. Grossman, D. A., & Frieder, O. (2004). Information retrieval. Algorithms and heuristics (2nd ed.). Netherlands: Springer.MATHGoogle Scholar
  24. Helbig, G., & Buscha, J. (1981). Deutsche Grammatik. 7. unveränderte Auflage. VEB Verlag Enzyklopädie, Leipzig.Google Scholar
  25. Hedlund, T., Pirkola, A., & Järvelin, K. (2001). Aspects of Swedish morphology and semantics from the perspective of mono- and cross-language information retrieval. Information Processing and Management, 37, 147–161.MATHCrossRefGoogle Scholar
  26. Hollink, V., Kamps, J., Monz, C., & de Rijke, M. (2004). Monolingual document retrieval for European languages. Information Retrieval, 7, 33–52.CrossRefGoogle Scholar
  27. Hull, D (1993) Using statistical testing in the evaluation of retrieval experiments. In Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval (pp. 329–338). New York: ACM.Google Scholar
  28. Jansen, B., & Spink, A. (2005). An analysis of Web searching by European users. Information Processing and Management, 41, 361–381.CrossRefGoogle Scholar
  29. Jansen, B., Spink, A., & Sarasevic, T. (2000). Real life, real users, and real needs: a study and analysis of user queries on the web. Information Processing & Management, 36, 207–227.CrossRefGoogle Scholar
  30. Jacquemin, C., & Tzoukerman, E. (1999). NLP for term variant extraction: synergy between morphology, lexicon, and syntax. In T. Strzralkowski (Ed.), Natural language information retrieval (pp. 25–74). Dordrecht: Kluwer Academic Publishers.Google Scholar
  31. Karlsson, F. (1986). Frequency considerations in morphology. Zeitsschrift für Phonetik, Sprachwissenschaft und Kommunikationsforschung, 39, 19–28.Google Scholar
  32. Karlsson, F. (2000). Defectivity. In G. Booij, et al. (Ed.), Morphology. An International Handbook on Inflection and Word-Formation (Vol. 1. pp. 647–654). Berlin: Walter de Gruyter.Google Scholar
  33. Kekäläinen, J. (1999). The effects of query complexity, expansion and structure on retrieval performance in probabilistic text retrieval. Acta Universitatis Tamperensis 678.Google Scholar
  34. Kettunen, K. (2006). Developing and automatic linguistic truncation operator for best-match retrieval of Finnish in inflected word form text database indexes. Journal of Information Science, 32, 465–479.CrossRefGoogle Scholar
  35. Kettunen, K., & Airio, E. (2006). Is a morphologically complex language really that complex in full-text retrieval? In Salakoski, T., et al. (Ed.), Advances in natural language processing, LNAI 4139 (pp. 411–422). Springer-Verlag Berlin Heidelberg.Google Scholar
  36. Kettunen, K., Kunttu, T., & Järvelin, K. (2005). To stem or lemmatize a highly inflectional language in a probabilistic IR environment? Journal of Documentation, 61, 476–496.CrossRefGoogle Scholar
  37. Kettunen, K, Sadeniemi, M, Lindh-Knuutila, T and Honkela, T (2006) Analysis of EU Languages through Text Compression. In Salakoski T et al. (Ed.), Advances in natural language processing, LNAI 4139 (pp. 99–109) Springer-Verlag Berlin Heidelberg.Google Scholar
  38. Koskenniemi, K. (1996). Finite state morphology and information retrieval. Natural Language Engineering, 2, 331–336.CrossRefGoogle Scholar
  39. Kostić, A., Marković, T., & Baucal, A. (2003). Inflectional morphology and word meaning: Orthogonal or co-implicative cognitive domains. In R. H. Baayen, & R. Schreuder (Eds.), Morphological structure in language processing. Trends in linguistics, studies and monographs 151 (pp. 1–43). Mouton de Gruyter, Berlin.Google Scholar
  40. Koval, S., Beliaeva, L., Kogan, L., Mikhailov, A., Nikolaev, V., Piotrowski, R., & Tovmach, Yu. (2000). Morphological representation in PC-based text processing systems. Literary and Linguistic Computing, 15, 131–155.CrossRefGoogle Scholar
  41. Kraaij, W. (2004) Variations on language modeling for information retrieval. Haag: CTIT Ph. D. series No. 04–62.Google Scholar
  42. Lemur. The Lemur Toolkit for Language Modeling and Information Retrieval. (visited September 10th, 2006).Google Scholar
  43. Lexin. Svensk-finskt lexicon, (visited in August 2006).Google Scholar
  44. Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: The MIT Press.MATHGoogle Scholar
  45. Mayfield, J., & McNamee, P. (2003). Single N-gram stemming. In Proceedings of Sigir2003, The twenty-sixth annual international ACM SIGIR conference on research and development in information retrieval, pp. 415−416.Google Scholar
  46. Metzler, D., & Croft, W. B. (2004). Combining the Language Model and Inference Network Approaches to Retrieval. Information Processing and Management Special Issue on Bayesian Networks and Information Retrieval, 40, 735–750.Google Scholar
  47. Multitran, (visited in September 2006).Google Scholar
  48. Peters, C. (2003). Introduction to the CLEF 2003 Working Notes. (visited September 1st, 2005).Google Scholar
  49. Peters, C (2004) What happened in CLEF 2004? Introduction to the working notes. (visited October 20th, 2006).Google Scholar
  50. Petrasi, V., Perelman, N., & Gey, F. (2003). UC Berkeley at CLEF 2003–Russian Language Experiments and Domain-Specific Cross-Language Retrieval. Working Notes for the CLEF 2003 Workshop 21–22 August, Trondheim, Norway. (visited October 10th, 2006).Google Scholar
  51. Popovič, M., & Willett, P. (1992). The effectiveness of stemming for natural-language access to slovene textual data. Journal of the American Society for Information Science, 43, 384–390.CrossRefGoogle Scholar
  52. Rasmussen, E. M. (2003). Indexing and retrieval for the web. In B. Cronin (Ed.), Annual review of information science and technology (Vol. 37, pp. 91–124). Medford, NJ: Information Today, Inc.Google Scholar
  53. Russian National Corpus. E-mailed information about noun and adjective distributions, September 1st, 2006. Corpus info available at (in Russian).Google Scholar
  54. Search Engine Showdown. Search Engine Features Chart (Last updated Sep. 17, 2006), (visited October 28th, 2006).Google Scholar
  55. Siegel, S., & Castellan, N. J. Jr. (1988). Nonparametric statistics for the behavioral sciences (2nd ed.). New York: McGraw-Hill Book Company.Google Scholar
  56. Snowball web page, (visited October 20th, 2006).Google Scholar
  57. Sormunen, E. (2000). A method for measuring wide range performance of boolean queries in full-text databases. Tampere: University of Tampere, Doctoral Thesis. Acta Electronica Universitatis Tamperensis. (visited August 15, 2006).Google Scholar
  58. SWETWOL, (visited August 2006).Google Scholar
  59. Tiger corpus, (visited June 7th, 2006).Google Scholar
  60. Tomlinson, S. (2004a) Lexical and algorithmic stemming compared for 9 European languages with Humminbird SearchServer™ at CLEF 2003. In Comparative evaluation of multilingual information access systems (pp. 286–300). Springer-Verlag, LNCS #3237.Google Scholar
  61. Tomlinson, S (2004b) Finnish, Portuguese and Russian Retrieval with Hummingbird SearchServerTM at CLEF 2004. Working Notes for the CLEF 2004 Workshop, 15–17 September, Bath, UK. (visited October 10th, 2006).Google Scholar
  62. Tordai, A and de Rijke, M (2005) Hungarian Monolingual Retrieval at CLEF 2005. Working Notes for the CLEF 2005 Workshop, 21–23 September, Vienna, Austria. (visited September 6th, 2006).Google Scholar
  63. Xu, J., & Croft, B. (1998). Corpus-based stemming using cooccurrence of word variants. ACM Transactions on Information Systems, 16(1), 61–81.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  1. 1.Department of Information StudiesUniversity of TampereTampereFinland

Personalised recommendations