Restricted inflectional form generation in management of morphological keyword variation

Kettunen, Kimmo; Airio, Eija; Järvelin, Kalervo

doi:10.1007/s10791-007-9030-z

Restricted inflectional form generation in management of morphological keyword variation

Published: 10 August 2007

Volume 10, pages 415–444, (2007)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

Restricted inflectional form generation in management of morphological keyword variation

Download PDF

Kimmo Kettunen¹,
Eija Airio¹ &
Kalervo Järvelin¹

253 Accesses
9 Citations
Explore all metrics

Abstract

Word form normalization through lemmatization or stemming is a standard procedure in information retrieval because morphological variation needs to be accounted for and several languages are morphologically non-trivial. Lemmatization is effective but often requires expensive resources. Stemming is also effective in most contexts, generally almost as good as lemmatization and typically much less expensive; besides it also has a query expansion effect. However, in both approaches the idea is to turn many inflectional word forms to a single lemma or stem both in the database index and in queries. This means extra effort in creating database indexes. In this paper we take an opposite approach: we leave the database index un-normalized and enrich the queries to cover for surface form variation of keywords. A potential penalty of the approach would be long queries and slow processing. However, we show that it only matters to cover a negligible number of possible surface forms even in morphologically complex languages to arrive at a performance that is almost as good as that delivered by stemming or lemmatization. Moreover, we show that, at least for typical test collections, it only matters to cover nouns and adjectives in queries. Furthermore, we show that our findings are particularly good for short queries that resemble normal searches of web users. Our approach is called FCG (for Frequent Case (form) Generation). It can be relatively easily implemented for Latin/Greek/Cyrillic alphabet languages by examining their (typically very skewed) nominal form statistics in a small text sample and by creating surface form generators for the 3–9 most frequent forms. We demonstrate the potential of our FCG approach for several languages of varying morphological complexity: Swedish, German, Russian, and Finnish in well-known test collections. Applications include in particular Web IR in languages poor in morphological resources.

Frequent Case Generation in Ad Hoc Retrieval of Three Indian Languages – Bengali, Gujarati and Marathi

Building and Exploiting Lexical Databases for Morphological Parsing

Assisting non-expert speakers of under-resourced languages in assigning stems and inflectional paradigms to new word entries of morphological dictionaries

Article 15 June 2016

Introduction

Various methods for handling the morphological variation of keywords in information retrieval (IR) have been used already for decades. Some of them are more complex than others, while some are amazingly simple but produce still quite good results in IR. So far it has been shown among other things that even a quite simple rule-based non-lexical stemmer can improve precision and recall of textual searches for languages that are morphologically more complex than English or some times even very complex—as, e.g., Finnish and Slovene (cf. Popovič and Willett 1992; Hollink et al. 2004; Airio 2006). Use of stemming has been a de facto standard in information retrieval, but in language technology use of full coverage lemmatization has been thought a necessity for languages that are morphologically complex, even in monolingual single term IR (Koskenniemi 1996). This belief has been shared also by some IR researchers (Galvez et al. 2005; Galvez and de Moya-Anegon 2006; Jacquemin and Tzoukerman 1999).

At the same time as simple conflation methods have been used in IR, not much attention has been given to heuristic linguistically motivated aids that do not even aim to cover all the inflection of the keywords but are based, for example, on the statistically most frequent word forms of the language in question. In Kettunen and Airio (2006) we showed that case form frequency based keyword generation competes quite well against the gold standard, FINTWOL lemmatizer, in best-match IR for Finnish, a highly inflectional and compound rich language. A similar but converse approach, stemming based on the statistical distribution of Hungarian noun suffixes, is reported in Tordai and de Rijke (2005). Two other types of approaches can be seen as more remotely related to our approach: Xu and Croft’s (1998) idea of using corpus-based word variant statistics in stemmer creation or modification and the use of a probabilistic (and thus language independent) model for stemmer generation (Bacchin et al. 2004; Di Nunzio et al. 2004). Our method is called FCG (for Frequent Case (form) Generation).

In this paper we shall further examine our method in monolingual IR of morphologically complex languages by testing three more languages, German, Russian and Swedish, with the methods developed in Kettunen and Airio (2006). For Finnish we shall also show some new results with very short queries.

On a general level, our background motivations can be stated as follows:

The average precision and recall (P/R) of retrieval needs to be kept as high as possible without using excessively complex language technology tools; we believe that the need of large lexicon-based lemmatizers in basic monolingual IR is not as high as often thought even for a morphologically complex language.

Our research questions are following:

(1)
Is the FCG approach viable across languages of varying morphological complexity?
1. (1a)
  In order of increasing complexity, what is the performance of FCG in Swedish, German, Russian and Finnish as observed in generally available test collections?
2. (1b)
  How many morphological surface forms are needed to achieve reasonable performance?
3. (1c)
  How does this performance compare to doing nothing at all, stemming and lemmatization?
(2)
What is the effect of topic length on the performance of FCG as compared to doing nothing at all, stemming or lemmatization?

The main research question of the paper is, whether our FCG method can be shown to work with other languages that have non-trivial morphology. As the idea of the method is based on the skewed distributions of word form frequencies, it is supposed to work regardless of language in question, but verification for more than one language (Finnish) is also needed.

The performance of our new methods is compared to the state of art, usage of a lemmatizer, which is more challenging than use of raw words that has become all too common in IR (e.g., Hollink et al 2004; Braschler and Ripplinger 2004; Mayfield and McNamee 2003; Tomlinson 2004a, b). We have argued in Kettunen et al. (2005) that the performance gained with raw words is quite meager and variable for a morphologically rich language like Finnish, and thus the performance gains attributed to different morphological processing methods are not as great as they are thought to be. If comparisons are made, they should be made with respect to the state of the art or gold standard, not with respect to the worst possible result, as now is done many times in IR. With morphologically complex languages the best retrieval result is usually attained through a lemmatizer, such as TWOL for different languages (Koskenniemi 1996). This line of argumentation is taken in the present study.

The structure of our paper is following. First we discuss distributions of word forms in the light of linguistic corpus statistics and introduce our word form frequency based method and IR results of Kettunen and Airio (2006). After this our frequency based keyword generation method is introduced, tested and discussed using three European languages of increasing level of morphological complexity, Swedish, German, and Russian.

Distributions of word forms

It is well known that the distributions of words and word forms are not even in texts. Some word forms occur often, some are rare. Even the distributions of different morphological categories have rates of their own, and both semantic and morphological factors play a role in distribution of word form frequencies (Baayen 1993, 2001; Manning and Schütze 1999). Karlsson (1986, 2000), e.g., shows with some semantically distinctive word types, how the case distributions of the words differ in Finnish. A word denoting a place, like Helsinki, has besides the dominating nominative and genitive singular forms mainly occurrences of locative cases. A person’s name like Martti occurs mostly in nominative singular. Same sort of analysis is given by Kostić et al. (2003) for Serbian, although they seem to be hesitant about the semantic origins of the phenomenon. We shall not explore the semantic factors of case distribution any deeper, but analyze the distribution of cases on morphological level only.

In Kettunen and Airio (2006) we first sought for corpus statistics of Finnish nominal word forms. Then we verified these statistics with two independent automatic analyses of larger corpuses. Our analysis and earlier corpus statistics showed, that six cases (out of 14) constituted about 84–88% of the token level occurrences of case forms for nouns—thus covering 84–88% of the possible variation of about 2,000 distinct inflectional forms of nouns. Our analysis also showed that the huge number of grammatical forms is mainly due to clitics and possessive endings that are almost nonexistent even in a reasonably large textual corpus (10.3 M nouns). This analysis demonstrated that, while a language may in principle be morphologically complex, in practice it is much less so.

Distribution based handling of keyword variation for IR

Our FCG (Frequent Case (Form) Generation) method and its language specific testing are simply as follows:

For a morphologically complex enough language the distribution of different nominal case/other word forms is first studied through corpus analysis (if such results are not available for the language). The corpus used can be quite small, because variation at this level of language can be detected even from smaller corpuses. Variation in textual styles may affect slightly the results, so a style neutral corpus is the best. If style specific results are sought for, then an appropriate corpus needs to be used in word form occurrence analysis.
After the most frequent (case) forms for the language have been found with corpus statistics, the IR results of using only these forms for noun and adjective keyword forms are tested. As a comparison best available normalization method (lemmatization or stemming) is used. The number of tested FCG processes depends on the morphological complexity of the language: more processes can be tested for a complex language, only a few for a simpler one.
After testing, the best FCG process with respect to normalization is usually distinguished. The testing process will probably also show that more than one FCG process is giving quite good results, and thus a varying number of keyword forms can be used for different retrieval purposes, if necessary.

We have been simulating the process of keyword generation in our tests, but as word form generation programs are available for many languages, their output could be modified accordingly for real use, i.e., only the most frequent forms of generated forms would be used in search.

Based on this method, we tested four different FCGs in two different full-text collections of Finnish, TUTK (with multi-valued relevance; Sormunen 2000) and CLEF 2003 (with binary relevance; Peters 2003). The results of Kettunen and Airio (2006) showed that frequent case form generation works in full-text retrieval of inflected indexes in a best-match query system and competes at best well with the gold standard, lemmatization, for Finnish. Our best FCG procedures, FCG_9 and FCG_12—with 9 and 12 variant keyword forms—achieved about 86% of the best average precisions of FINTWOL lemmatizer in TUTK and about 90% in CLEF 2003. We thus performed successful information retrieval of Finnish with nine and twelve variant keyword forms, which is 0.48% and 0.64% of the possible grammatical forms of Finnish nouns (∑ = 1872) and about 34.6% and 46.2% of the productive forms (∑ = 26).

One possible bottleneck of the method, too slow index search with many key forms, was also analyzed in Kettunen and Airio (2006): runtimes of the FCG queries were shown to be comparable to those of the other methods with 60 queries of the CLEF 2003 collection. Thus a hitherto unused method, frequent case form generation for morphologically complex languages, appears as a simple and effective alternative to more traditional methods like lemmatization or stemming in IR.

In Kettunen and Airio (2006) we had typical long queries made out of title and description fields of the CLEF 2003 topics. These results are replicated in Table 1.^{Footnote 1} For comparison, we now made also very short queries out of the title fields (mean length 2,55 words when stop words were omitted) only for the five best methods of our earlier study (plus topic words as such). Results of these runs are in Table 2.

Table 1 Finnish CLEF 2003 results, 45 title-description queries

Case	Singular	Plural
Nominative	der Mann	die Männer
Accusative	den Mann	die Männer
Dative	dem Mann(e)	den Männern
Genitive	des Mann(e)s	der Männer

Restricted inflectional form generation in management of morphological keyword variation

Abstract

Similar content being viewed by others

Frequent Case Generation in Ad Hoc Retrieval of Three Indian Languages – Bengali, Gujarati and Marathi

Building and Exploiting Lexical Databases for Morphological Parsing

Assisting non-expert speakers of under-resourced languages in assigning stems and inflectional paradigms to new word entries of morphological dictionaries

Introduction

Distributions of word forms

Distribution based handling of keyword variation for IR

Materials and methods

Language resources used in normalization and query generation

Query generation and structuring

Morphology and morphological statistics of the three languages

Swedish

Morphology of Swedish nouns

Distributions of Swedish nominal word forms

Swedish FCGs

German

Morphology of German nouns

Distributions of German word forms

German FCGs

Russian

Morphology of Russian

Distributions of Russian word forms

Russian FCG procedures

Results

Swedish results

German results

Russian results

Discussion

Conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation