Abstract
This paper discusses information retrieval of Finnish and keyword variation management by generating inflected variant keyword forms. Finnish is a highly inflectional language, and thus keyword variation management of queries and query indexes is of utter importance for successful Finnish full-text retrieval. In the paper we show that generation of a quite small number of variant keyword forms leads to good retrieval performance using a probabilistic best-match retrieval system (Lemur). Generation of almost the full paradigm of inflected nominal forms improves the results slightly. We have also interesting results with regards to different index types: our evaluation shows that generated inflected queries behave extremely well in a lemmatized index, which is supposedly not suitable for this query type. We also show that in a research environment even inexact generation that produces lots of incorrect inflected forms achieves high precision-recall performance without considerable loss in query throughput effectiveness. We use two different word form generators and their variants and compare the results to commonly used reductive word form variation management methods, stemming and lemmatization. The paper includes also a short discussion about usage of the variant keyword method with Web search engines.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Lovins, J.B.: Development of a Stemming Algorithm. Mechanical Translation and Computational Linguistics 11, 23–31 (1968)
Harman, D.: How Effective is Suffixing? Journal of the American Society for Information Science 42, 7–15 (1991)
Frakes, W.B.: Stemming Algorithms. In: Frakes, W.B., Baeza-Yates, R. (eds.) Information Retrieval, pp. 131–160. Prentice-Hall, Upper Saddle River (1992)
Hull, D.: Stemming Algorithms: a Case Study for Detailed Evaluation. Journal of the American Society for Information Science 47, 70–84 (1996)
Hollink, V., Kamps, J., Monz, C., de Rijke, M.: Monolingual Document Retrieval for European Languages. Information Retrieval 7, 33–52 (2004)
Galvez, C., de Moya-Anegón, F., Herrero-Solana, V.: Term Conflation Methods in Information Retrieval. Non-linguistic and Linguistic Approaches. Journal of Documentation 61, 520–547 (2005)
Koskenniemi, K.: Finite State Morphology and Information Retrieval. Natural Language Engineering 2, 331–336 (1996)
Loponen, A., Järvelin, K.: A Dictionary- and Corpus-Independent Statistical Lemmatizer for Information Retrieval in Low Resource Languages. In: Agosti, M., Ferro, N., Peters, C., de Rijke, M., Smeaton, A. (eds.) CLEF 2010. LNCS, vol. 6360, pp. 3–14. Springer, Heidelberg (2010)
Karlsson, F.: Suomen kielen äänne- ja muotorakenne. WSOY, Helsinki (1983)
Kettunen, K., Airio, E., Järvelin, K.: Restricted Inflectional Form Generation in Management of Morphological Keyword Variation. Information Retrieval 10, 415–444 (2007)
Kettunen, K., Airio, E.: Is a Morphologically Complex Language Really that Complex in Full-Text Retrieval? In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 411–422. Springer, Heidelberg (2006)
Kettunen, K.: Automatic Generation of Frequent Case Forms of Query Keywords in Text Retrieval. In: Nordström, B., Ranta, A. (eds.) GoTAL 2008. LNCS (LNAI), vol. 5221, pp. 222–236. Springer, Heidelberg (2008)
Leturia, I., Gurrutxaga, A., Areta, N., Alegria, I., Ezeiza, A.: Eusbila, a Search Service Designed for the Agglutinative Nature of Basque. In: Lazarinis, F., Vilares, J., Tait, J.I. (eds.) First Workshop on Improving Non English Web Searching (ACM Sigir 2007 Workshop), pp. 47–54 (2007)
Leturia, I., Gurrutxaga, A., Areta, N., Pociello, E.: Analysis and Performance of Morphological Query Expansion and Language-filtering Words on Basque Web Searching. In: 6th International Conference on Language Resources and Evaluations (LREC), Marrakech (2008)
Paik, J.H., Kettunen, K., Pal, D., Järvelin, K.: Frequent Case Generation in ad hoc Retrieval of Three Indian Languages–Bengali, Gujarati and Marathi. To appear in Proceedings of FIRE 2011 (2012)
Kettunen, K., Kunttu, T., Järvelin, K.: To Stem or Lemmatize a Highly Inflectional Language in Probabilistic IR Environment? Journal of Documentation 61, 476–496 (2005)
Snowball web site, http://snowball.tartarus.org/
Lingsoft, http://www.lingsoft.fi
Kettunen, K., Baskaya, F.: Stemming Finnish for Information Retrieval–Comparison of an Old and a New Rule-based Stemmer. In: Vetulani, Z. (ed.) Proceedings of the 5th Language & Technology Conference (LTC 2011), Poznan, pp. 476–480 (2011)
Sanderson, M.: Test Collection Based Evaluation of Information Retrieval Systems. Foundations and Trends in Information Retrieval 4, 247–375 (2010)
Bar-Ilan, J., Gutman, T.: How do Search Engines Respond to Some Non-English Queries? Journal of Information Science 31, 13–28 (2005)
Jansen, B., Spink, A., Sarasevic, T.: Real Life, Real Users, and Real Needs: a Study and Analysis of User Queries on the Web. Information Processing & Management 36, 207–227 (2000)
Jansen, B., Spink, A.: An Analysis of Web Searching by European Alltheweb.com Users. Information Processing and Management 41, 361–381 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kettunen, K., Arvola, P. (2012). Generating Variant Keyword Forms for a Morphologically Complex Language Leads to Successful Information Retrieval with Finnish. In: Salampasis, M., Larsen, B. (eds) Multidisciplinary Information Retrieval. IRFC 2012. Lecture Notes in Computer Science, vol 7356. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31274-8_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-31274-8_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31273-1
Online ISBN: 978-3-642-31274-8
eBook Packages: Computer ScienceComputer Science (R0)