Abstract
Until the introduction of the method for evaluation of stemming algorithms based on error counting, the effectiveness of these algorithms was compared by determining their retrieval performance for various experimental test collections. With this method, the performance of a stemmer is computed by counting the number of identifiable errors during the stemming of words from various text samples, thus making the evaluation independent of Information Retrieval. In order to implement the method it is necessary to group manually the words in each sample into disjoint sets of words holding the same semantic concept. One single word can belong to only one concept. In order to do this grouping automatically, in the present work this constraint has been generalized, allowing one word to belong to several different concepts. Results with the generalized method confirm those obtained by the non-generalized method, but show considerable less differences between three affix removal stemmers. For first time evaluated four letter successor variety stemmers, these appear to be slightly inferior with respect to the other three in terms of general accuracy (ERRT, error rate relative to truncation), but they are weight adjustable and, most important, need no linguistic knowledge about the language they are applied to.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Kowalski, G., Maybury, M.T.: Information storage and retrieval. Theory and Implementation. Kluwer Academic Publishers, Dordrecht (2000)
Paice, C.D.: Method for evaluation of stemming algorithms based on error counting. Journal of the American Society for Information Science 47(8), 632–649 (1996)
Salton, G., McGill, M.: Introduction to modern information retrieval. McGraw-Hill, New York (1983)
Frakes, W., Baeza-Yates, R.: Information Retrieval: data structures and algorithms. Prentice-Hall, Englewood Cliffs (1992)
Lennon, M., Pierce, D.S., Tarry, B.D., Willet, P.: An evaluation of some conflation algorithms for information retrieval. Journal for Information Science 3, 177–183 (1981)
Kirkpatrick, B. (ed.): Roget’s thesaurus of English words and phrases. Penguin Books (2000)
MULTEXT project: MULTEXT lexicons. Centre National de la Recherche Scientifique (1996-1998)
Lovins, J.B.: Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11, 22–31 (1968)
Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)
Paice, C.D.: Another stemmer. SIGIR Forum 24, 56–61 (1990)
Hafer, M., Weiss, S.: Word segmentation by letter successor varieties. Information Storage and Retrieval 10, 371–385 (1974)
Goldsmith, J.A., Higgins, D., Soglasnova, S.: Automatic language-specific stemming in information retrieval. In: Peters, C. (ed.) CLEF 2000. LNCS, vol. 2069, pp. 273–284. Springer, Heidelberg (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
de Madariaga, R.S., del Castillo, J.R.F., Hilera, J.R. (2005). A Generalization of the Method for Evaluation of Stemming Algorithms Based on Error Counting. In: Consens, M., Navarro, G. (eds) String Processing and Information Retrieval. SPIRE 2005. Lecture Notes in Computer Science, vol 3772. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11575832_26
Download citation
DOI: https://doi.org/10.1007/11575832_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29740-6
Online ISBN: 978-3-540-32241-2
eBook Packages: Computer ScienceComputer Science (R0)