A Generalization of the Method for Evaluation of Stemming Algorithms Based on Error Counting

de Madariaga, Ricardo Sánchez; del Castillo, José Raúl Fernández; Hilera, José Ramón

doi:10.1007/11575832_26

Ricardo Sánchez de Madariaga¹⁸,
José Raúl Fernández del Castillo¹⁸ &
José Ramón Hilera¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3772))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

1527 Accesses
2 Citations

Abstract

Until the introduction of the method for evaluation of stemming algorithms based on error counting, the effectiveness of these algorithms was compared by determining their retrieval performance for various experimental test collections. With this method, the performance of a stemmer is computed by counting the number of identifiable errors during the stemming of words from various text samples, thus making the evaluation independent of Information Retrieval. In order to implement the method it is necessary to group manually the words in each sample into disjoint sets of words holding the same semantic concept. One single word can belong to only one concept. In order to do this grouping automatically, in the present work this constraint has been generalized, allowing one word to belong to several different concepts. Results with the generalized method confirm those obtained by the non-generalized method, but show considerable less differences between three affix removal stemmers. For first time evaluated four letter successor variety stemmers, these appear to be slightly inferior with respect to the other three in terms of general accuracy (ERRT, error rate relative to truncation), but they are weight adjustable and, most important, need no linguistic knowledge about the language they are applied to.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Kowalski, G., Maybury, M.T.: Information storage and retrieval. Theory and Implementation. Kluwer Academic Publishers, Dordrecht (2000)
Google Scholar
Paice, C.D.: Method for evaluation of stemming algorithms based on error counting. Journal of the American Society for Information Science 47(8), 632–649 (1996)
Article Google Scholar
Salton, G., McGill, M.: Introduction to modern information retrieval. McGraw-Hill, New York (1983)
MATH Google Scholar
Frakes, W., Baeza-Yates, R.: Information Retrieval: data structures and algorithms. Prentice-Hall, Englewood Cliffs (1992)
Google Scholar
Lennon, M., Pierce, D.S., Tarry, B.D., Willet, P.: An evaluation of some conflation algorithms for information retrieval. Journal for Information Science 3, 177–183 (1981)
Article Google Scholar
Kirkpatrick, B. (ed.): Roget’s thesaurus of English words and phrases. Penguin Books (2000)
Google Scholar
MULTEXT project: MULTEXT lexicons. Centre National de la Recherche Scientifique (1996-1998)
Google Scholar
Lovins, J.B.: Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11, 22–31 (1968)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)
Google Scholar
Paice, C.D.: Another stemmer. SIGIR Forum 24, 56–61 (1990)
Article Google Scholar
Hafer, M., Weiss, S.: Word segmentation by letter successor varieties. Information Storage and Retrieval 10, 371–385 (1974)
Article Google Scholar
Goldsmith, J.A., Higgins, D., Soglasnova, S.: Automatic language-specific stemming in information retrieval. In: Peters, C. (ed.) CLEF 2000. LNCS, vol. 2069, pp. 273–284. Springer, Heidelberg (2001)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science, University of Alcalá, 28805, Madrid, Spain
Ricardo Sánchez de Madariaga, José Raúl Fernández del Castillo & José Ramón Hilera

Authors

Ricardo Sánchez de Madariaga
View author publications
You can also search for this author in PubMed Google Scholar
José Raúl Fernández del Castillo
View author publications
You can also search for this author in PubMed Google Scholar
José Ramón Hilera
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Toronto,
Mariano Consens
Dept. of Computer Science, University of Chile,
Gonzalo Navarro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

de Madariaga, R.S., del Castillo, J.R.F., Hilera, J.R. (2005). A Generalization of the Method for Evaluation of Stemming Algorithms Based on Error Counting. In: Consens, M., Navarro, G. (eds) String Processing and Information Retrieval. SPIRE 2005. Lecture Notes in Computer Science, vol 3772. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11575832_26

Download citation

DOI: https://doi.org/10.1007/11575832_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29740-6
Online ISBN: 978-3-540-32241-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics