Abstract
Stemming is a fundamental technique for processing large amounts of data in information retrieval and text mining. However, after processing the reversal of this process is often desirable, e.g., for human interpretation, or methods which operate on sequences of characters. We present a formal analysis of the stemming inversion problem, and show that the underlying optimization problem capturing conceptual groups as known from under- and overstemming, is of high computational complexity. We present efficient heuristic algorithms for practical application in information retrieval and test our approach on real data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Annett, M., Kondrak, G.: A comparison of sentiment analysis techniques: Polarizing movie blogs. In: Bergler, S. (ed.) Canadian AI. LNCS (LNAI), vol. 5032, pp. 25–35. Springer, Heidelberg (2008)
Dawson, J.L.: Suffix removal for word conflation. Bulletin of the Association for Literary and Linguistic Computing 2(3), 33–46 (1974)
Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure in R. Journal of Statistical Software 25(5), 1–54 (2008), http://www.jstatsoft.org/v25/i05
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, New York (1979)
Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A K-means clustering algorithm (AS R39: 81V30 p355-356). Applied Statistics 28, 100–108 (1979)
Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W. (eds.) Complexity of Computer Computations, pp. 85–103 (1972)
Krovetz, R.: Viewing morphology as an inference process. Artificial Intelligence 118(1–2), 277–294 (2000)
Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)
Lewis, D.: Reuters-21578 text categorization test collection (1997), http://www.daviddlewis.com/resources/testcollections/reuters21578/
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. J. of Machine Learning Research 2, 419–444 (2002)
Lovins, J.B.: Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11, 22–31 (1968)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967)
Paice, C.D.: Another stemmer. SIGIR Forum 24(3), 56–61 (1990)
Paice, C.D.: Method for evaluation of stemming algorithms based on error counting. Journal of the American Society for Information Science 47(8), 632–649 (1996)
Porter, M.: An algorithm for suffix stripping. Program 3, 130–137 (1980)
R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2010), http://www.R-project.org ISBN 3-900051-07-0
Stone, P.J.: Thematic text analysis: new agendas for analyzing text content. In: Text Analysis for the Social Sciences. ch. 2, Lawrence Erlbaum Associates, Mahwah (1997)
Strzalkowski, T., Vauthey, B.: Information retrieval using robust natural language processing. In: Proc. of the 30th annual meeting on ACL, Association for Computational Linguistics, Morristown, NJ, USA, pp. 104–111 (1992)
Uyar, A.: Google stemming mechanisms. J. of Inf. Sci. 35(5), 499–514 (2009)
Weiss, S., Indurkhya, N., Zhang, T., Damerau, F.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, Heidelberg (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Feinerer, I. (2010). Analysis and Algorithms for Stemming Inversion. In: Cheng, PJ., Kan, MY., Lam, W., Nakov, P. (eds) Information Retrieval Technology. AIRS 2010. Lecture Notes in Computer Science, vol 6458. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17187-1_28
Download citation
DOI: https://doi.org/10.1007/978-3-642-17187-1_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-17186-4
Online ISBN: 978-3-642-17187-1
eBook Packages: Computer ScienceComputer Science (R0)