Analysis and Algorithms for Stemming Inversion

Feinerer, Ingo

doi:10.1007/978-3-642-17187-1_28

Ingo Feinerer²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6458))

Included in the following conference series:

Asia Information Retrieval Symposium

1402 Accesses

Abstract

Stemming is a fundamental technique for processing large amounts of data in information retrieval and text mining. However, after processing the reversal of this process is often desirable, e.g., for human interpretation, or methods which operate on sequences of characters. We present a formal analysis of the stemming inversion problem, and show that the underlying optimization problem capturing conceptual groups as known from under- and overstemming, is of high computational complexity. We present efficient heuristic algorithms for practical application in information retrieval and test our approach on real data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Annett, M., Kondrak, G.: A comparison of sentiment analysis techniques: Polarizing movie blogs. In: Bergler, S. (ed.) Canadian AI. LNCS (LNAI), vol. 5032, pp. 25–35. Springer, Heidelberg (2008)
Chapter Google Scholar
Dawson, J.L.: Suffix removal for word conflation. Bulletin of the Association for Literary and Linguistic Computing 2(3), 33–46 (1974)
Google Scholar
Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure in R. Journal of Statistical Software 25(5), 1–54 (2008), http://www.jstatsoft.org/v25/i05
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, New York (1979)
MATH Google Scholar
Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A K-means clustering algorithm (AS R39: 81V30 p355-356). Applied Statistics 28, 100–108 (1979)
Article Google Scholar
Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W. (eds.) Complexity of Computer Computations, pp. 85–103 (1972)
Google Scholar
Krovetz, R.: Viewing morphology as an inference process. Artificial Intelligence 118(1–2), 277–294 (2000)
Article MATH Google Scholar
Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)
MathSciNet MATH Google Scholar
Lewis, D.: Reuters-21578 text categorization test collection (1997), http://www.daviddlewis.com/resources/testcollections/reuters21578/
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. J. of Machine Learning Research 2, 419–444 (2002)
MATH Google Scholar
Lovins, J.B.: Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11, 22–31 (1968)
Google Scholar
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967)
Google Scholar
Paice, C.D.: Another stemmer. SIGIR Forum 24(3), 56–61 (1990)
Article Google Scholar
Paice, C.D.: Method for evaluation of stemming algorithms based on error counting. Journal of the American Society for Information Science 47(8), 632–649 (1996)
Article Google Scholar
Porter, M.: An algorithm for suffix stripping. Program 3, 130–137 (1980)
Article Google Scholar
R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2010), http://www.R-project.org ISBN 3-900051-07-0
Stone, P.J.: Thematic text analysis: new agendas for analyzing text content. In: Text Analysis for the Social Sciences. ch. 2, Lawrence Erlbaum Associates, Mahwah (1997)
Google Scholar
Strzalkowski, T., Vauthey, B.: Information retrieval using robust natural language processing. In: Proc. of the 30th annual meeting on ACL, Association for Computational Linguistics, Morristown, NJ, USA, pp. 104–111 (1992)
Google Scholar
Uyar, A.: Google stemming mechanisms. J. of Inf. Sci. 35(5), 499–514 (2009)
Google Scholar
Weiss, S., Indurkhya, N., Zhang, T., Damerau, F.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, Heidelberg (2004)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Vienna University of Technology, Austria
Ingo Feinerer

Authors

Ingo Feinerer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Information Engineering, Roosevelt Road National Taiwan University, No. 1, Sec. 4, 10617, Taipei, Taiwan R.O.C.
Pu-Jen Cheng
School of Computing, National University of Singapore (NUS), Computing 1, 13 Computing Drive, 117417, Singapore
Min-Yen Kan
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong Shatin, N.T. Hong Kong, China
Wai Lam
School of Computing, Computing 1, National University of Singapore (NUS), 13 Computing Drive, 117417, Singapore
Preslav Nakov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Feinerer, I. (2010). Analysis and Algorithms for Stemming Inversion. In: Cheng, PJ., Kan, MY., Lam, W., Nakov, P. (eds) Information Retrieval Technology. AIRS 2010. Lecture Notes in Computer Science, vol 6458. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17187-1_28

Download citation

DOI: https://doi.org/10.1007/978-3-642-17187-1_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-17186-4
Online ISBN: 978-3-642-17187-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics