Skip to main content

Analysis and Algorithms for Stemming Inversion

  • Conference paper
Information Retrieval Technology (AIRS 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6458))

Included in the following conference series:

  • 1402 Accesses

Abstract

Stemming is a fundamental technique for processing large amounts of data in information retrieval and text mining. However, after processing the reversal of this process is often desirable, e.g., for human interpretation, or methods which operate on sequences of characters. We present a formal analysis of the stemming inversion problem, and show that the underlying optimization problem capturing conceptual groups as known from under- and overstemming, is of high computational complexity. We present efficient heuristic algorithms for practical application in information retrieval and test our approach on real data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Annett, M., Kondrak, G.: A comparison of sentiment analysis techniques: Polarizing movie blogs. In: Bergler, S. (ed.) Canadian AI. LNCS (LNAI), vol. 5032, pp. 25–35. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  2. Dawson, J.L.: Suffix removal for word conflation. Bulletin of the Association for Literary and Linguistic Computing 2(3), 33–46 (1974)

    Google Scholar 

  3. Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure in R. Journal of Statistical Software 25(5), 1–54 (2008), http://www.jstatsoft.org/v25/i05

  4. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, New York (1979)

    MATH  Google Scholar 

  5. Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A K-means clustering algorithm (AS R39: 81V30 p355-356). Applied Statistics 28, 100–108 (1979)

    Article  Google Scholar 

  6. Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W. (eds.) Complexity of Computer Computations, pp. 85–103 (1972)

    Google Scholar 

  7. Krovetz, R.: Viewing morphology as an inference process. Artificial Intelligence 118(1–2), 277–294 (2000)

    Article  MATH  Google Scholar 

  8. Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)

    MathSciNet  MATH  Google Scholar 

  9. Lewis, D.: Reuters-21578 text categorization test collection (1997), http://www.daviddlewis.com/resources/testcollections/reuters21578/

  10. Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. J. of Machine Learning Research 2, 419–444 (2002)

    MATH  Google Scholar 

  11. Lovins, J.B.: Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11, 22–31 (1968)

    Google Scholar 

  12. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967)

    Google Scholar 

  13. Paice, C.D.: Another stemmer. SIGIR Forum 24(3), 56–61 (1990)

    Article  Google Scholar 

  14. Paice, C.D.: Method for evaluation of stemming algorithms based on error counting. Journal of the American Society for Information Science 47(8), 632–649 (1996)

    Article  Google Scholar 

  15. Porter, M.: An algorithm for suffix stripping. Program 3, 130–137 (1980)

    Article  Google Scholar 

  16. R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2010), http://www.R-project.org ISBN 3-900051-07-0

  17. Stone, P.J.: Thematic text analysis: new agendas for analyzing text content. In: Text Analysis for the Social Sciences. ch. 2, Lawrence Erlbaum Associates, Mahwah (1997)

    Google Scholar 

  18. Strzalkowski, T., Vauthey, B.: Information retrieval using robust natural language processing. In: Proc. of the 30th annual meeting on ACL, Association for Computational Linguistics, Morristown, NJ, USA, pp. 104–111 (1992)

    Google Scholar 

  19. Uyar, A.: Google stemming mechanisms. J. of Inf. Sci. 35(5), 499–514 (2009)

    Google Scholar 

  20. Weiss, S., Indurkhya, N., Zhang, T., Damerau, F.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, Heidelberg (2004)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Feinerer, I. (2010). Analysis and Algorithms for Stemming Inversion. In: Cheng, PJ., Kan, MY., Lam, W., Nakov, P. (eds) Information Retrieval Technology. AIRS 2010. Lecture Notes in Computer Science, vol 6458. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17187-1_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-17187-1_28

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-17186-4

  • Online ISBN: 978-3-642-17187-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics