Skip to main content

Classifying with Co-stems

A New Representation for Information Filtering

  • Conference paper
Advances in Information Retrieval (ECIR 2011)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6611))

Included in the following conference series:

  • 6721 Accesses

Abstract

Besides the content the writing style is an important discriminator in information filtering tasks. Ideally, the solution of a filtering task employs a text representation that models both kinds of characteristics. In this respect word stems are clearly content capturing, whereas word suffixes qualify as writing style indicators. Though the latter feature type is used for part of speech tagging, it has not yet been employed for information filtering in general. We propose a text representation that combines both the output of a stemming algorithm (stems) and the stem-reduced words (co-stems). A co-stems can be a prefix, an infix, a suffix, or a concatenation of prefixes, infixes, or suffixes. Using accepted standard corpora, we analyze the discriminative power of this representation for a broad range of information filtering tasks to provide new insights into the adequacy and task-specificity of text representation models. Altogether we observe that co-stems-based representations outperform the classical bag of words model for several filtering tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Blanzieri, E., Bryl, A.: A survey of learning-based techniques of email spam filtering. Artificial Intelligence Review 29(1), 63–92 (2008)

    Article  Google Scholar 

  2. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proc. of the Workshop on Computational Learning Theory, pp. 92–100 (1998)

    Google Scholar 

  3. Gottron, T., Lipka, N.: A comparison of language identification approaches on short, query-style texts. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 611–614. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  4. Hanani, U., Shapira, B., Shoval, P.: Information filtering: Overview of issues, research and systems. User Modeling and User-Adapted Interaction 11(3), 203–259 (2001)

    Article  MATH  Google Scholar 

  5. Koppel, M., Schler, J.: Authorship verification as a one-class classification problem. In: Proc. of ICML, p. 62 (2004)

    Google Scholar 

  6. Krovetz, R.: Viewing morphology as an inference process. In: Proc. of SIGIR, pp. 191–202 (1993)

    Google Scholar 

  7. Lang, K.: Newsweeder: learning to filter netnews. In: Proc. of ICML, pp. 331–339 (1995)

    Google Scholar 

  8. Lipka, N., Stein, B.: Identifying Featured Articles in Wikipedia: Writing Style Matters. In: Proc. of WWW, pp. 1147–1148 (2010)

    Google Scholar 

  9. Lovins, J.B.: Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11, 22–31 (1968)

    Google Scholar 

  10. Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: Proc. of WWW, pp. 83–92 (2006)

    Google Scholar 

  11. Paice, C.D.: Another Stemmer. SIGIR Forum 24(3), 56–61 (1990)

    Article  Google Scholar 

  12. Pang, B., Lee, L.: A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In: Proc. of ACL, pp. 271–278 (2004)

    Google Scholar 

  13. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment classification using machine learning techniques. In: Proc. of the Conference on Empirical Methods in Natural Language Processing, pp. 79–86 (2002)

    Google Scholar 

  14. Porter, M.F.: An algorithm for suffix stripping. Program: Electronic Library & Information Systems 40(3), 211–218 (1980)

    Article  Google Scholar 

  15. Priedhorsky, R., Chen, J., Lam, S.T.K., Panciera, K., Terveen, L., Riedl, J.: Creating, destroying, and restoring value in wikipedia. In: GROUP 2007: Proc. of the International ACM Conference on Supporting Group Work, pp. 259–268 (2007)

    Google Scholar 

  16. Santini, M.: Common criteria for genre classification: Annotation and granularity. In: Third International Workshop on Text-Based Information Retrieval (2006)

    Google Scholar 

  17. Schler, J., Koppel, M., Argamon, S., Pennebaker, J.: Effects of age and gender on blogging. In: Proc. of AAAI - Symposium on Computational Approaches for Analyzing Weblogs, pp. 191–197 (2006)

    Google Scholar 

  18. Stamatatos, E.: A survey of modern authorship attribution methods. JASIST 60, 538–556 (2009)

    Article  Google Scholar 

  19. Stein, B., Eissen, S.M.Z., Lipka, N.: Web genre analysis: Use cases, retrieval models, and implementation issues. Genres on the Web 42, 167–189 (2011)

    Article  Google Scholar 

  20. Tsur, O., Davidov, D., Rappoport, A.: A Great Catchy Name: Semi-Supervised Recognition of Sarcastic Sentences in Product Reviews. In: Proc. of AAAI - ICWSM (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lipka, N., Stein, B. (2011). Classifying with Co-stems. In: Clough, P., et al. Advances in Information Retrieval. ECIR 2011. Lecture Notes in Computer Science, vol 6611. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20161-5_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20161-5_31

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20160-8

  • Online ISBN: 978-3-642-20161-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics