Classifying with Co-stems

Lipka, Nedim; Stein, Benno

doi:10.1007/978-3-642-20161-5_31

Nedim Lipka²¹ &
Benno Stein²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6611))

Included in the following conference series:

European Conference on Information Retrieval

6721 Accesses

Abstract

Besides the content the writing style is an important discriminator in information filtering tasks. Ideally, the solution of a filtering task employs a text representation that models both kinds of characteristics. In this respect word stems are clearly content capturing, whereas word suffixes qualify as writing style indicators. Though the latter feature type is used for part of speech tagging, it has not yet been employed for information filtering in general. We propose a text representation that combines both the output of a stemming algorithm (stems) and the stem-reduced words (co-stems). A co-stems can be a prefix, an infix, a suffix, or a concatenation of prefixes, infixes, or suffixes. Using accepted standard corpora, we analyze the discriminative power of this representation for a broad range of information filtering tasks to provide new insights into the adequacy and task-specificity of text representation models. Altogether we observe that co-stems-based representations outperform the classical bag of words model for several filtering tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Blanzieri, E., Bryl, A.: A survey of learning-based techniques of email spam filtering. Artificial Intelligence Review 29(1), 63–92 (2008)
Article Google Scholar
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proc. of the Workshop on Computational Learning Theory, pp. 92–100 (1998)
Google Scholar
Gottron, T., Lipka, N.: A comparison of language identification approaches on short, query-style texts. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 611–614. Springer, Heidelberg (2010)
Chapter Google Scholar
Hanani, U., Shapira, B., Shoval, P.: Information filtering: Overview of issues, research and systems. User Modeling and User-Adapted Interaction 11(3), 203–259 (2001)
Article MATH Google Scholar
Koppel, M., Schler, J.: Authorship verification as a one-class classification problem. In: Proc. of ICML, p. 62 (2004)
Google Scholar
Krovetz, R.: Viewing morphology as an inference process. In: Proc. of SIGIR, pp. 191–202 (1993)
Google Scholar
Lang, K.: Newsweeder: learning to filter netnews. In: Proc. of ICML, pp. 331–339 (1995)
Google Scholar
Lipka, N., Stein, B.: Identifying Featured Articles in Wikipedia: Writing Style Matters. In: Proc. of WWW, pp. 1147–1148 (2010)
Google Scholar
Lovins, J.B.: Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11, 22–31 (1968)
Google Scholar
Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: Proc. of WWW, pp. 83–92 (2006)
Google Scholar
Paice, C.D.: Another Stemmer. SIGIR Forum 24(3), 56–61 (1990)
Article Google Scholar
Pang, B., Lee, L.: A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In: Proc. of ACL, pp. 271–278 (2004)
Google Scholar
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment classification using machine learning techniques. In: Proc. of the Conference on Empirical Methods in Natural Language Processing, pp. 79–86 (2002)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program: Electronic Library & Information Systems 40(3), 211–218 (1980)
Article Google Scholar
Priedhorsky, R., Chen, J., Lam, S.T.K., Panciera, K., Terveen, L., Riedl, J.: Creating, destroying, and restoring value in wikipedia. In: GROUP 2007: Proc. of the International ACM Conference on Supporting Group Work, pp. 259–268 (2007)
Google Scholar
Santini, M.: Common criteria for genre classification: Annotation and granularity. In: Third International Workshop on Text-Based Information Retrieval (2006)
Google Scholar
Schler, J., Koppel, M., Argamon, S., Pennebaker, J.: Effects of age and gender on blogging. In: Proc. of AAAI - Symposium on Computational Approaches for Analyzing Weblogs, pp. 191–197 (2006)
Google Scholar
Stamatatos, E.: A survey of modern authorship attribution methods. JASIST 60, 538–556 (2009)
Article Google Scholar
Stein, B., Eissen, S.M.Z., Lipka, N.: Web genre analysis: Use cases, retrieval models, and implementation issues. Genres on the Web 42, 167–189 (2011)
Article Google Scholar
Tsur, O., Davidov, D., Rappoport, A.: A Great Catchy Name: Semi-Supervised Recognition of Sarcastic Sentences in Product Reviews. In: Proc. of AAAI - ICWSM (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Bauhaus-Universität Weimar, 99421, Weimar, Germany
Nedim Lipka & Benno Stein

Authors

Nedim Lipka
View author publications
You can also search for this author in PubMed Google Scholar
Benno Stein
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Information School, University of Sheffield, Regent Court, 211 Portobello Street, S1 4DP, Sheffield, UK
Paul Clough
CLARITY: Centre for Sensor Web Technologies, School of Computing, Dublin City University, Glasnevin, Dublin 9, Ireland
Colum Foley , Cathal Gurrin & Hyowon Lee , &
Centre for Next Generation Localisation, School of Computing, Dublin City University, Glasnevin, Dublin 9, Ireland
Gareth J. F. Jones
TNO Human Factors, Brassersplein 2, 2612 CT, Delft, The Netherlands
Wessel Kraaij
Yahoo! Research, 177 Diagonal, 08018, Barcelona, Spain
Vanessa Mudoch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lipka, N., Stein, B. (2011). Classifying with Co-stems. In: Clough, P., et al. Advances in Information Retrieval. ECIR 2011. Lecture Notes in Computer Science, vol 6611. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20161-5_31

Download citation

DOI: https://doi.org/10.1007/978-3-642-20161-5_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20160-8
Online ISBN: 978-3-642-20161-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics