Abstract
Besides the content the writing style is an important discriminator in information filtering tasks. Ideally, the solution of a filtering task employs a text representation that models both kinds of characteristics. In this respect word stems are clearly content capturing, whereas word suffixes qualify as writing style indicators. Though the latter feature type is used for part of speech tagging, it has not yet been employed for information filtering in general. We propose a text representation that combines both the output of a stemming algorithm (stems) and the stem-reduced words (co-stems). A co-stems can be a prefix, an infix, a suffix, or a concatenation of prefixes, infixes, or suffixes. Using accepted standard corpora, we analyze the discriminative power of this representation for a broad range of information filtering tasks to provide new insights into the adequacy and task-specificity of text representation models. Altogether we observe that co-stems-based representations outperform the classical bag of words model for several filtering tasks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Blanzieri, E., Bryl, A.: A survey of learning-based techniques of email spam filtering. Artificial Intelligence Review 29(1), 63–92 (2008)
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proc. of the Workshop on Computational Learning Theory, pp. 92–100 (1998)
Gottron, T., Lipka, N.: A comparison of language identification approaches on short, query-style texts. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 611–614. Springer, Heidelberg (2010)
Hanani, U., Shapira, B., Shoval, P.: Information filtering: Overview of issues, research and systems. User Modeling and User-Adapted Interaction 11(3), 203–259 (2001)
Koppel, M., Schler, J.: Authorship verification as a one-class classification problem. In: Proc. of ICML, p. 62 (2004)
Krovetz, R.: Viewing morphology as an inference process. In: Proc. of SIGIR, pp. 191–202 (1993)
Lang, K.: Newsweeder: learning to filter netnews. In: Proc. of ICML, pp. 331–339 (1995)
Lipka, N., Stein, B.: Identifying Featured Articles in Wikipedia: Writing Style Matters. In: Proc. of WWW, pp. 1147–1148 (2010)
Lovins, J.B.: Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11, 22–31 (1968)
Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: Proc. of WWW, pp. 83–92 (2006)
Paice, C.D.: Another Stemmer. SIGIR Forum 24(3), 56–61 (1990)
Pang, B., Lee, L.: A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In: Proc. of ACL, pp. 271–278 (2004)
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment classification using machine learning techniques. In: Proc. of the Conference on Empirical Methods in Natural Language Processing, pp. 79–86 (2002)
Porter, M.F.: An algorithm for suffix stripping. Program: Electronic Library & Information Systems 40(3), 211–218 (1980)
Priedhorsky, R., Chen, J., Lam, S.T.K., Panciera, K., Terveen, L., Riedl, J.: Creating, destroying, and restoring value in wikipedia. In: GROUP 2007: Proc. of the International ACM Conference on Supporting Group Work, pp. 259–268 (2007)
Santini, M.: Common criteria for genre classification: Annotation and granularity. In: Third International Workshop on Text-Based Information Retrieval (2006)
Schler, J., Koppel, M., Argamon, S., Pennebaker, J.: Effects of age and gender on blogging. In: Proc. of AAAI - Symposium on Computational Approaches for Analyzing Weblogs, pp. 191–197 (2006)
Stamatatos, E.: A survey of modern authorship attribution methods. JASIST 60, 538–556 (2009)
Stein, B., Eissen, S.M.Z., Lipka, N.: Web genre analysis: Use cases, retrieval models, and implementation issues. Genres on the Web 42, 167–189 (2011)
Tsur, O., Davidov, D., Rappoport, A.: A Great Catchy Name: Semi-Supervised Recognition of Sarcastic Sentences in Product Reviews. In: Proc. of AAAI - ICWSM (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lipka, N., Stein, B. (2011). Classifying with Co-stems. In: Clough, P., et al. Advances in Information Retrieval. ECIR 2011. Lecture Notes in Computer Science, vol 6611. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20161-5_31
Download citation
DOI: https://doi.org/10.1007/978-3-642-20161-5_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20160-8
Online ISBN: 978-3-642-20161-5
eBook Packages: Computer ScienceComputer Science (R0)