On Compression-Based Text Classification

Marton, Yuval; Wu, Ning; Hellerstein, Lisa

doi:10.1007/978-3-540-31865-1_22

Yuval Marton¹⁸,
Ning Wu¹⁹ &
Lisa Hellerstein¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3408))

Included in the following conference series:

European Conference on Information Retrieval

4446 Accesses
28 Citations

Abstract

Compression-based text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are character-based, and thus have the potential to automatically capture non-word features of a document, such as punctuation, word-stems, and features spanning more than one word. However, compression-based classification methods have drawbacks (such as slow running time), and not all such methods are equally effective. We present the results of a number of experiments designed to evaluate the effectiveness and behavior of different compression-based text classification methods on English text. Among our experiments are some specifically designed to test whether the ability to capture non-word (including super-word) features causes character-based text compression methods to achieve more accurate classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Benedetto, B., Caglioti, E., Loreto, V.: Language trees and zipping. Physical Review Letters 88(048702) (2002)
Google Scholar
Benedetto, B., Caglioti, E., Loreto, V.: On J. Goodman’s comment, to Language trees and zipping (July 2004), http://arxiv.org/abs/cond-mat/0203275
Benedetto, D., Caglioti, E.: Benedetto, Caglioti, and Loreto reply. Physical Review Letters 90(089804) (2003)
Google Scholar
Cavnar, W.B., Tenkle, J.M.: N-gram-based text categorization. In: Proc. of the 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR 1994), pp. 161–175 (1994)
Google Scholar
Chen, S.F., Goodman, J.: An Empirical Study of smoothing techniques for language modeling. In: Proc. of the Thirty-Fourth Annual Meeting of the Association for Computational Linguistics (1998)
Google Scholar
Damashek, M.: Gauging similarity with n-grams: Language-independent categorization of text. Science 267(5199), 843–848 (1995)
Article Google Scholar
Eyheramendy, S., Lewis, D.D., Madigan, D.: On the naive Bayes model for text categorization. In: Proc. of the Ninth International Workshop on Artificial Intelligence and Statistics (2003)
Google Scholar
Frank, E., Chui, C., Witten, I.H.: Text Categorization Using Compression Models. In: Proc. of DCC 2000, IEEE Data Compression Conference, pp. 200–209 (2000)
Google Scholar
Ghani, R.: Using Error Correcting Codes for Efficient Text Classification with a Large Number of Categories. KDD project report. Masters Thesis. Center for Automated Learning and Discovery, Carnegie Mellon University (2001)
Google Scholar
Goodman, J.T.: A Bit of Progress in Language Modeling, Extended Version. Computer Speech and Language, 403-434 (October 2001)
Google Scholar
Goodman, J.: Extended comment on language trees and zipping, http://arxiv.org/abs/cond-mat/0202383
gzip, a GNU license compression tool, version 1.3.3 (2002-03-08). Copyright, Free Software Foundation Copyright 1992-1993 Jean-loup Gailly (2002)
Google Scholar
Khmelev, D., Teahan, W.: A repetition based measure for verification of text collections and for text categorization. In: Proc. of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 104–110 (2003)
Google Scholar
Khmelev, D.V., Teahan, W.J.: Comment: Language trees and zipping. Physical Review Letters 90(089803) (2003)
Google Scholar
Khmelev, D., Tweedie, F.: Using Markov Chains for Identification of Writers. Literary and Linguistic Computing 16(4), 299–307 (2001)
Article Google Scholar
Kukushkina, O.V., Polikarpov, A.A., Khmelev, D.V.: Using literal and grammatical statistics for authorship attribution. Problems of Information Transmission 37, 172–184 (2001)
Article MATH MathSciNet Google Scholar
Lowenstern, D., Hirsh, H., Noordiwier, M., Yianilos, P.: DNA Sequence Classification Using Compression-Based Induction. DIMACS Technical Report 95-04 (1995)
Google Scholar
McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: Proc. of the AAAI 1998 Workshop on Learning for Text Categorization (1998)
Google Scholar
Mitchell, T.: Tutorial on machine Learning over natural language documents, http://www.cs.cmu.edu/~tom/text-learning.ps
Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist, 2nd edn. Springer, New York (1964), Applied Bayesian and Classical Inference (1984)
MATH Google Scholar
Nelson, M.R.: LZW source code. Dr. Dobb’s Journal (October 1989), Also available at, http://www.dogma.net/markn/articles/lzw/lzw.htm
Peng, F., Schuurmans, D.: Combining Naive Bayes and n-gram language models for text classification. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 335–350. Springer, Heidelberg (2003)
Chapter Google Scholar
Peng, F., Schuurmans, D., Wang, S.: Augmenting Naive Bayes classifiers with statistical language models. Information Retrieval 7, 317–345 (2004)
Article Google Scholar
Peng, F., Schuurmans, D., Wang, S.: Language and task independent text categorization with simple language models. In: Proc. Human Language Technology Conference of the North American Chapter of the ACL, pp. 189–196 (2003)
Google Scholar
RAR compression tool by RAR Labs, Inc., Version 3.30 (January 22, 2004). Copyright (c) Eugene Roshal (1993-2004), www.rarlab.com
Rennie, J.D.M., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of Naive Bayes text classifiers. In: Proc. of the Twentieth International Conference on Machine Learning (2003)
Google Scholar
Rorshal, Eugene (RAR Labs Inc.): Personal communication (2004)
Google Scholar
Schechter, B.: Fun with your zip program: Sort through texts, and more, New York Times, April 30 (2002)
Google Scholar
Shkarin, D.: Improving the efficiency of PPM algorithm. Problems of information transmission 34(3), 44–54 (2001), In Russian. English description available at http://www.dogma.net/DataCompression/Miscellaneous/PPMII_DCC02.pdf
MathSciNet Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Teahan, W.J.: Modelling English Text. PhD thesis, University of Waikato (1998)
Google Scholar
Teahan, W.J.: Text classification and segmentation using minimum cross-entropy. In: Proc. RIAO 2000, 6th International Conference Recherche d’Information Assistee par Ordinateur (2000)
Google Scholar
Teahan, W.J., Harper, D.J.: Using compression-based language models for text categorization. In: Proc. of the Workshop on Language Modeling and Information Retrieval (2001)
Google Scholar
Thaper, N.: Using Compression For Source Based Classification Of Text. Master’s Thesis, Massachusetts Institute of Technology (2001)
Google Scholar
Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval 1(1/2), 67–88 (1999)
Article Google Scholar
Yang, Y., Pederson, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of 14th International Conference on Machine Learning, ICML 1997 (1997)
Google Scholar
Zhang, T.: Personal communication (2004)
Google Scholar
Zhang, T., Oles, J.F.: Text Categorization Based on Regularized Linear Classification Methods. Information retrieval 4, 5–31 (2001)
Article MATH Google Scholar
Zhang, J., Jin, R., Yang, Y., Hauptmann, A.G.: Modified Logistic Regression: An Approximation to SVM and Its Applications in Large-Scale Text Categorization. In: Proc. of the 20th International Conference on Machine Learning, pp. 888–895 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Linguistics, University of Maryland, 1401 Marie Mount Hall, College Park, MD, 20742-7505
Yuval Marton
Department of Computer and Information Science, Polytechnic University, 5 Metrotech Center, Brooklyn, NY, 11201
Ning Wu & Lisa Hellerstein

Authors

Yuval Marton
View author publications
You can also search for this author in PubMed Google Scholar
Ning Wu
View author publications
You can also search for this author in PubMed Google Scholar
Lisa Hellerstein
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Departamento de Electrónica y Computación, Universidad de Santiago de Compostela, Spain
David E. Losada
Departamento de Ciencias de la Computación e Inteligencia Artificial E.T.S.I. Informática y de Telecomunicación, Universidad de Granada, 18071, Granada, Spain
Juan M. Fernández-Luna

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Marton, Y., Wu, N., Hellerstein, L. (2005). On Compression-Based Text Classification. In: Losada, D.E., Fernández-Luna, J.M. (eds) Advances in Information Retrieval. ECIR 2005. Lecture Notes in Computer Science, vol 3408. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31865-1_22

Download citation

DOI: https://doi.org/10.1007/978-3-540-31865-1_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25295-5
Online ISBN: 978-3-540-31865-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics