Skip to main content

On Compression-Based Text Classification

  • Conference paper
Book cover Advances in Information Retrieval (ECIR 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3408))

Included in the following conference series:

Abstract

Compression-based text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are character-based, and thus have the potential to automatically capture non-word features of a document, such as punctuation, word-stems, and features spanning more than one word. However, compression-based classification methods have drawbacks (such as slow running time), and not all such methods are equally effective. We present the results of a number of experiments designed to evaluate the effectiveness and behavior of different compression-based text classification methods on English text. Among our experiments are some specifically designed to test whether the ability to capture non-word (including super-word) features causes character-based text compression methods to achieve more accurate classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Benedetto, B., Caglioti, E., Loreto, V.: Language trees and zipping. Physical Review Letters 88(048702) (2002)

    Google Scholar 

  2. Benedetto, B., Caglioti, E., Loreto, V.: On J. Goodman’s comment, to Language trees and zipping (July 2004), http://arxiv.org/abs/cond-mat/0203275

  3. Benedetto, D., Caglioti, E.: Benedetto, Caglioti, and Loreto reply. Physical Review Letters 90(089804) (2003)

    Google Scholar 

  4. Cavnar, W.B., Tenkle, J.M.: N-gram-based text categorization. In: Proc. of the 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR 1994), pp. 161–175 (1994)

    Google Scholar 

  5. Chen, S.F., Goodman, J.: An Empirical Study of smoothing techniques for language modeling. In: Proc. of the Thirty-Fourth Annual Meeting of the Association for Computational Linguistics (1998)

    Google Scholar 

  6. Damashek, M.: Gauging similarity with n-grams: Language-independent categorization of text. Science 267(5199), 843–848 (1995)

    Article  Google Scholar 

  7. Eyheramendy, S., Lewis, D.D., Madigan, D.: On the naive Bayes model for text categorization. In: Proc. of the Ninth International Workshop on Artificial Intelligence and Statistics (2003)

    Google Scholar 

  8. Frank, E., Chui, C., Witten, I.H.: Text Categorization Using Compression Models. In: Proc. of DCC 2000, IEEE Data Compression Conference, pp. 200–209 (2000)

    Google Scholar 

  9. Ghani, R.: Using Error Correcting Codes for Efficient Text Classification with a Large Number of Categories. KDD project report. Masters Thesis. Center for Automated Learning and Discovery, Carnegie Mellon University (2001)

    Google Scholar 

  10. Goodman, J.T.: A Bit of Progress in Language Modeling, Extended Version. Computer Speech and Language, 403-434 (October 2001)

    Google Scholar 

  11. Goodman, J.: Extended comment on language trees and zipping, http://arxiv.org/abs/cond-mat/0202383

  12. gzip, a GNU license compression tool, version 1.3.3 (2002-03-08). Copyright, Free Software Foundation Copyright 1992-1993 Jean-loup Gailly (2002)

    Google Scholar 

  13. Khmelev, D., Teahan, W.: A repetition based measure for verification of text collections and for text categorization. In: Proc. of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 104–110 (2003)

    Google Scholar 

  14. Khmelev, D.V., Teahan, W.J.: Comment: Language trees and zipping. Physical Review Letters 90(089803) (2003)

    Google Scholar 

  15. Khmelev, D., Tweedie, F.: Using Markov Chains for Identification of Writers. Literary and Linguistic Computing 16(4), 299–307 (2001)

    Article  Google Scholar 

  16. Kukushkina, O.V., Polikarpov, A.A., Khmelev, D.V.: Using literal and grammatical statistics for authorship attribution. Problems of Information Transmission 37, 172–184 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  17. Lowenstern, D., Hirsh, H., Noordiwier, M., Yianilos, P.: DNA Sequence Classification Using Compression-Based Induction. DIMACS Technical Report 95-04 (1995)

    Google Scholar 

  18. McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: Proc. of the AAAI 1998 Workshop on Learning for Text Categorization (1998)

    Google Scholar 

  19. Mitchell, T.: Tutorial on machine Learning over natural language documents, http://www.cs.cmu.edu/~tom/text-learning.ps

  20. Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist, 2nd edn. Springer, New York (1964), Applied Bayesian and Classical Inference (1984)

    MATH  Google Scholar 

  21. Nelson, M.R.: LZW source code. Dr. Dobb’s Journal (October 1989), Also available at, http://www.dogma.net/markn/articles/lzw/lzw.htm

  22. Peng, F., Schuurmans, D.: Combining Naive Bayes and n-gram language models for text classification. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 335–350. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  23. Peng, F., Schuurmans, D., Wang, S.: Augmenting Naive Bayes classifiers with statistical language models. Information Retrieval 7, 317–345 (2004)

    Article  Google Scholar 

  24. Peng, F., Schuurmans, D., Wang, S.: Language and task independent text categorization with simple language models. In: Proc. Human Language Technology Conference of the North American Chapter of the ACL, pp. 189–196 (2003)

    Google Scholar 

  25. RAR compression tool by RAR Labs, Inc., Version 3.30 (January 22, 2004). Copyright (c) Eugene Roshal (1993-2004), www.rarlab.com

  26. Rennie, J.D.M., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of Naive Bayes text classifiers. In: Proc. of the Twentieth International Conference on Machine Learning (2003)

    Google Scholar 

  27. Rorshal, Eugene (RAR Labs Inc.): Personal communication (2004)

    Google Scholar 

  28. Schechter, B.: Fun with your zip program: Sort through texts, and more, New York Times, April 30 (2002)

    Google Scholar 

  29. Shkarin, D.: Improving the efficiency of PPM algorithm. Problems of information transmission 34(3), 44–54 (2001), In Russian. English description available at http://www.dogma.net/DataCompression/Miscellaneous/PPMII_DCC02.pdf

    MathSciNet  Google Scholar 

  30. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  31. Teahan, W.J.: Modelling English Text. PhD thesis, University of Waikato (1998)

    Google Scholar 

  32. Teahan, W.J.: Text classification and segmentation using minimum cross-entropy. In: Proc. RIAO 2000, 6th International Conference Recherche d’Information Assistee par Ordinateur (2000)

    Google Scholar 

  33. Teahan, W.J., Harper, D.J.: Using compression-based language models for text categorization. In: Proc. of the Workshop on Language Modeling and Information Retrieval (2001)

    Google Scholar 

  34. Thaper, N.: Using Compression For Source Based Classification Of Text. Master’s Thesis, Massachusetts Institute of Technology (2001)

    Google Scholar 

  35. Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval 1(1/2), 67–88 (1999)

    Article  Google Scholar 

  36. Yang, Y., Pederson, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of 14th International Conference on Machine Learning, ICML 1997 (1997)

    Google Scholar 

  37. Zhang, T.: Personal communication (2004)

    Google Scholar 

  38. Zhang, T., Oles, J.F.: Text Categorization Based on Regularized Linear Classification Methods. Information retrieval 4, 5–31 (2001)

    Article  MATH  Google Scholar 

  39. Zhang, J., Jin, R., Yang, Y., Hauptmann, A.G.: Modified Logistic Regression: An Approximation to SVM and Its Applications in Large-Scale Text Categorization. In: Proc. of the 20th International Conference on Machine Learning, pp. 888–895 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Marton, Y., Wu, N., Hellerstein, L. (2005). On Compression-Based Text Classification. In: Losada, D.E., Fernández-Luna, J.M. (eds) Advances in Information Retrieval. ECIR 2005. Lecture Notes in Computer Science, vol 3408. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31865-1_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-31865-1_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25295-5

  • Online ISBN: 978-3-540-31865-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics