Skip to main content

Spam Filtering without Text Analysis

  • Conference paper

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 12))

Abstract

Our paper introduces a new way to filter spam using as background the Kolmogorov complexity theory and as learning component a Support Vector Machine. Our idea is to skip the classical text analysis in use with standard filtering techniques, and to focus on the measure of the informative content of a message to classify it as spam or legitimate. Exploiting the fact that we can estimate a message information content through compression techniques, we represent an e-mail as a multi-dimensional real vector and we train a Support Vector Machine to get a classifier achieving accuracy rates in the range of 90%-97%, bringing our combined technique at the top of the current spam filtering technologies.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Graham, P.: A plan for spam (August 2002), http://www.paulgraham.com/spam.html

  2. Pearl, J., Russell, S.: Bayesian networks. In: Arbib, M.A. (ed.) Handbook of Brain Theory and Neural Networks, pp. 157–160. MIT Press, Cambridge (2003)

    Google Scholar 

  3. Lowd, D., Meek, C.: Anti-spam products give unsatisfactory performance. In: Proceedings of the Second Conference on E-mail and Anti-spam (CEAS), Palo Alto, CA, July 2005, pp. 125–132 (2005)

    Google Scholar 

  4. Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Problems in Information Transmission 1(1), 1–7 (1965)

    MathSciNet  MATH  Google Scholar 

  5. Kirchherr, W., Li, M., Vitányi, P.: The miraculous universal distribution. MATHINT: The Mathematical Intelligencer 19(4) (1997)

    Google Scholar 

  6. Li, M., Vitányi, P.: Introduction to Kolmogorov Complexity and Its Applications. Springer, Heidelberg (1997)

    Book  MATH  Google Scholar 

  7. Bennett, C., Gacs, P., Li, M., Vitányi, P., Zurek, W.: Information distance. IEEE Transaction on Information Theory 44(4), 1407–1423 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  8. Welch, T.: A technique for high performance data compression. IEEE Computer 17(6) (1984)

    Google Scholar 

  9. Huffman, D.: A method for the construction of minimum reduncancy codes. In: Proceedings of the IRE (September 1952)

    Google Scholar 

  10. Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)

    Google Scholar 

  11. Roshal, A.: Official rar site. Visit, http://www.rarlab.com

  12. Burges, C.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2), 121–167 (1998)

    Article  Google Scholar 

  13. Chang, C.-C., Lin, C.-J.: Libsvm: a library for support vector machines (2001), Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

  14. Kulkarni, P., Bush, S.F.: Active network management and kolmogorov complexity. In: OpenArch 2001, Anchorage, Alaska (2001)

    Google Scholar 

  15. Bush, S.F.: Active virtual network management prediction: Complexity as a framework for prediction, optimization, and assurance. In: Proceedings of the, DARPA Active Networks Conference and Exposition (DANCE), San Francisco, CA, May 2002, pp. 534–553 (2002)

    Google Scholar 

  16. Bush, S.F.: Extended abstract: Complexity and vulnerability analysis. In: Complexity and Inference, DIMACS Center, Rutgers University (June 2003)

    Google Scholar 

  17. Spracklin, L., Saxton, L.: Filtering spam using kolmogorov complexity estimates. In: 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW), Niagara Falls, Ontario, May 2007, pp. 321–328 (2007)

    Google Scholar 

  18. Bratko, A., Cormack, G.V., Filipic, B., Lynam, T.R., Zupan, B.: Spam filtering using statistical data compression models. Journal of Machine Learning Research 7, 2673–2698 (2006)

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Belabbes, S., Richard, G. (2008). Spam Filtering without Text Analysis. In: Jahankhani, H., Revett, K., Palmer-Brown, D. (eds) Global E-Security. ICGeS 2008. Communications in Computer and Information Science, vol 12. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69403-8_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-69403-8_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-69402-1

  • Online ISBN: 978-3-540-69403-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics