Abstract
Our paper introduces a new way to filter spam using as background the Kolmogorov complexity theory and as learning component a Support Vector Machine. Our idea is to skip the classical text analysis in use with standard filtering techniques, and to focus on the measure of the informative content of a message to classify it as spam or legitimate. Exploiting the fact that we can estimate a message information content through compression techniques, we represent an e-mail as a multi-dimensional real vector and we train a Support Vector Machine to get a classifier achieving accuracy rates in the range of 90%-97%, bringing our combined technique at the top of the current spam filtering technologies.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Graham, P.: A plan for spam (August 2002), http://www.paulgraham.com/spam.html
Pearl, J., Russell, S.: Bayesian networks. In: Arbib, M.A. (ed.) Handbook of Brain Theory and Neural Networks, pp. 157–160. MIT Press, Cambridge (2003)
Lowd, D., Meek, C.: Anti-spam products give unsatisfactory performance. In: Proceedings of the Second Conference on E-mail and Anti-spam (CEAS), Palo Alto, CA, July 2005, pp. 125–132 (2005)
Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Problems in Information Transmission 1(1), 1–7 (1965)
Kirchherr, W., Li, M., Vitányi, P.: The miraculous universal distribution. MATHINT: The Mathematical Intelligencer 19(4) (1997)
Li, M., Vitányi, P.: Introduction to Kolmogorov Complexity and Its Applications. Springer, Heidelberg (1997)
Bennett, C., Gacs, P., Li, M., Vitányi, P., Zurek, W.: Information distance. IEEE Transaction on Information Theory 44(4), 1407–1423 (1998)
Welch, T.: A technique for high performance data compression. IEEE Computer 17(6) (1984)
Huffman, D.: A method for the construction of minimum reduncancy codes. In: Proceedings of the IRE (September 1952)
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)
Roshal, A.: Official rar site. Visit, http://www.rarlab.com
Burges, C.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2), 121–167 (1998)
Chang, C.-C., Lin, C.-J.: Libsvm: a library for support vector machines (2001), Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Kulkarni, P., Bush, S.F.: Active network management and kolmogorov complexity. In: OpenArch 2001, Anchorage, Alaska (2001)
Bush, S.F.: Active virtual network management prediction: Complexity as a framework for prediction, optimization, and assurance. In: Proceedings of the, DARPA Active Networks Conference and Exposition (DANCE), San Francisco, CA, May 2002, pp. 534–553 (2002)
Bush, S.F.: Extended abstract: Complexity and vulnerability analysis. In: Complexity and Inference, DIMACS Center, Rutgers University (June 2003)
Spracklin, L., Saxton, L.: Filtering spam using kolmogorov complexity estimates. In: 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW), Niagara Falls, Ontario, May 2007, pp. 321–328 (2007)
Bratko, A., Cormack, G.V., Filipic, B., Lynam, T.R., Zupan, B.: Spam filtering using statistical data compression models. Journal of Machine Learning Research 7, 2673–2698 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Belabbes, S., Richard, G. (2008). Spam Filtering without Text Analysis. In: Jahankhani, H., Revett, K., Palmer-Brown, D. (eds) Global E-Security. ICGeS 2008. Communications in Computer and Information Science, vol 12. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69403-8_18
Download citation
DOI: https://doi.org/10.1007/978-3-540-69403-8_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69402-1
Online ISBN: 978-3-540-69403-8
eBook Packages: Computer ScienceComputer Science (R0)