Spam Filtering through Anomaly Detection

  • Igor Santos
  • Carlos Laorden
  • Xabier Ugarte-Pedrero
  • Borja Sanz
  • Pablo G. Bringas
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 314)


More than 85% of received e-mails are spam. Spam is an important issue for computer security because it is used to spread other threats such as computer viruses, worms or phishing. Classic techniques to fight spam, including simple techniques such as sender blacklisting or the use of e-mail signatures, are no longer completely reliable. Machine-learning techniques trained using statistical representations of the terms that usually appear in the e-mails are widely used in the literature. However, these methods demand a time-consuming training step with labelled data. Dealing with the situation where the availability of labelled training instances is limited slows down the progress of filtering systems and offers advantages to spammers. In this paper, we present the first spam filtering method based on anomaly detection that reduces the necessity of labelling spam messages and only uses the representation of legitimate e-mails. This approach represents legitimate e-mails as word frequency vectors. Thereby, an email is classified as spam or legitimate by measuring its deviation to the representation of these legitimate e-mails. This method achieves high accuracy rates detecting spam and maintains a low false positive rate, reducing the effort produced by labelling spam.


Computer security Spam filtering Anomaly detection Text classification 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bratko, A., Filipič, B., Cormack, G., Lynam, T., Zupan, B.: Spam filtering using statistical data compression models. The Journal of Machine Learning Research 7, 2673–2698 (2006)zbMATHGoogle Scholar
  2. 2.
    Jagatic, T., Johnson, N., Jakobsson, M., Menczer, F.: Social phishing. Communications of the ACM 50, 94–100 (2007)CrossRefGoogle Scholar
  3. 3.
    Carpinter, J., Hunt, R.: Tightening the net: A review of current and next generation spam filtering tools. Computers & Security 25, 566–578 (2006)CrossRefGoogle Scholar
  4. 4.
    Heron, S.: Technologies for spam detection. Network Security, 11–15 (2009)Google Scholar
  5. 5.
    Jung, J., Sit, E.: An empirical study of spam traffic and the use of DNS black lists. In: Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement, pp. 370–375. ACM, New York (2004)CrossRefGoogle Scholar
  6. 6.
    Ramachandran, A., Dagon, D., Feamster, N.: Can DNS-based blacklists keep up with bots. In: Conference on Email and Anti-Spam, Citeseer (2006)Google Scholar
  7. 7.
    Kołcz, A., Chowdhury, A., Alspector, J.: The impact of feature selection on signature-driven spam detection. In: Proceedings of the 1st Conference on Email and Anti-Spam, CEAS 2004 (2004)Google Scholar
  8. 8.
    Mishne, G., Carmel, D., Lempel, R.: Blocking blog spam with language model disagreement. In: Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pp. 1–6 (2005)Google Scholar
  9. 9.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys (CSUR) 34, 1–47 (2002)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Lewis, D.: Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–18. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  11. 11.
    Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C., Stamatopoulos, P.: Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach. In: Proceedings of the Machine Learning and Textual Information Access Workshop of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (2000)Google Scholar
  12. 12.
    Schneider, K.: A comparison of event models for Naive Bayes anti-spam e-mail filtering. In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, pp. 307–314 (2003)Google Scholar
  13. 13.
    Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., Spyropoulos, C.: An evaluation of naive bayesian anti-spam filtering. In: Proceedings of the Workshop on Machine Learning in the New Information Age, pp. 9–17 (2000)Google Scholar
  14. 14.
    Androutsopoulos, I., Koutsias, J., Chandrinos, K., Spyropoulos, C.: An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 160–167 (2000)Google Scholar
  15. 15.
    Seewald, A.: An evaluation of naive Bayes variants in content-based learning for spam filtering. Intelligent Data Analysis 11, 497–524 (2007)Google Scholar
  16. 16.
    Vapnik, V.: The nature of statistical learning theory. Springer (2000)Google Scholar
  17. 17.
    Drucker, H., Wu, D., Vapnik, V.: Support vector machines for spam categorization. IEEE Transactions on Neural Networks 10, 1048–1054 (1999)CrossRefGoogle Scholar
  18. 18.
    Blanzieri, E., Bryl, A.: Instance-based spam filtering using SVM nearest neighbor classifier. In: Proceedings of FLAIRS-20, pp. 441–442 (2007)Google Scholar
  19. 19.
    Sculley, D., Wachman, G.: Relaxed online SVMs for spam filtering. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 415–422 (2007)Google Scholar
  20. 20.
    Quinlan, J.: Induction of decision trees. Machine Learning 1, 81–106 (1986)Google Scholar
  21. 21.
    Carreras, X., Márquez, L.: Boosting trees for anti-spam email filtering. In: Proceedings of RANLP-01, 4th International Conference on Recent Advances in Natural Language Processing, Citeseer, pp. 58–64 (2001)Google Scholar
  22. 22.
    Zhang, L., Zhu, J., Yao, T.: An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing (TALIP) 3, 243–269 (2004)CrossRefGoogle Scholar
  23. 23.
    Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Communications of the ACM 18, 613–620 (1975)zbMATHCrossRefGoogle Scholar
  24. 24.
    Wilbur, W., Sirotkin, K.: The automatic identification of stop words. Journal of Information Science 18, 45–55 (1992)CrossRefGoogle Scholar
  25. 25.
    Salton, G., McGill, M.: Introduction to modern information retrieval. McGraw-Hill, New York (1983)zbMATHGoogle Scholar
  26. 26.
    Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)Google Scholar
  27. 27.
    McGill, M., Salton, G.: Introduction to modern information retrieval. McGraw-Hill (1983)Google Scholar
  28. 28.
    Kent, J.: Information gain and a general measure of correlation. Biometrika 70, 163–173 (1983)MathSciNetzbMATHCrossRefGoogle Scholar
  29. 29.
    Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C., Stamatopoulos, P.: A memory-based approach to anti-spam filtering for mailing lists. Information Retrieval 6, 49–73 (2003)CrossRefGoogle Scholar
  30. 30.
    Cranor, L., LaMacchia, B.: Spam! Communications of the ACM 41, 74–83 (1998)CrossRefGoogle Scholar
  31. 31.
    Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to filtering junk e-mail. In: Learning for Text Categorization: Papers from the 1998 Workshop. AAAI Technical Report WS-98-05, Madison, Wisconsin, vol. 62 (1998)Google Scholar
  32. 32.
    Lovins, J.: Development of a Stemming Algorithm. Mechanical Translation and Computational Linguistics 11, 22–31 (1968)Google Scholar
  33. 33.
    Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International Joint Conference on Artificial Intelligence, vol. 14, pp. 1137–1145 (1995)Google Scholar
  34. 34.
    Elkan, C.: The foundations of cost-sensitive learning. In: Proceedings of the 2001 International Joint Conference on Artificial Intelligence, pp. 973–978 (2001)Google Scholar
  35. 35.
    Cohen, D.: Explaining linguistic phenomena. Halsted Press (1974)Google Scholar
  36. 36.
    Polyvyanyy, A.: Evaluation of a novel information retrieval model: eTVSM. MSc Dissertation (2007)Google Scholar
  37. 37.
    Carnap, R.: Meaning and synonymy in natural languages. Philosophical Studies 6, 33–47 (1955)CrossRefGoogle Scholar
  38. 38.
    Cruse, D.: Hyponymy and lexical hierarchies. Archivum Linguisticum 6, 26–31 (1975)Google Scholar
  39. 39.
    Radden, G., Kövecses, Z.: Towards a theory of metonymy. Metonymy in Language and Thought, 17–59 (1999)Google Scholar
  40. 40.
    Ming-Tzu, K., Nation, P.: Word meaning in academic English: Homography in the academic word list. Applied Linguistics 25, 291–314 (2004)CrossRefGoogle Scholar
  41. 41.
    Becker, J., Kuropka, D.: Topic-based vector space model. In: Proceedings of the 6th International Conference on Business Information Systems, pp. 7–12 (2003)Google Scholar
  42. 42.
    Karlberger, C., Bayler, G., Kruegel, C., Kirda, E.: Exploiting redundancy in natural language to penetrate bayesian spam filters. In: Proceedings of the 1st USENIX Workshop on Offensive Technologies (WOOT), pp. 1–7. USENIX Association (2007)Google Scholar
  43. 43.
    Kuropka, D.: Modelle zur Repräsentation natürlichsprachlicher Dokumente-Information-Filtering und-Retrieval mit relationalen Datenbanken. Advances in Information Systems and Management Science 10 (2004)Google Scholar
  44. 44.
    Awad, A., Polyvyanyy, A., Weske, M.: Semantic querying of business process models. In: IEEE International Conference on Enterprise Distributed Object Computing Conference (EDOC 2008), pp. 85–94 (2008)Google Scholar
  45. 45.
    Ide, N., Véronis, J.: Introduction to the special issue on word sense disambiguation: the state of the art. Computational Linguistics 24, 2–40 (1998)Google Scholar
  46. 46.
    Navigli, R.: Word sense disambiguation: a survey. ACM Computing Surveys (CSUR) 41, 10 (2009)CrossRefGoogle Scholar
  47. 47.
    Bates, M., Weischedel, R.: Challenges in natural language processing. Cambridge Univ. Pr. (1993)Google Scholar
  48. 48.
    Dietterich, T., Lathrop, R., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence 89, 31–71 (1997)zbMATHCrossRefGoogle Scholar
  49. 49.
    Maron, O., Lozano-Pérez, T.: A framework for multiple-instance learning. In: Advances in Neural Information Processing Systems, pp. 570–576 (1998)Google Scholar
  50. 50.
    Zhou, Y., Jorgensen, Z., Inge, M.: Combating Good Word Attacks on Statistical Spam Filters with Multiple Instance Learning. In: Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence, vol. 02, pp. 298–305. IEEE Computer Society (2007)Google Scholar
  51. 51.
    Wittel, G., Wu, S.: On attacking statistical spam filters. In: Proceedings of the 1st Conference on Email and Anti-Spam, CEAS (2004)Google Scholar
  52. 52.
    Cano, J., Herrera, F., Lozano, M.: On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining. Applied Soft Computing Journal 6, 323–332 (2006)CrossRefGoogle Scholar
  53. 53.
    Czarnowski, I., Jedrzejowicz, P.: Instance reduction approach to machine learning and multi-database mining. In: Proceedings of the Scientific Session Organized During XXI Fall Meeting of the Polish Information Processing Society, Informatica, pp. 60–71. ANNALES Universitatis Mariae Curie-Skłodowska, Lublin (2006)Google Scholar
  54. 54.
    Pyle, D.: Data preparation for data mining. Morgan Kaufmann (1999)Google Scholar
  55. 55.
    Tsang, E., Yeung, D., Wang, X.: OFFSS: optimal fuzzy-valued feature subset selection. IEEE Transactions on Fuzzy Systems 11, 202–213 (2003)CrossRefGoogle Scholar
  56. 56.
    Torkkola, K.: Feature extraction by non parametric mutual information maximization. The Journal of Machine Learning Research 3, 1415–1438 (2003)MathSciNetzbMATHGoogle Scholar
  57. 57.
    Dash, M., Liu, H.: Consistency-based search in feature selection. Artificial Intelligence 151, 155–176 (2003)MathSciNetzbMATHCrossRefGoogle Scholar
  58. 58.
    Liu, H., Motoda, H.: Instance selection and construction for data mining. Kluwer Academic Pub. (2001)Google Scholar
  59. 59.
    Liu, H., Motoda, H.: Computational methods of feature selection. Chapman & Hall/CRC (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Igor Santos
    • 1
  • Carlos Laorden
    • 1
  • Xabier Ugarte-Pedrero
    • 1
  • Borja Sanz
    • 1
  • Pablo G. Bringas
    • 1
  1. 1.S3Lab, DeustoTech - Computing, Deusto Institute of TechnologyUniversity of DeustoBilbaoSpain

Personalised recommendations