Skip to main content

E-Mail Spam Filter Based on Unsupervised Neural Architectures and Thematic Categories: Design and Analysis

  • Conference paper
  • First Online:
Book cover Computational Intelligence (IJCCI 2016)

Part of the book series: Studies in Computational Intelligence ((SCI,volume 792))

Included in the following conference series:

  • 279 Accesses

Abstract

Spam, or unsolicited messages sent massively, is one of the threats that affects email and other media. Its huge quantity generates considerable economic and time losses. A solution to this issue is presented: a hybrid anti-spam filter based on unsupervised Artificial Neural Networks (ANNs). It consists of two steps, preprocessing and processing, both based on different computation models: programmed and neural (using Kohonen SOM). This system has been optimized by utilizing a dataset built with ham from “Enron Email” and spam from two different sources: traditional (user’s inbox) and spamtrap-honeypot. The preprocessing was based on 13 thematic categories found in spams and hams, Term Frequency (TF) and three versions of Inverse Category Frequency (ICF). 1260 system configurations were analyzed with the most used performance measures, achieving \(\mathrm{AUC}>0.95\) the optimal ones. Results were similar to other researchers’ over the same corpus, although they utilize different Machine Learning (ML) methods and a number of attributes several orders of magnitude greater. The system was further tested with different datasets, characterized by heterogeneous origins, dates, users and types, including samples of image spam. In these new tests the filter obtained \(0.75<\mathrm{AUC}<0.96\). Degradation of the system performance can be explained by the differences in the characteristics of the datasets, particularly dates. This phenomenon is called “topic drift” and it commonly affects all classifiers and, to a larger extent, those that use offline learning, as is the case, especially in adversarial ML problems such as spam filtering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Each ROC curve was drawn with 102 specificity and sensitivity values, given by the same number of confidence thresholds, for enhanced ROC curve detail.

References

  1. Subramaniam, T., Jalab, H.A., Taqa, A.Y.: Overview of textual anti-spam filtering techniques. 5, 1869–1882

    Google Scholar 

  2. Cabrera-León, Y., García Báez, P., Suárez-Araujo, C.P.: Non-email spam and machine learning-based anti-spam filters: trends and some remarks. In: Moreno-Díaz, R., Pichler, F., Quesada-Arencibia, A. (eds.) Computer Aided Systems Theory–EUROCAST 2017. Lecture notes in computer science, vol. 10671, pp. 245–253. Springer, Cham

    Google Scholar 

  3. McAfee, ICF International: The Carbon Footprint of Email Spam Report

    Google Scholar 

  4. Statista: Global spam volume as percentage of total e-mail traffic from 2007–2015

    Google Scholar 

  5. Cabrera León, Y.: Análisis del uso de las redes neuronales artificiales en el diseño de filtros antispam: una propuesta basada en arquitecturas neuronales no supervisadas

    Google Scholar 

  6. Rao, J.M., Reiley, D.H.: The economics of spam. 26, 87–110

    Google Scholar 

  7. Lieb, R.: Make Spammers Pay Before You Do

    Google Scholar 

  8. Alazab, M., Broadhurst, R.: Spam and criminal activity, 1–14

    Google Scholar 

  9. Calais Guerra, P.H., Guedes, D.O., Meira Jr., W., Hoepers, C., Chaves, M.H., Steding-Jessen, K.: Exploring the spam arms race to characterize spam evolution. In: Proceedings of the 7th Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference

    Google Scholar 

  10. Pu, C., Webb, S.: Observed trends in spam construction techniques: a case study of spam evolution. In: Third Conference on Email and Anti-Spam (CEAS), pp. 1–9

    Google Scholar 

  11. Wang, D., Irani, D., Pu, C.: A study on evolution of email spam over fifteen years. In: 9th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom 2013), pp. 1–10. IEEE

    Google Scholar 

  12. Cohen, W.W.: Enron Email Dataset

    Google Scholar 

  13. Cabrera-León, Y., García Báez, P., Suárez-Araujo, C.P.: Self-organizing maps in the design of anti-spam filters. a proposal based on thematic categories. In: Proceedings of the 8th IJCCI, vol. 3, pp. 21–32. NCTA, SCITEPRESS Digital Library

    Google Scholar 

  14. Postini, Inc: The shifting tactics of spammers: What you need to know about new email threats

    Google Scholar 

  15. Spammer-X, Posluns, J., Sjouwerman, S.: Inside the SPAM Cartel, 1st edn. Syngress, Elsevier

    Google Scholar 

  16. Erickson, D., Casado, M., McKeown, N.: The effectiveness of whitelisting: a user-study. In: Proceedings of Conference on Email and Anti-Spam, pp. 1–10

    Google Scholar 

  17. Kucherawy, M., Crocker, D.: RFC 6647 - Email Greylisting: An Applicability Statement for SMTP

    Google Scholar 

  18. Harris, E.: The Next Step in the Spam Control War: Greylisting

    Google Scholar 

  19. Guzella, T.S., Caminhas, W.M.: A review of machine learning approaches to Spam filtering. 36, 10206–10222

    Google Scholar 

  20. Meyer, T.A., Whateley, B.: SpamBayes: Effective open-source, Bayesian based, email classification system. In: CEAS, (Citeseer)

    Google Scholar 

  21. Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A bayesian approach to filtering junk E-mail

    Google Scholar 

  22. Lowd, D., Meek, C.: Good word attacks on statistical spam filters. In: Proceedings of the Second Conference on Email and Anti-Spam (CEAS), pp. 1–8

    Google Scholar 

  23. Sprengers, M., Heskes, T.T.: The effects of different bayesian poison methods on the quality of the bayesian spam filter spambayes

    Google Scholar 

  24. Wittel, G.L., Wu, S.F.: On attacking statistical spam filters. CEAS

    Google Scholar 

  25. Metsis, V., Androutsopoulos, I., Paliouras, G.: Spam filtering with naive bayes - which naive bayes? In: CEAS 2006 - Third Conference on Email and Anti-Spam

    Google Scholar 

  26. Drucker, H., Wu, D., Vapnik, V.N.: Support vector machines for spam categorization. 10, 1048–1054

    Google Scholar 

  27. Xie, C., Ding, L., Du, X.: Anti-spam filters based on support vector machines. In: Advances in Computation and Intelligence. 4th International Symposium, ISICA 2009. Lecture notes in computer science, vol. 5821, pp. 349–357. Springer, Heidelberg

    Google Scholar 

  28. Chhabra, P., Wadhvani, R., Shukla, S.: Spam filtering using support vector machine. In: Special Issue of IJCCT Vol.1 Issue 2, 3, 4; 2010 for International Conference [ACCTA-2010], pp. 166–171

    Google Scholar 

  29. Blanco, N., Ricket, A.M., Martín-Merino, M.: Combining SVM classifiers for email anti-spam filtering. In: Sandoval, F., Prieto, A., Cabestany, J., Graña, M. (eds.) 9th International Work-Conference on Artificial Neural Networks, IWANN 2007. Computational and ambient intelligence of lecture notes in computer science, vol. 4507, pp. 903–910. Springer, Heidelberg

    Google Scholar 

  30. Kufandirimbwa, O., Gotora, R.: Spam detection using artificial neural networks (Perceptron Learning Rule). 1, 22–29

    Google Scholar 

  31. Sculley, D., Wachman, G., Brodley, C.E.: Spam filtering using inexact string matching in explicit feature space with on-line linear classifiers. TREC

    Google Scholar 

  32. Chuan, Z., Xianliang, L., Mengshu, H., Xu, Z.: A LVQ-based neural network anti-spam email approach. 39, 34–39 (6)

    Google Scholar 

  33. Cabrera León, Y., Acosta Padrón, O.: Spam: definition, statistics, anti-spam methods and legislation

    Google Scholar 

  34. Qian, F., Pathak, A., Hu, Y.C., Mao, Z.M., Xie, Y.: A case for unsupervised-learning-based spam filtering. ACM SIGMETRICS Perform. Eval. Rev. 38, 367–368. ACM

    Google Scholar 

  35. Narisawa, K., Bannai, H., Hatano, K., Takeda, M.: Unsupervised spam detection based on string alienness measures. In: Discovery Science, pp. 161–172. Springer, Heidelberg

    Google Scholar 

  36. Uemura, T., Ikeda, D., Arimura, H.: Unsupervised spam detection by document complexity estimation. In: Discovery Science, pp. 319–331

    Google Scholar 

  37. Luo, X., Zincir-Heywood, N.: Comparison of a SOM based sequence analysis system and naive Bayesian classifier for spam filtering. In: Proceedings of the IEEE International Joint Conference On Neural Networks IJCNN’05, vol. 4, pp. 2571–2576

    Google Scholar 

  38. Vrusias, B.L., Golledge, I.: Adaptable text filters and unsupervised neural classifiers for spam detection. In: Proceedings of the International Workshop on Computational Intelligence in Security for Information Systems CISIS’08. Advances in soft computing, vol. 53, pp. 195–202. Springer, Heidelberg

    Google Scholar 

  39. Vrusias, B.L., Golledge, I.: Online self-organised map classifiers as text filters for spam email detection. 4, 151–160

    Google Scholar 

  40. Chapelle, O., Schölkopf, B., Zien, A.: Semi-Supervised Learning, vol. 2. MIT Press

    Google Scholar 

  41. Gao, Y., Yan, M., Choudhary, A.: Semi supervised image spam hunter: a regularized discriminant EM approach. In: International Conference on Advanced Data Mining and Applications, pp. 152–164. Springer, Heidelberg

    Google Scholar 

  42. Pfahringer, B.: A semi-supervised spam mail detector, pp. 1–5

    Google Scholar 

  43. Santos, I., Sanz, B., Laorden, C., Brezo, F., Bringas, P.G.: (Computational Intelligence in Security for Information Systems: 4th International Conference, CISIS 2011, Held at IWANN 2011)

    Google Scholar 

  44. Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. 16, 321–328

    Google Scholar 

  45. Mason, J.: Filtering Spam with SpamAssassin (presentation)

    Google Scholar 

  46. Xu, J.M., Fumera, G., Roli, F., Zhou, Z.H.: Training spamassassin with active semi-supervised learning. In: Proceedings of the 6th Conference on Email and Anti-Spam (CEAS’09), pp. 1–8. (Citeseer)

    Google Scholar 

  47. Shunli, Z., Qingshuang, Y.: Personal spam filter by semi-supervised learning. In: Proceedings of the Third International Symposium on Com Puter Science and Computational Technology (ISCSCT’10), pp. 171–174

    Google Scholar 

  48. Zhou, D., Burges, C.J.C., Tao, T.: Transductive link spam detection. In: Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web, pp. 21–28

    Google Scholar 

  49. Mojdeh, M., Cormack, G.V.: Semi-supervised Spam Filtering: Does it Work? In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (ACM) 745–746

    Google Scholar 

  50. Resnick, P. (ed.) : RFC 5322 - Internet Message Format

    Google Scholar 

  51. Cormack, G.V., Mojdeh, M.: Machine learning for information retrieval: TREC 2009 web, relevance feedback and legal tracks. In: The Eighteenth Text REtrieval Conference Proceedings (TREC 2009), pp. 1–9

    Google Scholar 

  52. Malathi, R.: Email spam filter using supervised learning with bayesian neural network. 1, 89–100

    Google Scholar 

  53. Pitsillidis, A., Levchenko, K., Kreibich, C., Kanich, C., Voelker, G.M., Paxson, V., Weaver, N., Savage, S.: Botnet judo: Fighting spam with itself. In: Symposium on Network and Distributed System Security (NDSS), pp. 1–19

    Google Scholar 

  54. Kolcz, A., Chowdhury, A., Alspector, J.: The impact of feature selection on signature-driven spam detection. In: Proceedings of the 1st Conference on Email and Anti-Spam (CEAS-2004), pp. 1–8

    Google Scholar 

  55. The Apache SpamAssassin Project: SpamAssassin v3.3.x: Tests Performed to Determine Spaminess and Haminess of a Message

    Google Scholar 

  56. Yerazunis, W., Kato, M., Kori, M., Shibata, H., Hackenberg, K.: Keeping the Good Stuff In: Confidential Information Firewalling with the CRM114 Spam Filter & Text Classifier, pp. 1–18

    Google Scholar 

  57. Graham-Cumming, J.: SpamOrHam, pp. 22–24

    Google Scholar 

  58. Feroze, M.A., Baig, Z.A., Johnstone, M.N.: A two-tiered user feedback-based approach for spam detection. In: Becker Westphall, C., Borcoci, E., Manoharan, S. (eds.) ICSNC 2015: The Tenth International Conference on Systems and Networks Communications, pp. 12–17. Curran Associates, Inc, Spain, 15–20 November 2015

    Google Scholar 

  59. Bruce, J.: Grey Mail: The New Email Nuisance To Hit Your Inbox

    Google Scholar 

  60. Ramachandran, A., Feamster, N.: Understanding the network-level behavior of spammers. ACM SIGCOMM Comput. Commun. Rev. 36, 291–302

    Google Scholar 

  61. Fumera, G., Pillai, I., Roli, F.: Spam filtering based on the analysis of text information embedded into images. 7, 2699–2720

    Google Scholar 

  62. Borovicka, T., Jirina Jr., M., Kordik, P., Jirina, M.: Selecting Representative Data Sets. In: Karahoca, A. (ed.) Advances in Data Mining Knowledge Discovery and Applications. (InTech)

    Google Scholar 

  63. Skillicorn, D.: Other Versions of the Enron Data (preprocessed)

    Google Scholar 

  64. Styler, W.: The EnronSent Corpus

    Google Scholar 

  65. Bekkerman, R., McCallum, A., Huang, G.: Automatic categorization of email into folders: Benchmark experiments on Enron and SRI corpora

    Google Scholar 

  66. The Apache SpamAssassin Project: Index of the SpamAssassin’s Public Corpus

    Google Scholar 

  67. Guenter, B.: SPAM Archive: Email spam received yearly, since early 1998

    Google Scholar 

  68. CSMining Group: CSDMC2010 SPAM corpus

    Google Scholar 

  69. Hovold, J.: Naive Bayes Spam filtering using word-position-based attributes. In: CEAS

    Google Scholar 

  70. Zhang, Y.: Lecture for Chap. 2 - Data Preprocessing (course presentation)

    Google Scholar 

  71. Porter, M.F.: An algorithm for suffix stripping. 14, 130–137

    Google Scholar 

  72. Freschi, V., Seraghiti, A., Bogliolo, A.: Filtering obfuscated email spam by means of phonetic string matching. Advances in Information Retrieval, pp. 505–509. Springer, Berlin

    Google Scholar 

  73. Liu, C., Stamm, S.: Fighting Unicode-obfuscated spam. In: Proceedings of the Anti-Phishing Working Groups 2nd Annual eCrime Researchers Summit, pp. 45–59. ACM

    Google Scholar 

  74. Wang, D., Zhang, H.: Inverse-category-frequency based supervised term weighting schemes for text categorization. 29, 209–225

    Google Scholar 

  75. Lertnattee, V., Theeramunkong, T.: Analysis of inverse class frequency in centroid-based text classification. In: IEEE International Symposium on Communications and Information Technology (ISCIT 2004). vol. 2, pp. 1171–1176. IEEE

    Google Scholar 

  76. Zeimpekis, D., Kontopoulou, E.M., Gallopoulos, E.: Text to Matrix Generator (TMG)

    Google Scholar 

  77. Kohonen, T.: Self-Organizing Maps, 3rd edn. Springer, New York

    Google Scholar 

  78. Rojas, R.: Kohonen networks. In: Neural Networks: A Systematic Introduction, pp. 391–412. Springer, Berlin

    Google Scholar 

  79. Haykin, S.S.: Neural Networks. A Comprehensive Foundation, 2nd edn. Prentice-Hall International

    Google Scholar 

  80. Tan, H.S., George, S.E.: Investigating learning parameters in a standard 2-D SOM model to select good maps and avoid poor ones. In: Australasian Joint Conference on Artificial Intelligence, pp. 425–437, Springer, Berlin

    Google Scholar 

  81. Kohonen, T.: Essentials of the self-organizing map. 37, 52–65

    Google Scholar 

  82. Fawcett, T.: ROC graphs: Notes and practical considerations for researchers. 31, 1–38

    Google Scholar 

  83. Metz, C.E.: Basic principles of ROC analysis. In: Seminars in Nuclear Medicine. vol. 8, pp. 283–298. Elsevier

    Google Scholar 

  84. Slaby, A.: ROC analysis with Matlab. In: 29th International Conference On Information Technology Interfaces, 2007. ITI, pp. 191–196. IEEE

    Google Scholar 

  85. Vesanto, J., Himberg, J., Alhoniemi, E., Parhankangas, J.: SOM Toolbox for Matlab 5

    Google Scholar 

  86. MathWorks: Parallel Computing Toolbox for Matlab R2014a - User’s Guide

    Google Scholar 

  87. Holden, S.: Spam Filtering II: Comparison of a number of Bayesian anti-spam filters over different email corpora

    Google Scholar 

  88. Gama, J., Ẑliobaitė, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. 46, 1–37

    Google Scholar 

  89. Ẑliobaitė, I., Pechenizkiy, M., Gama, J.: An overview of concept drift applications. In Japkowicz, N., Stefanowski, J., eds.: Big Data Analysis: New Algorithms for a New Society of Studies in Big Data, vol. 16, pp. 91–114. Springer International Publishing

    Google Scholar 

  90. Freeman, J.A., Skapura, D.M.: Neural Networks: Algorithms, Applications, and Programming Techniques. Computation and neural systems series. Addison-Wesley

    Google Scholar 

  91. Hecht-Nielsen, R.: Counterpropagation networks. 26, 4979–4984

    Google Scholar 

  92. Suárez Araujo, C.P., García Báez, P., Hernández Trujillo, Y.: Neural computation methods in the determination of fungicides. In: Fungicides. Odile carisse edn. INTECH Open Access Publisher

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carmen Paz Suárez-Araujo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cabrera-León, Y., García Báez, P., Suárez-Araujo, C.P. (2019). E-Mail Spam Filter Based on Unsupervised Neural Architectures and Thematic Categories: Design and Analysis. In: Merelo, J.J., et al. Computational Intelligence. IJCCI 2016. Studies in Computational Intelligence, vol 792. Springer, Cham. https://doi.org/10.1007/978-3-319-99283-9_12

Download citation

Publish with us

Policies and ethics