Abstract
Spam, or unsolicited messages sent massively, is one of the threats that affects email and other media. Its huge quantity generates considerable economic and time losses. A solution to this issue is presented: a hybrid anti-spam filter based on unsupervised Artificial Neural Networks (ANNs). It consists of two steps, preprocessing and processing, both based on different computation models: programmed and neural (using Kohonen SOM). This system has been optimized by utilizing a dataset built with ham from “Enron Email” and spam from two different sources: traditional (user’s inbox) and spamtrap-honeypot. The preprocessing was based on 13 thematic categories found in spams and hams, Term Frequency (TF) and three versions of Inverse Category Frequency (ICF). 1260 system configurations were analyzed with the most used performance measures, achieving \(\mathrm{AUC}>0.95\) the optimal ones. Results were similar to other researchers’ over the same corpus, although they utilize different Machine Learning (ML) methods and a number of attributes several orders of magnitude greater. The system was further tested with different datasets, characterized by heterogeneous origins, dates, users and types, including samples of image spam. In these new tests the filter obtained \(0.75<\mathrm{AUC}<0.96\). Degradation of the system performance can be explained by the differences in the characteristics of the datasets, particularly dates. This phenomenon is called “topic drift” and it commonly affects all classifiers and, to a larger extent, those that use offline learning, as is the case, especially in adversarial ML problems such as spam filtering.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Each ROC curve was drawn with 102 specificity and sensitivity values, given by the same number of confidence thresholds, for enhanced ROC curve detail.
References
Subramaniam, T., Jalab, H.A., Taqa, A.Y.: Overview of textual anti-spam filtering techniques. 5, 1869–1882
Cabrera-León, Y., García Báez, P., Suárez-Araujo, C.P.: Non-email spam and machine learning-based anti-spam filters: trends and some remarks. In: Moreno-Díaz, R., Pichler, F., Quesada-Arencibia, A. (eds.) Computer Aided Systems Theory–EUROCAST 2017. Lecture notes in computer science, vol. 10671, pp. 245–253. Springer, Cham
McAfee, ICF International: The Carbon Footprint of Email Spam Report
Statista: Global spam volume as percentage of total e-mail traffic from 2007–2015
Cabrera León, Y.: Análisis del uso de las redes neuronales artificiales en el diseño de filtros antispam: una propuesta basada en arquitecturas neuronales no supervisadas
Rao, J.M., Reiley, D.H.: The economics of spam. 26, 87–110
Lieb, R.: Make Spammers Pay Before You Do
Alazab, M., Broadhurst, R.: Spam and criminal activity, 1–14
Calais Guerra, P.H., Guedes, D.O., Meira Jr., W., Hoepers, C., Chaves, M.H., Steding-Jessen, K.: Exploring the spam arms race to characterize spam evolution. In: Proceedings of the 7th Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference
Pu, C., Webb, S.: Observed trends in spam construction techniques: a case study of spam evolution. In: Third Conference on Email and Anti-Spam (CEAS), pp. 1–9
Wang, D., Irani, D., Pu, C.: A study on evolution of email spam over fifteen years. In: 9th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom 2013), pp. 1–10. IEEE
Cohen, W.W.: Enron Email Dataset
Cabrera-León, Y., García Báez, P., Suárez-Araujo, C.P.: Self-organizing maps in the design of anti-spam filters. a proposal based on thematic categories. In: Proceedings of the 8th IJCCI, vol. 3, pp. 21–32. NCTA, SCITEPRESS Digital Library
Postini, Inc: The shifting tactics of spammers: What you need to know about new email threats
Spammer-X, Posluns, J., Sjouwerman, S.: Inside the SPAM Cartel, 1st edn. Syngress, Elsevier
Erickson, D., Casado, M., McKeown, N.: The effectiveness of whitelisting: a user-study. In: Proceedings of Conference on Email and Anti-Spam, pp. 1–10
Kucherawy, M., Crocker, D.: RFC 6647 - Email Greylisting: An Applicability Statement for SMTP
Harris, E.: The Next Step in the Spam Control War: Greylisting
Guzella, T.S., Caminhas, W.M.: A review of machine learning approaches to Spam filtering. 36, 10206–10222
Meyer, T.A., Whateley, B.: SpamBayes: Effective open-source, Bayesian based, email classification system. In: CEAS, (Citeseer)
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A bayesian approach to filtering junk E-mail
Lowd, D., Meek, C.: Good word attacks on statistical spam filters. In: Proceedings of the Second Conference on Email and Anti-Spam (CEAS), pp. 1–8
Sprengers, M., Heskes, T.T.: The effects of different bayesian poison methods on the quality of the bayesian spam filter spambayes
Wittel, G.L., Wu, S.F.: On attacking statistical spam filters. CEAS
Metsis, V., Androutsopoulos, I., Paliouras, G.: Spam filtering with naive bayes - which naive bayes? In: CEAS 2006 - Third Conference on Email and Anti-Spam
Drucker, H., Wu, D., Vapnik, V.N.: Support vector machines for spam categorization. 10, 1048–1054
Xie, C., Ding, L., Du, X.: Anti-spam filters based on support vector machines. In: Advances in Computation and Intelligence. 4th International Symposium, ISICA 2009. Lecture notes in computer science, vol. 5821, pp. 349–357. Springer, Heidelberg
Chhabra, P., Wadhvani, R., Shukla, S.: Spam filtering using support vector machine. In: Special Issue of IJCCT Vol.1 Issue 2, 3, 4; 2010 for International Conference [ACCTA-2010], pp. 166–171
Blanco, N., Ricket, A.M., Martín-Merino, M.: Combining SVM classifiers for email anti-spam filtering. In: Sandoval, F., Prieto, A., Cabestany, J., Graña, M. (eds.) 9th International Work-Conference on Artificial Neural Networks, IWANN 2007. Computational and ambient intelligence of lecture notes in computer science, vol. 4507, pp. 903–910. Springer, Heidelberg
Kufandirimbwa, O., Gotora, R.: Spam detection using artificial neural networks (Perceptron Learning Rule). 1, 22–29
Sculley, D., Wachman, G., Brodley, C.E.: Spam filtering using inexact string matching in explicit feature space with on-line linear classifiers. TREC
Chuan, Z., Xianliang, L., Mengshu, H., Xu, Z.: A LVQ-based neural network anti-spam email approach. 39, 34–39 (6)
Cabrera León, Y., Acosta Padrón, O.: Spam: definition, statistics, anti-spam methods and legislation
Qian, F., Pathak, A., Hu, Y.C., Mao, Z.M., Xie, Y.: A case for unsupervised-learning-based spam filtering. ACM SIGMETRICS Perform. Eval. Rev. 38, 367–368. ACM
Narisawa, K., Bannai, H., Hatano, K., Takeda, M.: Unsupervised spam detection based on string alienness measures. In: Discovery Science, pp. 161–172. Springer, Heidelberg
Uemura, T., Ikeda, D., Arimura, H.: Unsupervised spam detection by document complexity estimation. In: Discovery Science, pp. 319–331
Luo, X., Zincir-Heywood, N.: Comparison of a SOM based sequence analysis system and naive Bayesian classifier for spam filtering. In: Proceedings of the IEEE International Joint Conference On Neural Networks IJCNN’05, vol. 4, pp. 2571–2576
Vrusias, B.L., Golledge, I.: Adaptable text filters and unsupervised neural classifiers for spam detection. In: Proceedings of the International Workshop on Computational Intelligence in Security for Information Systems CISIS’08. Advances in soft computing, vol. 53, pp. 195–202. Springer, Heidelberg
Vrusias, B.L., Golledge, I.: Online self-organised map classifiers as text filters for spam email detection. 4, 151–160
Chapelle, O., Schölkopf, B., Zien, A.: Semi-Supervised Learning, vol. 2. MIT Press
Gao, Y., Yan, M., Choudhary, A.: Semi supervised image spam hunter: a regularized discriminant EM approach. In: International Conference on Advanced Data Mining and Applications, pp. 152–164. Springer, Heidelberg
Pfahringer, B.: A semi-supervised spam mail detector, pp. 1–5
Santos, I., Sanz, B., Laorden, C., Brezo, F., Bringas, P.G.: (Computational Intelligence in Security for Information Systems: 4th International Conference, CISIS 2011, Held at IWANN 2011)
Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. 16, 321–328
Mason, J.: Filtering Spam with SpamAssassin (presentation)
Xu, J.M., Fumera, G., Roli, F., Zhou, Z.H.: Training spamassassin with active semi-supervised learning. In: Proceedings of the 6th Conference on Email and Anti-Spam (CEAS’09), pp. 1–8. (Citeseer)
Shunli, Z., Qingshuang, Y.: Personal spam filter by semi-supervised learning. In: Proceedings of the Third International Symposium on Com Puter Science and Computational Technology (ISCSCT’10), pp. 171–174
Zhou, D., Burges, C.J.C., Tao, T.: Transductive link spam detection. In: Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web, pp. 21–28
Mojdeh, M., Cormack, G.V.: Semi-supervised Spam Filtering: Does it Work? In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (ACM) 745–746
Resnick, P. (ed.) : RFC 5322 - Internet Message Format
Cormack, G.V., Mojdeh, M.: Machine learning for information retrieval: TREC 2009 web, relevance feedback and legal tracks. In: The Eighteenth Text REtrieval Conference Proceedings (TREC 2009), pp. 1–9
Malathi, R.: Email spam filter using supervised learning with bayesian neural network. 1, 89–100
Pitsillidis, A., Levchenko, K., Kreibich, C., Kanich, C., Voelker, G.M., Paxson, V., Weaver, N., Savage, S.: Botnet judo: Fighting spam with itself. In: Symposium on Network and Distributed System Security (NDSS), pp. 1–19
Kolcz, A., Chowdhury, A., Alspector, J.: The impact of feature selection on signature-driven spam detection. In: Proceedings of the 1st Conference on Email and Anti-Spam (CEAS-2004), pp. 1–8
The Apache SpamAssassin Project: SpamAssassin v3.3.x: Tests Performed to Determine Spaminess and Haminess of a Message
Yerazunis, W., Kato, M., Kori, M., Shibata, H., Hackenberg, K.: Keeping the Good Stuff In: Confidential Information Firewalling with the CRM114 Spam Filter & Text Classifier, pp. 1–18
Graham-Cumming, J.: SpamOrHam, pp. 22–24
Feroze, M.A., Baig, Z.A., Johnstone, M.N.: A two-tiered user feedback-based approach for spam detection. In: Becker Westphall, C., Borcoci, E., Manoharan, S. (eds.) ICSNC 2015: The Tenth International Conference on Systems and Networks Communications, pp. 12–17. Curran Associates, Inc, Spain, 15–20 November 2015
Bruce, J.: Grey Mail: The New Email Nuisance To Hit Your Inbox
Ramachandran, A., Feamster, N.: Understanding the network-level behavior of spammers. ACM SIGCOMM Comput. Commun. Rev. 36, 291–302
Fumera, G., Pillai, I., Roli, F.: Spam filtering based on the analysis of text information embedded into images. 7, 2699–2720
Borovicka, T., Jirina Jr., M., Kordik, P., Jirina, M.: Selecting Representative Data Sets. In: Karahoca, A. (ed.) Advances in Data Mining Knowledge Discovery and Applications. (InTech)
Skillicorn, D.: Other Versions of the Enron Data (preprocessed)
Styler, W.: The EnronSent Corpus
Bekkerman, R., McCallum, A., Huang, G.: Automatic categorization of email into folders: Benchmark experiments on Enron and SRI corpora
The Apache SpamAssassin Project: Index of the SpamAssassin’s Public Corpus
Guenter, B.: SPAM Archive: Email spam received yearly, since early 1998
CSMining Group: CSDMC2010 SPAM corpus
Hovold, J.: Naive Bayes Spam filtering using word-position-based attributes. In: CEAS
Zhang, Y.: Lecture for Chap. 2 - Data Preprocessing (course presentation)
Porter, M.F.: An algorithm for suffix stripping. 14, 130–137
Freschi, V., Seraghiti, A., Bogliolo, A.: Filtering obfuscated email spam by means of phonetic string matching. Advances in Information Retrieval, pp. 505–509. Springer, Berlin
Liu, C., Stamm, S.: Fighting Unicode-obfuscated spam. In: Proceedings of the Anti-Phishing Working Groups 2nd Annual eCrime Researchers Summit, pp. 45–59. ACM
Wang, D., Zhang, H.: Inverse-category-frequency based supervised term weighting schemes for text categorization. 29, 209–225
Lertnattee, V., Theeramunkong, T.: Analysis of inverse class frequency in centroid-based text classification. In: IEEE International Symposium on Communications and Information Technology (ISCIT 2004). vol. 2, pp. 1171–1176. IEEE
Zeimpekis, D., Kontopoulou, E.M., Gallopoulos, E.: Text to Matrix Generator (TMG)
Kohonen, T.: Self-Organizing Maps, 3rd edn. Springer, New York
Rojas, R.: Kohonen networks. In: Neural Networks: A Systematic Introduction, pp. 391–412. Springer, Berlin
Haykin, S.S.: Neural Networks. A Comprehensive Foundation, 2nd edn. Prentice-Hall International
Tan, H.S., George, S.E.: Investigating learning parameters in a standard 2-D SOM model to select good maps and avoid poor ones. In: Australasian Joint Conference on Artificial Intelligence, pp. 425–437, Springer, Berlin
Kohonen, T.: Essentials of the self-organizing map. 37, 52–65
Fawcett, T.: ROC graphs: Notes and practical considerations for researchers. 31, 1–38
Metz, C.E.: Basic principles of ROC analysis. In: Seminars in Nuclear Medicine. vol. 8, pp. 283–298. Elsevier
Slaby, A.: ROC analysis with Matlab. In: 29th International Conference On Information Technology Interfaces, 2007. ITI, pp. 191–196. IEEE
Vesanto, J., Himberg, J., Alhoniemi, E., Parhankangas, J.: SOM Toolbox for Matlab 5
MathWorks: Parallel Computing Toolbox for Matlab R2014a - User’s Guide
Holden, S.: Spam Filtering II: Comparison of a number of Bayesian anti-spam filters over different email corpora
Gama, J., Ẑliobaitė, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. 46, 1–37
Ẑliobaitė, I., Pechenizkiy, M., Gama, J.: An overview of concept drift applications. In Japkowicz, N., Stefanowski, J., eds.: Big Data Analysis: New Algorithms for a New Society of Studies in Big Data, vol. 16, pp. 91–114. Springer International Publishing
Freeman, J.A., Skapura, D.M.: Neural Networks: Algorithms, Applications, and Programming Techniques. Computation and neural systems series. Addison-Wesley
Hecht-Nielsen, R.: Counterpropagation networks. 26, 4979–4984
Suárez Araujo, C.P., García Báez, P., Hernández Trujillo, Y.: Neural computation methods in the determination of fungicides. In: Fungicides. Odile carisse edn. INTECH Open Access Publisher
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Cabrera-León, Y., García Báez, P., Suárez-Araujo, C.P. (2019). E-Mail Spam Filter Based on Unsupervised Neural Architectures and Thematic Categories: Design and Analysis. In: Merelo, J.J., et al. Computational Intelligence. IJCCI 2016. Studies in Computational Intelligence, vol 792. Springer, Cham. https://doi.org/10.1007/978-3-319-99283-9_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-99283-9_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99282-2
Online ISBN: 978-3-319-99283-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)