E-Mail Spam Filter Based on Unsupervised Neural Architectures and Thematic Categories: Design and Analysis

Cabrera-León, Ylermi; García Báez, Patricio; Suárez-Araujo, Carmen Paz

doi:10.1007/978-3-319-99283-9_12

Ylermi Cabrera-León⁹,
Patricio García Báez¹⁰ &
Carmen Paz Suárez-Araujo⁹

Part of the book series: Studies in Computational Intelligence ((SCI,volume 792))

Included in the following conference series:

International Joint Conference on Computational Intelligence

279 Accesses

Abstract

Spam, or unsolicited messages sent massively, is one of the threats that affects email and other media. Its huge quantity generates considerable economic and time losses. A solution to this issue is presented: a hybrid anti-spam filter based on unsupervised Artificial Neural Networks (ANNs). It consists of two steps, preprocessing and processing, both based on different computation models: programmed and neural (using Kohonen SOM). This system has been optimized by utilizing a dataset built with ham from “Enron Email” and spam from two different sources: traditional (user’s inbox) and spamtrap-honeypot. The preprocessing was based on 13 thematic categories found in spams and hams, Term Frequency (TF) and three versions of Inverse Category Frequency (ICF). 1260 system configurations were analyzed with the most used performance measures, achieving \(\mathrm{AUC}>0.95\) the optimal ones. Results were similar to other researchers’ over the same corpus, although they utilize different Machine Learning (ML) methods and a number of attributes several orders of magnitude greater. The system was further tested with different datasets, characterized by heterogeneous origins, dates, users and types, including samples of image spam. In these new tests the filter obtained \(0.75<\mathrm{AUC}<0.96\). Degradation of the system performance can be explained by the differences in the characteristics of the datasets, particularly dates. This phenomenon is called “topic drift” and it commonly affects all classifiers and, to a larger extent, those that use offline learning, as is the case, especially in adversarial ML problems such as spam filtering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Each ROC curve was drawn with 102 specificity and sensitivity values, given by the same number of confidence thresholds, for enhanced ROC curve detail.

References

Subramaniam, T., Jalab, H.A., Taqa, A.Y.: Overview of textual anti-spam filtering techniques. 5, 1869–1882
Google Scholar
Cabrera-León, Y., García Báez, P., Suárez-Araujo, C.P.: Non-email spam and machine learning-based anti-spam filters: trends and some remarks. In: Moreno-Díaz, R., Pichler, F., Quesada-Arencibia, A. (eds.) Computer Aided Systems Theory–EUROCAST 2017. Lecture notes in computer science, vol. 10671, pp. 245–253. Springer, Cham
Google Scholar
McAfee, ICF International: The Carbon Footprint of Email Spam Report
Google Scholar
Statista: Global spam volume as percentage of total e-mail traffic from 2007–2015
Google Scholar
Cabrera León, Y.: Análisis del uso de las redes neuronales artificiales en el diseño de filtros antispam: una propuesta basada en arquitecturas neuronales no supervisadas
Google Scholar
Rao, J.M., Reiley, D.H.: The economics of spam. 26, 87–110
Google Scholar
Lieb, R.: Make Spammers Pay Before You Do
Google Scholar
Alazab, M., Broadhurst, R.: Spam and criminal activity, 1–14
Google Scholar
Calais Guerra, P.H., Guedes, D.O., Meira Jr., W., Hoepers, C., Chaves, M.H., Steding-Jessen, K.: Exploring the spam arms race to characterize spam evolution. In: Proceedings of the 7th Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference
Google Scholar
Pu, C., Webb, S.: Observed trends in spam construction techniques: a case study of spam evolution. In: Third Conference on Email and Anti-Spam (CEAS), pp. 1–9
Google Scholar
Wang, D., Irani, D., Pu, C.: A study on evolution of email spam over fifteen years. In: 9th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom 2013), pp. 1–10. IEEE
Google Scholar
Cohen, W.W.: Enron Email Dataset
Google Scholar
Cabrera-León, Y., García Báez, P., Suárez-Araujo, C.P.: Self-organizing maps in the design of anti-spam filters. a proposal based on thematic categories. In: Proceedings of the 8th IJCCI, vol. 3, pp. 21–32. NCTA, SCITEPRESS Digital Library
Google Scholar
Postini, Inc: The shifting tactics of spammers: What you need to know about new email threats
Google Scholar
Spammer-X, Posluns, J., Sjouwerman, S.: Inside the SPAM Cartel, 1st edn. Syngress, Elsevier
Google Scholar
Erickson, D., Casado, M., McKeown, N.: The effectiveness of whitelisting: a user-study. In: Proceedings of Conference on Email and Anti-Spam, pp. 1–10
Google Scholar
Kucherawy, M., Crocker, D.: RFC 6647 - Email Greylisting: An Applicability Statement for SMTP
Google Scholar
Harris, E.: The Next Step in the Spam Control War: Greylisting
Google Scholar
Guzella, T.S., Caminhas, W.M.: A review of machine learning approaches to Spam filtering. 36, 10206–10222
Google Scholar
Meyer, T.A., Whateley, B.: SpamBayes: Effective open-source, Bayesian based, email classification system. In: CEAS, (Citeseer)
Google Scholar
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A bayesian approach to filtering junk E-mail
Google Scholar
Lowd, D., Meek, C.: Good word attacks on statistical spam filters. In: Proceedings of the Second Conference on Email and Anti-Spam (CEAS), pp. 1–8
Google Scholar
Sprengers, M., Heskes, T.T.: The effects of different bayesian poison methods on the quality of the bayesian spam filter spambayes
Google Scholar
Wittel, G.L., Wu, S.F.: On attacking statistical spam filters. CEAS
Google Scholar
Metsis, V., Androutsopoulos, I., Paliouras, G.: Spam filtering with naive bayes - which naive bayes? In: CEAS 2006 - Third Conference on Email and Anti-Spam
Google Scholar
Drucker, H., Wu, D., Vapnik, V.N.: Support vector machines for spam categorization. 10, 1048–1054
Google Scholar
Xie, C., Ding, L., Du, X.: Anti-spam filters based on support vector machines. In: Advances in Computation and Intelligence. 4th International Symposium, ISICA 2009. Lecture notes in computer science, vol. 5821, pp. 349–357. Springer, Heidelberg
Google Scholar
Chhabra, P., Wadhvani, R., Shukla, S.: Spam filtering using support vector machine. In: Special Issue of IJCCT Vol.1 Issue 2, 3, 4; 2010 for International Conference [ACCTA-2010], pp. 166–171
Google Scholar
Blanco, N., Ricket, A.M., Martín-Merino, M.: Combining SVM classifiers for email anti-spam filtering. In: Sandoval, F., Prieto, A., Cabestany, J., Graña, M. (eds.) 9th International Work-Conference on Artificial Neural Networks, IWANN 2007. Computational and ambient intelligence of lecture notes in computer science, vol. 4507, pp. 903–910. Springer, Heidelberg
Google Scholar
Kufandirimbwa, O., Gotora, R.: Spam detection using artificial neural networks (Perceptron Learning Rule). 1, 22–29
Google Scholar
Sculley, D., Wachman, G., Brodley, C.E.: Spam filtering using inexact string matching in explicit feature space with on-line linear classifiers. TREC
Google Scholar
Chuan, Z., Xianliang, L., Mengshu, H., Xu, Z.: A LVQ-based neural network anti-spam email approach. 39, 34–39 (6)
Google Scholar
Cabrera León, Y., Acosta Padrón, O.: Spam: definition, statistics, anti-spam methods and legislation
Google Scholar
Qian, F., Pathak, A., Hu, Y.C., Mao, Z.M., Xie, Y.: A case for unsupervised-learning-based spam filtering. ACM SIGMETRICS Perform. Eval. Rev. 38, 367–368. ACM
Google Scholar
Narisawa, K., Bannai, H., Hatano, K., Takeda, M.: Unsupervised spam detection based on string alienness measures. In: Discovery Science, pp. 161–172. Springer, Heidelberg
Google Scholar
Uemura, T., Ikeda, D., Arimura, H.: Unsupervised spam detection by document complexity estimation. In: Discovery Science, pp. 319–331
Google Scholar
Luo, X., Zincir-Heywood, N.: Comparison of a SOM based sequence analysis system and naive Bayesian classifier for spam filtering. In: Proceedings of the IEEE International Joint Conference On Neural Networks IJCNN’05, vol. 4, pp. 2571–2576
Google Scholar
Vrusias, B.L., Golledge, I.: Adaptable text filters and unsupervised neural classifiers for spam detection. In: Proceedings of the International Workshop on Computational Intelligence in Security for Information Systems CISIS’08. Advances in soft computing, vol. 53, pp. 195–202. Springer, Heidelberg
Google Scholar
Vrusias, B.L., Golledge, I.: Online self-organised map classifiers as text filters for spam email detection. 4, 151–160
Google Scholar
Chapelle, O., Schölkopf, B., Zien, A.: Semi-Supervised Learning, vol. 2. MIT Press
Google Scholar
Gao, Y., Yan, M., Choudhary, A.: Semi supervised image spam hunter: a regularized discriminant EM approach. In: International Conference on Advanced Data Mining and Applications, pp. 152–164. Springer, Heidelberg
Google Scholar
Pfahringer, B.: A semi-supervised spam mail detector, pp. 1–5
Google Scholar
Santos, I., Sanz, B., Laorden, C., Brezo, F., Bringas, P.G.: (Computational Intelligence in Security for Information Systems: 4th International Conference, CISIS 2011, Held at IWANN 2011)
Google Scholar
Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. 16, 321–328
Google Scholar
Mason, J.: Filtering Spam with SpamAssassin (presentation)
Google Scholar
Xu, J.M., Fumera, G., Roli, F., Zhou, Z.H.: Training spamassassin with active semi-supervised learning. In: Proceedings of the 6th Conference on Email and Anti-Spam (CEAS’09), pp. 1–8. (Citeseer)
Google Scholar
Shunli, Z., Qingshuang, Y.: Personal spam filter by semi-supervised learning. In: Proceedings of the Third International Symposium on Com Puter Science and Computational Technology (ISCSCT’10), pp. 171–174
Google Scholar
Zhou, D., Burges, C.J.C., Tao, T.: Transductive link spam detection. In: Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web, pp. 21–28
Google Scholar
Mojdeh, M., Cormack, G.V.: Semi-supervised Spam Filtering: Does it Work? In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (ACM) 745–746
Google Scholar
Resnick, P. (ed.) : RFC 5322 - Internet Message Format
Google Scholar
Cormack, G.V., Mojdeh, M.: Machine learning for information retrieval: TREC 2009 web, relevance feedback and legal tracks. In: The Eighteenth Text REtrieval Conference Proceedings (TREC 2009), pp. 1–9
Google Scholar
Malathi, R.: Email spam filter using supervised learning with bayesian neural network. 1, 89–100
Google Scholar
Pitsillidis, A., Levchenko, K., Kreibich, C., Kanich, C., Voelker, G.M., Paxson, V., Weaver, N., Savage, S.: Botnet judo: Fighting spam with itself. In: Symposium on Network and Distributed System Security (NDSS), pp. 1–19
Google Scholar
Kolcz, A., Chowdhury, A., Alspector, J.: The impact of feature selection on signature-driven spam detection. In: Proceedings of the 1st Conference on Email and Anti-Spam (CEAS-2004), pp. 1–8
Google Scholar
The Apache SpamAssassin Project: SpamAssassin v3.3.x: Tests Performed to Determine Spaminess and Haminess of a Message
Google Scholar
Yerazunis, W., Kato, M., Kori, M., Shibata, H., Hackenberg, K.: Keeping the Good Stuff In: Confidential Information Firewalling with the CRM114 Spam Filter & Text Classifier, pp. 1–18
Google Scholar
Graham-Cumming, J.: SpamOrHam, pp. 22–24
Google Scholar
Feroze, M.A., Baig, Z.A., Johnstone, M.N.: A two-tiered user feedback-based approach for spam detection. In: Becker Westphall, C., Borcoci, E., Manoharan, S. (eds.) ICSNC 2015: The Tenth International Conference on Systems and Networks Communications, pp. 12–17. Curran Associates, Inc, Spain, 15–20 November 2015
Google Scholar
Bruce, J.: Grey Mail: The New Email Nuisance To Hit Your Inbox
Google Scholar
Ramachandran, A., Feamster, N.: Understanding the network-level behavior of spammers. ACM SIGCOMM Comput. Commun. Rev. 36, 291–302
Google Scholar
Fumera, G., Pillai, I., Roli, F.: Spam filtering based on the analysis of text information embedded into images. 7, 2699–2720
Google Scholar
Borovicka, T., Jirina Jr., M., Kordik, P., Jirina, M.: Selecting Representative Data Sets. In: Karahoca, A. (ed.) Advances in Data Mining Knowledge Discovery and Applications. (InTech)
Google Scholar
Skillicorn, D.: Other Versions of the Enron Data (preprocessed)
Google Scholar
Styler, W.: The EnronSent Corpus
Google Scholar
Bekkerman, R., McCallum, A., Huang, G.: Automatic categorization of email into folders: Benchmark experiments on Enron and SRI corpora
Google Scholar
The Apache SpamAssassin Project: Index of the SpamAssassin’s Public Corpus
Google Scholar
Guenter, B.: SPAM Archive: Email spam received yearly, since early 1998
Google Scholar
CSMining Group: CSDMC2010 SPAM corpus
Google Scholar
Hovold, J.: Naive Bayes Spam filtering using word-position-based attributes. In: CEAS
Google Scholar
Zhang, Y.: Lecture for Chap. 2 - Data Preprocessing (course presentation)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. 14, 130–137
Google Scholar
Freschi, V., Seraghiti, A., Bogliolo, A.: Filtering obfuscated email spam by means of phonetic string matching. Advances in Information Retrieval, pp. 505–509. Springer, Berlin
Google Scholar
Liu, C., Stamm, S.: Fighting Unicode-obfuscated spam. In: Proceedings of the Anti-Phishing Working Groups 2nd Annual eCrime Researchers Summit, pp. 45–59. ACM
Google Scholar
Wang, D., Zhang, H.: Inverse-category-frequency based supervised term weighting schemes for text categorization. 29, 209–225
Google Scholar
Lertnattee, V., Theeramunkong, T.: Analysis of inverse class frequency in centroid-based text classification. In: IEEE International Symposium on Communications and Information Technology (ISCIT 2004). vol. 2, pp. 1171–1176. IEEE
Google Scholar
Zeimpekis, D., Kontopoulou, E.M., Gallopoulos, E.: Text to Matrix Generator (TMG)
Google Scholar
Kohonen, T.: Self-Organizing Maps, 3rd edn. Springer, New York
Google Scholar
Rojas, R.: Kohonen networks. In: Neural Networks: A Systematic Introduction, pp. 391–412. Springer, Berlin
Google Scholar
Haykin, S.S.: Neural Networks. A Comprehensive Foundation, 2nd edn. Prentice-Hall International
Google Scholar
Tan, H.S., George, S.E.: Investigating learning parameters in a standard 2-D SOM model to select good maps and avoid poor ones. In: Australasian Joint Conference on Artificial Intelligence, pp. 425–437, Springer, Berlin
Google Scholar
Kohonen, T.: Essentials of the self-organizing map. 37, 52–65
Google Scholar
Fawcett, T.: ROC graphs: Notes and practical considerations for researchers. 31, 1–38
Google Scholar
Metz, C.E.: Basic principles of ROC analysis. In: Seminars in Nuclear Medicine. vol. 8, pp. 283–298. Elsevier
Google Scholar
Slaby, A.: ROC analysis with Matlab. In: 29th International Conference On Information Technology Interfaces, 2007. ITI, pp. 191–196. IEEE
Google Scholar
Vesanto, J., Himberg, J., Alhoniemi, E., Parhankangas, J.: SOM Toolbox for Matlab 5
Google Scholar
MathWorks: Parallel Computing Toolbox for Matlab R2014a - User’s Guide
Google Scholar
Holden, S.: Spam Filtering II: Comparison of a number of Bayesian anti-spam filters over different email corpora
Google Scholar
Gama, J., Ẑliobaitė, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. 46, 1–37
Google Scholar
Ẑliobaitė, I., Pechenizkiy, M., Gama, J.: An overview of concept drift applications. In Japkowicz, N., Stefanowski, J., eds.: Big Data Analysis: New Algorithms for a New Society of Studies in Big Data, vol. 16, pp. 91–114. Springer International Publishing
Google Scholar
Freeman, J.A., Skapura, D.M.: Neural Networks: Algorithms, Applications, and Programming Techniques. Computation and neural systems series. Addison-Wesley
Google Scholar
Hecht-Nielsen, R.: Counterpropagation networks. 26, 4979–4984
Google Scholar
Suárez Araujo, C.P., García Báez, P., Hernández Trujillo, Y.: Neural computation methods in the determination of fungicides. In: Fungicides. Odile carisse edn. INTECH Open Access Publisher
Google Scholar

Download references

Author information

Authors and Affiliations

Instituto Universitario de Ciencias y Tecnologías Cibernéticas, Universidad de Las Palmas de Gran Canaria, Las Palmas de Gran Canaria, Spain
Ylermi Cabrera-León & Carmen Paz Suárez-Araujo
Departamento de Ingeniería Informática y de Sistemas, Universidad de La Laguna, San Cristóbal de La Laguna, Spain
Patricio García Báez

Authors

Ylermi Cabrera-León
View author publications
You can also search for this author in PubMed Google Scholar
Patricio García Báez
View author publications
You can also search for this author in PubMed Google Scholar
Carmen Paz Suárez-Araujo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carmen Paz Suárez-Araujo .

Editor information

Editors and Affiliations

Department of Computer Architecture and Computer Technology, Universidad de Granada, Granada, Spain
Juan Julian Merelo
ISEL-Instituto Politécnico de Lisboa, Lisboa, Portugal
Fernando Melício
Facultad de Informática, Department of Information and Communications Engineering, University of Murcia, Murcia, Spain
José M. Cadenas
University of Coimbra, Coimbra, Portugal
António Dourado
Université Paris-Est Créteil (UPEC), Créteil, France
Kurosh Madani
University of Algarve, Faro, Portugal
António Ruano
INSTICC, Setúbal, Portugal
Joaquim Filipe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cabrera-León, Y., García Báez, P., Suárez-Araujo, C.P. (2019). E-Mail Spam Filter Based on Unsupervised Neural Architectures and Thematic Categories: Design and Analysis. In: Merelo, J.J., et al. Computational Intelligence. IJCCI 2016. Studies in Computational Intelligence, vol 792. Springer, Cham. https://doi.org/10.1007/978-3-319-99283-9_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-99283-9_12
Published: 04 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99282-2
Online ISBN: 978-3-319-99283-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics