Detecting Targeted Malicious E-Mail Using Linear Regression Algorithm with Data Mining Techniques
E-mail is the most fundamental means of communication. It is the focus of attack by the terrorists, e-mail spammers, imposters, business fraudsters, and hackers. To combat this, different data mining classifiers are used to identify the spam mails. This paper introduces a system that imports data from the e-mail accounts and performs preprocessing techniques like file conversions that are appropriate to conduct the experiments, searching for frequency of a word by Knuth–Morris–Pratt (KMP) string searching algorithm, and feature selection using principal component analysis (PCA) are applied. Next, linear regression classification is used to predict the spam mails. Then, association rule mining is performed. The mean absolute error and root mean squared error for the training data and test data are computed. The errors of the training and test data sets are negligible which indicates the classifier is well trained. Finally, the results are displayed by the visualization techniques.
KeywordsPreprocessing KMP PCA Linear regression Association rule mining Mean absolute error Root mean squared error Visualization
The authors of this paper would like to thank the reviewers of the paper who would read this manuscript and give us valuable suggestions.
- 1.Fan Jia-Peng, Wu Xia-Hui, Zhu Shi-dong, and Xia Yan, “Research and Implementation of Web mail Forensics System”, 978-1-4244-6581-1/11, 2011 IEEE.Google Scholar
- 2.Chih-Chin Lai, and Ming-Chi Tsai, “An empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization”, Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (HIS’04) 0-7695-2291-2014, IEEE.Google Scholar
- 3.Walaa Gad, Sherine Rady, “Email Filtering based on Supervised Learning and Mutual Information Feature Selection”, in 978-1-4673-9971-5/15- IEEE, 2015, pp 147–152.Google Scholar
- 4.R. Shams and R. E. MercerIn, “Classifying Spam Emails using Text and Readability Features”, In 13th International Conference on Data Mining, IEEE, 2013, pp. 657–666.Google Scholar
- 5.Spam Cop, Spam Cop Blocking List. Available: http://www.spamcop.net/bl.shtml, 2010.
- 6.DeBarr, H.W.D., Spam Detection using Clustering, Random Forests and Active Learning, presented at the 6th Conference on Email and Anti-Spam, California, 2009.Google Scholar
- 7.Awad, S.M.E.W.A., “Machine Learning methods for Email Classification”, International Journal of Computer Applications, 2011.Google Scholar
- 8.P. Ozarkar and Dr. M. Patwardhan, “Efficient Spam Classification By Appropriate Feature Selection”, International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 –6375 (Online) vol. 4(3), May–June, 2013.Google Scholar
- 9.Josin Thomas, Nisha S. Raj, Vinod P., “Robust Feature Vector for Spam Classification”, In proceedings of the International Conference on Data Sciences, Universities Press, ISBN: 978-81-7371-926-4, Feb 2014, pp. 87–95.Google Scholar
- 10.Tich Phuoc Tran, Pohsiang Tsai, Tony Jan, “An Adjustable Combination of Linear Regression and Modified Probabilistic Neural Network for Anti-Spam Filtering” IEEE 2008.Google Scholar
- 11.D. Puniškis, R. Laurutis, R. Dirmeikis, “An Artificial Neural Nets for Spam e-mail Recognition”, electronics and electrical engineering ISSN 1392 – 1215 2006. Nr. 5(69).Google Scholar
- 12.Rachana Mishara, Ramjeeevan Singh Thakur, “An efficient Approach For Supervised Learning Algorithms using Different Data Mining Tools For Spam Categorization”, Fourth International Conference on Communication Systems and Network Technologies, 2014, pp 472–477.Google Scholar
- 13.Sujeet More, Ravi Kalkundri, “Evaluation of Deceptive Mails using Filtering & Weka”, IEEE sponsored 2nd International Conference on Innovations in Information Embedded and Communication Systems, ICIIECS, IEEE, 2015.Google Scholar