MLSPD - Machine Learning Based Spam and Phishing Detection

  • Sanjay KumarEmail author
  • Azfar Faizan
  • Ari Viinikainen
  • Timo Hamalainen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11280)


Spam emails have become a global menace since the rise of the Internet era. In fact, according to an estimate, around 50% of the emails are spam emails. Spam emails as part of a phishing scam can be sent to the masses with the motive to perform information stealing, identity theft, and other malicious actions. The previous studies showed that 91% of the cyber attacks start with the phishing emails, which contain Uniform Resource Locator (URLs). Although these URLs have several characteristics which make them distinguishable from the usual website links, yet a human eye cannot easily notice these URLs. Previous research also showed that traditional systems such as blacklisting/whitelisting of IPs and spam filters could not efficiently detect phishing and spam emails. However, Machine Learning (ML) approaches have shown promising results in combating spamming and phishing attacks. To identify these threats, we used several ML algorithms to train spam and phishing detector. The proposed framework is based on several linguistic and URL based features. Our proposed model can detect the spam and phishing emails with the accuracy of 89.2% and 97.7%, respectively.


Artificial Intelligence Phishing Spam emails Supervised learning 


  1. 1.
    Statista: Spam share of global email traffic 2014–2018. Technical report. Accessed 1 Sept 2018
  2. 2.
    Kaspersky: What is spam and a phishing scam. Technical report. Accessed 1 Sept 2018
  3. 3.
    CSO: What is cryptojacking? How to prevent, detect, and recover from it. Technical report. Accessed 1 Sept 2018
  4. 4.
    Darkreading: 91% of cyberattacks start with a phishing email. Technical report. Accessed 1 Sept 2018
  5. 5.
    Volkamer, M., Renaud, K., Reinheimer, B., Kunz, A.: User experiences of TORPEDO: TOoltip-poweRED Phishing Email DetectiOn. Comput. Secur. 71, 100–113 (2017)CrossRefGoogle Scholar
  6. 6.
    Sheng, S., Holbrook, M., Kumaraguru, P., Cranor, L.F., Downs, J.: Who falls for phish?: a demographic analysis of phishing susceptibility and effectiveness of interventions. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 373–382. ACM (2010)Google Scholar
  7. 7.
    KnowBe4: Q2 2018 top-clicked phishing report. Technical report. Accessed 1 Sept 2018
  8. 8.
    Kotthoff, L., Thornton, C., Hoos, H.H., Hutter, F., Leyton-Brown, K.: Auto-WEKA 2.0: automatic model selection and hyperparameter optimization in WEKA. J. Mach. Learn. Res. 18(1), 826–830 (2017)MathSciNetzbMATHGoogle Scholar
  9. 9.
    Kumar, S., Viinikainen, A., Hamalainen, T.: Machine learning classification model for network based intrusion detection system. In: Proceedings of the 11th International Conference for Internet Technology and Secured Transactions (ICITST), pp. 242–249, December 2016Google Scholar
  10. 10.
    Kumar, S., Viinikainen, A., Hamalainen, T.: A network-based framework for mobile threat detection. In: Proceedings of the 1st International Conference on Data Intelligence and Security (ICDIS), pp. 227–233, April 2018Google Scholar
  11. 11.
    Pan, Y., Ding, X.: Anomaly based web phishing page detection. In: Proceedings of the 22nd Annual Computer Security Applications Conference (ACSAC 2006), pp. 381–392, December 2006Google Scholar
  12. 12.
    McGrath, D.K., Gupta, M.: Behind phishing: an examination of phisher modi operandi. In Proceedings of the USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET) (2008)Google Scholar
  13. 13.
    Xiang, G., Hong, J., Rose, C.P., Cranor, L.: CANTINA+: a feature-rich machine learning framework for detecting phishing web sites. ACM Trans. Inf. Syst. Secur. (TISSEC) 14(2), 21 (2011)CrossRefGoogle Scholar
  14. 14.
    Aydin, M., Baykal, N.: Feature extraction and classification phishing websites based on URL. In: Proceedings of the IEEE Conference on Communications and Network Security (CNS), pp. 769–770, September 2015Google Scholar
  15. 15.
    Zouina, M., Outtaj, B.: A novel lightweight url phishing detection system using svm and similarity index. Hum. Centric Comput. Inf. Sci. 7(1), 17 (2017)CrossRefGoogle Scholar
  16. 16.
    Jain, A.K., Gupta, B.B.: PHISH-SAFE: URL features-based phishing detection system using machine learning. In: Bokhari, M.U., Agrawal, N., Saini, D. (eds.) Cyber Security. AISC, vol. 729, pp. 467–474. Springer, Singapore (2018). Scholar
  17. 17.
    Sahingoz, O.K., Buber, E., Demir, O., Diri, B.: Machine learning based phishing detection from URLs. Expert Syst. Appl. 117, 345–357 (2018)CrossRefGoogle Scholar
  18. 18.
    CSMINING CSDMC2010 SPAM corpus (2010, s.e.d., csdmc2010 and s. corpus). Accessed 1 May 2018
  19. 19.
    IBM: Watson tone analyzer. Accessed 20 Aug 2018
  20. 20.
    Mohammad, R.M., Thabtah, F., McCluskey, L.: An assessment of features related to phishing websites using an automated technique. In: Proceedings of the International Conference for Internet Technology and Secured Transactions, pp. 492–497, December 2012Google Scholar
  21. 21.
    Dheeru, D., Karra Taniskidou, E.: UCI machine learning repository (2017).
  22. 22.
    Mohammad, R.M., Thabtah, F., McCluskey, L.: Phishing websites features.
  23. 23.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Sanjay Kumar
    • 1
    Email author
  • Azfar Faizan
    • 1
  • Ari Viinikainen
    • 1
  • Timo Hamalainen
    • 1
  1. 1.Faculty of Information TechnologyUniversity of JyvaskylaJyvaskylaFinland

Personalised recommendations