Everything Is in the Name – A URL Based Approach for Phishing Detection

  • Harshal TupsamudreEmail author
  • Ajeet Kumar Singh
  • Sachin Lodha
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11527)


Phishing attack, in which a user is tricked into revealing sensitive information on a spoofed website, is one of the most common threat to cybersecurity. Most modern web browsers counter phishing attacks using a blacklist of confirmed phishing URLs. However, one major disadvantage of the blacklist method is that it is ineffective against newly generated phishes. Machine learning based techniques that rely on features extracted from URL (e.g., URL length and bag-of-words) or web page (e.g., TF-IDF and form fields) are considered to be more effective in identifying new phishing attacks. The main benefit of using URL based features over page based features is that the machine learning model can classify new URLs on-the-fly even before the page is loaded by the web browser, thus avoiding other potential dangers such as drive-by download attacks and cryptojacking attacks.

In this work, we focus on improving the performance of URL based detection techniques. We show that, although a classifier trained on traditional bag-of-words features (tokenized using special characters) works well in many cases, it fails to recognize a very prevalent class of phishing URLs that combines a popular brand with one or more words (e.g., and among others. To overcome these flaws, we explore various alternative feature extraction techniques based on word segmentation and \(n-\)grams. We also construct and use a phishy-list of popular words that are highly indicative of phishing attacks. We verify the efficacy of each of these feature sets by training a logistic regression classifier on a large dataset consisting of 100,000 URLs. Our experimental results reveal that features based on word segmentation, phishy-list and numerical features (e.g., URL length) perform better than all other features, as measured by misclassification and false negative rates.


Phishing detection Machine learning Social engineering attacks 


  1. 1.
  2. 2.
    DMOZ, February 2019.
  3. 3.
    Google Safe Browsing, February 2019.
  4. 4.
  5. 5.
    Python Word Segmentation, February 2019.
  6. 6.
    Alsharnouby, M., Alaca, F., Chiasson, S.: Why phishing still works: user strategies for combating phishing attacks. Int. J. Hum.-Comput. Stud. 82, 69–82 (2015)CrossRefGoogle Scholar
  7. 7.
    Ardi, C., Heidemann, J.: Auntietuna: personalized content-based phishing detection. In: Proceedings of the NDSS Workshop on Usable Security. The Internet Society, San Diego, California, USA, February 2016.
  8. 8.
    Canova, G., Volkamer, M., Bergmann, C., Reinheimer, B.: NoPhish app evaluation: lab and retention study. Internet Society, USEC (2015)Google Scholar
  9. 9.
    CJ, G., Pandit, S., Vaddepalli, S., Tupsamudre, H., Banahatti, V., Lodha, S.: Phishy - a serious game to train enterprise users on phishing awareness. In: Proceedings of the 2018 Annual Symposium on Computer-Human Interaction in Play Companion Extended Abstracts, CHI PLAY 2018, pp. 169–181. ACM, New York (2018).
  10. 10.
    Dhamija, R., Tygar, J.D., Hearst, M.: Why phishing works. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2006, pp. 581–590. ACM, New York (2006).
  11. 11.
    Felt, A.P., et al.: Improving SSL warnings: comprehension and adherence. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, CHI 2015, pp. 2893–2902. ACM, New York (2015).
  12. 12.
    Garera, S., Provos, N., Chew, M., Rubin, A.D.: A framework for detection and measurement of phishing attacks. In: Proceedings of the 2007 ACM Workshop on Recurring Malcode, WORM 2007, pp. 1–8. ACM, New York (2007).
  13. 13.
    Hong, J.: The state of phishing attacks. Commun. ACM 55(1), 74–81 (2012). Scholar
  14. 14.
    Khonji, M., Iraqi, Y., Jones, A.: Phishing detection: a literature survey. IEEE Commun. Surv. Tutor. 15(4), 2091–2121 (2013). Scholar
  15. 15.
    Kintis, P., et al.: Hiding in plain sight: a longitudinal study of combosquatting abuse. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, pp. 569–586. ACM, New York (2017).
  16. 16.
    Le, A., Markopoulou, A., Faloutsos, M.: PhishDef: URL names say it all. In: 2011 Proceedings IEEE INFOCOM, pp. 191–195, April 2011.
  17. 17.
    Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2009, pp. 1245–1254. ACM, New York (2009).
  18. 18.
    Marchal, S., François, J., State, R., Engel, T.: Phishstorm: detecting phishing with streaming analytics. IEEE Trans. Netw. Serv. Manag. 11(4), 458–471 (2014). Scholar
  19. 19.
    Marchal, S., Saari, K., Singh, N., Asokan, N.: Know your phish: novel techniques for detecting phishing sites and their targets. In: 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS), pp. 323–333, June 2016.
  20. 20.
    McGrath, D.K., Gupta, M.: Behind phishing: an examination of phisher modi operandi. In: Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats, LEET 2008, pp. 4:1–4:8. USENIX Association, Berkeley, CA, USA (2008).
  21. 21.
    Norvig, P.: Natural Language Corpus Data: Beautiful Data, February 2019.
  22. 22.
    Reeder, R.W., Felt, A.P., Consolvo, S., Malkin, N., Thompson, C., Egelman, S.: An experience sampling study of user reactions to browser warnings in the field. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI 2018, pp. 512:1–512:13. ACM, New York (2018).
  23. 23.
    Sahoo, D., Liu, C., Hoi, S.C.: Malicious URL detection using machine learning: a survey. arXiv preprint arXiv:1701.07179 (2017)
  24. 24.
    Sheng, S., et al.: Anti-phishing phil: the design and evaluation of a game that teaches people not to fall for phish. In: Proceedings of the 3rd Symposium on Usable Privacy and Security, SOUPS 2007, pp. 88–99. ACM, New York (2007).
  25. 25.
    Sheng, S., Wardman, B., Warner, G., Cranor, L., Hong, J., Zhang, C.: An empirical analysis of phishing blacklists. In: Sixth Conference on Email and Anti-Spam (CEAS), California, USA (2009)Google Scholar
  26. 26.
    Verizon: 2018 data breach investigations report, February 2019.
  27. 27.
    Verma, R., Das, A.: What’s in a URL: fast feature extraction and malicious URL detection. In: Proceedings of the 3rd ACM on International Workshop on Security and Privacy Analytics, IWSPA 2017, pp. 55–63. ACM, New York (2017).
  28. 28.
    Wang, W., Shirley, K.: Breaking bad: detecting malicious domains using word segmentation. arXiv preprint arXiv:1506.04111 (2015)
  29. 29.
    Whittaker, C., Ryner, B., Nazif, M.: Large-scale automatic classification of phishing pages. In: NDSS 2010 (2010).
  30. 30.
    Yang, W., Zuo, W., Cui, B.: Detecting malicious urls via a keyword-based convolutional gated-recurrent-unit neural network. IEEE Access 7, 29891–29900 (2019). Scholar
  31. 31.
    Zhang, Y., Hong, J.I., Cranor, L.F.: Cantina: a content-based approach to detecting phishing web sites. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 639–648. ACM, New York (2007).

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Harshal Tupsamudre
    • 1
    Email author
  • Ajeet Kumar Singh
    • 1
  • Sachin Lodha
    • 1
  1. 1.TCS ResearchPuneIndia

Personalised recommendations