Skip to main content

Lexical Feature Based Feature Selection and Phishing URL Classification Using Machine Learning Techniques

  • Conference paper
  • First Online:
Machine Learning, Image Processing, Network Security and Data Sciences (MIND 2020)

Abstract

Phishing is an illegitimate method to collect secret information of any person or organization. Information like debit card, credit card details, PIN no, OTP, passwords, etc. are stolen by the attackers through phishing sites. Researchers have used different techniques to detect those phishing sites. But it is difficult to stay on a particular technique as attackers come with new tactics. In this paper, phishing and legitimate URL classifications are performed based on the lexical features of URLs. Feature selection technique is used to select the relevant features only. Accuracy for all combination of features with different numbers of features each time was evaluated to find the best possible combination of features. Performances are analyzed for different datasets with various parameters using four different machine learning techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Mohammad, R.M., Thabtah, F., McCluskey, L.: Predicting phishing websites based on self-structuring neural network. Neural Comput. Appl. 25(2), 443–458 (2013). https://doi.org/10.1007/s00521-013-1490-z

    Article  Google Scholar 

  2. Phishing Activity Trends Reports. https://www.antiphishing.org/trendsreports/. Accessed 13 Nov 2019

  3. Overview Safe Browsing APIs (v4) Google Developers. https://developers.google.com/safe-browsing/v4. Accessed 18 Dec 2019

  4. Gupta, B.B., Arachchilage, N.A.G., Psannis, K.E.: Defending against phishing attacks: taxonomy of methods, current issues and future directions. Telecommunication Systems 67(2), 247–267 (2017). https://doi.org/10.1007/s11235-017-0334-z

    Article  Google Scholar 

  5. Prakash, P., Kumar, M., Kompella, R.R., Gupta, M.: PhishNet: predictive blacklisting to detect phishing attacks. In: 2010 Proceedings IEEE INFOCOM (2010). https://doi.org/10.1109/infcom.2010.5462216

  6. Han, W., Cao, Y., Bertino, E., Yong, J.: Using automated individual white-list to protect web digital identities. Expert Syst. Appl. 39, 11861–11869 (2012). https://doi.org/10.1016/j.eswa.2012.02.020

    Article  Google Scholar 

  7. Jain, A.K., Gupta, B.B.: A novel approach to protect against phishing attacks at client side using auto-updated white-list. EURASIP J. Inf. Secur. 2016(1), 1–11 (2016). https://doi.org/10.1186/s13635-016-0034-3

    Article  Google Scholar 

  8. Jain, A.K., Gupta, B.B.: Phishing detection: analysis of visual similarity based approaches. Secur. Commun. Netw. 2017, 1–20 (2017). https://doi.org/10.1155/2017/5421046

    Article  Google Scholar 

  9. Sahoo, D., Liu, C., Hoi, S.C.H.: Malicious URL detection using Machine Learning: a survey. arXiv:1701.07179v2 (2017)

  10. Vanhoenshoven, F., Napoles, G., Falcon, R., Vanhoof, K., Koppen, M.: Detecting malicious URLs using machine learning techniques. In: 2016 IEEE Symposium Series on Computational Intelligence (SSCI) (2016). https://doi.org/10.1109/ssci.2016.7850079

  11. Mamun, M.S.I., Rathore, M.A., Lashkari, A.H., Stakhanova, N., Ghorbani, A.A.: Detecting malicious URLs using lexical analysis. In: Chen, J., Piuri, V., Su, C., Yung, M. (eds.) NSS 2016. LNCS, vol. 9955, pp. 467–482. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46298-1_30

    Chapter  Google Scholar 

  12. Phishtank-Join the fight against phishing. https://www.phishtank.com/. Accessed 09 Nov 2019

  13. DMOZ URL gr33ndata: gr33ndata/dmoz-urlclassifier. https://github.com/gr33ndata/dmoz-urlclassifier/. Accessed 27 Oct 2019

  14. Banik, B., Sarma, A.: Phishing URL detection system based on URL features using SVM. Int. J. Electron. Appl. Res. 5, 40–55 (2018). https://doi.org/10.33665/ijear.2018.v05i02.003

    Article  Google Scholar 

  15. Chiew, K.L., Chang, E.H., Tan, C.L., Abdullah, J., Yong, K.S.C.: Building standard offline anti-phishing dataset for Benchmarking, International Journal of Engineering & Technology, vol. 7, no. 4.31, pp. 7–14, (2018). https://doi.org/10.14419/ijet.v7i4.31.23333

  16. Althobaiti, K., Rummani, G., Vaniea, K.: A review of human- and computer-facing URL phishing features. In: 2019 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW) (2019). https://doi.org/10.1109/eurospw.2019.00027

  17. Brownlee, J.: How to Choose a Feature Selection Method For Machine Learning. https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/. Accessed 05 Jan 2020

  18. Asaithambi, S.: Why, How and When to apply Feature Selection. https://towardsdatascience.com/why-how-and-when-to-apply-feature-selection-e9c69adfabf2. Accessed 05 Jan 2020

  19. Liu, H., Setiono, R.: Chi2: feature selection and discretization of numeric attributes. In: Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence (1995). https://doi.org/10.1109/tai.1995.479783

  20. Meesad, P., Boonrawd, P., Nuipian, V.: A Chi-Square-test for word importance differentiation in text classification. In: International Conference on Information and Electronics Engineering (2011)

    Google Scholar 

  21. Hutchinson, S., Zhang, Z., Liu, Q.: Detecting phishing websites with random forest. In: Meng, L., Zhang, Y. (eds.) MLICOM 2018. LNICSSITE, vol. 251, pp. 470–479. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00557-3_46

    Chapter  Google Scholar 

  22. Sonowal, G., Kuppusamy, K.: PhiDMA – a phishing detection model with multi-filter approach. J. King Saud Univ. Comput. Inf. Sci. 32, 99–112 (2017). https://doi.org/10.1016/j.jksuci.2017.07.005

    Article  Google Scholar 

  23. Mao, J., et al.: Phishing page detection via learning classifiers from page layout feature. EURASIP J. Wirel. Commun. Network. 2019(1), 1–14 (2019). https://doi.org/10.1186/s13638-019-1361-0

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bireswar Banik .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Banik, B., Sarma, A. (2020). Lexical Feature Based Feature Selection and Phishing URL Classification Using Machine Learning Techniques. In: Bhattacharjee, A., Borgohain, S., Soni, B., Verma, G., Gao, XZ. (eds) Machine Learning, Image Processing, Network Security and Data Sciences. MIND 2020. Communications in Computer and Information Science, vol 1241. Springer, Singapore. https://doi.org/10.1007/978-981-15-6318-8_9

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-6318-8_9

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-6317-1

  • Online ISBN: 978-981-15-6318-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics