Lexical Feature Based Feature Selection and Phishing URL Classification Using Machine Learning Techniques

Banik, Bireswar; Sarma, Abhijit

doi:10.1007/978-981-15-6318-8_9

Bireswar Banik¹¹ &
Abhijit Sarma¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1241))

Included in the following conference series:

International Conference on Machine Learning, Image Processing, Network Security and Data Sciences

1174 Accesses
2 Citations

Abstract

Phishing is an illegitimate method to collect secret information of any person or organization. Information like debit card, credit card details, PIN no, OTP, passwords, etc. are stolen by the attackers through phishing sites. Researchers have used different techniques to detect those phishing sites. But it is difficult to stay on a particular technique as attackers come with new tactics. In this paper, phishing and legitimate URL classifications are performed based on the lexical features of URLs. Feature selection technique is used to select the relevant features only. Accuracy for all combination of features with different numbers of features each time was evaluated to find the best possible combination of features. Performances are analyzed for different datasets with various parameters using four different machine learning techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Mohammad, R.M., Thabtah, F., McCluskey, L.: Predicting phishing websites based on self-structuring neural network. Neural Comput. Appl. 25(2), 443–458 (2013). https://doi.org/10.1007/s00521-013-1490-z
Article Google Scholar
Phishing Activity Trends Reports. https://www.antiphishing.org/trendsreports/. Accessed 13 Nov 2019
Overview Safe Browsing APIs (v4) Google Developers. https://developers.google.com/safe-browsing/v4. Accessed 18 Dec 2019
Gupta, B.B., Arachchilage, N.A.G., Psannis, K.E.: Defending against phishing attacks: taxonomy of methods, current issues and future directions. Telecommunication Systems 67(2), 247–267 (2017). https://doi.org/10.1007/s11235-017-0334-z
Article Google Scholar
Prakash, P., Kumar, M., Kompella, R.R., Gupta, M.: PhishNet: predictive blacklisting to detect phishing attacks. In: 2010 Proceedings IEEE INFOCOM (2010). https://doi.org/10.1109/infcom.2010.5462216
Han, W., Cao, Y., Bertino, E., Yong, J.: Using automated individual white-list to protect web digital identities. Expert Syst. Appl. 39, 11861–11869 (2012). https://doi.org/10.1016/j.eswa.2012.02.020
Article Google Scholar
Jain, A.K., Gupta, B.B.: A novel approach to protect against phishing attacks at client side using auto-updated white-list. EURASIP J. Inf. Secur. 2016(1), 1–11 (2016). https://doi.org/10.1186/s13635-016-0034-3
Article Google Scholar
Jain, A.K., Gupta, B.B.: Phishing detection: analysis of visual similarity based approaches. Secur. Commun. Netw. 2017, 1–20 (2017). https://doi.org/10.1155/2017/5421046
Article Google Scholar
Sahoo, D., Liu, C., Hoi, S.C.H.: Malicious URL detection using Machine Learning: a survey. arXiv:1701.07179v2 (2017)
Vanhoenshoven, F., Napoles, G., Falcon, R., Vanhoof, K., Koppen, M.: Detecting malicious URLs using machine learning techniques. In: 2016 IEEE Symposium Series on Computational Intelligence (SSCI) (2016). https://doi.org/10.1109/ssci.2016.7850079
Mamun, M.S.I., Rathore, M.A., Lashkari, A.H., Stakhanova, N., Ghorbani, A.A.: Detecting malicious URLs using lexical analysis. In: Chen, J., Piuri, V., Su, C., Yung, M. (eds.) NSS 2016. LNCS, vol. 9955, pp. 467–482. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46298-1_30
Chapter Google Scholar
Phishtank-Join the fight against phishing. https://www.phishtank.com/. Accessed 09 Nov 2019
DMOZ URL gr33ndata: gr33ndata/dmoz-urlclassifier. https://github.com/gr33ndata/dmoz-urlclassifier/. Accessed 27 Oct 2019
Banik, B., Sarma, A.: Phishing URL detection system based on URL features using SVM. Int. J. Electron. Appl. Res. 5, 40–55 (2018). https://doi.org/10.33665/ijear.2018.v05i02.003
Article Google Scholar
Chiew, K.L., Chang, E.H., Tan, C.L., Abdullah, J., Yong, K.S.C.: Building standard offline anti-phishing dataset for Benchmarking, International Journal of Engineering & Technology, vol. 7, no. 4.31, pp. 7–14, (2018). https://doi.org/10.14419/ijet.v7i4.31.23333
Althobaiti, K., Rummani, G., Vaniea, K.: A review of human- and computer-facing URL phishing features. In: 2019 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW) (2019). https://doi.org/10.1109/eurospw.2019.00027
Brownlee, J.: How to Choose a Feature Selection Method For Machine Learning. https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/. Accessed 05 Jan 2020
Asaithambi, S.: Why, How and When to apply Feature Selection. https://towardsdatascience.com/why-how-and-when-to-apply-feature-selection-e9c69adfabf2. Accessed 05 Jan 2020
Liu, H., Setiono, R.: Chi2: feature selection and discretization of numeric attributes. In: Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence (1995). https://doi.org/10.1109/tai.1995.479783
Meesad, P., Boonrawd, P., Nuipian, V.: A Chi-Square-test for word importance differentiation in text classification. In: International Conference on Information and Electronics Engineering (2011)
Google Scholar
Hutchinson, S., Zhang, Z., Liu, Q.: Detecting phishing websites with random forest. In: Meng, L., Zhang, Y. (eds.) MLICOM 2018. LNICSSITE, vol. 251, pp. 470–479. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00557-3_46
Chapter Google Scholar
Sonowal, G., Kuppusamy, K.: PhiDMA – a phishing detection model with multi-filter approach. J. King Saud Univ. Comput. Inf. Sci. 32, 99–112 (2017). https://doi.org/10.1016/j.jksuci.2017.07.005
Article Google Scholar
Mao, J., et al.: Phishing page detection via learning classifiers from page layout feature. EURASIP J. Wirel. Commun. Network. 2019(1), 1–14 (2019). https://doi.org/10.1186/s13638-019-1361-0
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Gauhati University, Guwahati, India
Bireswar Banik & Abhijit Sarma

Authors

Bireswar Banik
View author publications
You can also search for this author in PubMed Google Scholar
Abhijit Sarma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bireswar Banik .

Editor information

Editors and Affiliations

National Institute of Technology Silchar, Silchar, India
Arup Bhattacharjee
National Institute Of Technology Silchar, Silchar, India
Samir Kr. Borgohain
National Institute of Technology Silchar, Silchar, India
Badal Soni
National Institute of Technology Kurukshetra, Kurukshetra, India
Gyanendra Verma
University of Eastern Finland, Kuopio, Finland
Xiao-Zhi Gao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Banik, B., Sarma, A. (2020). Lexical Feature Based Feature Selection and Phishing URL Classification Using Machine Learning Techniques. In: Bhattacharjee, A., Borgohain, S., Soni, B., Verma, G., Gao, XZ. (eds) Machine Learning, Image Processing, Network Security and Data Sciences. MIND 2020. Communications in Computer and Information Science, vol 1241. Springer, Singapore. https://doi.org/10.1007/978-981-15-6318-8_9

Download citation

DOI: https://doi.org/10.1007/978-981-15-6318-8_9
Published: 15 June 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-6317-1
Online ISBN: 978-981-15-6318-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics