Abstract
Malicious Uniform Resource Locator (URL) is an important problem in web search and mining. Malicious URLs host unsolicited content (spam, phishing, drive-by downloads, etc.) and try to lure uneducated users into clicking in such links or downloading malware which will result in critical data exfiltration. Traditional techniques in detecting such URLs have been to use blacklists and rule-based methods. The main disadvantage of such problems is that they are not resistant to 0-day attacks, meaning that there will be at least one victim for each URL before the blacklist is created. Other techniques include having sandbox and testing the URLs before clicking on them in the production or main environment. Such methods have two main drawbacks which are the cost of the sandboxing as well as the non-real-time response which is due to the approval process in the test environment. In this paper, we propose a method that exploits semantic features in both domains and URLs as well. The method is adaptive, meaning that the model can dynamically change based on the new feedback received on the 0-day attacks. We extract features from all sections of a URL separately. We then apply three methods of machine learning on three different sets of data. We provide an analysis of features on the most efficient value of N for applying the N-grams to the domain names. The result shows that Random Forest has the highest accuracy of over 96% and at the same time provides more interpretability as well as performance benefits.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
RSA Quarterly Fraud Report, vol. 1, no. 3Q3 (2018)
Nahorney, O.C.H.L.B., O’Gorman, D.O.B.B., Paul, J.P.P.S.W., Cleary, W.C.W.G., Corpin, M.: Internet security threat report. Technical report 23, Symantec Corporation (2018)
State of the Phishâ„¢ Report: Wombat security technologies (2018)
Canfora, G., Medvet, E., Mercaldo, F., Visaggio, C.A.: Detection of malicious web pages using system calls sequences. In: Teufel, S., Min, T.A., You, I., Weippl, E. (eds.) CD-ARES 2014. LNCS, vol. 8708, pp. 226–238. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10975-6_17
Chhabra, S., Aggarwal, A., Benevenuto, F., Kumaraguru, P.: Phi.sh/$oCiaL: the phishing landscape through short urls. In: Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference, pp. 92–101. ACM (2011)
Cova, M., Kruegel, C., Vigna, G.: Detection and analysis of drive-by-download attacks and malicious JavaScript code. In: Proceedings of the 19th International Conference on World Wide Web, pp. 281–290. ACM (2010)
Daigle, L.: WHOIS Protocol Specification, RFC 3912 (2004)
Fahmy, H.M., Ghoneim, S.A.: PhishBlock: a hybrid anti-phishing tool. In: 2011 International Conference on Communications, Computing and Control Applications, pp. 1–5 (2011)
Felegyhazi, M., Kreibich, C., Paxson, V.: On the potential of proactive domain blacklisting. LEET 10, 6-6 (2010)
Gyawali, B., Solorio, T., Montes-y Gómez, M., Wardman, B., Warner, G.: Evaluating a semisupervised approach to phishing URL identification in a realistic scenario. In: Proceedings of the 8th Annual Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference, pp. 176–183. ACM (2011)
Hong, J.: The state of phishing attacks. Commun. ACM 55(1), 74–81 (2012)
Liang, B., Huang, J., Liu, F., Wang, D., Dong, D., Liang, Z.: Malicious web pages detection based on abnormal visibility recognition. In: 2009 International Conference on e-Business and Information System Security, pp. 1–5. IEEE (2009)
Lin, J.: Divergence measures based on the shannon entropy. IEEE Trans. Inf. Theor. 37(1), 145–151 (1991)
Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Identifying suspicious urls: an application of large-scale online learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 681–688. ACM (2009)
LLC OpenDNS: PhishTank: an anti-phishing site (2016). https://www.phishtank.com
Patil, D.R., Patil, J.: Survey on malicious web pages detection techniques. Int. J. u-and e-Serv. Sci. Technol. 8(5), 195–206 (2015)
Rieck, K., Krueger, T., Dewald, A.: Cujo: efficient detection and prevention of drive-by-download attacks. In: Proceedings of the 26th Annual Computer Security Applications Conference, pp. 31–39. ACM (2010)
Sahingoz, O.K., Buber, E., Demir, O., Diri, B.: Machine learning-based phishing detection from URLs. Exp. Syst. Appl. 117, 345–357 (2019)
Sheng, S., Wardman, B., Warner, G., Cranor, L.F., Hong, J., Zhang, C.: An empirical analysis of phishing blacklists. In: 6th Conference on Email and Anti-Spam (CEAS), California, USA (2009)
Shibahara, T., et al.: Malicious url sequence detection using event denoising convolutional neural network. In: 2017 IEEE International Conference on Communications, pp. 1–7 (2017)
Tao, Y.: Suspicious URL and device detection by log mining. Ph.D. thesis, Applied Sciences: School of Computing Science (2014)
Acknowledgments
This work was supported in part by the FCT/MCTES (UNINOVA-CTS funding UID/EEA/00066/2019), UIDB/00066/2020 (CTS – Center of Technology and Systems), and the FCT/MCTES project CESME - collaborative and Evolvable Smart Manufacturing Ecosystem, funding PRDC/EEI-AUT/32410/2017.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 IFIP International Federation for Information Processing
About this paper
Cite this paper
Ghalati, N.F., Ghalaty, N.F., Barata, J. (2020). Towards the Detection of Malicious URL and Domain Names Using Machine Learning. In: Camarinha-Matos, L., Farhadi, N., Lopes, F., Pereira, H. (eds) Technological Innovation for Life Improvement. DoCEIS 2020. IFIP Advances in Information and Communication Technology, vol 577. Springer, Cham. https://doi.org/10.1007/978-3-030-45124-0_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-45124-0_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-45123-3
Online ISBN: 978-3-030-45124-0
eBook Packages: Computer ScienceComputer Science (R0)