Skip to main content

Towards the Detection of Malicious URL and Domain Names Using Machine Learning

  • Conference paper
  • First Online:
Technological Innovation for Life Improvement (DoCEIS 2020)

Abstract

Malicious Uniform Resource Locator (URL) is an important problem in web search and mining. Malicious URLs host unsolicited content (spam, phishing, drive-by downloads, etc.) and try to lure uneducated users into clicking in such links or downloading malware which will result in critical data exfiltration. Traditional techniques in detecting such URLs have been to use blacklists and rule-based methods. The main disadvantage of such problems is that they are not resistant to 0-day attacks, meaning that there will be at least one victim for each URL before the blacklist is created. Other techniques include having sandbox and testing the URLs before clicking on them in the production or main environment. Such methods have two main drawbacks which are the cost of the sandboxing as well as the non-real-time response which is due to the approval process in the test environment. In this paper, we propose a method that exploits semantic features in both domains and URLs as well. The method is adaptive, meaning that the model can dynamically change based on the new feedback received on the 0-day attacks. We extract features from all sections of a URL separately. We then apply three methods of machine learning on three different sets of data. We provide an analysis of features on the most efficient value of N for applying the N-grams to the domain names. The result shows that Random Forest has the highest accuracy of over 96% and at the same time provides more interpretability as well as performance benefits.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. RSA Quarterly Fraud Report, vol. 1, no. 3Q3 (2018)

    Google Scholar 

  2. Nahorney, O.C.H.L.B., O’Gorman, D.O.B.B., Paul, J.P.P.S.W., Cleary, W.C.W.G., Corpin, M.: Internet security threat report. Technical report 23, Symantec Corporation (2018)

    Google Scholar 

  3. State of the Phishâ„¢ Report: Wombat security technologies (2018)

    Google Scholar 

  4. Canfora, G., Medvet, E., Mercaldo, F., Visaggio, C.A.: Detection of malicious web pages using system calls sequences. In: Teufel, S., Min, T.A., You, I., Weippl, E. (eds.) CD-ARES 2014. LNCS, vol. 8708, pp. 226–238. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10975-6_17

    Chapter  Google Scholar 

  5. Chhabra, S., Aggarwal, A., Benevenuto, F., Kumaraguru, P.: Phi.sh/$oCiaL: the phishing landscape through short urls. In: Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference, pp. 92–101. ACM (2011)

    Google Scholar 

  6. Cova, M., Kruegel, C., Vigna, G.: Detection and analysis of drive-by-download attacks and malicious JavaScript code. In: Proceedings of the 19th International Conference on World Wide Web, pp. 281–290. ACM (2010)

    Google Scholar 

  7. Daigle, L.: WHOIS Protocol Specification, RFC 3912 (2004)

    Google Scholar 

  8. Fahmy, H.M., Ghoneim, S.A.: PhishBlock: a hybrid anti-phishing tool. In: 2011 International Conference on Communications, Computing and Control Applications, pp. 1–5 (2011)

    Google Scholar 

  9. Felegyhazi, M., Kreibich, C., Paxson, V.: On the potential of proactive domain blacklisting. LEET 10, 6-6 (2010)

    Google Scholar 

  10. Gyawali, B., Solorio, T., Montes-y Gómez, M., Wardman, B., Warner, G.: Evaluating a semisupervised approach to phishing URL identification in a realistic scenario. In: Proceedings of the 8th Annual Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference, pp. 176–183. ACM (2011)

    Google Scholar 

  11. Hong, J.: The state of phishing attacks. Commun. ACM 55(1), 74–81 (2012)

    Article  Google Scholar 

  12. Liang, B., Huang, J., Liu, F., Wang, D., Dong, D., Liang, Z.: Malicious web pages detection based on abnormal visibility recognition. In: 2009 International Conference on e-Business and Information System Security, pp. 1–5. IEEE (2009)

    Google Scholar 

  13. Lin, J.: Divergence measures based on the shannon entropy. IEEE Trans. Inf. Theor. 37(1), 145–151 (1991)

    Article  MathSciNet  Google Scholar 

  14. Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Identifying suspicious urls: an application of large-scale online learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 681–688. ACM (2009)

    Google Scholar 

  15. LLC OpenDNS: PhishTank: an anti-phishing site (2016). https://www.phishtank.com

  16. Patil, D.R., Patil, J.: Survey on malicious web pages detection techniques. Int. J. u-and e-Serv. Sci. Technol. 8(5), 195–206 (2015)

    Article  MathSciNet  Google Scholar 

  17. Rieck, K., Krueger, T., Dewald, A.: Cujo: efficient detection and prevention of drive-by-download attacks. In: Proceedings of the 26th Annual Computer Security Applications Conference, pp. 31–39. ACM (2010)

    Google Scholar 

  18. Sahingoz, O.K., Buber, E., Demir, O., Diri, B.: Machine learning-based phishing detection from URLs. Exp. Syst. Appl. 117, 345–357 (2019)

    Article  Google Scholar 

  19. Sheng, S., Wardman, B., Warner, G., Cranor, L.F., Hong, J., Zhang, C.: An empirical analysis of phishing blacklists. In: 6th Conference on Email and Anti-Spam (CEAS), California, USA (2009)

    Google Scholar 

  20. Shibahara, T., et al.: Malicious url sequence detection using event denoising convolutional neural network. In: 2017 IEEE International Conference on Communications, pp. 1–7 (2017)

    Google Scholar 

  21. Tao, Y.: Suspicious URL and device detection by log mining. Ph.D. thesis, Applied Sciences: School of Computing Science (2014)

    Google Scholar 

Download references

Acknowledgments

This work was supported in part by the FCT/MCTES (UNINOVA-CTS funding UID/EEA/00066/2019), UIDB/00066/2020 (CTS – Center of Technology and Systems), and the FCT/MCTES project CESME - collaborative and Evolvable Smart Manufacturing Ecosystem, funding PRDC/EEI-AUT/32410/2017.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nastaran Farhadi Ghalati .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ghalati, N.F., Ghalaty, N.F., Barata, J. (2020). Towards the Detection of Malicious URL and Domain Names Using Machine Learning. In: Camarinha-Matos, L., Farhadi, N., Lopes, F., Pereira, H. (eds) Technological Innovation for Life Improvement. DoCEIS 2020. IFIP Advances in Information and Communication Technology, vol 577. Springer, Cham. https://doi.org/10.1007/978-3-030-45124-0_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-45124-0_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-45123-3

  • Online ISBN: 978-3-030-45124-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics