Automatic malware classification and new malware detection using machine learning

Article

Abstract

The explosive growth of malware variants poses a major threat to information security. Traditional anti-virus systems based on signatures fail to classify unknown malware into their corresponding families and to detect new kinds of malware programs. Therefore, we propose a machine learning based malware analysis system, which is composed of three modules: data processing, decision making, and new malware detection. The data processing module deals with gray-scale images, Opcode n-gram, and import functions, which are employed to extract the features of the malware. The decision-making module uses the features to classify the malware and to identify suspicious malware. Finally, the detection module uses the shared nearest neighbor (SNN) clustering algorithm to discover new malware families. Our approach is evaluated on more than 20 000 malware instances, which were collected by Kingsoft, ESET NOD32, and Anubis. The results show that our system can effectively classify the unknown malware with a best accuracy of 98.9%, and successfully detects 86.7% of the new malware.

Key words

Malware classification Machine learning n-gram Gray-scale image Feature extraction Malware detection 

CLC number

TP309.5 

References

  1. Annachhatre, C., Austin, T.H., Stamp, M., 2015. Hidden Markov models for malware classification. J. Comput. Virol. Hack. Tech., 11(2):59–73. https://doi.org/10.1007/s11416-014-0215-xCrossRefGoogle Scholar
  2. Cheng, J.Y.C., Tsai, T.S., Yang, C.S., 2013. An information retrieval approach for malware classification based on Windows API calls. Int. Conf. on Machine Learning and Cybernetics, p.1678–1683. https://doi.org/10.1109/ICMLC.2013.6890868Google Scholar
  3. Damodaran, A., di Troia, F., Visaggio, C.A., et al., 2017. A comparison of static, dynamic, and hybrid analysis for malware detection. J. Comput. Virol. Hack. Tech., 13(1): 1–12. https://doi.org/10.1007/s11416-015-0261-zCrossRefGoogle Scholar
  4. Ding, Y.X., Dai, W., Yan, S.L., et al., 2014. Control flowbased Opcode behavior analysis for malware detection. Comput. Secur., 44:65–74. https://doi.org/10.1016/j.cose.2014.04.003CrossRefGoogle Scholar
  5. Egele, M., Scholte, T., Kirda, E., et al., 2012. A survey on automated dynamic malware-analysis techniques and tools. ACM Comput. Surv., 44(2): Article 6. https://doi.org/10.1145/2089125.2089126Google Scholar
  6. Ertoz, L., Steinbach, M., Kumar, V., 2002. A new shared nearest neighbor clustering algorithm and its applications. Workshop on Clustering High Dimensional Data and Its Applications at the 2nd SIAM Int. Conf. on Data Mining, p.105–115.Google Scholar
  7. Gandotra, E., Bansal, D., Sofat, S., 2014. Malware analysis and classification: a survey. J. Inform. Secur., 5(2):44440. https://doi.org/10.4236/jis.2014.52006Google Scholar
  8. Han, K.S., Lim, J.H., Im, E.G., 2013. Malware analysis method using visualization of binary files. Proc. on Research in Adaptive and Convergent Systems, p.317–321. https://doi.org/10.1145/2513228.2513294Google Scholar
  9. Hu, Q.H., Yu, D.R., Xie, Z.X., et al., 2007. EROS: ensemble rough subspaces. Patt. Recogn., 40(12):3728–3739. https://doi.org/10.1016/j.patcog.2007.04.022CrossRefMATHGoogle Scholar
  10. Islam, R., Tian, R.H., Batten, L.M., et al., 2013. Classification of malware based on integrated static and dynamic features. J. Netw. Comput. Appl., 36(2):646–656. https://doi.org/10.1016/j.jnca.2012.10.004CrossRefGoogle Scholar
  11. Iwamoto, K., Wasaki, K., 2012. Malware classification based on extracted API sequences using static analysis. Proc. Asian Internet Engineering Conf., p.31–38. https://doi.org/10.1145/2402599.2402604CrossRefGoogle Scholar
  12. Jain, S., Meena, Y.K., 2011. Byte level n-gram analysis for malware detection. In: Venugopal, K.R., Patnaik, L.M. (Eds.), Computer Networks and Intelligent Computing. Springer, Berlin, p.51–59. https://doi.org/10.1007/978-3-642-22786-8_6Google Scholar
  13. Jarvis, R.A., Patrick, E.A., 1973. Clustering using a similarity measure based on shared near neighbors. IEEE Trans. Comput., C-22(11):1025–1034. https://doi.org/10.1109/T-C.1973.223640CrossRefGoogle Scholar
  14. Jolliffe, I.T., 2002. Principal Component Analysis (2nd Ed.). Springer, New York. https://doi.org/10.1007/b98835MATHGoogle Scholar
  15. Kancherla, K., Mukkamala, S., 2013. Image visualization based malware detection. IEEE Symp. on Computational Intelligence in Cyber Security, p.40–44. https://doi.org/10.1109/CICYBS.2013.6597204Google Scholar
  16. Kapoor, A., Dhavale, S., 2016. Control flow graph based multiclass malware detection using bi-normal separation. Defen. Sci. J., 66(2):138–145. https://doi.org/10.14429/dsj.66.9701CrossRefGoogle Scholar
  17. Kaspersky Labs, 2015. Security Bulletin 2015. https://securelist. com/files/2015/12/KSB_2015_Statistics_FINAL_EN. pdfGoogle Scholar
  18. Kinable, J., Kostakis, O., 2011. Malware classification based on call graph clustering. J. Comput. Virol., 7(4):233–245. https://doi.org/10.1007/s11416-011-0151-yCrossRefGoogle Scholar
  19. Kong, D.G., Yan, G.H., 2013. Discriminant malware distance learning on structural information for automated malware classification. Proc. 19th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.1357–1365. https://doi.org/10.1145/2487575.2488219Google Scholar
  20. Lee, J., Jeong, K., Lee, H., 2010. Detecting metamorphic malwares using code graphs. Proc. ACM Symp. on Applied Computing, p.1970–1977. https://doi.org/10.1145/1774088.1774505Google Scholar
  21. Lin, C.T., Wang, N.J., Xiao, H., et al., 2015. Feature selection and extraction for malware classification. J. Inform. Sci. Eng., 31(3):965–992. https://doi.org/10.6688/JISE.2015.31.3.11Google Scholar
  22. Lin, D., Stamp, M., 2011. Hunting for undetectable metamorphic viruses. J. Comput. Virol., 7(3):201–214. https://doi.org/10.1007/s11416-010-0148-yCrossRefGoogle Scholar
  23. Liu, X.W., Wang, L., Huang, G.B., et al., 2015. Multiple kernel extreme learning machine. Neurocomputing, 149: 253–264. https://doi.org/10.1016/j.neucom.2013.09.072CrossRefGoogle Scholar
  24. Musale, M., Austin, T.H., Stamp, M., 2015. Hunting for metamorphic JavaScript malware. J. Comput. Virol. Hack. Tech., 11(2):89–102. https://doi.org/10.1007/s11416-014-0225-8CrossRefGoogle Scholar
  25. Nataraj, L., Karthikeyan, S., Jacob, G., et al., 2014. Malware images: visualization and automatic classification. Proc. 8th Int. Symp. on Visualization for Cyber Security. https://doi.org/10.1145/2016904.2016908Google Scholar
  26. Oliva, A., Torralba, A., 2001. Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis., 42(3):145–175. https://doi.org/10.1023/A:1011139631724CrossRefMATHGoogle Scholar
  27. Pascanu, R., Stokes, J.W., Sanossian, H., et al., 2015. Malware classification with recurrent networks. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, p.1916–1920. https://doi.org/10.1109/ICASSP.2015.7178304Google Scholar
  28. Roundy, K.A., Miller, B.P., 2010. Hybrid analysis and control of malware. In: Jha, S., Sommer, R., Kreibich, C. (Eds.), Recent Advances in Intrusion Detection. Springer Berlin Heidelberg, p.317–338. https://doi.org/10.1007/978-3-642-15512-3_17Google Scholar
  29. Russo, A., Sabelfeld, A., 2010. Dynamic vs. static flowsensitive security analysis. 23rd IEEE Computer Security Foundations Symp., p.186–199. https://doi.org/10.1109/CSF.2010.20Google Scholar
  30. Salton, G., McGill, M.J., 1986. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, USA.MATHGoogle Scholar
  31. Shabtai, A., Moskovitch, R., Elovici, Y., et al., 2009. Detection of malicious code by applying machine learning classifiers on static features: a state-of-the-art survey. Inform. Secur. Tech. Rep., 14(1):16–29. https://doi.org/10.1016/j.istr.2009.03.003CrossRefGoogle Scholar
  32. Tao, H., Ma, X., Qiao, M., 2013. Subspace selective ensemble algorithm based on feature clustering. J. Comput., 8(2): 509–516.CrossRefGoogle Scholar
  33. Tian, R.H., Batten, L., Islam, R., et al., 2009. An automated classification system based on the strings of Trojan and virus families. 4th Int. Conf. on Malicious and Unwanted Software, p.23–30. https://doi.org/10.1109/MALWARE.2009.5403021Google Scholar
  34. Tian, R.H., Islam, R., Batten, L., et al., 2010. Differentiating malware from cleanware using behavioural analysis. 5th Int. Conf. on Malicious and Unwanted Software, p.23–30. https://doi.org/10.1109/MALWARE.2010.5665796CrossRefGoogle Scholar
  35. Tsyganok, K., Tumoyan, E., Babenko, L., et al., 2012. Classification of polymorphic and metamorphic malware samples based on their behavior. Proc. 5th Int. Conf. on Security of Information and Networks, p.111–116. https://doi.org/10.1145/2388576.2388591Google Scholar
  36. Wong, W., Stamp, M., 2006. Hunting for metamorphic engines. J. Comput. Virol., 2(3):211–229. https://doi.org/10.1007/s11416-006-0028-7CrossRefGoogle Scholar
  37. Yan, G.H., Brown, N., Kong, D.G., 2013. Exploring discriminatory features for automated malware classification. In: Rieck, K., Stewin, P., Seifert, J.P. (Eds.), Detection of Intrusions and Malware, and Vulnerability Assessment. Springer Berlin Heidelberg, p.41–61. https://doi.org/10.1007/978-3-642-39235-1_3Google Scholar
  38. Yao, W., Chen, X.Q., Zhao, Y., et al., 2012. Concurrent subspace width optimization method for RBF neural network modeling. IEEE Trans. Neur. Netw. Learn. Syst., 23(2): 247–259. https://doi.org/10.1109/TNNLS.2011.2178560CrossRefGoogle Scholar
  39. Ye, Y.F., Li, T., Chen, Y., et al., 2010. Automatic malware categorization using cluster ensemble. Proc. 16th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.95–104. https://doi.org/10.1145/1835804.1835820Google Scholar
  40. Yu, Y., Wang, H.M., Yin, G., et al., 2016. Reviewer recommendation for pull-requests in GitHub: what can we learn from code review and bug assignment? Inform. Softw. Technol., 74:204–218. https://doi.org/10.1016/j.infsof.2016.01.004CrossRefGoogle Scholar
  41. Zhou, Z.H., Wu, J.X., Tang, W., 2002. Ensembling neural networks: many could be better than all. Artif. Intell., 137(1–2):239–263. https://doi.org/10.1016/S0004-3702(02)00190-XMathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Zhejiang University and Springer-Verlag GmbH Germany, part of Springer Nature 2017

Authors and Affiliations

  1. 1.College of ComputerNational University of Defense TechnologyChangshaChina

Personalised recommendations