Skip to main content

A Better Understanding of Machine Learning Malware Misclassifcation

  • Conference paper
  • First Online:
Information Systems Security and Privacy (ICISSP 2017)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 867))

Included in the following conference series:

  • 428 Accesses

Abstract

Machine learning-based malware detection systems have been widely suggested and used as a replacement for signature-based detection methods. Such systems have shown that they can provide a high detection rate when recognising non-previously seen malware samples. However, when classifying malware based on their behavioural features, some new malware can go undetected, resulting in a misclassification. Our aim is to gain more understanding of the underlying causes of malware misclassification; this will help to develop more robust malware detection systems. Towards this objective, several questions have been addressed in this paper: Does misclassification increase over a period of time? Do changes that affect classification occur in malware at the level of families, where all instances that belong to certain families are hard to detect? Alternatively, can such changes be traced back to certain malware variants instead of families? Also, does misclassification increase when removing distinct API functions that have been used only by malware? As this technique could be used by malware writers to evade the detection. Our experiments showed that changes in malware behaviour are mostly due to behavioural changes at the level of variants across malware families, where variants did not behave as expected. It also showed that machine learning-based systems could maintain a high detection rate even in the case of trying to evade the detection by not using distinct API functions, which are uniquely used by malware.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alazab, M., Layton, R., Venkataraman, S., Watters, P.: Malware detection based on structural and behavioural features of API calls (2010)

    Google Scholar 

  2. Alruhaily, N., Bordbar, B., Chothia, T.: Towards an understanding of the misclassification rates of machine learning-based malware detection systems. In: Proceedings of the 3rd International Conference on Information Systems Security and Privacy - Volume 1: ICISSP, pp. 101–112 (2017)

    Google Scholar 

  3. Bailey, M., Oberheide, J., Andersen, J., Mao, Z.M., Jahanian, F., Nazario, J.: Automated classification and analysis of internet malware. In: Kruegel, C., Lippmann, R., Clark, A. (eds.) RAID 2007. LNCS, vol. 4637, pp. 178–197. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74320-0_10

    Chapter  Google Scholar 

  4. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)

    Article  Google Scholar 

  5. Bifet, A., Gavalda, R.: Learning from time-changing data with adaptive windowing. In: Proceedings of the 2007 SIAM International Conference on Data Mining, pp. 443–448. SIAM (2007)

    Chapter  Google Scholar 

  6. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)

    MATH  Google Scholar 

  7. Ceron, J.M., Margi, C.B., Granville, L.Z.: MARS: an SDN-based malware analysis solution. In: 2016 IEEE Symposium on Computers and Communication (ISCC), pp. 525–530. IEEE (2016)

    Google Scholar 

  8. Chang, E.Y., Li, B., Wu, G., Goh, K.: Statistical learning for effective visual information retrieval. In: ICIP, vol. 3, pp. 609–612. Citeseer (2003)

    Google Scholar 

  9. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  10. Cuckoo Sandbox: Automated malware analysis - cuckoo sandbox (2015). http://www.cuckoosandbox.org/

  11. Fan, C.I., Hsiao, H.W., Chou, C.H., Tseng, Y.F.: Malware detection systems based on API log data mining. In: 2015 IEEE 39th Annual Computer Software and Applications Conference (COMPSAC), vol. 3, pp. 255–260. IEEE (2015)

    Google Scholar 

  12. Faruki, P., Laxmi, V., Gaur, M.S., Vinod, P.: Behavioural detection with API call-grams to identify malicious PE files. In: Proceedings of the First International Conference on Security of Internet of Things, pp. 85–91. ACM (2012)

    Google Scholar 

  13. Ferri, C., Hernández-Orallo, J., Modroiu, R.: An experimental comparison of performance measures for classification. Pattern Recogn. Lett. 30(1), 27–38 (2009)

    Article  Google Scholar 

  14. Firdausi, I., Lim, C., Erwin, A., Nugroho, A.S.: Analysis of machine learning techniques used in behavior-based malware detection. In: 2010 Second International Conference on Advances in Computing, Control and Telecommunication Technologies (ACT), pp. 201–203. IEEE (2010)

    Google Scholar 

  15. Hansen, S.S., Larsen, T.M.T., Stevanovic, M., Pedersen, J.M.: An approach for detection and family classification of malware based on behavioral analysis. In: 2016 International Conference on Computing, Networking and Communications (ICNC), pp. 1–5. IEEE (2016)

    Google Scholar 

  16. Hsu, F.H., Wu, M.H., Tso, C.K., Hsu, C.H., Chen, C.W.: Antivirus software shield against antivirus terminators. IEEE Trans. Inf. Forensics Secur. 7(5), 1439–1447 (2012)

    Article  Google Scholar 

  17. Huang, J., Ling, C.X.: Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 17(3), 299–310 (2005)

    Article  Google Scholar 

  18. Islam, R., Tian, R., Moonsamy, V., Batten, L.: A comparison of the classification of disparate malware collected in different time periods. J. Netw. 7(6), 946–955 (2012)

    Google Scholar 

  19. Kang, P., Cho, S.: EUS SVMs: ensemble of under-sampled SVMs for data imbalance problems. In: King, I., Wang, J., Chan, L.-W., Wang, D.L. (eds.) ICONIP 2006. LNCS, vol. 4232, pp. 837–846. Springer, Heidelberg (2006). https://doi.org/10.1007/11893028_93

    Chapter  Google Scholar 

  20. Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Trans. Syst. Man Cybern.-Part A: Syst. Hum. 41(3), 552–568 (2011)

    Article  Google Scholar 

  21. Klinkenberg, R., Renz, I.: Adaptive information filtering: learning in the presence of concept drifts. In: Learning for Text Categorization, pp. 33–40 (1998)

    Google Scholar 

  22. Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: a review of classification techniques (2007)

    Google Scholar 

  23. Kruczkowski, M., Szynkiewicz, E.N.: Support vector machine for malware analysis and classification. In: Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)-Volume 02, pp. 415–420. IEEE Computer Society (2014)

    Google Scholar 

  24. Lin, W.J., Chen, J.J.: Class-imbalanced classifiers for high-dimensional data. Briefings in bioinformatics, p. bbs006 (2012)

    Article  Google Scholar 

  25. Lu, Y.B., Din, S.C., Zheng, C.F., Gao, B.J.: Using multi-feature and classifier ensembles to improve malware detection. J. CCIT 39(2), 57–72 (2010)

    Google Scholar 

  26. Maxwell, K.: Mwcrawler (2012). https://github.com/0day1day/mwcrawler

  27. Maxwell, K.: Maltrieve (2015). https://github.com/technoskald/maltrieve

  28. Miao, Q., Liu, J., Cao, Y., Song, J.: Malware detection using bilayer behavior abstraction and improved one-class support vector machines. Int. J. Inf. Secur. 15, 1–19 (2015)

    Google Scholar 

  29. Microsoft: Microsoft security intelligence report (sir) (2015). http://www.microsoft.com/security/sir/default.aspx

  30. Moser, A., Kruegel, C., Kirda, E.: Limits of static analysis for malware detection. In: Twenty-Third Annual Computer Security Applications Conference, ACSAC 2007, pp. 421–430. IEEE (2007)

    Google Scholar 

  31. Moskovitch, R., Feher, C., Elovici, Y.: Unknown malcode detection—a chronological evaluation. In: IEEE International Conference on Intelligence and Security Informatics, ISI 2008, pp. 267–268. IEEE (2008)

    Google Scholar 

  32. Offensivecomputing: Open malware (2015). http://www.offensivecomputing.net

  33. Peiravian, N., Zhu, X.: Machine learning for android malware detection using permission and API calls. In: 2013 IEEE 25th International Conference on Tools with Artificial Intelligence, pp. 300–305. IEEE (2013)

    Google Scholar 

  34. Pektaş, A., Acarman, T., Falcone, Y., Fernandez, J.C.: Runtime-behavior based malware classification using online machine learning. In: 2015 World Congress on Internet Security (WorldCIS), pp. 166–171. IEEE (2015)

    Google Scholar 

  35. Pirscoveanu, R.S., Hansen, S.S., Larsen, T.M., Stevanovic, M., Pedersen, J.M., Czech, A.: Analysis of malware behavior: type classification using machine learning. In: 2015 International Conference on Cyber Situational Awareness, Data Analytics and Assessment (CyberSA), pp. 1–7. IEEE (2015)

    Google Scholar 

  36. Salehi, Z., Sami, A., Ghiasi, M.: Using feature generation from API calls for malware detection. Comput. Fraud Secur. 2014(9), 9–18 (2014)

    Article  Google Scholar 

  37. Sami, A., Yadegari, B., Rahimi, H., Peiravian, N., Hashemi, S., Hamze, A.: Malware detection based on mining API calls. In: Proceedings of the 2010 ACM Symposium on Applied Computing, pp. 1020–1025. ACM (2010)

    Google Scholar 

  38. Schick, S.: Security intelligence: Tinba malware watches mouse movements, screen activity to avoid sandbox detection. https://securityintelligence.com/news/tinba-malware-watches-mouse-movements-screen-activity-to-avoid-sandbox-detection/ (2016)

  39. Scikit-learn: Scikit-learn: machine learning in python, 17 June 2013. http://scikit-learn.org/stable/

  40. Shabtai, A., Moskovitch, R., Feher, C., Dolev, S., Elovici, Y.: Detecting unknown malicious code by applying classification techniques on opcode patterns. Secur. Inform. 1(1), 1–22 (2012)

    Article  Google Scholar 

  41. Singh, A., Walenstein, A., Lakhotia, A.: Tracking concept drift in malware families. In: Proceedings of the 5th ACM Workshop on Security and Artificial Intelligence, pp. 81–92. ACM (2012)

    Google Scholar 

  42. Symantec: Symantec security response - virus naming conventions, 17 June 2013. https://www.symantec.com/security_response/virusnaming.jsp

  43. Symantec: W32.Sality!dam, 17 June 2013. https://www.symantec.com/security_response/writeup.jsp?docid=2013-043010-4816-99

  44. Symantec: Internet security threat report (2015). http://www.symantec.com/security_response/publications/threatreport.jsp

  45. Symantec: A-Z listing of threats & risks (2016). https://www.symantec.com/security_response/landing/azlisting.jsp

  46. Symantec: Trojan.gen (2016). https://www.symantec.com/security_response/writeup.jsp?docid=2010-022501-5526-99

  47. Tian, R., Islam, R., Batten, L., Versteeg, S.: Differentiating malware from cleanware using behavioural analysis. In: 2010 5th International Conference on Malicious and Unwanted Software (MALWARE), pp. 23–30. IEEE (2010)

    Google Scholar 

  48. Veeramani, R., Rai, N.: Windows API based malware detection and framework analysis. In: International Conference on Networks and Cyber Security, vol. 25 (2012)

    Google Scholar 

  49. Virusshare: Virusshare.com (2016). http://vxheaven.org

  50. VirusTotal: Virustotal - free online virus, malware and URL scanner (2015). https://www.virustotal.com/

  51. VX Heaven: Vxheaven.org. (2016) http://vxheaven.org

  52. Walenstein, A., Lakhotia, A.: The software similarity problem in malware analysis. In: Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2007)

    Google Scholar 

  53. Wang, C., Pang, J., Zhao, R., Liu, X.: Using API sequence and Bayes algorithm to detect suspicious behavior. In: International Conference on Communication Software and Networks, ICCSN 2009, pp. 544–548. IEEE (2009)

    Google Scholar 

  54. Xu, J.Y., Sung, A.H., Chavez, P., Mukkamala, S.: Polymorphic malicious executable scanner by API sequence analysis. In: Fourth International Conference on Hybrid Intelligent Systems, HIS 2004, pp. 378–383. IEEE (2004)

    Google Scholar 

  55. Yap, B.W., Rani, K.A., Rahman, H.A.A., Fong, S., Khairudin, Z., Abdullah, N.N.: An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets. In: Herawan, T., Deris, M.M., Abawajy, J. (eds.) Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013). LNEE, vol. 285, pp. 13–22. Springer, Singapore (2014). https://doi.org/10.1007/978-981-4585-18-7_2

    Chapter  Google Scholar 

  56. Ye, Y., Chen, L., Wang, D., Li, T., Jiang, Q., Zhao, M.: SBMDS: an interpretable string based malware detection system using svm ensemble with bagging. J. Comput. Virol. 5(4), 283–293 (2009)

    Article  Google Scholar 

  57. Ye, Y., Li, T., Huang, K., Jiang, Q., Chen, Y.: Hierarchical associative classifier (HAC) for malware detection from the large and imbalanced gray list. J. Intell. Inf. Syst. 35(1), 1–20 (2010)

    Article  Google Scholar 

  58. Ye, Y., Wang, D., Li, T., Ye, D., Jiang, Q.: An intelligent PE-malware detection system based on association mining. J. Comput. Virol. 4(4), 323–334 (2008)

    Article  Google Scholar 

  59. Zhang, B.Y., Yin, J.P., Hao, J.B., Zhang, D.X.: Using support vector machine to detect unknown computer viruses. Int. J. Comput. Intell. Res. 2(1), 100–104 (2006)

    Google Scholar 

  60. Zhang, B., Yin, J., Tang, W., Hao, J., Zhang, D.: Unknown malicious codes detection based on rough set theory and support vector machine. In: The 2006 IEEE International Joint Conference on Neural Network Proceedings, pp. 2583–2587. IEEE (2006)

    Google Scholar 

  61. Zhao, H., Xu, M., Zheng, N., Yao, J., Ho, Q.: Malicious executables classification based on behavioral factor analysis. In: International Conference on e-Education, e-Business, e-Management, and e-Learning, 2010. IC4E 2010, pp. 502–506. IEEE (2010)

    Google Scholar 

Download references

Acknowledgments

We would like to express our sincere gratitude to Professor Peter Tino, Chair of Complex and Adaptive Systems in the University of Birmingham, for his valuable advise and suggestions. We would like also to thank VirusTotal for providing us with access to their intelligence service in addition to a private API.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nada Alruhaily .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alruhaily, N., Chothia, T., Bordbar, B. (2018). A Better Understanding of Machine Learning Malware Misclassifcation. In: Mori, P., Furnell, S., Camp, O. (eds) Information Systems Security and Privacy. ICISSP 2017. Communications in Computer and Information Science, vol 867. Springer, Cham. https://doi.org/10.1007/978-3-319-93354-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-93354-2_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-93353-5

  • Online ISBN: 978-3-319-93354-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics