Discovering Similarities in Malware Behaviors by Clustering of API Call Sequences

  • Fatima Al Shamsi
  • Wei Lee Woon
  • Zeyar AungEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11304)


New genres of malware are evading detection by using polymorphism, obfuscation and encryption techniques. Hence, new strategies are needed to overcome the limitations of current malware analysis practices. In this paper, we propose an unsupervised learning (clustering) framework to complement the supervised learning (i.e., classifier-based malware detection) approach. We cluster malware instances to discover similarities in their dynamic behaviors and to detect new malware families. For that, we utilize Application Programming Interface (API) call sequences to represent the behaviors of malware in dynamic runtime environment. We investigate three sequence comparison algorithms, namely, Optimal Matching (OM), Longest Common Subsequence (LCS), and Longest Common Prefix (LCP) for calculating sequence–sequence distances to be used for hierarchical clustering. Among the three algorithms, LCP is found to be both the most effective in terms of clustering quality and the most efficient in terms of time complexity (linear-time).


Malware API calls Clustering Malware patterns 


  1. 1.
    Alwahedi, S., Al Ali, M., Ishowo-Oloko, F., Woon, W.L., Aung, Z.: Security in mobile computing: attack vectors, solutions, and challenges. In: Agüero, R., Zaki, Y., Wenning, B.-L., Förster, A., Timm-Giel, A. (eds.) MONAMI 2016. LNICST, vol. 191, pp. 177–191. Springer, Cham (2017). Scholar
  2. 2.
    Lee, T., Kwak, J.: Effective and reliable malware group classification for a massive malware environment. Int. J. Distrib. Sens. Netw. (2016). Article ID 4601847Google Scholar
  3. 3.
    Cho, I.K., Im, E.G.: Extracting representative API patterns of malware families using multiple sequence alignments. In: Proceedings of the 2015 ACM Conference on Research in Adaptive and Convergent Systems (RACS), pp. 308–313 (2015)Google Scholar
  4. 4.
    Stamp, M.: Information Security: Principles and Practice, 2nd edn. Wiley, New York (2011)CrossRefGoogle Scholar
  5. 5.
    Dinh, A., Brill, D., Li, Y., He, W.: Malware sequence alignment. In: Proceedings of the 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom), pp. 613–617 (2016)Google Scholar
  6. 6.
    Symantec Enterprise Security: 2016 symantec internet security threat report. Technical report (2016).
  7. 7.
    Narayanan, A., Chen, Y., Pang, S., Tao, B.: The effects of different representations on static structure analysis of computer malware signatures. Sci. World J. 2013, 8 (2013). Article ID 671096CrossRefGoogle Scholar
  8. 8.
    Kate, P.M., Dhavale, S.V.: Two phase static analysis technique for Android malware detection. In: Proceedings of the 3rd International Symposium on Women in Computing and Informatics (WCI), pp. 650–655 (2015)Google Scholar
  9. 9.
    Milosevic, N., Dehghantanha, A., Choo, K.K.R.: Machine learning aided Android malware classification. Comput. Electr. Eng. 61, 266–274 (2017)CrossRefGoogle Scholar
  10. 10.
    Al Ali, M., Svetinovic, D., Aung, Z., Lukman, S.: Malware detection in Android mobile platform using machine learning algorithms. In: Proceedings of the 2017 IEEE International Conference on Infocom Technologies and Unmanned Systems (Trends and Future Directions) (ICTUS), pp. 763–768 (2017)Google Scholar
  11. 11.
    Bayer, U., Comparetti, P.M., Hlauschek, C., Krügel, C., Kirda, E.: Scalable, behavior-based malware clustering. In: Proceedings of the 2009 Network and Distributed System Security Symposium (NDSS), pp. 1–18 (2009)Google Scholar
  12. 12.
    Kim, J., Kim, T.G., Im, E.G.: Structural information based malicious app similarity calculation and clustering. In: Proceedings of the 2015 ACM Conference on Research in Adaptive and Convergent Systems (RACS), pp. 314–318 (2015)Google Scholar
  13. 13.
    Qiao, Y., He, J., Yang, Y., Ji, L.: Analyzing malware by abstracting the frequent itemsets in API call sequences. In: Proceedings of the 2013 IEEE 12th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), pp. 265–270 (2013)Google Scholar
  14. 14.
    Zhong, Y., Yamaki, H., Yamaguchi, Y., Takakura, H.: ARIGUMA code analyzer: Efficient variant detection by identifying common instruction sequences in malware families. In: Proceedings of the 2013 IEEE 37th Annual Computer Software and Applications Conference (COMPSAC), pp. 11–20 (2013)Google Scholar
  15. 15.
    Cordeiro De Amorim, R., Komisarczuk, P.: On partitional clustering of malware. In: Proceedings of the 1st International Workshop on Cyberpatterns: Unifying Design Patterns with Security, Attack and Forensic Patterns (CyberPatterns), pp. 47–51 (2012)Google Scholar
  16. 16.
    Perdisci, R., U, M.: VAMO: Towards a fully automated malware clustering validity analysis. In: Proceedings of the 28th ACM Annual Computer Security Applications Conference (ACSAC), pp. 329–338 (2012)Google Scholar
  17. 17.
    Monshizadeh, M., Yan, Z.: Security related data mining. In: Proceedings of the 2014 IEEE International Conference on Computer and Information Technology (CIT), pp. 775–782 (2014)Google Scholar
  18. 18.
    Al Shamsi, F.: Mapping, Exploration, and Detection Strategies for Malware Universe. Master’s thesis, Masdar Institute of Science and Technology, Abu Dhabi, UAE (2017)Google Scholar
  19. 19.
    Oprişa, C., Cabău, G., Pal, G.S.: Malware clustering using suffix trees. J. Comput. Virol. Hacking Tech. 12, 1–10 (2016)CrossRefGoogle Scholar
  20. 20.
    Ki, Y., Kim, E., Kim, H.K.: A novel approach to detect malware based on API call sequence analysis. Int. J. Distrib. Sens. Netw. (2015). Article ID 659101Google Scholar
  21. 21.
    Microsoft: MSDN: Learn to develop with Microsoft developer network (2018).
  22. 22.
    Elzinga, C.H.: Sequence analysis: Metric representations of categorical time series. Department of Social Science Research Methods. Technical report, Vrije Universiteit Amsterdam, The Netherlands (2006)Google Scholar
  23. 23.
    Gabadinho, A., Ritschard, G., Müller, N.S., Studer, M.: Analyzing and visualizing state sequences in R with TraMineR. J. Stat. Softw. 40, 1–37 (2011)CrossRefGoogle Scholar
  24. 24.
    Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970)CrossRefGoogle Scholar
  25. 25.
    Gonnet, G.H., Scholl, R.: Scientific Computation, 1st edn. Cambridge University Press, New York (2009)CrossRefGoogle Scholar
  26. 26.
  27. 27.
    Maechler, M., Rousseeuw, P., Struyf, A., et al.: Methods for cluster analysis (2018).
  28. 28.
    Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (2005)zbMATHGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Abu Dhabi Systems and Information CentreAbu DhabiUAE
  2. 2.Khalifa University of Science and Technology, Masdar InstituteAbu DhabiUAE

Personalised recommendations