Automated Construction of Malware Families

  • Krishnendu GhoshEmail author
  • Jeffery Mills
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11611)


Discovery of malware families from behavioral characteristics of a set of malware traces is an important step in the detection of malware. Malware in the wild often occur as variants of each other. In this work, a data dependent formalism is described for the construction of malware families from trace data. The malware families are represented in an edge labeled graph where the nodes represent a malware trace and edges describe relationship between the malware traces. The edge labels contain a numerical value representing similarity between the malware traces. Network theoretical concepts such as hubs are evaluated on the edge labeled graph. The formalism has been elucidated by the experiments performed on multiple data sets of malware traces.


Malware Families Malware Traces Kullback-Leibler divergence Discrete-Time Markov Chain Algorithm Network theory 


  1. 1.
  2. 2.
    Anderson, B., Lane, T., Hash, C.: Malware phylogenetics based on the multiview graphical lasso. In: Blockeel, H., van Leeuwen, M., Vinciotti, V. (eds.) IDA 2014. LNCS, vol. 8819, pp. 1–12. Springer, Cham (2014). Scholar
  3. 3.
    Canali, D., Lanzi, A., Balzarotti, D., Kruegel, C., Christodorescu, M., Kirda, E.: A quantitative study of accuracy in system call-based malware detection. In: Proceedings of the 2012 International Symposium on Software Testing and Analysis, pp. 122–132. ACM (2012)Google Scholar
  4. 4.
    Carrera, E., Erdélyi, G.: Digital genome mapping-advanced binary malware analysis. In: Virus Bulletin Conference, vol. 11 (2004)Google Scholar
  5. 5.
    Christodorescu, M., Jha, S., Seshia, S.A., Song, D., Bryant, R.E.: Semantics-aware malware detection. In: 2005 IEEE Symposium on Security and Privacy, pp. 32–46. IEEE (2005)Google Scholar
  6. 6.
    Creech, G., Hu, J.: Generation of a new IDS test dataset: time to retire the KDD collection. In: Wireless Communications and Networking Conference (WCNC), 2013 IEEE, pp. 4487–4492. IEEE (2013)Google Scholar
  7. 7.
    Creech, G., Hu, J.: A semantic approach to host-based intrusion detection systems using contiguous and discontiguous system call patterns. IEEE Trans. Comput. 63(4), 807–819 (2014)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Deng, K., Sun, Y., Mehta, P.G., Meyn, S.P.: An information-theoretic framework to aggregate a Markov chain. In: American Control Conference, 2009. ACC 2009, pp. 731–736. IEEE (2009)Google Scholar
  9. 9.
    Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3–5), 75–174 (2010)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Ghosh, K., Mills, J., Dorr, J.: Phylogenetic-inspired probabilistic model abstraction in detection of malware families. In: 2017 AAAI Fall Symposium Series (2017)Google Scholar
  11. 11.
    Girvan, M., Newman, M.E.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. 99(12), 7821–7826 (2002)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Goldberg, L.A., Goldberg, P.W., Phillips, C.A., Sorkin, G.B.: Constructing computer virus phylogenies. J. Algorithms 26(1), 188–208 (1998)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Haq, I., Chica, S., Caballero, J., Jha, S.: Malware lineage in the wild. Comput. Secur. 78, 347–363 (2018)CrossRefGoogle Scholar
  14. 14.
    Hayes, M., Walenstein, A., Lakhotia, A.: Evaluation of malware phylogeny modelling systems using automated variant generation. J. Comput. Virol. 5(4), 335–343 (2009)CrossRefGoogle Scholar
  15. 15.
    Idika, N., Mathur, A.P.: A Survey of Malware Detection Techniques, vol. 48. Purdue University (2007)Google Scholar
  16. 16.
    Jang, J., Brumley, D., Venkataraman, S.: Bitshred: Fast, scalable malware triage. Cylab, Carnegie Mellon University, Pittsburgh, PA, Technical report CMU-Cylab-10, vol. 22 (2010)Google Scholar
  17. 17.
    Jang, J., Woo, M., Brumley, D.: Towards automatic software lineage inference. In: Presented as Part of the 22nd USENIX Security Symposium (USENIX Security 13), pp. 81–96 (2013)Google Scholar
  18. 18.
    Jordaney, R., Wang, Z., Papini, D., Nouretdinov, I., Cavallaro, L.: Misleading metrics: on evaluating machine learning for malware with confidence. Technical report (2016)Google Scholar
  19. 19.
    Karim, M.E., Walenstein, A., Lakhotia, A., Parida, L.: Malware phylogeny generation using permutations of code. J. Comput. Virol. 1(1–2), 13–23 (2005)CrossRefGoogle Scholar
  20. 20.
    Khoo, W.M., Lió, P.: Unity in diversity: phylogenetic-inspired techniques for reverse engineering and detection of malware families. In: SysSec Workshop (SysSec), 2011 First, pp. 3–10. IEEE (2011)Google Scholar
  21. 21.
    Ki, Y., Kim, E., Kim, H.K.: A novel approach to detect malware based on API call sequence analysis. Int. J. Distrib. Sens. Netw. 11(6), 659101 (2015)CrossRefGoogle Scholar
  22. 22.
    Kim, H.M., Song, H.M., Seo, J.W., Kim, H.K.: Andro-simnet: android malware family classification using social network analysis. In: 2018 16th Annual Conference on Privacy, Security and Trust (PST), pp. 1–8. IEEE (2018)Google Scholar
  23. 23.
    Kim, H., Khoo, W.M., Liò, P.: Polymorphic attacks against sequence-based software birthmarks. In: 2nd ACM SIGPLAN Workshop on Software Security and Protection (2012)Google Scholar
  24. 24.
    Kleinberg, J.M.: Hubs, authorities, and communities. ACM Comput, Surv. (CSUR) 31(4es), 5 (1999)CrossRefGoogle Scholar
  25. 25.
    Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Lakhotia, A., Notani, V., LeDoux, C.: Malware economics and its implication to anti-malware situational awareness. In: 2018 International Conference On Cyber Situational Awareness, Data Analytics And Assessment (Cyber SA), pp. 1–8. IEEE (2018)Google Scholar
  27. 27.
    Lancichinetti, A., Fortunato, S.: Community detection algorithms: a comparative analysis. Phys. Rev. E 80(5), 056117 (2009)CrossRefGoogle Scholar
  28. 28.
    Li, P., Liu, L., Gao, D., Reiter, M.K.: On challenges in evaluating malware clustering. In: Jha, S., Sommer, R., Kreibich, C. (eds.) RAID 2010. LNCS, vol. 6307, pp. 238–255. Springer, Heidelberg (2010). Scholar
  29. 29.
    Lin, J.: Divergence measures based on the shannon entropy. IEEE Trans. Inf. Theor. 37(1), 145–151 (1991)MathSciNetCrossRefGoogle Scholar
  30. 30.
    Liu, J., Wang, Y., Wang, Y.: Inferring phylogenetic networks of malware families from API sequences. In: 2016 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), pp. 14–17 (2016)Google Scholar
  31. 31.
    Pattanayak, H.S., Verma, H.K., Sangal, A.: Community detection metrics and algorithms in social networks. In: 2018 First International Conference on Secure Cyber Computing and Communication (ICSCCC), pp. 483–489. IEEE (2019)Google Scholar
  32. 32.
    Rached, Z., Alajaji, F., Campbell, L.L.: The Kullback-leibler divergence rate between Markov sources. IEEE Trans. Inf. Theor. 50(5), 917–921 (2004)MathSciNetCrossRefGoogle Scholar
  33. 33.
    Rieck, K., Holz, T., Willems, C., Düssel, P., Laskov, P.: Learning and classification of malware behavior. In: Zamboni, D. (ed.) DIMVA 2008. LNCS, vol. 5137, pp. 108–125. Springer, Heidelberg (2008). Scholar
  34. 34.
    Rieck, K., Trinius, P., Willems, C., Holz, T.: Automatic analysis of malware behavior using machine learning. J. Comput. Secur. 19(4), 639–668 (2011)CrossRefGoogle Scholar
  35. 35.
    Rossow, C., et al.: Prudent practices for designing malware experiments: status quo and outlook. In: 2012 IEEE Symposium on Security and Privacy (SP), pp. 65–79. IEEE (2012)Google Scholar
  36. 36.
    Singh, J., Nene, M.J.: A survey on machine learning techniques for intrusion detection systems. Int. J. Adv. Res. Comput. Commun. Eng. 2(11), 4349–4355 (2013)Google Scholar
  37. 37.
    Sorkin, G.: Grouping related computer viruses into families. In: Proceedings of the IBM Security ITS (1994)Google Scholar
  38. 38.
    Ugarte-Pedrero, X., Graziano, M., Balzarotti, D.: A close look at a daily dataset of malware samples. ACM Trans. Priv. Secur. (TOPS) 22(1), 6 (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Computer ScienceCollege of CharlestonCharlestonUSA
  2. 2.Department of Computer ScienceNorthern Kentucky UniversityHighland HeightsUSA

Personalised recommendations