Abstract
Malware growth was exponential in the last years, therefore it is a tedious work to manually analyze them in order to observe when a new strain appears. In this article we present a dynamic analysis system which clusters suspicious executable files in different malware families, based on the behavioral similarities their running processes exhibit thus reducing the workload of malware analysts. We identified similarities between our approach and the problem of text clustering based on topic, achieving similar results to text clustering without semantic analysis involved. We modeled the behavior of a process by extracting sequences of Windows API functions called by that process during its execution. We separated the registered API calls on three levels, based on their impact on the system, and dealt with them as text-like terms. More complex terms were constructed with N-grams and the features were represented with TF-IDF scores. We clustered the processes with variants of the k-means algorithm and derived a method for analyzing cluster characteristics in order to determine the best number of clusters to be considered. Finally, we identified the API level and N-gram lengths required to obtain relevant clusters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)
AV-TEST: Number of malware throughout 2009–2018. https://www.av-test.org/en/statistics/malware/
Bayer, U., Comparetti, P.M., Hlauschek, C., Kruegel, C., Kirda, E.: Scalable, behavior-based malware clustering. In: NDSS, vol. 9, pp. 8–11. Citeseer (2009)
Bergeron, J., Debbabi, M., Desharnais, J., Erhioui, M.M., Lavoie, Y., Tawbi, N., et al.: Static detection of malicious code in executable programs. Int. J. Req. Eng. 2001(184–189), 79 (2001)
Buchta, C., Kober, M., Feinerer, I., Hornik, K.: Spherical k-means clustering. J. Stat. Softw. 50(10), 1–22 (2012)
Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat.-Theory Methods 3(1), 1–27 (1974)
Cavnar, W.B., Trenkle, J.M., et al.: N-gram-based text categorization. Ann arbor mi 48113(2), 161–175 (1994)
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 2, 224–227 (1979)
Galkovsky, M.: Dlls the dynamic way. MSDN Library Website (1999)
Hassani, M., Seidl, T.: Using internal evaluation measures to validate the quality of diverse stream clustering algorithms. Vietnam J. Comput. Sci. 4(3), 171–183 (2017)
Huang, A.: Similarity measures for text document clustering. In: Proceedings of the Sixth New Zealand Computer Science Research Student Conference (NZCSRSC 2008), Christchurch, New Zealand, pp. 49–56 (2008)
Khabia, A., Chandak, M.: A cluster based approach with n-grams at word level for document classification. Int. J. Comput. Appl. 117(23), 38–42 (2015)
Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets, pp. 7–15. Cambridge University Press, Cambridge (2014)
Li, P., Liu, L., Gao, D., Reiter, M.K.: On challenges in evaluating malware clustering. In: Jha, S., Sommer, R., Kreibich, C. (eds.) RAID 2010. LNCS, vol. 6307, pp. 238–255. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15512-3_13
Malwarebytes: Cybercrime tactics and techniques: Q1 2018. https://www.malwarebytes.com/pdf/white-papers/CTNT-Q1-2018.pdf
Perdisci, R., et al.: VAMO: towards a fully automated malware clustering validity analysis. In: Proceedings of the 28th Annual Computer Security Applications Conference, pp. 329–338. ACM (2012)
Qiao, Y., He, J., Yang, Y., Ji, L.: Analyzing malware by abstracting the frequent itemsets in API call sequences. In: 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, pp. 265–270. IEEE (2013)
Ramos, J., et al.: Using TF-IDF to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, vol. 242, pp. 133–142 (2003)
Rieck, K., Trinius, P., Willems, C., Holz, T.: Automatic analysis of malware behavior using machine learning. J. Comput. Secur. 19(4), 639–668 (2011)
Rosenberg, A., Hirschberg, J.: V-measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (2007)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, pp. 1177–1178. ACM (2010)
Shankarapani, M.K., Ramamoorthy, S., Movva, R.S., Mukkamala, S.: Malware detection using assembly and API call sequences. J. Comput. Virol. 7(2), 107–119 (2011)
Willems, C., Holz, T., Freiling, F.: Toward automated dynamic malware analysis using cwsandbox. IEEE Secur. Privacy 5(2), 32–39 (2007)
Acknowledgment
This work was supported by a grant of the Romanian National Authority for Scientific Research and Innovation, CNCS/CCCDI-UEFISCDI, project number PN-III-P2-2.1-PED-2016-2073, within PNCDI III.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Széles, G.J., Coleşa, A. (2019). Malware Clustering Based on Called API During Runtime. In: Fournaris, A., Lampropoulos, K., Marín Tordera, E. (eds) Information and Operational Technology Security Systems. IOSec 2018. Lecture Notes in Computer Science(), vol 11398. Springer, Cham. https://doi.org/10.1007/978-3-030-12085-6_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-12085-6_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-12084-9
Online ISBN: 978-3-030-12085-6
eBook Packages: Computer ScienceComputer Science (R0)