Adaptive Semantics-Aware Malware Classification

Kolosnjaji, Bojan; Zarras, Apostolis; Lengyel, Tamas; Webster, George; Eckert, Claudia

doi:10.1007/978-3-319-40667-1_21

Bojan Kolosnjaji¹⁶,
Apostolis Zarras¹⁶,
Tamas Lengyel¹⁶,
George Webster¹⁶ &
…
Claudia Eckert¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 9721))

Included in the following conference series:

International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment

2463 Accesses
11 Citations

Abstract

Automatic malware classification is an essential improvement over the widely-deployed detection procedures using manual signatures or heuristics. Although there exists an abundance of methods for collecting static and behavioral malware data, there is a lack of adequate tools for analysis based on these collected features. Machine learning is a statistical solution to the automatic classification of malware variants based on heterogeneous information gathered by investigating malware code and behavioral traces. However, the recent increase in variety of malware instances requires further development of effective and scalable automation for malware classification and analysis processes.

In this paper, we investigate the topic modeling approaches as semantics-aware solutions to the classification of malware based on logs from dynamic malware analysis. We combine results of static and dynamic analysis to increase the reliability of inferred class labels. We utilize a semi-supervised learning architecture to make use of unlabeled data in classification. Using a nonparametric machine learning approach to topic modeling we design and implement a scalable solution while maintaining advantages of semantics-aware analysis. The outcomes of our experiments reveal that our approach brings a new and improved solution to the reoccurring problems in malware classification and analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

The Cuckoo Sandbox. https://www.cuckoosandbox.org/
VirusTotal. http://www.virustotal.com
Alvarez, V.M.: Yara. http://plusvic.github.io/yara/
Anderson, B., Storlie, C., Lane, T.: Improving malware classification: bridging the static/dynamic gap. In: Workshop on Security and Artificial Intelligence (AISec) (2012)
Google Scholar
Bailey, M., Oberheide, J., Andersen, J., Mao, Z.M., Jahanian, F., Nazario, J.: Automated classification and analysis of internet malware. In: Kruegel, C., Lippmann, R., Clark, A. (eds.) RAID 2007. LNCS, vol. 4637, pp. 178–197. Springer, Heidelberg (2007)
Chapter Google Scholar
Bayer, U., Comparetti, P.M., Hlauschek, C., Kruegel, C., Kirda, E.: Scalable, behavior-based malware clustering. In: ISOC Network and Distributed System Security Symposium (NDSS) (2009)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Chau, D.H., Nachenberg, C., Wilhelm, J., Wright, A., Faloutsos, C.: Polonium: tera-scale graph mining and inference for malware detection. In: SIAM International Conference on Data Mining (SDM) (2011)
Google Scholar
Dahl, G.E., Stokes, J.W., Deng, L., Yu, D.: Large-scale malware classification using random projections and neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013)
Google Scholar
Dumais, S.T.: Latent semantic analysis. Ann. Rev. Inf. Sci. Technol. 38(1), 188–230 (2004)
Article Google Scholar
Dumitras, T., Shou, D.: Toward a standard benchmark for computer security research: the Worldwide Intelligence Network Environment (WINE). In: Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (BADGERS) (2011)
Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd (1996)
Google Scholar
Garcia-Teodoro, P., Diaz-Verdejo, J., Maciá-Fernández, G., Vázquez, E.: Anomaly-based network intrusion detection: techniques, systems and challenges. Comput. Secur. 28(1), 18–28 (2009)
Article Google Scholar
Hanif, Z., Calhoun, T., Trost, J.: Binarypig: Scalable Static Binary Analysis Over Hadoop. Black Hat, USA (2013)
Google Scholar
Hanif, Z., Lengyel, T.K., Webster, G.D.: Internet-Scale File Analysis. Black Hat, USA (2015)
Google Scholar
Heller, K., Svore, K., Keromytis, A.D., Stolfo, S.: One class support vector machines for detecting anomalous windows registry accesses. In: Workshop on Data Mining for Computer Security (DMSEC) (2003)
Google Scholar
Jang, J., Brumley, D., Venkataraman, S.: Bitshred: feature hashing malware for scalable triage and semantic analysis. In: Conference on Computer and Communications Security (CCS) (2011)
Google Scholar
Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley, New York (2004)
Book MATH Google Scholar
Lengyel, T.K., Maresca, S., Payne, B.D., Webster, G.D., Vogl, S., Kiayias, A.: Scalability, fidelity and stealth in the Drakvuf dynamic malware analysis system. In: Annual Computer Security Applications Conference (ACSAC) (2014)
Google Scholar
Leung, K., Leckie, C.: Unsupervised anomaly detection in network intrusion detection using clusters. In: Australasian Conference on Computer Science (2005)
Google Scholar
Maxwell, K.: Maltrieve. https://github.com/krmaxwell/maltrieve
Newman, D., Chemudugunta, C., Smyth, P., Steyvers, M.: Analyzing entities and topics in news articles using statistical topic models. In: Mehrotra, S., Zeng, D.D., Chen, H., Thuraisingham, B., Wang, F.-Y. (eds.) ISI 2006. LNCS, vol. 3975, pp. 93–104. Springer, Heidelberg (2006)
Chapter Google Scholar
Perdisci, R., U, M.C.: VAMO: towards a fully automated malware clustering validity analysis. In: Annual Computer Security Applications Conference (ACSAC) (2012)
Google Scholar
Pfoh, J., Schneider, C., Eckert, C.: Leveraging string kernels for malware detection. In: Lopez, J., Huang, X., Sandhu, R. (eds.) NSS 2013. LNCS, vol. 7873, pp. 206–219. Springer, Heidelberg (2013)
Chapter Google Scholar
Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Conference on Empirical Methods in Natural Language Processing (2009)
Google Scholar
Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Workshop on New Challenges for NLP Frameworks (2010)
Google Scholar
Rieck, K., Holz, T., Willems, C., Düssel, P., Laskov, P.: Learning and classification of malware behavior. In: Zamboni, D. (ed.) DIMVA 2008. LNCS, vol. 5137, pp. 108–125. Springer, Heidelberg (2008)
Chapter Google Scholar
Roberts, J.-M.: Virus Share. https://virusshare.com/
Schultz, M.G., Eskin, E., Zadok, E., Stolfo, S.J.: Data mining methods for detection of new malicious executables. In: Symposium on Security and Privacy (2001)
Google Scholar
Stringhini, G., Egele, M., Zarras, A., Holz, T., Kruegel, C., Vigna, G.: B@bel: leveraging email delivery for spam mitigation. In: USENIX Security Symposium (2012)
Google Scholar
Tegeler, F., Fu, X., Vigna, G., Kruegel, C.: Botfinder: finding bots in network traffic without deep packet inspection. In: International Conference on Emerging Networking Experiments and Technologies (CoNEXT) (2012)
Google Scholar
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2006)
Article MathSciNet MATH Google Scholar
The MITRE Corporation. CRITS. https://crits.github.io/
VirusTotal. File Statistics. https://www.virustotal.com/en/statistics/
Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn. 1, 1–305 (2008)
Article MATH Google Scholar
Wang, C., Paisley, J.W., Blei, D.M.: Online variational inference for the hierarchical Dirichlet process. In: International Conference on Artificial Intelligence and Statistics (2011)
Google Scholar
Warrender, C., Forrest, S., Pearlmutter, B.: Detecting intrusions using system calls: alternative data models. In: Symposium on Security and Privacy (1999)
Google Scholar
Wicherski, G.: Pehash: a novel approach to fast malware clustering. In: USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET) (2009)
Google Scholar
Xiao, H., Eckert, C.: Efficient online sequence prediction with side information. In: IEEE International Conference on Data Mining (ICDM) (2013)
Google Scholar
Xiao, H., Stibor, T.: A supervised topic transition model for detecting malicious system call sequences. In: Workshop on Knowledge Discovery, Modeling and Simulation (2011)
Google Scholar
Zarras, A., Papadogiannakis, A., Gawlik, R., Holz, T.: Automated generation of models for fast and precise detection of HTTP-based malware. In: Annual Conference on Privacy, Security and Trust (PST) (2014)
Google Scholar
Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. Adv. Neural Inf. Process. Syst. 16(16), 321–328 (2004)
Google Scholar

Download references

Acknowledgments

The research was supported by the German Federal Ministry of Education and Research under grant 16KIS0328 (IUNO) and by the Bavarian State Ministry of Education, Science and the Arts as part of the FORSEC research association.

Author information

Authors and Affiliations

Technical University of Munich, Munich, Germany
Bojan Kolosnjaji, Apostolis Zarras, Tamas Lengyel, George Webster & Claudia Eckert

Authors

Bojan Kolosnjaji
View author publications
You can also search for this author in PubMed Google Scholar
Apostolis Zarras
View author publications
You can also search for this author in PubMed Google Scholar
Tamas Lengyel
View author publications
You can also search for this author in PubMed Google Scholar
George Webster
View author publications
You can also search for this author in PubMed Google Scholar
Claudia Eckert
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bojan Kolosnjaji .

Editor information

Editors and Affiliations

IMDEA Software Institute, Pozuelo de Alarcón, Madrid, Spain
Juan Caballero
Mondragon University, Arrasate, Guipúzcoa, Spain
Urko Zurutuza
Universidad de Zaragoza, Zaragoza, Spain
Ricardo J. Rodríguez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kolosnjaji, B., Zarras, A., Lengyel, T., Webster, G., Eckert, C. (2016). Adaptive Semantics-Aware Malware Classification. In: Caballero, J., Zurutuza, U., Rodríguez, R. (eds) Detection of Intrusions and Malware, and Vulnerability Assessment. DIMVA 2016. Lecture Notes in Computer Science(), vol 9721. Springer, Cham. https://doi.org/10.1007/978-3-319-40667-1_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-40667-1_21
Published: 12 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40666-4
Online ISBN: 978-3-319-40667-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics