Abstract
Industry reports and blogs have estimated the amount of malware based on known malicious files. This paper extends this analysis to the amount of unknown malware. The study is based on 26.7 million files referenced in telemetry reports from 50 million computers running commercial anti-malware (AM) products. To estimate the undetected malware, a classifier predicts the underlying nature of unknown files recorded in the telemetry reports. The telemetry classifier predicts that 69.6% (4.27 million) of the unknown files are malicious. Assuming the unknown files predicted to be malicious by the classifier are malware, the telemetry classifier also allows us to estimate the efficacy of the AM system indicating that signatures detected 82.8% (20.6 million) of the malicious files. We have validated our system by conducting a longitudinal study to measure the false positive and false negative rates over a period of thirteen months.
Chapter PDF
Similar content being viewed by others
References
Andrew, G., Gao, J.: Scalable training of l1-regularized log-linear models. In: Proc. of the 24th International Conference on Machine Learning (ICML), Corvalis, OR, pp. 33–40. ACM, New York (2007)
Bayer, U., Habibi, I., Balzarotti, D., Kirda, E., Kruegel, C.: A view on current malware behaviors. In: Proc. of 2nd USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET), Boston, MA, USA (2009)
Bayer, U., Comparetti, P.M., Hlauschek, C., Kruegel, C., Kirda, E.: Scalable, behavior-based malware clustering. In: Proc. of the 16th Annual Network and Distributed System Security Symposium (NDSS), San Diego, CA (February 2009)
Bayer, U., Kruegel, C., Kirda, E.: TTAnalyze: A tool for analyzing malware. In: Proc. of 15th Annual Conference of the European Institute for Computer Antivirus Research, EICAR (2006)
Bishop, C.: Pattern Recognition and Machine Learning. Springer (2006)
Brumley, D., Jager, I., Avgerinos, T., Schwartz, E.J.: BAP: A Binary Analysis Platform. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 463–469. Springer, Heidelberg (2011)
Nachenberg, C., Seshadri, V., Ramzan, Z.: An analysis of real-world effectiveness of reputation-based security. In: Proc. of Virus Bulletin Conference, VB, pp. 178–183 (2010)
Chau, D.H., Nachenberg, C., Wilhelm, J., Wright, A., Faloutsos, C.: Polonium: Tera-scale graph mining and inference for malware detection. In: Proc. of SIAM International Conference on Data Mining, SDM (2011)
Christodorescu, M., Jha, S., Kruegel, C.: Mining specifications of malicious behavior. In: Proc. of the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 5–14 (2007)
Edelman, B.: Adverse selection in online “trust” certifications. In: Fifth Workshop on the Economics of Information Security, pp. 26–28 (2006)
Freund, Y., Schapire, R.: Large margin classification using the perceptron algorithm. Machine Learning, 277–296 (1999)
Friedman, J.: Greedy function approximation: a gradient boosting machine. Annals of Statistics, 1189–1232 (2001)
Group, A.P.W.: Phishing activity trends report, 3rd quarter 2009 (2010), http://www.antiphishing.org/reports/apwg_report_Q3_2009.pdf
Haber, J.: Smartscreen application reputation in ie9 (2011), http://blogs.msdn.com/b/ie/archive/2011/05/17/smartscreen-174-application-reputation-in-ie9.aspx
Hu, W., Liao, Y., Vemuri, V.R.: Robust support vector machines for anomaly detection. In: Proc. 2003 International Conference on Machine Learning and Applications (ICMLA), pp. 23–24 (2003)
Idika, N., Mathur, A.: A survey of malware detection techniques. Tech. rep., Purdue Univ. (February 2007), http://www.eecs.umich.edu/techreports/cse/2007/CSE-TR-530-07.pdf
Iseclab: Anubis, analyzing unknown binaries, http://anubis.iseclab.org
Jacob, G., Comparetti, P.M., Neugschwandtner, M., Kruegel, C., Vigna, G.: A static, packer-agnostic filter to detect similar malware samples. In: Conference on Detection of Intrusions and Malware & Vulnerability Assessment, DIMVA (2012)
Jang, J., Brumley, D., Venkataraman, S.: Bitshred: feature hashing malware for scalable triage and semantic analysis. In: Proc. of the 18th ACM Conference on Computer and Communications Security (CCS), pp. 309–320 (2011)
Jiang, X., Wang, X., Xu, D.: Stealthy malware detection through vmm-based ”out-of-the-box” semantic view reconstruction. In: Proc. of the ACM Conference on Computer and Communications Security (CCS), pp. 128–138 (2007)
Kirda, E., Kruegel, C., Banks, G., Vigna, G., Kemmerer, R.A.: Behavior based spyware detection. In: Proc. of the 15th USENIX Security Symposium, pp. 273–288 (2006)
Kolter, J., Maloof, M.: Learning to detect and classify malicious executables in the wild. Journal of Machine Learning Research (JMLR), 2721–2744 (2006)
Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval. Cambridge University Press (2009)
Microsoft: Microsoft security intelligence report (July-December 2010) (2011), http://www.microsoft.com/security/sir/default.aspx
Moser, A., Kruegel, C., Kirda, E.: Limits of static analysis for malware detection. In: Proc. of the 23rd Annual Computer Security Applications Conference (ACSAC), pp. 421–430 (2007)
Neugschwandtner, M., Comparetti, P.M., Jacob, G., Kruegel, C.: Forecast – skimming off the malware cream. In: 27th Annual Computer Security Applications Conference, ACSAC (2011)
Oberheide, J., Cooke, E., Jahanian, F.: Cloudav: N-version antivirus in the network cloud. In: Proc. of the 17th Conference on Security Symposium, pp. 91–106 (2008)
Perdisci, R., Lanzi, A., Lee, W.: Mcboost: Boosting scalability in malware collection and analysis using statistical classification of executables. In: Proc. of the 2008 Annual Computer Security Applications Conference (ACSAC), pp. 301–310 (2008)
Preda, M., Christodorescu, M., Jha, S., Debray, S.: A semantics-based approach to malware detection. In: Proc. of the 34th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 377–388 (2007)
Schultz, M., Eskin, E., Zadok, E., Stolfo, S.: Data mining methods of detection of new malicious executables. In: Proc. of the 2001 IEEE Symposium on Security and Privacy (SP), pp. 38–49. IEEE Press, New York (2001)
Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: Primal estimated sub-gradient solver for svm. In: Proc. of the 24th International Conference on Machine Learning (ICML), Corvalis, OR, pp. 807–814. ACM, New York (2007)
Song, D., Brumley, D., Yin, H., Caballero, J., Jager, I., Kang, M.G., Liang, Z., Newsome, J., Poosankam, P., Saxena, P.: BitBlaze: A New Approach to Computer Security via Binary Analysis. In: Sekar, R., Pujari, A.K. (eds.) ICISS 2008. LNCS, vol. 5352, pp. 1–25. Springer, Heidelberg (2008)
Stolfo, S., Wang, K., Li, W.: Towards stealthy malware detection. In: Christodorescu, M., Jha, S., Maughan, D., Song, D., Wang, C. (eds.) Malware Detection. Springer (2007)
Wicherski, G.: pehash: A novel approach to fast malware clustering. In: USENIX Workshop Large-Scale Exploits and Emergent Threats, LEET (2009)
Zhang, B., Yin, J., Hao, J., Zhang, D., Wang, S.: Malicious Codes Detection Based on Ensemble Learning. In: Xiao, B., Yang, L.T., Ma, J., Muller-Schloer, C., Hua, Y. (eds.) ATC 2007. LNCS, vol. 4610, pp. 468–477. Springer, Heidelberg (2007)
Zhang, J., Jin, R., Yang, Y., Hauptmann, A.G.: Modified logistic regression: An approximation to svm and its applications in large-scale text categorization. In: Proc. of the 20th International Conference on Machine Learning (ICML), Menlo Park, pp. 888–895 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Stokes, J.W. et al. (2012). Scalable Telemetry Classification for Automated Malware Detection. In: Foresti, S., Yung, M., Martinelli, F. (eds) Computer Security – ESORICS 2012. ESORICS 2012. Lecture Notes in Computer Science, vol 7459. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33167-1_45
Download citation
DOI: https://doi.org/10.1007/978-3-642-33167-1_45
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33166-4
Online ISBN: 978-3-642-33167-1
eBook Packages: Computer ScienceComputer Science (R0)