Reviewer Integration and Performance Measurement for Malware Detection

  • Brad MillerEmail author
  • Alex Kantchelian
  • Michael Carl Tschantz
  • Sadia Afroz
  • Rekha Bachwani
  • Riyaz Faizullabhoy
  • Ling Huang
  • Vaishaal Shankar
  • Tony Wu
  • George Yiu
  • Anthony D. Joseph
  • J. D. Tygar
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9721)


We present and evaluate a large-scale malware detection system integrating machine learning with expert reviewers, treating reviewers as a limited labeling resource. We demonstrate that even in small numbers, reviewers can vastly improve the system’s ability to keep pace with evolving threats. We conduct our evaluation on a sample of VirusTotal submissions spanning 2.5 years and containing 1.1 million binaries with 778 GB of raw feature data. Without reviewer assistance, we achieve 72 % detection at a 0.5 % false positive rate, performing comparable to the best vendors on VirusTotal. Given a budget of 80 accurate reviews daily, we improve detection to 89 % and are able to detect 42 % of malicious binaries undetected upon initial submission to VirusTotal. Additionally, we identify a previously unnoticed temporal inconsistency in the labeling of training datasets. We compare the impact of training labels obtained at the same time training data is first seen with training labels obtained months later. We find that using training labels obtained well after samples appear, and thus unavailable in practice for current training data, inflates measured detection by almost 20 % points. We release our cluster-based implementation, as well as a list of all hashes in our evaluation and 3 % of our entire dataset.


  1. 1.
    ClamAV PUA, 14 November 2014.
  2. 2.
  3. 3.
    The Cuckoo Sandbox, 14 November 2014.
  4. 4.
    Arp, D., Spreitzenbarth, M., Hubner, M., Gascon, H., Rieck, K.: Drebin: effective and explainable detection of android malware in your pocket. In: NDSS (2014)Google Scholar
  5. 5.
    Canali, D., Cova, M., Vigna, G., Kruegel, C.: Prophiler: a fast filter for the large-scale detection of malicious web pages. In: WWW (2011)Google Scholar
  6. 6.
    Chakradeo, S., Reaves, B., Traynor, P., Enck, W.: Mast: triage for market-scale mobile malware analysis. In: ACM WiSec (2013)Google Scholar
  7. 7.
    Chapelle, O., Schlkopf, B., Zien, A.: Semi-Supervised Learning. The MIT Press, Cambridge (2010)Google Scholar
  8. 8.
    Curtsinger, C., Livshits, B., Zorn, B., Seifert, C.: Zozzle: fast and precise in-browser javascript malware detection. In: Usenix Security (2011)Google Scholar
  9. 9.
    Damballa: State of Infections Report: Q4 2014. Technical report, Damballa (2015)Google Scholar
  10. 10.
    Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer, New York (2001)CrossRefzbMATHGoogle Scholar
  11. 11.
    Kantchelian, A., Afroz, S., Huang, L., Islam, A.C., Miller, B., Tschantz, M.C., Greenstadt, R., Joseph, A.D., Tygar, J.D.: Approaches to adversarial drift. In: ACM AISec (2013)Google Scholar
  12. 12.
    Karanth, S., Laxman, S., Naldurg, P., Venkatesan, R., Lambert, J., Shin, J.: ZDVUE: prioritization of javascript attacks to discover new vulnerabilities. In: ACM AISec (2011)Google Scholar
  13. 13.
    Kolter, J.Z., Maloof, M.A.: Learning to detect and classify malicious executables in the wild. J. Mach. Learn. Res. 7, 2721–2744 (2006)MathSciNetzbMATHGoogle Scholar
  14. 14.
    McAfee Labs: McAfee Labs Threats Report, August 2014Google Scholar
  15. 15.
    Miller, B.: Scalable Platform for Malicious Content Detection Integrating Machine Learning and Manual Review. Ph.D. thesis, UC Berkeley (2015)Google Scholar
  16. 16.
    Nissim, N., Cohen, A., Moskovitch, R., Shabtai, A., Edry, M., Bar-Ad, O., Elovici, Y.: ALPD: active learning framework for enhancing the detection of malicious pdf files. In: IEEE JISIC, September 2014Google Scholar
  17. 17.
    Nissim, N., Moskovitch, R., Rokach, L., Elovici, Y.: Novel active learning methods for enhanced pc malware detection in windows os. J. Expert Syst. Appl. 41(13), 5843–5857 (2014)CrossRefGoogle Scholar
  18. 18.
    Perdisci, R., Lee, W., Feamster, N.: Behavioral clustering of http-based malware and signature generation using malicious network traces. In: NSDI (2010)Google Scholar
  19. 19.
    Provos, N., Mavrommatis, P., Rajab, M.A., Monrose, F.: All your iframes point to us. In: USENIX Security (2008)Google Scholar
  20. 20.
    Rajab, M.A., Ballard, L., Lutz, N., Mavrommatis, P., Provos, N.: CAMP: content-agnostic malware protection. In: NDSS (2013)Google Scholar
  21. 21.
    Schultz, M.G., Eskin, E., Zadok, E., Stolfo, S.J.: Data mining methods for detection of new malicious executables. In: IEEE S&P (2001)Google Scholar
  22. 22.
    Schwenk, G., Bikadorov, A., Krueger, T., Rieck, K.: Autonomous learning for detection of javascript attacks: vision or reality? In: ACM AISec (2012)Google Scholar
  23. 23.
    Sculley, D., Otey, M.E., Pohl, M., Spitznagel, B., Hainsworth, J., Zhou, Y.: Detecting adversarial advertisements in the wild. In: KDD (2011)Google Scholar
  24. 24.
    Settles, B.: Active learning literature survey. Computer Sciences Technical report 1648, University of Wisconsin-Madison (2009)Google Scholar
  25. 25.
    Šrndic, N., Laskov, P.: Detection of malicious PDF files based on hierarchical document structure. In: NDSS (2013)Google Scholar
  26. 26.
    Stringhini, G., Kruegel, C., Vigna, G.: Shady paths: leveraging surfing crowds to detect malicious web pages. In: ACM CCS (2013)Google Scholar
  27. 27.
    VirusTotal. Accessed 30 Jul 2014

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Brad Miller
    • 1
    Email author
  • Alex Kantchelian
    • 2
  • Michael Carl Tschantz
    • 3
  • Sadia Afroz
    • 3
  • Rekha Bachwani
    • 4
  • Riyaz Faizullabhoy
    • 2
  • Ling Huang
    • 5
  • Vaishaal Shankar
    • 2
  • Tony Wu
    • 2
  • George Yiu
    • 6
  • Anthony D. Joseph
    • 2
  • J. D. Tygar
    • 2
  1. 1.Google Inc.Mountain ViewUSA
  2. 2.UC BerkeleyBerkeleyUSA
  3. 3.International Computer Science InstituteBerkeleyUSA
  4. 4.NetflixLos GatosUSA
  5. 5.DataVisorMountain ViewUSA
  6. 6.PinterestSan FranciscoUSA

Personalised recommendations