Learning Latent Byte-Level Feature Representation for Malware Detection

  • Mahmood Yousefi-AzarEmail author
  • Len Hamey
  • Vijay Varadharajan
  • Shiping Chen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11304)


This paper proposes two different byte level feature representations of binary files for malware detection. The proposed static feature representations do not need any third-party tools and are independent of the operating system because they operate on the raw file bytes. Sparse term-frequency simhashing (s-tf-simhashing) is a faster type of tf-simhashing. S-tf-simhashing requires less computation and outperforms the original dense tf-simhashing. The binary word2vec (Bword2vec) representation embeds the semantic relationships of the n-grams into the code vectors. Bword2vec employs a binary to word2vec representation that reduces the feature space dimension than s-tf-simhashing and thus further reducing the computation of the classifier. We show that the proposed techniques can successfully be used for both analyzing of full malware apps and infected files. The experiments are conducted on real Android and PDF malware datasets.


Malware detection Binary-level feature representation Sparse term-frequency simhashing Binary Word2vec 


  1. 1.
    Allix, K., Bissyandé, T.F., Klein, J., Le Traon, Y.: Androzoo: collecting millions of android apps for the research community. In: ICSE 2016, pp. 468–471. ACM (2016)Google Scholar
  2. 2.
    Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the seventh ACM SIGKDD, pp. 245–250. ACM (2001)Google Scholar
  3. 3.
    Chen, C., Vong, C.M., Wong, C.M., Wang, W., Wong, P.K.: Efficient extreme learning machine via very sparse random projection. Soft Comput. 22(11), 3563–3574 (2018)CrossRefGoogle Scholar
  4. 4.
    Chistyakov, A., Lobacheva, E., Kuznetsov, A., Romanenko, A.: Semantic embeddings for program behavior patterns. arXiv preprint arXiv:1804.03635 (2018)
  5. 5.
    Karbab, E.B., Debbabi, M., Derhab, A., Mouheb, D.: MalDozer: automatic framework for android malware detection using deep learning. Digit. Investig. 24, S48–S59 (2018)CrossRefGoogle Scholar
  6. 6.
    Kolosnjaji, B., Demontis, A., Biggio, B., Maiorca, D., Giacinto, G., Eckert, C., Roli, F.: Adversarial malware binaries: evading deep learning for malware detection in executables. In: EUSIPCO 2018 (2018)Google Scholar
  7. 7.
    Li, P., Hastie, T.J., Church, K.W.: Very sparse random projections. In: Proceedings of the 12th ACM SIGKDD, pp. 287–296. ACM (2006)Google Scholar
  8. 8.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
  9. 9.
    Nissim, N., et al.: Keeping pace with the creation of new malicious PDF files using an active-learning based detection framework. Secur. Inform. 5(1), 1 (2016)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Onwuzurike, L., Mariconti, E., Andriotis, P., De Cristofaro, E., Ross, G., Stringhini, G.: Mamadroid: detecting android malware by building Markov chains of behavioral models (extended version). arXiv preprint arXiv:1711.07477 (2017)
  11. 11.
    Sayfullina, L., Eirola, E., Komashinsky, D., Palumbo, P., Karhunen, J.: Android malware detection: building useful representations. In: Machine Learning and Applications (ICMLA), pp. 201–206. IEEE (2016)Google Scholar
  12. 12.
    Scofield, D., Miles, C., Kuhn, S.: Fast model learning for the detection of malicious digital documents. In: PPREW, p. 3. ACM (2017)Google Scholar
  13. 13.
    Wang, L., Liu, J., Chen, X.: Microsoft malware classification challenge (big 2015) first place team: say no to overfitting (2015)Google Scholar
  14. 14.
    Yousefi-Azar, M., Hamey, L., Varadharajan, V., McDonnell, M.D.: Fast, automatic and scalable learning to detect android malware. In: Liu, D., Xie, S., Li, Y., Zhao, D., El-Alfy, E.-S.M. (eds.) ICONIP 2017. LNCS, vol. 10638, pp. 848–857. Springer, Cham (2017). Scholar
  15. 15.
    Yousefi-Azar, M., Hamey, L., Varadharajanz, V., Cheng, S.: Malytics: a malware detection scheme. arXiv preprint arXiv:1803.03465 (2018)

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Mahmood Yousefi-Azar
    • 1
    • 3
    Email author
  • Len Hamey
    • 1
  • Vijay Varadharajan
    • 2
  • Shiping Chen
    • 3
  1. 1.Department of Computing, Faculty of Science and EngineeringMacquarie UniversitySydneyAustralia
  2. 2.Faculty of Engineering and Built EnvironmentUniversity of NewcastleNewcastleAustralia
  3. 3.Commonwealth Scientific and Industrial Research Organisation, CSIRO, Data61SydneyAustralia

Personalised recommendations