1 Introduction

Malware is a term for all software that intend to cause harm or inflict damage on computer systems. Recently, with introducing of polymorphic and metamorphic techniques, malicious software gets explosive growth on both quality and quantity. Malware classification has always been a concerned field in recent decade, which is an issue of giving a malicious sample i, calculating a family label j from knowledge base. Hence, malware classification can associate a fresh variant to a known family, which is meaningful to malware detection.

The research of Microsoft [11] indicates that the vast volume of data which needs to be detected is becoming one of the major challenges of anti-malware companies. Traditional signature-based and behavior-based malware analysis techniques are difficult to meet the demand, a more effective method is needed [19]. Deep learning is a good choice, under the gpu acceleration, a model can be easily trained and the detection is even more efficient. In addition, inspired by the excellent performance of convolution neural network in image field, we decide to explore a simpler model, converting malware classification tasks into image classification problems.

In fact, recent research of malware analysis, both static and dynamic, is moving from traditional aspects to deep learning. Ronen et al. [15] make a comparison between research papers using Microsoft malware classification challenge dataset (BIG2015). The results suggest that none of 12 papers in 2016 introducing deep learning, but 5 of 17 papers in 2017. Although most of them still rely on solid domain knowledge, researchers are exploring an end-to-end model to extract and fuse malware features automatically.

Nataraj et al. [13] propose first malware embedding method called NatarajImage based on binary file in 2011. Although NatarajImage is believed vulnerable to obfuscation and packing techniques, their work has been followed by many others [1, 7, 8, 18]. Andrew et al. [2] propose another malware embedding method named AndrewImage based on disassembly file at Black Hat conference 2015. As far as we know, AndrewImage has not been introduced in any research paper. Compared with NatarajImage, AndrewImage embeds instruction-level information, which has better robustness and interpretability. Unfortunately, AndrewImage uses so much zero padding that the accuracy still remains a challenge.

To summarize, this paper has the following contributions. 1. We propose a novel image-based malware classification model including a malware embedding method called YongImage and a simple deep neural network named malVecNet. YongImage directly embeds hexadecimal instruction and other metadata into vector space. MalVecNet has better theoretical interpretability and can be trained more effectively. Our model converts malware analysis tasks into image classification problems that do not rely on domain knowledge and time-consuming feature extraction. 2. We successfully train the model on Microsoft malware classification challenge dataset, the results indicate that our model outperforms most of the related work. As far as we know, it is the state-of-the-art solution for large-scale malware classification tasks. 3. We make the code for malware embedding and training based on the model described in this paper, now is available in github.Footnote 1

2 Malware Embedding

Formally, malware embedding maps malicious software into a vector space, helps learning algorithms to obtain better performance in malware analysis tasks. Similar to word embedding [12] in Natural Language Processing (NLP), this choice has several advantages—simplicity, effectiveness and the observation that some image-based models trained on huge malware dataset outperform most of traditional signature-based and behavior-based methods. Differently, malware embedding now focuses more on how to choose atomic units.

2.1 NatarajImage and AndrewImage

NatarajImage [13], as shown in Fig. 1, chooses 8-bit malware binary as atomic units, maps whole Portable Executable (PE) file or only .text section into a gray image vector.

Fig. 1.
figure 1

NatarajImage embedding process

However, even without considering obfuscation and packing techniques, Ahmadi et al. [1] find that the texture pattern of NatarajImage in different malware families may be exactly similar. Therefore, models based on NatarajImage are vulnerable to attackers.

Differently, AndrewImage [2], as shown in Fig. 2, chooses hexadecimal instruction in disassembly file as atomic units, embeds malware instruction into a black-white image vector.

Fig. 2.
figure 2

AndrewImage embedding process

In fact, AndrewImage has excellent semantic features —one line instruction, one row vector. Unfortunately, it is also this excellent visual interpretability that uses a large percentage of invalid zero padding, which makes output image vector too large to train, and high accuracy still remains a challenge.

2.2 YongImage

Inspired by previous work, we propose YongImage. As Fig. 3 shows, a PE malware disassembly file generated by IDA ProFootnote 2 contains two aspects of information—hexadecimal instruction, and corresponding metadata, i.e. section name, address of instruction, opcode and operand.

Fig. 3.
figure 3

YongImage embedding process

We embed these information as following steps: 1. Encode malware disassembly file with UTF-8 [20]. 2. Obtain gray vector by truncating each encoded value from a high order to 8-bit. 3. Reshape the gray vector to (m, 64). Intuitively, the visual interpretability of YongImage is not as well as AndrewImage. In fact, YongImage retains instruction-level interpretability by reshaping the gray vector to (m, 64).

Why (m, 64). Firstly, Let us prove the optimal padding length in AndrewImage is \(L=64\).

The Intel 64 and IA-32 architectures instruction encodings are subsets of the format shown in Fig. 4, which specifies the maximum length of an instruction is 15 bytes, in general, it does not exceed 11 bytes [5]. For example, the hexadecimal instruction {8B 44 24 10} in Fig. 2 is only 4 bytes.

Fig. 4.
figure 4

Intel 64 and IA-32 architectures instruction format [5]

Therefore, the first idea is to pad binary digits to 120-bit to cover all instructions or just truncate binary digits to 88-bit vector to support most. However, in that case, the output vector is too large to train.

Fig. 5.
figure 5

The cumulative distribution function of instruction in BIG2015 dataset

By analyzing samples in Microsoft malware classification challenge dataset (BIG2015) [15], we obtain the cumulative distribution function (CDF) of instruction length in samples, as shown in Fig. 5(a), which indicates that \(99\%\) of instructions do not exceed 64 bits and \(82\%\) of instructions do not exceed 32 bits. Therefore, \(L=64\) is chosen to cover almost all instructions while maintaining smaller vector size. Differently, YongImage does not truncate each line vector to 64-bit, as an alternative, we reshape the vector to (m, 64). Since 64-bit is almost sufficient for a line of disassembly file encoded by UTF-8, m can be approximated as number of instruction lines, i.e. instruction quantity.

Another problem of YongImage is how many lines of instructions should be embedded, i.e. how to choose m. Certainly, the larger m can lead to better accuracy without considering performance. However, when the model reaches a certain accuracy, m should be as small as possible. Figure 5(b) indicates that the instruction quantity in disassembly file varies greatly, \(50\%\) of files contain no more than 3200 instructions, and \(69\%\) of files contain no more than 6400. In our experiment, 3200 and 6400 are two candidate values of m.

3 Model Definition

Kim et al. [9] propose a simple novel convolution neural network (CNN) architecture with little hyperparameter tuning and static vectors, which achieves excellent results. Inspired by their work, we propose a variant architecture named malVectNet, as shown in Fig. 6.

Fig. 6.
figure 6

Image-Based malware classification model architecture

Malware embedding outputs an image vector with size (m, 64), where m represents the number of instructions. Channel transformation is designed to turn vector into a new shape of \((\frac{m}{k}, 64, k)\), in which k represents the number of channels.

Next, we use a special value \(k=1\) to illustrate our malVecNet.

First, let \(S_j \in \mathbb {R}^{64}\) be the 64-dimensional instruction vector corresponding to the j-th instruction in m. Then, a malware sample \(X_i\) is represented as:

$$\begin{aligned} {X_i} = [S_1, S_2, ..., S_j, ..., S_m]. \end{aligned}$$
(1)

Each convolution layer includes several filters \(w \in \mathbb {R}^{hc}\), which apply a window with size (hc). For instance, \(c_{i,t}\) is a vector generated from a window \(X_{i:i+h}(t:t+c)\) by

$$\begin{aligned} c_{i,t} = w \cdot X_{i:i+h}(t:t+c) + b, \end{aligned}$$
(2)

where \(b \in \mathbb {R}\) is bias. When a row convolution is finished, we obtain a new row feature vector as follows:

$$\begin{aligned} c_i = [c_{i,1}, c_{i,2}, ..., c_{i, 64}], \end{aligned}$$
(3)

Similarly, when whole convolution is completed, a new abstract instruction vector is generated, i.e.

$$\begin{aligned} c = [c_{1}, c_{2}, ..., c_{i}, ..., c_{m}]. \end{aligned}$$
(4)

Then, batch normalization (BN) [6] is applied on c. Details of BN on mini-batch are in Algorithm 1. Recent research has shown that BN can smooth the objective function [16], which is significant to accelerate deep neural network.

figure a

After that, we use a non-linear activation function f and max-pooling on \(e_{i}\), which aims to reduce the dimension of the feature vector.

$$\begin{aligned} e = maxpooling(f([e_1, e_2, ..., e_i, ..., e_{m}])); e_i = BN_{\beta ,\gamma }(c_i) \end{aligned}$$
(5)

So far, we have described the process of one CNN block in malVecNet, which uses four similar CNN blocks. Therefore, the input of global max-pooling is a normalized, activated and pooled feature vector, i.e.

$$\begin{aligned} e = [e_1, e_2,..., e_{g}]. \end{aligned}$$
(6)

where g is determined by the specific parameters of the before layers. In order to preserve the most important features (one with highest value) while reducing the dimension of the final feature vector, we apply global max-pooling on e and take \(\hat{e} = max\left\{ e \right\} \) as the final instruction-level features vector.

Finally, we use two fully-connected blocks and a softmax layer to get:

$$\begin{aligned} y = [y_1, y_2,..., y_{n}]. \end{aligned}$$
(7)

in which \(y_i\) is the probability of malware family i, n represents the number of malware families.

We use the idea of sentence-level classification to construct the entire model. In theory, it is more suitable for instruction-level malware embedding methods, such as AndrewImage and YongImage.

4 Experiments and Results

4.1 Dataset

BIG2015 [15] contains 21741 malware samples of 9 different families, i.e. Ramnit(F1), Lollipop(F2), Kelihos_ver3(F3), Vundo(F4), Simda(F5), Tracur(F6), Kelihos_ver1(F7), Obfuscator.ACY(F8) and Gatak(F9).

Fig. 7.
figure 7

The distribution of dataset and cross-validation results

Since only 10868 training samples in BIG2015 is labeled, we choose this part as experimental dataset. As shown in Fig. 7(a), the dataset is extremely unbalanced. Therefore, we combine the following two methods to eliminate the impact of this unbalance on the model. Firstly, F4, F5 and F7 are randomly up-sampled to 500. Secondly, a loss function weight that is inversely proportional to the class frequency is set in the input data [10].

4.2 Platform and Environment

We evaluate malVecNet on the platform environment presented in Table 1. More model details can be found in our github.Footnote 3

Table 1. The platform environment of experiment

Loss function of the model is cross entropy, which is defined in Eq. (8),

$$\begin{aligned} loss = -\frac{1}{M} \sum _{i}^{M} \sum _{j}^{N} y_{ij} \log {p_{ij}}. \end{aligned}$$
(8)

where M is the number of samples in min-batch, N is the number of malware classes, \(y_{ij}\) is 1 if sample i is in class j, otherwise, \(y_{ij}\) is 0, and \(p_{ij}\) is the predicated probability of sample i in class j.

At the same time, we choose accuracy, precision, recall and f1-scoreFootnote 4 to evaluate the performance of our model.

4.3 Hyperparameter

In this section, we begin to discuss hyperparameters in our model. Firstly, it is the instruction quantity m. Experiment results shown in Table 2 indicate YongImage outperforms AndrewImage on all metrics regardless of m. In particular, when m reaches 6400, the accuracy of YongImage is increased, and AndrewImage decreased. One potential reason is that as m increases, the large number of invalid zero padding in AndrewImage causes more interference.

Intuitively, a larger m covers more instruction information, which should achieve a higher accuracy. In fact, when m increases to 6400, the accuracy of YongImage is only slightly improved, however the training time almost increases by half. Therefore, our model takes \(m=3200\).

Table 2. The impact of instruction quantity m

Second hyperparameter is k, which represents the initial number of channels. In fact, the initial idea is to analyze the correlation between instructions by stacking k instructions in the channel direction to obtain higher model accuracy. However, the experimental results in Table 3 indicate that this design is only beneficial to accelerate model training. In order to achieve better accuracy, we finally choose \(k=1\).

Table 3. The impact of channel parameter k

We use tenfold cross-validation on BIG2015 to evaluate the above hyperparameters. Particularly, the detail results of YongImage with \(k=1\) and \(m=3200\) is shown in Fig. 7(b), where the average accuracy is \(99.49\%\), and the average training time is 1.70 h.

4.4 Comparison with Other Work

In this part, we compare malVecNet with several methods that have performed well on BIG2015 in recent years. Certainly, due to differences in experimental platforms, time metric is only a certain degree of reference.

The results in Table 4, suggest that only the solutions of Kaggle Winner [17] (BIG2015 winner) and Ahmadi [1] are slightly more accurate than our malVecNet. However, they all rely on time and labor consuming feature engineering, which is inefficient during both training and detecting phases. Hence, compared with them, malVecNet makes an orders-of-magnitude performance boost.

Garcia et al. [4] introduce random forest based on NatarajImage to classify variant malware. Unfortunately, it is believed that NatarajImage is vulnerable to obfuscation and packing techniques, so their solution is less robust.

Raff et al. [14] propose a novel valid distance metric named Lempel-Ziv Jaccard Distance (LZJD) to classify malware in raw data, which obtains greater performance improvement than Normalized Compression Distance (NCD). However, models combining distance metrics with clustering algorithm are easy to train but time-consuming to detect.

Drew et al. [3] introduce strand gen sequence to malware classification, which achieves \(98.89\%\) accuracy and requires only 0.75h to train. Unfortunately, the detection time of strand gen sequence classifier is still too long for large-scale malware.

Table 4. The comparison with other work

In fact, only the method of Yan et al. [18], which stacks VGG (based on NatarajImage), LSTM (based on opcode), achieves similar performance as our malVecNet. However, our model still has obvious advantages despite the differences in the experimental platform. Firstly, we use YongImage as the only input feature vector, model preprocessing and training are relatively simple, certainly, detection is faster. Secondly, benefit from instruction embedding and sentence-level modeling, our solution is more robust and interpretable.

Therefore, as far as we know, our malVecNet is the advanced solution for large-scale malware classification tasks.

5 Conclusion

This research aims to explore a simple and practical model to convert malware classification tasks into image classification problems.

We propose a novel image-based malware classification model including a malware embedding method called YongImage and a simple deep neural network named malVecNet. YongImage directly embeds hexadecimal instruction and other metadata into vector space. MalVecNet has better theoretical interpretability and can be trained more effectively. Our model does not rely on solid domain knowledge and time-consuming feature extraction.

We successfully train the model on Microsoft malware classification challenge dataset, the results indicate that our model outperforms most of the related work. To the best of our knowledge, it is the state-of-the-art solution for large-scale malware classification tasks.