Keywords

1 Introduction

Insider threat is becoming a serious security challenge for many organizations. It is generally defined as malicious actions performed by an insider in a secure environment, often causing system sabotage, electronic fraud and information theft. Hence, it is potentially harmful to individuals, organizations and state security. Recently, insider threat detection has attracted considerable attention in both academic and industrial community.

Insider threat detection becomes an extremely complex and challenging task. The reasons are as follows. First, insiders do unauthorized things by the use of their trusted access. Hence, external network security devices (intrusion detection, firewalls, and anti-virus) cannot detect them. Second, insider attack manifests in various forms, such as a disgruntled employee planting a logic bomb to disrupt systems, stealing intellectual property for personal gain, etc. The diversity of insider attack increases the complexity of insider threat detection. The last but not the least, insider threat often performed by insiders during working hours, causing insider’s anomalous behaviors scattered in large amounts of normal working behaviors. Therefore, it increases the difficulty of insider threat detection.

The key of insider threat detection is to model a user’s normal behavior to detect anomalous behavior. Much work has been proposed to address the issue [1, 2]. They aggregate all the actions of a user in one day to represent the user’s behavior in the same day. However, the anomalous behavior happening within one day may be missed. For example, a user logs on to his assigned computer after hours and uploads data to wikileaks.org. We argue that using user action sequences for each user is very important in detecting insider threat.

To address this problem, we propose a novel insider threat detection method to detect whether user behavior is normal or anomalous. Specifically, it is not efficient that we directly use the LSTM to classify the user action sequence, because the output of the LSTM only contains a single bit of information for every sequence. Instead, we use the trained LSTM to predict next user action, and use a series of hidden states of the LSTM model to generate a fixed-size feature matrix that is given to the CNN classifier. The LSTM can better capture the long term temporal dependencies on user action sequence, because hidden units of the LSTM potentially record temporal behavior patterns.

To summarize, in this paper, we make the following contributions:

  1. (1)

    We present a novel insider threat detection method with LSTM and CNN based on user behavior.

  2. (2)

    We use the LSTM to learn the language of user behavior through user actions and extract abstracted temporal features which are the input of the CNN classifier.

  3. (3)

    Experimental results on a public dataset of insider threats show that our proposal can successfully detect insider threat and we obtained AUC = 0.9449 in best case.

The rest of this paper is organized as follows. We summarize the related work in Sect. 2, and give a detailed description of our insider threat detection method in Sect. 3. Implementation details and experimental results for this work are shown in Sect. 4. Finally, we conclude the paper’s work in Sect. 5.

2 Related Work

Related work falls into two main categories, insider threat detection and deep neural network.

Insider Threat Detection:

The problem of insider threat detection is usually framed as an anomaly detection task. A comprehensive and structured overview of anomaly detection techniques was provided by Chandola et al. [3]. They defined that the purpose of anomaly detection is finding patterns in data which did not conform to the expected behavior. The key problem of anomaly detection is how to model a user’s normal behavior profile. A lot of research work has been proposed to develop anomaly detection, especially machine learning.

Early work on anomaly detection based on user command proposed by Davison and Hirsh [4] and Lane and Brodley [5]. They examine user command sequences and compute the match degree of a current command pattern with the historical command pattern to classify user behavior as normal or anomalous.

After that, anomaly detection begins to take advantage of machine learning techniques, such as Naive Bayes [6], Eigen Co-occurrence Matrix (ECM) [7], One-Class Support Vector Machine (OC-SVM) [8] and Hidden Markov [9]. Schonlau et al. compared the performance of six masquerade-detection algorithms on the data set of “truncated” UNIX shell commands for 70 users and experimental results revealed that no single method completely dominated any other. Maxion and Townsend [6] applied the Naive Bayes classifier to the same data set [17], inspired by Naive Bayes text classification. They also provided a thorough and detailed investigation of classification errors of the classifier in [18]. Oka et al. [7] argued that the causal relationship embedded in sequences of events should be considered when modeling a user’s profile. They developed the layered networks approach based on the Eigen Co-occurrence Matrix (ECM) and extracted the causal relationships embedded in sequences of commands to supplement user behavior model. Salem et al. [19] evaluated the accuracy performance of the nine methods mentioned above using the Schonlau dataset, but the results revealed that their detection rates were not high. Szymanski and Zhang [8] used an OC-SVM classifier for insider threat detection. However, the approach needed mixing user data and it was hard to implement in a real-world setting. Rashid et al. [9] proposed an approach to insider threat detection by the use of Hidden Markov. They used Hidden Markov to model user’s normal behavior via user actions and regarded deviations from the normal behavior as anomalous behavior. The effectiveness of the method is highly impacted by the number of the states. However, the computational cost of the Hidden Markov model increases as the number of states increases.

The works mentioned above make use of machine learning techniques to build a classifier. On one hand, machine learning requires “feature engineering” which is time-consuming and difficult. On the other hand, the classifier is too simple, resulting in a low detection rate.

Deep Neural Network:

Recently, deep neural network that can automatically learn powerful features has led to new ideas for anomaly detection. Tang et al. [10] applied the deep learning methodology to build up an anomaly detection system, but the experimental results in the testing phase were not good enough. Veeramachananeni et al. [11] used a neural network auto-encoder to detect insider threat. They aggregated a number of numeric features over a time window and fed these features to an ensemble of anomaly detection methods: Principal Component Analysis, neural networks, and a probabilistic model. However, individual user activity was not explicitly modeled over time. Tuor et al. [2] proposed a deep learning approach to detect anomalous network activity from system logs. They trained Recurrent Neural Networks (RNNs) to recognize characteristic of each user on a network and concurrently assessed whether user behavior is normal or anomalous. While this method aggregates features over one day for individual users, it is possible to miss anomalous behavior happening within one day. Instead, our model is trained using user action sequences with DNN. The actions that a user takes over a period of time on a system can be modeled as a sequence. The action sequences of user’s normal behavior are seen often or on a usual basis. Observed action sequences deviated from those normal action sequences are regarded as anomalous behavior. Therefore, our model can detect anomalous behavior through user actions and even can detect anomalous behavior happening within one day.

3 Proposed Method

In this section, we introduce the details of our insider threat detection method. The proposal applies DNN in two stages. The first stage extracts the abstracted temporal features of user behavior by the LSTM and outputs feature vectors. Then the feature vectors are transformed into fixed-size feature matrices. In the second stage, these fixed-size feature matrices are fed to the CNN to classify them as normal or anomaly.

3.1 Overview

The overview of our insider threat detection method is shown in Fig. 1. The individual action (e.g., logging onto an assigned computer afterhours) represents the operation of a user; actions taken by a user in one day represent user behavior. Similar to natural language modeling, an action is corresponding to a word and an action sequence is corresponding to a sentence. For that reason, we attempt to learn the language of user behavior as a new method for detecting insider threat. The LSTM is used to extract the features of user behavior. The CNN uses these features to find anomalous behavior.

Fig. 1.
figure 1

Overview of proposed method

Let \( {\text{U}} = \left\{ {u_{1} ,u_{2} , \cdots ,u_{K} } \right\} \) be the set of K users. For a user \( u_{k} \left( {1 \le k \le K} \right) \), we can obtain his action sequences over \( J \) days, \( {\mathbf{S}} = \left[ {{\mathbf{s}}_{{u_{k,1} }} ,{\mathbf{s}}_{{u_{k,2} }} , \cdots ,{\mathbf{s}}_{{u_{k,J} }} } \right] \), where \( {\mathbf{s}}_{{u_{k,j} }} \left( {1 \le j \le J} \right) \) is a vector which denotes the action sequence on the day indexed by \( j \). In the training phase, we first obtain an action sequence \( {\mathbf{s}}_{{u_{k,j} }} \) that user \( u_{k} \) has performed within the day indexed by \( j \). Second, the action sequence \( {\mathbf{s}}_{{u_{k,j} }} \) is then fed into the LSTM and the LSTM is trained to construct a feature extractor to obtain the abstracted feature vectors in the deep layer. Third, the feature vectors are transformed into a fixed-size matrix \( {\mathbf{M}}^{{u_{k,j} }} \). The fixed-size feature matrix potentially contains various abstracted temporal features that represent user behavior. Finally, we use these fixed-size matrices annotated with normal or anomalous to train the CNN. In the testing phase, we evaluate the approach with the trained LSTM and the trained CNN. The detail of each step is described in the following subsections.

3.2 Training LSTM for Feature Extraction

Based on the user action sequences, we construct a feature extractor which can automatically extract abstracted temporal features from each input action sequence. The LSTM consists of an input layer, an embedding layer, three LSTM layers, and an output layer. The flow of the LSTM is shown in Fig. 2.

Fig. 2.
figure 2

Flow of LSTM training

For user \( u_{k} \) on the day indexed by \( j \), let \( {\text{T}} \) be the length of the action sequence, \( {\mathbf{s}}_{{u_{k,j} }} = \left[ {{\mathbf{x}}_{1}^{{u_{k,j} }} ,{\mathbf{x}}_{2}^{{u_{k,j} }} , \cdots ,{\mathbf{x}}_{T}^{{u_{k,j} }} } \right] \). \( {\mathbf{x}}_{t}^{{u_{k,j} }} \left( {1 \,\le\, t \,\le\, T} \right) \) represents an individual action at time instance \( t \). \( {\mathbf{h}}_{l, t}^{{u_{k,j} }} \left( {0\, \le\, l \,\le\, 3,1 \,\le\, t\, \le\, T} \right) \) denotes the hidden state of hidden layer \( l \) at time instance \( t \). \( {\mathbf{y}}_{t}^{{u_{k,j} }} \left( {1 \,\le \,t \,\le\, T} \right) \) denotes the output at time instance \( t \). Here we use one-hot encoding to embed the input \( {\mathbf{x}}_{t}^{{u_{k,j} }} \) as a vector \( {\mathbf{e}}_{t}^{{u_{k,j} }} \left( {1 \,\le\, t \,\le\, T} \right) \). The one-hot encoding is performed as follows:

  1. 1.

    Creating a dictionary in which IDs and actions are associated with each other, such as logging on an assigned PC after hours is denoted as 1, logging off an assigned PC after hours is denoted as 2, etc.

  2. 2.

    Converting actions to one-hot vectors, which is 1 at the action ID position, and 0 elsewhere.

The LSTM with three hidden layers (\( l = 1,2,3 \)) is described by the following equations:

$$ {\mathbf{i}}_{l,t}^{{u_{k,j} }} = \sigma \left( {{\mathbf{W}}_{l}^{{\left( {i,x} \right)}} {\mathbf{h}}_{l - 1,t}^{{u_{k,j} }} + {\mathbf{W}}_{l}^{{\left( {i,h} \right)}} {\mathbf{h}}_{l,t - 1}^{{u_{k,j} }} + {\mathbf{b}}_{l}^{i} } \right) $$
(1)
$$ {\mathbf{f}}_{l,t}^{{u_{k,j} }} = \sigma \left( {{\mathbf{W}}_{l}^{{\left( {f,x} \right)}} {\mathbf{h}}_{l - 1,t}^{{u_{k,j} }} + {\mathbf{W}}_{l}^{{\left( {f,h} \right)}} {\mathbf{h}}_{l,t - 1}^{{u_{k,j} }} + {\mathbf{b}}_{l}^{f} } \right) $$
(2)
$$ {\mathbf{o}}_{l,t}^{{u_{k,j} }} = \sigma \left( {{\mathbf{W}}_{l}^{{\left( {o,x} \right)}} {\mathbf{h}}_{l - 1,t}^{{u_{k,j} }} + {\mathbf{W}}_{l}^{{\left( {o,h} \right)}} {\mathbf{h}}_{l,t - 1}^{{u_{k,j} }} + {\mathbf{b}}_{l}^{o} } \right) $$
(3)
$$ {\mathbf{g}}_{l,t}^{{u_{k,j} }} = \tanh \left( {{\mathbf{W}}_{l}^{{\left( {g,x} \right)}} {\mathbf{h}}_{l - 1,t}^{{u_{k,j} }} + {\mathbf{W}}_{l}^{{\left( {g,h} \right)}} {\mathbf{h}}_{l,t - 1}^{{u_{k,j} }} + {\mathbf{b}}_{l}^{g} } \right) $$
(4)
$$ {\mathbf{c}}_{l,t}^{{u_{k,j} }} = {\mathbf{f}}_{l,t}^{{u_{k,j} }} \odot {\mathbf{c}}_{l,t - 1}^{{u_{k,j} }} + {\mathbf{i}}_{l,t}^{{u_{k,j} }} \odot {\mathbf{g}}_{l,t}^{{u_{k,j} }} $$
(5)
$$ {\mathbf{h}}_{l,t}^{{u_{k,j} }} = {\mathbf{o}}_{l,t}^{{u_{k,j} }} \odot \tanh \left( {{\mathbf{c}}_{l,t}^{{u_{k,j} }} } \right) $$
(6)

Where \( {\mathbf{h}}_{0,t}^{{u_{k,j} }} = {\mathbf{e}}_{t}^{{u_{k,j} }} \), and \( {\mathbf{c}}_{l,0}^{{u_{k,j} }} \), \( {\mathbf{h}}_{l,0}^{{u_{k,j} }} \) are set to zero vector for all \( 1 \,\le\, l \,\le\, 3 \). \( \sigma \left( \cdot \right) \) is the sigmoid function and \( \odot \) denotes element-wise multiplication. Vector \( {\mathbf{g}}_{l,t}^{{u_{k,j} }} \) is a hidden representation, vector \( {\mathbf{i}}_{l,t}^{{u_{k,j} }} \) decides which values to update, vector \( {\mathbf{f}}_{l,t}^{{u_{k,j} }} \) decides which things to forget, vector \( {\mathbf{o}}_{l,t}^{{u_{k,j} }} \) decides what to be outputted. 24 weight matrices (\( {\mathbf{W}} \)) and 12 bias vectors (\( {\mathbf{b}} \)) are learned parameters.

The LSTM is repeatedly trained using user action sequences. First, we take an input series of user \( u_{k} \) as a vector \( {\mathbf{A}}^{{u_{k,j} }} = \left[ {{\mathbf{x}}_{1}^{{u_{k,j} }} ,{\mathbf{x}}_{2}^{{u_{k,j} }} , \cdots ,{\mathbf{x}}_{T}^{{u_{k,j} }} } \right] \). Second, the embedding layer converts the series of actions \( {\mathbf{A}}^{{u_{k,j} }} \) to one-hot vectors \( {\mathbf{E}}^{{u_{k,j} }} = \left[ {{\mathbf{e}}_{1}^{{u_{k,j} }} ,{\mathbf{e}}_{2}^{{u_{k,j} }} , \cdots ,{\mathbf{e}}_{T}^{{u_{k,j} }} } \right] \). Third, we sequentially input each one-hot vector \( {\mathbf{e}}_{t}^{{u_{k,j} }} \) to the LSTM and the LSTM outputs prediction \( {\mathbf{y}}_{t}^{{u_{k,j} }} \). Finally, we calculate the cross-entropy loss function by comparing prediction \( {\mathbf{y}}_{t}^{{u_{k,j} }} \) with answer \( {\mathbf{x}}_{t + 1}^{{u_{k,j} }} \).

In training phase, we apply Dropout [12] to the LSTM in a way that can reduce overfitting. The dropout operator is only applied to the non-recurrent connections. One epoch means that all training user action sequences are inputted to the LSTM. The order of user action sequences is randomized in every epoch. The LSTM training is executed for multiple epochs. After training, we obtain the trained feature extractor. Then we extract the hidden state of the last hidden layer (the third layer in Fig. 2) for every input and obtain a series of feature vectors \( {\text{H}}^{{u_{k,j} }} = \left[ {{\mathbf{h}}_{3,1}^{{u_{k,j} }} ,{\mathbf{h}}_{3,2}^{{u_{k,j} }} , \cdots ,{\mathbf{h}}_{3,T}^{{u_{k,j} }} } \right] \).

3.3 Fixed-Size Feature Representations

As the designed classifier accepts fixed-size representations and the number of actions differs between user action sequences, we need to construct a fixed-size feature matrix for the series of feature vectors which is provided as input of the CNN.

To deal with this, we decided on a maximal length \( {\text{N}}^{{u_{k} }} \) and a minimal length \( n^{{u_{k} }} \) for any action sequence for user \( u_{k} \). We ignore all sequences whose length are shorter than \( n^{{u_{k} }} \). For all sequences with more than \( {\text{N}}^{{u_{k} }} \) steps, we keep only the first \( {\text{N}}^{{u_{k} }} \) actions. For all sequences whose length \( {\text{T}} \) is between \( n^{{u_{k} }} \) and \( {\text{N}}^{{{\text{u}}_{\text{k}} }} \), we pad them with zeros until their lengths reach \( {\text{N}}^{{u_{k} }} \). By this way, we can convert the series of feature vectors \( {\mathbf{H}}^{{u_{k,j} }} = \left[ {{\mathbf{h}}_{3,1}^{{u_{k,j} }} ,{\mathbf{h}}_{3,2}^{{u_{k,j} }} , \cdots ,{\mathbf{h}}_{3,T}^{{u_{k,j} }} } \right] \) into a fixed-size feature matrix \( {\mathbf{M}}^{{u_{k,j} }} \) of dimensions \( {\text{N}}^{{u_{k} }} \times {\text{V}}^{{u_{k} }} \), where \( {\text{V}}^{{u_{k} }} \) is the dimension of the last hidden layer. We map each element of \( {\mathbf{M}}^{{u_{k,j} }} \) to the [0,1] space by sigmoid function. Finally, we obtain the fixed-size feature matrix \( {\mathbf{M}}^{{u_{k,j} }} \) of dimensions \( {\text{N}}^{{u_{k} }} \times {\text{V}}^{{u_{k} }} \).

3.4 Training CNN for Detecting Insider Threat

The final component of our approach is the classification stage. We use the CNN to classify the fixed-size feature matrices of user behavior into normal behavior and anomalous behavior. The CNN consists of an input layer, two convolution-pooling layers, a fully-connected layer, and an output layer. For user \( u_{k} \), the dimension of the input layer is \( {\text{N}}^{{u_{k} }} \times {\text{V}}^{{u_{k} }} \) and the dimension of the output layer is two. Figure 3 shows the structure of the CNN.

Fig. 3.
figure 3

Structure of the CNN

We first train the CNN by using fixed-size feature matrices annotated with normal or anomaly. Also the softmax function is applied to the output of the CNN. After training, we use the trained CNN to calculate anomalous probability of a user action sequence.

4 Experiments

This section reports the experimental validation of the proposed method. We apply our method to the CMU-CERT insider threat dataset [13], which provides a synthetic dataset describing a user’s computer based activity. The dataset consists of information on several different activities over a period of 17 months. Next, we first describe details of the dataset and evaluation method. Then we present the experimental results of our approach.

4.1 Dataset

We perform experiments on the CERT insider threat dataset V4.2, because it contains more instances of insider threats compared to the other version of datasets. The dataset captures the 17 months of activity logs of the 1000 users (with 70 insiders) in an organization, which consists of five different types of activities: logon/logoff, email, device, file and http. Each log line is parsed to obtain details like a timestamp, user ID, PC ID, action details etc. We choose a comprehensive set of 64 actions over the five types of activities and build 1000 user specific profiles based on user action sequences. An example of a user action is visiting a job-hunting website between the hours of 8:00 am and 5:00 pm on an assigned computer. The enumeration of user actions is listed in Table 1.

Table 1. Enumeration of user actions

Over the course of 17 months, 1000 users generate 32,770,227 log lines. Among these are 7323 anomalous activity instances manually injected by domain expert, representing three insider threat scenarios taking place.

We split the dataset into two subsets: training and testing. The former subset (~70% of the data) is used for model selection and hyper-parameter tuning. The latter subset (~30% of the data) is used for evaluating the performance of the model. Our classifications are made at the granularity of user-day. One note is that we removed the weekends of the data when we classify at the granularity of user-day, because the user behavior is qualitatively different for weekdays and weekends.

4.2 Evaluation Method

The dataset used for experiment is unbalanced, so we choose the Receiver Operating Characteristics Curves (ROC) and Area-Under-Curve (AUC) measure for evaluating the proposed method. On one hand, we can visualize the relation between TPR and FPR of a classifier. On the other hand, the accuracy with two or more classifiers can be compared.

4.3 Results

To compare the performance of the model with different parameters, we train our model with several parameters. When setting the parameters of the LSTM, we refer the setting of [14] which uses the LSTM in language modeling. In addition, the LSTM is trained using the ADAM [15] variant of gradient descent. The parameter settings of the LSTM are shown in Table 2.

Table 2. Parameters of the LSTM

The parameters of the CNN were set by referring the setting of LeNet [16], which is used for recognizing hand written digit. Let a(b) denotes the number of filters (the shape of each filter) per convolutional layer. Max-pooling reduces the size of the input into 1/2 with stride of 2. The parameter settings of the CNN are shown in Table 3.

Table 3. Parameters of the CNN

We evaluated the ROC curves for each of these CNNs, and later we compare the best performing CNN against the logistic regression classifier-based architectures (see Fig. 5). Figure 4(a), (b), (c) and (d) show the ROC curves when CNN1, CNN2, CNN3 and CNN4, respectively, are used for classification. We can see that the different parameter settings differ only slightly. The performance of relu activation function is similar to the tanh activation function, using the same parameter settings. The LSTM2 with CNN3 provides better result than the other CNNs and gets the best result AUC = 0.9449.

Fig. 4.
figure 4

ROC curves for CNNs

Figure 5 compares the ROC curves of the best performing CNN3 plus the logistic regression classifier-based architectures. The ROC results for the CNN classifier based architectures are better than the Logistic Regression version with the same language model (LSTM2).

Fig. 5.
figure 5

ROC curves for CNN3 and logistic regression

5 Conclusion

In this paper, we proposed the insider threat detection method with deep neural network. Because insider threat manifest in various forms, it is not practical to explicitly model it. We frame insider threat detection as an anomaly detection task and use anomalous behavior of a user as indicative of insider threat. The LSTM extracts user behavior features from sequences of user actions and generates fixed-size feature matrices. The CNN classifies fixed-size feature matrices as normal or anomaly. We evaluated the proposed method using the CERT Insider Threat dataset V4.2. Experimental results show that our method can successfully detect insider threat and we obtained AUC = 0.9449 in best case.