Keywords

1 Introduction

Over the last two decades, many solutions have emerged to protect and secure computer systems. They are complementary but not always sufficient. Thus, antivirus software acts on a host computer and protects against viruses and malicious programs. If this solution is effective for isolated machines and viruses already known, it is not recommended to trust them, especially when connected to the Internet.

To solve the problem of network intrusion, firewalls come to the rescue. A firewall is used to control communications between a local network or host machine and the Internet. It filters traffic in both directions and blocks suspicious traffic according to a network security policy. It is therefore the tool that defines the software and the users that have the permission to connect to the Internet or to access the network [1]. With anti-virus, the firewall increases the security of data in a network. However, this combination remains powerless in front of malicious users with knowledge of all the requirements of security policy. Indeed, once a software or user has the permission to connect to the Internet or a network, there is no guarantee that they will not perform illegal operations. Moreover, many studies have shown that 60% to 70% of attacks come from within systems [2].

To handle this problem, in addition to antivirus and firewalls, intrusion detection systems (IDS) are used to monitor computer systems for a possible intrusion considered as unauthorized use or misuse of a computer system [3].

IDSs are designed to track intrusions and protect system vulnerabilities. They are very effective at recognizing intrusions for which they are programmed. However, they are less effective if the intruder changes the way he attacks. Whatever the performance of the IDS, they are often limited by the amount of data that the IDS can handle at a time. This limitation does not allow for permanent monitoring and leaves violations for intruders.

This research explores a first implementation of classical machine learning models for intrusion detection. These models take into account the historical IDSs data with their corresponding classes. Then, through a comparative study, we select the best method to use it for the inference in order to detect intrusions in real-time. In addition, we present in this study a hierarchical learning method that is proving to be very effective in comparison with the traditional classification method.

The rest of this paper is organised as follows: next section presents some related works, followed by the proposed strategy in Sect. 3. In Sect. 4 we show the experimental results and give some discussions. Finally, in Sect. 5 we conclude with some perspectives.

2 Related Works

Many recent researches try to handle the intrusion detection problem with the artificial intelligence and machine learning approaches.

The authors of [4] explore the issue of the game theory for modelling the problem of intrusion detection as a game between the intruder and the IDS according to a probabilistic model, the objective of their study is to find a frequency for an IDS verification activities that ensures the best net gain in the worst case. We think that is a good idea, but if the attacker changes his behaviour, the proposed approach will no more be able to intercept him effectively.

In [5] a new agent architecture is proposed. It combines case-based reasoning, reactive behavior and learning. Through this combination, the proposed agent can adapt itself to its environment and identify new intrusions not previously specified in system design. Even if the authors showed that the hybrid agent achieves good results compared to a reactive agent, their experimental study did not include other learning approaches such as support vector machine, K nearest neighbors... In addition, the learning set used in this study is very small, only 1000 records.

The authors in [6] proposed an intrusion detection framework based on an augmented features technique and an SVM algorithm. They validated their method on the NSL-KDD dataset, and stated that their method was superior to other approaches. However, they did not mention which data are used for the test. In addition, the application of features augmentations technique increases the risk of falling into an over fitting case, especially when processing large data, so we believe this is not an ideal choice for analyzing large network traffic for intrusion detection.

In [7], the authors applied a hybrid model of genetic algorithms with SVM and KPCA to intrusion detection. They used the KDD CUP99 dataset to validate their system. However, this dataset contains several redundancies, so the classifier will probably be skewed in favor of more frequent records. With the same way, the authors of [8] combined decision tree with genetic algorithms and features selection for intrusion detection. They used also the KDD dataset which is not really reliable for validating methods, it would have been more interested to take other datasets to confirm the proposed approach. An interesting multi-level hybrid intrusion detection model is presented in [9], it uses support vector machine and extreme learning machine to improve the efficiency of detecting known and unknown attacks. The authors apply this model on the KDD99 dataset, which has the previously mentioned drawbacks. On the same dataset, a new LSTM: long short term memory model is presented in [10] to deal with four classes of attacks, and the results are satisfactory. In [11] the authors present a text mining approach to detect intrusions, in which they present a new distance measure. However, most of logs features are in numerical format, and taking them as text data will considerably increase the complexity of the calculation in terms of memory capacity and also in terms of learning time. This is the biggest flaw in the last two papers mentioned.

A very interesting survey is presented in [12]. The paper describes the literature review of most of machine learning and data mining methods used for cyber security. However, the methods that are the most effective for cyber applications have not been established by this study, the authors affirm that given the richness and complexity of the methods, it is impossible to make one recommendation for each method, based on the type of attack the system is supposed to detect. In our study we draw inspiration from the methods cited by this paper such as decision trees, support vector machine, k nearest neighbors,... for a comparative study established on the most popular datasets. In addition, we enrich our comparative study with several neural networks models, and with a new proposed hierarchical classification method.

3 The Proposed Strategy

Contrary to the idea in [13], in which the authors present an hierarchical classification of the features for intrusion detection. In our study we propose a hierarchical method that starts by detecting malicious connexion with a binary classification, and then, in a second time, the algorithm tries to find the corresponding attack class by a multi-label classification as shown in Fig. 1. The idea is to detect an abnormal connection very quickly and launch a warning to the admin, then try to classify this connection in the corresponding attack class.

Fig. 1.
figure 1

The proposed hierarchical classification method

For this, we will design a hierarchical approach to the machine learning methods, and compare their performances with the classical algorithms of classification.

In the context of data mining and machine learning: classification is done using a model that is built on historical data. The goal of predictive classification is to accurately predict the target class for each record in new data. A classification task begins with build training data for which the target values (or class assignments) are known. Many classification algorithms use different techniques for finding relations between the predictor attribute’s values and the target attribute’s values in the build data [14, 15]. In the following subsections, a summarised overview of the implemented machine learning algorithms is reported.

3.1 Naïve Bayes Classifier

The naïve Bayes algorithm is based on Bayesian probability theory following assumptions of naive independence [16]. It is one of the most basic classification techniques with various applications, such as email spam detection, personal email sorting, and document categorization.

Even though it is often outperformed by other techniques. The main advantage of the naïve Bayes remains that it is less computationally intensive (in both CPU and memory), and it requires a small amount of training data. Moreover, the training time with Naive Bayes is significantly smaller as opposed to alternative methods [17].

3.2 K-Neighbors Approach

Nearest Neighbors is one of the simplest, and rather trivial classifiers is the rote classifier, which memorizes all training data and performs classification only if the attributes of the test object match one of the training examples exactly [18]. A more known variation, the k-nearest neighbor (kNN) classification [19], finds a group of k objects in the training set that are closest to the test object and bases the assignment of a label on the predominance of a particular class in this neighborhood. There are three key elements of this approach: a set of labeled objects, a distance or similarity metric to compute distance between objects, and the value of parameter k, which represents the number of nearest neighbors.

To classify an unlabeled object, the distance of this object to the labeled objects is computed, its k-nearest neighbours are identified, and the class labels of these nearest neighbours are then used to determine the class label of the object.

Given a training set DR and a test object \(z=(x',y')\) the algorithm computes the distance (or similarity) between z and all the training objects \((x,y) \in {DR} \) to determine its nearest-neighbor list: \(D_{z}\). (\(x_{i}\) is the training data of \(object_{i}\), while \(y_{i}\) is its class. Likewise, \(x'\) the data of the test object and \(y'\) is its class.) Once the nearest-neighbors list is obtained, the test object is classified based on the majority class of its nearest neighbors:

$$\begin{aligned} Majority\_Voting\_y' = argmax_{v} \sum _{x_{i},y_{i}\in {D_{z}} } I(v=y_{i}). \end{aligned}$$

where v is a class label, \(y_{i}\) is the class label for the \(i^{th}\) nearest neighbors, and I() is an indicator function that returns the value 1 if its argument is true and 0 otherwise.

3.3 Support Vector Machine

Support vector machines (SVM) have exhibited superb performance in binary classification tasks. Intuitively, SVM aims at searching for a hyperplane that separates the two classes of data with the largest margin (the margin is the distance between the hyperplane and the point closest to it) [20, 21].

Most discriminative classifiers, including SVMs, are essentially two-class classifiers. A standard method of dealing with multi-class problems is to create an ensemble of yes/no binary classifiers, one for each label. This method is called “one-vs-others” [22].

3.4 Random Forests

Random Forests are a part of ensemble learning. Ensemble learning [23] deals with methods which employ multiple learners to solve a problem. The capacity of working with several learners in the same time achieve better results than a single learner. Random forest works by creating various decision trees in the training phase and output class labels those have the majority vote [24]. They achieve high classification accuracy and can handle outliers and noise in the data. Random Forest is implemented in this work because it is less susceptible to over-fitting and it has previously shown good classification results.

3.5 Multilayer Perceptron Neural Networks

The basic unit in a neural network is called “neuron” or “unit”. Each neuron receives a set of inputs, which are denoted by the vector \(\overline{X}_{i}\) [25]. In addition, each neuron is associated with a set of weights A, which are used for computing a function f of its inputs. A typical function that is basically used in the neural network is the linear function, it is defined as follows: \(p_{i}=A.\overline{X}_{i}\). We assume that the class label is denoted by \(y_{i}\). The goal of this approach is to learn the set of weights A with the use of the training set. The idea is to start off with random weights, and gradually update them when a mistake is done by applying the current function on the training example. The magnitude of the update is regulated by a learning rate \(\mu \). This forms the core idea of the perceptron algorithm.

figure a

3.6 Convolutional Neural Network Classifier

A Convolutional Neural Network (CNN) is comprised of one or more convolutional layers, and then followed by one or more fully connected layers as in a standard multilayer neural network. The neurons of a convolutional layer are grouped in feature maps sharing the same weights, so the entire procedure becomes equivalent to convolution [27]. Convolutional layers are usually followed by a non-linear activation-layer, in order to capture more complex properties of the input data. The pooling layers are usually used for subsampling the preceding layer, by aggregating small rectangular values subsets. Maximum or average pooling is often applied by replacing the input values with the maximum or the average value, respectively. Finally, one or more dense layers are put in place, each followed by an activation-layer, which produce the classification result.

The training of CNNs is performed similarly to that of classical Multilayer Perceptron Networks, by minimizing a loss function using gradient descent-based methods and back-propagation of the error.

In this study, our best CNN model is reached after trying many architectures. It is inspired by the contribution of [28] in sentiments detection. Since its model demonstrated well performant results, we adapted it in our study for intrusion detection. It is mainly composed of four hidden layers in addition to the input and the output layer. In the following, we show a summarization of the architecture of our best CNN model:

  • The input layer is a convolution of one dimension with a number of neurones equals to the number of the dataset features.

  • The second layer is a max pooling 1D with a pool size equals to 4.

  • The third layer is a flatten with 512 neurones.

  • The fourth is a dense layer with 512 neurones with ‘relu’ as activation function.

  • The output layer is a dense layer with the ‘sigmoid’ function, and a number of neurones equals to the number of classes.

4 Experimentation Results and Discussion

In this section, we implement various machine learning models to classify the different sets of network traffic of the four well known benchmarks, which are summarised in Table 1:

Table 1. The datasets characteristics

4.1 Performance Evaluation

For the performance evaluation, we calculate the accuracy with the F-score of each approach. The F-score also called F-measure is based on the two primary metrics: precision and recall. Given a subject and a gold standard, precision is the proportion of cases that the subject classified as positive that were positive in the gold standard. It is equivalent to positive predictive value. Recall is the proportion of positive cases in the gold standard that were classified as positive by the subject. It is equivalent to sensitivity. The two metrics are often combined as their harmonic mean, the formula can be formulated as follows:

$$\begin{aligned} F = \frac{(1+\beta ^{2})\times recall \times precision}{(\beta ^{2} \times precision)+recall} \end{aligned}$$
$$\begin{aligned} Precision = \frac{TP}{TP+FP}, Recall = \frac{TP}{TP+FN} \end{aligned}$$

Where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives and FN is the number of false negatives. The F-measure can be used to balance the contribution of false negatives by weighting recall through a parameter \(\beta \ge 0\). In our case \(\beta \) is set to 1, F1-score is than equal to:

$$\begin{aligned} F1\_score = \frac{ 2 \times recall \times precision}{ precision+recall} \end{aligned}$$

4.2 Environment and Materials

We use in this implementation, Python language, Tensorflow tool, and Keras library. All the algorithms are executed and compared using the NVIDIA DGX-1Footnote 1. The DGX1 is an Nvidia server which specializes in using GPGPU to accelerate deep learning applications. The server features 8 GPUs based on the Volta daughter cards with HBM 2 memory, connected by an NVLink mesh network.

4.3 Multi-class Classification Results

In this paper, first, we classify the datasets as they are labelled, without any modification. The obtained results are shown in Table 2, columns of Multi-label classification.

Table 2. Classifications reports

We notice from these results that most of the approaches succeed in obtaining good training and test scores. However, the best of these models does not exceed 71% accuracy on the test benchmark.

Considering the delicacy of the domain, and the dangerousness that can generate an IDS which classifies network traffic as normal when it is an attack. We propose a hierarchical classification strategy to achieve greater accuracy.

4.4 Hierarchical Classification Strategy

Given the main objective of an intrusion detection system, which is to detect potential attacks, we have decided in this strategy to adopt an hierarchical classification.

First, we start with a binary classification, merging all attack classes into one large class and labelling it ‘attack’, and on the other hand keeping the ‘normal’ class without any modification.

Then, after separating normal network traffic to that which represents a potential attack, a multi-class classification can be applied within the “attack” class to know what type of attack it is.

Binary Classification (Level 1: Normal/Attack). We start the hierarchical classification strategy by the detection of an abnormal connexion. This is reached by the binary classification task. For this, we have merged all the connections which have a different label from ‘normal’ into a single class that we have labelled ‘attack’. The obtained results are show in Table 2, columns of hierarchical (level 1).

Classification of Attack Types (Level 2: Multi-class Classification). After applying a binary classification to detect abnormal connections. We implement a multi-class classification approach on the ‘attack’ class, to obtain more details on the type of attack. All the well known machine learning approaches are implemented and compared in this way. The results are noted in Table 2, columns of hierarchical classification (level2). We can imagine a third level, if we have subclasses in an attack class.

4.5 Discussion

From these results, we note that the proposed hierarchical approach has considerably improved the effectiveness of the classical classification approach on all the benchmarks studied.

Fig. 2.
figure 2

Classification methods comparison on KDD Dataset

For instance, on the KDD99 dataset, the proposed hierarchical classification approach surpassed the traditional multi-class approach. Like demonstrated on Fig. 2. While the best approach obtained an accuracy of 53% in a multi-class classification, the proposed approach allows to detect an abnormal connection with a rate of 84%, and to predict the attack type with a success rate of 61%.

Fig. 3.
figure 3

Classification methods comparison on NSLKDD Dataset

Also on the NSLKDD dataset, the proposed hierarchical classification approach surpassed the traditional multi-class approach. We can note on Fig. 3 that, while the best approach obtained an accuracy of 73% in a multi-class classification, the proposed approach allows to detect an abnormal connection with a rate of 82%, and to predict the attack type with a success rate of 75%.

Fig. 4.
figure 4

Classification methods comparison on UNSW15 Dataset

In Fig. 4, we note that the proposed approach of hierarchical classification has increased the accuracy rate. While, the best approach obtained an accuracy of 71% in a multi-class classification, the proposed approach allows to detect an abnormal connection with a rate of 86%, and to predict the attack type with a success rate of 79%.

Fig. 5.
figure 5

Classification methods comparison on CIC_IDS 2017 Dataset

We valid our hypothesis also on the CIC-IDS 2017 dataset. In Fig. 5, we note that the proposed approach of hierarchical classification has increased the accuracy rate. While, the best approach obtained an accuracy of 88% in a multi-class classification, the proposed approach allows to detect an abnormal connection with a rate of 92%, and to predict the attack type with a success rate of 95%.

According to the results of our comparative study, in general we can validate the hypothesis that to achieve effective intrusion detection, we must start with a binary classification (attack/normal) using our best neural network model, followed by the application of the KNN algorithm or the best neural network model to find out what type of attack is involved.

5 Conclusion

The study presented here leads us to make two conclusions. It appears first that the learning model best suited to the intrusion detection problem is that based on convolutional neural networks. Moreover, by comparing different learning strategies, the approach based on a hierarchical detection of the attacks (starting with a first level of binary classification discriminating only the compliant traffic of the nonconforming traffic) presents the best performances, well in front of the methods of multi-label classification. The system thus obtained has an intrusion detection rate of 95%. These results allow us to consider the implementation of a real-time intrusion detection system based on our CNN model and binary classification. This will require larger datasets and more powerful training infrastructure solutions to further improve the detection rate. Finally, one of the challenges of intrusion detection remains zero-day attack detection. It turns out that the method used to train our neural network gives him the ability to identify as invalid data he has never met during his training. The next task will be to develop this capacity and especially to measure its effectiveness.