Keywords

1 Introduction

The problem of securing electronic devices is as old as computers exist, but with time computers have gained more and more resources, so IDS in these devices became more efficient. Now, a lot of different small devices without the power of modern computers are connected to a network and are the target of many attacks. Moreover, every IoT system is different and has specific worries depending on the type of the attack (DDoS, Blackhole, Sybil Attack…) they want to be protected from. Wireless Sensor Networks (WSN) for instance has unique characteristics such as limited power supply, low transmission bandwidth, small memory size, and data storage [3]. It is thus crucial to develop and deploy new IDS.

Section 2 presents a brief review of the literature and Sect. 3 presents the problem of selecting or simulating a dataset to test IDS. Section 4 presents the implementation, tests, and results of several machine learning algorithms for outlier detection for CIDDS-001 dataset. Section 5 presents some simple decision rules that can be introduced in an IoT network to work like an IDS. Section 6 presents a summary of the conclusions and future work.

2 Related Work

Doshi et al. [4] simulate IoT networks with raspberry Pi and virtual machines. They collect network data from their system and test this data with 5 algorithms: K-nearest neighbors, Support Vector Machine (SVM) with linear kernel, decision tree, random forest, and neural network. The specificity of their work is to use stateful features and they get up to 30% better performance compared to without these features.

Hussain et al. [5] list for each problem many surveys that use machine learning techniques (Table 4 on the document). For anomaly and intrusion detection:

  • K-means clustering and Decision Tree [6]

  • Artificial Neural Network ANN [7]

  • Novelty and Outlier Detection [8]

  • Decision Tree [9]

  • Naive Bayes [9, 10]

Butun et al. [3] classify the IDS methodology of IDS in 3 categories:

  1. 1

    Anomaly based detection:

    We create an activity profile for each member of the network and a certain amount of deviation is reported as an anomaly. This method is adequate to detect never known attacks but we need to update the profiles periodically because the network behavior can change rapidly.

  2. 2

    Misuse based detection

    A signature (profile) of the previously known attack is used and is used as a reference to flag the next attacks. The disadvantage of this method is that it cannot detect new type of attacks, but the false positive rate is very low.

  3. 3

    Specification based detection

    That’s a mix of the previous ones, “a set of specifications and constraints that describe the correct operation of a program or protocol is defined.” [3] But it takes a lot of time to develop special rules to get a low false-positive rate.

Some surveys have unclear results, sometimes there is no result. There are also very few simulations and implementations in real systems.

3 Dataset

Selecting a dataset to design and evaluate NIDS ML-based algorithms is not immediate and may be a full part challenge.

One of the most used datasets is the KDD cup99 set, but it still presents defaults, as emphasized by Tavallaee et al. [11]:

  • a lot of redundant measures

  • some parts of the train set were used as test sets in some studies

  • set is too long forcing to take only part of the set.

Ring et al. [12] did an exhaustive list of network-based detection data set and compared them. One of the recent and not too heavy dataset is the CIDDS-001 (Coburg Intrusion Detection Data Set) [13] which was described as follows:

The CIDDS-001 data set was captured within an emulated small business environment in 2017, contains four weeks of unidirectional flow-based network traffic, and comes along with a detailed technical report with additional information. As special feature, the data set encompasses an external server which was attacked in the internet. In contrast to honeypots, this server was also regularly used by the clients from the emulated environment. The CIDDS-001 data set is publicly available and contains SSH brute force, DoS and port scan attacks as well as several attacks captured from the wild” [12].

The dataset contains 14 features as follow (Table 1).

Table 1. Features within the CIDDS-001 data set, from [13]

In our experiment, we used the “Class” attribute as the target for classification, removed the AttackType, AttackID and AttackDescription features which are obviously correlated with the “attacker” class. Furthermore, since IPs were anonymized, they do not convey information so we also removed them. We also use “Date first seen” as the x-axis. Finally, we transformed Flags, Class and Proto, which are categorical features, into “dummy variables” by one-hot encoding.

In the CIDDS-01 dataset, we used the internal-week1 subset of observations, as it contains 42 of the 92 attacks on the entire dataset.

Anomalies are labeled as victim or attacker. However, this file has more than 8 million rows and less than 20% are anomalies. Hence we face a case of imbalanced classes. In such a case, the more represented class can have a “masking effect” on the others; this has been studied in [14] for this dataset. Given the high number of instances available, rebalancing the classes can be simply done by subsampling the majority class (otherwise, one can also oversample the minority classes by creating new, synthetic, instances). In our case, we decided to (a) shuffle the data, (b) keep half of the data for a final evaluation (c) subsample the other half to keep about 180000 instances per class.

4 Experiments and Results

The experiments were carried out using Google Colaboratory with 32 GB of ram and the Tensor Processing Unit acceleration material.

The metrics that were used to evaluate the performance of the algorithm include the classification accuracy, precision, recall and F1-score. These metrics are expressed by the equations below,

$$ \text{Accuracy}\, = \,\frac{TP + TN}{TP + TN + FP + FN} $$
$$ {\text{Precision }} = \frac{TP}{TP + FP} $$
$$ {\text{Recall }} = \frac{TP}{TP + FN} $$
$$ {\text{F1-score }} = \frac{2.Precision.Recall}{Precision + Recall} $$

where, TP, TN, FP and FN stand for true positives, true negatives, false positives and false negatives, respectively.

We shuffle the set and we take 33% of the set as the test set and 66% as the train set. Then, we classify the traffic with 4 algorithms: K-Nearest Neighbors (KNN), Decision Tree (DT), Random Forest (RF) and Neural Network (NN). We used the python sklearn package to do our test.

For the KNN, we use only 1 neighbour with a uniform weight function. We get a global accuracy of 99.27%; other metrics are reported Table 2.

Table 2. Results with the K Nearest Neighbour algorithm

For the Decision Tree, we used the default parameters, which is the Gini criterion for measuring the quality of a split and maximum expansion of nodes. We already obtained a global accuracy of 99.89%; other performance metrics are given Table 3.

Table 3. Results with the Decision Tree algorithm

For instance, we get something as shown in Fig. 1 for the beginning of the tree:

Fig. 1.
figure 1

First nodes of the decision tree

The decision tree has a depth of 31, a total of 563 nodes and 282 leaves, thus 281 tests. From this tree, it is possible to deduce, and even generate automatically, a classification script (see the source code [20] for an example). Running this code on an instance will predict the classification with 99.88% accuracy.

For the RF, we selected the best parameters using a grid search strategy, which consists of computing the performance, by cross validation, on a grid of possible parameters values, and then selecting the best estimator. We used global accuracy as the performance metric. In particular, we used 800 trees with a max depth of 20. We get a global accuracy of 99.95%, with the performances reported in Table 4.

Table 4. Results with the Random Forest algorithm

An interesting outcome of the random forest is that we can extract which are the most important features in computing the classification. The features are shown by importance on Fig. 2.

Fig. 2.
figure 2

Relative importance of features in the CIDDS-01 dataset.

For the Neural Network, we used the multi-layer perceptron classifier. This model optimizes the log-loss function using LBFGS or stochastic gradient descent. We used 100 neurons in the hidden layer with the rectified linear unit function for the hidden layer activation function. The maximum iteration was set to 200. We finally obtained a global accuracy of 99.25%; performances metrics for the different classes are shown Table 5.

Table 5. Results with the Neural Network algorithm

For the CIDDS-01 dataset and the ML-based algorithms, we obtained very high accuracies. As mentioned in [12], for this particular dataset, balancing the classes or not has a very tight influence on the accuracy, which is already so high. Anyway, by a careful selection of the hyperparameters we recover results similar to the best RF-WHICD in [14] (where only two classes normal/attacker were considered) (Table 6).

Table 6. Comparison with related work

5 Rules

The most accurate algorithm is the RF with 99.95% accuracy. However, it may be difficult to embed on an IoT and it lacks interpretability. The problem of extracting simple rules from a forest of decision trees has been considered in the machine learning community. The goal was to find a trade-off between the modelization power of random forests and some simple rules interpretable as in a (small) decision tree. The Skope rules Python library [19] enables us to extract such rules from a random forest. In our experiments, we took all the instances to train the model and extract the rules.

For the victim class, the skope rules are:

  • Bytes > 100

  • Duration <= 0.03749999962747097

  • Flags == ASF

And for the attacker class, the identified rules are:

  • Dst Pt <= 261

  • Duration <= 0.032500000670552254

  • Flags == APSF

With these very simple rules, we already get a global accuracy of 86.88%. Furthermore, by a simple inspection of data, we discovered that adding the instances flagged with AR (class victim) or S (class attacker) TCP flags enables us to improve the accuracy to 98.45%, with only 0.35% are miss classifications.

6 Conclusion

We have considered the problem of detecting anomalies or intrusions in IoT networks. We have first presented the context and reviewed approaches in the literature. We have focused on machine learning-based methods that can learn directly from data and find what are the important features, without resorting to specialized models of the network or specialized signatures. We have then selected a dataset of network activity with several attacks, which is regularly used to develop NIDS and as a benchmark of proposals. Using standard open-source libraries, we have implemented and evaluated several ML-based algorithms, with performances that are at the state-of-the-art. The sources are available and results easily reproducible [20].

Using and implementing such solutions in an IoT network requires to consider the possible computational overhead. For a router or network supervisor, we need a network traffic module to capture the incoming network, and the classifiers, once trained by a decision tree or a random forest, can probably be implemented. For the IoT devices themselves, where consumption and computational costs can be more severely constrained, it is possible to use a decision tree classifier (at most 20 comparison tests and a 840 lines program in Python) or even to use the very simple rules (3 tests) derived from a random forest, with a 98.5% accuracy and a low false-positive rate. Testing these ideas on real devices and real data is the objective of future efforts.