1 Introduction

In the IoT era, billions of things are being connected to the Internet. This results in an unprecedented growth in the amount of generated traffic. This traffic presents different QoS and security challenges. Traffic classification emerges as a key enabling tool to meeting some of these challenges. For example, IoT devices’ vulnerabilities have been exploited to perform critical network attacks (e.g. Mirai) [34]. Tracking these devices and detecting abnormal traffic are key to prevent harmful network attacks.

In this context, Machine Learning (ML) based methods were proposed for traffic classification and intrusion detection. Many supervised and unsupervised methods have been employed accordingly. While supervised methods are devoted to classifying the (attack) traffic based on known class labels, the unsupervised ones allow the detection of unknown traffic. The traditional ML methods rely on well-structured and hand-designed features. These features are extracted using statistical traffic measures (e.g. maximum, minimum, standard deviation, etc. of the packet sizes, and packets interarrival times).

However, mutation techniques alter the traffic statistical features, making it very challenging to know the original traffic type. Moreover, the mutation techniques might change the packets characteristics while maintaining the flow statistical features unchanged. In this case, the detection of abnormal traffic can be evaded.

Recently, Deep Learning (DL) has acquired a lot of attention due to its representation learning capabilities. Using a new data representation in this paper, we propose an unsupervised DL model to detect abnormal traffic and de-anonymize the mutated one. Generative DL architectures, namely Autoencoders (AE) and Generative Adversarial Networks (GAN), have been applied mainly in the computer vision domain to detect abnormality and to denoise images. AE is a DL architecture for extracting data representation, and GAN has the capability to generate fake data samples to enhance the discrimination between real and fake data. In this paper, we combine AE and GAN to detect abnormal traffic and de-anonymize the mutated one. The proposed architecture consists of an encoder, a decoder, and a discriminator. The encoder-decoder pair form a denoising AE responsible for learning the original data representation and to denoise the mutated one. In parallel, the discriminator is trained to differentiate between the mutated traffic (abnormal) and the denoised traffic (normal). The training of the proposed model relies on data collected from real IoT devices and IoT attacks. The testing results show the robustness of the proposed method to detect mutated traffic and to recover the original one. Note that the proposed model is not limited to IoT traffic and can be applied to any type of network traffic.

The rest of the paper is organized as follows: in Sect. 2, we present an overview of related concepts. In Sect. 3, we present our proposed model. Section 4 details the implementation and presents the evaluation results. In Sect. 5, we discuss the results and present our future work. Finally, we conclude in Sect. 6.

2 Background and Related Work

In this section, we present the related work including a review of the generative DL architectures, their application in detecting abnormality, traffic classification and intrusion detection methods, along with the obfuscation techniques that can affect their accuracy.

2.1 Unsupervised Deep Learning

In this subsection, we explain the DL architectures that we based our work on. These architectures can learn the input data representation and are therefore called generative DL models.

Generative Adversarial Network: Introduced by Goodfellow et al. in [24], GAN consists of two parts: the generator (G) and the discriminator (D). From a game theoretic perspective, GAN can be interpreted as a zero-sum or min-max game between the generator and the discriminator. The generator tries to learn the input data representation to generate data samples very similar to the real ones. The discriminator tries to maximize the probability of distinguishing between fake and real input. The GAN objective function can be presented as follows: \(V(G,D) = E_{x\sim p_{data}(x)}(log(D(x)) + E_{z\sim p_{z}(z)}(log(1-D(G(z)))))\) where x is the input data, \(p_{data} (x)\) is the data distribution, D(x) is the discriminator output, \(p_{z}(z)\) is the fake data distribution, z is a sample from \(p_{z}(z)\), and G(z) is the generator output. The generator aims at minimizing the probability of fake data detection by the discriminator, which means that the G objective is to find \( \smash {\displaystyle \min _{G}} (E_{z\sim p(z)}(log(1-D(G(z)))))\). The discriminator aims at maximizing the probability of detecting real data as real and fake data as fake, which means that the D objective is to find \(\smash {\displaystyle \max _{D}} (E_{x\sim p_{data}(x)}(log(D(x))) + E_{z\sim p(z)}(log(1-D(G(z)))))\). Thus, the GAN objective is to find \(\smash {\displaystyle \min _{G}\max _{D}} (V(G,D))\). Primarily, GAN is applied for synthetic data generation. GAN has been applied also for image anomaly detection [19, 27, 50]. In fact, the adversarial learning permits the discriminator to detect abnormal input data. Furthermore, the generative learning permits the generator to learn the real data representation, which makes GAN suitable for image denoising [40, 47].

Autoencoders: Being a generative model, AE is a type of DL networks that is specialized in extracting the input data representation. The AE consists of two parts: an encoder and a decoder. The encoder extracts a compressed data code by estimating a function f, in such a way \(z = f(x)\), where x is the input data, and z is the extracted representation or latent variable. The decoder aims at reconstructing the input data by relying on the extracted representation. In other terms, the decoder tries to infer the inverse function g, in such a way that \(y = g(f(x)) = g(z)\). The AE objective function can be represented at the minimization of the difference between g(f(x)) and x. In other terms, the AE aims at minimizing the reconstruction error.

A type of AEs is the probabilistic autoencoder, which aims to infer the distribution of x, \(p_{\theta } (x)\) by means of another distribution \(q_{\phi } (z/x)\) (probability of the latent variable z knowing the input x). A well-known type of probabilistic AEs is the Variational Autoencoder (VAE), which imposes a prior restriction on p(z) to be a normal distribution. In this case, the problem reduces to maximizing the Evidence Lower Bound (ELBO) or maximizing the Kullback–Leibler (KL) divergence between \(q_{\phi } (z)\) and \(p_{\theta (z/x)}\), represented by \(KL(q_{\phi } (z/x)||p(z))\). Having the ability to extract the real data distribution, VAEs have been applied for anomaly detection. Indeed, when the reconstruction error is large, anomaly is detected in the input data. Borrowing the adversarial concept from GAN, Adversarial AEs (AAE) were introduced by Makhzani et al. in [30]. Similar to the GAN, AAE includes a discriminator that tries to differentiate between the data sampled from the latent variable prior p(z) and the real data. In this case, the discriminator aims to minimize \(L_{dis}= -1/N \sum _{i=0}^{N-1} log(d_{x} (z_{i})) + \sum _{j=N}^{2N} log(1-d_{x}(z_{j}))\), and the generator tries to minimize \(L_{prior}= 1/N \sum _{i=0}^{N-1} (log(1 - d_{x} (z_{i}))\), where N is the number of samples, and \(d_{x} (z_{i})\) is the discriminator output of the latent space variable. In this case, if the total loss function is optimized, \(q_{\phi } (z/x)\) will be very similar to p(z), or in other terms, \(KL(q_{\phi } (z/x)|| p(z))\) will be minimized, and thus the log likelihood of the original data distribution will be maximized. AAEs were applied also for anomaly detection in images [5, 23, 44, 46]. Furthermore, a recent work has considered to add the denoising function to the AAEs for image denoising [17]. In this case, the corrupted data \(\tilde{x}\) is considered as input and two methods were proposed for model representation. The first consists of matching \(\tilde{q}_{\phi } (z/x)\) to p(z), and the second consists of matching \(q_{\phi } (z/\tilde{x})\) to p(z). However, in our work, we use a sparse AE that aims to minimize the Mean Square Error (MSE) between the reconstructed data and original data. In addition, applying the adversarial concept, we choose to train a discriminator to detect abnormal traffic when the reconstruction error is high. Thus, unlike previous work, the generator part of GAN is omitted [13].

2.2 ML Based Traffic Classification and Intrusion Detection

Traffic classification is an essential network function, which is necessary for traffic engineering, QoS management, and security management. Different methods have been proposed for traffic classification using different sets of features [8, 9, 18, 22, 32, 33]. More recently, DL has been applied for traffic classification using new features and new data representations [14, 21, 36, 39, 45].

On the other hand, intrusion detection is an essential network security component. Several approaches have been adopted to detect and prevent network attacks. Traditional Intrusion Detection Systems (IDSes) are rule-based, where the attack signature is known by identifying some patterns in the packet’s fields. However, this method fails to detect unknown attacks (e.g. zero-day attacks) and requires the inspection of the packet header. In addition, traffic anonymization can be used to thwart detection by traditional IDS. ML methods have been proposed to detect network attacks and traffic abnormality by means of statistical features  [6]. However, the traditional methods require the extraction of the features by computing statistical measures that might be data dependent. In addition, the detection of abnormal traffic is not a straightforward task, given that the abnormal traffic might present similar statistical behavior to the normal one  [4].

Recently, DL has been applied in the network domain for intrusion detection [3, 43]. More specifically, IoT, which presents aggravated security challenges, has acquired a special attention from the intrusion detection perspective [10, 16]. Convolutional Neural Network (CNN) and Recurrent neural Network (RNN) have been applied for payload-based attack detection in [29]. While supervised learning was mainly considered for identifying specific attacks [48], the detection of unknown and zero-day attacks call for unsupervised-learning-based methods. In fact, the proposed unsupervised [31] or semi-supervised [41] DL methods for intrusion detection use statistical features. To the best of our knowledge, no previous work considered the recovering (denoising) of the mutated traffic. In addition, this is the first work to consider the question of detection of mutated traffic, in addition to the (unknown) attack traffic.

2.3 Traffic Obfuscation Techniques

In the aim of protecting user privacy, a new research direction is considering traffic obfuscation to thwart classification. In this context, many obfuscation techniques have been proposed  [15, 25]. These methods can be classified in seven categories: steganography, tunneling, anonymization, mutation, morphing, and physical layer obfuscation. While steganography and physical layer obfuscation techniques require specific protocols to recover the original data, the remaining techniques can be applied without imposing changes to the current network protocols. Anonymization and tunneling hide some packet-related information by encryption and establishment of virtual connections (port numbers and internet addresses). However, statistical ML methods still have power to classify the anonymized or tunneled traffic. Mutation and morphing are two techniques that consider modifying the statistical traffic characteristics considering the modification of the packet size and the packet Inter-Arrival Times (IAT) [11]. This has great impact on the accuracy of ML-based classification and intrusion detection. Even though the obfuscation techniques were intended to protect the user privacy, attackers might use them to perform their attacks without being detected. In this context, padding and traffic shaping are proposed to modify the packet size and IAT respectively [7, 35].

Recently, the GAN DL architecture has been employed to adapt malware traffic to the normal one [28, 37, 42, 51]. On the contrary, in this paper, we will consider a discriminative denoising DL model to detect abnormality by relying on the trained discriminator and to de-anonymize mutated traffic by means of a denoising AE.

Our contributions in this paper can thus be summarized by the following points:

  • The discriminative part of GAN with a denoising AE are combined to allow mutated traffic detection and recovery.

  • A new data representation is used to permit the de-anonymization of mutated traffic by applying DL-based techniques.

3 Proposed Abnormal Traffic Detection

In this section, we detail our proposed DL model, the attack model, and the traffic representation method.

3.1 Attack Model

Our attack model consists of an attacker trying to modify the packet size (padding) and IAT (shaping), in such a way to hide any information that serves for attack detection or traffic classification. The mutations are therefore of two types, padding and shaping [28, 29], as shown in Fig. 1. In the following, we summarized the packet padding techniques listed in [7, 12, 35], with s being the packet original size and m(s) being the packet size after mutation.

Fig. 1.
figure 1

Attack model

Fig. 2.
figure 2

Proposed model

  1. 1.

    Padding to the Maximum Transmission Unit (MTU): this technique consists of padding all the flow packets to the same size, which is the MTU. In this case, the mutation can be expressed by \(m(s) = MTU\).

  2. 2.

    Linear padding: this technique consists of linearly padding the flow packets sizes. In this case, the mutation equation can be expressed as follows: \(m(s) = \lceil s/c\rceil *c\), where c is a parameter to choose, and \(\lceil s/c\rceil \) is the ceiling of s/c.

  3. 3.

    Exponential padding: this technique consists of padding the packet size in an exponential manner. The mutation can be expressed by the following equation: \(m(s) = min(2^log_{2}(s), MTU)\).

  4. 4.

    Elephants and mice padding: this technique consists of padding the packet to a certain size c (mice), if the original packet size if less than c. If not, the packet size is padded to the MTU (elephant).

  5. 5.

    Random padding: this technique consists of padding the packet randomly to a size chosen randomly from the interval ([sMTU]). In this case, the mutation function can be expressed as following: \(m(s) = RAND([s,MTU])\).

  6. 6.

    Probabilistic padding: this technique assumes that the packet sizes follow a normal distribution. In this case, the mutated size is chosen randomly based on a normal distribution, where the mean \((\mu )\) and standard deviation \((\sigma )\) are computed considering the original traffic packet sizes. In this case, \(m(s) = GAUSS(\mu ,\sigma )\) [12].

    For the IAT shaping techniques, we list in the following the ones included in [49] in addition to a new technique proposed in [12]:

  7. 7.

    Constant IAT: this technique consists of sending the packets at a fixed IAT.

  8. 8.

    Variable IAT: this technique consists of sending the packets at a variable interval of time randomly chosen from the interval \([I_{1},I_{2}]\).

  9. 9.

    Probabilistic IAT shaping: this technique assumes that the IAT follows a normal distribution. Thus, the packets are transmitted at an interval chosen randomly based on a normal distribution, where the mean and standard deviation are computed based on the original traffic packets IAT.

    The last method, described in [20], combines shaping and padding.

  10. 10.

    Fixed size and fixed IAT: this method consists of padding the size of all the flow packets to the MTU and sending all the packets at a fixed time interval.

This model considers two types of adversaries, including:

  • Malicious adversary: this type of attackers aims at mutating the attack traffic to evade intrusion detection.

  • Benign adversary: this type of attackers aims at hiding the traffic characteristics to protect the user privacy.

3.2 Deep Learning Model

Our proposed model, illustrated in Fig. 2, consists of two parts: a denoising AE and a discriminator. The denoising AE consists of an encoder and a decoder. The encoder aims at estimating the function f, that maps the input space of X to the latent space of the latent variable Z, where Z is a compressed representation of X. The denoising AE is fed with a dataset containing the mutated version \(\tilde{x}\) of the initial data X. The AE aims at minimizing the reconstruction error \(L(X,\tilde{X})\). In our case, the loss function is the MSE function, \(L(X,\tilde{X}) = (X-g(f(\tilde{X})))^2\), where X is the original data and \(g(f(\tilde{X})\)) is the reconstructed data by considering the mutated data as input. The discriminator aims at differentiating between normal and abnormal traffic. In fact, the discriminator will be trained on classifying the mutated data (i.e. the mutated input data) as abnormal and the reconstructed data (i.e. the decoder output) as normal. The output function of the discriminator is a sigmoid function \(p(y = y^{(j)}/x) = 1/(1 + e^{-y^{(j)}})\), where \(y^{(j)}\) is the output class that can take the two values: 0 for \(j=0\) and 1 for \(j=1\), and the loss function is a cross entropy function \(L_{D} = 1/m\sum _{j=0}^{1} y^{(j)} log(y^{(j)}) + (1-y^{(j)}) log(1- y^{(j)}))\).

3.3 Proposed Model Workflow

After training the model, the proposed scheme workflow, illustrated in Fig. 3, consists of: (1) passing the traffic to the discriminator; (2) if the traffic is normal, it is passed to the classifier of normal traffic; (3) If not, it is passed to the denoiser. (4) After denoising, it is passed again to the discriminator. (5) If a normal traffic is detected, it is passed to the classifier; (6) if not, the traffic is detected as abnormal, so it might be obfuscated or attack traffic. The same workflow is repeated for the abnormal traffic to know the attack type or detect an unknown attack traffic.

In fact, two cases are considered. First, when training the model on normal traffic, the aim is to detect attack traffic and the mutated one as abnormal. In this case, the denoising aims at recovering the normal traffic for classification. However, when training the model with attack traffic, at the testing phase the aim is to detect unknown attacks and to denoise the mutated attack traffic to know the exact attack type. Note that the classifier will be trained in a supervised mode to classify the traffic based on the IoT device type in the normal traffic case and based on the attack type in the attack traffic case.

Fig. 3.
figure 3

Proposed scheme workflow

3.4 Data Representation

In our case, the data consists of collected network traffic. This traffic is filtered by flows, where a flow is defined as being the set of packets having the same: source IP address, source port number, destination IP address, destination port number, and protocol (TCP or UDP). For each packet, we extract three features: size (s), IAT (t), and direction (d). For each flow, we consider the first \(4*4\) packets in either direction. In fact, as shown in Fig. 3, the extracted features can be visualized in \(4*4\) RGB images, where the \(i^{th}\) pixel RGB coordinates are represented by the \(i^{th}\) packet features [std]. Thus, every feature represents a color channel. In our case, R is given for the size, G is given for the interarrival time, and B is given for the direction. This data representation is explained in detail in [38] (Fig. 4).

Fig. 4.
figure 4

Data representation

4 Implementation

In this section, we detail the data collection and preprocessing, the model implementation, and the evaluation results.

4.1 Data Collection and Preprocessing

For collecting normal IoT traffic, we used a set of IoT devices, including Wi-Fi-enabled devices and hub-connected devices. The Wi-Fi devices are: D-Link HD 180-Degree Wi-Fi Camera DCS-8200LH, D-Link Wi-Fi Smart Plug DSP-W215, D-Link Wi-Fi Siren DCH-S220, and D-Link Wi-Fi Water Sensor DCH-S160, while the hub-connected devices are components of the Samsung SmartThings Home Monitoring Kit, including a motion sensor, a multi-purpose sensor, and a smart plug. These devices were installed in a home environment and were left to function normally. The Wi-Fi enabled devices are routed through a laptop to the Internet. A bridge is created at this laptop to forward the incoming traffic to the Ethernet interface connected to the home gateway. For the hub-connected devices, the hub is connected to the laptop by Ethernet and this laptop is configured to forward the incoming traffic to the wireless interface connected to the home gateway. In both cases, the traffic is collected at the laptop. One day of traffic for each device was considered for training and one day of traffic was used for testing.

For IoT attack traffic training data, we consider the dataset collected in [26], this dataset consists of multiple PCAP files for each type of attack. We choose one file for each type of attack for training and one file for testing.

To preprocess the data and extract the flows, the dpkt Python library was used [1]. In total, for training, we have 10320 flows of normal traffic and 3308 flows of attack traffic. For testing, we have 258 flows of normal traffic, 350 flows of attack traffic, and 2000 flows of (unknown) attack traffic. The normal traffic is categorized into five classes based on the device model while the attack traffic is categorized into three classes: data theft, denial of service and scanning.

4.2 Model Implementation and Training

The proposed model was implemented in Python using the tensorflow library [2]. The encoder, decoder and the discriminator consist of a 2-layer fully connected neural network with 1000 neurons each. The output layer of the decoder and the discriminator is a sigmoid layer, with the difference that the decoder output is of the same dimension of the input; however, the discriminator output is of dimension one. The model is trained with 100 epochs and a batch size of 100, the learning rate is \(10^{-3}\), and the momentum decay is 0.9 (beta1). The Rectified Linear Unit (ReLU) was used as the activation function for all hidden layers, and the weights are optimized using the Adam optimizer. For generating mutated data, we implement the mutation techniques listed in section III-B ((1) →(10)). The training data is randomly mutated, where each data sample is uniformly mutated to one of the 10 mutation techniques.

For testing the effectiveness of the denoising process on the classification accuracy, we implement a CNN classifier with three convolutional layers, two max pooling layers, and two fully connected layers with one dropout layer with 50% dropout probability. Similarly, ReLU is applied for activation in the hidden layers, and the Adam algorithm is used for optimization with a learning rate of \(10^{-3}\). In addition, we apply cross validation for the classifier training with 4 folds. The performance metric used to evaluate the classifier is the accuracy, which is the ratio of the correctly classified samples to the total number of samples.

4.3 Evaluation and Results

For each of the experiments (1) →(10), the corresponding mutation technique is applied to the testing data. First, the traffic is passed to the discriminator. Then, it is passed to the denoising AE. Furthermore, the mutated and denoised traffic are passed to a classifier, that classifies the flow based on the device model for normal traffic and based on the attack type for attack traffic.

Samples of the mutated traffic and resulted images after denoising are included in Fig. 5. It is visually clear from the included images that the denoiser succeeds in learning the original data representation disregarding the mutation level. The difference between the denoised images and the original ones shows that the model does not overfit the data, however it learns the representation and the noise pattern.

For the normal traffic, the denoiser succeeds in recovering flows very similar to the original ones for most of the mutation techniques, except for the mutations (8) and (9). Similarly, for the attack traffic, the denoiser recovers the main flow representation, except for the mutation (2). In Table 1, the MSE between the original data and the mutated one (mutation degradation), the reconstruction loss, and the abnormality detection rate are reported for the normal traffic and the attack traffic.

Table 1. Testing evaluation results

The results present a high detection rate for the different mutation techniques, except for the mutation techniques (4), (6), and (9). This can be explained by the fact that (6) and (9) are normal distribution based mutations. In these cases, the main traffic characteristics will remain unchanged and this will harden the differentiation between the original and mutated traffic. (4) is a mice-elephant mutation of the packet sizes. However, in our case (i.e. IoT traffic), most of the packets are of small size and therefore the mutation (4) will have limited affect on the traffic characteristics. Moreover, it can be noticed that the MSE between the recovered data and the original one (reconstruction loss) is lower than the MSE between the mutated data and the original one (mutation degradation) in most of the cases. This means that the denoising process decreases the degradation effect by reconstructing a version of the data that is closer to the original one.

Fig. 5.
figure 5

Visualized images representing network traffic

Table 2 presents the accuracy of the classification based on the traffic label: device model for normal traffic and attack type for attack traffic. The mutation techniques (1), (4), (7), and (10) affect the accuracy and misleads the classification noticeably in the normal traffic case; however, after denoising, the accuracy increases. However, for the techniques (5), (6), (8), and (9), the denoiser fails to reconstruct the original traffic. This is due to the fact that the randomness will create a denoised traffic of random type. Moreover, for the mutation technique (2), the mutation is linear and this does not affect the CNN classifier accuracy, being immune to the linear degradation of the image. However, for statistical based machine learning methods, the mutation techniques (2), (5), and (6) affect noticeably the classification accuracy. To see the effect of the mutation on the statistical based classification, we include the results of the mutated and denoised traffic statistical based classification in Table 2. The statistical classification uses a subset of the Moore features (see Table 3) and Random Forest (RF) as classifier. It can be noticed that overall our representation method with CNN classifier outperforms the RF method before and after denoising in the normal traffic case. However, in the attack traffic case, our method gives better results after denoising.

Table 2. Classification results
Table 3. Feature set for statistical classification

5 Discussion and Future Work

The results show that unsupervised DL architectures are very powerful in learning the data representation. AE is a well-known DL generative model, that is highly effective in extracting compressed representation from image-type data. Consequently, after representing network traffic as images, we applied AE to extract the representation patterns from IoT traffic. Moreover, the capability of AE to denoise images is used to overcome the mutation technique challenges. The results show the effectiveness of the proposed method to recover the original traffic representation for different levels of mutation, some of which are rather severe (e.g. fixed packet size and IAT). In fact, the considered mutation techniques cause any statistical classifier to fail. However, one limitation of this work is that we assume that we know ahead of time the mutation technique used by the attacker, so we can decide to choose to apply the classifier before or after denoising. While we considered mutation techniques in this paper, we will study as future work additional obfuscation techniques. In addition to the abnormal traffic detection using the discriminative part of the GAN architecture, in the morphing case for example, we need to consider the generative part also. The generator will be trained to generate morphed traffic similar to the original one. In this case, the model convergence will not be a straightforward task.

6 Conclusion

Applying ML for network management and network security has gained a lot of interest in the last decade. However, new traffic obfuscation techniques have been developed to thwart classification and avoid detection in the sake of user privacy. These techniques have been employed by attackers to mount their attacks without being detected. While the obfuscation techniques might be detected by behavioral or statistical ML techniques, the recovery of the initial traffic is unfeasible. In this paper, inspired from a promising DL application, which is image denoising, we transform the traffic to images then combine two well-established DL architectures (AE and GAN) to reconstruct the original traffic and detect abnormal one. The test results show the effectiveness and robustness of the proposed model to detect abnormal traffic in all its variants.