Keywords

1 Introduction

Deep learning has been providing a lot of solutions which have previously posed big challenges to the artificial intelligence. It has been deployed in numerous applications such as computer vision, natural language processing and many other domains [1] thanks to the great advancement in computing power and the availability of vast amount of data. Much more complex and advanced algorithms can now be trained [2].

In the cybersecurity domain, anomaly detection is a critical mechanism used for threat detection and network behavior anomaly detection is a complementary technology to systems that detect security threats based on packet signatures. Network anomaly detection is the continuous monitoring of a network for unusual events or trends which is an ideal platform to apply deep learning. Deep anomaly detection [3] has thus seen rapid development such as self-taught learning based deep learning [4] and deep autoencoding Gaussian mixture model (DAGMM [5]. Deep learning significantly improves the model complexity resulting in substantial improvement to the detection accuracy. However, deep learning requires the availability of tremendous amount of data to be well trained and supervised learning not only requires large amount of data but they must also be labelled.

The DAGMM proposed in [5] significantly improves the F1 score compared to other state-of-the-art methods including deep learning methods. It employs dimensionality reduction and feature embedding via the AutoEncoder, a compression network, and then performs density estimation via an estimation network. The entire process is unsupervised and hence no labelled data is needed for training the networks. However, DAGMM still requires a huge amount of data to train its models.

Data availability poses a huge challenge for deep learning system. Compounding the problem is that not every one has amassed huge amount of data and even if they have, the data may not be labelled or they are not to be shared collectively due to privacy and security issues. Herein lies the interest in federated learning, where model parameters instead of training data are exchanged through a centralized master model in a secured manner [6] thereby preserving the privacy of individual datasets as well as alleviating the challenge of limited datasets.

This paper proposes the federated learning assisted deep autoencoding Gaussian mixture model (FDAGMM) for network anomaly detection under the scenario where there is insufficient data to train the deep learning models. FDAGMM can thus improve the poor performance of DAGMM caused by limited dataset and its superiority is empirically demonstrated with extensive experiments. In industry scenarios, the presented solution is expected to solve the problem of lacking training data that each organizations are not willing to share in a centralized mode.

2 Related Work

2.1 Anomaly Detection

Network Anomaly Detection, also called Network Intrusion Detection, has been studied for over 20 years [7]. Network intrusion detection systems are either signature (rule)-based or anomaly-based. The former uses patterns of well-known attacks or weak spots of the system to identify intrusions whereas the latter uses machine learning methods to determine whether the deviation from the established normal usage patterns can be flagged as intrusions [8].

Machine learning methods being applied for network anomaly detection, include genetic algorithms, support vector LLmachines (SVM), Self-organizing map (SOM), random forests, XGBoost, KNN, Naive Bayes networks, etc. However, many suffer low accuracy and high False Positive Rate (FPR). More recently, there are research in deep anomaly detection such as [3] and [4]. The use of deep learning significantly improves the model complexity and results in substantial performance improvement in terms of the various metrics, such as F1 score, accuracy, precision, score and False Positive Rate (FPR). However, the more complex the model, the more labelled training data it needs to be well trained.

Deep Autoencoding Gaussian Mixture Model (DAGMM) is recently proposed in [5] and it produces good results with no need to label the training data. The model consists of a compression network and an estimation network which are trained end-to-end instead of using decoupled two-stage training and the standard Expectation-Maximization (EM) algorithm. The compression network is an autoencoder that embeds the feature into a low-dimension representation and meanwhile yields the reconstruction error for each input data point, which is further fed into the estimation network which acts as a Gaussian Mixture Model (GMM). It outperforms many state-of-the-art methods in terms of F1 score. However, even though it does not require labelled training data, it demands a large amount of unlabeled data in which normal users do not have or are unwilling or unable to share.

KDDCUP 99 [9] has been the most widely used dataset for the evaluation of anomaly detection methods since it was prepared by [10] in 1999. There are 5 million simulated tcpdump connection records, each of which contains 41 features and is labelled as either normal or an attack, with exactly one specific attack type. It covers attacks falling into the following 4 main categories:

  • DoS attack: denial-of-service, e.g. smurf;

  • R2L: unauthorized access from a remote machine, e.g. guessing password;

  • U2R: unauthorized access to local superuser (root) privileges, e.g. various “buffer overflow” attacks;

  • Probing: surveillance and other probing, e.g. port scanning.

2.2 Federated Learning

Federated learning is proposed by Konečný et al. [11, 12] to use the availability of privacy sensitive data stored in mobile devices to boost the power of various deep/machine learning models. In typical federated learning (FL), clients, e.g. smart phones, suffer from unstable communication. In addition, their data are unbalanced, Non-IID and massively distributed. These features distinguish FL from conventional distributed machine learning [13, 14].

Fig. 1.
figure 1

Overview of federated learning.

As shown in Fig. 1, Federated Learning (FL) involves two components, i.e. central and local training. K clients indexed by k. The whole process is divided into communication rounds, in which clients are synchronously trained with local stochastic gradient descent (SGD) on their datasets \(\mathcal {P}_k\). In the central server, parameters \(\omega ^k\) come from the local clients, where \(k \in S\), and S refers to the participating subset of m clients in each communication round. These updated parameters are then aggregated. [11, 12, 15]

The setting of federated learning follows the principle of focused collection or data minimization proposed in the White House report [16]. In this setting, local models are trained upon data that are stored in the clients. The server does not need training data which contains private information. Only client parameters are sent to the server, and they are aggregated to obtain the central model. After communication, clients receive the aggregated updates as initial parameters of the subsequent training. Then, their models are retrained on privacy data with local stochastic gradient descent (local SGD) [11, 12, 17, 18].

In the typical FL that is executed on mobile devices, clients suffer from unstable communication. Hence it is reasonable to assume that only a subset of all clients, i.e. the aforementioned participating subset, is ready to get involved in each communication round. However, in our proposed federated deep autoencoding Gaussian mixture model (FDAGMM) for anomaly detection, the assumption does not hold anymore since the clients here commonly refer to companies or organizations with cutting-edge equipment and facilities.

2.3 Deep Autoencoding Gaussian Mixture Model

Two-step approaches that sequentially conduct dimensionality reduction and density estimation are widely used since they well address the curse of dimensionality to some extent [19]. Although fruitful progress has been achieved, these approaches suffer from a decoupled training process, together with inconsistent optimization goals, and the loss of valuable information caused by step one, i.e. the dimensionality reduction. Motivated by these, Zong et al. proposed a Deep Autoencoding Gaussian Mixture Model (DAGMM) [5].

Fig. 2.
figure 2

Overview of deep autoencoding Gaussian mixture model.

As shown in Fig. 2, a Compression Network and an Estimation Network constitute DAGMM. It works as follows:

  • Dimensionality Reduction: Given the raw features of a sample \( \mathbf {x} \), the compression network which is also a deep autoencoder conducts dimensionality reduction to output its low-dimensional representation \( \mathbf {z} \) as follows:

    $$\begin{aligned} { \begin{aligned} \mathbf {z}_{c}&=h\left( \mathbf {x} ; \theta _{e}\right) \\ \mathbf {x}^{\prime }&=g\left( \mathbf {z}_{c} ; \theta _{d}\right) \\ \mathbf {z}_{r}&=f\left( \mathbf {x}, \mathbf {x}^{\prime }\right) \\ \mathbf {z}&=\left[ \mathbf {z}_{c}, \mathbf {z}_{r}\right] \\ \end{aligned} }\end{aligned}$$
    (1)

    where \( \theta _{e} \) and \( \theta _{d} \) are the parameters of the decoder and encoder respectively, \( \mathbf {x}^{\prime } \) is the reconstruction of \( \mathbf {x} \) generated by the autoencoder, \( \mathbf {z}_{c} \) is the reduced/learned low-dimensional representation, \( \mathbf {z}_{r} \). \( h(\cdot ) \), \( g(\cdot ) \), and \( f(\cdot ) \) denote the encoding, decoding and reconstruction-error calculation function respectively.

  • Density Estimation: The subsequent estimation network takes \( \mathbf {z} \) from the compression network as its input. It performs density estimation with a Gaussian Mixture Model (GMM). A multi-layer neural network, denoted as \( M L N(\cdot ) \), is adopted to predict the mixture membership for each sample as follows:

    $$\begin{aligned} { \begin{aligned} \mathbf {p}&=M L N\left( \mathbf {z} ; \theta _{m}\right) \\ \hat{\gamma }&={\text {softmax}}(\mathbf {p}) \\ \end{aligned} }\end{aligned}$$
    (2)

    where \( \theta _{m} \) corresponds to parameters of MLN, K indicates the number of mixture components, and \( \hat{\gamma } \) is a K-dimensional vector for the soft mixture-component membership prediction. Given the batch size N, \(\forall 1 \le k \le K\), parameter estimation of GMM is further conducted as follows:

    $$\begin{aligned} { \begin{aligned} \hat{\phi }_{k}&=\sum _{i=1}^{N} \frac{\hat{\gamma }_{i k}}{N} \\ \hat{\mu }_{k}&=\frac{\sum _{i=1}^{N} \hat{\gamma }_{i k} \mathbf {z}_{i}}{\sum _{i=1}^{N} \hat{\gamma }_{i k}} \\ \hat{\mathbf {\Sigma }}_{k}&=\frac{\sum _{i=1}^{N} \hat{\gamma }_{i k}\left( \mathbf {z}_{i}-\hat{\mu }_{k}\right) \left( \mathbf {z}_{i}-\hat{\mu }_{k}\right) ^{T}}{\sum _{i=1}^{N} \hat{\gamma }_{i k}} \\ \end{aligned} }\end{aligned}$$
    (3)

    where \(\hat{\gamma }_{i}\) is the membership prediction, and \(\hat{\phi }_{k}\), \(\hat{\mu }_{k}\), \(\hat{\mathbf {\Sigma }}_{k}\) are the mixture probability, mean, and covariance for component k in GMM respectively. Furthermore, sample energy can be inferred as:

    $$\begin{aligned} { E(\mathbf {z})=-\log \left( \sum _{k=1}^{K} \hat{\phi }_{k} \frac{\exp \left( -\frac{1}{2}\left( \mathbf {z}-\hat{\mu }_{k}\right) ^{T} \hat{\mathbf {\Sigma }}_{k}^{-1}\left( \mathbf {z}-\hat{\mu }_{k}\right) \right) }{\sqrt{\left| 2 \pi \hat{\mathbf {\Sigma }}_{k}\right| }}\right) }\end{aligned}$$
    (4)

    where \( |\cdot | \) denotes the determinant of a matrix.

Based on the three components, i.e. reconstruction error of autoencoder \(L\left( \mathbf {x}_{i}, \mathbf {x}_{i}^{\prime }\right) \), sample energy \(E\left( \mathbf {z}_{i}\right) \), and a penalty term \(P(\hat{\mathbf {\Sigma }})\), the objective function of DAGMM is then constructed as:

$$\begin{aligned} { J\left( \theta _{e}, \theta _{d}, \theta _{m}\right) =\frac{1}{N} \sum _{i=1}^{N} L\left( \mathbf {x}_{i}, \mathbf {x}_{i}^{\prime }\right) +\frac{\lambda _{1}}{N} \sum _{i=1}^{N} E\left( \mathbf {z}_{i}\right) +\lambda _{2} P(\hat{\mathbf {\Sigma }}) }\end{aligned}$$
(5)

3 Federated Deep Autoencoding Gaussian Mixture Model

As discussed in the previous sections, limited data samples lead to the performance deterioration of DAGMM [5]. Therefore, the motivation of the proposed FDAGMM is to address this issue by extending the data sources while preserving the data privacy of the individual clients. Under the framework of federated learning (FL), not only can FDAGMM improve its performance with more data, but privacy is appropriately protected.

The rest of Sect. 3 is divided into two subsections, namely, server execution and client update. The two main components of FDAGMM are introduced in the form of pseudo-codes.FDAGMM shares most of the notation as FL except that \( \omega \) is replaced with \( \theta \) to be consistent with DAGMM as shown:

$$\begin{aligned} { \begin{aligned} \theta = \{ \theta _{\{e, d\}}, \theta _m \} \end{aligned} }\end{aligned}$$
(6)

where the parameters of the autoencoder include those of the encoder and the decoder, i.e. \( \theta _{\{e, d\}} = \{ \theta _e, \theta _d \} \), and those of the estimation network is denoted as \( \theta _{m} \). Moreover, superscripts and subscripts are adopted to specify client k and communication round t that the parameters belong to, i.e. \( \theta ^k_t \).

3.1 Server Execution

Algorithm 1 shows the Server Execution that is carried out on the central server. It consists of an initialization operation followed by T communication rounds. In initialization (Line 2), \( \omega _0 \) is initialized.

Under the FL framework, the training process consists of the communication rounds that are indexed with t (Line 3). Lines 4–6 call sub-function Client Update for K clients in parallel. Then in line 7, the aggregation is performed to update \( \theta \), which is the parameter of the centre model. n and \( n_k \) indicate the number of instances belonging to client k and the total number of all involved samples, respectively.

figure a

3.2 Client Update

figure b

Client Update (Algorithm 2) takes k and \(\theta \) as its input. k indexes a specific client and \( \theta \) denotes the parameters of the central model in the current round. E denotes the local epoch. Line 2 splits data into batches, whereas Lines 3–7 train the local DAGMMs by batch with private data stored on each client. \( \eta \) denotes the learning rate; \( J(\cdot ) \) is the loss function. Its definitions is detailed in Equation (5). Line 8 returns the updated local parameters.

Fig. 3.
figure 3

Aggregation of federated deep autoencoding Gaussian mixture model.

Figure 3 shows an example illustrating the aggregation of FDAGMM. The abscissa and ordinate denote the communication round and the local client respectively. Two local devices, i.e. Client 0, Client 1, and a server are involved. The cross located at (Clint 0; T) indicates that Clint 0 is participating in updating the central model in round T.

4 Experimental Results and Analysis

4.1 Experimental Design

Due to the lack of public datasets for anomaly detection, especially the Intrusion Detection Systems (IDSs), very few of them can be directly adopted in the evaluation of the proposed FDAGMM.

We perform extensive experiments with a public dataset KDDCUP 99, which is the most widely used in the evaluation of various anomaly detection approaches and systems. Table 1 shows the statistics of KDDCUP 99.

Table 1. Statistics of KDDCUP 99
Fig. 4.
figure 4

Stacked bar-charts of attack types belong to two clients.

Data Pre-processing. Constructing datasets to simulate the private data of the individual clients which fulfill the associated requirements is thus needed for this study. In the FL setting, private datasets stored in clients are Non-IID and unbalanced. In the experiments, the whole KDDCUP 99 dataset is split into two sets belonging to Client 1 and Client 2 through selecting records in the complete KDDCUP. These two clients play the roles of two companies in which Client 1 with limited data asks for help from Client 2 under the framework of FDAGMM.

Attack instances are separated according to their types. Training sets only include half of the attack samples belonging to its client, while test sets consist of both normal and the other half attack instances. The data belonging to Client 1 are similar in anomaly ratio as the other client. Since there is a very sharp distinction between rare and common attacks, which is reflected in Fig. 4, smurf and neptune are included as Common Attacks and the rest comprises the Rare Attacks. The details are shown in Table 2. The experiments are expected to prove that Client 1 can improve its performance with the help of Client 2.

Table 2. Common and rare attacks

As shown in Fig. 4(a), the scenario with Client 1 holding 1% common and 100% rare (bars in orange) attack instances and Client 2 holding the rest, i.e. 99% common attacks (bars in blue), is denoted as Scenario 1. Those corresponding to the remaining three figures, (b) to (d) are denoted as Scenario 2, Scenario 3 and Scenario 4 respectively.

Table 3. Default setting of DAGMM

Parameter Setting on DAGMM. For the purpose of a fair comparison, all these experiments adopt the default DAGMM setting as the hyper-parameters of the local models in the proposed FDAGMM. The settings are summarized in Table 3.

4.2 Experiments on KDDCUP 99

The experiments are designed to evaluate the effectiveness of the proposed FDAGMM. According to their attack types, i.e. rare or common, instances making up each training set belong to two clients. Four scenarios with distinct combinations of attacks are considered and illustrated in Fig. 4. Three metrics are adopted to measure the performance of the compared algorithms. They are Precision, Recall, and F1-Score.

All the experiments are run independently for five times; each time, the random seed is fixed. The average (AVG) values of F1-Score are presented in Fig. 5. Tables 4 and 5 show the AVG and standard deviation (STDEV) of precision and recall values respectively. In the tables, the AVG value is listed before the standard deviation in parentheses.

The names of the algorithms compared in Tables 4 and 5 indicate not only the adopted technique but also the involved training set, i.e. Training set from Client 0 or Client 1.

  • DAGMM_C0 employs DAGMM and trains the model with only data from Client 0. Test set comprises half of the attack instances belonging to Client 0 and the corresponding proportion of normal samples.

  • DAGMM_C1 employs DAGMM and trains the model with only data from Client 1. Test set comprises half of the attack instances belonging to Client 1 and the corresponding proportion of normal samples. DAGMM_C1 thus complements the training on the limited data of Client 0 to increase the detection performance on Client 0 without comprising the privancy of its data.

  • DAGMM_C0&C1 employs DAGMM and trains the model with a mixture of the data from two clients. Test set comprises half of the attack instances belonging to Client 0 and the corresponding proportion of normal samples. DAGMM_C0&C1 (Ideal Bound) denotes where the performance limit of FDAGMM is. This is under the ideal scenario where both clients are willing to share their data so that DAGMM can be trained to achieve the best performance.

  • FDAGMM includes two clients, and each employs a DAGMM and trains the model with only data from itself. Test set comprises half of the attack instances belonging to Client 0 and the corresponding proportion of normal samples. The performance of FDAGMM reflects how much help Client 0 can receive from Client 1 under the FL framework. In other words, how close the presented FDAGMM can approach the ideal performance limit, i.e. DAGMM_C0&C1 (Ideal Bound) where there is an abundance of data for training and clients are willing to contribute their data for collective training of the DAGMM.

Table 4. Comparative studies on FDAGMM: precision.
Table 5. Comparative studies on FDAGMM: recall.
Fig. 5.
figure 5

Comparative studies on FDAGMM on KDDCUP 99.

Based on the results shown in these tables and figures, the following observations can be made:

  • The proposed FDAGMM outperforms DAGMM on all metrics, i.e. F1-Score, Precision and Recall for all four scenarios.

  • According to its lower STDEV values, FDAGMM’s performance is more stable than DAGMM.

  • The more Non-IID and unbalanced the data distribution is across clients, the more challenging the scenario tends to be, which is reflected by the blue dotted lines corresponding to DAGMM_C0&C1 in Fig. 5.

5 Conclusion

With the help of other clients holding sufficient feature-similar records under the FL framework, we show that the less than satisfactory performance of DAGMM suffering from limited dataset can be addressed and improved using FDAGMM. Empirical studies comparing the performance of the proposed FDAGMM and DAGMM under four distinct scenarios demonstrate the superiority of the FDAGMM in terms of all the associated performance metrics.

This study follows the assumption that all local models adopt the same neural networks architecture and share the same hyperparameters, which implies all the involved data records share the same feature structure. This renders FDAGMM to be less versatile to be deployed to other application domains. In future research, we are going to develop new federated learning assisted DAGMM address the weakness.