Deep Gaussian Process autoencoders for novelty detection
- 1k Downloads
Abstract
Novelty detection is one of the classic problems in machine learning that has applications across several domains. This paper proposes a novel autoencoder based on Deep Gaussian Processes for novelty detection tasks. Learning the proposed model is made tractable and scalable through the use of random feature approximations and stochastic variational inference. The result is a flexible model that is easy to implement and train, and can be applied to general novelty detection tasks, including large-scale problems and data with mixed-type features. The experiments indicate that the proposed model achieves competitive results with state-of-the-art novelty detection methods.
Keywords
Novelty detection Deep Gaussian Processes Autoencoder Unsupervised learning Stochastic variational inference1 Introduction
Novelty detection is a fundamental task across numerous domains, with applications in data cleaning (Liu et al. 2004), fault detection and damage control (Dereszynski and Dietterich 2011; Worden et al. 2000), fraud detection related to credit cards (Hodge and Austin 2004) and network security (Pokrajac et al. 2007), along with several medical applications such as brain tumor (Prastawa et al. 2004) and breast cancer (Greensmith et al. 2006) detection. Novelty detection targets the recognition of anomalies in test data which differ significantly from the training set (Pimentel et al. 2014), so this problem is also known as “anomaly detection”. Challenges in performing novelty detection stem from the fact that labelled data identifying anomalies in the training set is usually scarce and expensive to obtain, and that very little is usually known about the distribution of such novelties. Meanwhile, the training set itself might be corrupted by outliers and this might impact the ability of novelty detection methods to accurately characterize the distribution of samples associated with a nominal behavior of the system under study. Furthermore, there are many applications, such as the ones that we study in this work, where the volume and heterogeneity of data might pose serious computational challenges to react to novelties in a timely manner and to develop flexible novelty detection algorithms. As an example, the Airline IT company Amadeus provides booking platforms handling millions of transactions per second, resulting in more than 3 million bookings per day and Petabytes of stored data. This company manages almost half of the flight bookings worldwide and is targeted by fraud attempts leading to revenue losses and indemnifications. Detecting novelties in such large volumes of data is a daunting task for a human operator; thus, an automated and scalable approach is truly desirable.
Because of the difficulty in obtaining labelled data and since the scarcity of anomalies is challenging for supervised methods (Japkowicz and Stephen 2002), novelty detection is normally approached as an unsupervised machine learning problem (Pimentel et al. 2014). The considerations above suggest some desirable scalability and generalization properties that novelty detection algorithms should have.
We have recently witnessed the rise of deep learning techniques as the preferred choice for supervised learning problems, due to their large representational power and the possibility to train these models at scale (LeCun et al. 2015); examples of deep learning techniques achieving state-of-the-art performance on a wide variety of tasks include computer vision (Krizhevsky et al. 2012), speech recognition (Hinton et al. 2012), and natural language processing (Collobert and Weston 2008). A natural question is whether such impressive results can extend beyond supervised learning to unsupervised learning and further to novelty detection. Deep learning techniques for unsupervised learning are currently actively researched on Kingma and Welling (2014) and Goodfellow et al. (2014), but it is still unclear whether these can compete with state-of-the-art novelty detection methods. We are not aware of recent surveys on neural networks for novelty detection, and the latest one we could find is almost 15 years old (Markou and Singh 2003) and misses the recent developments in this domain.
Key challenges with the use of deep learning methods in general learning tasks are (1) the necessity to specify a suitable architecture for the problem at hand and (2) the necessity to control their generalization. While various forms of regularization have been proposed to mitigate the overfitting problem and improve generalization, e.g., through the use of dropout (Srivastava et al. 2014; Gal and Ghahramani 2016), there are still open questions on how to devise principled ways of applying deep learning methods to general learning tasks. Deep Gaussian Processes (dgps) are ideal candidate to simultaneously tackle issues (1) and (2) above. dgps are deep nonparametric probabilistic models implementing a composition of probabilistic processes that implicitly allows for the use of an infinite number of neurons at each layer (Damianou and Lawrence 2013; Duvenaud et al. 2014). Also, their probabilistic nature induces a form of regularization that prevents overfitting, and allows for a principled way of carrying out model selection (Neal 1996). While dgps are particularly appealing to tackle general deep learning problems, their training is computationally intractable. Recently, there have been contributions in the direction of making the training of these models tractable (Bui et al. 2016; Cutajar et al. 2017; Bradshaw et al. 2017), and these are currently in the position to compete with Deep Neural Networks (dnns) in terms of scalability, accuracy, while providing superior quantification of uncertainty (Gal and Ghahramani 2016; Cutajar et al. 2017; Gal et al. 2017).
In this paper, we introduce an unsupervised model for novelty detection based on dgps in autoencoder configuration. We train the proposed dgp autoencoder (dgp-ae) by approximating the dgp layers using random feature expansions, and by performing stochastic variational inference on the resulting approximate model. The key features of the proposed approach are as follows: (1) dgp-aes are unsupervised probabilistic models that can model highly complex data distribution and offer a scoring method for novelty detection; (2) dgp-aes can model any type of data including cases with mixed-type features, such as continuous, discrete, and count data; (3) dgp-aes training does not require any expensive and potentially numerically troublesome matrix factorizations, but only tensor products; (4) dgp-aes can be trained using mini-batch learning, and could therefore exploit distributed and GPU computing; (5) dgp-aes training using stochastic variational inference can be easily implemented taking advantage of automatic differentiation tools, making for a very practical and scalable methods for novelty detection. Even though we leave this for future work, it is worth mentioning that dgp-aes can easily include the use of special representations based, e.g., on convolutional filters for applications involving images, and allow for end-to-end training of the model and the filters.
We compare dgp-aes with a number of competitors that have been proposed in the literature of deep learning to tackle large-scale unsupervised learning problems, such as Variational Autoencoders (vae) (Kingma and Welling 2014), Variational Auto-Encoded Deep Gaussian Process (vae-dgp) (Dai et al. 2016) and Neural Autoregressive Distribution Estimator (nade) (Uria et al. 2016). Through a series of experiments, where we also compare against state-of-the-art novelty detection methods such as Isolation Forest (Liu et al. 2008) and Robust Kernel Density Estimation (Kim and Scott 2012), we demonstrate that dgp-aes offer flexible modeling capabilities with a practical learning algorithm, while achieving state-of-the-art performance.
The paper is organized as follows: Sect. 2 introduces the problem of novelty detection and reviews the related work on the state-of-the-art. Section 3 presents the proposed dgp-ae for novelty detection, while Sects. 4 and 5 report the experiments and conclusions.
2 Novelty detection
Consider an unsupervised learning problem where we are given a set of input vectors \(X = [\mathbf {x}_1, \ldots , \mathbf {x}_n]^{\top }\). Novelty detection is the task of classifying new test points \(\mathbf {x}_*\), based on the criterion that they significantly differ from the input vectors X, that is the data available at training time. Such data is assumed to be generated by a different generative process and are called anomalies. Novelty detection is thus a one-class classification problem, which aims at constructing a model describing the distribution of nominal samples in a dataset. Unsupervised learning methods allow for the prediction on test data \(\mathbf {x}_*\); given a model with parameters \({\varvec{\theta }}\), define predictions as \(h(\mathbf {x}_* | X, {\varvec{\theta }})\). Assuming \(h(\mathbf {x}_* | X, {\varvec{\theta }})\) to be continuous, it is possible to interpret it as a means of scoring test points as novelties. The resulting scores allow for a ranking of test points \(\mathbf {x}_*\) highlighting the patterns which differ the most from the training data X. In particular, it is possible to define a threshold \(\alpha \) and flag a test point \(\mathbf {x}_*\) as a novelty when \(h(\mathbf {x}_* | X, {\varvec{\theta }}) > \alpha \).
Novelty detection has been thoroughly investigated by theoretical studies (Pimentel et al. 2014; Hodge and Austin 2004). The evaluation of state-of-the-art methods was also reported in experimental papers (Emmott et al. 2016), including experiments on the methods scalability (Domingues et al. 2018) and resistance to the curse of dimensionality (Zimek et al. 2012). In one of the most recent surveys on novelty detection (Pimentel et al. 2014), methods have been classified into the following categories. (1) Probabilistic approaches estimate the probability density function of X defined by the model parameters \({\varvec{\theta }}\). Novelties are scored by the likelihood function \(P(\mathbf {x}_* | {\varvec{\theta }})\), which computes the probability for a test point to be generated by the trained distribution. These approaches are generative, and provide a simple understanding of the underlying data through parameterized distributions. (2) Distance-based methods compute the pairwise distance between samples using various similarity metrics. Patterns with a small number of neighbors within a specified radius, or distant from the center of dense clusters of points, receive a high novelty score. (3) Domain-based methods learn the domain of the nominal class as a decision boundary. The label assigned to test points is then based on their location with respect to the boundary. (4) Information theoretic approaches measure the increase of entropy induced by including a test point in the nominal class. As an alternative, (5) isolation methods target the isolation of outliers from the remaining samples. As such, these techniques focus on isolating anomalies instead of profiling nominal patterns. (6) Most suitable unsupervised neural networks for novelty detection are autoencoders, i.e., networks learning a compressed representation of the training data by minimizing the error between the input data and the reconstructed output. Test points showing a high reconstruction error are labelled as novelties. Our model belongs to this last category, and extends it by proposing a nonparametric and probabilistic approach to alleviate issues related to the choice of a suitable architecture while accounting for the uncertainty in the autoencoder mappings; crucially, we show that this can be achieved while learning the model at scale.
3 Deep Gaussian Process autoencoders for novelty detection
In this section, we introduce the proposed dgp-ae model and describe the approximation that we use to make inference tractable and scalable. Each iteration of the algorithm is linear in the dimensionality of the input, batch size, dimensionality of the latent representation and number of Monte Carlo samples used in the approximation of the objective function, which highlights the tractability of the model. We also discuss the inference scheme based on stochastic variational inference, and show how predictions can be made. Finally, we present ways in which we can make the proposed dgp-ae model handle various types of data, e.g., mixing continuous and categorical features. We refer the reader to Cutajar et al. (2017) for a detailed derivation of the random feature approximation of dgps and variational inference of the resulting model. In this work, we extend this dgp formulation to autoencoders.
3.1 Deep Gaussian Process autoencoders
An autoencoder is a model combining an encoder and a decoder. The encoder part takes each input \(\mathbf {x}\) and maps it into a set of latent variables \(\mathbf {z}\), whereas the decoder part maps latent variables \(\mathbf {z}\) into the inputs \(\mathbf {x}\). Because of their structure, autoencoders are able to jointly learn latent representations for a given dataset and a model to produce \(\mathbf {x}\) given latent variables \(\mathbf {z}\). Typically this is achieved by minimizing a reconstruction error.
Autoencoders are not generative models, and variational autoencoders have recently been proposed to enable this feature (Dai et al. 2016; Kingma and Welling 2014). In the context of novelty detection, the possibility to learn a generative model might be desirable but not essential, so in this work we focus in particular on autoencoders. Having said that, we believe that extending variational autoencoders using the proposed framework is possible, as well as empowering the current model to enable generative modeling; we leave these avenues of research for future work. In this work, we propose to construct the encoder and the decoder functions of autoencoders using dgps. As a result, we aim at jointly learning a probabilistic nonlinear projection based on dgps (the encoder) and a dgp-based latent variable model (the decoder).
The building block of dgps are gps, which are priors over functions; formally, a gp is a set of random variables characterized by the property that any subset of them is jointly Gaussian (Rasmussen and Williams 2006). The gp covariance function models the covariance between the random variables at different inputs, and it is possible to specify a parametric function for their mean.
Denote by \(F^{(i)}\) the collection of the multivariate functions \(\mathbf {f}^{(i)}\) evaluated at the inputs \(F^{(i-1)}\), and define \(F^{(0)} := X\). The encoder part of the proposed dgp-ae model maps the inputs X into a set of latent variables \(Z := F^{(j)}\) through a dgp, whereas the decoder is another dgp mapping Z into X. The dgp controlling the decoding part of the model, assumes a likelihood function that allows one to express the likelihood of the observed data X as \(p\left( X | F^{(N_{\mathrm {L}})}, {\varvec{\theta }}^{(N_{\mathrm {L}})}\right) \). The likelihood reflects the choice on the mappings between latent variables and the type of data being modelled, and it can include and mix various types and dimensionality; Sect. 3.5 discusses this in more detail.
3.2 Random feature expansions for dgp-aes
In our framework, the choice of the covariance function induces different basis functions. For example, a possible approximation of the arc-cosine kernel (Cho and Saul 2009) yields Rectified Linear Units (relu) basis functions (Cutajar et al. 2017) resulting in faster computations compared to the approximation of the rbf covariance, given that derivatives of relu basis functions are cheap to evaluate.
3.3 Stochastic variational inference for dgp-aes
3.4 Predictions with dgp-aes
For a given test set \(X_*\) containing multiple test samples, it is possible to use the predictive distribution as a scoring function to identify novelties. In particular, we can rank the predictive probabilities \(p(\mathbf {x}_* | X, {\varvec{\Omega }}, {\varvec{\Theta }})\) for all test points to identify the ones that have the lowest probability under the given dgp-ae model. In practice, for numerical stability, our implementation uses log-sum operations to compute \(\log [p(\mathbf {x}_* | X, {\varvec{\Omega }}, {\varvec{\Theta }})]\), and we use this as the scoring function.
3.5 Likelihood functions
4 Experiments
We evaluate the performance of our model by monitoring the convergence of the mean log-likelihood (mll) and by measuring the area under the Precision–Recall curve, namely the mean average precision (map). These metrics are taken on real-world datasets described in Sect. 4.2. In addition, we compare our model against state-of-the-art neural networks suitable for outlier detection and highlighted in Sect. 4.1. To demonstrate the value of our proposal as a competitive novelty detection method, we include top performance novelty detection methods from other domains, namely Isolation Forest (Liu et al. 2008) and Robust Kernel Density Estimation (rkde) (Kim and Scott 2012), which are recommended for outlier detection in Emmott et al. (2016).
4.1 Selected methods
In order to retrieve a continuous score for the outliers and be able to compare the convergence of the likelihood for the selected models, our comparison focuses on probabilistic neural networks. Our dgp-ae is benchmarked against the Variational Autoencoder (vae) (Kingma and Welling 2014) and the Neural Autoregressive Distribution Estimator (nade) (Uria et al. 2016). We also include standard dnn autoencoders with sigmoid activation functions and dropout regularization to give a wider context to the reader. We initially intended to include Real nvp (Dinh et al. 2016) and Wasserstein gan (Arjovsky et al. 2017), but we found these networks and their implementations tightly tailored to images. The one-class classification with gps recently developed (Kemmler et al. 2013) is actually a supervised learning task where the authors regress on the labels and use heuristics to score novelties. Since this work is neither probabilistic nor a neural network, we did not include it. Parameter selection for the following methods was achieved by grid-search and maximized the average map over testing datasets labelled for novelty detection and described in Sect. 4.2. We append the depth of the networks as a suffix to the name, e.g., vae-2.
dgp -ae -g , dgp -ae -gs We train the proposed dgp-ae model for 100,000 iterations using 100 random features at each hidden layer. Due to the network topology, we use a number of multivariate gps equal to the number of input features when using a single-layer configuration, but use a multivariate gp of dimension 3 for the latent variables representation when using more than one layer. In the remainder of the paper, the term layer describes a hidden layer composed of two inner layers \(\varPhi ^{(i)}\) and \(F^{(i+1)}\). As observed in Duvenaud et al. (2014) and Neal (1996), deep architectures require to feed the input forward to the hidden layers in order to implement the modeling of meaningful functions. In the experiments involving more than 2 layers, we follow this advice by feed-forwarding the input to the encoding layers and feed-forward the latent variables to the decoding layers. The weights are optimized using a batch size of 200 and a learning rate of 0.01. The parameters \(q({\varvec{\Omega }})\) and \({\varvec{\Theta }}\) are fixed for 1000 and 7000 iterations respectively. \(N_{\mathrm {MC}}\) is set to 1 during the training, while we use \(N_{\mathrm {MC}} = 100\) at test time to score samples with higher accuracy. dgp-ae-g uses a Gaussian likelihood for continuous and one-hot encoded categorical variables. dgp-ae-gs is a modified dgp-ae-g where categorical features are modelled by a softmax likelihood as previously described. These networks use an rbf covariance function, except when the arc suffix is used, e.g., dgp-ae-g-1-arc.
vae-dgp-2^{1} This network performs inducing points approximation to train a dgp model with variational inference. The network uses 2 hidden layers of dimensionality \(max(\frac{d}{2}, 5)\) and \(max(\frac{d}{3}, 4)\), and is trained for 1000 iterations over all training samples. All layers use a rbf kernel with 40 inducing points. The MLP in the recognition model has two layers of dimensionality 300 and 150.
vae-dgp,^{2} vae-2 The variational autoencoder is a generative model which compresses the representation of the training data into a layer of latent variables, optimized with stochastic gradient descent. The sum of the reconstruction error and the latent loss, i.e., the negative of the Kullback–Leibler divergence between the approximate posterior over the latent variables and the prior, gives the loss term optimized during the training. The networks were trained for 4000 iterations using 50 hidden units and a batch size of 1000 samples. A learning rate of 0.001 was selected to optimize the weights. vae-1 is a shallow network using one layer for latent variables representation, while vae-2 uses a two-layer architecture with a first layer for encoding and a second one for decoding, each containing 100 hidden units. We use the reconstruction error to score novelties.
nade-2^{3} This neural network is an autoencoder suitable for density estimation. The network uses mixtures of Gaussians to model p(x). The network yields an autoregressive model, which implies that the joint distribution is modelled such that the probability for a given feature depends on the previous features fed to the network, i.e., \(p(\varvec{x}) = p(x_{o_d}|\varvec{x}_{o_{<d}})\), where \(x_{o_d}\) is the feature of index d of \(\varvec{x}\). We train a deep and orderless nade for 5000 iterations using batches of 200 samples, a learning rate of 0.005 and a weight decay of 0.02. Training the network for more iterations increases the risk of the training failing due to runtime errors. The network has a 2 layer-topology with 100 hidden units and a relu activation function. The number of components for the mixture of Gaussians was set to 20, and we use Bernoulli distributions instead of Gaussians to model datasets exclusively composed of categorical data. 15% of the training data was used for validation to select the final weights.
ae-1, ae-5 These two neural networks are feedforward autoencoders using sigmoid activation functions in the hidden layers and a dropout rate of 0.5 to provide regularization. The first network is a single layer autoencoder with a number of hidden units equal to the number of features, while the second one has a 5-layer topology with 80% of the number of input features on the second and fourth layer, and 60% on the third layer. The networks are trained for 100,000 iterations with a batch size of 200 samples and a learning rate of 0.01. The reconstruction error is used to detect outliers.
Isolation forest^{4} is a random forest algorithm performing recursive random splits over the feature domain until each sample is isolated from the rest of the dataset. As a result, outliers are separated after few splits and are located in nodes close to the root of the trees. The average path length required to reach the node containing the specified point is used for scoring. A contamination rate of 5% was used for this experiment.
rkde^{5} is a probabilistic method which assigns a kernel function to each training sample, then sums the local contribution of each function to give an estimate of the density. The experiment uses the cross-validation bandwidth (lkcv) as a smoothing parameter on the shape of the density, and the Huber loss function to provide a robust estimation of the maximum likelihood.
4.2 Datasets
uci and proprietary datasets benchmarked—(# categ. dims) is the number of binary features after one-hot encoding of the categorical features
Dataset | Nominal class | Anomaly class | Numeric dims | Categ. dims | Samples | Anomalies |
---|---|---|---|---|---|---|
− 1 | 1 | 6 | 0 (0) | 11,183 | 260 (2.32%) | |
g | h | 10 | 0 (0) | 12,332 | 408 (3.20%)\({}^{\mathrm{a}}\) | |
4, 5, 6, 7, 8 | 3, 9 | 11 | 0 (0) | 4898 | 25 (0.51%) | |
e | p | 0 | 22 (107) | 4368 | 139 (3.20%)\({}^{{\mathrm{a}}}\) | |
unacc, acc, good | vgood | 0 | 6 (21) | 1728 | 65 (3.76%) | |
1 | 2 | 7 | 13 (54) | 723 | 23 (3.18%)\({}^{{\mathrm{a}}}\) | |
pnr | 0 | 1, 2, 3, 4, 5 | 82 | 0 (0) | 20,000 | 121 (0.61%) |
transactions | 0 | 1 | 41 | 1 (9) | 10,000 | 21 (0.21%) |
shared-access | 0 | 1 | 49 | 0 (0) | 18,722 | 37 (0.20%) |
payment-sub | 0 | 1 | 37 | 0 (0) | 73,848 | 2769 (3.75%) |
airline | 1 | 0 | 8 | 0 (0) | 3,188,179 | 203,501 (6.00%) |
4.3 Results
This section shows the outlier detection capabilities of the methods and monitors the mll to exhibit convergence. We also study the impact of depth and dimensionality on dgp-aes, and plot the latent representations learnt by the network.
4.3.1 Method comparison
Mean area under the precision–recall curve (map) per dataset and algorithm (5 runs)
dgp-ae g | dgp-ae g-2 | dgp-ae gs | dgp-ae gs-2 | vae-dgp-2 | ae-1 | ae-5 | vae-1 | vae-2 | nade-2 | rkde | IForest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
mammography | 0.222 | 0.183 | 0.222 | 0.183 | 0.221 | 0.118 | 0.075 | 0.119 | 0.148 | 0.193 | 0.231 | 0.244 |
magic-gamma-sub | 0.260 | 0.340 | 0.260 | 0.340 | 0.235 | 0.253 | 0.125 | 0.230 | 0.305 | 0.398 | 0.402 | 0.290 |
wine-quality | 0.224 | 0.203 | 0.224 | 0.203 | 0.075 | 0.106 | 0.042 | 0.064 | 0.124 | 0.102 | 0.051 | 0.059 |
mushroom-sub | 0.811 | 0.677 | 0.940 | 0.892 | 0.636 | 0.725 | 0.331 | 0.758 | 0.479 | 0.596 | 0.839 | 0.546 |
car | 0.050 | 0.061 | 0.043 | 0.067 | 0.045 | 0.044 | 0.032 | 0.071 | 0.050 | 0.030 | 0.034 | 0.041 |
german-sub | 0.066 | 0.077 | 0.106 | 0.098 | 0.113 | 0.065 | 0.103 | 0.104 | 0.062 | 0.118 | 0.109 | 0.079 |
pnr | 0.190 | 0.172 | 0.190 | 0.172 | 0.201 | 0.059 | 0.107 | 0.100 | 0.106 | 0.006 | 0.146 | 0.124 |
transactions | 0.756 | 0.752 | 0.810 | 0.835 | 0.509 | 0.563 | 0.510 | 0.532 | 0.760 | 0.373 | 0.585 | 0.564 |
shared-access | 0.692 | 0.738 | 0.692 | 0.738 | 0.668 | 0.546 | 0.766 | 0.471 | 0.527 | 0.239 | 0.783 | 0.746 |
payment-sub | 0.173 | 0.173 | 0.168 | 0.168 | 0.137 | 0.157 | 0.129 | 0.175 | 0.143 | 0.101 | 0.180 | 0.142 |
airline | 0.081 | 0.079 | 0.081 | 0.079 | 0.060 | 0.063 | 0.059 | 0.068 | 0.074 | 0.064 | – | 0.069 |
average \({}^{\mathrm{a}}\) | 0.344 | 0.338 | 0.366 | 0.370 | 0.284 | 0.264 | 0.222 | 0.262 | 0.270 | 0.216 | 0.336 | 0.284 |
Looking at the average performance, our dgps autoencoders achieve the best results for novelty detection. dgps performed well on all datasets, including high dimensional cases, and outperform the other methods on wine-quality, airline and pnr. By fitting a softmax likelihood instead of a Gaussian on one-hot encoded features, dgp-ae-gs-1 achieves better performance than dgp-ae-g-1 on 3 datasets containing categorical variables out of 4, e.g., mushroom-sub, german-sub and transactions, while showing similar results on the car dataset. This representation allows dgps to reach the best performance on half of the datasets and to outperform state-of-the-art algorithms for novelty detection, such as rkde and IForest. Despite the low-dimensional representation of the latent variables, dgp-ae-g-2 achieves performance comparable to dgp-ae-g-1, which suggests good dimensionality reduction abilities. The use of a softmax likelihood in dgp-ae-gs-2 resulted in better novelty detection capabilities than dgp-ae-g-2 on the 4 datasets containing categorical features. vae-dgp-2 achieves good results but is outperformed on most small datasets.
vae-1 also shows good outlier detection capabilities and handles binary features better than vae-2. However, the multilayer architecture outperforms its shallow counterpart on large datasets containing more than 10,000 samples. Both algorithms perform better than nade-2 which fails on high dimensional datasets such as mushroom-sub, pnr or transactions. We performed additional tests with an increased number of units for nade-2 to cope for the large dimensionality, but we obtained similar results.
While ae-1 shows unexpected detection capabilities for a very simple model, ae-5 reaches the lowest performance. Compressing the data to a feature space 40% smaller than the input space along with dropout layers may cause loss of information resulting in an inaccurate model.
4.3.2 Convergence monitoring
While the likelihood is the objective function of most networks, the monitoring of this metric reveals occasional decreases of the mll for all methods during the training process. If minor increases are part of the gradient optimization, the others indicate convergence issues for complex datasets. This is observed for vae-dgp-2 and vae-1 on mammography, or dgp-ae-g-1-arc and vae-1 on mushroom-sub.
Our dgps show the best likelihood on most datasets, in particular when using the arc kernel, with the exception of pnr and mushroom-sub where the rbf kernel is much more efficient. These results demonstrate the efficiency of regularization for dgps and their excellent ability to generalize while fitting complex models.
On the opposite, nade-2 barely reaches the likelihood of ae-1 and ae-5 at convergence. In addition, the network requires an extensive tuning of its parameters and has a computationally expensive prediction step. We tweaked the parameters to increase the model complexity, e.g., number of components and units, but it did not improve the optimized likelihood.
vae-dgp-2 does not reach a competitive likelihood, even with deeper architectures, and shows a computationally expensive prediction step.
Looking at the overall results of these networks, we observe that the model, depicted here by the likelihood, is refined during the entire training process, while the average precision quickly stabilizes. This behavior implies that the ordering of data points according to their outlier score converges much faster, even though small changes can still occur.
The left part of the plot reports the convergence of dgp-ae-g for configurations ranging from one to ten layers. The plot highlights the correlation between a higher test likelihood and a higher average precision. Single-layer models show a good convergence of the mll on most datasets, though are outperformed by deeper models, especially 4-layer networks, on magic-gamma-sub, payment-sub and airline. Deep architectures result in models of higher capacity at the cost of needing larger datasets to be able to model complex representations, with a resulting slower convergence behavior. Using moderately deep networks can thus show better results on datasets where a single layer is not sufficient to capture the complexity of the data. Interestingly, the bound on the model evidence makes it possible to carry out model selection to decide on the best architecture for the model at hand (Cutajar et al. 2017).
In the right panel of Fig. 3, we increase the dimensionality of the latent representation fixing the architecture to a dgp-ae-g-2. Both the test likelihood and the average precision show that a univariate gp is not sufficient to model accurately the input data. The limitations of this configuration is observed on mammography, payment-sub and airline where more complex representations achieve better performance. Increasing the number of gps results in a higher number of weights for the model, thus in a slower convergence. While configurations using 5 GPs already perform a significant dimensionality reduction, they achieve good performance and are suitable for efficient novelty detection.
4.3.3 Latent representation
In Fig. 4, we draw 300 Monte Carlo samples from the approximate posterior over the weights \(\mathbf {W}\) to construct a latent representation of the old faithful dataset. We use a gmm with two components to cluster the input data, and color the latent representation based on the resulting labels. The point highlighted on the left panel of the plot by a cross is mapped into the green points on the right.
5 Conclusions
In this paper, we introduced a novel deep probabilistic model for novelty detection. The proposed dgp-ae model is an autoencoder where the encoding and the decoding mappings are governed by dgps. We make the inference of the model tractable and scalable by approximating the dgps using random feature expansions and by inferring the resulting model through stochastic variational inference that could exploit distributed and GPU computing. The proposed dgp-ae is able to flexibly model data with mixed-types feature, which is actively investigated in the recent literature (Vergari et al. 2018). Furthermore, the model is easy to implement using automatic differentiation tools, and is characterized by robust training given that, unlike most gp-based models (Dai et al. 2016), it only involves tensor products and no matrix factorizations.
Through a series of experiments, we demonstrated that dgp-ae s achieve competitive results against state-of-the-art novelty detection methods and dnn-based novelty detection methods. Crucially, dgp-ae s achieve these performance with a practical learning method, making deep probabilistic modeling as an attractive model for general novelty detection tasks. The encoded latent representation is probabilistic and it yields uncertainty that can be used to turn the proposed autoencoder into a generative model; we leave this investigation for future work, as well as the possibility to make use of dgps to model the mappings in variational autoencoders.
Footnotes
Notes
Acknowledgements
The authors wish to thank the Amadeus Middleware Fraud Detection team directed by Virginie Amar and Jeremie Barlet, led by the product owner Christophe Allexandre and composed of Jean-Blas Imbert, Jiang Wu, Damien Fontanes and Yang Pu for building and labeling the transactions, shared-access and payment-sub datasets. MF gratefully acknowledges support from the AXA Research Fund.
References
- Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., et al. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.Google Scholar
- Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein gan. arXiv:1701.07875v2
- Bache, K., & Lichman, M. (2013). UCI machine learning repository. Irvine: University of California, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml.
- Bradshaw, J., Alexander, & Ghahramani, Z. (2017). Adversarial examples, uncertainty, and transfer testing robustness in Gaussian process hybrid deep networks. arXiv:1707.02476
- Bui, T. D., Hernández-Lobato, D., Hernández-Lobato, J. M., Li, Y., & Turner, R. E. (2016). Deep Gaussian Processes for regression using approximate expectation propagation. In M. Balcan & K. Q. Weinberger (Eds.), Proceedings of the 33nd international conference on machine learning, ICML 2016, New York City, NY, USA, June 19–24, 2016, volume 48 of JMLR workshop and conference proceedings (pp. 1472–1481). JMLR.org.Google Scholar
- Cho, Y., & Saul, L. K. (2009). Kernel methods for deep learning. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Advances in neural information processing systems (Vol. 22, pp. 342–350). Curran Associates, Inc.Google Scholar
- Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on machine learning, ICML ’08 (pp. 160–167). New York, NY: ACM.Google Scholar
- Cutajar, K., Bonilla, E. V., Michiardi, P., & Filippone, M. (2017). Random feature expansions for Deep Gaussian Processes. In D. Precup & Y. W. Teh (Eds.), Proceedings of the 34th international conference on machine learning, volume 70 of proceedings of machine learning research (pp. 884–893). Sydney: International Convention Centre, PMLR.Google Scholar
- Dai, Z., Damianou, A., González, J., & Lawrence, N. (2016). Variationally auto-encoded Deep Gaussian Processes. In Proceedings of the fourth international conference on learning representations (ICLR 2016).Google Scholar
- Damianou, A. C., & Lawrence, N. D. (2013). Deep Gaussian Processes. In Proceedings of the sixteenth international conference on artificial intelligence and statistics, AISTATS 2013, Scottsdale, AZ, USA, April 29–May 1, 2013, volume 31 of JMLR proceedings (pp. 207–215). JMLR.org.Google Scholar
- Davis, J., & Goadrich, M. (2006). The relationship between precision–recall and ROC curves. In Proceedings of the 23rd international conference on Machine learning, ICML ’06 (pp. 233–240). New York, NY: ACM.Google Scholar
- Dereszynski, E. W., & Dietterich, T. G. (2011). Spatiotemporal models for data-anomaly detection in dynamic environmental monitoring campaigns. ACM Transactions on Sensor Networks (TOSN), 8(1), 3.CrossRefGoogle Scholar
- Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using real NVP. arXiv:1605.08803
- Domingues, R., Filippone, M., Michiardi, P., & Zouaoui, J. (2018). A comparative evaluation of outlier detection algorithms: Experiments and analyses. Pattern Recognition, 74, 406–421.CrossRefGoogle Scholar
- Duvenaud, D. K., Rippel, O., Adams, R. P., & Ghahramani, Z. (2014). Avoiding pathologies in very deep networks. In Proceedings of the seventeenth international conference on artificial intelligence and statistics, AISTATS 2014, Reykjavik, Iceland, April 22–25, 2014, volume 33 of JMLR workshop and conference proceedings (pp. 202–210). JMLR.org.Google Scholar
- Emmott, A., Das, S., Dietterich, T., Fern, A., & Wong, W.-K. (2016). A meta-analysis of the anomaly detection problem. arXiv:1503.01158v2
- Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd international conference on international conference on machine learning—volume 48, ICML’16 (pp. 1050–1059). JMLR.org.Google Scholar
- Gal, Y., Hron, J., & Kendall, A. (2017). Concrete dropout. arXiv:1705.07832
- Garca, S., Fernndez, A., Luengo, J., & Herrera, F. (2010). Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences, 180(10), 2044–2064.CrossRefGoogle Scholar
- Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014) Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (Vol. 27, pp. 2672–2680). Curran Associates, Inc.Google Scholar
- Graves, A. (2011). Practical variational inference for neural networks. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (Vol. 24, pp. 2348–2356). Curran Associates, Inc.Google Scholar
- Greensmith, J., Twycross, J., & Aickelin, U. (2006). Dendritic cells for anomaly detection. In IEEE international conference on evolutionary computation, 2006 (pp. 664–671). https://doi.org/10.1109/CEC.2006.1688374.
- Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-R., Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.CrossRefGoogle Scholar
- Hodge, V., & Austin, J. (2004). A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2), 85–126.CrossRefzbMATHGoogle Scholar
- Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5), 429–449.zbMATHGoogle Scholar
- Kemmler, M., Rodner, E., Wacker, E.-S., & Denzler, J. (2013). One-class classification with Gaussian processes. Pattern Recognition, 46(12), 3507–3518.CrossRefGoogle Scholar
- Kim, J., & Scott, C. D. (2012). Robust kernel density estimation. Journal of Machine Learning Research, 13, 2529–2565.MathSciNetzbMATHGoogle Scholar
- Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. In Proceedings of the second international conference on learning representations (ICLR 2014).Google Scholar
- Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International conference on neural information processing systems, NIPS’12 (pp. 1097–1105). Curran Associates Inc.Google Scholar
- LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.CrossRefGoogle Scholar
- Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). Isolation forest. In Proceedings of the 2008 eighth IEEE international conference on data mining, ICDM ’08 (pp. 413–422). IEEE Computer Society.Google Scholar
- Liu, H., Shah, S., & Jiang, W. (2004). On-line outlier detection and data cleaning. Computers & Chemical Engineering, 28(9), 1635–1647.CrossRefGoogle Scholar
- Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov), 2579–2605.zbMATHGoogle Scholar
- Markou, M., & Singh, S. (2003). Novelty detection: A review-part 2: Neural network based approaches. Signal Processing, 83(12), 2499–2521.CrossRefzbMATHGoogle Scholar
- Neal, R. M. (1996). Bayesian learning for neural networks (lecture notes in statistics) (1st ed.). Berlin: Springer.CrossRefGoogle Scholar
- Pimentel, M. A., Clifton, D. A., Clifton, L., & Tarassenko, L. (2014). A review of novelty detection. Signal Processing, 99, 215–249.CrossRefGoogle Scholar
- Pokrajac, D., Lazarevic, A., & Latecki, L. J. (2007). Incremental local outlier detection for data streams. In Computational intelligence and data mining, 2007. CIDM 2007. IEEE symposium on (pp. 504–515). IEEE.Google Scholar
- Prastawa, M., Bullitt, E., Ho, S., & Gerig, G. (2004). A brain tumor segmentation framework based on outlier detection. Medical Image Analysis, 8(3), 275–283.CrossRefGoogle Scholar
- Rasmussen, C. E., & Williams, C. (2006). Gaussian processes for machine learning. Cambridge, MA: MIT Press.zbMATHGoogle Scholar
- Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22, 400–407.MathSciNetCrossRefzbMATHGoogle Scholar
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.MathSciNetzbMATHGoogle Scholar
- Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611–622.MathSciNetCrossRefzbMATHGoogle Scholar
- Uria, B., Côté, M.-A., Gregor, K., Murray, I., & Larochelle, H. (2016). Neural autoregressive distribution estimation. Journal of Machine Learning Research, 17(205), 1–37.MathSciNetzbMATHGoogle Scholar
- Vergari, A., Peharz, R., Di Mauro, N., Molina, A., Kersting, K., & Esposito, F. (2018). Sum-product autoencoding: Encoding and decoding representations using sum-product networks. In Proceedings of the AAAI conference on artificial intelligence (AAAI).Google Scholar
- Worden, K., Manson, G., & Fieller, N. R. (2000). Damage detection using outlier analysis. Journal of Sound and Vibration, 229(3), 647–667.CrossRefGoogle Scholar
- Zimek, A., Schubert, E., & Kriegel, H.-P. (2012). A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining, 5(5), 363–387.MathSciNetCrossRefGoogle Scholar