Keywords

1 Introduction

In recent years deep learning has outperformed other classes of predictive models in many applications. In some of these, e.g. autonomous driving or diagnostics in medicine, the reliability of a prediction is of highest interest. In classification tasks, the thresholding on the highest softmax probability or thresholding on the entropy of the classification distributions (softmax output) are commonly used metrics to quantify classification uncertainty of neural networks, see e.g. [11]. However, misclassification is oftentimes not detected by these metrics and it is also well known that these metrics can be fooled easily. Many works demonstrated how an input can be designed to fool a neural network such that it incorrectly classifies the input with high confidence (termed adversarial examples, see e.g. [9, 13, 18, 19]). This underlines the need for measures of uncertainty.

A basic statistical study of the performance of softmax probability thresholding on several datasets was developed in [11]. This work also assigns proper out-of-distribution candidate datasets to many common datasets. For instance a network trained on MNIST is applied to images of handwritten letters, scaled gray scale images from CIFAR10, and different types of noise. This represents a baseline for comparisons.

Using classical approaches from uncertainty quantification for modeling input uncertainty and/or model uncertainty, the detection rate of misclassifications can be improved. Using the baseline in [11], an approach named ODIN, which is based on input uncertainty, was published in [15]. This approach shows improved results compared to pure softmax probability thresholding. Uncertainty in the weights of a neural network can be modeled using Bayesian neural networks. A practically feasible approximation to Bayesian neural networks was introduced in [8], known as Monte-Carlo dropout, which also improves over classical softmax probability thresholding.

Since the softmax removes one dimension from its input by normalization, some works also perform outlier detection on the softmax input (the penultimate layer) and outperform softmax probability thresholding as well, see [2].

In this work we propose a different approach to measure uncertainty of a neural network based on gradient information. Technically, we compute the gradient of the negative log-likelihood of a single sample during inference where the class argument in the log-likelihood is the predicted class. We then extract compressed representations of the gradients, e.g., the norm of a gradient for a chosen layer. E.g., a large norm of the gradient is interpreted as a sign that, if the prediction would be true, major re-learning would be necessary for the CNN. We interpret this ‘re-learning-stress’ as uncertainty and study the performance of different gradient metrics used in two meta classification tasks: separating correct and incorrect predictions and detecting in- and out-of-distribution samples.

The closest approaches to ours are probably [2, 11] as they also establish a self evaluation procedure for neural networks. However they only incorporate (non-gradient) metrics for particular layers close to the networks output while we consider gradient metrics extracted from all the layers. Just as [2, 11] our approach does not make use of input or model uncertainty. However these approaches, as well as our approach, are somewhat orthogonal to classical uncertainty quantification and should be potentially combinable with input uncertainty and model uncertainty, as used in [8, 15], respectively.

The remainder of this work is structured as follows: First, in Sect. 2 we introduce (gradient) metrics, the concept of meta classification and threshold independent performance measures for meta classification, AUROC and AUPR, that are used in the experiments. In Sect. 3 we introduce the network architecture and the experiment setup containing the choice of data sets. We use EMNIST [6] digits as a known concept on which the CNN is trained and EMNIST letters, CIFAR10 images as well as different types of noise as unknown/unlearned concepts. Then we statistically investigate the separation performance of our metrics for correct vs. incorrect classifications provided by CNNs. This is followed by a performance study for the detection of in- and out-of-distribution samples (detection of unlearned concepts) in Sect. 4. Therefore we also combine available metrics for training and comparing different meta classifiers. In this section meta classifiers are trained only using known concepts, i.e., EMNIST digits. Afterwards, in Sect. 5, we insert unlearned concepts (which therefore become known unknowns) into the training of the meta classifiers. While the softmax baseline achieves an AUROC value of \(95.83\%\) our approach gains \(0.81\%\) in terms of AUROC and even more in terms of AUPR values.

2 Entropy, Softmax Baseline and Gradient Metrics

Given an input \(x \in \mathbb {R}^n\), weights \(w \in \mathbb {R}^p\) and class labels \(y \in \mathcal {C} = \{1,\ldots ,q\} \), we denote the output of a neural network by \(f(y|x,w) \in [0,1]\). The entropy of the estimated class distribution conditioned on the input (also called Shannon information, [16])

$$\begin{aligned} E(x,w)=-\frac{1}{\log (q)}\sum _{y\in \mathcal {C}} f(y|x,w)\log (f(y|x,w)), \end{aligned}$$
(1)

is a well known dispersion measure and widely used for quantifying classification uncertainty of neural networks. In the following we will use the term entropy in the sense explained above. Note that this should not be confused with the entropy underlying the (not estimated and joint) statistical distribution of inputs and labels. The softmax baseline proposed by [11] is calculated as

$$\begin{aligned} S(x,w) = \max \limits _{y\in \mathcal {C}} f(y|x,w). \end{aligned}$$
(2)

Using the maximum a posteriori principle (MAP), the predicted class is defined by

$$\begin{aligned} \hat{y}(x,w) := \mathop {{{\mathrm{arg\,max}}}}\limits _{y\in \mathcal {C}}f(y|x,w) \end{aligned}$$
(3)

according to the Bayes decision rule [3], or as one hot encoded label \(\hat{g}(x,w) \in \{0,1\}^q\) with

(4)

for \(k=1,\ldots ,q\). Given an input sample \(x^i\) with one hot label \(y^i\), predicted class label \(\hat{g}^i\) (from Eq. (4)) and a loss function \(L=L(f(y|x^i,w),y^i)\), we can calculate the gradient of the loss function with respect to the weights \(\nabla _w L = \nabla _w L(f(y|x^i,w),\hat{g}^i)\). In our experiments we use the gradient of the negative log-likelihood at the predicted class label, which means

$$\begin{aligned} L=L(f(y|x^i,w),\hat{g}^i)=-\sum _{y\in \mathcal {C}}\hat{g}^i_y\log \left( f(y|x^i,w)\right) =-\log \left( f(\hat{y}|x^i,w)\right) . \end{aligned}$$
(5)

We apply the following metrics to this gradient:

  • Absolute norm (\(\Vert \nabla _{w}L\Vert _1\))

  • Euclidean norm (\(\Vert \nabla _{w}L\Vert _2\))

  • Minimum (\(\min \left( \nabla _{w}L\right) \))

  • Maximum (\(\max \left( \nabla _{w}L\right) \))

  • Mean (\(\text {mean}\left( \nabla _{w}L\right) \))

  • Skewness (\(\text {skew}\left( \nabla _{w}L\right) \))

  • Kurtosis (\(\text {kurt}\left( \nabla _{w}L\right) \))

These metrics can either be applied layerwise by restricting the gradient to those weights belonging to a single layer in the neural network or to the whole gradient on all layers.

The metrics can be sampled over the input X and conditioned to the event of either correct or incorrect classification. Let T(w) and F(w) denote the subset of correctly and incorrectly classified samples for the network f(y|xw), respectively. Given a metric M (e.g. the entropy E or any gradient based one), the two conditioned distributions \(M(X,w)|_{T(w)}\) and \(M(X,w)|_{F(w)}\) are further investigated. For a threshold t, we measure \(P(M(X,w)< t \, | \, {T(w)})\) and \(P(M(X,w)\ge t \, | \, {F(w)} )\) by sampling X. If both probabilities are high, t gives a good separation between correctly and incorrectly classified samples. This concept can be transfered to the detection of out-of-distribution samples by defining these as incorrectly classified. We term this procedure (classifying \(M(X,w)< t\) vs. \(M(X,w)\ge t\)) meta classification.

Since there are many possible ways to compute thresholds t, we compute our results threshold independent by using Area Under the Receiver Operating Curve (AUROC) and Area Under the Precision Recall curve (AUPR). For any chosen threshold t we define

$$\begin{aligned} TP&= \#\{\text {correctly predicted positive cases}\},\\ TN&= \#\{\text {correctly predicted negative cases}\},\\ FP&= \#\{\text {incorrectly predicted positive cases}\},\\ FN&= \#\{\text {incorrectly predicted negative cases}\}.\\ \end{aligned}$$

and can compute the quantities

$$\begin{aligned} R = TPR&= \frac{TP}{TP+FN}\quad&\text {(True positive rate or Recall)},\\ FPR&= \frac{FP}{FP+TN}\quad&\text {(False positive rate)},\\ P&= \frac{TP}{TP+FP}\quad&\text {(Precision)}.\\ \end{aligned}$$

When dealing with threshold dependent classification techniques, one calculates TPR (R), FPR and P for many different thresholds in the value range of the variable. The AUROC is the area under the receiver operating curve, which has the FPR as ordinate and the TPR as abscissa. The AUPR is the area under the precision recall curve, which has the recall as the ordinate and the precision as abscissa. For more information on these performance measures see [7].

The AUPR is in general more informative for datasets with a strong imbalance in positive and negative cases and is sensitive to which class is defined as the positive case. Because of that we are computing the AUPR-In and AUPR-Out, for which the definition of a positive case is reversed. In addition the values of one variable are multiplied by \(-1\) to switch between AUPR-In and AUPR-Out as in [11].

3 Meta Classification – A Benchmark Between Maximum Softmax Probability and Gradient Metrics

We perform all our statistical experiments on the EMNIST data set [6], which contains \(28 \,\times \,28\) gray scale images of 280 000 handwritten digits (0–9) and 411 302 handwritten letters (a–z, A–Z). We train the CNNs only on the digits, in order to test their behavior on untrained concepts. We split the EMNIST data set (after a random permutation) as follows:

  • \(60,\!000\) digits (0–9) for training

  • \(20,\!000\) digits (0–9) for validation

  • \(200,\!000\) digits (0–9) for testing

  • \(20,\!000\) letters (a–z, A–Z) as untrained concepts

Additionally we included the CIFAR10 library [12], shrinked and converted to gray scale, as well as \(20,\!000\) images generated from random uniform noise. All concepts can be seen in Fig. 1.

Fig. 1.
figure 1

Different concepts used for our statistical experiments

The architecture of the CNNs consists of three convolutional (conv) layers with 16 filters of size \(3\times 3\) each, with a stride of 1, as well as a dense layer with a 10-way softmax output. Each of the first two conv layers are equipped with leaky ReLU activations

(6)

and followed by \(2\times 2\) max pooling. We employ \(L^2\) regularization with a regularization parameter of \(10^{-3}\). Additionally, dropout [17] is applied after the first and third conv layer. The dropout rate is \(33\%\).

The models are trained using stochastic gradient descent with a batch size of 256, momentum of 0.9 and categorical cross entropy as cost function. The initial learning rate is 0.1 and is reduced by a factor of 10 every time the average validation accuracy stagnates, until a lower limit for the learning rate of 0.001 is reached. All models were trained and evaluated using Keras [5] with Tensorflow backend [1]. Note, that the parameters where chosen from experience and not tuned to any extent. The goal is not to achieve a high accuracy, but to detect the uncertainty of a neural network reliably.

Fig. 2.
figure 2

Empirical distribution for entropy, euclidean norm and minimum applied to correctly predicted and incorrectly predicted digits from the test data (green and red) of one CNN. Further distributions are generated from EMNIST samples with unlearned letters (blue), CIFAR10 images (gray) and uniform noise images (purple). (Color figure online)

In this section, we study the performance of gradient metrics, the softmax baseline and the entropy in terms of AUROC and AUPR for EMNIST test data, thus considering the error and success prediction problem, formulated in [11]. First of all we demonstrate that gradient metrics are indeed able to provide good separations. Results for the entropy, euclidean norm and minimum are shown in Fig. 2 (green and red). Note that we have left out the mean, skewness and kurtosis metric, as their violin plots showed, that they are not suitable for a threshold meta classifier.

In what follows we define EMNISTc as the set containing all correctly classified samples of the EMNIST test set and EMNISTw as the set containing all incorrectly classified ones. From now on we resample the data splitting and use ensembles of CNNs. More precisely, the random splitting of the \(280,\!000\) digit images in training, validation and test data is repeated 10 times and we train one CNN for each splitting. In this way we train 10 CNNs that differ with respect to initial weights, training, validation and test data. We then repeat the above meta classification for each of the CNNs. With this non parametric bootstrap, we try to get as close as possible to a true sampling of the statistical law underlying the EMNIST ensemble of data and obtain results with statistic validity.

Table 1. AUROC, AUPR-In (EMNISTc as positive case) and AUPR-Out (EMNISTc as negative case) values for the threshold classification on the softmax baseline, entropy as well as selected gradient metrics. All values are in percentage and averaged over 10 differently initialized CNNs with distinct splittings of the training data. Values in brackets are the standard deviation of the mean in percentage. To get the standard deviation within the sample, multiply by \(\sqrt{10}\).

Table 1 shows that the softmax baseline as well as some selected gradient metrics exhibit comparable performance on the test set in the error and success prediction task. Column one corresponds to the empirical distributions depicted in Fig. 2 for \(200,\!000\) test images.

In a next step we aggregate entropy and all gradient based metrics (evaluated on the gradient of each layer in the CNN) in a more sophisticated classification technique. Therefore we choose a variety of regularized and unregularized logistic regression techniques, namely a Generalized Linear Model (GLM) equipped with the logit link function, the Least Absolute Shrinkage and Selection Operator (LASSO) with a \(L^1\) regularization term and a regularization parameter \(\lambda _1=1\), the ridge regression with a \(L^2\) regularization term and a regularization parameter of \(\lambda _2=1\) and finally the Elastic net with one half \(L^1\) and one half \(L^2\) regularization, which means \(\lambda _1=\lambda _2=0.5\). For details about these methods, cf. [10].

To include a non linear classifier we train a feed forward NN with one hidden layer containing 15 rectified linear units (ReLUs) with \(L^2\) weight decay of \(10^{-3}\) and 2-way softmax output. The neural network is trained in the same fashion as the CNNs with stochastic gradient descent. Both groups of classifiers are trained on the EMNIST validation set. Results for the logistic regression techniques can be seen in Table 2 (column one) and those for the neural network in Table 3 (first row of each evaluation metric). For comparison we also include the entropy and softmax baseline in each table. The regression techniques perform equally well or better compared to the softmax baseline. This is however not true for the NN. For the logistic regression types including more features from early layers did not improve the performance, the neural network however showed improved results. This means the additional information in those layers can only be utilized by a non linear classifier.

4 Recognition of Unlearned Concepts

A (C)NN, being a statistical classifier, classifies inside the prescribed label space. In this section, we empirically test the hypothesis that test samples out of the label space will be all misclassified, however at a statistically different level of entropy or gradient metric, respectively. We test this hypothesis for three cases: First we feed the CNN with images from the EMNIST letter set and determine the entropy as well as the values for all gradient metrics for each of it. Secondly we follow the same procedure, however the inputs are gray scale CIFAR10 images coarsened to \(28\times 28\) pixels. Finally, we use uncorrelated noise that is uniformly distributed in the gray scales with the same resolution. Roughly speaking, we test empirical distributions for unlearned data that is close to the learned concept as in the case of EMNIST letters, data that represents a somewhat remote concept as in the case of CIFAR10 or, as in the case of noise, do not represent any concept at all.

We are classifying the output of a CNN on such input as incorrect label, this way we solve the in- and out-of-distribution detection problem from [11], but are still detecting misclassifications in the prescribed label space. The empirical distributions of unlearned concepts can be seen in Fig. 2. As we can observe, the distributions for incorrectly classified samples are in a statistical sense significantly different from those for correctly classified ones. The gradient metrics however are not able to separate the noise samples very well, but also resulting in an overall good separation of the other concepts, as for the entropy. The threshold classification evaluation metrics can be seen in Table 1. For the logistic regression results in Table 2 one can see that the GLM is inferior to the other methods. Regression techniques with a regularization term like LASSO, Ridge and Elastic net are performing best. We get similar AUROC values as for the threshold classification with single metrics, but can improve between \(5\%\) and \(14.08\%\) over the softmax baseline in terms of AUPR-Out values for unknown concepts, showing a better generalization.

Table 2. Average AUROC, AUPR-In and AUPR-Out values for different regression types trained on the validation set and all metric features including the entropy but excluding the softmax baseline. The values are averaged over 10 CNNs and displayed in percentage. The values in brackets are the standard deviations of the mean in percentage. To get the standard deviation within the sample, multiply by \(\sqrt{10}\).

5 Meta Classification with Known Unknowns

In the previous section we trained the meta classifier on the training or validation data only. This means it has no knowledge of entropy or metric distributions for unlearned concepts, hence we followed a puristic approach treating out of distribution cases as unknown unknowns. The classification accuracy could be improved, by extending the training set of the meta classifier with the entropy and gradient metric values of a few unlearned concepts and labeling them as false, i.e., incorrectly predicted. As in the previous sections we then train meta classifiers on the metrics. For this we use the same data sets as [11], namely the omniglot handwritten characters set [14], the notMNIST dataset [4] consisting of letters from different fonts, the CIFAR10 dataset [12] coarsened and converted to gray scale as well as normal and uniform noise. In order to investigate the influence of unknown concepts in the training set of the meta classifier, we used the LASSO regression and the NN introduced in Sect. 3 and supplied them with different training sets, consisting of

  • EMNIST validation set

  • EMNIST validation set and 200 uniform noise images

  • EMNIST validation set, 200 uniform noise images and 200 CIFAR10 images

  • EMNIST validation set, 200 uniform noise images, 200 CIFAR10 images and 200 omniglot images

Table 3. AUROC, AUPR-In (EMNISTc is positive case) and AUPR-Out (EMNISTc is negative case) values for a NN meta classifier. “All” contains omniglot, notMNIST, CIFAR10, normal noise and uniform noise. We used 200 samples of each concept that was additionally included into the training set. The supplied features are all gradient based metrics as well as the entropy. The displayed values are averages over 5 differently initialized NN meta classifiers for each of the 10 CNNs trained on the EMNIST dataset. All values are in percentage and the values in brackets are the standard deviations of the mean in percentage. To get the mean within the sample multiply by \(\sqrt{10}\).

We are omitting the results for the LASSO here, since they are inferior to those of the NN. Including known unknowns into the training set, the NN has far better performance on the unknown concepts, even though the amount of additional training data is small. Noteworthily the validation set together with only 200 uniform noise images increases the results on the AUPR-Out values for all unknown concepts already significantly by \(13.74\%\), even comparable to using all concepts. Together with the fact, that noise is virtually available at no cost, it is a very promising candidate for improving the generalization of the meta classifier without the need of generating labels for more datasets. The in-distribution detection rate of correct and wrong predictions is also increased when using additional training concepts, making it only beneficial to include noise into the training set of the meta classifier. Our experiments show however that normal noise does not have such a high influence on the performance as uniform noise and is even decreasing the in-distribution meta classification performance. All in all we reach a \(3.48\%\) higher performance on the out of distribution examples compared to the softmax baseline in AUPR-Out and \(0.81\%\) in AUROC, whereas the increase in AUPR-In is marginal (\(0.12\%\)).

6 Conclusion and Outlook

We introduced a new set of metrics that measures the uncertainty of deep CNNs. These metrics have a comparable performance with the widely used entropy and maximum softmax probability to meta-classify whether a certain classification proposed by the underlying CNN is presumably correct or incorrect. Here the performance is measured by AUROC, AUPR-In and AUPR-Out. Entropy and softmax probability perform equally well or slightly better than any single member of the new gradient based metrics for the detection of unknown concepts like EMNIST letters, gray scale converted CIFAR10 images and uniform noise where simple thresholding criteria are applied. But still, our new metrics allow contributions of different layers and weights to the total uncertainty. Combining the gradient metrics together with entropy in a more complex meta classifier increases the ability to identify out-of-distribution examples, so that in some cases these meta classifiers outperform the baseline. Additional calibration by including a few samples of unknown concepts increases the performance significantly. Uniform noise proved to raise the overall performance, without the need of more labels. Overall the results for the classification of correct or incorrect predictions increased when the meta classifier was supplied with more distinct concepts in the training set. It seems that the higher number of uncertainty metrics helps to better hedge the correctly classified samples from the variety of out of sample classes, which would be difficult, if only one metric is available. Note that this increase in meta classification is particularly valuable, if one does not want to deteriorate the classification performance of the underlying classifier by additional classes for the known unknowns.

As future work we want to evaluate the performance and robustness of such gradient metrics on different tasks in pattern recognition. Further features could be generated by applying the metrics to activations rather than gradients. One could also investigate the possibility of generating artificial samples, labeled as incorrect, for the training set of the meta classifier in order to further improve the results.