=Paper= {{Paper |id=Vol-2640/paper_18 |storemode=property |title=Is Uncertainty Quantification in Deep Learning Sufficient for Out-of-Distribution Detection? |pdfUrl=https://ceur-ws.org/Vol-2640/paper_18.pdf |volume=Vol-2640 |authors=Adrian Schwaiger,Poulami Sinhamahapatra,Jens Gansloser,Karsten Roscher |dblpUrl=https://dblp.org/rec/conf/ijcai/SchwaigerSGR20 }} ==Is Uncertainty Quantification in Deep Learning Sufficient for Out-of-Distribution Detection?== https://ceur-ws.org/Vol-2640/paper_18.pdf

Is Uncertainty Quantification in Deep Learning Sufficient
for Out-of-Distribution Detection?

Adrian Schwaiger , Poulami Sinhamahapatra , Jens Gansloser , Karsten Roscher
Fraunhofer IKS, Fraunhofer Institute for Cognitive Systems
{firstname.lastname}@iks.fraunhofer.de

liable uncertainty estimates can be utilized by a safety-
Abstract envelope [Weiss et al., 2018] that encapsulates the high-
performance DNN. Whenever the uncertainty of a prediction
Reliable information about the uncertainty of pre- is high, the result of the DNN is discarded and the predic-
dictions from deep neural networks could greatly tion of a verified, lower-performance safety path is used in-
facilitate their utilization in safety-critical applica- stead. In this context, the performance of different Uncer-
tions. Current approaches for uncertainty quantifi- tainty Quantification (UQ) approaches for DNNs have al-
cation usually focus on in-distribution data, where ready been investigated on In-Distribution (ID) data, i.e., data
a high uncertainty should be assigned to incor- that is conceptually similar to the data the network has been
rect predictions. In contrast, we focus on out-of- trained on [Henne et al., 2020]. However, the viability of UQ
distribution data where a network cannot make cor- to detect Out-of-Distribution (OOD) inputs, i.e., data that dif-
rect predictions and therefore should always report fers strongly from the training data, is still an open question.
high uncertainty. In this paper, we compare several The detection of such inputs is important, as DNNs are not
state-of-the-art uncertainty quantification methods able to provide a correct prediction for them. For instance,
for deep neural networks regarding their ability to a network trained to distinguish between cats and dogs will
detect novel inputs. We evaluate them on image always output one or the other, and very often with high con-
classification tasks with regard to metrics reflecting fidence, even when challenged with an OOD sample, e.g.,
requirements important for safety-critical applica- with the image of a car. As it is not feasible to construct a
tions. Our results show that a portion of out-of- dataset that guarantees the coverage of all relevant concepts
distribution inputs can be detected with reasonable in sufficient quantity for open world scenarios, approaches to
loss in overall accuracy. However, current uncer- detect OOD inputs are important to ensure the safety of the
tainty quantification approaches alone are not suf- overall system and to detect violations of its operational de-
ficient for an overall reliable out-of-distribution de- sign domain.
tection. In this paper, we investigate several state-of-the-art meth-
ods for UQ in combination with popular DNN architectures
for image classification. We use three datasets from different
1 Introduction application domains to train the models and apply them to test
Many state-of-the-art methods for solving perceptual tasks sets containing in- and out-of-distribution samples. We focus
are based on Deep Neural Networks (DNNs). However, the on the trade-off between remaining accuracy and remaining
lack of interpretability of these networks is still a problem error under the assumption that inputs with uncertain predic-
when DNNs are employed in safety-critical applications, e.g., tions are handled by a fallback mechanism and therefore ac-
for autonomous driving or in medical diagnosis. In these do- count to neither of them. Since the acceptable remaining error
mains, mistakes are not just a minor annoyance but can have or minimal performance may vary from application to appli-
severe consequences. Therefore, thorough safety analysis and cation we highlight the relationship between the two instead
argumentation are an integral part of the development of such of assuming arbitrary limits.
systems. Unfortunately, the black-box nature of DNNs and
the fact that already slight changes in the input can have dras- 2 Related Work
tic effects on the output make this task almost impossible for
Machine Learning in Safety-Critical Domains Arguing
complex DNN-based computer vision pipelines.
the safety of Machine Learning (ML) algorithms for complex
One approach to address this problem is quantifying the
tasks still remains an open research question. Insufficien-
predictive uncertainty of a DNN for each given input. Re-
cies of DNNs on perception tasks include, e.g., susceptibility
Copyright c 2020 for this paper by its authors. Use permitted towards distributional shifts and lack of interpretability, and
under Creative Commons License Attribution 4.0 International (CC general mitigation strategies for e.g., the incorporation of un-
BY 4.0). certainty and proper specification of the data acquisition pro-
cess, as discussed in [Willers et al., 2020]. One way to argue while the network is trained with the default softmax, dur-
the safety is by formulating confidence arguments to gather ing test phase the tempered softmax forces the network to be
evidence for the performance of an ML system [Burton et sure with its decisions. In [DeVries and Taylor, 2018], the
al., 2019]. To aid the formulation of such arguments, the au- authors propose Learned Confidence estimates to classify a
thors provide an overview of the most common failure cases sample as ID or OOD sample by appending a confidence es-
and propose assurance claim points to further break down the timation branch to the network. Similar to this, Metric Learn-
task. Another direction in the domain of autonomous ve- ing [Masana et al., 2018] adds an additional output branch
hicles to assure the safety is the creation of a specification and maps it into a manifold where the Euclidean distance
based on formal rules and physical constraints, as it is done from such manifolds is used as a measure of detecting pos-
within RSS [Shalev-Shwartz et al., 2018]. As this approach sible OOD samples. A probabilistic approach given in [Lee
implicitly requires perfect perception, which in a real-world et al., 2018] uses features (lower and upper level) from any
scenario is unattainable, PURSS [Salay et al., 2020] has been pre-trained classifier and maps them into class conditional
proposed as an extension to allow the integration of percep- Gaussian distributions under Gaussian discriminant analysis,
tual uncertainty into the otherwise rigid specifications. which result in a confidence score based on the Mahalanobis
Interpretability The lack of interpretability of DNNs is a distance. Finally, the most popular method of computing
hindrance for using them in safety-critical applications, as probabilistic statistics uses the ensembles of predictions of
it makes thorough safety analyses almost impossible. One discriminative classifiers trained on ID data, as proposed by
[Lakshminarayanan et al., 2017]. It has emerged as popu-
approach to address this problem is the visualization of the
learned features and their interplay with each other [Olah et lar non-Bayesian approach for predictive UQ, also used for
al., 2020]. While this only enables qualitative analyses, the detecting OOD samples during inference. An alternative di-
authors suggest that it can aid in gaining a better understand- rection of approaching the OOD detection problem is the use
ing of DNNs and facilitate other work in this domain. A of generative model-based methods, which are appealing as
different direction is the formation of human-understandable they do not require labeled data and directly model the input
features in DNNs. For instance, by specifying desired con- distribution. These methods fit a generative model p(x) to the
cepts a network can be incentivized to learn corresponding ID data, and then evaluate the likelihood of new OOD inputs
features which in turn can be used for quantitative analy- under that model as in [Ren et al., 2019], [Serrà et al., 2019].
ses [Kim et al., 2018]. Moreover, many self-supervised approaches [Hendrycks et
al., 2019], [Mohseni et al., 2020], which also do not need
Verification of DNNS The ability to verify DNNs would labeled data, have shown promise in OOD detection, often
facilitate any safety argumentation greatly. Approaches con- with accuracy comparable to supervised methods.
cerning the verification include linear approximations of the
learned function in order to subsequently solve them using
existing verification tools [Katz et al., 2017]. The problems
3 Uncertainty Quantification for OOD
of scalability and the definition of proper specifications, how- Detection
ever, prevent their application to complex perception tasks. In the previous section, dedicated OOD detection techniques
Nevertheless, it is an active field of research and promising have been presented. However, it is reasonable to investigate
approaches exist, e.g., for the verification of direct percep- the usage of UQ for this task as well. The idea is that a DNN
tion utilizing an input property characterizer and an approach should assign a high uncertainty to OOD inputs, as nothing
to verification based on assumed guarantees [Cheng et al., comparable has been encountered before.
2019]. In [Osawa et al., 2019] the authors, i.a., compare Bayesian
Out-of-Distribution Detection In real-world machine UQ methods wrt. to their performance of detecting OOD
learning applications, the importance of detecting OOD sam- samples. Although their results are promising, the chosen
ples in the test data, which basically indicates distributional task is not as complex, because the defined datasets for ID
shift from training data, is paramount. It has been recognized and OOD are very dissimilar. The performance of different
as an important problem for AI safety [Amodei et al., 2016]. uncertainty quantifiers to distinguish samples from more sim-
Neural Network classifiers tend to incorrectly classify OOD ilar distributions have been investigated in [Pawlowski et al.,
samples with high confidence. The high-confidence predic- 2017]. Their findings are promising and also encourage fur-
tions are often the result from the softmax functions, since ther research in that area.
these probabilities are computed with the fast-growing expo-
nential function, where minor input addition can lead to sub- 3.1 Predictive Uncertainty Quantification of DNNs
stantial increase in output. In this direction, [Hendrycks and A common approach to probabilistic UQ for neural networks
Gimpel, 2018] proposed a baseline method to detect OOD is to rely on Bayesian methods (e.g., variational Bayes or
samples based on an observation that a well-trained neural Markov chain Monte Carlo), where the posterior distribu-
network tends to assign higher softmax scores to ID samples tion over the network parameters is computed. However, ex-
than OOD samples. This approach was further extended in act Bayesian inference is usually intractable, thus the pos-
ODIN [Liang et al., 2018] by using temperature scaling in terior can only be computed approximately. Recently, non-
the softmax function [Guo et al., 2017], and adding small Bayesian methods gained in popularity, which often allow for
controlled perturbations to inputs such that the softmax score simpler implementation and faster training. In this work, we
gap between ID and OOD samples is further enlarged. Here, focus on methods for predictive UQ that are fast to train, rea-
sonably easy to implement and suitable for large-scale prob- 4 Evaluation
lems often seen in image classification tasks. In the following, the previously presented UQ methods, Deep
A straightforward approach to UQ is to interpret the classi- Ensembles (DE), Monte-Carlo Dropout (MCDO), Learned
fication scores as probabilities, e.g., by applying the softmax Confidence (LC), and Evidential Deep Learning (EDL), are
function to the prediction scores. However, modern DNNs compared to each other and to the default softmax confi-
tend to be not well calibrated, i.e., the predicted probability dences, which serve as a baseline. The task, hereby, is to
for an input sample does not represent the true accuracy of the classify images correctly and confidently.
network. This is especially true for DNNs with high model
capacity and lack of regularization [Guo et al., 2017]. One 4.1 Experimental Setup
approach for DNN calibration is to learn a scaling of the pre- To provide a comprehensive comparison, we trained each
dicted probabilities using a validation set, where the parame- of the UQ methods on three different model architectures.
ters of the DNN are fixed. VGG16 [Simonyan and Zisserman, 2015] as a standard net-
In addition to that, softmax probabilities viewed alone are work architecture, SqueezeNet [Iandola et al., 2016] for its
often overconfident for OOD samples [Gal and Ghahramani, small size and suitability for embedded systems, and the re-
2016]. Nevertheless, for a given network ID samples tend to cently introduced EfficientNet [Tan and Le, 2019] as a high-
have greater softmax values than OOD samples, which can be performing and efficient architecture. The model variant B0
used as a baseline for OOD detection [Hendrycks and Gim- for EfficientNet was adopted for our use-cases. All models
pel, 2018]. use dropout regularization to allow the application of MCDO.
Deep Ensembles Ensembles of deep neural networks, i.e. Each deep ensemble consists of 5 networks and the number
deep ensembles, is a well-known method to improve predic- of sampling steps for MCDO has been set to 50. Increasing
tion accuracy. However, deep ensembles can also be used the number of members or sampling steps further lead only
as a non-Bayesian uncertainty estimator [Lakshminarayanan to minor improvements. For LC the last dense layer of each
et al., 2017]. A number of randomly initialized neural net- model is replaced by a prediction and a confidence branch,
works are trained independently on the same training data. To which then are concatenated again to form the final predic-
compute the predictive distribution, the individual prediction tion, as in [DeVries and Taylor, 2018]. Additionally we set
probabilities of all neural networks in the ensemble are aver- the hyperparameters for the loss function of LC to λ = 0.1
aged. Additionally, [Lakshminarayanan et al., 2017] propose and β = 0.3, which generally showed the best results in our
to use proper scoring functions as loss functions and adver- experiments. For EDL, using softplus as evidence function in
sarial training to smooth the predictive distributions. combination with the expected cross entropy loss employing
the digamma function, as described in [Sensoy et al., 2018],
Monte-Carlo Dropout MC-Dropout can be interpreted as yielded the best results and is used in all experiments pre-
a form of ensembles with shared network parameters or sented in this paper.
alternatively, as approximate Bayesian inference [Gal and As training datasets we used CIFAR-10, German Traffic
Ghahramani, 2016]. Usually, dropout is used during training Sign Recognition Benchmark (GTSRB) [Stallkamp et al.,
for regularization to prevent overfitting. However, dropout 2011], and NWPU-RESISC45 [Cheng et al., 2017]. CIFAR-
can also be used during inference to estimate the predictive 10 contains small images separated into 10 different classes,
distribution. The empirical predictive mean and variance are e.g., automobile, truck or dog. GTSRB is a collection of
calculated from multiple stochastic forward passes, where German traffic signs. The number of classes amounts to 43.
each forward pass can be seen as sampling from a posterior NWPU-RESISC45 has larger aerial images which are cate-
distribution over the network weights. Since MC-Dropout gorized into 45 different classes, e.g., forest, freeway or rail-
does not require any change in the network architecture, it is way station. Additionally, we used images from CIFAR-100
easy to implement and to use with existing architectures. as OOD samples for CIFAR-10 and Belgium Traffic Signs
(BTSRB) [Timofte et al., 2014] as OOD samples for GT-
Learned Confidence A different, sampling-free approach
SRB. While CIFAR-100 and CIFAR-10 already have dis-
to estimate uncertainty is proposed in [DeVries and Taylor,
tinct classes, for BTSRB we only included classes that had
2018] where the network learns an explicit confidence score
no equivalent in GTSRB. As we found no suitable OOD
as second optimization objective. A confidence layer is added
datasets for NWPU-RESISC45, we split it into two datasets.
after the last network layer, in parallel to the class prediction
The OOD dataset includes 9 classes, airplane, airport, beach,
layer. The optimization objective is then the sum of the clas-
harbor, island, lake, river, sea ice, and ship. These are se-
sification loss and the confidence loss.
mantically separated from the remaining 39 classes used as
Evidential Deep Learning Evidential Deep Learn- ID dataset. Overall, the ID and OOD dataset pairs are quite
ing [Sensoy et al., 2018] is inspired by the Dempster-Shafer similar to each other, which makes the task of OOD detection
theory and another sampling-free approach. For classification more difficult. This was done purposefully, as it transfers bet-
tasks the parameters of a Dirichlet distribution are learned, ter to safety-critical applications, where OOD inputs must be
from which the total evidence for each of the classes and the detected, coming from the exact same sensor in similar envi-
epistemic uncertainty regarding the prediction as a whole can ronments.
be calculated. The authors also conducted some experiments We trained the models from scratch using random initial-
regarding OOD detection and showed that their method izations and used Adam as optimizer. Early stopping has been
generally assigned higher uncertainties to OOD inputs. applied if the validation loss did not change for several epochs
to prevent overfitting whilst ensuring fully trained networks. a straight line, suggesting that there are no thresholds which
Augmentations have not been applied, to rule out potential can produce error rates in that range. For error rates < 0.5%,
side effects introduced by the specific configuration used. all but softmax show the same accuracy. Upon further in-
vestigation, we noticed the distribution of classes among the
4.2 Evaluation Metrics undetected OOD samples were similar, hinting towards sam-
Following the cue of our previous work in [Henne et al., ples that are universally hard to reject. For SqueezeNet, DE
2020], similar evaluation metrics have been used in this paper. significantly shows the best performance, followed by EDL.
It constitutes of maximizing Remaining Accuracy Rate (RAR) Softmax and MCDO perform equally and LC again performs
along with minimizing the Remaining Error Rate (RER). the worst using this architecture. Using VGG16, DE still out-
RAR takes into account the number of samples which have performs the other methods but the difference is much less
been correctly classified by the classifier as well as declared significant. EDL and MCDO more or less perform equally,
confident (“certain” and “correct”) for a given threshold by with EDL being slightly better for high RER and MCDO be-
the respective UQ method. RER on the other hand is the frac- ing slightly better for really low RER. LC and softmax also
tion of inputs that is classified incorrectly but with a high con- show similar performance. Softmax again is not able to pro-
fidence (“certain” and “incorrect”). duce different RER in lower ranges, however, this can mostly
All trained networks were evaluated first on a test set with be attributed to the ID samples.
only ID data. Subsequently, the same model was tested on a On NWPU-RESISC45, DE performs the best for Efficient-
second test set where OOD samples corresponding to 17.65% Net in terms of maximum RAR achieved. Next, softmax and
of the size of the ID data were added, to obtain a dataset with MCDO behave similarly but with a slight decrease in RAR.
85% ID and 15% OOD samples. The amount of ODD sam- For RER < 5%, both of these methods show a slight kink in
ples was chosen arbitrarily to improve the visual presentation the curve showing their sensitivity to certain range of thresh-
of the plots. However, it has no impact on the overall obser- olds. But even in this range, DE clearly achieves much better
vations since we focus on the relative performance between RAR at the cost of < 1% RER. LC tries to achieve close
best and worst case. to 80% RAR, but at the cost of much higher RER. Finally,
EDL performs similarly as others in RER < 5%, but is vastly
4.3 Results and Discussion outperformed in terms of overall RAR. For SqueezeNet, DE
Remaining Accuracy and Error again performs the best followed by MCDO, EDL and soft-
The results are shown in Figure 1. Due to space restrictions, max in close proximity. Nonetheless, LC as pointed out ear-
the graphs for VGG16 could not be included. Each curve con- lier performs the worst with this architecture. For VGG16, all
sists of the RAR and RER plotted for each threshold t ∈ [0; 1] the methods perform sub-optimally in comparison to other
with a sampling step size of 0.001. The blue curves represent architectures with maximum RAR of nearly 75% achieved
the performance on the ID dataset, the green curves show the by DE. Similar to the trend above, DE is followed by EDL
performance on the dataset with combined ID and OOD sam- with comparable RAR, as DE, in RER < 5%. Softmax and
ples. Furthermore, the green curves have been normalized MCDO follow them, but spread over larger RER. LC is not
regarding the RAR by the amount of OOD samples. Thereby, again able to produce RER in lower ranges and has much
the influence on the accuracy due to additional OOD samples larger RER as compared to similar RAR achieved by other
is eliminated and only the error introduced by them is factored UQ methods.
in. For one, this better represents the application case, as the Based on the observations above, DE performs best across
DNNs are not supposed to classify OOD samples correctly all methods and datasets. LC had been originally proposed as
and only have to detect them. Second, due to the normaliza- an OOD detection method rather than being an UQ method.
tion, the behavior regarding the OOD detection can be better However, LC has shown consistent sub-optimal performance
interpreted visually. Given a perfect OOD detection method, in almost all the scenarios above, particularly with smaller ar-
both curves would be the same, as all OOD samples would chitectures like SqueezeNet or higher resolution dataset, like
be rejected. The black curves show the worst case, i.e., if NWPU-RESISC45. MCDO and softmax perform averagely
none of the OOD samples are rejected. They also have been in most cases. Most UQ methods including EDL tend to have
normalized like the green curves. quite competent RAR for lower RER ranges, but on initial in-
On GTSRB, DE can detect most of the OOD samples, with vestigation it has been also observed there always exist some
a minor loss in accuracy. Although SqueezeNet has a slightly harder sample categories which are almost too difficult to cer-
lower base accuracy, it is slightly better in rejecting OOD tainly reject for most UQ methods.
samples. For GTSRB, this also holds for MCDO and soft-
max. EDL on the other hand shows a better OOD discrimi- Quality of Uncertainty Estimation
nation ability in the other two architectures for higher RER, To further assess the novelty detection capabilities of the
but can reduce the error almost completely with the highest methods, we show the ratio of inputs marked as uncertain
accuracy left. LC performs sub-par with SqueezeNet, which for a given threshold. We, thereby, show the comparison for
might be due to the low number of parameters, as we already the three possible cases: ID inputs predicted correctly, ID in-
noticed in [Henne et al., 2020]. puts predicted incorrectly and OOD inputs. Corresponding to
On CIFAR-10 using EfficientNet, all but DE perform each of the three cases we plot, over all thresholds, the frac-
equally with only minor differences. An exception to this are tion of samples having high uncertainty. An ideal method, for
softmax and MCDO, which for an RER of < 3.5% drop in some given threshold, is certain for all correct predictions and
Softmax Deep Ensemble Monte Carlo Dropout Evidential Deep Learning Learned Confidence
1.0

EfficientNet
GTSRB
RAR

0.5

0.0
0.0 0.1 0.2 0.3 0.2
0.0 0.1 0.2 0.30.4
0.0 0.1 0.2 0.3
0.60.0 0.1 0.2 0.3
0.8 0.0 0.1 0.2 0.3
1.0

1.0

SqueezeNet
GTSRB
RAR

0.5

0.0
0.0 0.1 0.2 0.3 0.2
0.0 0.1 0.2 0.30.4
0.0 0.1 0.2 0.3
0.60.0 0.1 0.2 0.3
0.8 0.0 0.1 0.2 0.3
1.0

1.0

EfficientNet
CIFAR-10
RAR

0.5

0.0
0.0 0.1 0.2 0.3 0.2
0.0 0.1 0.2 0.30.4
0.0 0.1 0.2 0.3
0.60.0 0.1 0.2 0.3
0.8 0.0 0.1 0.2 0.3
1.0

1.0

SqueezeNet
CIFAR-10
RAR

0.5

0.0
0.0 0.1 0.2 0.3 0.2
0.0 0.1 0.2 0.30.4
0.0 0.1 0.2 0.3
0.60.0 0.1 0.2 0.3
0.8 0.0 0.1 0.2 0.3
1.0

1.0

NWPU-RESISC45
EfficientNet
RAR

0.5

0.0
0.0 0.1 0.2 0.3 0.2
0.0 0.1 0.2 0.30.4
0.0 0.1 0.2 0.3
0.60.0 0.1 0.2 0.3
0.8 0.0 0.1 0.2 0.3
1.0

1.0
NWPU-RESISC45
SqueezeNet
RAR

0.5

0.0
0.0 0.1 0.2 0.3 0.2
0.0 0.1 0.2 0.30.4
0.0 0.1 0.2 0.3
0.60.0 0.1 0.2 0.3
0.8 0.0 0.1 0.2 0.3
1.0
RER
ID Normalized ID+OOD Lower bound

Figure 1: Remaining Error Rate (RER) vs. Remaining Accuracy Rate (RAR) for EfficientNet and SqueezeNet on the GTSRB, CIFAR-10 and
NWPU-RESISC45 datasets. The plots show the performances first on the ID dataset(blue), then on dataset consisting of the ID and OOD
samples(green). The lower bound (black) represents the worst-case scenario where the network fails to reject none of the OOD sample.
1.0 1.0 1.0

Monte-Carlo Dropout
correct correct correct
Uncertainty Ratio

Uncertainty Ratio

Uncertainty Ratio
Deep Ensembles
0.8 incorrect 0.8 incorrect 0.8 incorrect
Softmax

0.6 ood 0.6 ood 0.6 ood

0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Confidence Threshold
Evidential Deep Learning Confidence Threshold Confidence Threshold

1.0 1.0

Learned Confidence
correct correct
Uncertainty Ratio

Uncertainty Ratio
0.8 incorrect 0.8 incorrect
0.6 ood 0.6 ood

0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Confidence Threshold Confidence Threshold

Figure 2: The ratio of inputs marked as uncertain for the three cases — correctly classified, incorrectly classified, and OOD inputs — over
the range of thresholds for EfficientNet on CIFAR-10.

Deep Ensembles / GTSRB Deep Ensembles / CIFAR-10 Deep Ensembles / NWPU-RESISC45
1.00 1.0 1.0

0.8
0.95 0.8
RAR

RAR

RAR
0.6
0.90 EfficientNet 0.6 EfficientNet EfficientNet
SqueezeNet SqueezeNet 0.4 SqueezeNet
VGG16 VGG16 VGG16
0.85 0.4 0.2
0.0 0.1 0.2 0.0 0.1 0.2 0.3 0.0 0.2 0.4
RER RER RER

Figure 3: Remaining Error Rate (RER) vs. Remaining Accuracy Rate (RAR) for Deep Ensembles on the normalized ID+OOD dataset trained
on GTSRB, CIFAR-10 and NWPU-RESISC45 with EfficientNet, SqueezeNet and VGG16.

uncertain for all incorrect predictions as well as predictions ity is not as significant, visually represented by how close
for OOD inputs. In Figure 2, the uncertainty ratios are shown the blue and green curves match. An exception to this are
only for CIFAR-10 and EfficientNet, but we observe the same some configurations with LC, especially with SqueezeNet.
findings in our other considered configurations. Most inter- On CIFAR-10, all architectures perform mostly the same
estingly, the curves for incorrectly classified and ood inputs regarding their novelty detection ability and on the easier
match very closely. Obviously, this raises the question: How dataset GTSRB SqueezeNet has a slight edge. For NWPU-
correlated are these two categories and will better UQ meth- RESISC45, VGG16 rejects OOD samples slightly better,
ods be able to better detect novel inputs? This could be sub- however, its baseline accuracy is about 20% lower for all UQ
ject for future research. The plots again indicate that very methods. Figure 3 shows the overall performance of DE for
low error rates can only be achieved at the cost of sacrificing all architectures on the combined ID + OOD datasets.
a lot of accuracy. It is also worth mentioning, that EDL and
LC exhibit a smoother behavior over the range of thresholds, 5 Conclusion and Future Work
especially compared to Softmax and MCDO, and therefore, In this paper, we investigated the question, whether un-
are less sensitive towards small changes in the choice of a certainty quantification is sufficient for detecting out-of-
threshold. distribution inputs. To that end, we applied different state-
of-the-art methods and network architectures to three image
Influence of Model Architecture classification tasks. While all tested UQ methods assign high
While the choice of architecture is important for the perfor- uncertainty to some of the ODD samples, their rejection ca-
mance wrt. accuracy, its influence on the OOD detection abil- pabilities will not suffice for most safety-critical applications,
especially considering that in the real-world even more diffi- [Gal and Ghahramani, 2016] Yarin Gal and Zoubin Ghahra-
cult OOD inputs can occur. If UQ should be applied, deep en- mani. Dropout as a bayesian approximation: Representing
sembles consistently showed the best trade-off between per- model uncertainty in deep learning. In Proc. ICML 2016,
formance and remaining error, but mostly due to its better volume 48, pages 1050–1059. PMLR, June 2016.
accuracy baseline to begin with. [Guo et al., 2017] Chuan Guo, Geoff Pleiss, Yu Sun, and
A closer look at our results revealed that in many cases Kilian Q. Weinberger. On Calibration of Modern Neu-
all methods fail on ODD inputs from the same classes. This ral Networks. In Proc. ICML 2017, pages 1321–1330.
hints at the possibility that certain ODD inputs are concep- JMLR.org, August 2017.
tually harder (or even impossible) to identify either by UQ
methods or even in general. However, further research is [Hendrycks and Gimpel, 2018] Dan Hendrycks and Kevin
needed to provide more evidence. In addition, many novelty Gimpel. A Baseline for Detecting Misclassified and Out-
detection approaches have been proposed in recent years. It of-Distribution Examples in Neural Networks. In Proc.
would be interesting to see how they perform compared to the ICML 2017. JMLR.org, October 2018.
UQ methods presented here. Furthermore, their error patterns [Hendrycks et al., 2019] Dan Hendrycks, Mantas Mazeika,
may provide additional insights into the difficulties of OOD Saurav Kadavath, and Dawn Song. Using Self-Supervised
detection in general. Learning Can Improve Model Robustness and Uncertainty.
Additionally, it is worthwhile investigating, whether our In Advances in Neural Information Processing Systems
findings also transfer to other tasks, e.g., object detection or 32, pages 15663–15674. Curran Associates, Inc., October
instance segmentation, and to other types of input data, for 2019.
instance, radar or lidar point clouds. While there are similar [Henne et al., 2020] Maximilian Henne, Adrian Schwaiger,
base components at play —object detectors even use the in-
Karsten Roscher, and Gereon Weiss. Benchmarking Un-
vestigated networks as feature extractors—, the transferabil-
certainty Estimation Methods for Deep Learning With
ity of our results is not guaranteed.
Safety-Related Metrics. In Proc. SafeAI@AAAI 2020, vol-
ume 2560 of CEUR Workshop Proceedings, pages 83–90,
Acknowledgments 2020.
This work was partially supported by the Bavarian Min- [Iandola et al., 2016] Forrest N. Iandola, Matthew W.
istry of Economic Affairs, Regional Development and En- Moskewicz, Khalid Ashraf, Song Han, William J. Dally,
ergy through the Center for Analytics—Data—Applications and Kurt Keutzer. SqueezeNet: AlexNet-level accuracy
(ADA-Center) within the framework of “BAYERN DIGITAL with 50x fewer parameters and <1MB model size. CoRR,
II” and within the Intel Collaborative Research Institute— abs/1602.07360, 2016. eprint: 1602.07360.
Safe Automated Vehicles. [Katz et al., 2017] Guy Katz, Clark Barrett, David L. Dill,
Kyle Julian, and Mykel J. Kochenderfer. Reluplex: An Ef-
References ficient SMT Solver for Verifying Deep Neural Networks.
[Amodei et al., 2016] Dario Amodei, Chris Olah, Jacob In Computer Aided Verification, LNCS, pages 97–117,
Steinhardt, Paul Christiano, John Schulman, and Dan Cham, 2017. Springer International Publishing.
Mané. Concrete Problems in AI Safety. ArXiv160606565 [Kim et al., 2018] Been Kim, Martin Wattenberg, Justin
Cs, July 2016. Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and
[Burton et al., 2019] Simon Burton, Lydia Gauerhof, Bib- Rory Sayres. Interpretability Beyond Feature Attribu-
huti Bhusan Sethy, Ibrahim Habli, and Richard Hawkins. tion: Quantitative Testing with Concept Activation Vec-
Confidence Arguments for Evidence of Performance in tors (TCAV). In Proc. ICML 2018, pages 2668–2677, July
Machine Learning for Highly Automated Driving Func- 2018.
tions. In Computer Safety, Reliability, and Security, [Lakshminarayanan et al., 2017] Balaji Lakshminarayanan,
LNCS, pages 365–377, Cham, 2019. Springer Interna- Alexander Pritzel, and Charles Blundell. Simple and scal-
tional Publishing. able predictive uncertainty estimation using deep ensem-
[Cheng et al., 2017] Gong Cheng, Junwei Han, and Xiao- bles. In Advances in Neural Information Processing Sys-
qiang Lu. Remote Sensing Image Scene Classifica- tems 30, pages 6402–6413. Curran Associates, Inc., 2017.
tion: Benchmark and State of the Art. Proc. IEEE, [Lee et al., 2018] Kimin Lee, Kibok Lee, Honglak Lee, and
105(10):1865–1883, October 2017. Jinwoo Shin. A Simple Unified Framework for Detect-
[Cheng et al., 2019] Chih-Hong Cheng, Chung-Hao Huang, ing Out-of-Distribution Samples and Adversarial Attacks.
Thomas Brunner, and Vahid Hashemi. Towards ArXiv180703888 Cs Stat, October 2018.
Safety Verification of Direct Perception Neural Networks. [Liang et al., 2018] Shiyu Liang, Yixuan Li, and R. Srikant.
ArXiv190404706 Cs, November 2019. Enhancing The Reliability of Out-of-distribution Image
[DeVries and Taylor, 2018] Terrance DeVries and Gra- Detection in Neural Networks. In arXiv:1706.02690 [Cs,
ham W. Taylor. Learning Confidence for Out- Stat], February 2018.
of-Distribution Detection in Neural Networks. [Masana et al., 2018] Marc Masana, Idoia Ruiz, Joan Serrat,
ArXiv180204865 Cs Stat, February 2018. Joost van de Weijer, and Antonio M. Lopez. Metric Learn-
ing for Novelty and Anomaly Detection. In Proc. BMVC [Tan and Le, 2019] Mingxing Tan and Quoc Le. Efficient-
2018, August 2018. Net: Rethinking Model Scaling for Convolutional Neural
[Mohseni et al., 2020] Sina Mohseni, Mandar Pitale, JBS Networks. In Proc. ICML 2019, volume 97 of Proceedings
Yadawa, and Zhangyang Wang. Self-Supervised Learning of Machine Learning Research, pages 6105–6114. PMLR,
for Generalizable Out-of-Distribution Detection. In Proc. June 2019.
AAAI 2020, page 8, 2020. [Timofte et al., 2014] Radu Timofte, Karel Zimmermann,
[Olah et al., 2020] Chris Olah, Nick Cammarata, Ludwig and Luc Van Gool. Multi-view traffic sign detection,
Schubert, Gabriel Goh, Michael Petrov, and Shan recognition, and 3D localisation. Machine Vision and Ap-
Carter. Zoom In: An Introduction to Circuits. Distill, plications, 25(3):633–647, April 2014.
5(3):10.23915/distill.00024.001, March 2020. [Weiss et al., 2018] Gereon Weiss, Philipp Schleiss, Daniel
[Osawa et al., 2019] Kazuki Osawa, Siddharth Swaroop, Schneider, and Mario Trapp. Towards integrating unde-
Mohammad Emtiyaz E Khan, Anirudh Jain, Runa Eschen- pendable self-adaptive systems in safety-critical environ-
hagen, Richard E Turner, and Rio Yokota. Practical Deep ments. In Proc. SEAMS 2018, pages 26–32. ACM, May
Learning with Bayesian Principles. In Advances in Neu- 2018.
ral Information Processing Systems 32, pages 4287–4299. [Willers et al., 2020] Oliver Willers, Sebastian Sudholt,
Curran Associates, Inc., 2019. Shervin Raafatnia, and Stephanie Abrecht. Safety Con-
[Pawlowski et al., 2017] Nick Pawlowski, Miguel Jaques, cerns and Mitigation Approaches Regarding the Use
and Ben Glocker. Efficient variational Bayesian neural of Deep Learning in Safety-Critical Perception Tasks.
network ensembles for outlier detection. In Proc. ICLR ArXiv200108001 Cs Stat, January 2020.
2017. OpenReview.net, 2017.
[Ren et al., 2019] Jie Ren, Peter J. Liu, Emily Fertig, Jasper
Snoek, Ryan Poplin, Mark Depristo, Joshua Dillon, and
Balaji Lakshminarayanan. Likelihood ratios for out-of-
distribution detection. In Advances in Neural Information
Processing Systems 32, pages 14707–14718. Curran Asso-
ciates, Inc., 2019.
[Salay et al., 2020] Rick Salay, Krzysztof Czarnecki,
Maria Soledad Elli, Ignacio J. Alvarez, Sean Sedwards,
and Jack Weast. PURSS: Towards Perceptual Uncertainty
Aware Responsibility Sensitive Safety with ML. In Proc.
SafeAI@AAAI 2020, volume 2560 of CEUR Workshop
Proceedings, pages 91–95, 2020.
[Sensoy et al., 2018] Murat Sensoy, Lance Kaplan, and
Melih Kandemir. Evidential Deep Learning to Quantify
Classification Uncertainty. In Advances in Neural Infor-
mation Processing Systems 31, pages 3179–3189. Curran
Associates, Inc., 2018.
[Serrà et al., 2019] Joan Serrà, David Álvarez, Vicenç
Gómez, Olga Slizovskaia, José F. Núñez, and Jordi Luque.
Input Complexity and Out-of-distribution Detection with
Likelihood-based Generative Models. In Proc. ICLR 2020,
September 2019.
[Shalev-Shwartz et al., 2018] Shai Shalev-Shwartz, Shaked
Shammah, and Amnon Shashua. On a Formal Model of
Safe and Scalable Self-driving Cars. ArXiv170806374 Cs
Stat, October 2018.
[Simonyan and Zisserman, 2015] Karen Simonyan and An-
drew Zisserman. Very Deep Convolutional Networks for
Large-Scale Image Recognition. In Proc. ICLR 2015,
2015.
[Stallkamp et al., 2011] Johannes Stallkamp, Marc Schlips-
ing, Jan Salmen, and Christian Igel. The German Traffic
Sign Recognition Benchmark: A multi-class classification
competition. In The 2011 International Joint Conference
on Neural Networks, pages 1453–1460, July 2011.