=Paper= {{Paper |id=Vol-2640/paper_18 |storemode=property |title=Is Uncertainty Quantification in Deep Learning Sufficient for Out-of-Distribution Detection? |pdfUrl=https://ceur-ws.org/Vol-2640/paper_18.pdf |volume=Vol-2640 |authors=Adrian Schwaiger,Poulami Sinhamahapatra,Jens Gansloser,Karsten Roscher |dblpUrl=https://dblp.org/rec/conf/ijcai/SchwaigerSGR20 }} ==Is Uncertainty Quantification in Deep Learning Sufficient for Out-of-Distribution Detection?== https://ceur-ws.org/Vol-2640/paper_18.pdf
                    Is Uncertainty Quantification in Deep Learning Sufficient
                               for Out-of-Distribution Detection?

          Adrian Schwaiger , Poulami Sinhamahapatra , Jens Gansloser , Karsten Roscher
                     Fraunhofer IKS, Fraunhofer Institute for Cognitive Systems
                               {firstname.lastname}@iks.fraunhofer.de


                                                                    liable uncertainty estimates can be utilized by a safety-
                          Abstract                                  envelope [Weiss et al., 2018] that encapsulates the high-
                                                                    performance DNN. Whenever the uncertainty of a prediction
     Reliable information about the uncertainty of pre-             is high, the result of the DNN is discarded and the predic-
     dictions from deep neural networks could greatly               tion of a verified, lower-performance safety path is used in-
     facilitate their utilization in safety-critical applica-       stead. In this context, the performance of different Uncer-
     tions. Current approaches for uncertainty quantifi-            tainty Quantification (UQ) approaches for DNNs have al-
     cation usually focus on in-distribution data, where            ready been investigated on In-Distribution (ID) data, i.e., data
     a high uncertainty should be assigned to incor-                that is conceptually similar to the data the network has been
     rect predictions. In contrast, we focus on out-of-             trained on [Henne et al., 2020]. However, the viability of UQ
     distribution data where a network cannot make cor-             to detect Out-of-Distribution (OOD) inputs, i.e., data that dif-
     rect predictions and therefore should always report            fers strongly from the training data, is still an open question.
     high uncertainty. In this paper, we compare several            The detection of such inputs is important, as DNNs are not
     state-of-the-art uncertainty quantification methods            able to provide a correct prediction for them. For instance,
     for deep neural networks regarding their ability to            a network trained to distinguish between cats and dogs will
     detect novel inputs. We evaluate them on image                 always output one or the other, and very often with high con-
     classification tasks with regard to metrics reflecting         fidence, even when challenged with an OOD sample, e.g.,
     requirements important for safety-critical applica-            with the image of a car. As it is not feasible to construct a
     tions. Our results show that a portion of out-of-              dataset that guarantees the coverage of all relevant concepts
     distribution inputs can be detected with reasonable            in sufficient quantity for open world scenarios, approaches to
     loss in overall accuracy. However, current uncer-              detect OOD inputs are important to ensure the safety of the
     tainty quantification approaches alone are not suf-            overall system and to detect violations of its operational de-
     ficient for an overall reliable out-of-distribution de-        sign domain.
     tection.                                                          In this paper, we investigate several state-of-the-art meth-
                                                                    ods for UQ in combination with popular DNN architectures
                                                                    for image classification. We use three datasets from different
1   Introduction                                                    application domains to train the models and apply them to test
Many state-of-the-art methods for solving perceptual tasks          sets containing in- and out-of-distribution samples. We focus
are based on Deep Neural Networks (DNNs). However, the              on the trade-off between remaining accuracy and remaining
lack of interpretability of these networks is still a problem       error under the assumption that inputs with uncertain predic-
when DNNs are employed in safety-critical applications, e.g.,       tions are handled by a fallback mechanism and therefore ac-
for autonomous driving or in medical diagnosis. In these do-        count to neither of them. Since the acceptable remaining error
mains, mistakes are not just a minor annoyance but can have         or minimal performance may vary from application to appli-
severe consequences. Therefore, thorough safety analysis and        cation we highlight the relationship between the two instead
argumentation are an integral part of the development of such       of assuming arbitrary limits.
systems. Unfortunately, the black-box nature of DNNs and
the fact that already slight changes in the input can have dras-    2   Related Work
tic effects on the output make this task almost impossible for
                                                                    Machine Learning in Safety-Critical Domains Arguing
complex DNN-based computer vision pipelines.
                                                                    the safety of Machine Learning (ML) algorithms for complex
   One approach to address this problem is quantifying the
                                                                    tasks still remains an open research question. Insufficien-
predictive uncertainty of a DNN for each given input. Re-
                                                                    cies of DNNs on perception tasks include, e.g., susceptibility
    Copyright c 2020 for this paper by its authors. Use permitted   towards distributional shifts and lack of interpretability, and
under Creative Commons License Attribution 4.0 International (CC    general mitigation strategies for e.g., the incorporation of un-
BY 4.0).                                                            certainty and proper specification of the data acquisition pro-
cess, as discussed in [Willers et al., 2020]. One way to argue     while the network is trained with the default softmax, dur-
the safety is by formulating confidence arguments to gather        ing test phase the tempered softmax forces the network to be
evidence for the performance of an ML system [Burton et            sure with its decisions. In [DeVries and Taylor, 2018], the
al., 2019]. To aid the formulation of such arguments, the au-      authors propose Learned Confidence estimates to classify a
thors provide an overview of the most common failure cases         sample as ID or OOD sample by appending a confidence es-
and propose assurance claim points to further break down the       timation branch to the network. Similar to this, Metric Learn-
task. Another direction in the domain of autonomous ve-            ing [Masana et al., 2018] adds an additional output branch
hicles to assure the safety is the creation of a specification     and maps it into a manifold where the Euclidean distance
based on formal rules and physical constraints, as it is done      from such manifolds is used as a measure of detecting pos-
within RSS [Shalev-Shwartz et al., 2018]. As this approach         sible OOD samples. A probabilistic approach given in [Lee
implicitly requires perfect perception, which in a real-world      et al., 2018] uses features (lower and upper level) from any
scenario is unattainable, PURSS [Salay et al., 2020] has been      pre-trained classifier and maps them into class conditional
proposed as an extension to allow the integration of percep-       Gaussian distributions under Gaussian discriminant analysis,
tual uncertainty into the otherwise rigid specifications.          which result in a confidence score based on the Mahalanobis
Interpretability The lack of interpretability of DNNs is a         distance. Finally, the most popular method of computing
hindrance for using them in safety-critical applications, as       probabilistic statistics uses the ensembles of predictions of
it makes thorough safety analyses almost impossible. One           discriminative classifiers trained on ID data, as proposed by
                                                                   [Lakshminarayanan et al., 2017]. It has emerged as popu-
approach to address this problem is the visualization of the
learned features and their interplay with each other [Olah et      lar non-Bayesian approach for predictive UQ, also used for
al., 2020]. While this only enables qualitative analyses, the      detecting OOD samples during inference. An alternative di-
authors suggest that it can aid in gaining a better understand-    rection of approaching the OOD detection problem is the use
ing of DNNs and facilitate other work in this domain. A            of generative model-based methods, which are appealing as
different direction is the formation of human-understandable       they do not require labeled data and directly model the input
features in DNNs. For instance, by specifying desired con-         distribution. These methods fit a generative model p(x) to the
cepts a network can be incentivized to learn corresponding         ID data, and then evaluate the likelihood of new OOD inputs
features which in turn can be used for quantitative analy-         under that model as in [Ren et al., 2019], [Serrà et al., 2019].
ses [Kim et al., 2018].                                            Moreover, many self-supervised approaches [Hendrycks et
                                                                   al., 2019], [Mohseni et al., 2020], which also do not need
Verification of DNNS The ability to verify DNNs would              labeled data, have shown promise in OOD detection, often
facilitate any safety argumentation greatly. Approaches con-       with accuracy comparable to supervised methods.
cerning the verification include linear approximations of the
learned function in order to subsequently solve them using
existing verification tools [Katz et al., 2017]. The problems
                                                                   3     Uncertainty Quantification for OOD
of scalability and the definition of proper specifications, how-         Detection
ever, prevent their application to complex perception tasks.       In the previous section, dedicated OOD detection techniques
Nevertheless, it is an active field of research and promising      have been presented. However, it is reasonable to investigate
approaches exist, e.g., for the verification of direct percep-     the usage of UQ for this task as well. The idea is that a DNN
tion utilizing an input property characterizer and an approach     should assign a high uncertainty to OOD inputs, as nothing
to verification based on assumed guarantees [Cheng et al.,         comparable has been encountered before.
2019].                                                                In [Osawa et al., 2019] the authors, i.a., compare Bayesian
Out-of-Distribution Detection In real-world machine                UQ methods wrt. to their performance of detecting OOD
learning applications, the importance of detecting OOD sam-        samples. Although their results are promising, the chosen
ples in the test data, which basically indicates distributional    task is not as complex, because the defined datasets for ID
shift from training data, is paramount. It has been recognized     and OOD are very dissimilar. The performance of different
as an important problem for AI safety [Amodei et al., 2016].       uncertainty quantifiers to distinguish samples from more sim-
Neural Network classifiers tend to incorrectly classify OOD        ilar distributions have been investigated in [Pawlowski et al.,
samples with high confidence. The high-confidence predic-          2017]. Their findings are promising and also encourage fur-
tions are often the result from the softmax functions, since       ther research in that area.
these probabilities are computed with the fast-growing expo-
nential function, where minor input addition can lead to sub-      3.1    Predictive Uncertainty Quantification of DNNs
stantial increase in output. In this direction, [Hendrycks and     A common approach to probabilistic UQ for neural networks
Gimpel, 2018] proposed a baseline method to detect OOD             is to rely on Bayesian methods (e.g., variational Bayes or
samples based on an observation that a well-trained neural         Markov chain Monte Carlo), where the posterior distribu-
network tends to assign higher softmax scores to ID samples        tion over the network parameters is computed. However, ex-
than OOD samples. This approach was further extended in            act Bayesian inference is usually intractable, thus the pos-
ODIN [Liang et al., 2018] by using temperature scaling in          terior can only be computed approximately. Recently, non-
the softmax function [Guo et al., 2017], and adding small          Bayesian methods gained in popularity, which often allow for
controlled perturbations to inputs such that the softmax score     simpler implementation and faster training. In this work, we
gap between ID and OOD samples is further enlarged. Here,          focus on methods for predictive UQ that are fast to train, rea-
sonably easy to implement and suitable for large-scale prob-        4     Evaluation
lems often seen in image classification tasks.                      In the following, the previously presented UQ methods, Deep
   A straightforward approach to UQ is to interpret the classi-     Ensembles (DE), Monte-Carlo Dropout (MCDO), Learned
fication scores as probabilities, e.g., by applying the softmax     Confidence (LC), and Evidential Deep Learning (EDL), are
function to the prediction scores. However, modern DNNs             compared to each other and to the default softmax confi-
tend to be not well calibrated, i.e., the predicted probability     dences, which serve as a baseline. The task, hereby, is to
for an input sample does not represent the true accuracy of the     classify images correctly and confidently.
network. This is especially true for DNNs with high model
capacity and lack of regularization [Guo et al., 2017]. One         4.1    Experimental Setup
approach for DNN calibration is to learn a scaling of the pre-      To provide a comprehensive comparison, we trained each
dicted probabilities using a validation set, where the parame-      of the UQ methods on three different model architectures.
ters of the DNN are fixed.                                          VGG16 [Simonyan and Zisserman, 2015] as a standard net-
   In addition to that, softmax probabilities viewed alone are      work architecture, SqueezeNet [Iandola et al., 2016] for its
often overconfident for OOD samples [Gal and Ghahramani,            small size and suitability for embedded systems, and the re-
2016]. Nevertheless, for a given network ID samples tend to         cently introduced EfficientNet [Tan and Le, 2019] as a high-
have greater softmax values than OOD samples, which can be          performing and efficient architecture. The model variant B0
used as a baseline for OOD detection [Hendrycks and Gim-            for EfficientNet was adopted for our use-cases. All models
pel, 2018].                                                         use dropout regularization to allow the application of MCDO.
Deep Ensembles Ensembles of deep neural networks, i.e.              Each deep ensemble consists of 5 networks and the number
deep ensembles, is a well-known method to improve predic-           of sampling steps for MCDO has been set to 50. Increasing
tion accuracy. However, deep ensembles can also be used             the number of members or sampling steps further lead only
as a non-Bayesian uncertainty estimator [Lakshminarayanan           to minor improvements. For LC the last dense layer of each
et al., 2017]. A number of randomly initialized neural net-         model is replaced by a prediction and a confidence branch,
works are trained independently on the same training data. To       which then are concatenated again to form the final predic-
compute the predictive distribution, the individual prediction      tion, as in [DeVries and Taylor, 2018]. Additionally we set
probabilities of all neural networks in the ensemble are aver-      the hyperparameters for the loss function of LC to λ = 0.1
aged. Additionally, [Lakshminarayanan et al., 2017] propose         and β = 0.3, which generally showed the best results in our
to use proper scoring functions as loss functions and adver-        experiments. For EDL, using softplus as evidence function in
sarial training to smooth the predictive distributions.             combination with the expected cross entropy loss employing
                                                                    the digamma function, as described in [Sensoy et al., 2018],
Monte-Carlo Dropout MC-Dropout can be interpreted as                yielded the best results and is used in all experiments pre-
a form of ensembles with shared network parameters or               sented in this paper.
alternatively, as approximate Bayesian inference [Gal and              As training datasets we used CIFAR-10, German Traffic
Ghahramani, 2016]. Usually, dropout is used during training         Sign Recognition Benchmark (GTSRB) [Stallkamp et al.,
for regularization to prevent overfitting. However, dropout         2011], and NWPU-RESISC45 [Cheng et al., 2017]. CIFAR-
can also be used during inference to estimate the predictive        10 contains small images separated into 10 different classes,
distribution. The empirical predictive mean and variance are        e.g., automobile, truck or dog. GTSRB is a collection of
calculated from multiple stochastic forward passes, where           German traffic signs. The number of classes amounts to 43.
each forward pass can be seen as sampling from a posterior          NWPU-RESISC45 has larger aerial images which are cate-
distribution over the network weights. Since MC-Dropout             gorized into 45 different classes, e.g., forest, freeway or rail-
does not require any change in the network architecture, it is      way station. Additionally, we used images from CIFAR-100
easy to implement and to use with existing architectures.           as OOD samples for CIFAR-10 and Belgium Traffic Signs
                                                                    (BTSRB) [Timofte et al., 2014] as OOD samples for GT-
Learned Confidence A different, sampling-free approach
                                                                    SRB. While CIFAR-100 and CIFAR-10 already have dis-
to estimate uncertainty is proposed in [DeVries and Taylor,
                                                                    tinct classes, for BTSRB we only included classes that had
2018] where the network learns an explicit confidence score
                                                                    no equivalent in GTSRB. As we found no suitable OOD
as second optimization objective. A confidence layer is added
                                                                    datasets for NWPU-RESISC45, we split it into two datasets.
after the last network layer, in parallel to the class prediction
                                                                    The OOD dataset includes 9 classes, airplane, airport, beach,
layer. The optimization objective is then the sum of the clas-
                                                                    harbor, island, lake, river, sea ice, and ship. These are se-
sification loss and the confidence loss.
                                                                    mantically separated from the remaining 39 classes used as
Evidential Deep Learning Evidential Deep Learn-                     ID dataset. Overall, the ID and OOD dataset pairs are quite
ing [Sensoy et al., 2018] is inspired by the Dempster-Shafer        similar to each other, which makes the task of OOD detection
theory and another sampling-free approach. For classification       more difficult. This was done purposefully, as it transfers bet-
tasks the parameters of a Dirichlet distribution are learned,       ter to safety-critical applications, where OOD inputs must be
from which the total evidence for each of the classes and the       detected, coming from the exact same sensor in similar envi-
epistemic uncertainty regarding the prediction as a whole can       ronments.
be calculated. The authors also conducted some experiments             We trained the models from scratch using random initial-
regarding OOD detection and showed that their method                izations and used Adam as optimizer. Early stopping has been
generally assigned higher uncertainties to OOD inputs.              applied if the validation loss did not change for several epochs
to prevent overfitting whilst ensuring fully trained networks.       a straight line, suggesting that there are no thresholds which
Augmentations have not been applied, to rule out potential           can produce error rates in that range. For error rates < 0.5%,
side effects introduced by the specific configuration used.          all but softmax show the same accuracy. Upon further in-
                                                                     vestigation, we noticed the distribution of classes among the
4.2   Evaluation Metrics                                             undetected OOD samples were similar, hinting towards sam-
Following the cue of our previous work in [Henne et al.,             ples that are universally hard to reject. For SqueezeNet, DE
2020], similar evaluation metrics have been used in this paper.      significantly shows the best performance, followed by EDL.
It constitutes of maximizing Remaining Accuracy Rate (RAR)           Softmax and MCDO perform equally and LC again performs
along with minimizing the Remaining Error Rate (RER).                the worst using this architecture. Using VGG16, DE still out-
RAR takes into account the number of samples which have              performs the other methods but the difference is much less
been correctly classified by the classifier as well as declared      significant. EDL and MCDO more or less perform equally,
confident (“certain” and “correct”) for a given threshold by         with EDL being slightly better for high RER and MCDO be-
the respective UQ method. RER on the other hand is the frac-         ing slightly better for really low RER. LC and softmax also
tion of inputs that is classified incorrectly but with a high con-   show similar performance. Softmax again is not able to pro-
fidence (“certain” and “incorrect”).                                 duce different RER in lower ranges, however, this can mostly
   All trained networks were evaluated first on a test set with      be attributed to the ID samples.
only ID data. Subsequently, the same model was tested on a              On NWPU-RESISC45, DE performs the best for Efficient-
second test set where OOD samples corresponding to 17.65%            Net in terms of maximum RAR achieved. Next, softmax and
of the size of the ID data were added, to obtain a dataset with      MCDO behave similarly but with a slight decrease in RAR.
85% ID and 15% OOD samples. The amount of ODD sam-                   For RER < 5%, both of these methods show a slight kink in
ples was chosen arbitrarily to improve the visual presentation       the curve showing their sensitivity to certain range of thresh-
of the plots. However, it has no impact on the overall obser-        olds. But even in this range, DE clearly achieves much better
vations since we focus on the relative performance between           RAR at the cost of < 1% RER. LC tries to achieve close
best and worst case.                                                 to 80% RAR, but at the cost of much higher RER. Finally,
                                                                     EDL performs similarly as others in RER < 5%, but is vastly
4.3   Results and Discussion                                         outperformed in terms of overall RAR. For SqueezeNet, DE
Remaining Accuracy and Error                                         again performs the best followed by MCDO, EDL and soft-
The results are shown in Figure 1. Due to space restrictions,        max in close proximity. Nonetheless, LC as pointed out ear-
the graphs for VGG16 could not be included. Each curve con-          lier performs the worst with this architecture. For VGG16, all
sists of the RAR and RER plotted for each threshold t ∈ [0; 1]       the methods perform sub-optimally in comparison to other
with a sampling step size of 0.001. The blue curves represent        architectures with maximum RAR of nearly 75% achieved
the performance on the ID dataset, the green curves show the         by DE. Similar to the trend above, DE is followed by EDL
performance on the dataset with combined ID and OOD sam-             with comparable RAR, as DE, in RER < 5%. Softmax and
ples. Furthermore, the green curves have been normalized             MCDO follow them, but spread over larger RER. LC is not
regarding the RAR by the amount of OOD samples. Thereby,             again able to produce RER in lower ranges and has much
the influence on the accuracy due to additional OOD samples          larger RER as compared to similar RAR achieved by other
is eliminated and only the error introduced by them is factored      UQ methods.
in. For one, this better represents the application case, as the        Based on the observations above, DE performs best across
DNNs are not supposed to classify OOD samples correctly              all methods and datasets. LC had been originally proposed as
and only have to detect them. Second, due to the normaliza-          an OOD detection method rather than being an UQ method.
tion, the behavior regarding the OOD detection can be better         However, LC has shown consistent sub-optimal performance
interpreted visually. Given a perfect OOD detection method,          in almost all the scenarios above, particularly with smaller ar-
both curves would be the same, as all OOD samples would              chitectures like SqueezeNet or higher resolution dataset, like
be rejected. The black curves show the worst case, i.e., if          NWPU-RESISC45. MCDO and softmax perform averagely
none of the OOD samples are rejected. They also have been            in most cases. Most UQ methods including EDL tend to have
normalized like the green curves.                                    quite competent RAR for lower RER ranges, but on initial in-
   On GTSRB, DE can detect most of the OOD samples, with             vestigation it has been also observed there always exist some
a minor loss in accuracy. Although SqueezeNet has a slightly         harder sample categories which are almost too difficult to cer-
lower base accuracy, it is slightly better in rejecting OOD          tainly reject for most UQ methods.
samples. For GTSRB, this also holds for MCDO and soft-
max. EDL on the other hand shows a better OOD discrimi-              Quality of Uncertainty Estimation
nation ability in the other two architectures for higher RER,        To further assess the novelty detection capabilities of the
but can reduce the error almost completely with the highest          methods, we show the ratio of inputs marked as uncertain
accuracy left. LC performs sub-par with SqueezeNet, which            for a given threshold. We, thereby, show the comparison for
might be due to the low number of parameters, as we already          the three possible cases: ID inputs predicted correctly, ID in-
noticed in [Henne et al., 2020].                                     puts predicted incorrectly and OOD inputs. Corresponding to
   On CIFAR-10 using EfficientNet, all but DE perform                each of the three cases we plot, over all thresholds, the frac-
equally with only minor differences. An exception to this are        tion of samples having high uncertainty. An ideal method, for
softmax and MCDO, which for an RER of < 3.5% drop in                 some given threshold, is certain for all correct predictions and
                   Softmax                Deep Ensemble           Monte Carlo Dropout Evidential Deep Learning      Learned Confidence
       1.0




                                                                                                                                           EfficientNet
                                                                                                                                             GTSRB
 RAR




       0.5



       0.0
             0.0   0.1   0.2   0.3 0.2
                                    0.0     0.1        0.2   0.30.4
                                                                  0.0   0.1   0.2    0.3
                                                                                       0.60.0   0.1   0.2     0.3
                                                                                                               0.8 0.0   0.1   0.2   0.3
                                                                                                                                     1.0

       1.0




                                                                                                                                           SqueezeNet
                                                                                                                                            GTSRB
 RAR




       0.5



       0.0
             0.0   0.1   0.2   0.3 0.2
                                    0.0     0.1        0.2   0.30.4
                                                                  0.0   0.1   0.2    0.3
                                                                                       0.60.0   0.1   0.2     0.3
                                                                                                               0.8 0.0   0.1   0.2   0.3
                                                                                                                                     1.0

       1.0




                                                                                                                                           EfficientNet
                                                                                                                                            CIFAR-10
 RAR




       0.5



       0.0
             0.0   0.1   0.2   0.3 0.2
                                    0.0     0.1        0.2   0.30.4
                                                                  0.0   0.1   0.2    0.3
                                                                                       0.60.0   0.1   0.2     0.3
                                                                                                               0.8 0.0   0.1   0.2   0.3
                                                                                                                                     1.0

       1.0




                                                                                                                                           SqueezeNet
                                                                                                                                           CIFAR-10
 RAR




       0.5



       0.0
             0.0   0.1   0.2   0.3 0.2
                                    0.0     0.1        0.2   0.30.4
                                                                  0.0   0.1   0.2    0.3
                                                                                       0.60.0   0.1   0.2     0.3
                                                                                                               0.8 0.0   0.1   0.2   0.3
                                                                                                                                     1.0

       1.0




                                                                                                                                           NWPU-RESISC45
                                                                                                                                             EfficientNet
 RAR




       0.5



       0.0
             0.0   0.1   0.2   0.3 0.2
                                    0.0     0.1        0.2   0.30.4
                                                                  0.0   0.1   0.2    0.3
                                                                                       0.60.0   0.1   0.2     0.3
                                                                                                               0.8 0.0   0.1   0.2   0.3
                                                                                                                                     1.0

       1.0
                                                                                                                                           NWPU-RESISC45
                                                                                                                                             SqueezeNet
 RAR




       0.5



       0.0
             0.0   0.1   0.2   0.3 0.2
                                    0.0     0.1        0.2   0.30.4
                                                                  0.0   0.1   0.2    0.3
                                                                                       0.60.0   0.1   0.2     0.3
                                                                                                               0.8 0.0   0.1   0.2   0.3
                                                                                                                                     1.0
                                                                           RER
                                                  ID             Normalized ID+OOD              Lower bound

Figure 1: Remaining Error Rate (RER) vs. Remaining Accuracy Rate (RAR) for EfficientNet and SqueezeNet on the GTSRB, CIFAR-10 and
NWPU-RESISC45 datasets. The plots show the performances first on the ID dataset(blue), then on dataset consisting of the ID and OOD
samples(green). The lower bound (black) represents the worst-case scenario where the network fails to reject none of the OOD sample.
                       1.0                                                                                          1.0                                                                                      1.0




                                                                                                                                                                                       Monte-Carlo Dropout
                                    correct                                                                                      correct                                                                                  correct
   Uncertainty Ratio




                                                                                                Uncertainty Ratio




                                                                                                                                                                                        Uncertainty Ratio
                                                                                                Deep Ensembles
                       0.8          incorrect                                                                       0.8          incorrect                                                                   0.8          incorrect
      Softmax

                       0.6          ood                                                                             0.6          ood                                                                         0.6          ood

                       0.4                                                                                          0.4                                                                                      0.4
                       0.2                                                                                          0.2                                                                                      0.2
                       0.0                                                                                          0.0                                                                                      0.0
                             0.0   0.2    0.4     0.6     0.8                            1.0                              0.0   0.2    0.4     0.6     0.8                      1.0                                0.0   0.2    0.4     0.6     0.8        1.0
                                      Confidence Threshold
                                            Evidential Deep Learning                                                               Confidence Threshold                                                                     Confidence Threshold

                                                                       1.0                                                                                    1.0




                                                                                                                                         Learned Confidence
                                                                                    correct                                                                                correct
                                                Uncertainty Ratio




                                                                                                                                          Uncertainty Ratio
                                                                       0.8          incorrect                                                                 0.8          incorrect
                                                                       0.6          ood                                                                       0.6          ood

                                                                       0.4                                                                                    0.4
                                                                       0.2                                                                                    0.2
                                                                       0.0                                                                                    0.0
                                                                             0.0   0.2    0.4     0.6     0.8                        1.0                            0.0   0.2    0.4     0.6     0.8                          1.0
                                                                                      Confidence Threshold                                                                   Confidence Threshold

Figure 2: The ratio of inputs marked as uncertain for the three cases — correctly classified, incorrectly classified, and OOD inputs — over
the range of thresholds for EfficientNet on CIFAR-10.

                              Deep Ensembles / GTSRB                                                                      Deep Ensembles / CIFAR-10                                                          Deep Ensembles / NWPU-RESISC45
                 1.00                                                                                               1.0                                                                                      1.0

                                                                                                                                                                                                             0.8
                 0.95                                                                                               0.8
  RAR




                                                                                                    RAR




                                                                                                                                                                                            RAR
                                                                                                                                                                                                             0.6
                 0.90                                             EfficientNet                                      0.6                                  EfficientNet                                                                 EfficientNet
                                                                  SqueezeNet                                                                             SqueezeNet                                          0.4                      SqueezeNet
                                                                  VGG16                                                                                  VGG16                                                                        VGG16
                 0.85                                                                                               0.4                                                                                      0.2
                             0.0             0.1                                   0.2                                    0.0      0.1                        0.2         0.3                                      0.0            0.2                0.4
                                             RER                                                                                          RER                                                                                     RER

Figure 3: Remaining Error Rate (RER) vs. Remaining Accuracy Rate (RAR) for Deep Ensembles on the normalized ID+OOD dataset trained
on GTSRB, CIFAR-10 and NWPU-RESISC45 with EfficientNet, SqueezeNet and VGG16.


uncertain for all incorrect predictions as well as predictions                                                                                     ity is not as significant, visually represented by how close
for OOD inputs. In Figure 2, the uncertainty ratios are shown                                                                                      the blue and green curves match. An exception to this are
only for CIFAR-10 and EfficientNet, but we observe the same                                                                                        some configurations with LC, especially with SqueezeNet.
findings in our other considered configurations. Most inter-                                                                                       On CIFAR-10, all architectures perform mostly the same
estingly, the curves for incorrectly classified and ood inputs                                                                                     regarding their novelty detection ability and on the easier
match very closely. Obviously, this raises the question: How                                                                                       dataset GTSRB SqueezeNet has a slight edge. For NWPU-
correlated are these two categories and will better UQ meth-                                                                                       RESISC45, VGG16 rejects OOD samples slightly better,
ods be able to better detect novel inputs? This could be sub-                                                                                      however, its baseline accuracy is about 20% lower for all UQ
ject for future research. The plots again indicate that very                                                                                       methods. Figure 3 shows the overall performance of DE for
low error rates can only be achieved at the cost of sacrificing                                                                                    all architectures on the combined ID + OOD datasets.
a lot of accuracy. It is also worth mentioning, that EDL and
LC exhibit a smoother behavior over the range of thresholds,                                                                                       5            Conclusion and Future Work
especially compared to Softmax and MCDO, and therefore,                                                                                            In this paper, we investigated the question, whether un-
are less sensitive towards small changes in the choice of a                                                                                        certainty quantification is sufficient for detecting out-of-
threshold.                                                                                                                                         distribution inputs. To that end, we applied different state-
                                                                                                                                                   of-the-art methods and network architectures to three image
Influence of Model Architecture                                                                                                                    classification tasks. While all tested UQ methods assign high
While the choice of architecture is important for the perfor-                                                                                      uncertainty to some of the ODD samples, their rejection ca-
mance wrt. accuracy, its influence on the OOD detection abil-                                                                                      pabilities will not suffice for most safety-critical applications,
especially considering that in the real-world even more diffi-     [Gal and Ghahramani, 2016] Yarin Gal and Zoubin Ghahra-
cult OOD inputs can occur. If UQ should be applied, deep en-          mani. Dropout as a bayesian approximation: Representing
sembles consistently showed the best trade-off between per-           model uncertainty in deep learning. In Proc. ICML 2016,
formance and remaining error, but mostly due to its better            volume 48, pages 1050–1059. PMLR, June 2016.
accuracy baseline to begin with.                                   [Guo et al., 2017] Chuan Guo, Geoff Pleiss, Yu Sun, and
   A closer look at our results revealed that in many cases           Kilian Q. Weinberger. On Calibration of Modern Neu-
all methods fail on ODD inputs from the same classes. This            ral Networks. In Proc. ICML 2017, pages 1321–1330.
hints at the possibility that certain ODD inputs are concep-          JMLR.org, August 2017.
tually harder (or even impossible) to identify either by UQ
methods or even in general. However, further research is           [Hendrycks and Gimpel, 2018] Dan Hendrycks and Kevin
needed to provide more evidence. In addition, many novelty            Gimpel. A Baseline for Detecting Misclassified and Out-
detection approaches have been proposed in recent years. It           of-Distribution Examples in Neural Networks. In Proc.
would be interesting to see how they perform compared to the          ICML 2017. JMLR.org, October 2018.
UQ methods presented here. Furthermore, their error patterns       [Hendrycks et al., 2019] Dan Hendrycks, Mantas Mazeika,
may provide additional insights into the difficulties of OOD          Saurav Kadavath, and Dawn Song. Using Self-Supervised
detection in general.                                                 Learning Can Improve Model Robustness and Uncertainty.
   Additionally, it is worthwhile investigating, whether our          In Advances in Neural Information Processing Systems
findings also transfer to other tasks, e.g., object detection or      32, pages 15663–15674. Curran Associates, Inc., October
instance segmentation, and to other types of input data, for          2019.
instance, radar or lidar point clouds. While there are similar     [Henne et al., 2020] Maximilian Henne, Adrian Schwaiger,
base components at play —object detectors even use the in-
                                                                      Karsten Roscher, and Gereon Weiss. Benchmarking Un-
vestigated networks as feature extractors—, the transferabil-
                                                                      certainty Estimation Methods for Deep Learning With
ity of our results is not guaranteed.
                                                                      Safety-Related Metrics. In Proc. SafeAI@AAAI 2020, vol-
                                                                      ume 2560 of CEUR Workshop Proceedings, pages 83–90,
Acknowledgments                                                       2020.
This work was partially supported by the Bavarian Min-             [Iandola et al., 2016] Forrest N. Iandola, Matthew W.
istry of Economic Affairs, Regional Development and En-               Moskewicz, Khalid Ashraf, Song Han, William J. Dally,
ergy through the Center for Analytics—Data—Applications               and Kurt Keutzer. SqueezeNet: AlexNet-level accuracy
(ADA-Center) within the framework of “BAYERN DIGITAL                  with 50x fewer parameters and <1MB model size. CoRR,
II” and within the Intel Collaborative Research Institute—            abs/1602.07360, 2016. eprint: 1602.07360.
Safe Automated Vehicles.                                           [Katz et al., 2017] Guy Katz, Clark Barrett, David L. Dill,
                                                                      Kyle Julian, and Mykel J. Kochenderfer. Reluplex: An Ef-
References                                                            ficient SMT Solver for Verifying Deep Neural Networks.
[Amodei et al., 2016] Dario Amodei, Chris Olah, Jacob                 In Computer Aided Verification, LNCS, pages 97–117,
  Steinhardt, Paul Christiano, John Schulman, and Dan                 Cham, 2017. Springer International Publishing.
  Mané. Concrete Problems in AI Safety. ArXiv160606565            [Kim et al., 2018] Been Kim, Martin Wattenberg, Justin
  Cs, July 2016.                                                      Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and
[Burton et al., 2019] Simon Burton, Lydia Gauerhof, Bib-              Rory Sayres. Interpretability Beyond Feature Attribu-
  huti Bhusan Sethy, Ibrahim Habli, and Richard Hawkins.              tion: Quantitative Testing with Concept Activation Vec-
  Confidence Arguments for Evidence of Performance in                 tors (TCAV). In Proc. ICML 2018, pages 2668–2677, July
  Machine Learning for Highly Automated Driving Func-                 2018.
  tions. In Computer Safety, Reliability, and Security,            [Lakshminarayanan et al., 2017] Balaji Lakshminarayanan,
  LNCS, pages 365–377, Cham, 2019. Springer Interna-                  Alexander Pritzel, and Charles Blundell. Simple and scal-
  tional Publishing.                                                  able predictive uncertainty estimation using deep ensem-
[Cheng et al., 2017] Gong Cheng, Junwei Han, and Xiao-                bles. In Advances in Neural Information Processing Sys-
  qiang Lu. Remote Sensing Image Scene Classifica-                    tems 30, pages 6402–6413. Curran Associates, Inc., 2017.
  tion: Benchmark and State of the Art. Proc. IEEE,                [Lee et al., 2018] Kimin Lee, Kibok Lee, Honglak Lee, and
  105(10):1865–1883, October 2017.                                    Jinwoo Shin. A Simple Unified Framework for Detect-
[Cheng et al., 2019] Chih-Hong Cheng, Chung-Hao Huang,                ing Out-of-Distribution Samples and Adversarial Attacks.
  Thomas Brunner, and Vahid Hashemi.               Towards            ArXiv180703888 Cs Stat, October 2018.
  Safety Verification of Direct Perception Neural Networks.        [Liang et al., 2018] Shiyu Liang, Yixuan Li, and R. Srikant.
  ArXiv190404706 Cs, November 2019.                                   Enhancing The Reliability of Out-of-distribution Image
[DeVries and Taylor, 2018] Terrance DeVries and Gra-                  Detection in Neural Networks. In arXiv:1706.02690 [Cs,
  ham W. Taylor.          Learning Confidence for Out-                Stat], February 2018.
  of-Distribution Detection in Neural Networks.                    [Masana et al., 2018] Marc Masana, Idoia Ruiz, Joan Serrat,
  ArXiv180204865 Cs Stat, February 2018.                              Joost van de Weijer, and Antonio M. Lopez. Metric Learn-
   ing for Novelty and Anomaly Detection. In Proc. BMVC           [Tan and Le, 2019] Mingxing Tan and Quoc Le. Efficient-
   2018, August 2018.                                                Net: Rethinking Model Scaling for Convolutional Neural
[Mohseni et al., 2020] Sina Mohseni, Mandar Pitale, JBS              Networks. In Proc. ICML 2019, volume 97 of Proceedings
   Yadawa, and Zhangyang Wang. Self-Supervised Learning              of Machine Learning Research, pages 6105–6114. PMLR,
   for Generalizable Out-of-Distribution Detection. In Proc.         June 2019.
   AAAI 2020, page 8, 2020.                                       [Timofte et al., 2014] Radu Timofte, Karel Zimmermann,
[Olah et al., 2020] Chris Olah, Nick Cammarata, Ludwig               and Luc Van Gool. Multi-view traffic sign detection,
   Schubert, Gabriel Goh, Michael Petrov, and Shan                   recognition, and 3D localisation. Machine Vision and Ap-
   Carter. Zoom In: An Introduction to Circuits. Distill,            plications, 25(3):633–647, April 2014.
   5(3):10.23915/distill.00024.001, March 2020.                   [Weiss et al., 2018] Gereon Weiss, Philipp Schleiss, Daniel
[Osawa et al., 2019] Kazuki Osawa, Siddharth Swaroop,                Schneider, and Mario Trapp. Towards integrating unde-
   Mohammad Emtiyaz E Khan, Anirudh Jain, Runa Eschen-               pendable self-adaptive systems in safety-critical environ-
   hagen, Richard E Turner, and Rio Yokota. Practical Deep           ments. In Proc. SEAMS 2018, pages 26–32. ACM, May
   Learning with Bayesian Principles. In Advances in Neu-            2018.
   ral Information Processing Systems 32, pages 4287–4299.        [Willers et al., 2020] Oliver Willers, Sebastian Sudholt,
   Curran Associates, Inc., 2019.                                    Shervin Raafatnia, and Stephanie Abrecht. Safety Con-
[Pawlowski et al., 2017] Nick Pawlowski, Miguel Jaques,              cerns and Mitigation Approaches Regarding the Use
   and Ben Glocker. Efficient variational Bayesian neural            of Deep Learning in Safety-Critical Perception Tasks.
   network ensembles for outlier detection. In Proc. ICLR            ArXiv200108001 Cs Stat, January 2020.
   2017. OpenReview.net, 2017.
[Ren et al., 2019] Jie Ren, Peter J. Liu, Emily Fertig, Jasper
   Snoek, Ryan Poplin, Mark Depristo, Joshua Dillon, and
   Balaji Lakshminarayanan. Likelihood ratios for out-of-
   distribution detection. In Advances in Neural Information
   Processing Systems 32, pages 14707–14718. Curran Asso-
   ciates, Inc., 2019.
[Salay et al., 2020] Rick Salay, Krzysztof Czarnecki,
   Maria Soledad Elli, Ignacio J. Alvarez, Sean Sedwards,
   and Jack Weast. PURSS: Towards Perceptual Uncertainty
   Aware Responsibility Sensitive Safety with ML. In Proc.
   SafeAI@AAAI 2020, volume 2560 of CEUR Workshop
   Proceedings, pages 91–95, 2020.
[Sensoy et al., 2018] Murat Sensoy, Lance Kaplan, and
   Melih Kandemir. Evidential Deep Learning to Quantify
   Classification Uncertainty. In Advances in Neural Infor-
   mation Processing Systems 31, pages 3179–3189. Curran
   Associates, Inc., 2018.
[Serrà et al., 2019] Joan Serrà, David Álvarez, Vicenç
   Gómez, Olga Slizovskaia, José F. Núñez, and Jordi Luque.
   Input Complexity and Out-of-distribution Detection with
   Likelihood-based Generative Models. In Proc. ICLR 2020,
   September 2019.
[Shalev-Shwartz et al., 2018] Shai Shalev-Shwartz, Shaked
   Shammah, and Amnon Shashua. On a Formal Model of
   Safe and Scalable Self-driving Cars. ArXiv170806374 Cs
   Stat, October 2018.
[Simonyan and Zisserman, 2015] Karen Simonyan and An-
   drew Zisserman. Very Deep Convolutional Networks for
   Large-Scale Image Recognition. In Proc. ICLR 2015,
   2015.
[Stallkamp et al., 2011] Johannes Stallkamp, Marc Schlips-
   ing, Jan Salmen, and Christian Igel. The German Traffic
   Sign Recognition Benchmark: A multi-class classification
   competition. In The 2011 International Joint Conference
   on Neural Networks, pages 1453–1460, July 2011.