Is Uncertainty Quantification in Deep Learning Sufficient for Out-of-Distribution Detection? Adrian Schwaiger , Poulami Sinhamahapatra , Jens Gansloser , Karsten Roscher Fraunhofer IKS, Fraunhofer Institute for Cognitive Systems {firstname.lastname}@iks.fraunhofer.de liable uncertainty estimates can be utilized by a safety- Abstract envelope [Weiss et al., 2018] that encapsulates the high- performance DNN. Whenever the uncertainty of a prediction Reliable information about the uncertainty of pre- is high, the result of the DNN is discarded and the predic- dictions from deep neural networks could greatly tion of a verified, lower-performance safety path is used in- facilitate their utilization in safety-critical applica- stead. In this context, the performance of different Uncer- tions. Current approaches for uncertainty quantifi- tainty Quantification (UQ) approaches for DNNs have al- cation usually focus on in-distribution data, where ready been investigated on In-Distribution (ID) data, i.e., data a high uncertainty should be assigned to incor- that is conceptually similar to the data the network has been rect predictions. In contrast, we focus on out-of- trained on [Henne et al., 2020]. However, the viability of UQ distribution data where a network cannot make cor- to detect Out-of-Distribution (OOD) inputs, i.e., data that dif- rect predictions and therefore should always report fers strongly from the training data, is still an open question. high uncertainty. In this paper, we compare several The detection of such inputs is important, as DNNs are not state-of-the-art uncertainty quantification methods able to provide a correct prediction for them. For instance, for deep neural networks regarding their ability to a network trained to distinguish between cats and dogs will detect novel inputs. We evaluate them on image always output one or the other, and very often with high con- classification tasks with regard to metrics reflecting fidence, even when challenged with an OOD sample, e.g., requirements important for safety-critical applica- with the image of a car. As it is not feasible to construct a tions. Our results show that a portion of out-of- dataset that guarantees the coverage of all relevant concepts distribution inputs can be detected with reasonable in sufficient quantity for open world scenarios, approaches to loss in overall accuracy. However, current uncer- detect OOD inputs are important to ensure the safety of the tainty quantification approaches alone are not suf- overall system and to detect violations of its operational de- ficient for an overall reliable out-of-distribution de- sign domain. tection. In this paper, we investigate several state-of-the-art meth- ods for UQ in combination with popular DNN architectures for image classification. We use three datasets from different 1 Introduction application domains to train the models and apply them to test Many state-of-the-art methods for solving perceptual tasks sets containing in- and out-of-distribution samples. We focus are based on Deep Neural Networks (DNNs). However, the on the trade-off between remaining accuracy and remaining lack of interpretability of these networks is still a problem error under the assumption that inputs with uncertain predic- when DNNs are employed in safety-critical applications, e.g., tions are handled by a fallback mechanism and therefore ac- for autonomous driving or in medical diagnosis. In these do- count to neither of them. Since the acceptable remaining error mains, mistakes are not just a minor annoyance but can have or minimal performance may vary from application to appli- severe consequences. Therefore, thorough safety analysis and cation we highlight the relationship between the two instead argumentation are an integral part of the development of such of assuming arbitrary limits. systems. Unfortunately, the black-box nature of DNNs and the fact that already slight changes in the input can have dras- 2 Related Work tic effects on the output make this task almost impossible for Machine Learning in Safety-Critical Domains Arguing complex DNN-based computer vision pipelines. the safety of Machine Learning (ML) algorithms for complex One approach to address this problem is quantifying the tasks still remains an open research question. Insufficien- predictive uncertainty of a DNN for each given input. Re- cies of DNNs on perception tasks include, e.g., susceptibility Copyright c 2020 for this paper by its authors. Use permitted towards distributional shifts and lack of interpretability, and under Creative Commons License Attribution 4.0 International (CC general mitigation strategies for e.g., the incorporation of un- BY 4.0). certainty and proper specification of the data acquisition pro- cess, as discussed in [Willers et al., 2020]. One way to argue while the network is trained with the default softmax, dur- the safety is by formulating confidence arguments to gather ing test phase the tempered softmax forces the network to be evidence for the performance of an ML system [Burton et sure with its decisions. In [DeVries and Taylor, 2018], the al., 2019]. To aid the formulation of such arguments, the au- authors propose Learned Confidence estimates to classify a thors provide an overview of the most common failure cases sample as ID or OOD sample by appending a confidence es- and propose assurance claim points to further break down the timation branch to the network. Similar to this, Metric Learn- task. Another direction in the domain of autonomous ve- ing [Masana et al., 2018] adds an additional output branch hicles to assure the safety is the creation of a specification and maps it into a manifold where the Euclidean distance based on formal rules and physical constraints, as it is done from such manifolds is used as a measure of detecting pos- within RSS [Shalev-Shwartz et al., 2018]. As this approach sible OOD samples. A probabilistic approach given in [Lee implicitly requires perfect perception, which in a real-world et al., 2018] uses features (lower and upper level) from any scenario is unattainable, PURSS [Salay et al., 2020] has been pre-trained classifier and maps them into class conditional proposed as an extension to allow the integration of percep- Gaussian distributions under Gaussian discriminant analysis, tual uncertainty into the otherwise rigid specifications. which result in a confidence score based on the Mahalanobis Interpretability The lack of interpretability of DNNs is a distance. Finally, the most popular method of computing hindrance for using them in safety-critical applications, as probabilistic statistics uses the ensembles of predictions of it makes thorough safety analyses almost impossible. One discriminative classifiers trained on ID data, as proposed by [Lakshminarayanan et al., 2017]. It has emerged as popu- approach to address this problem is the visualization of the learned features and their interplay with each other [Olah et lar non-Bayesian approach for predictive UQ, also used for al., 2020]. While this only enables qualitative analyses, the detecting OOD samples during inference. An alternative di- authors suggest that it can aid in gaining a better understand- rection of approaching the OOD detection problem is the use ing of DNNs and facilitate other work in this domain. A of generative model-based methods, which are appealing as different direction is the formation of human-understandable they do not require labeled data and directly model the input features in DNNs. For instance, by specifying desired con- distribution. These methods fit a generative model p(x) to the cepts a network can be incentivized to learn corresponding ID data, and then evaluate the likelihood of new OOD inputs features which in turn can be used for quantitative analy- under that model as in [Ren et al., 2019], [Serrà et al., 2019]. ses [Kim et al., 2018]. Moreover, many self-supervised approaches [Hendrycks et al., 2019], [Mohseni et al., 2020], which also do not need Verification of DNNS The ability to verify DNNs would labeled data, have shown promise in OOD detection, often facilitate any safety argumentation greatly. Approaches con- with accuracy comparable to supervised methods. cerning the verification include linear approximations of the learned function in order to subsequently solve them using existing verification tools [Katz et al., 2017]. The problems 3 Uncertainty Quantification for OOD of scalability and the definition of proper specifications, how- Detection ever, prevent their application to complex perception tasks. In the previous section, dedicated OOD detection techniques Nevertheless, it is an active field of research and promising have been presented. However, it is reasonable to investigate approaches exist, e.g., for the verification of direct percep- the usage of UQ for this task as well. The idea is that a DNN tion utilizing an input property characterizer and an approach should assign a high uncertainty to OOD inputs, as nothing to verification based on assumed guarantees [Cheng et al., comparable has been encountered before. 2019]. In [Osawa et al., 2019] the authors, i.a., compare Bayesian Out-of-Distribution Detection In real-world machine UQ methods wrt. to their performance of detecting OOD learning applications, the importance of detecting OOD sam- samples. Although their results are promising, the chosen ples in the test data, which basically indicates distributional task is not as complex, because the defined datasets for ID shift from training data, is paramount. It has been recognized and OOD are very dissimilar. The performance of different as an important problem for AI safety [Amodei et al., 2016]. uncertainty quantifiers to distinguish samples from more sim- Neural Network classifiers tend to incorrectly classify OOD ilar distributions have been investigated in [Pawlowski et al., samples with high confidence. The high-confidence predic- 2017]. Their findings are promising and also encourage fur- tions are often the result from the softmax functions, since ther research in that area. these probabilities are computed with the fast-growing expo- nential function, where minor input addition can lead to sub- 3.1 Predictive Uncertainty Quantification of DNNs stantial increase in output. In this direction, [Hendrycks and A common approach to probabilistic UQ for neural networks Gimpel, 2018] proposed a baseline method to detect OOD is to rely on Bayesian methods (e.g., variational Bayes or samples based on an observation that a well-trained neural Markov chain Monte Carlo), where the posterior distribu- network tends to assign higher softmax scores to ID samples tion over the network parameters is computed. However, ex- than OOD samples. This approach was further extended in act Bayesian inference is usually intractable, thus the pos- ODIN [Liang et al., 2018] by using temperature scaling in terior can only be computed approximately. Recently, non- the softmax function [Guo et al., 2017], and adding small Bayesian methods gained in popularity, which often allow for controlled perturbations to inputs such that the softmax score simpler implementation and faster training. In this work, we gap between ID and OOD samples is further enlarged. Here, focus on methods for predictive UQ that are fast to train, rea- sonably easy to implement and suitable for large-scale prob- 4 Evaluation lems often seen in image classification tasks. In the following, the previously presented UQ methods, Deep A straightforward approach to UQ is to interpret the classi- Ensembles (DE), Monte-Carlo Dropout (MCDO), Learned fication scores as probabilities, e.g., by applying the softmax Confidence (LC), and Evidential Deep Learning (EDL), are function to the prediction scores. However, modern DNNs compared to each other and to the default softmax confi- tend to be not well calibrated, i.e., the predicted probability dences, which serve as a baseline. The task, hereby, is to for an input sample does not represent the true accuracy of the classify images correctly and confidently. network. This is especially true for DNNs with high model capacity and lack of regularization [Guo et al., 2017]. One 4.1 Experimental Setup approach for DNN calibration is to learn a scaling of the pre- To provide a comprehensive comparison, we trained each dicted probabilities using a validation set, where the parame- of the UQ methods on three different model architectures. ters of the DNN are fixed. VGG16 [Simonyan and Zisserman, 2015] as a standard net- In addition to that, softmax probabilities viewed alone are work architecture, SqueezeNet [Iandola et al., 2016] for its often overconfident for OOD samples [Gal and Ghahramani, small size and suitability for embedded systems, and the re- 2016]. Nevertheless, for a given network ID samples tend to cently introduced EfficientNet [Tan and Le, 2019] as a high- have greater softmax values than OOD samples, which can be performing and efficient architecture. The model variant B0 used as a baseline for OOD detection [Hendrycks and Gim- for EfficientNet was adopted for our use-cases. All models pel, 2018]. use dropout regularization to allow the application of MCDO. Deep Ensembles Ensembles of deep neural networks, i.e. Each deep ensemble consists of 5 networks and the number deep ensembles, is a well-known method to improve predic- of sampling steps for MCDO has been set to 50. Increasing tion accuracy. However, deep ensembles can also be used the number of members or sampling steps further lead only as a non-Bayesian uncertainty estimator [Lakshminarayanan to minor improvements. For LC the last dense layer of each et al., 2017]. A number of randomly initialized neural net- model is replaced by a prediction and a confidence branch, works are trained independently on the same training data. To which then are concatenated again to form the final predic- compute the predictive distribution, the individual prediction tion, as in [DeVries and Taylor, 2018]. Additionally we set probabilities of all neural networks in the ensemble are aver- the hyperparameters for the loss function of LC to λ = 0.1 aged. Additionally, [Lakshminarayanan et al., 2017] propose and β = 0.3, which generally showed the best results in our to use proper scoring functions as loss functions and adver- experiments. For EDL, using softplus as evidence function in sarial training to smooth the predictive distributions. combination with the expected cross entropy loss employing the digamma function, as described in [Sensoy et al., 2018], Monte-Carlo Dropout MC-Dropout can be interpreted as yielded the best results and is used in all experiments pre- a form of ensembles with shared network parameters or sented in this paper. alternatively, as approximate Bayesian inference [Gal and As training datasets we used CIFAR-10, German Traffic Ghahramani, 2016]. Usually, dropout is used during training Sign Recognition Benchmark (GTSRB) [Stallkamp et al., for regularization to prevent overfitting. However, dropout 2011], and NWPU-RESISC45 [Cheng et al., 2017]. CIFAR- can also be used during inference to estimate the predictive 10 contains small images separated into 10 different classes, distribution. The empirical predictive mean and variance are e.g., automobile, truck or dog. GTSRB is a collection of calculated from multiple stochastic forward passes, where German traffic signs. The number of classes amounts to 43. each forward pass can be seen as sampling from a posterior NWPU-RESISC45 has larger aerial images which are cate- distribution over the network weights. Since MC-Dropout gorized into 45 different classes, e.g., forest, freeway or rail- does not require any change in the network architecture, it is way station. Additionally, we used images from CIFAR-100 easy to implement and to use with existing architectures. as OOD samples for CIFAR-10 and Belgium Traffic Signs (BTSRB) [Timofte et al., 2014] as OOD samples for GT- Learned Confidence A different, sampling-free approach SRB. While CIFAR-100 and CIFAR-10 already have dis- to estimate uncertainty is proposed in [DeVries and Taylor, tinct classes, for BTSRB we only included classes that had 2018] where the network learns an explicit confidence score no equivalent in GTSRB. As we found no suitable OOD as second optimization objective. A confidence layer is added datasets for NWPU-RESISC45, we split it into two datasets. after the last network layer, in parallel to the class prediction The OOD dataset includes 9 classes, airplane, airport, beach, layer. The optimization objective is then the sum of the clas- harbor, island, lake, river, sea ice, and ship. These are se- sification loss and the confidence loss. mantically separated from the remaining 39 classes used as Evidential Deep Learning Evidential Deep Learn- ID dataset. Overall, the ID and OOD dataset pairs are quite ing [Sensoy et al., 2018] is inspired by the Dempster-Shafer similar to each other, which makes the task of OOD detection theory and another sampling-free approach. For classification more difficult. This was done purposefully, as it transfers bet- tasks the parameters of a Dirichlet distribution are learned, ter to safety-critical applications, where OOD inputs must be from which the total evidence for each of the classes and the detected, coming from the exact same sensor in similar envi- epistemic uncertainty regarding the prediction as a whole can ronments. be calculated. The authors also conducted some experiments We trained the models from scratch using random initial- regarding OOD detection and showed that their method izations and used Adam as optimizer. Early stopping has been generally assigned higher uncertainties to OOD inputs. applied if the validation loss did not change for several epochs to prevent overfitting whilst ensuring fully trained networks. a straight line, suggesting that there are no thresholds which Augmentations have not been applied, to rule out potential can produce error rates in that range. For error rates < 0.5%, side effects introduced by the specific configuration used. all but softmax show the same accuracy. Upon further in- vestigation, we noticed the distribution of classes among the 4.2 Evaluation Metrics undetected OOD samples were similar, hinting towards sam- Following the cue of our previous work in [Henne et al., ples that are universally hard to reject. For SqueezeNet, DE 2020], similar evaluation metrics have been used in this paper. significantly shows the best performance, followed by EDL. It constitutes of maximizing Remaining Accuracy Rate (RAR) Softmax and MCDO perform equally and LC again performs along with minimizing the Remaining Error Rate (RER). the worst using this architecture. Using VGG16, DE still out- RAR takes into account the number of samples which have performs the other methods but the difference is much less been correctly classified by the classifier as well as declared significant. EDL and MCDO more or less perform equally, confident (“certain” and “correct”) for a given threshold by with EDL being slightly better for high RER and MCDO be- the respective UQ method. RER on the other hand is the frac- ing slightly better for really low RER. LC and softmax also tion of inputs that is classified incorrectly but with a high con- show similar performance. Softmax again is not able to pro- fidence (“certain” and “incorrect”). duce different RER in lower ranges, however, this can mostly All trained networks were evaluated first on a test set with be attributed to the ID samples. only ID data. Subsequently, the same model was tested on a On NWPU-RESISC45, DE performs the best for Efficient- second test set where OOD samples corresponding to 17.65% Net in terms of maximum RAR achieved. Next, softmax and of the size of the ID data were added, to obtain a dataset with MCDO behave similarly but with a slight decrease in RAR. 85% ID and 15% OOD samples. The amount of ODD sam- For RER < 5%, both of these methods show a slight kink in ples was chosen arbitrarily to improve the visual presentation the curve showing their sensitivity to certain range of thresh- of the plots. However, it has no impact on the overall obser- olds. But even in this range, DE clearly achieves much better vations since we focus on the relative performance between RAR at the cost of < 1% RER. LC tries to achieve close best and worst case. to 80% RAR, but at the cost of much higher RER. Finally, EDL performs similarly as others in RER < 5%, but is vastly 4.3 Results and Discussion outperformed in terms of overall RAR. For SqueezeNet, DE Remaining Accuracy and Error again performs the best followed by MCDO, EDL and soft- The results are shown in Figure 1. Due to space restrictions, max in close proximity. Nonetheless, LC as pointed out ear- the graphs for VGG16 could not be included. Each curve con- lier performs the worst with this architecture. For VGG16, all sists of the RAR and RER plotted for each threshold t ∈ [0; 1] the methods perform sub-optimally in comparison to other with a sampling step size of 0.001. The blue curves represent architectures with maximum RAR of nearly 75% achieved the performance on the ID dataset, the green curves show the by DE. Similar to the trend above, DE is followed by EDL performance on the dataset with combined ID and OOD sam- with comparable RAR, as DE, in RER < 5%. Softmax and ples. Furthermore, the green curves have been normalized MCDO follow them, but spread over larger RER. LC is not regarding the RAR by the amount of OOD samples. Thereby, again able to produce RER in lower ranges and has much the influence on the accuracy due to additional OOD samples larger RER as compared to similar RAR achieved by other is eliminated and only the error introduced by them is factored UQ methods. in. For one, this better represents the application case, as the Based on the observations above, DE performs best across DNNs are not supposed to classify OOD samples correctly all methods and datasets. LC had been originally proposed as and only have to detect them. Second, due to the normaliza- an OOD detection method rather than being an UQ method. tion, the behavior regarding the OOD detection can be better However, LC has shown consistent sub-optimal performance interpreted visually. Given a perfect OOD detection method, in almost all the scenarios above, particularly with smaller ar- both curves would be the same, as all OOD samples would chitectures like SqueezeNet or higher resolution dataset, like be rejected. The black curves show the worst case, i.e., if NWPU-RESISC45. MCDO and softmax perform averagely none of the OOD samples are rejected. They also have been in most cases. Most UQ methods including EDL tend to have normalized like the green curves. quite competent RAR for lower RER ranges, but on initial in- On GTSRB, DE can detect most of the OOD samples, with vestigation it has been also observed there always exist some a minor loss in accuracy. Although SqueezeNet has a slightly harder sample categories which are almost too difficult to cer- lower base accuracy, it is slightly better in rejecting OOD tainly reject for most UQ methods. samples. For GTSRB, this also holds for MCDO and soft- max. EDL on the other hand shows a better OOD discrimi- Quality of Uncertainty Estimation nation ability in the other two architectures for higher RER, To further assess the novelty detection capabilities of the but can reduce the error almost completely with the highest methods, we show the ratio of inputs marked as uncertain accuracy left. LC performs sub-par with SqueezeNet, which for a given threshold. We, thereby, show the comparison for might be due to the low number of parameters, as we already the three possible cases: ID inputs predicted correctly, ID in- noticed in [Henne et al., 2020]. puts predicted incorrectly and OOD inputs. Corresponding to On CIFAR-10 using EfficientNet, all but DE perform each of the three cases we plot, over all thresholds, the frac- equally with only minor differences. An exception to this are tion of samples having high uncertainty. An ideal method, for softmax and MCDO, which for an RER of < 3.5% drop in some given threshold, is certain for all correct predictions and Softmax Deep Ensemble Monte Carlo Dropout Evidential Deep Learning Learned Confidence 1.0 EfficientNet GTSRB RAR 0.5 0.0 0.0 0.1 0.2 0.3 0.2 0.0 0.1 0.2 0.30.4 0.0 0.1 0.2 0.3 0.60.0 0.1 0.2 0.3 0.8 0.0 0.1 0.2 0.3 1.0 1.0 SqueezeNet GTSRB RAR 0.5 0.0 0.0 0.1 0.2 0.3 0.2 0.0 0.1 0.2 0.30.4 0.0 0.1 0.2 0.3 0.60.0 0.1 0.2 0.3 0.8 0.0 0.1 0.2 0.3 1.0 1.0 EfficientNet CIFAR-10 RAR 0.5 0.0 0.0 0.1 0.2 0.3 0.2 0.0 0.1 0.2 0.30.4 0.0 0.1 0.2 0.3 0.60.0 0.1 0.2 0.3 0.8 0.0 0.1 0.2 0.3 1.0 1.0 SqueezeNet CIFAR-10 RAR 0.5 0.0 0.0 0.1 0.2 0.3 0.2 0.0 0.1 0.2 0.30.4 0.0 0.1 0.2 0.3 0.60.0 0.1 0.2 0.3 0.8 0.0 0.1 0.2 0.3 1.0 1.0 NWPU-RESISC45 EfficientNet RAR 0.5 0.0 0.0 0.1 0.2 0.3 0.2 0.0 0.1 0.2 0.30.4 0.0 0.1 0.2 0.3 0.60.0 0.1 0.2 0.3 0.8 0.0 0.1 0.2 0.3 1.0 1.0 NWPU-RESISC45 SqueezeNet RAR 0.5 0.0 0.0 0.1 0.2 0.3 0.2 0.0 0.1 0.2 0.30.4 0.0 0.1 0.2 0.3 0.60.0 0.1 0.2 0.3 0.8 0.0 0.1 0.2 0.3 1.0 RER ID Normalized ID+OOD Lower bound Figure 1: Remaining Error Rate (RER) vs. Remaining Accuracy Rate (RAR) for EfficientNet and SqueezeNet on the GTSRB, CIFAR-10 and NWPU-RESISC45 datasets. The plots show the performances first on the ID dataset(blue), then on dataset consisting of the ID and OOD samples(green). The lower bound (black) represents the worst-case scenario where the network fails to reject none of the OOD sample. 1.0 1.0 1.0 Monte-Carlo Dropout correct correct correct Uncertainty Ratio Uncertainty Ratio Uncertainty Ratio Deep Ensembles 0.8 incorrect 0.8 incorrect 0.8 incorrect Softmax 0.6 ood 0.6 ood 0.6 ood 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Confidence Threshold Evidential Deep Learning Confidence Threshold Confidence Threshold 1.0 1.0 Learned Confidence correct correct Uncertainty Ratio Uncertainty Ratio 0.8 incorrect 0.8 incorrect 0.6 ood 0.6 ood 0.4 0.4 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Confidence Threshold Confidence Threshold Figure 2: The ratio of inputs marked as uncertain for the three cases — correctly classified, incorrectly classified, and OOD inputs — over the range of thresholds for EfficientNet on CIFAR-10. Deep Ensembles / GTSRB Deep Ensembles / CIFAR-10 Deep Ensembles / NWPU-RESISC45 1.00 1.0 1.0 0.8 0.95 0.8 RAR RAR RAR 0.6 0.90 EfficientNet 0.6 EfficientNet EfficientNet SqueezeNet SqueezeNet 0.4 SqueezeNet VGG16 VGG16 VGG16 0.85 0.4 0.2 0.0 0.1 0.2 0.0 0.1 0.2 0.3 0.0 0.2 0.4 RER RER RER Figure 3: Remaining Error Rate (RER) vs. Remaining Accuracy Rate (RAR) for Deep Ensembles on the normalized ID+OOD dataset trained on GTSRB, CIFAR-10 and NWPU-RESISC45 with EfficientNet, SqueezeNet and VGG16. uncertain for all incorrect predictions as well as predictions ity is not as significant, visually represented by how close for OOD inputs. In Figure 2, the uncertainty ratios are shown the blue and green curves match. An exception to this are only for CIFAR-10 and EfficientNet, but we observe the same some configurations with LC, especially with SqueezeNet. findings in our other considered configurations. Most inter- On CIFAR-10, all architectures perform mostly the same estingly, the curves for incorrectly classified and ood inputs regarding their novelty detection ability and on the easier match very closely. Obviously, this raises the question: How dataset GTSRB SqueezeNet has a slight edge. For NWPU- correlated are these two categories and will better UQ meth- RESISC45, VGG16 rejects OOD samples slightly better, ods be able to better detect novel inputs? This could be sub- however, its baseline accuracy is about 20% lower for all UQ ject for future research. The plots again indicate that very methods. Figure 3 shows the overall performance of DE for low error rates can only be achieved at the cost of sacrificing all architectures on the combined ID + OOD datasets. a lot of accuracy. It is also worth mentioning, that EDL and LC exhibit a smoother behavior over the range of thresholds, 5 Conclusion and Future Work especially compared to Softmax and MCDO, and therefore, In this paper, we investigated the question, whether un- are less sensitive towards small changes in the choice of a certainty quantification is sufficient for detecting out-of- threshold. distribution inputs. To that end, we applied different state- of-the-art methods and network architectures to three image Influence of Model Architecture classification tasks. While all tested UQ methods assign high While the choice of architecture is important for the perfor- uncertainty to some of the ODD samples, their rejection ca- mance wrt. accuracy, its influence on the OOD detection abil- pabilities will not suffice for most safety-critical applications, especially considering that in the real-world even more diffi- [Gal and Ghahramani, 2016] Yarin Gal and Zoubin Ghahra- cult OOD inputs can occur. If UQ should be applied, deep en- mani. Dropout as a bayesian approximation: Representing sembles consistently showed the best trade-off between per- model uncertainty in deep learning. In Proc. ICML 2016, formance and remaining error, but mostly due to its better volume 48, pages 1050–1059. PMLR, June 2016. accuracy baseline to begin with. [Guo et al., 2017] Chuan Guo, Geoff Pleiss, Yu Sun, and A closer look at our results revealed that in many cases Kilian Q. Weinberger. On Calibration of Modern Neu- all methods fail on ODD inputs from the same classes. This ral Networks. In Proc. ICML 2017, pages 1321–1330. hints at the possibility that certain ODD inputs are concep- JMLR.org, August 2017. tually harder (or even impossible) to identify either by UQ methods or even in general. However, further research is [Hendrycks and Gimpel, 2018] Dan Hendrycks and Kevin needed to provide more evidence. In addition, many novelty Gimpel. A Baseline for Detecting Misclassified and Out- detection approaches have been proposed in recent years. It of-Distribution Examples in Neural Networks. In Proc. would be interesting to see how they perform compared to the ICML 2017. JMLR.org, October 2018. UQ methods presented here. Furthermore, their error patterns [Hendrycks et al., 2019] Dan Hendrycks, Mantas Mazeika, may provide additional insights into the difficulties of OOD Saurav Kadavath, and Dawn Song. Using Self-Supervised detection in general. Learning Can Improve Model Robustness and Uncertainty. Additionally, it is worthwhile investigating, whether our In Advances in Neural Information Processing Systems findings also transfer to other tasks, e.g., object detection or 32, pages 15663–15674. Curran Associates, Inc., October instance segmentation, and to other types of input data, for 2019. instance, radar or lidar point clouds. While there are similar [Henne et al., 2020] Maximilian Henne, Adrian Schwaiger, base components at play —object detectors even use the in- Karsten Roscher, and Gereon Weiss. Benchmarking Un- vestigated networks as feature extractors—, the transferabil- certainty Estimation Methods for Deep Learning With ity of our results is not guaranteed. Safety-Related Metrics. In Proc. SafeAI@AAAI 2020, vol- ume 2560 of CEUR Workshop Proceedings, pages 83–90, Acknowledgments 2020. This work was partially supported by the Bavarian Min- [Iandola et al., 2016] Forrest N. Iandola, Matthew W. istry of Economic Affairs, Regional Development and En- Moskewicz, Khalid Ashraf, Song Han, William J. Dally, ergy through the Center for Analytics—Data—Applications and Kurt Keutzer. SqueezeNet: AlexNet-level accuracy (ADA-Center) within the framework of “BAYERN DIGITAL with 50x fewer parameters and <1MB model size. CoRR, II” and within the Intel Collaborative Research Institute— abs/1602.07360, 2016. eprint: 1602.07360. Safe Automated Vehicles. [Katz et al., 2017] Guy Katz, Clark Barrett, David L. Dill, Kyle Julian, and Mykel J. Kochenderfer. Reluplex: An Ef- References ficient SMT Solver for Verifying Deep Neural Networks. [Amodei et al., 2016] Dario Amodei, Chris Olah, Jacob In Computer Aided Verification, LNCS, pages 97–117, Steinhardt, Paul Christiano, John Schulman, and Dan Cham, 2017. Springer International Publishing. Mané. Concrete Problems in AI Safety. ArXiv160606565 [Kim et al., 2018] Been Kim, Martin Wattenberg, Justin Cs, July 2016. Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and [Burton et al., 2019] Simon Burton, Lydia Gauerhof, Bib- Rory Sayres. Interpretability Beyond Feature Attribu- huti Bhusan Sethy, Ibrahim Habli, and Richard Hawkins. tion: Quantitative Testing with Concept Activation Vec- Confidence Arguments for Evidence of Performance in tors (TCAV). In Proc. ICML 2018, pages 2668–2677, July Machine Learning for Highly Automated Driving Func- 2018. tions. In Computer Safety, Reliability, and Security, [Lakshminarayanan et al., 2017] Balaji Lakshminarayanan, LNCS, pages 365–377, Cham, 2019. Springer Interna- Alexander Pritzel, and Charles Blundell. Simple and scal- tional Publishing. able predictive uncertainty estimation using deep ensem- [Cheng et al., 2017] Gong Cheng, Junwei Han, and Xiao- bles. In Advances in Neural Information Processing Sys- qiang Lu. Remote Sensing Image Scene Classifica- tems 30, pages 6402–6413. Curran Associates, Inc., 2017. tion: Benchmark and State of the Art. Proc. IEEE, [Lee et al., 2018] Kimin Lee, Kibok Lee, Honglak Lee, and 105(10):1865–1883, October 2017. Jinwoo Shin. A Simple Unified Framework for Detect- [Cheng et al., 2019] Chih-Hong Cheng, Chung-Hao Huang, ing Out-of-Distribution Samples and Adversarial Attacks. Thomas Brunner, and Vahid Hashemi. Towards ArXiv180703888 Cs Stat, October 2018. Safety Verification of Direct Perception Neural Networks. [Liang et al., 2018] Shiyu Liang, Yixuan Li, and R. Srikant. ArXiv190404706 Cs, November 2019. Enhancing The Reliability of Out-of-distribution Image [DeVries and Taylor, 2018] Terrance DeVries and Gra- Detection in Neural Networks. In arXiv:1706.02690 [Cs, ham W. Taylor. Learning Confidence for Out- Stat], February 2018. of-Distribution Detection in Neural Networks. [Masana et al., 2018] Marc Masana, Idoia Ruiz, Joan Serrat, ArXiv180204865 Cs Stat, February 2018. Joost van de Weijer, and Antonio M. Lopez. Metric Learn- ing for Novelty and Anomaly Detection. In Proc. BMVC [Tan and Le, 2019] Mingxing Tan and Quoc Le. Efficient- 2018, August 2018. Net: Rethinking Model Scaling for Convolutional Neural [Mohseni et al., 2020] Sina Mohseni, Mandar Pitale, JBS Networks. In Proc. ICML 2019, volume 97 of Proceedings Yadawa, and Zhangyang Wang. Self-Supervised Learning of Machine Learning Research, pages 6105–6114. PMLR, for Generalizable Out-of-Distribution Detection. In Proc. June 2019. AAAI 2020, page 8, 2020. [Timofte et al., 2014] Radu Timofte, Karel Zimmermann, [Olah et al., 2020] Chris Olah, Nick Cammarata, Ludwig and Luc Van Gool. Multi-view traffic sign detection, Schubert, Gabriel Goh, Michael Petrov, and Shan recognition, and 3D localisation. Machine Vision and Ap- Carter. Zoom In: An Introduction to Circuits. Distill, plications, 25(3):633–647, April 2014. 5(3):10.23915/distill.00024.001, March 2020. [Weiss et al., 2018] Gereon Weiss, Philipp Schleiss, Daniel [Osawa et al., 2019] Kazuki Osawa, Siddharth Swaroop, Schneider, and Mario Trapp. Towards integrating unde- Mohammad Emtiyaz E Khan, Anirudh Jain, Runa Eschen- pendable self-adaptive systems in safety-critical environ- hagen, Richard E Turner, and Rio Yokota. Practical Deep ments. In Proc. SEAMS 2018, pages 26–32. ACM, May Learning with Bayesian Principles. In Advances in Neu- 2018. ral Information Processing Systems 32, pages 4287–4299. [Willers et al., 2020] Oliver Willers, Sebastian Sudholt, Curran Associates, Inc., 2019. Shervin Raafatnia, and Stephanie Abrecht. Safety Con- [Pawlowski et al., 2017] Nick Pawlowski, Miguel Jaques, cerns and Mitigation Approaches Regarding the Use and Ben Glocker. Efficient variational Bayesian neural of Deep Learning in Safety-Critical Perception Tasks. network ensembles for outlier detection. In Proc. ICLR ArXiv200108001 Cs Stat, January 2020. 2017. OpenReview.net, 2017. [Ren et al., 2019] Jie Ren, Peter J. Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark Depristo, Joshua Dillon, and Balaji Lakshminarayanan. Likelihood ratios for out-of- distribution detection. In Advances in Neural Information Processing Systems 32, pages 14707–14718. Curran Asso- ciates, Inc., 2019. [Salay et al., 2020] Rick Salay, Krzysztof Czarnecki, Maria Soledad Elli, Ignacio J. Alvarez, Sean Sedwards, and Jack Weast. PURSS: Towards Perceptual Uncertainty Aware Responsibility Sensitive Safety with ML. In Proc. SafeAI@AAAI 2020, volume 2560 of CEUR Workshop Proceedings, pages 91–95, 2020. [Sensoy et al., 2018] Murat Sensoy, Lance Kaplan, and Melih Kandemir. Evidential Deep Learning to Quantify Classification Uncertainty. In Advances in Neural Infor- mation Processing Systems 31, pages 3179–3189. Curran Associates, Inc., 2018. [Serrà et al., 2019] Joan Serrà, David Álvarez, Vicenç Gómez, Olga Slizovskaia, José F. Núñez, and Jordi Luque. Input Complexity and Out-of-distribution Detection with Likelihood-based Generative Models. In Proc. ICLR 2020, September 2019. [Shalev-Shwartz et al., 2018] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. On a Formal Model of Safe and Scalable Self-driving Cars. ArXiv170806374 Cs Stat, October 2018. [Simonyan and Zisserman, 2015] Karen Simonyan and An- drew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proc. ICLR 2015, 2015. [Stallkamp et al., 2011] Johannes Stallkamp, Marc Schlips- ing, Jan Salmen, and Christian Igel. The German Traffic Sign Recognition Benchmark: A multi-class classification competition. In The 2011 International Joint Conference on Neural Networks, pages 1453–1460, July 2011.