A Comparison of Uncertainty Estimation Approaches in Deep Learning Components for Autonomous Vehicle Applications Fabio Arnez1 , Huascar Espinoza1 , Ansgar Radermacher1 and François Terrier1 1 CEA LIST, Gif-sur-Yvette, France {fabio.arnez, huascar.espinoza, ansgar.radermacher, francois.terrier}@cea.fr to test and provide the same performance guarantees in all Abstract the possible environment configurations the system could en- counter in the real world [Kuutti et al., 2020]. A common A key factor for ensuring safety in Autonomous Ve- practice to overcome this problem is to use runtime monitor- hicles (AVs) is to avoid any abnormal behaviors ing of DNN components, so that safety can be ensured even if under undesirable and unpredicted circumstances. the component was not fully validated at design time [Henne As AVs increasingly rely on Deep Neural Net- et al., 2020; Koopman et al., 2019]. A central aspect to enable works (DNNs) to perform safety-critical tasks, dif- DNN monitoring is to provide a runtime treatment of uncer- ferent methods for uncertainty quantification have tainties associated with DNN’s predictions [McAllister et al., recently been proposed to measure the inevitable 2017; Koopman et al., 2019]. source of errors in data and models. However, un- In this paper, we review common uncertainty estimation certainty quantification in DNNs is still a challeng- methods for DNNs and compare their performance and ben- ing task. These methods require a higher compu- efits for different AV tasks. These methods offer a potential tational load, a higher memory footprint, and in- solution for runtime DNN confidence prediction and detec- troduce extra latency, which can be prohibitive in tion of Out-of-Distribution (OOD) samples, since prediction safety-critical applications. In this paper, we pro- probability scores in DNNs do not provide a true represen- vide a brief and comparative survey of methods for tation of uncertainty [Mohseni et al., 2019]. However, these uncertainty quantification in DNNs along with ex- methods still demand a high computational load, incorporate isting metrics to evaluate uncertainty predictions. extra latency, and require a larger memory footprint. We com- We are particularly interested in understanding the pare these factors since they can represent a major impedi- advantages and downsides of each method for spe- ment in safety-critical applications with tight time constraints cific AV tasks and types of uncertainty sources. and limited computation hardware. We also briefly focus on surveying uncertainty metrics that evaluate the performance 1 Introduction of quantification methods, as another critical factor to ensure In the last decade, Deep Neural Networks (DNNs) have wit- safety in AV systems. nessed great advances in real-world applications like Au- The remainder of the paper is structured as follows. Sec- tonomous Vehicles (AVs) to perform complex tasks such as tion 2 describes the sources of uncertainty in deep learning object detection and tracking or vehicle control. Despite for AVs. Section 3 presents a comparison of recent works substantial performance improvements introduced by DNNs, in AV tasks that include uncertainty estimation methods for they still have significant safety shortcomings due to their DNNs. It provides a brief review of common uncertainty es- complexity, opacity and lack of interpretability [McAllister timation methods in deep learning as well as metrics for pre- et al., 2017]. In particular, DNNs are brittle to operational dictive uncertainty evaluation in classification and regression domain shift and even small data corruption or perturbations tasks. Section 4 discusses the open challenges and possible [Kuutti et al., 2020]. This impedes ensuring the reliabil- directions for future work. ity of the DNNs models, which is a precondition for safety- critical systems to ensure compliance with automotive indus- 2 Background try safety standards and avoid jeopardizing human lives. A concrete safety problem is to detect abnormal situations 2.1 Sources of Uncertainty in Deep Learning for under uncertain environment conditions and DNN-specific Autonomous Vehicles unpredictability. These situations are difficult to analyze dur- Autonomous vehicles have to deal with dynamic, non- ing system development phases, in a way that they can be stationary and highly unpredictable operational environ- properly mitigated at a real-time scale. Indeed, although a ments. Taking into account all the details from the opera- DNN model achieves great performance in a validation set tional environment at design time is an intractable task. In- from its operation environment, it is currently impossible stead, the operational environment is constrained in a way Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). that it considers only a subset of all possible situations that In a similar fashion as the cases presented before, auto- the system can encounter in operation. This process is known motive industry standard ISO/PAS 21448 or SOTIF (Safety as Operational Design Domain (ODD) adoption [Koopman Of The Intended Functionality) [ISO, 2019], provides a pro- and Fratrik, 2019], and safety requirements are built on the cess to identify unknown and potentially unsafe scenarios to top of the ODD specification. minimize the risk by recognizing the performance limitations Given the constrained operational environment within sys- from sensors, algorithms, or user misuse. Unsafe scenarios tem ODD, ensuring safety in an AV requires the identification can be further classified into unsafe-known (e.g out of ODD of unfamiliar contexts by modeling AV’s uncertainty [McAl- samples) or unsafe-unknown (e.g. OOD samples). Once an lister et al., 2017]. However, there are many factors, not unknown-unsafe scenario or situation is identified, it becomes only related to the environment, that affect the system per- a known-unsafe scenario that can be mitigated at design time formance by introducing some degree of uncertainty. [Czar- [Rau et al., ; Mohseni et al., 2019]. necki and Salay, 2018] identify a set of factors that contribute to uncertainty in the perception function in an AV, and in this 2.2 Uncertainty Estimation Methods for DNNs manner affect its performance. From this set, we take spe- cial attention to sensor properties, model uncertainty, situa- In recent years, many probabilistic deep learning methods tion and scenario coverage, and operational domain uncer- have been proposed to obtain an uncertainty measure from tainty factors. In the context of DNNs, the first two fac- an approximation to the (highly multi-modal) predictive dis- tors can be modeled by using uncertainty estimation methods, tribution, as well as methods for calibrating the outputs of while the last two correspond to some degree of dataset shift DNNs. In general, there are two approaches for DNN predic- (i.e. breaking the independent and identically distributed as- tive uncertainty calculation: sampling-based and sampling- sumption between training and testing data.) and Out-of- free methods. Sampling-based methods rely on taking multi- Distribution (OOD) samples [Quionero-Candela et al., 2009; ple predictive samples based on the same input to get the esti- Mohseni et al., 2019]. mator that will be associated with uncertainty. Sampling-free methods require one single predictive output. These methods Sensor properties like range, resolution, noise characteris- are further discussed in Section 3. tics, and calibration can influence the amount of information in the samples delivered to a machine learning model during training or testing. In consequence, the effect of these prop- Neural Network Calibration erties are captured as noise and ambiguities inherent to the Confidence calibration represents the degree to which a obtained samples. This type of noise in the data is known as model’s predicted probability estimates the true correctness Aleatoric uncertainty, and represents the incapability of com- likelihood [Guo et al., 2017]. Under ideal circumstances, we pletely sensing all the details of the environment [Kendall expect that the normalized outputs from a DNN (i.e softmax and Gal, 2017; Lee et al., 2019b; Gustafsson et al., 2019]. outputs) correspond to the true correctness likelihood [Guo Aleatoric uncertainty can be further further classified into ho- et al., 2017]. From a frequentist perspective, this can be moscedastic uncertainty (uncertainty that remains constant viewed as a discrepancy measure between local confidence for different samples), and heteroscedastic uncertainty (un- (or uncertainty) predictions and the expected performance in certainty that can vary between samples). the long-run [Hubschneider et al., 2019; Lakshminarayanan et al., 2017]. For example, we expect that a class predicted Model uncertainty is often referred to as Epistemic uncer- with probability p is correct p% of the time, i.e. from 100 tainty, and accounts for uncertainty in the model parameters. samples predicted with confidence 0.9, we expect 90 cor- This type of uncertainty captures the ignorance of the model rect predictions. DDNs can be calibrated by using Temper- as a consequence of a dataset that does not represent the ODD ature Scaling, a simple post-processing technique [Guo et well, or that is not sufficiently large [Kendall and Gal, 2017; al., 2017], or more recently, Dirichelt calibration [Kull et Lee et al., 2019b]. Epistemic uncertainty is expected to al., 2019]. For a regression setting, [Kuleshov et al., 2018; increase in unknown situations (e.g. different environment Hubschneider et al., 2019] formalize the calibration notion ODD conditions such as weather or lightning), and can be for continuous variables, in which a p% confidence interval explained away by incorporating more data. should contain the true outcome p% of the time. Situation and scenario coverage is related to the degree in which situations and scenarios from an ODD are reflected in Despite the improvements achieved with calibration meth- training and operation stages; while operational domain un- ods, they can not be seen as a complete solution for uncer- certainty refers to a discrepancy between ODD situations and tainty estimation problem, since calibration is performed rel- scenarios present at training and those encountered at oper- ative to a validation dataset [Kull et al., 2019; Ashukha et al., ation (e.g. scenarios from two different ODDs) [Czarnecki 2020] (i.e., calibration methods rely on in-distribution sam- and Salay, 2018]. In both cases, uncertainty can be reduced ples to learn a calibration map). In the presence of OOD by incorporating more data, or by adjusting the ODD spec- samples, a model is no longer calibrated. This limits the con- ification. However, it is extremely important to detect and tribution of calibration techniques to scenarios where huge discover OOD samples (i.e. outliers), especially those that training datasets are available. have not been seen before, since those can lead to highly con- fident predictions that are wrong, i.e., the unknown-unknowns [Bansal and Weld, 2018]. 3 Comparison of Uncertainty Estimation Instead of relying on only one configuration of the weights, Methods in AV Domain we use every possible configuration of the weights (all possi- ble models) weighted by the posterior on the parameters, to In this section, we compare and analyze some common uncer- make a prediction, i.e. p(y∗ |x∗ , D) = Ep(w|D) [p(y∗ |x∗ , w)]. tainty estimation methods in terms of out-of-the-box calibra- This represents the Bayesian Model Average (BMA) and ac- tion in the predictions (i.e. without a prior calibration), com- counts for epistemic uncertainty [Wilson and Izmailov, 2020; putational budget, memory footprint, and required changes in Gal, 2016; Blundell et al., 2015]. the DNN for applying each method (architecture, loss func- Unfortunately, the integrals from (1) and (2) are intractable. tion, and others). We have chosen the most representative Thus, we must build a distribution that approximates the true works to the best of our knowledge in each application. Some posterior distribution on the weights, q(w) ≈ p(w|D). Two of the listed works introduce improvements by performing main paradigms exist to build q(w): Markov Chain Monte combinations between other methods. This is summarized in Carlo (MCMC) and Variational Inference (VI) methods. In Table 1. the former, the gold standard is Hamiltonian Monte Carlo (HMC), and other methods like Stochastic Gradient MCMC 3.1 Methods Limited to Aleatoric Uncertainty (SG-MCMC) have been explored. However, MCMC meth- The first four methods listed in Table 1 exclusively deal with ods are in general hard to scale to large DNNs due to the high- aleatoric uncertainty. In classification tasks, uncertainty is dimensional and multi-modal posterior distribution [Gustafs- usually represented by normalized logits at the output layer son et al., 2019]. In the latter case, VI methods approxi- (e.g. softmax output) which can be interpreted as a proba- mate the posterior over the weights by approximating a sim- bility distribution related to aleatoric uncertainty [Gustafsson pler distribution qφ (w) (e.g. a gaussian) parameterized by φ. et al., 2019]. Unfortunately, normalized outputs as proba- The parameters of qφ (w) are found by minimizing the KL- bility distributions fail to capture model uncertainty and this divergence to p(w|D). very often results in overconfident predictions that are wrong A particular scalable and easy to implement sample-based [Guo et al., 2017], especially in the presence of dataset-shift. method for approximate VI is Monte Carlo Dropout (MCD) To overcome the problems of softmax, [Gast and Roth, 2018] [Gal and Ghahramani, 2016]. In this method, dropout reg- propose to use a Dirichlet distribution instead. ularization is also applied at test time, so that qφ (w) is a In a regression configuration, deep learning models do not Bernoulli distribution. Dropout is only performed in some have an uncertainty representation by default. The outputs of of the deeper layers of the DNN to model better high-level a DNN are intended to parameterize a probability distribution features and to avoid slow training [Mukhoti and Gal, 2018; (e.g., Gaussian, Laplace) to obtain a probabilistic representa- Kendall et al., 2015]. Dropout probabilities can be set man- tion. This modification of the architecture allows DNNs to ually, or the network can tune dropout rates during training learn aleatoric uncertainty from the data itself by using thes [Gal et al., 2017]. heteroscedastic loss and maximum likelihood [Kendall and All the MCD-related methods listed in Table 1 refer to this Gal, 2017; Ilg et al., 2018]. Similarly, in the heteroscedastic approximation of BNNs. It can be noted from the perfor- version of the classification, [Kendall and Gal, 2017] place mance comparison criteria, that the need to take multiple for- a Gaussian distribution over the output logits (i.e., each logit ward passes (output samples) for the same input to approxi- with its respective variance), before the softmax layer is ap- mate the distribution from Equation 2 represents a major im- plied. An alternative approach replaces the input, output and pediment to safety-critical applications with tight time con- activation functions of a DNN with probability distributions straints and limited computation hardware. [Gast and Roth, 2018]. This method allows the propagation To get a representation of both types of uncertainty of a fixed uncertainty at the input to the output of the DNN (aleatoric and epistemic), the methods presented in Section employing Assumed Density Filtering (ADF). 3.1 have been used in combination with MCD. For example, in a regression configuration, a set of T samples are taken 3.2 Bayesian Neural Networks from the predictions of a DNN that parameterize a distri- bution in its output: {ŷt , σ̂t }Tt=1 . However, since aleatoric Bayesian Neural Networks (BNNs), aim to learn a distribu- uncertainty is learned from the data itself (by using the het- tion over the weights instead of point estimates. In this way, eroscedastic loss), this approach could produce wrong uncer- we look for the posterior distribution of the weights given the tainty estimations in samples that include a higher level of data p(w|D), by applying Bayes’ theorem from the data like- uncertainty than that observed during training. Another ap- lihood and a chosen prior distribution over the weights p(w): proach presented in [Loquercio et al., 2020], applies MCD to take samples from a DNN where the input, output and acti- p(D|w)p(w) p(D|w)p(w) vation functions are replaced by probability distributions ac- p(w|D) = =R (1) p(D) p(D|w)p(w)dw cording to [Gast and Roth, 2018]. This method permits un- certainty propagation at the input to the output of the DNN Given the predictive posterior distribution p(w|D), we ob- using ADF (e.g., sensor noise can be propagated to the output tain the predictive posterior distribution for a new input x∗ of the DNNs). This is an appealing method for AV applica- by marginalizing over the model parameters: tions where sensor properties are commonly known. Inter- Z estingly, the authors show that this method can be applied to p(y∗ |x∗ , D) = p(y∗ |x∗ , w)p(w|D)dw (2) trained DNNs and is architecture agnostic. Uncertainty Captured Comparison Criteria Method Autonomous Vehicle Task Out-of-the box Computational Memory Changes in Aleatoric Epistemic Calibration Budget Footprint DNN Softmax logits as parameters - Object Detection [Feng et al., 2019b] 3 7 Bad Fair Slow Small of a prob. dist. Outputs as parameters - Object Detection [Feng et al., 2019b] 3 7 Bad Fair Slow Small of a prob. dist. Inputs, activation and output as - Optical Flow [Gast and Roth, 2018] 3 7 Undefined Low Low Mid prob. dist. & ADF Point estimate & MCD - Steering Angle Prediction [Hubschneider et al., 2019; 7 3 Fair Fair Low None regression Michelmore et al., 2018; Michelmore et al., 2019] - Traffic Sign Recognition [Henne et al., 2020] Softmax & MCD - Semantic Segmentation [Phan et al., 2019; Mukhoti and 7 3 Fair Fair Low None Gal, 2018; Gustafsson et al., 2019] - Steering Angle Prediction [Hubschneider et al., 2019] Deep - Traffic Sign Recognition [Henne et al., 2020] 3 3 Good High High Small Ensembles - Semantic Segmentation [Gustafsson et al., 2019] - Depth Estimation [Gustafsson et al., 2019] Bootstrap - Steering Angle Prediction [Hubschneider et al., 2019] 3 3 Bad Fair Fair Mid Ensembles - Optical Flow [Ilg et al., 2018] Softmax logits as parameters - Object Detection [Feng et al., 2018] 3 3 Fair High Low Small of a prob. dist. & MCD Outputs as - Object Detection [Feng et al., 2018] parameters of - Steering Angle Prediction [Lee et al., 2019b; Lee et al., 3 3 Fair High Low Small a prob. dist. 2019c; Lee et al., 2019a] & MCD - Depth Estimation [Gustafsson et al., 2019] Inputs, activation and output as - Steering Angle Prediction [Loquercio et al., 2020] 3 3 Undefined High Low Mid prob. dist. & ADF &MCD - Steering Angle Prediction [Hubschneider et al., 2019; MDNs Choi et al., 2018] 3 3 Bad Low Low None - Future Prediction [Makansi et al., 2019] MDNs - Future Prediction [Makansi et al., 2019] 3 3 Undefined Low Low High with stages Table 1: Uncertainty Estimation Methods Comparison 3.3 Deep Ensembles obtained as a combination of the mixture components in a PK A Deep Ensemble (DE) is another sample-based method, in weighted sum: E[y|x] = i=1 πi (x)µi (x). The predicted which M DNNs are trained to obtain the predictive distri- variance is composed of the weighted sum of the variances bution p(y|x) [Lakshminarayanan et al., 2017]. Each DNN and the weighted variances of the means: learns a set of parameters w that are point estimates, start- 2 K K K ing for different random initialization and repeating the min- V[y|x] = X πi (x)Σi (x) + X πi (x) µi (x) − X πi (x)µi (x) (5) imization M times. In an ensemble, predictions are aver- i=1 i i aged and can be considered as a mixture model that is equally weighted: where the first term represents the aleatoric uncertainty and the second term represents the epistemic uncertainty. We re- M 1 X i fer the reader to [Choi et al., 2018] for more details about p(y|x) = p(y|x, ŵ ), {ŵ(i) }M i=1 (3) uncertainty acquisition in MDNs. M i=1 As pointed out in Table 1, the sampling-free nature of this For classification, equation (3) corresponds to an aver- method reduces the computation load, memory footprint, and age of the softmax probabilities. For regression, the out- permits complex distribution modeling with respect to the puts that parameterize a probability distribution are averaged methods described before. These characteristics are attractive to represent the mean and variance of the mixture. In this for real-time applications. However, MDNs suffer from nu- manner, both types of uncertainty (aleatoric and epistemic) merical instability for high dimensional problems and mode can be easily captured. Although DE is considered a non- collapse when using regularization techniques [Makansi et Bayesian method, expression (3) represents an approxima- al., 2019]. tion of (2) since {ŵ(i) }Mi=1 can be seen as samples taken from distribution that approximates the true posterior, by ex- 3.5 Quality Metrics for Uncertainty Estimation ploring different modes of from p(w|D) [Fort et al., 2019; In this section, we discuss common metrics for evaluating the Wilson and Izmailov, 2020]. quality of uncertainty estimation. As presented in Table 1, the DE method tends to outper- Classification Metrics. Different methods for uncertainty form approximate Bayesian inference methods like MCD, for estimation exist for classification tasks. Variation Ratio and both, uncertainty estimates and accuracy [Gustafsson et al., information metrics such as Predictive Entropy, Mutual In- 2019]. A recent work from [Snoek et al., 2019] also shows, formation, can be used in classification settings to represent that DE is more robust to dataset shift. These works suggest uncertainty [Gal, 2016]. Variation ratio is a measure of dis- that DE should be considered as the new standard method for persion; mutual information captures model confidence, and predictive distributions and uncertainty estimation. However, predictive entropy accounts for epistemic and aleatoric un- DE has some drawbacks, especially if the target application certainty [Mukhoti and Gal, 2018; Michelmore et al., 2018; is a safety-critical application. DE requires a higher com- Phan et al., 2019]. [Mukhoti and Gal, 2018] propose specific putational load and a larger memory footprint, as shown in performance metrics for semantic segmentation to evaluate Table 1. For the training and testing stage, the number of pa- Bayesian models. Since there is no ground-truth for uncer- rameters, and the inference times scale linearly with M . To tainty estimation, [Snoek et al., 2019; Lakshminarayanan et mitigate this problem, [Osband et al., 2016] propose a fused al., 2017] argue that proper scoring rules are NLL and Brier version of ensembles with multiple heads. All the heads share score. NLL depends on predictive uncertainty and is com- the convolutional layers (feature extractors) and each head is monly evaluated in a held-out set, however, it can overesti- trained using boostrap samples. mate tail probabilities; whereas Brier-score measures the ac- 3.4 Mixture Density Networks curacy of predictive probabilities by a sum of squared differ- ences between the predicted probability vector and the target, Mixture Density Networks (MDN) [Bishop, 1994], is a nonetheless, this score is prone to avoid capturing infrequent sample-free method for regression tasks, where the aim is to events. Other evaluation metrics independent of score val- train a DNN that predicts the parameters of a Gaussian Mix- ues are: the Area Under the Receiver Operating Characteris- ture Model (GMM) given an input x. A GMM is formed by a tic (AUROC), Area Under Precision Recall Curve (AUPRC), weighted sum of K Gaussians, to model the conditional dis- and Area Under Risk-Coverage (AURC) [Hendrycks and tribution: Gimpel, 2016; Ding et al., 2019]. K X Regression Metrics. Similarly, in regression tasks, NLL p(y|x) = πi (x)N (y|µi (x), Σi (x)) (4) is a proper scoring rule for a likelihood that follows Gaus- i=1 sian distribution [Lakshminarayanan et al., 2017; Kendall and where πi (x), µi (x), Σi (x) represent the set of parameters of Gal, 2017]. Furthermore, [Ilg et al., 2018] introduces a rel- the GMM as a function of the input x for K mixtures. For ative measure for uncertainty estimation, the Area Under the training, Negative Log-likelihood (NLL) is used as loss func- Sparsification Error (AUSE) curve, that measures the differ- tion. ence between the dispersion of predictions (affected by pre- By using the law of total variance, [Choi et al., 2018] for- dictive uncertainty), and a oracle in terms of true prediction malized the acquisition of aleatoric and epistemic uncertianty error, e.g. Root Mean Squared Error (RMSE) [Gustafsson et in MDNs. As a first step, the expectation of the GMM is al., 2019]. Calibration Metrics. For classification tasks, common DE achieves better performance and calibration than MCD quality metrics are Expected Calibration Error (ECE) and variants [Gustafsson et al., 2019]. However, in DE the com- Maximum Calibration Error (MCE) [Guo et al., 2017]. The putational cost at training and testing grows linearly with the former measures the difference between expected accuracy number of ensembles. Similarly for traffic sign recognition, and expected confidence; the latter identifies the largest dis- DE exhibit the best-calibrated outputs, but in this case, MCD crepancy between accuracy and confidence, which is of par- in combination with softmax also produces well-calibrated ticular interest in safety-critical applications. For a regression outputs close to that from DE [Henne et al., 2020]. configuration, [Kuleshov et al., 2018] use calibration error as For optical flow, [Gast and Roth, 2018] capture aleatoric a metric that represents the sum of weighted squared differ- uncertainty by replacing the input, output and activation func- ences between the expected and observed (empirical) confi- tions with probability distributions. This method allows prop- dence levels; correspondingly in [Gustafsson et al., 2019], the agating a fixed value of uncertainty at the input to the output authors propose to use the Area Under the Calibration Error of the DNN. [Ilg et al., 2018] present an alternative approach, curve (AUCE) as an absolute measure of uncertainty. The where DE and bootstrap ensembles were used to obtain the before-mentioned authors use reliability diagrams (i.e. cali- predictive uncertainty. bration plots) to get a visual representation of model calibra- For future prediction, [Makansi et al., 2019] propose an tion. Regardless of drawbacks with OOD samples, calibra- improvement to MDNs to predict the multi-modal distribu- tion plots and measures are used extensively to compare the tion of positions of a vehicle in the future. This method predictive quality of other uncertainty estimation methods. presents two stages: a sampling and a fitting network. The former network receives the current position of the vehicle as 3.6 Considerations per AV Task Type an input and outputs a fixed number of hypotheses for future In the context of AVs, for (end-to-end) steering angle pre- positions. The latter network fits a mixture distribution to the diction, a broad variety of uncertainty estimation methods hypothesis estimated in the first network. This improvement have been applied. In some works only epistemic uncer- helps to avoid mode collapse in MDNs, however, high dimen- tainty was captured by using MCD [Michelmore et al., 2018; sional outputs remain challenging for this approach. Michelmore et al., 2019]. However, usually both types of un- certainty are captured [Lee et al., 2019b; Lee et al., 2019c; 4 Conclusions Lee et al., 2019a] by using the method proposed by [Kendall We presented a comparative survey for uncertainty estima- and Gal, 2017], or by using DE, boostrap ensembles, or tion methods for both, classification and regression tasks, in MDNs. The calibration plots presented in [Hubschneider et the AV domain. We also provide a general comparative anal- al., 2019] show that MCD has better out-of-the-box calibra- ysis of these methods. From this analysis we can see that DE tion than bootstrap ensembles or MDNs; the last two meth- has become a gold-standard for uncertainty quantification in ods are overconfident in their predictions. In this particu- many AV tasks thanks to its high-quality uncertainty predic- lar task, safety mechanisms have been proposed when un- tions and its robustness to OOD samples. However, the high certainty estimations surpass a given or learned threshold in computational load and large memory footprint, can hinder its order to improve vehicle safety [Michelmore et al., 2018; use in safety-critical applications that have hardware limita- Michelmore et al., 2019; Lee et al., 2019b]. tions or tight time-constraints. Here, sampling-free methods Under the modular pipeline paradigm for AV control, prob- are an interesting avenue for future research. New robust (to abilistic modeling has mainly been applied to perception OOD) and lightweight approaches should be explored in the tasks like object detection from 3D Lidar, semantic segmenta- AV domain, to produce good-quality uncertainty estimates. tion and depth estimation. For 3D object detection from Lidar We also observed that predictions from these methods are un- point-clouds, [Feng et al., 2018] estimate aleatoric and epis- calibrated (overconfident or underconfident) and are usually temic uncertainty using the methods proposed by [Kendall applied to classification tasks. We encourage the application and Gal, 2017]. However, epistemic uncertainty estimation of calibration methods also for regression tasks by using the with MCD introduces a high computational cost. A later methods proposed by [Kuleshov et al., 2018] instead of lim- work from [Feng et al., 2019b] leverages aleatoric uncertain- iting the assessment of predictions with only reliability di- ties to greatly improve the performance and reduce the com- agrams. We also suggest to study and compare uncertainty putational load from MCD. In [Feng et al., 2019a] the au- estimation methods under dataset-shift conditions to assess thors show that predictions for classification and regression their robustness. For future work, we plan to incorporate un- are miscalibrated, and propose methods to fix calibration of certainty information into the Responsability-Sensitive Safety DNNs and produce better uncertainty estimates. model [Shalev-Shwartz et al., 2017]. This generalizes the ap- For semantic segmentation, [Phan et al., 2019; Mukhoti proach from [Salay et al., 2020] by considering component and Gal, 2018; Gustafsson et al., 2019] model aleatoric un- uncertainty from different AV subsystems and propagating it certainty from the softmax output, and epistemic uncertainty through them. These subsystems could include DNNs e.g. by using MCD or ensembles. Common uncertainty metrics for planning and control. in this case are predictive entropy and mutual information [Mukhoti and Gal, 2018]. For Depth estimation, [Gustafsson et al., 2019] compares DE with the heteroscedastic regression Acknow ledgments in combination with MCD [Kendall and Gal, 2017]. In both This work has received funding from the COMP4DRONES previous tasks (semantic segmentation and depth estimation) project, under Joint Undertaking (JU) grant agreement N◦ 826610. The JU receives support from the European [Gal and Ghahramani, 2016] Yarin Gal and Zoubin Ghahra- Union’s Horizon 2020 research and innovation programme mani. Dropout as a bayesian approximation: Representing and from Spain, Austria, Belgium, Czech Republic, France, model uncertainty in deep learning. In international con- Italy, Latvia, Netherlands. ference on machine learning, pages 1050–1059, 2016. [Gal et al., 2017] Yarin Gal, Jiri Hron, and Alex Kendall. References Concrete dropout. In Advances in neural information pro- [Ashukha et al., 2020] Arsenii Ashukha, Alexander Lyzhov, cessing systems, pages 3581–3590, 2017. Dmitry Molchanov, and Dmitry Vetrov. Pitfalls of in- [Gal, 2016] Yarin Gal. Uncertainty in deep learning. Uni- domain uncertainty estimation and ensembling in deep versity of Cambridge, 1:3, 2016. learning. arXiv preprint arXiv:2002.06470, 2020. [Gast and Roth, 2018] Jochen Gast and Stefan Roth. [Bansal and Weld, 2018] Gagan Bansal and Daniel S Weld. Lightweight probabilistic deep networks. In Proceedings A coverage-based utility model for identifying unknown of the IEEE Conference on Computer Vision and Pattern unknowns. In Thirty-Second AAAI Conference on Artifi- Recognition, pages 3369–3378, 2018. cial Intelligence, 2018. [Guo et al., 2017] Chuan Guo, Geoff Pleiss, Yu Sun, and [Bishop, 1994] Christopher M Bishop. Mixture density net- Kilian Q Weinberger. On calibration of modern neural net- works. 1994. works. In Proceedings of the 34th International Confer- [Blundell et al., 2015] Charles Blundell, Julien Cornebise, ence on Machine Learning-Volume 70, pages 1321–1330. Koray Kavukcuoglu, and Daan Wierstra. Weight uncer- JMLR. org, 2017. tainty in neural network. In International Conference on [Gustafsson et al., 2019] Fredrik K Gustafsson, Martin Machine Learning, pages 1613–1622, 2015. Danelljan, and Thomas B Schön. Evaluating scalable [Choi et al., 2018] Sungjoon Choi, Kyungjae Lee, Sungbin bayesian deep learning methods for robust computer Lim, and Songhwai Oh. Uncertainty-aware learning from vision. arXiv preprint arXiv:1906.01620, 2019. demonstration using density networks with sampling-free [Hendrycks and Gimpel, 2016] Dan Hendrycks and Kevin variance modeling. In 2018 IEEE International Confer- Gimpel. A baseline for detecting misclassified and out-of- ence on Robotics and Automation (ICRA), pages 6915– distribution examples in neural networks. arXiv preprint 6922. IEEE, 2018. arXiv:1610.02136, 2016. [Czarnecki and Salay, 2018] Krzysztof Czarnecki and Rick [Henne et al., 2020] Maximilian Henne, Adrian Schwaiger, Salay. Towards a framework to manage perceptual uncer- tainty for safe automated driving. In International Confer- Karsten Roscher, and Gereon Weiss. Benchmarking un- ence on Computer Safety, Reliability, and Security, pages certainty estimation methods for deep learning with safety- 439–445. Springer, 2018. related metrics. 2020. [Ding et al., 2019] Yukun Ding, Jinglan Liu, Jinjun Xiong, [Hubschneider et al., 2019] Christian Hubschneider, Robin and Yiyu Shi. Evaluation of neural network uncertainty Hutmacher, and J Marius Zöllner. Calibrating uncertainty estimation with application to resource-constrained plat- models for steering angle estimation. In 2019 IEEE Intel- forms. arXiv preprint arXiv:1903.02050, 2019. ligent Transportation Systems Conference (ITSC), pages 1511–1518. IEEE, 2019. [Feng et al., 2018] Di Feng, Lars Rosenbaum, and Klaus Di- etmayer. Towards safe autonomous driving: Capture un- [Ilg et al., 2018] Eddy Ilg, Ozgun Cicek, Silvio Galesso, certainty in the deep neural network for lidar 3d vehicle Aaron Klein, Osama Makansi, Frank Hutter, and Thomas detection. In 2018 21st International Conference on Intel- Brox. Uncertainty estimates and multi-hypotheses net- ligent Transportation Systems (ITSC), pages 3266–3273. works for optical flow. In Proceedings of the European IEEE, 2018. Conference on Computer Vision (ECCV), pages 652–667, 2018. [Feng et al., 2019a] Di Feng, Lars Rosenbaum, Claudius Glaeser, Fabian Timm, and Klaus Dietmayer. Can we trust [ISO, 2019] ISO ISO. Pas 21448-road vehicles-safety of you? on calibration of a probabilistic object detector for the intended functionality. International Organization for autonomous driving. arXiv preprint arXiv:1909.12358, Standardization, 2019. 2019. [Kendall and Gal, 2017] Alex Kendall and Yarin Gal. What [Feng et al., 2019b] Di Feng, Lars Rosenbaum, Fabian uncertainties do we need in bayesian deep learning for Timm, and Klaus Dietmayer. Leveraging heteroscedastic computer vision? In Advances in neural information pro- aleatoric uncertainties for robust real-time lidar 3d object cessing systems, pages 5574–5584, 2017. detection. In 2019 IEEE Intelligent Vehicles Symposium [Kendall et al., 2015] Alex Kendall, Vijay Badrinarayanan, (IV), pages 1280–1287. IEEE, 2019. and Roberto Cipolla. Bayesian segnet: Model uncertainty [Fort et al., 2019] Stanislav Fort, Huiyi Hu, and Balaji Lak- in deep convolutional encoder-decoder architectures for shminarayanan. Deep ensembles: A loss landscape per- scene understanding. arXiv preprint arXiv:1511.02680, spective. arXiv preprint arXiv:1912.02757, 2019. 2015. [Koopman and Fratrik, 2019] Philip Koopman and Frank Cipolla, and Adrian Weller. Concrete problems for au- Fratrik. How many operational design domains, objects, tonomous vehicle safety: Advantages of bayesian deep and events? 2019. learning. International Joint Conferences on Artificial In- [Koopman et al., 2019] Philip Koopman, Beth Osyk, and telligence, Inc., 2017. Jack Weast. Autonomous vehicles meet the physical [Michelmore et al., 2018] Rhiannon Michelmore, Marta world: Rss, variability, uncertainty, and proving safety. In Kwiatkowska, and Yarin Gal. Evaluating uncertainty International Conference on Computer Safety, Reliability, quantification in end-to-end autonomous driving control. and Security, pages 245–253. Springer, 2019. arXiv preprint arXiv:1811.06817, 2018. [Kuleshov et al., 2018] Volodymyr Kuleshov, Nathan Fen- [Michelmore et al., 2019] Rhiannon Michelmore, Matthew ner, and Stefano Ermon. Accurate uncertainties for Wicker, Luca Laurenti, Luca Cardelli, Yarin Gal, and deep learning using calibrated regression. arXiv preprint Marta Kwiatkowska. Uncertainty quantification with sta- arXiv:1807.00263, 2018. tistical guarantees in end-to-end autonomous driving con- [Kull et al., 2019] Meelis Kull, Miquel Perello Nieto, trol. arXiv preprint arXiv:1909.09884, 2019. Markus Kängsepp, Telmo Silva Filho, Hao Song, and [Mohseni et al., 2019] Sina Mohseni, Mandar Pitale, Vasu Peter Flach. Beyond temperature scaling: Obtaining Singh, and Zhangyang Wang. Practical solutions for well-calibrated multi-class probabilities with dirichlet machine learning safety in autonomous vehicles. arXiv calibration. In Advances in Neural Information Processing preprint arXiv:1912.09630, 2019. Systems, pages 12295–12305, 2019. [Mukhoti and Gal, 2018] Jishnu Mukhoti and Yarin Gal. [Kuutti et al., 2020] Sampo Kuutti, Richard Bowden, Evaluating bayesian deep learning methods for semantic Yaochu Jin, Phil Barber, and Saber Fallah. A survey of segmentation. arXiv preprint arXiv:1811.12709, 2018. deep learning applications to autonomous vehicle control. [Osband et al., 2016] Ian Osband, Charles Blundell, Alexan- IEEE Transactions on Intelligent Transportation Systems, der Pritzel, and Benjamin Van Roy. Deep exploration via 2020. bootstrapped dqn. In Advances in neural information pro- [Lakshminarayanan et al., 2017] Balaji Lakshminarayanan, cessing systems, pages 4026–4034, 2016. Alexander Pritzel, and Charles Blundell. Simple and scal- [Phan et al., 2019] Buu Phan, Samin Khan, Rick Salay, and able predictive uncertainty estimation using deep ensem- Krzysztof Czarnecki. Bayesian uncertainty quantification bles. In Advances in neural information processing sys- with synthetic data. In International Conference on Com- tems, pages 6402–6413, 2017. puter Safety, Reliability, and Security, pages 378–390. [Lee et al., 2019a] Keuntaek Lee, Gabriel Nakajima An, Vi- Springer, 2019. acheslav Zakharov, and Evangelos A Theodorou. Per- [Quionero-Candela et al., 2009] Joaquin Quionero-Candela, ceptual attention-based predictive control. arXiv preprint Masashi Sugiyama, Anton Schwaighofer, and Neil D arXiv:1904.11898, 2019. Lawrence. Dataset shift in machine learning. The MIT [Lee et al., 2019b] Keuntaek Lee, Kamil Saigol, and Evan- Press, 2009. gelos A Theodorou. Early failure detection of deep end- [Rau et al., ] Paul Rau, Christopher Becker, and John to-end control policy by reinforcement learning. In 2019 Brewer. Approach for deriving scenarios for safety of the International Conference on Robotics and Automation intended functionality. (ICRA), pages 8543–8549. IEEE, 2019. [Salay et al., 2020] Rick Salay, Krzysztof Czarnecki, [Lee et al., 2019c] Keuntaek Lee, Ziyi Wang, Bogdan Vla- Maria Soledad Elli, Ignacio J Alvarez, Sean Sedwards, hov, Harleen Brar, and Evangelos A Theodorou. Ensemble and Jack Weast. Purss: Towards perceptual uncertainty bayesian decision making with redundant deep perceptual aware responsibility sensitive safety with ml. In SafeAI@ control policies. In 2019 18th IEEE International Con- AAAI, pages 91–95, 2020. ference On Machine Learning And Applications (ICMLA), [Shalev-Shwartz et al., 2017] Shai Shalev-Shwartz, Shaked pages 831–837. IEEE, 2019. Shammah, and Amnon Shashua. On a formal model [Loquercio et al., 2020] Antonio Loquercio, Mattia Segu, of safe and scalable self-driving cars. arXiv preprint and Davide Scaramuzza. A general framework for un- arXiv:1708.06374, 2017. certainty estimation in deep learning. IEEE Robotics and [Snoek et al., 2019] Jasper Snoek, Yaniv Ovadia, Emily Automation Letters, 5(2):3153–3160, 2020. Fertig, Balaji Lakshminarayanan, Sebastian Nowozin, [Makansi et al., 2019] Osama Makansi, Eddy Ilg, Ozgun Ci- D Sculley, Joshua Dillon, Jie Ren, and Zachary Nado. Can cek, and Thomas Brox. Overcoming limitations of mixture you trust your model’s uncertainty? evaluating predictive density networks: A sampling and fitting framework for uncertainty under dataset shift. In Advances in Neural In- multimodal future prediction. In Proceedings of the IEEE formation Processing Systems, pages 13969–13980, 2019. Conference on Computer Vision and Pattern Recognition, [Wilson and Izmailov, 2020] Andrew Gordon Wilson and pages 7144–7153, 2019. Pavel Izmailov. Bayesian deep learning and a prob- [McAllister et al., 2017] Rowan McAllister, Yarin Gal, Alex abilistic perspective of generalization. arXiv preprint Kendall, Mark Van Der Wilk, Amar Shah, Roberto arXiv:2002.08791, 2020.