A Comparison of Uncertainty Estimation Approaches in Deep Learning
                Components for Autonomous Vehicle Applications

           Fabio Arnez1 , Huascar Espinoza1 , Ansgar Radermacher1 and François Terrier1
                                    1
                                      CEA LIST, Gif-sur-Yvette, France
                {fabio.arnez, huascar.espinoza, ansgar.radermacher, francois.terrier}@cea.fr


                                                                     to test and provide the same performance guarantees in all
                          Abstract                                   the possible environment configurations the system could en-
                                                                     counter in the real world [Kuutti et al., 2020]. A common
     A key factor for ensuring safety in Autonomous Ve-              practice to overcome this problem is to use runtime monitor-
     hicles (AVs) is to avoid any abnormal behaviors                 ing of DNN components, so that safety can be ensured even if
     under undesirable and unpredicted circumstances.                the component was not fully validated at design time [Henne
     As AVs increasingly rely on Deep Neural Net-                    et al., 2020; Koopman et al., 2019]. A central aspect to enable
     works (DNNs) to perform safety-critical tasks, dif-             DNN monitoring is to provide a runtime treatment of uncer-
     ferent methods for uncertainty quantification have              tainties associated with DNN’s predictions [McAllister et al.,
     recently been proposed to measure the inevitable                2017; Koopman et al., 2019].
     source of errors in data and models. However, un-                  In this paper, we review common uncertainty estimation
     certainty quantification in DNNs is still a challeng-           methods for DNNs and compare their performance and ben-
     ing task. These methods require a higher compu-                 efits for different AV tasks. These methods offer a potential
     tational load, a higher memory footprint, and in-               solution for runtime DNN confidence prediction and detec-
     troduce extra latency, which can be prohibitive in              tion of Out-of-Distribution (OOD) samples, since prediction
     safety-critical applications. In this paper, we pro-            probability scores in DNNs do not provide a true represen-
     vide a brief and comparative survey of methods for              tation of uncertainty [Mohseni et al., 2019]. However, these
     uncertainty quantification in DNNs along with ex-               methods still demand a high computational load, incorporate
     isting metrics to evaluate uncertainty predictions.             extra latency, and require a larger memory footprint. We com-
     We are particularly interested in understanding the             pare these factors since they can represent a major impedi-
     advantages and downsides of each method for spe-                ment in safety-critical applications with tight time constraints
     cific AV tasks and types of uncertainty sources.                and limited computation hardware. We also briefly focus on
                                                                     surveying uncertainty metrics that evaluate the performance
1   Introduction                                                     of quantification methods, as another critical factor to ensure
In the last decade, Deep Neural Networks (DNNs) have wit-            safety in AV systems.
nessed great advances in real-world applications like Au-               The remainder of the paper is structured as follows. Sec-
tonomous Vehicles (AVs) to perform complex tasks such as             tion 2 describes the sources of uncertainty in deep learning
object detection and tracking or vehicle control. Despite            for AVs. Section 3 presents a comparison of recent works
substantial performance improvements introduced by DNNs,             in AV tasks that include uncertainty estimation methods for
they still have significant safety shortcomings due to their         DNNs. It provides a brief review of common uncertainty es-
complexity, opacity and lack of interpretability [McAllister         timation methods in deep learning as well as metrics for pre-
et al., 2017]. In particular, DNNs are brittle to operational        dictive uncertainty evaluation in classification and regression
domain shift and even small data corruption or perturbations         tasks. Section 4 discusses the open challenges and possible
[Kuutti et al., 2020]. This impedes ensuring the reliabil-           directions for future work.
ity of the DNNs models, which is a precondition for safety-
critical systems to ensure compliance with automotive indus-         2     Background
try safety standards and avoid jeopardizing human lives.
   A concrete safety problem is to detect abnormal situations
                                                                     2.1    Sources of Uncertainty in Deep Learning for
under uncertain environment conditions and DNN-specific                     Autonomous Vehicles
unpredictability. These situations are difficult to analyze dur-     Autonomous vehicles have to deal with dynamic, non-
ing system development phases, in a way that they can be             stationary and highly unpredictable operational environ-
properly mitigated at a real-time scale. Indeed, although a          ments. Taking into account all the details from the opera-
DNN model achieves great performance in a validation set             tional environment at design time is an intractable task. In-
from its operation environment, it is currently impossible           stead, the operational environment is constrained in a way


Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
that it considers only a subset of all possible situations that       In a similar fashion as the cases presented before, auto-
the system can encounter in operation. This process is known       motive industry standard ISO/PAS 21448 or SOTIF (Safety
as Operational Design Domain (ODD) adoption [Koopman               Of The Intended Functionality) [ISO, 2019], provides a pro-
and Fratrik, 2019], and safety requirements are built on the       cess to identify unknown and potentially unsafe scenarios to
top of the ODD specification.                                      minimize the risk by recognizing the performance limitations
   Given the constrained operational environment within sys-       from sensors, algorithms, or user misuse. Unsafe scenarios
tem ODD, ensuring safety in an AV requires the identification      can be further classified into unsafe-known (e.g out of ODD
of unfamiliar contexts by modeling AV’s uncertainty [McAl-         samples) or unsafe-unknown (e.g. OOD samples). Once an
lister et al., 2017]. However, there are many factors, not         unknown-unsafe scenario or situation is identified, it becomes
only related to the environment, that affect the system per-       a known-unsafe scenario that can be mitigated at design time
formance by introducing some degree of uncertainty. [Czar-         [Rau et al., ; Mohseni et al., 2019].
necki and Salay, 2018] identify a set of factors that contribute
to uncertainty in the perception function in an AV, and in this    2.2   Uncertainty Estimation Methods for DNNs
manner affect its performance. From this set, we take spe-
cial attention to sensor properties, model uncertainty, situa-     In recent years, many probabilistic deep learning methods
tion and scenario coverage, and operational domain uncer-          have been proposed to obtain an uncertainty measure from
tainty factors. In the context of DNNs, the first two fac-         an approximation to the (highly multi-modal) predictive dis-
tors can be modeled by using uncertainty estimation methods,       tribution, as well as methods for calibrating the outputs of
while the last two correspond to some degree of dataset shift      DNNs. In general, there are two approaches for DNN predic-
(i.e. breaking the independent and identically distributed as-     tive uncertainty calculation: sampling-based and sampling-
sumption between training and testing data.) and Out-of-           free methods. Sampling-based methods rely on taking multi-
Distribution (OOD) samples [Quionero-Candela et al., 2009;         ple predictive samples based on the same input to get the esti-
Mohseni et al., 2019].                                             mator that will be associated with uncertainty. Sampling-free
                                                                   methods require one single predictive output. These methods
   Sensor properties like range, resolution, noise characteris-    are further discussed in Section 3.
tics, and calibration can influence the amount of information
in the samples delivered to a machine learning model during
training or testing. In consequence, the effect of these prop-     Neural Network Calibration
erties are captured as noise and ambiguities inherent to the       Confidence calibration represents the degree to which a
obtained samples. This type of noise in the data is known as       model’s predicted probability estimates the true correctness
Aleatoric uncertainty, and represents the incapability of com-     likelihood [Guo et al., 2017]. Under ideal circumstances, we
pletely sensing all the details of the environment [Kendall        expect that the normalized outputs from a DNN (i.e softmax
and Gal, 2017; Lee et al., 2019b; Gustafsson et al., 2019].        outputs) correspond to the true correctness likelihood [Guo
Aleatoric uncertainty can be further further classified into ho-   et al., 2017]. From a frequentist perspective, this can be
moscedastic uncertainty (uncertainty that remains constant         viewed as a discrepancy measure between local confidence
for different samples), and heteroscedastic uncertainty (un-       (or uncertainty) predictions and the expected performance in
certainty that can vary between samples).                          the long-run [Hubschneider et al., 2019; Lakshminarayanan
                                                                   et al., 2017]. For example, we expect that a class predicted
   Model uncertainty is often referred to as Epistemic uncer-      with probability p is correct p% of the time, i.e. from 100
tainty, and accounts for uncertainty in the model parameters.      samples predicted with confidence 0.9, we expect 90 cor-
This type of uncertainty captures the ignorance of the model       rect predictions. DDNs can be calibrated by using Temper-
as a consequence of a dataset that does not represent the ODD      ature Scaling, a simple post-processing technique [Guo et
well, or that is not sufficiently large [Kendall and Gal, 2017;    al., 2017], or more recently, Dirichelt calibration [Kull et
Lee et al., 2019b]. Epistemic uncertainty is expected to           al., 2019]. For a regression setting, [Kuleshov et al., 2018;
increase in unknown situations (e.g. different environment         Hubschneider et al., 2019] formalize the calibration notion
ODD conditions such as weather or lightning), and can be           for continuous variables, in which a p% confidence interval
explained away by incorporating more data.                         should contain the true outcome p% of the time.
   Situation and scenario coverage is related to the degree in
which situations and scenarios from an ODD are reflected in           Despite the improvements achieved with calibration meth-
training and operation stages; while operational domain un-        ods, they can not be seen as a complete solution for uncer-
certainty refers to a discrepancy between ODD situations and       tainty estimation problem, since calibration is performed rel-
scenarios present at training and those encountered at oper-       ative to a validation dataset [Kull et al., 2019; Ashukha et al.,
ation (e.g. scenarios from two different ODDs) [Czarnecki          2020] (i.e., calibration methods rely on in-distribution sam-
and Salay, 2018]. In both cases, uncertainty can be reduced        ples to learn a calibration map). In the presence of OOD
by incorporating more data, or by adjusting the ODD spec-          samples, a model is no longer calibrated. This limits the con-
ification. However, it is extremely important to detect and        tribution of calibration techniques to scenarios where huge
discover OOD samples (i.e. outliers), especially those that        training datasets are available.
have not been seen before, since those can lead to highly con-
fident predictions that are wrong, i.e., the unknown-unknowns
[Bansal and Weld, 2018].
3     Comparison of Uncertainty Estimation                         Instead of relying on only one configuration of the weights,
      Methods in AV Domain                                         we use every possible configuration of the weights (all possi-
                                                                   ble models) weighted by the posterior on the parameters, to
In this section, we compare and analyze some common uncer-         make a prediction, i.e. p(y∗ |x∗ , D) = Ep(w|D) [p(y∗ |x∗ , w)].
tainty estimation methods in terms of out-of-the-box calibra-      This represents the Bayesian Model Average (BMA) and ac-
tion in the predictions (i.e. without a prior calibration), com-   counts for epistemic uncertainty [Wilson and Izmailov, 2020;
putational budget, memory footprint, and required changes in       Gal, 2016; Blundell et al., 2015].
the DNN for applying each method (architecture, loss func-            Unfortunately, the integrals from (1) and (2) are intractable.
tion, and others). We have chosen the most representative          Thus, we must build a distribution that approximates the true
works to the best of our knowledge in each application. Some       posterior distribution on the weights, q(w) ≈ p(w|D). Two
of the listed works introduce improvements by performing           main paradigms exist to build q(w): Markov Chain Monte
combinations between other methods. This is summarized in          Carlo (MCMC) and Variational Inference (VI) methods. In
Table 1.                                                           the former, the gold standard is Hamiltonian Monte Carlo
                                                                   (HMC), and other methods like Stochastic Gradient MCMC
3.1    Methods Limited to Aleatoric Uncertainty                    (SG-MCMC) have been explored. However, MCMC meth-
The first four methods listed in Table 1 exclusively deal with     ods are in general hard to scale to large DNNs due to the high-
aleatoric uncertainty. In classification tasks, uncertainty is     dimensional and multi-modal posterior distribution [Gustafs-
usually represented by normalized logits at the output layer       son et al., 2019]. In the latter case, VI methods approxi-
(e.g. softmax output) which can be interpreted as a proba-         mate the posterior over the weights by approximating a sim-
bility distribution related to aleatoric uncertainty [Gustafsson   pler distribution qφ (w) (e.g. a gaussian) parameterized by φ.
et al., 2019]. Unfortunately, normalized outputs as proba-         The parameters of qφ (w) are found by minimizing the KL-
bility distributions fail to capture model uncertainty and this    divergence to p(w|D).
very often results in overconfident predictions that are wrong        A particular scalable and easy to implement sample-based
[Guo et al., 2017], especially in the presence of dataset-shift.   method for approximate VI is Monte Carlo Dropout (MCD)
To overcome the problems of softmax, [Gast and Roth, 2018]         [Gal and Ghahramani, 2016]. In this method, dropout reg-
propose to use a Dirichlet distribution instead.                   ularization is also applied at test time, so that qφ (w) is a
   In a regression configuration, deep learning models do not      Bernoulli distribution. Dropout is only performed in some
have an uncertainty representation by default. The outputs of      of the deeper layers of the DNN to model better high-level
a DNN are intended to parameterize a probability distribution      features and to avoid slow training [Mukhoti and Gal, 2018;
(e.g., Gaussian, Laplace) to obtain a probabilistic representa-    Kendall et al., 2015]. Dropout probabilities can be set man-
tion. This modification of the architecture allows DNNs to         ually, or the network can tune dropout rates during training
learn aleatoric uncertainty from the data itself by using thes     [Gal et al., 2017].
heteroscedastic loss and maximum likelihood [Kendall and              All the MCD-related methods listed in Table 1 refer to this
Gal, 2017; Ilg et al., 2018]. Similarly, in the heteroscedastic    approximation of BNNs. It can be noted from the perfor-
version of the classification, [Kendall and Gal, 2017] place       mance comparison criteria, that the need to take multiple for-
a Gaussian distribution over the output logits (i.e., each logit   ward passes (output samples) for the same input to approxi-
with its respective variance), before the softmax layer is ap-     mate the distribution from Equation 2 represents a major im-
plied. An alternative approach replaces the input, output and      pediment to safety-critical applications with tight time con-
activation functions of a DNN with probability distributions       straints and limited computation hardware.
[Gast and Roth, 2018]. This method allows the propagation             To get a representation of both types of uncertainty
of a fixed uncertainty at the input to the output of the DNN       (aleatoric and epistemic), the methods presented in Section
employing Assumed Density Filtering (ADF).                         3.1 have been used in combination with MCD. For example,
                                                                   in a regression configuration, a set of T samples are taken
3.2    Bayesian Neural Networks                                    from the predictions of a DNN that parameterize a distri-
                                                                   bution in its output: {ŷt , σ̂t }Tt=1 . However, since aleatoric
Bayesian Neural Networks (BNNs), aim to learn a distribu-          uncertainty is learned from the data itself (by using the het-
tion over the weights instead of point estimates. In this way,     eroscedastic loss), this approach could produce wrong uncer-
we look for the posterior distribution of the weights given the    tainty estimations in samples that include a higher level of
data p(w|D), by applying Bayes’ theorem from the data like-        uncertainty than that observed during training. Another ap-
lihood and a chosen prior distribution over the weights p(w):      proach presented in [Loquercio et al., 2020], applies MCD to
                                                                   take samples from a DNN where the input, output and acti-
                   p(D|w)p(w)     p(D|w)p(w)                       vation functions are replaced by probability distributions ac-
       p(w|D) =               =R                            (1)
                      p(D)       p(D|w)p(w)dw                      cording to [Gast and Roth, 2018]. This method permits un-
                                                                   certainty propagation at the input to the output of the DNN
Given the predictive posterior distribution p(w|D), we ob-         using ADF (e.g., sensor noise can be propagated to the output
tain the predictive posterior distribution for a new input x∗      of the DNNs). This is an appealing method for AV applica-
by marginalizing over the model parameters:                        tions where sensor properties are commonly known. Inter-
                         Z                                         estingly, the authors show that this method can be applied to
          p(y∗ |x∗ , D) = p(y∗ |x∗ , w)p(w|D)dw            (2)     trained DNNs and is architecture agnostic.
                                                                                   Uncertainty Captured                     Comparison Criteria
      Method                        Autonomous Vehicle Task
                                                                                                           Out-of-the box   Computational   Memory      Changes in
                                                                                   Aleatoric   Epistemic
                                                                                                            Calibration        Budget       Footprint     DNN
Softmax logits
as parameters        - Object Detection [Feng et al., 2019b]                          3           7              Bad            Fair          Slow        Small
of a prob. dist.
Outputs as
parameters           - Object Detection [Feng et al., 2019b]                          3           7              Bad            Fair          Slow        Small
of a prob. dist.
Inputs, activation
and output as        - Optical Flow [Gast and Roth, 2018]                             3           7          Undefined          Low           Low          Mid
prob. dist. & ADF
Point estimate
& MCD                - Steering Angle Prediction [Hubschneider et al., 2019;          7           3              Fair           Fair          Low         None
regression           Michelmore et al., 2018; Michelmore et al., 2019]
                     - Traffic Sign Recognition [Henne et al., 2020]
Softmax & MCD        - Semantic Segmentation [Phan et al., 2019; Mukhoti and          7           3              Fair           Fair          Low         None
                     Gal, 2018; Gustafsson et al., 2019]
                     - Steering Angle Prediction [Hubschneider et al., 2019]
Deep                 - Traffic Sign Recognition [Henne et al., 2020]
                                                                                      3           3              Good           High          High        Small
Ensembles            - Semantic Segmentation [Gustafsson et al., 2019]
                     - Depth Estimation [Gustafsson et al., 2019]
Bootstrap            - Steering Angle Prediction [Hubschneider et al., 2019]
                                                                                      3           3              Bad            Fair          Fair         Mid
Ensembles            - Optical Flow [Ilg et al., 2018]
Softmax logits
as parameters
                     - Object Detection [Feng et al., 2018]                           3           3              Fair           High          Low         Small
of a prob. dist.
& MCD
Outputs as           - Object Detection [Feng et al., 2018]
parameters of        - Steering Angle Prediction [Lee et al., 2019b; Lee et al.,
                                                                                      3           3              Fair           High          Low         Small
a prob. dist.        2019c; Lee et al., 2019a]
& MCD                - Depth Estimation [Gustafsson et al., 2019]
Inputs, activation
and output as
                     - Steering Angle Prediction [Loquercio et al., 2020]             3           3          Undefined          High          Low          Mid
prob. dist. & ADF
&MCD
                     - Steering Angle Prediction [Hubschneider et al., 2019;
MDNs                 Choi et al., 2018]                                               3           3              Bad            Low           Low         None
                     - Future Prediction [Makansi et al., 2019]
MDNs
                     - Future Prediction [Makansi et al., 2019]                       3           3          Undefined          Low           Low         High
with stages

                                                            Table 1: Uncertainty Estimation Methods Comparison
3.3   Deep Ensembles                                              obtained as a combination of the mixture components in a
                                                                                            PK
A Deep Ensemble (DE) is another sample-based method, in           weighted sum: E[y|x] = i=1 πi (x)µi (x). The predicted
which M DNNs are trained to obtain the predictive distri-         variance is composed of the weighted sum of the variances
bution p(y|x) [Lakshminarayanan et al., 2017]. Each DNN           and the weighted variances of the means:
learns a set of parameters w that are point estimates, start-                                                                                 2
                                                                              K                      K                     K
ing for different random initialization and repeating the min-     V[y|x] =
                                                                              X
                                                                                    πi (x)Σi (x) +
                                                                                                     X
                                                                                                         πi (x) µi (x) −
                                                                                                                           X
                                                                                                                               πi (x)µi (x)       (5)
imization M times. In an ensemble, predictions are aver-                      i=1                    i                     i
aged and can be considered as a mixture model that is equally
weighted:                                                         where the first term represents the aleatoric uncertainty and
                                                                  the second term represents the epistemic uncertainty. We re-
                            M
                      1 X            i                            fer the reader to [Choi et al., 2018] for more details about
          p(y|x) =          p(y|x, ŵ ), {ŵ(i) }M
                                                 i=1       (3)    uncertainty acquisition in MDNs.
                      M i=1
                                                                     As pointed out in Table 1, the sampling-free nature of this
   For classification, equation (3) corresponds to an aver-       method reduces the computation load, memory footprint, and
age of the softmax probabilities. For regression, the out-        permits complex distribution modeling with respect to the
puts that parameterize a probability distribution are averaged    methods described before. These characteristics are attractive
to represent the mean and variance of the mixture. In this        for real-time applications. However, MDNs suffer from nu-
manner, both types of uncertainty (aleatoric and epistemic)       merical instability for high dimensional problems and mode
can be easily captured. Although DE is considered a non-          collapse when using regularization techniques [Makansi et
Bayesian method, expression (3) represents an approxima-          al., 2019].
tion of (2) since {ŵ(i) }Mi=1 can be seen as samples taken
from distribution that approximates the true posterior, by ex-    3.5   Quality Metrics for Uncertainty Estimation
ploring different modes of from p(w|D) [Fort et al., 2019;        In this section, we discuss common metrics for evaluating the
Wilson and Izmailov, 2020].                                       quality of uncertainty estimation.
   As presented in Table 1, the DE method tends to outper-        Classification Metrics. Different methods for uncertainty
form approximate Bayesian inference methods like MCD, for         estimation exist for classification tasks. Variation Ratio and
both, uncertainty estimates and accuracy [Gustafsson et al.,      information metrics such as Predictive Entropy, Mutual In-
2019]. A recent work from [Snoek et al., 2019] also shows,        formation, can be used in classification settings to represent
that DE is more robust to dataset shift. These works suggest      uncertainty [Gal, 2016]. Variation ratio is a measure of dis-
that DE should be considered as the new standard method for       persion; mutual information captures model confidence, and
predictive distributions and uncertainty estimation. However,     predictive entropy accounts for epistemic and aleatoric un-
DE has some drawbacks, especially if the target application       certainty [Mukhoti and Gal, 2018; Michelmore et al., 2018;
is a safety-critical application. DE requires a higher com-       Phan et al., 2019]. [Mukhoti and Gal, 2018] propose specific
putational load and a larger memory footprint, as shown in        performance metrics for semantic segmentation to evaluate
Table 1. For the training and testing stage, the number of pa-    Bayesian models. Since there is no ground-truth for uncer-
rameters, and the inference times scale linearly with M . To      tainty estimation, [Snoek et al., 2019; Lakshminarayanan et
mitigate this problem, [Osband et al., 2016] propose a fused      al., 2017] argue that proper scoring rules are NLL and Brier
version of ensembles with multiple heads. All the heads share     score. NLL depends on predictive uncertainty and is com-
the convolutional layers (feature extractors) and each head is    monly evaluated in a held-out set, however, it can overesti-
trained using boostrap samples.                                   mate tail probabilities; whereas Brier-score measures the ac-
3.4   Mixture Density Networks                                    curacy of predictive probabilities by a sum of squared differ-
                                                                  ences between the predicted probability vector and the target,
Mixture Density Networks (MDN) [Bishop, 1994], is a               nonetheless, this score is prone to avoid capturing infrequent
sample-free method for regression tasks, where the aim is to      events. Other evaluation metrics independent of score val-
train a DNN that predicts the parameters of a Gaussian Mix-       ues are: the Area Under the Receiver Operating Characteris-
ture Model (GMM) given an input x. A GMM is formed by a           tic (AUROC), Area Under Precision Recall Curve (AUPRC),
weighted sum of K Gaussians, to model the conditional dis-        and Area Under Risk-Coverage (AURC) [Hendrycks and
tribution:                                                        Gimpel, 2016; Ding et al., 2019].
                      K
                      X                                           Regression Metrics. Similarly, in regression tasks, NLL
           p(y|x) =         πi (x)N (y|µi (x), Σi (x))     (4)
                                                                  is a proper scoring rule for a likelihood that follows Gaus-
                      i=1
                                                                  sian distribution [Lakshminarayanan et al., 2017; Kendall and
where πi (x), µi (x), Σi (x) represent the set of parameters of   Gal, 2017]. Furthermore, [Ilg et al., 2018] introduces a rel-
the GMM as a function of the input x for K mixtures. For          ative measure for uncertainty estimation, the Area Under the
training, Negative Log-likelihood (NLL) is used as loss func-     Sparsification Error (AUSE) curve, that measures the differ-
tion.                                                             ence between the dispersion of predictions (affected by pre-
   By using the law of total variance, [Choi et al., 2018] for-   dictive uncertainty), and a oracle in terms of true prediction
malized the acquisition of aleatoric and epistemic uncertianty    error, e.g. Root Mean Squared Error (RMSE) [Gustafsson et
in MDNs. As a first step, the expectation of the GMM is           al., 2019].
Calibration Metrics. For classification tasks, common                DE achieves better performance and calibration than MCD
quality metrics are Expected Calibration Error (ECE) and             variants [Gustafsson et al., 2019]. However, in DE the com-
Maximum Calibration Error (MCE) [Guo et al., 2017]. The              putational cost at training and testing grows linearly with the
former measures the difference between expected accuracy             number of ensembles. Similarly for traffic sign recognition,
and expected confidence; the latter identifies the largest dis-      DE exhibit the best-calibrated outputs, but in this case, MCD
crepancy between accuracy and confidence, which is of par-           in combination with softmax also produces well-calibrated
ticular interest in safety-critical applications. For a regression   outputs close to that from DE [Henne et al., 2020].
configuration, [Kuleshov et al., 2018] use calibration error as         For optical flow, [Gast and Roth, 2018] capture aleatoric
a metric that represents the sum of weighted squared differ-         uncertainty by replacing the input, output and activation func-
ences between the expected and observed (empirical) confi-           tions with probability distributions. This method allows prop-
dence levels; correspondingly in [Gustafsson et al., 2019], the      agating a fixed value of uncertainty at the input to the output
authors propose to use the Area Under the Calibration Error          of the DNN. [Ilg et al., 2018] present an alternative approach,
curve (AUCE) as an absolute measure of uncertainty. The              where DE and bootstrap ensembles were used to obtain the
before-mentioned authors use reliability diagrams (i.e. cali-        predictive uncertainty.
bration plots) to get a visual representation of model calibra-         For future prediction, [Makansi et al., 2019] propose an
tion. Regardless of drawbacks with OOD samples, calibra-             improvement to MDNs to predict the multi-modal distribu-
tion plots and measures are used extensively to compare the          tion of positions of a vehicle in the future. This method
predictive quality of other uncertainty estimation methods.          presents two stages: a sampling and a fitting network. The
                                                                     former network receives the current position of the vehicle as
3.6   Considerations per AV Task Type                                an input and outputs a fixed number of hypotheses for future
In the context of AVs, for (end-to-end) steering angle pre-          positions. The latter network fits a mixture distribution to the
diction, a broad variety of uncertainty estimation methods           hypothesis estimated in the first network. This improvement
have been applied. In some works only epistemic uncer-               helps to avoid mode collapse in MDNs, however, high dimen-
tainty was captured by using MCD [Michelmore et al., 2018;           sional outputs remain challenging for this approach.
Michelmore et al., 2019]. However, usually both types of un-
certainty are captured [Lee et al., 2019b; Lee et al., 2019c;        4   Conclusions
Lee et al., 2019a] by using the method proposed by [Kendall          We presented a comparative survey for uncertainty estima-
and Gal, 2017], or by using DE, boostrap ensembles, or               tion methods for both, classification and regression tasks, in
MDNs. The calibration plots presented in [Hubschneider et            the AV domain. We also provide a general comparative anal-
al., 2019] show that MCD has better out-of-the-box calibra-          ysis of these methods. From this analysis we can see that DE
tion than bootstrap ensembles or MDNs; the last two meth-            has become a gold-standard for uncertainty quantification in
ods are overconfident in their predictions. In this particu-         many AV tasks thanks to its high-quality uncertainty predic-
lar task, safety mechanisms have been proposed when un-              tions and its robustness to OOD samples. However, the high
certainty estimations surpass a given or learned threshold in        computational load and large memory footprint, can hinder its
order to improve vehicle safety [Michelmore et al., 2018;            use in safety-critical applications that have hardware limita-
Michelmore et al., 2019; Lee et al., 2019b].                         tions or tight time-constraints. Here, sampling-free methods
   Under the modular pipeline paradigm for AV control, prob-         are an interesting avenue for future research. New robust (to
abilistic modeling has mainly been applied to perception             OOD) and lightweight approaches should be explored in the
tasks like object detection from 3D Lidar, semantic segmenta-        AV domain, to produce good-quality uncertainty estimates.
tion and depth estimation. For 3D object detection from Lidar        We also observed that predictions from these methods are un-
point-clouds, [Feng et al., 2018] estimate aleatoric and epis-       calibrated (overconfident or underconfident) and are usually
temic uncertainty using the methods proposed by [Kendall             applied to classification tasks. We encourage the application
and Gal, 2017]. However, epistemic uncertainty estimation            of calibration methods also for regression tasks by using the
with MCD introduces a high computational cost. A later               methods proposed by [Kuleshov et al., 2018] instead of lim-
work from [Feng et al., 2019b] leverages aleatoric uncertain-        iting the assessment of predictions with only reliability di-
ties to greatly improve the performance and reduce the com-          agrams. We also suggest to study and compare uncertainty
putational load from MCD. In [Feng et al., 2019a] the au-            estimation methods under dataset-shift conditions to assess
thors show that predictions for classification and regression        their robustness. For future work, we plan to incorporate un-
are miscalibrated, and propose methods to fix calibration of         certainty information into the Responsability-Sensitive Safety
DNNs and produce better uncertainty estimates.                       model [Shalev-Shwartz et al., 2017]. This generalizes the ap-
   For semantic segmentation, [Phan et al., 2019; Mukhoti            proach from [Salay et al., 2020] by considering component
and Gal, 2018; Gustafsson et al., 2019] model aleatoric un-          uncertainty from different AV subsystems and propagating it
certainty from the softmax output, and epistemic uncertainty         through them. These subsystems could include DNNs e.g.
by using MCD or ensembles. Common uncertainty metrics                for planning and control.
in this case are predictive entropy and mutual information
[Mukhoti and Gal, 2018]. For Depth estimation, [Gustafsson
et al., 2019] compares DE with the heteroscedastic regression        Acknow ledgments
in combination with MCD [Kendall and Gal, 2017]. In both             This work has received funding from the COMP4DRONES
previous tasks (semantic segmentation and depth estimation)          project, under Joint Undertaking (JU) grant agreement
N◦ 826610. The JU receives support from the European              [Gal and Ghahramani, 2016] Yarin Gal and Zoubin Ghahra-
Union’s Horizon 2020 research and innovation programme              mani. Dropout as a bayesian approximation: Representing
and from Spain, Austria, Belgium, Czech Republic, France,           model uncertainty in deep learning. In international con-
Italy, Latvia, Netherlands.                                         ference on machine learning, pages 1050–1059, 2016.
                                                                  [Gal et al., 2017] Yarin Gal, Jiri Hron, and Alex Kendall.
References                                                          Concrete dropout. In Advances in neural information pro-
[Ashukha et al., 2020] Arsenii Ashukha, Alexander Lyzhov,           cessing systems, pages 3581–3590, 2017.
   Dmitry Molchanov, and Dmitry Vetrov. Pitfalls of in-           [Gal, 2016] Yarin Gal. Uncertainty in deep learning. Uni-
   domain uncertainty estimation and ensembling in deep             versity of Cambridge, 1:3, 2016.
   learning. arXiv preprint arXiv:2002.06470, 2020.
                                                                  [Gast and Roth, 2018] Jochen Gast and Stefan Roth.
[Bansal and Weld, 2018] Gagan Bansal and Daniel S Weld.             Lightweight probabilistic deep networks. In Proceedings
   A coverage-based utility model for identifying unknown           of the IEEE Conference on Computer Vision and Pattern
   unknowns. In Thirty-Second AAAI Conference on Artifi-            Recognition, pages 3369–3378, 2018.
   cial Intelligence, 2018.
                                                                  [Guo et al., 2017] Chuan Guo, Geoff Pleiss, Yu Sun, and
[Bishop, 1994] Christopher M Bishop. Mixture density net-
                                                                    Kilian Q Weinberger. On calibration of modern neural net-
   works. 1994.                                                     works. In Proceedings of the 34th International Confer-
[Blundell et al., 2015] Charles Blundell, Julien Cornebise,         ence on Machine Learning-Volume 70, pages 1321–1330.
   Koray Kavukcuoglu, and Daan Wierstra. Weight uncer-              JMLR. org, 2017.
   tainty in neural network. In International Conference on       [Gustafsson et al., 2019] Fredrik K Gustafsson, Martin
   Machine Learning, pages 1613–1622, 2015.
                                                                    Danelljan, and Thomas B Schön. Evaluating scalable
[Choi et al., 2018] Sungjoon Choi, Kyungjae Lee, Sungbin            bayesian deep learning methods for robust computer
   Lim, and Songhwai Oh. Uncertainty-aware learning from            vision. arXiv preprint arXiv:1906.01620, 2019.
   demonstration using density networks with sampling-free        [Hendrycks and Gimpel, 2016] Dan Hendrycks and Kevin
   variance modeling. In 2018 IEEE International Confer-
                                                                    Gimpel. A baseline for detecting misclassified and out-of-
   ence on Robotics and Automation (ICRA), pages 6915–
                                                                    distribution examples in neural networks. arXiv preprint
   6922. IEEE, 2018.
                                                                    arXiv:1610.02136, 2016.
[Czarnecki and Salay, 2018] Krzysztof Czarnecki and Rick
                                                                  [Henne et al., 2020] Maximilian Henne, Adrian Schwaiger,
   Salay. Towards a framework to manage perceptual uncer-
   tainty for safe automated driving. In International Confer-      Karsten Roscher, and Gereon Weiss. Benchmarking un-
   ence on Computer Safety, Reliability, and Security, pages        certainty estimation methods for deep learning with safety-
   439–445. Springer, 2018.                                         related metrics. 2020.
[Ding et al., 2019] Yukun Ding, Jinglan Liu, Jinjun Xiong,        [Hubschneider et al., 2019] Christian Hubschneider, Robin
   and Yiyu Shi. Evaluation of neural network uncertainty           Hutmacher, and J Marius Zöllner. Calibrating uncertainty
   estimation with application to resource-constrained plat-        models for steering angle estimation. In 2019 IEEE Intel-
   forms. arXiv preprint arXiv:1903.02050, 2019.                    ligent Transportation Systems Conference (ITSC), pages
                                                                    1511–1518. IEEE, 2019.
[Feng et al., 2018] Di Feng, Lars Rosenbaum, and Klaus Di-
   etmayer. Towards safe autonomous driving: Capture un-          [Ilg et al., 2018] Eddy Ilg, Ozgun Cicek, Silvio Galesso,
   certainty in the deep neural network for lidar 3d vehicle         Aaron Klein, Osama Makansi, Frank Hutter, and Thomas
   detection. In 2018 21st International Conference on Intel-        Brox. Uncertainty estimates and multi-hypotheses net-
   ligent Transportation Systems (ITSC), pages 3266–3273.            works for optical flow. In Proceedings of the European
   IEEE, 2018.                                                       Conference on Computer Vision (ECCV), pages 652–667,
                                                                     2018.
[Feng et al., 2019a] Di Feng, Lars Rosenbaum, Claudius
   Glaeser, Fabian Timm, and Klaus Dietmayer. Can we trust        [ISO, 2019] ISO ISO. Pas 21448-road vehicles-safety of
   you? on calibration of a probabilistic object detector for        the intended functionality. International Organization for
   autonomous driving. arXiv preprint arXiv:1909.12358,              Standardization, 2019.
   2019.                                                          [Kendall and Gal, 2017] Alex Kendall and Yarin Gal. What
[Feng et al., 2019b] Di Feng, Lars Rosenbaum, Fabian                uncertainties do we need in bayesian deep learning for
   Timm, and Klaus Dietmayer. Leveraging heteroscedastic            computer vision? In Advances in neural information pro-
   aleatoric uncertainties for robust real-time lidar 3d object     cessing systems, pages 5574–5584, 2017.
   detection. In 2019 IEEE Intelligent Vehicles Symposium         [Kendall et al., 2015] Alex Kendall, Vijay Badrinarayanan,
   (IV), pages 1280–1287. IEEE, 2019.                               and Roberto Cipolla. Bayesian segnet: Model uncertainty
[Fort et al., 2019] Stanislav Fort, Huiyi Hu, and Balaji Lak-       in deep convolutional encoder-decoder architectures for
   shminarayanan. Deep ensembles: A loss landscape per-             scene understanding. arXiv preprint arXiv:1511.02680,
   spective. arXiv preprint arXiv:1912.02757, 2019.                 2015.
[Koopman and Fratrik, 2019] Philip Koopman and Frank                 Cipolla, and Adrian Weller. Concrete problems for au-
   Fratrik. How many operational design domains, objects,            tonomous vehicle safety: Advantages of bayesian deep
   and events? 2019.                                                 learning. International Joint Conferences on Artificial In-
[Koopman et al., 2019] Philip Koopman, Beth Osyk, and                telligence, Inc., 2017.
   Jack Weast. Autonomous vehicles meet the physical              [Michelmore et al., 2018] Rhiannon Michelmore, Marta
   world: Rss, variability, uncertainty, and proving safety. In      Kwiatkowska, and Yarin Gal. Evaluating uncertainty
   International Conference on Computer Safety, Reliability,         quantification in end-to-end autonomous driving control.
   and Security, pages 245–253. Springer, 2019.                      arXiv preprint arXiv:1811.06817, 2018.
[Kuleshov et al., 2018] Volodymyr Kuleshov, Nathan Fen-           [Michelmore et al., 2019] Rhiannon Michelmore, Matthew
   ner, and Stefano Ermon. Accurate uncertainties for                Wicker, Luca Laurenti, Luca Cardelli, Yarin Gal, and
   deep learning using calibrated regression. arXiv preprint         Marta Kwiatkowska. Uncertainty quantification with sta-
   arXiv:1807.00263, 2018.                                           tistical guarantees in end-to-end autonomous driving con-
[Kull et al., 2019] Meelis Kull, Miquel Perello Nieto,               trol. arXiv preprint arXiv:1909.09884, 2019.
   Markus Kängsepp, Telmo Silva Filho, Hao Song, and              [Mohseni et al., 2019] Sina Mohseni, Mandar Pitale, Vasu
   Peter Flach. Beyond temperature scaling: Obtaining                Singh, and Zhangyang Wang. Practical solutions for
   well-calibrated multi-class probabilities with dirichlet          machine learning safety in autonomous vehicles. arXiv
   calibration. In Advances in Neural Information Processing         preprint arXiv:1912.09630, 2019.
   Systems, pages 12295–12305, 2019.                              [Mukhoti and Gal, 2018] Jishnu Mukhoti and Yarin Gal.
[Kuutti et al., 2020] Sampo Kuutti, Richard Bowden,                  Evaluating bayesian deep learning methods for semantic
   Yaochu Jin, Phil Barber, and Saber Fallah. A survey of            segmentation. arXiv preprint arXiv:1811.12709, 2018.
   deep learning applications to autonomous vehicle control.      [Osband et al., 2016] Ian Osband, Charles Blundell, Alexan-
   IEEE Transactions on Intelligent Transportation Systems,          der Pritzel, and Benjamin Van Roy. Deep exploration via
   2020.                                                             bootstrapped dqn. In Advances in neural information pro-
[Lakshminarayanan et al., 2017] Balaji Lakshminarayanan,             cessing systems, pages 4026–4034, 2016.
   Alexander Pritzel, and Charles Blundell. Simple and scal-      [Phan et al., 2019] Buu Phan, Samin Khan, Rick Salay, and
   able predictive uncertainty estimation using deep ensem-          Krzysztof Czarnecki. Bayesian uncertainty quantification
   bles. In Advances in neural information processing sys-           with synthetic data. In International Conference on Com-
   tems, pages 6402–6413, 2017.                                      puter Safety, Reliability, and Security, pages 378–390.
[Lee et al., 2019a] Keuntaek Lee, Gabriel Nakajima An, Vi-           Springer, 2019.
   acheslav Zakharov, and Evangelos A Theodorou. Per-             [Quionero-Candela et al., 2009] Joaquin Quionero-Candela,
   ceptual attention-based predictive control. arXiv preprint        Masashi Sugiyama, Anton Schwaighofer, and Neil D
   arXiv:1904.11898, 2019.                                           Lawrence. Dataset shift in machine learning. The MIT
[Lee et al., 2019b] Keuntaek Lee, Kamil Saigol, and Evan-            Press, 2009.
   gelos A Theodorou. Early failure detection of deep end-        [Rau et al., ] Paul Rau, Christopher Becker, and John
   to-end control policy by reinforcement learning. In 2019          Brewer. Approach for deriving scenarios for safety of the
   International Conference on Robotics and Automation               intended functionality.
   (ICRA), pages 8543–8549. IEEE, 2019.                           [Salay et al., 2020] Rick Salay, Krzysztof Czarnecki,
[Lee et al., 2019c] Keuntaek Lee, Ziyi Wang, Bogdan Vla-             Maria Soledad Elli, Ignacio J Alvarez, Sean Sedwards,
   hov, Harleen Brar, and Evangelos A Theodorou. Ensemble            and Jack Weast. Purss: Towards perceptual uncertainty
   bayesian decision making with redundant deep perceptual           aware responsibility sensitive safety with ml. In SafeAI@
   control policies. In 2019 18th IEEE International Con-            AAAI, pages 91–95, 2020.
   ference On Machine Learning And Applications (ICMLA),          [Shalev-Shwartz et al., 2017] Shai Shalev-Shwartz, Shaked
   pages 831–837. IEEE, 2019.                                        Shammah, and Amnon Shashua. On a formal model
[Loquercio et al., 2020] Antonio Loquercio, Mattia Segu,             of safe and scalable self-driving cars. arXiv preprint
   and Davide Scaramuzza. A general framework for un-                arXiv:1708.06374, 2017.
   certainty estimation in deep learning. IEEE Robotics and       [Snoek et al., 2019] Jasper Snoek, Yaniv Ovadia, Emily
   Automation Letters, 5(2):3153–3160, 2020.                         Fertig, Balaji Lakshminarayanan, Sebastian Nowozin,
[Makansi et al., 2019] Osama Makansi, Eddy Ilg, Ozgun Ci-            D Sculley, Joshua Dillon, Jie Ren, and Zachary Nado. Can
   cek, and Thomas Brox. Overcoming limitations of mixture           you trust your model’s uncertainty? evaluating predictive
   density networks: A sampling and fitting framework for            uncertainty under dataset shift. In Advances in Neural In-
   multimodal future prediction. In Proceedings of the IEEE          formation Processing Systems, pages 13969–13980, 2019.
   Conference on Computer Vision and Pattern Recognition,         [Wilson and Izmailov, 2020] Andrew Gordon Wilson and
   pages 7144–7153, 2019.                                            Pavel Izmailov. Bayesian deep learning and a prob-
[McAllister et al., 2017] Rowan McAllister, Yarin Gal, Alex          abilistic perspective of generalization. arXiv preprint
   Kendall, Mark Van Der Wilk, Amar Shah, Roberto                    arXiv:2002.08791, 2020.