The impact of averaging logits over probabilities on
ensembles of neural networks
Cedrique Rovile Njieutcheu Tassi1 , Jakob Gawlikowski2 , Auliya Unnisa Fitri2 and
Rudolph Triebel3
1
  German Aerospace Center (DLR), Institute of Optical Sensor Systems, Rutherfordstraße 2, 12489 Berlin, Germany
2
  German Aerospace Center (DLR), Institute of Data Science, Mälzerstraße 3-5, 07745 Jena, Germany
3
  German Aerospace Center (DLR), Institute of Robotics and Mechatronics, Münchener Straße 20, 82234 Wessling, Germany


                                             Abstract
                                             Model averaging has become a standard for improving neural networks in terms of accuracy, calibration, and the ability to
                                             detect false predictions (FPs). However, recent findings show that model averaging does not necessarily lead to calibrated
                                             confidences, especially for underconfident networks. While existing methods for improving the calibration of combined
                                             networks focus on recalibrating, building, or sampling calibrated models, we focus on the combination process. Specifically,
                                             we evaluate the impact of averaging logits instead of probabilities on the quality of confidence (QoC). We compare combined
                                             logits instead of probabilities of members (networks) for models such as ensembles, Monte Carlo Dropout (MCD), and Mixture
                                             of Monte Carlo Dropout (MMCD). Comparison is done using experimental results on three datasets using three different
                                             architectures. We show that averaging logits instead of probabilities increase the confidence thereby improving the confidence
                                             calibration for underconfident models. For example, for MCD evaluated on CIFAR10, averaging logits instead of probabilities
                                             reduces the expected calibration error (ECE) from 12.03% to 5.44%. However, the increase in confidence can bring harm to
                                             confidence calibration for overconfident models and the separability between true predictions (TPs) and FPs. For example, for
                                             MMCD evaluated on MNIST, the average confidence on FPs due to the noisy data increases from 51.31% to 94.58% when
                                             averaging logits instead of probabilities. While averaging logits can be applied with underconfident models to improve the
                                             calibration on test data, we suggest to average probabilities for safety- and mission-critical applications where the separability
                                             of TPs and FPs is of paramount importance.

                                              Keywords
                                              Model averaging, Combination process, Logit averaging, Probability averaging, Ensemble, Monte Carlo Dropout (MCD),
                                              Mixture of Monte Carlo Dropout (MMCD), Quality of confidence (QoC), Confidence calibration, Separating true predictions
                                              (TPs) and false predictions (FPs)


1. Introduction                                                                                                            produce more underconfident networks. For example, [7]
                                                                                                                           showed that averaging networks trained with modern
Recently, averaging the predictions of multiple stochas-                                                                   regularization techniques resulted in more underconfi-
tic or deterministic networks has become a standard ap-                                                                    dent networks and therefore miscalibrated predictions.
proach for improving accuracy [1, 2] and uncertainty                                                                       [12] supported this argument by theoretically and empir-
estimates [3]. Generally, the quality of uncertainty es-                                                                   ically showing that averaging calibrated networks do not
timates (e.g.: QoC) is assessed by the degree of calibra-                                                                  always lead to calibrated confidences. Calibrating confi-
tion and/or the ability to detect FPs. Model averaging                                                                     dences of averaged networks has received little attention
can yield well-calibrated confidence [4, 5] and is one of                                                                  in the literature. Generally, post-processing calibration
the state-of-the-art methods for detecting FPs caused                                                                      methods, such as temperature scaling [13], can be used
by out-of-distribution examples [4, 3]. However, recent                                                                    to recalibrate the confidences of averaged networks, as
findings [6, 7, 8] show that model averaging does not nec-                                                                 demonstrated in [8, 12]. From [14] and further supported
essarily lead to calibrated confidence, especially when                                                                    by [8], confidence calibration in model averaging is cor-
the networks are built using modern regularization tech-                                                                   related to diversity inherent in individual networks and
niques, such as mixup [9] or label smoothing [10, 11].                                                                     the more diverse the networks, the better the calibration.
This is because modern regularization techniques can                                                                       Motivated by this observation, [14] promoted model di-
(strongly) regularize networks, resulting in underconfi-                                                                   versity using structured dropout to reduce calibration
dence. Furthermore, averaging underconfident networks                                                                      errors. [7] proposed class-adjusted mixup that trains
The IJCAI-ECAI-22 Workshop on Artificial Intelligence Safety
                                                                                                                           less confident networks by evaluating the difference be-
(AISafety 2022)                                                                                                            tween accuracy (estimated on a validation dataset after
$ Cedrique.NjieutcheuTassi@dlr.de (C. R. N. Tassi);                                                                        each training epoch) and the confidence of each train-
Jakob.Gawlikowski@@dlr.de (J. Gawlikowski); Auliya.Fitri@dlr.de                                                            ing sample to activate or deactive mixup training for
(A. U. Fitri); Rudolph.Triebel@dlr.de (R. Triebel)                                                                         overconfidence (average confidence > accuracy) or un-
                                       © 2022 Copyright 2022 for this paper by its authors. Use permitted under Creative
    CEUR
    Workshop
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                       CEUR Workshop Proceedings (CEUR-WS.org)
                                                                                                                           derconfidence (average confidence < accuracy), respec-
    Proceedings
tively. All these methods for improving the calibration       TPs and FPs. Therefore, FPs can be made with high confi-
of combined networks focus on recalibrating, building,        dence similar to TPs. For example, for MMCD evaluated
or sampling the calibrated networks. However, this work       on FashionMNIST (see Table 4), the average confidence
focuses on combining the networks. Specifically, we ad-       on FPs due to the noisy data increased from 51.31% to
dress the question: What is the impact of averaging           94.58% when averaging logits instead of probabilities. In
logits instead of probabilities of multiple (stochastic       summary, we provide empirical evidence demonstrating
or deterministic) networks on the QoC?                        how combining logits instead of probabilities of multiple
   We hypothesized that averaging logits instead of prob-     (stochastic or deterministic) networks
abilities of multiple networks increases the confidence
of the averaged network. This is because logits (inputs            • preserves accuracy, but increases the confidence
to softmax), which can be interpreted as found evidence              on TPs and FPs.
for possible classes [15], are continuous values normal-           • reduces the calibration error (given underconfi-
ized using the softmax to produce discrete probabilities.            dent networks), but increases the calibration error
The softmax normalization of continuous values (log-                 (given overconfident networks).
its) to discrete values (probabilities) causes information         • can harm the separability between TPs and FPs.
loss and possible robustness to changes in the magni-
tudes of logits. This implies that the softmax function
is a nonlinear function that maps multiple logit vectors
                                                              2. Related works
with large differences in magnitudes to the same discrete     The combination process describes how multiple mem-
probability vector. We evaluated the impact of the in-        bers are combined and the information type (e.g., logits
crease in confidence caused by averaging logits instead       or probabilities) that is combined. Several approaches
of probabilities on the QoC. Specifically, we evaluated the   such as stacking [16] and voting [17, 18, 19]) have been
QoC by assessing the degree of confidence calibration,        reported for aggregating multiple predictions. Some of
which measures the difference between the predicted           these approaches have been reviewed and discussed in
(average confidence) and true probabilities (empirical ac-    [20, 21] and experimentally compared in [18, 16] to find
curacy). Furthermore, we evaluated the QoC by assessing       the one with the best accuracy. It was found that one
its ability to seperate TPs and FPs. To provide empirical     approach improves accuracy better than another depend-
evidence for evaluating the QoC, we considered the logit      ing on several factors, such as the number of members,
averaging against probability averaging and compared          diversity inherent in individual members, and accuracy
both approaches using different averaged models, such         of individual members. However, in [22], we compared
as ensemble, MCD, and MMCD. The comparison was                approaches such as averaging, plurality voting, or major-
based on results from different experiments conducted         ity voting to find the one that better captures uncertainty.
on three datasets, namely, MNIST, FashionMNIST, and           We found that the averaging approach captures uncer-
CIFAR10 evaluated on VGGNet, ResNet, and DenseNet,            tainty better than voting approaches. Before our work,
respectively.                                                 [23] argued that simple averaging approaches are more
   Results show that averaging logits instead of probabil-    robust than voting approaches. This argument was fur-
ities preserves accuracy, but increases confidence. For       ther supported by [24]. This is because the averaging
example, for MCD evaluated on CIFAR10 (see Table 2),          approach considers all members’ predictions, whereas
the accuracy remained around 85.36% while the aver-           plurality/majority voting ignores uncertain predictions
age confidence increased from 73.35% to 80.04% when           and therefore, reduces the uncertainty in the combined
we averaged logits instead of probabilities. Furthermore,     members’ prediction. Although various combination ap-
given underconfident models, the increase in the degree       proaches have been presented and compared in the liter-
of confidence reduces the calibration error on the test       ature, the information type that is combined has received
data. For example, for MCD evaluated on CIFAR10, ECE          relatively little attention. [25] showed that averaging
dropped from 12.04% to 5.40% when the average confi-          quantiles rather than probabilities improve the predic-
dence increased from 73.35% to 80.04%. However, given         tive performance. Generally, for neural networks and
overconfident models, the increase in the degree of con-      classification problems in particular, multiple members
fidence increased the calibration error on the test data.     (networks) are combined by averaging probabilities [16].
For example, for the ensemble evaluated on CIFAR10 (see       [16] evaluated the impact of combining logits instead of
Table 3), ECE increased from 3.03% to 7.40% when the av-      probabilities on accuracy, however, the impact on the
erage confidence increased from 89.43% to 96.17%. Finally,    QoC remains unclear. Thus, we investigated the impact
for underconfident or overconfident models, the increase      of combining logits instead of probabilities on the QoC.
in the degree of confidence can harm the separability
between TPs and FPs. This is because averaging logits
instead of probabilities increases the confidence of both
3. Background                                               Gaussian and Bernoulli distribution, the MCD layer sam-
                                                            ples the 𝑗 𝑡ℎ element of 𝑎 as 𝑎𝑠𝑗 = 𝑎𝑗 * 𝛼𝑗 * 𝛽𝑗 with
In the context of image classification, let the training    𝛼𝑗 ∼ 𝒩 (1, 𝜎 2 = 𝑞/(1 − 𝑞)) and 𝛽𝑗 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑞).
data 𝐷𝑡𝑟𝑎𝑖𝑛 = {𝑥𝑖 ∈ R𝐻×𝑊 ×𝐶 , 𝑦𝑖 ∈ 𝑈 𝐾 }𝑖∈[1,𝑁 ] be a       Here, 𝑞 denotes the dropout probability. In this work, we
realization of independently and identically distributed    can refer to MCD as an average of 𝑆 stochastic CNNs.
random variables (𝑥, 𝑦) ∈ 𝑋 × 𝑌 , where 𝑥𝑖 denotes
the 𝑖𝑡ℎ input and 𝑦𝑖 its corresponding one hot encoded
                                                          3.3. Ensemble
class label from the set of standard unit vectors of R𝐾 ,
𝑈 . 𝑋 and 𝑌 denote the input and label spaces. 𝐻 × An (explicit) ensemble was investigated in [4, 27, 28] for
  𝐾

𝑊 × 𝐶 denotes the dimension of input images, where uncertainty estimation. It approximates the prediction
𝐻, 𝑊 , and 𝐶 refer to the height, weight, and number of 𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛 ) by learning different settings. Given a set
channels, respectively. 𝐾 and 𝑁 denote the numbers of of CNNs 𝑓𝜃𝑚 for 𝑚 ∈ 1, 2, ..., 𝑀 , the ensemble predic-
possible output classes and samples within the training tion is obtained by averaging over the predictions of the
data, respectively.                                       CNNs. That is,
                                                                                               𝑀
                                                                                          1 ∑︁
3.1. Convolutional neural network (CNN)                             𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛 ) :=          𝑝(𝑦|𝑥, 𝜃𝑚 )
                                                                                          𝑀 𝑚=1
A CNN is a nonlinear function 𝑓𝜃 parameterized by model                                                            (3)
                                                                                               𝑀
parameters 𝜃, called the network weights. Here, it maps                                 1 ∑︁
                                                                                     :=      𝑓𝜃 (𝑥).
input images 𝑥𝑖 ∈ R𝐻×𝑊 ×𝐶 to class labels 𝑦𝑖 ∈ 𝑈 𝐾 ,                                    𝑀 𝑚=1 𝑚
                                                            In this work, we can refer to an ensemble as an average
 𝑓𝜃 : 𝑥𝑖 ∈ R𝐻×𝑊 ×𝐶 → 𝑦𝑖 ∈ [0, 1]𝐾 ; 𝑓𝜃 (𝑥𝑖 ) = 𝑦𝑖 (1)
                                                            of 𝑀 deterministic CNNs.
The network parameters are optimized on the train-
ing dataset, 𝐷𝑡𝑟𝑎𝑖𝑛 . Given a new data sample 𝑥 ∈           3.4. Mixture of Monte Carlo Dropout
R𝐻×𝑊 ×𝐶 , a trained CNN 𝑓𝜃 predicts the corresponding            (MMCD)
target 𝑦 = 𝑓𝜃 (𝑥) using the set of trained weights 𝜃. The
network output (logit) is given by 𝑧 = 𝑓𝜃 (𝑥), from which    MMCD was investigated in [29, 30, 31] for uncertainty
a probability vector 𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛 ) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧), can      estimation. It combines both MCD and ensemble. For
be computed. In the following, this probability vec-         prediction estimation, MCD evaluates a single feature rep-
tor will be abbreviated                                      resentation, but additionally considers the uncertainty
                     ∑︀ by 𝑝 and its entries by 𝑝𝑘 with      associated with the feature representation. However,
𝑘 = 1, . . . , 𝐾 and 𝐾 𝑘=1 𝑝𝑘 = 1. Further, we get the
predicted confidence 𝑐 = max𝑘 (𝑝𝑘 ) and predicted class      an ensemble evaluates multiple feature representations
label 𝑦 = arg max𝑘 (𝑝𝑘 )                                     without considering the uncertainty associated with in-
                                                             dividual feature representations. Hence, MMCD applies
                                                             MCD to an ensemble to evaluate multiple feature repre-
3.2. Monte Carlo Dropout (MCD)                               sentations and consider the uncertainty associated with
MCD was investigated in [26, 27, 28] for uncertainty individual feature representations. Given a set of CNNs
estimation. It is one of the most widespread Bayesian 𝑓𝜃𝑚 for 𝑚 ∈ 1, 2, ..., 𝑀 , the MMCD prediction is ob-
methods reviewed in [3]. It approximates the predic- tained by averaging over the predictions of all stochastic
tion 𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛 ) using the mean of 𝑆 stochastic for- CNNs. That is,
ward passes, 𝑝(𝑦|𝑥, 𝜃1 ), ..., 𝑝(𝑦|𝑥, 𝜃𝑆 ), representing 𝑆                                  𝑀
                                                                                      1 ∑︁ ∑︁
                                                                                                 𝑆

stochastic CNNs parameterized by samples 𝜃1 , 𝜃2 ,..., and      𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛 ) ≈                   𝑝(𝑦|𝑥, 𝜃𝑚𝑠 )
                                                                                    𝑀 · 𝑆 𝑚=1 𝑠=1
𝜃𝑆 . That is                                                                                                        (4)
                                                                                            𝑀 ∑︁ 𝑆
                       𝑆                      𝑆                                       1    ∑︁
                    1 ∑︁                   1 ∑︁                                  ≈                  𝑓𝜃 (𝑥).
  𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛 ) ≈       𝑝(𝑦|𝑥, 𝜃𝑠 ) ≈         𝑓𝜃𝑠 (𝑥). (2)                        𝑀 · 𝑆 𝑚=1 𝑠=1 𝑚𝑠
                    𝑆 𝑠=1               𝑆 𝑠=1
                                                           In this work, we can refer to MMCD as an average of
Specifically, MCD approximates the prediction with a 𝑀 · 𝑆 stochastic CNNs.
dropout distribution realized by sampling weights with
masks drawn from known distributions, such as Gaus-
sian, Bernoulli, or a cascade of Gaussian and Bernoulli 4. Combining logits instead of
distributions [22]. For example, given the activation vec-      probabilities
tor 𝑎 fed to a MCD layer (placed for example at the input
of the first fully-connected layer) and assuming that sam- The output layer of a CNN-based classifier includes
pling is realized with masks drawn from a cascade of 𝐾 output neurons with a softmax activation function,
                     (a) Logit averaging                                         (b) Probability averaging

Figure 1: Example showing the difference between averaging logits and averaging probabilities in an ensemble.


which normalizes its inputs (continuous values) to pro-        and reformulate the predicted probability vector of MCD,
duce
∑︀𝐾 discrete probabilities 𝑝𝑘 (with 𝑘 = 1, . . . , 𝐾 and       as shown in (7). Similarly, given MMCD representing an
   𝑘=1 𝑝𝑘 = 1) representing the probability that the           ensemble of 𝑀 · 𝑆 stochastic CNNs with logits 𝑧 𝑚𝑠 , we
input image belongs to the class associated with the           can estimate the average logit 𝑧 as
𝑘𝑡ℎ output neuron. The input to the softmax func-
                                                                             𝑀   𝑆             𝑀   𝑆
tion are logits and interpreted as evidence for possible        𝑧 ≈
                                                                       1 ∑︁ ∑︁ 𝑚𝑠
                                                                                    𝑧 ≈
                                                                                         1 ∑︁ ∑︁
                                                                                                     𝑓𝜃 (𝑥), (9)
classes [15]. The discrete probability 𝑝𝑘 is interpreted              𝑀 · 𝑆 𝑚=1 𝑠=1     𝑀 · 𝑆 𝑚=1 𝑠=1 𝑚𝑠
as the model confidence that the input belongs to the
class associated with the 𝑘𝑡ℎ output neuron. Given the         and reformulate the predicted probability vector of
                              ]︀𝑇                              MMCD, as shown in (7). From Figure 2, averaging logits
logit vector 𝑧 = 𝑧1 . . . 𝑧𝐾 , the softmax estimates
                   [︀
                 ]︀𝑇                                           instead of probabilities of multiple stochastic or deter-
𝑝 = 𝑝1 . . . 𝑝𝐾 as
     [︀
                                                               ministic CNNs increases the confidence of the averaged
                                                               CNNs. Intuitively, logit averaging provides the best evi-
  𝑝 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧)                                              dence (characterized by a low level of uncertainty caused
                                                               by the reduction of inductive biases inherent in individ-
            1                                   ]︀𝑇 (5)
                                                               ual logits) for making decisions. However, probability
                      [︀
    = ∑︀𝐾                exp(𝑧1 ) . . . exp(𝑧𝐾 ) .
        𝑘=1 exp(𝑧 𝑘 )                                          averaging provides the best confidence associated with
                                                               decisions made using weak evidence (characterized by
   From Figure 1, given an ensemble of 𝑀 deterministic
                                                               a high level of uncertainty caused by inductive biases
CNNs with logits 𝑧 𝑚 , the average logit 𝑧 can be estimated
                                                               inherent in individual logits). This implies that a decision
as
                    𝑀                  𝑀                       made using probability averaging considers more uncer-
               1 ∑︁ 𝑚             1 ∑︁                         tainty than that made using logit averaging. In this work,
       𝑧 :=              𝑧 :=              𝑓𝜃 (𝑥).       (6)
              𝑀 𝑚=1              𝑀 𝑚=1 𝑚                       we evaluated the impact of the possible increase in the
and the predicted probability vector of the ensemble of        degree of confidence caused by applying logit averaging
deterministic CNNs can be reformulated as                      instead of probability averaging on the QoC.

            𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛 ) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧).               (7)
                                                               5. Experiments
Given MCD representing an ensemble of 𝑆 stochastic
CNNs with logits 𝑧 𝑠 , we can estimate the average logit 𝑧     5.1. Experimental setup
as                                                             We hypothesized that the QoC of CNNs (strongly) de-
                      𝑆             𝑆
         𝑧 ≈
                 1 ∑︁ 𝑠
                         𝑧 ≈
                                 1 ∑︁
                                       𝑓𝜃 (𝑥),         (8)     pends on the task-difficulty (specified using the train-
                𝑆 𝑠=1            𝑆 𝑠=1 𝑠                       ing data), the underlying architecture, and/or the train-
                                                               ing procedure (mostly influenced by the regularization
                                                                    Table 1
                                                                    Summary of values assigned to regularization hyper-
 Softmax                                                            parameters.
                                                                                                  Values         Values
                                                                     Hyper-parameters             (weak regu-    (strong regu-
                                                                                                  larization)    larization)
                                                                     Probability of dropout ap-
                                                                     plied at inputs to max       -              0.05
                                                                     pooling layers
                                                                     Probability of dropout ap-
Figure 2: Example showing how averaging logits in-                   plied at inputs to fully-    0.05           0.5
stead of probabilities increases the confidence∑︀of an en-           connected layers
semble of four deterministic CNNs: 𝑧 = 14 4𝑚=1 𝑧 𝑚 ,                 Rotation range [Degree]      [-5, +5]       [-45, +45]
𝑝𝑧 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧), and 𝑝 = 14 4𝑚=1 𝑝𝑚 with 𝑝1 =                       Width and height shift
                                        ∑︀
                                                                                                  [-1, +1]       [-5, +5]
𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 ), 𝑝 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 ), 𝑝3 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 3 ),
              1     2                 2                              range [Pixel]
and 𝑝4 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 4 ). One can see that averaging log-             Scale intensity range        [0.95, 1.05]   [0.9, 1.2]
its (𝑝𝑧 ) results in more confident predictions than averaging       Shear intensity range        0.05           0.1
probabilities (𝑝). This is attributed to averaging logits being      Additive Gaussian noise
                                                                                                  0.05           0.5
more sensitive to the magnitude of logit values than averaging       standard deviation range
probabilities. Here, 𝑧 𝑚 with large values contributes most to
𝑧 . In our example, 𝑧 is mostly influenced by the values of 𝑧 1 .
That is, the contributions of 𝑧 2 , 𝑧 3 , and 𝑧 4 to 𝑧 are minor.
                                                                    5.2. Evaluation metrics
However, 𝑝 is influenced by the values of all probability vec-
tors 𝑝𝑚 and therefore, is less sensitive to the magnitude of        QoC was evaluated by assessing the degree of confidence
individual logits.                                                  calibration. Specifically, we evaluated the calibration er-
                                                                    ror using measures, such as the negative log likelihood
                                                                    (NLL) applied in [4, 5, 31], expected calibration error
strength). Therefore, we compared logits and probabili-             (ECE) applied in [13, 8, 12], and Brier score (BS) applied
ties averaging on three datasets to evaluate the impact of          in [4]. Low values of NLL, ECE, and BS indicate low cali-
the task-difficulty on the QoC. Moreover, we compared               bration error and vice versa. Furthermore, we evaluated
logits and probabilities averaging using three different            QoC by assessing its ability to separate TPs and FPs. Here,
architectures to evaluate the impact of the underlying ar-          we evaluated the average confidence on evaluation data
chitecture on the QoC. Specifically, we evaluated MNIST             causing TPs or FPs. Given evaluation data causing TPs,
[32] on VGGNets [1], FashionMNIST [33] on ResNets [2]               we expect the average confidence on the evaluation data
and CIFAR10 [34] on DenseNets [35]. Finally, we com-                to be high. However, for the evaluation data causing FPs,
pared logits and probabilities averaging on CNNs trained            we expect low average confidence on the evaluation data.
using two regularization strengths (strong and weak reg-            Moreover, we evaluated the ability to separate TPs and
ularization summarized in Table 1) to evaluate the impact           FPs by evaluating the area under the receiver operator
of the regularization strength on the QoC. We observed              characteristic (AU-ROC) applied in [37, 5]. AU-ROC sum-
strong and weak regularization results in underconfident            marizes the trade-off between the fraction of TPs that are
and overconfident CNNs, respectively. All CNNs were                 correctly detected and those of FPs that are undetected
regularized using batch normalization [36] layers placed            using different thresholds. In summary, in addition to the
before each convolutional activation function. All CNNs             NLL, ECE, and BS, we evaluated the accuracy, average
were randomly initialized and trained with random shuf-             confidence, and AUC-ROC.
fling of training samples. All CNNs were trained using
the categorical cross-entropy and stochastic gradient de-           5.3. Evaluation data
scent with momentum of 0.9, learning rate of 0.02, batch
size of 128, and epochs of 100. All images were standard-           We used five evaluation data for different purposes,
ized and normalized by dividing pixel values by 255. For            namely test data, subsets of the correctly classified test
all MCD and MMCD, we sampled activations of the first               data, out-of-domain data, swapped data, and noisy data.
fully-connected layer using masks drawn from a cascade
                                                                    Test data represent the test data from the experimental
of Bernoulli and Gaussian distributions [22] and using
                                                                           data, namely, MNIST, CIFAR10, and FashionM-
a dropout probability of 0.5. We performed 100 stochas-
                                                                           NIST. These datasets include both correctly clas-
tic forward passes (𝑆 = 100) and considered ensembles
                                                                           sified and misclassified test data. Test data are
consisting of five deterministic CNNs (𝑀 = 5).
                                                                           used for estimating the accuracy, NLL, ECE, and
           (a) Test data                           (b) Swapped data                              (c) Noisy data

Figure 3: Examples of evaluation data for experiments conducted on CIFAR10.


       BS. We expect the accuracy to be high and NLL,       5.4. Experimental results
       ECE and BS to be low on test data.
                                                            We evaluate the conducted experiments with respect to
Subsets of the correctly classified test data               accuracy and QoC.
      include 1000 correctly classified test data from         Table 2 and Table 3 summarize the accuracy, aver-
      the experimental data. Since CNNs will make TPs       age confidence, NLL, ECE, and BS of different models
      on these data, we used these data for evaluating      using the two averaging approaches and CNNs trained
      the average confidence on TPs.                        using strong regularization (causing underconfidence)
                                                            and weak regularization (causing overconfidence). The
Swapped data were simulated using subsets of the cor-       results show that averaging logits instead of probabil-
     rectly classified test data structurally perturbed     ities do not strongly affect the accuracy. This means
     by dividing images into four regions and diago-        that averaging logits can preserve accuracy. Further-
     nally permuting the regions. From Figure 3b, the       more, averaging logits instead of probabilities signifi-
     upper left and right are permuted with the bottom      cantly increases the average confidence. Figure 2 illus-
     right and left regions, respectively. Swapped data     trates why the confidence increases. Further, Table 2
     include structurally perturbed objects within the      shows that averaging logits instead of probabilities sig-
     given images. We expect CNNs to make FPs on            nificantly decreases the NLL, ECE, and BS for under-
     swapped data. Therefore, we used these data for        condifent CNNs (trained using strong regularization).
     evaluating the average confidence on FPs caused        This means that averaging logits, unlike averaging prob-
     by structurally perturbed objects.                     abilities, reduces the calibration error for undercondifent
                                                            CNNs. This is because the stronger the regularization,
Noisy data were simulated using subsets of the cor-
                                                            the lower the confidence and the higher the gap between
      rectly classified test data perturbed by applying
                                                            accuracy and average confidence. Here, the increase
      additive Gaussian noise with a standard devia-
                                                            in the degree of confidence caused by averaging log-
      tion of 500. From Figure 3c, noisy data include
                                                            its instead of probabilities reduces the gap between ac-
      noise within the given images. We expect CNNs
                                                            curacy and average confidence. For example, Table 2
      to make FPs on these data. Therefore, we used
                                                            shows that averaging logits instead of probabilities of
      these data for evaluating the average confidence
                                                            the ensemble reduces the gap between accuracy and av-
      on FPs caused by noisy objects.
                                                            erage confidence from 18.24(= |88.75 − 70.51|)% to
Out-of-domain data were simulated using 1000 test           9.52(= |88.94 − 79.42|)% on CIFAR10.
      data of CIFAR100 [34]. Since CNNs will make              However, the increase in the degree of confidence caused
      FPs on these data, we used these data for evalu-      by averaging logits instead of probabilities increases the cal-
      ating the average confidence on FPs caused by         ibration error for overconfident CNNs (trained using weak
      unknown objects.                                      regularization). Table 3 provides empirical evidence for
                                                            this claim by showing that, on CIFAR10 and FashionM-
In general, we expect the average confidence to be high     NIST, NLL, ECE, and BS of the ensembles increase when
on TPs and to be low on FPs.                                the logits are averaged instead of probabilities. We ar-
                                                            gued that the more overconfident the CNNs, the higher
Table 2
Comparison of accuracy[%], average confidence[%] (in bracket), NLL[10−2 ], ECE[10−2 ], and BS[10−2 ] of different models
using two approaches for averaging underconfident CNNs trained using strong regularization: average probabilities (AP) and
average logits (AL). The results were obtained using the test data described in Section 5.3.
                    Accuracy (Average confidence)↑               NLL ↓              ECE ↓               BS ↓
                         AP              AL                    AP     AL          AP     AL      AP            AL
          CIFAR10 (DenseNets)
          Ensemble  89.52 (84.31)   89.60 (87.97)             34.66   32.81     5.23    2.47    16.13      15.38
          MCD       85.36 (73.35)   85.37 (80.04)             52.13   46.55     12.04   5.40    23.32      21.57
          MMCD      88.75 (70.51)   88.94 (79.42)             50.83   40.99     18.24   9.55    21.82      18.34
          FashionMNIST (ResNets)
          Ensemble  92.70 (87.86)   92.58 (90.16)             22.57   20.99     5.15    2.86    11.37      10.87
          MCD       90.56 (79.22)   90.56 (83.95)             35.45   30.18     11.47   6.85    15.82      14.57
          MMCD      92.65 (76.37)   92.73 (83.78)             35.57   26.96     16.31   9.10    14.87      12.47
          MNIST (VGGNets)
          Ensemble  99.04 (98.24)   99.04 (98.89)             3.25     2.90     1.03    0.52     1.52      1.41
          MCD       98.16 (94.53)   98.16 (96.48)             8.73     6.87     3.81    1.98     2.99      2.79
          MMCD      99.03 (94.67)   99.04 (97.46)             6.91     4.13     4.49    1.75     1.89      1.52


Table 3
Comparison of accuracy[%], average confidence[%] (in bracket), NLL[10−2 ], ECE[10−2 ], and BS[10−2 ] of ensembles using
two approaches for averaging overconfident CNNs trained using weak regularization: average probabilities (AP) and average
logits (AL). The results were obtained using the test data described in Section 5.3.
                                Accuracy (Average confidence) ↑            NLL ↓            ECE ↓              BS ↓
                                     AP              AL                 AP      AL        AP     AL         AP       AL
  CIFAR10 (DenseNets)           88.67 (89.43)   88.88 (96.17)          40.69   54.23     3.03   7.40       16.69   18.07
  FashionMNIST (ResNets)        94.49 (95.86)   94.58 (98.43)          20.20   28.00     1.98   4.11       8.36     9.32


the confidence and the higher the gap between accuracy        84.80% to 42.42%.
and average confidence. Here, the increase in the de-
gree of confidence caused by averaging logits instead of
probabilities further increases the gap between the ac-       6. Discussion
curacy and average confidence and therefore, increases
                                                              The term ‘combination process’ encompasses how mul-
the calibration error. For example, Table 3 shows that, on
                                                              tiple networks are combined and the information type
CIFAR10, averaging logits of the ensemble increases the
                                                              combined. It was found in [23, 24, 22] that simple averag-
gap between the accuracy and average confidence from
                                                              ing is more robust and captures uncertainty better than
0.76(= |88.67−89.43|)% to 7.29(= |88.88−96.17|)%.
                                                              voting approaches. This is because the simple averag-
   In Table 4, the average confidence on TPs and FPs is
                                                              ing equally weights all predictions, while voting ignores
shown for underconfident models using both averaging
                                                              uncertain predictions. In this work, we compared the
approaches. The results show that averaging logits instead
                                                              process of averaging logits instead of probabilities. We
of probabilities increases the confidence level on TPs and
                                                              empirically showed that averaging logits instead of prob-
FPs. The increase in the average confidence is sometimes
                                                              abilities increases the confidence while preserving the
very large for FPs due to the noisy data. For example, for
                                                              accuracy for underconfident or overconfident networks.
MMCD evaluated on FashionMNIST, the average confi-
                                                              This might be because logit averaging preserves the po-
dence on the noisy data increases from 51.31% to 94.58%
                                                              sition of the max element of individual logit vectors, but
when averaging logits. This is because noisy data can
                                                              is more sensitive to the magnitude of logit values than
increase the magnitude of logits and averaging logits is
                                                              probability averaging. Thus, logit values with a large
more sensitive to changes in the magnitude of logits than
                                                              magnitude contribute the most to the average logit. In
averaging probabilities (see Figure 2). The increase in the
                                                              this way, the magnitude of logit values induces a non-
degree of confidence caused by averaging logits can harm
                                                              uniform weighting (for logit averaging), which is lost
the separability of TPs and FPs. For example, the increase
                                                              (for probability averaging). Furthermore, we provided
in the average confidence on the noisy data from 51.31%
                                                              empirical evidence showing that for underconfident net-
to 94.58% causes the AUC-ROC obtained based on the
                                                              works (trained using strong regularization), the increase
evaluation of the degree of confidence to decrease from
                                                              in the confidence caused by averaging logits instead of
Table 4
Comparison of average confidence[%] of different models using two approaches (average probabilities (AP) and average logits
(AL)) for averaging underconfident networks trained using strong regularization and evaluated on TPs and FPs: TPs were
obtained on subsets of the correctly classified test data, while FPs were obtained on swapped, noisy and out-of-domain (OOD)
data described in Section 5.3.
                                  TP ↑             FP (OOD) ↓        FP (Swapped) ↓       FP (Noisy) ↓
                              AP       AL          AP     AL          AP      AL           AP     AL
                   CIFAR10 (DenseNets)
                   Ensemble  93.94   96.63        35.39     40.08    51.84     56.03     39.42    58.69
                   MCD       81.39   88.45        31.61     33.27    40.39     44.69     44.83    69.53
                   MMCD      79.48   89.53        22.81     23.83    36.26     40.67     28.01    33.08
                   FashionMNIST (ResNets)
                   Ensemble  88.01   90.16        55.48     63.21    59.30     67.91     81.39    99.82
                   MCD       79.39   83.76        47.08     50.36    55.75     59.29     41.23    65.79
                   MMCD      76.40   83.76        42.76     49.09    45.73     52.70     51.31    94.58
                   MNIST (VGGNets)
                   Ensemble  99.09   99.55        57.16     80.45    51.96     62.01     69.58    88.84
                   MCD       95.12   97.11        64.36     69.17    58.92     62.84     97.95    99.53
                   MMCD      95.37   98.17        48.89     63.56    43.53     49.39     57.17    78.14


probabilities reduces the calibration error on the test data.   fidence calibration and the ability of their proposed method
This is because the increase in the degree of confidence        to separate TPs and FPs. Finally, for mission- and safety-
reduces the gap between accuracy and average confi-             critical applications where the separability of TPs and FPs
dence. However, the increase in confidence caused by            is of paramount importance, we suggest to average prob-
averaging logits instead of probabilities for overconfident     abilities to avoid the negative impact of logits averaging
networks (trained using weak regularization) increases          on the ability to separate TPs and FPs.
the calibration error on the test data. This is because the
increase in the confidence further increases the gap be-
tween the accuracy and average confidence. This finding         7. Conclusion
suggests that for underconfident networks, we can aver-
                                                                Due to averaging logits instead of averaging probabili-
age logits instead of probabilities to reduce the calibration
                                                                ties of stochastic or deterministic networks, the degree
error. However, we should average probabilities instead
                                                                of confidence on TPs and FPs increased. This reduces
of logits for overconfident networks to avoid increasing
                                                                the calibration error on the test data for underconfident
the calibration error. Although the increase in the confi-
                                                                networks but affects the separability of TPs and FPs. Our
dence caused by averaging logits reduces the calibration
                                                                empirical results show that there is a trade-off between
error on the test data for underconfident networks, we
                                                                improving calibration on the test data and improving
empirically showed that it can harm the separability of
                                                                the separability of TPs and FPs. Additionally, the in-
TPs and FPs. This is because averaging logits increases
                                                                crease in the degree of confidence increases the calibra-
the confidence on both TPs and FPs. Therefore, FPs can
                                                                tion error on the test data for overconfident networks.
also be made with high confidence similar to TPs. These
                                                                Therefore, averaging logits should only be applied when
findings suggest that reducing the calibration error on
                                                                combining underconfident networks. For example, we
the test data and improving the separability of TPs and
                                                                can average logits instead of probabilities of an ensemble
FPs can be two contradicting goals. Improving one may
                                                                of networks trained with mixup or other modern data
be at the detriment of the other. Furthermore, for two
                                                                augmentation techniques to improve calibration on the
models 𝐴 and 𝐵, if 𝐴 is better calibrated than 𝐵, then 𝐴
                                                                test data. Notwithstanding this, for mission- and safety-
does not necessarily separate TPs and FPs better than 𝐵.
                                                                critical applications where the separability of TPs and
This implies that calibration methods may be insufficient
                                                                FPs is essential, we suggest traditionally average prob-
for separating TPs and FPs and therefore, ensuring safe
                                                                abilities. However, it remains unclear if the findings of
decision-making. Additionally, existing methods for con-
                                                                this paper will change if the given networks or the aver-
fidence calibration may not help in separating TPs and
                                                                age logit are calibrated, for example, with temperature
FPs. Subsequently, future work will evaluate the ability
                                                                scaling [13]. This suggests a new research direction.
of existing methods for confidence calibration to separate
TPs and FPs. We also recommend researchers to evaluate
both the calibration error of their proposed method for con-
References                                                        2017, pp. 1321–1330.
                                                             [14] Z. Zhang, A. V. Dalca, M. R. Sabuncu, Con-
 [1] K. Simonyan, A. Zisserman, Very deep convolu-                fidence calibration for convolutional neural net-
     tional networks for large-scale image recognition,           works using structured dropout, arXiv preprint
     in: International Conference on Learning Represen-           arXiv:1906.09551 (2019).
     tations, 2015. URL: http://arxiv.org/abs/1409.1556.     [15] M. Sensoy, L. Kaplan, M. Kandemir, Evidential deep
 [2] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn-        learning to quantify classification uncertainty, Ad-
     ing for image recognition, in: Proceedings of the            vances in Neural Information Processing Systems
     IEEE conference on computer vision and pattern               31 (2018).
     recognition, 2016, pp. 770–778.                         [16] C. Ju, A. Bibaut, M. van der Laan, The relative
 [3] J. Gawlikowski, C. R. N. Tassi, M. Ali, J. Lee,              performance of ensemble methods with deep con-
     M. Humt, J. Feng, A. Kruspe, R. Triebel, P. Jung,            volutional neural networks for image classification,
     R. Roscher, et al., A survey of uncertainty in deep          Journal of Applied Statistics 45 (2018) 2800–2818.
     neural networks, arXiv preprint arXiv:2107.03342        [17] L. I. Kuncheva, Combining pattern classifiers: meth-
     (2021).                                                      ods and algorithms, John Wiley & Sons, 2014.
 [4] B. Lakshminarayanan, A. Pritzel, C. Blundell, Sim-      [18] M. Van Erp, L. Vuurpijl, L. Schomaker, An overview
     ple and scalable predictive uncertainty estimation           and comparison of voting methods for pattern
     using deep ensembles, Advances in neural infor-              recognition, in: Proceedings Eighth International
     mation processing systems 30 (2017).                         Workshop on Frontiers in Handwriting Recogni-
 [5] S. Thulasidasan, G. Chennupati, J. A. Bilmes,                tion, IEEE, 2002, pp. 195–200.
     T. Bhattacharya, S. Michalak, On mixup training:        [19] T. Tajti, New voting functions for neural network
     Improved calibration and predictive uncertainty for          algorithms, in: Annales Mathematicae et Informati-
     deep neural networks, Advances in Neural Infor-              cae, volume 52, Eszterházy Károly Egyetem Líceum
     mation Processing Systems 32 (2019).                         Kiadó, 2020, pp. 229–242.
 [6] Y. Qin, X. Wang, A. Beutel, E. Chi, Improving cal-      [20] T. G. Dietterich, Machine-learning research, AI
     ibration through the relationship with adversar-             magazine 18 (1997) 97–97.
     ial robustness, in: A. Beygelzimer, Y. Dauphin,         [21] S. Tulyakov, S. Jaeger, V. Govindaraju, D. Doer-
     P. Liang, J. W. Vaughan (Eds.), Advances in Neu-             mann, Review of classifier combination methods,
     ral Information Processing Systems, 2021. URL:               Machine learning in document analysis and recog-
     https://openreview.net/forum?id=NJex-5TZIQa.                 nition (2008) 361–386.
 [7] Y. Wen, G. Jerfel, R. Muller, M. W. Dusenberry,         [22] N. Tassi, C. Rovile, Bayesian convolutional neural
     J. Snoek, B. Lakshminarayanan, D. Tran, Combin-              network: Robustly quantify uncertainty for misclas-
     ing ensembles and data augmentation can harm                 sifications detection, in: Mediterranean Conference
     your calibration, arXiv preprint arXiv:2010.09875            on Pattern Recognition and Artificial Intelligence,
     (2020).                                                      Springer, 2019, pp. 118–132.
 [8] R. Rahaman, A. H. Thiery, Uncertainty quantifi-         [23] R. T. Clemen, Combining forecasts: A review and
     cation and deep ensembles, Advances in Neural                annotated bibliography, International journal of
     Information Processing Systems 34 (2021).                    forecasting 5 (1989) 559–583.
 [9] H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz,        [24] J. Kittler, M. Hatef, R. P. Duin, J. Matas, On combin-
     mixup: Beyond empirical risk minimization, in:               ing classifiers, IEEE transactions on pattern analysis
     International Conference on Learning Representa-             and machine intelligence 20 (1998) 226–239.
     tions, 2018.                                            [25] K. C. Lichtendahl Jr, Y. Grushka-Cockayne, R. L.
[10] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wo-        Winkler, Is it better to average probabilities or
     jna, Rethinking the inception architecture for com-          quantiles?, Management Science 59 (2013) 1594–
     puter vision, in: Proceedings of the IEEE conference         1611.
     on computer vision and pattern recognition, 2016,       [26] Y. Gal, Z. Ghahramani, Dropout as a bayesian ap-
     pp. 2818–2826.                                               proximation: Representing model uncertainty in
[11] R. Müller, S. Kornblith, G. Hinton, When Does La-            deep learning, in: international conference on ma-
     bel Smoothing Help?, Curran Associates Inc., Red             chine learning, PMLR, 2016, pp. 1050–1059.
     Hook, NY, USA, 2019.                                    [27] W. H. Beluch, T. Genewein, A. Nürnberger, J. M.
[12] X. Wu, M. Gales, Should ensemble members be                  Köhler, The power of ensembles for active learn-
     calibrated?, arXiv preprint arXiv:2101.05397 (2021).         ing in image classification, in: Proceedings of the
[13] C. Guo, G. Pleiss, Y. Sun, K. Q. Weinberger, On              IEEE Conference on Computer Vision and Pattern
     calibration of modern neural networks, in: Inter-            Recognition, 2018, pp. 9368–9377.
     national Conference on Machine Learning, PMLR,          [28] F. K. Gustafsson, M. Danelljan, T. B. Schon, Evalu-
     ating scalable bayesian deep learning methods for
     robust computer vision, in: Proceedings of the
     IEEE/CVF conference on computer vision and pat-
     tern recognition workshops, 2020, pp. 318–319.
[29] G. Kahn, A. Villaflor, V. Pong, P. Abbeel, S. Levine,
     Uncertainty-aware reinforcement learning for col-
     lision avoidance, arXiv preprint arXiv:1702.01182
     (2017).
[30] B. Lütjens, M. Everett, J. P. How, Safe reinforce-
     ment learning with model uncertainty estimates,
     in: 2019 International Conference on Robotics and
     Automation (ICRA), IEEE, 2019, pp. 8662–8668.
[31] A. G. Wilson, P. Izmailov, Bayesian deep learning
     and a probabilistic perspective of generalization,
     Advances in neural information processing systems
     33 (2020) 4697–4708.
[32] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-
     based learning applied to document recognition,
     Proceedings of the IEEE 86 (1998) 2278–2324.
[33] H. Xiao, K. Rasul, R. Vollgraf, Fashion-mnist:
     a novel image dataset for benchmarking ma-
     chine learning algorithms,          arXiv preprint
     arXiv:1708.07747 (2017).
[34] A. Krizhevsky, G. Hinton, et al., Learning multiple
     layers of features from tiny images (2009).
[35] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Wein-
     berger, Densely connected convolutional networks,
     in: Proceedings of the IEEE conference on computer
     vision and pattern recognition, 2017, pp. 4700–4708.
[36] S. Ioffe, C. Szegedy, Batch normalization: Acceler-
     ating deep network training by reducing internal
     covariate shift, in: International conference on
     machine learning, PMLR, 2015, pp. 448–456.
[37] D. Hendrycks, K. Gimpel, A baseline for detect-
     ing misclassified and out-of-distribution examples
     in neural networks, Proceedings of International
     Conference on Learning Representations (2017).