=Paper= {{Paper |id=Vol-3215/paper_19 |storemode=property |title=The impact of averaging logits over probabilities on ensembles of neural networks |pdfUrl=https://ceur-ws.org/Vol-3215/19.pdf |volume=Vol-3215 |authors=Cedrique Rovile Njieutcheu Tassi,Jakob Gawlikowski,Auliya Unnisa Fitri,Rudolph Triebel |dblpUrl=https://dblp.org/rec/conf/ijcai/TassiGFT22 }} ==The impact of averaging logits over probabilities on ensembles of neural networks== https://ceur-ws.org/Vol-3215/19.pdf

The impact of averaging logits over probabilities on
ensembles of neural networks
Cedrique Rovile Njieutcheu Tassi1 , Jakob Gawlikowski2 , Auliya Unnisa Fitri2 and
Rudolph Triebel3
1
German Aerospace Center (DLR), Institute of Optical Sensor Systems, Rutherfordstraße 2, 12489 Berlin, Germany
2
German Aerospace Center (DLR), Institute of Data Science, Mälzerstraße 3-5, 07745 Jena, Germany
3
German Aerospace Center (DLR), Institute of Robotics and Mechatronics, Münchener Straße 20, 82234 Wessling, Germany

Abstract
Model averaging has become a standard for improving neural networks in terms of accuracy, calibration, and the ability to
detect false predictions (FPs). However, recent findings show that model averaging does not necessarily lead to calibrated
confidences, especially for underconfident networks. While existing methods for improving the calibration of combined
networks focus on recalibrating, building, or sampling calibrated models, we focus on the combination process. Specifically,
we evaluate the impact of averaging logits instead of probabilities on the quality of confidence (QoC). We compare combined
logits instead of probabilities of members (networks) for models such as ensembles, Monte Carlo Dropout (MCD), and Mixture
of Monte Carlo Dropout (MMCD). Comparison is done using experimental results on three datasets using three different
architectures. We show that averaging logits instead of probabilities increase the confidence thereby improving the confidence
calibration for underconfident models. For example, for MCD evaluated on CIFAR10, averaging logits instead of probabilities
reduces the expected calibration error (ECE) from 12.03% to 5.44%. However, the increase in confidence can bring harm to
confidence calibration for overconfident models and the separability between true predictions (TPs) and FPs. For example, for
MMCD evaluated on MNIST, the average confidence on FPs due to the noisy data increases from 51.31% to 94.58% when
averaging logits instead of probabilities. While averaging logits can be applied with underconfident models to improve the
calibration on test data, we suggest to average probabilities for safety- and mission-critical applications where the separability
of TPs and FPs is of paramount importance.

Keywords
Model averaging, Combination process, Logit averaging, Probability averaging, Ensemble, Monte Carlo Dropout (MCD),
Mixture of Monte Carlo Dropout (MMCD), Quality of confidence (QoC), Confidence calibration, Separating true predictions
(TPs) and false predictions (FPs)

1. Introduction produce more underconfident networks. For example, [7]
showed that averaging networks trained with modern
Recently, averaging the predictions of multiple stochas- regularization techniques resulted in more underconfi-
tic or deterministic networks has become a standard ap- dent networks and therefore miscalibrated predictions.
proach for improving accuracy [1, 2] and uncertainty [12] supported this argument by theoretically and empir-
estimates [3]. Generally, the quality of uncertainty es- ically showing that averaging calibrated networks do not
timates (e.g.: QoC) is assessed by the degree of calibra- always lead to calibrated confidences. Calibrating confi-
tion and/or the ability to detect FPs. Model averaging dences of averaged networks has received little attention
can yield well-calibrated confidence [4, 5] and is one of in the literature. Generally, post-processing calibration
the state-of-the-art methods for detecting FPs caused methods, such as temperature scaling [13], can be used
by out-of-distribution examples [4, 3]. However, recent to recalibrate the confidences of averaged networks, as
findings [6, 7, 8] show that model averaging does not nec- demonstrated in [8, 12]. From [14] and further supported
essarily lead to calibrated confidence, especially when by [8], confidence calibration in model averaging is cor-
the networks are built using modern regularization tech- related to diversity inherent in individual networks and
niques, such as mixup [9] or label smoothing [10, 11]. the more diverse the networks, the better the calibration.
This is because modern regularization techniques can Motivated by this observation, [14] promoted model di-
(strongly) regularize networks, resulting in underconfi- versity using structured dropout to reduce calibration
dence. Furthermore, averaging underconfident networks errors. [7] proposed class-adjusted mixup that trains
The IJCAI-ECAI-22 Workshop on Artificial Intelligence Safety
less confident networks by evaluating the difference be-
(AISafety 2022) tween accuracy (estimated on a validation dataset after
$ Cedrique.NjieutcheuTassi@dlr.de (C. R. N. Tassi); each training epoch) and the confidence of each train-
Jakob.Gawlikowski@@dlr.de (J. Gawlikowski); Auliya.Fitri@dlr.de ing sample to activate or deactive mixup training for
(A. U. Fitri); Rudolph.Triebel@dlr.de (R. Triebel) overconfidence (average confidence > accuracy) or un-
© 2022 Copyright 2022 for this paper by its authors. Use permitted under Creative
CEUR
Workshop
http://ceur-ws.org
ISSN 1613-0073
Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
derconfidence (average confidence < accuracy), respec-
Proceedings
tively. All these methods for improving the calibration TPs and FPs. Therefore, FPs can be made with high confi-
of combined networks focus on recalibrating, building, dence similar to TPs. For example, for MMCD evaluated
or sampling the calibrated networks. However, this work on FashionMNIST (see Table 4), the average confidence
focuses on combining the networks. Specifically, we ad- on FPs due to the noisy data increased from 51.31% to
dress the question: What is the impact of averaging 94.58% when averaging logits instead of probabilities. In
logits instead of probabilities of multiple (stochastic summary, we provide empirical evidence demonstrating
or deterministic) networks on the QoC? how combining logits instead of probabilities of multiple
We hypothesized that averaging logits instead of prob- (stochastic or deterministic) networks
abilities of multiple networks increases the confidence
of the averaged network. This is because logits (inputs • preserves accuracy, but increases the confidence
to softmax), which can be interpreted as found evidence on TPs and FPs.
for possible classes [15], are continuous values normal- • reduces the calibration error (given underconfi-
ized using the softmax to produce discrete probabilities. dent networks), but increases the calibration error
The softmax normalization of continuous values (log- (given overconfident networks).
its) to discrete values (probabilities) causes information • can harm the separability between TPs and FPs.
loss and possible robustness to changes in the magni-
tudes of logits. This implies that the softmax function
is a nonlinear function that maps multiple logit vectors
2. Related works
with large differences in magnitudes to the same discrete The combination process describes how multiple mem-
probability vector. We evaluated the impact of the in- bers are combined and the information type (e.g., logits
crease in confidence caused by averaging logits instead or probabilities) that is combined. Several approaches
of probabilities on the QoC. Specifically, we evaluated the such as stacking [16] and voting [17, 18, 19]) have been
QoC by assessing the degree of confidence calibration, reported for aggregating multiple predictions. Some of
which measures the difference between the predicted these approaches have been reviewed and discussed in
(average confidence) and true probabilities (empirical ac- [20, 21] and experimentally compared in [18, 16] to find
curacy). Furthermore, we evaluated the QoC by assessing the one with the best accuracy. It was found that one
its ability to seperate TPs and FPs. To provide empirical approach improves accuracy better than another depend-
evidence for evaluating the QoC, we considered the logit ing on several factors, such as the number of members,
averaging against probability averaging and compared diversity inherent in individual members, and accuracy
both approaches using different averaged models, such of individual members. However, in [22], we compared
as ensemble, MCD, and MMCD. The comparison was approaches such as averaging, plurality voting, or major-
based on results from different experiments conducted ity voting to find the one that better captures uncertainty.
on three datasets, namely, MNIST, FashionMNIST, and We found that the averaging approach captures uncer-
CIFAR10 evaluated on VGGNet, ResNet, and DenseNet, tainty better than voting approaches. Before our work,
respectively. [23] argued that simple averaging approaches are more
Results show that averaging logits instead of probabil- robust than voting approaches. This argument was fur-
ities preserves accuracy, but increases confidence. For ther supported by [24]. This is because the averaging
example, for MCD evaluated on CIFAR10 (see Table 2), approach considers all members’ predictions, whereas
the accuracy remained around 85.36% while the aver- plurality/majority voting ignores uncertain predictions
age confidence increased from 73.35% to 80.04% when and therefore, reduces the uncertainty in the combined
we averaged logits instead of probabilities. Furthermore, members’ prediction. Although various combination ap-
given underconfident models, the increase in the degree proaches have been presented and compared in the liter-
of confidence reduces the calibration error on the test ature, the information type that is combined has received
data. For example, for MCD evaluated on CIFAR10, ECE relatively little attention. [25] showed that averaging
dropped from 12.04% to 5.40% when the average confi- quantiles rather than probabilities improve the predic-
dence increased from 73.35% to 80.04%. However, given tive performance. Generally, for neural networks and
overconfident models, the increase in the degree of con- classification problems in particular, multiple members
fidence increased the calibration error on the test data. (networks) are combined by averaging probabilities [16].
For example, for the ensemble evaluated on CIFAR10 (see [16] evaluated the impact of combining logits instead of
Table 3), ECE increased from 3.03% to 7.40% when the av- probabilities on accuracy, however, the impact on the
erage confidence increased from 89.43% to 96.17%. Finally, QoC remains unclear. Thus, we investigated the impact
for underconfident or overconfident models, the increase of combining logits instead of probabilities on the QoC.
in the degree of confidence can harm the separability
between TPs and FPs. This is because averaging logits
instead of probabilities increases the confidence of both
3. Background Gaussian and Bernoulli distribution, the MCD layer sam-
ples the 𝑗 𝑡ℎ element of 𝑎 as 𝑎𝑠𝑗 = 𝑎𝑗 * 𝛼𝑗 * 𝛽𝑗 with
In the context of image classification, let the training 𝛼𝑗 ∼ 𝒩 (1, 𝜎 2 = 𝑞/(1 − 𝑞)) and 𝛽𝑗 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑞).
data 𝐷𝑡𝑟𝑎𝑖𝑛 = {𝑥𝑖 ∈ R𝐻×𝑊 ×𝐶 , 𝑦𝑖 ∈ 𝑈 𝐾 }𝑖∈[1,𝑁 ] be a Here, 𝑞 denotes the dropout probability. In this work, we
realization of independently and identically distributed can refer to MCD as an average of 𝑆 stochastic CNNs.
random variables (𝑥, 𝑦) ∈ 𝑋 × 𝑌 , where 𝑥𝑖 denotes
the 𝑖𝑡ℎ input and 𝑦𝑖 its corresponding one hot encoded
3.3. Ensemble
class label from the set of standard unit vectors of R𝐾 ,
𝑈 . 𝑋 and 𝑌 denote the input and label spaces. 𝐻 × An (explicit) ensemble was investigated in [4, 27, 28] for
𝐾

𝑊 × 𝐶 denotes the dimension of input images, where uncertainty estimation. It approximates the prediction
𝐻, 𝑊 , and 𝐶 refer to the height, weight, and number of 𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛 ) by learning different settings. Given a set
channels, respectively. 𝐾 and 𝑁 denote the numbers of of CNNs 𝑓𝜃𝑚 for 𝑚 ∈ 1, 2, ..., 𝑀 , the ensemble predic-
possible output classes and samples within the training tion is obtained by averaging over the predictions of the
data, respectively. CNNs. That is,
𝑀
1 ∑︁
3.1. Convolutional neural network (CNN) 𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛 ) := 𝑝(𝑦|𝑥, 𝜃𝑚 )
𝑀 𝑚=1
A CNN is a nonlinear function 𝑓𝜃 parameterized by model (3)
𝑀
parameters 𝜃, called the network weights. Here, it maps 1 ∑︁
:= 𝑓𝜃 (𝑥).
input images 𝑥𝑖 ∈ R𝐻×𝑊 ×𝐶 to class labels 𝑦𝑖 ∈ 𝑈 𝐾 , 𝑀 𝑚=1 𝑚
In this work, we can refer to an ensemble as an average
𝑓𝜃 : 𝑥𝑖 ∈ R𝐻×𝑊 ×𝐶 → 𝑦𝑖 ∈ [0, 1]𝐾 ; 𝑓𝜃 (𝑥𝑖 ) = 𝑦𝑖 (1)
of 𝑀 deterministic CNNs.
The network parameters are optimized on the train-
ing dataset, 𝐷𝑡𝑟𝑎𝑖𝑛 . Given a new data sample 𝑥 ∈ 3.4. Mixture of Monte Carlo Dropout
R𝐻×𝑊 ×𝐶 , a trained CNN 𝑓𝜃 predicts the corresponding (MMCD)
target 𝑦 = 𝑓𝜃 (𝑥) using the set of trained weights 𝜃. The
network output (logit) is given by 𝑧 = 𝑓𝜃 (𝑥), from which MMCD was investigated in [29, 30, 31] for uncertainty
a probability vector 𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛 ) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧), can estimation. It combines both MCD and ensemble. For
be computed. In the following, this probability vec- prediction estimation, MCD evaluates a single feature rep-
tor will be abbreviated resentation, but additionally considers the uncertainty
∑︀ by 𝑝 and its entries by 𝑝𝑘 with associated with the feature representation. However,
𝑘 = 1, . . . , 𝐾 and 𝐾 𝑘=1 𝑝𝑘 = 1. Further, we get the
predicted confidence 𝑐 = max𝑘 (𝑝𝑘 ) and predicted class an ensemble evaluates multiple feature representations
label 𝑦 = arg max𝑘 (𝑝𝑘 ) without considering the uncertainty associated with in-
dividual feature representations. Hence, MMCD applies
MCD to an ensemble to evaluate multiple feature repre-
3.2. Monte Carlo Dropout (MCD) sentations and consider the uncertainty associated with
MCD was investigated in [26, 27, 28] for uncertainty individual feature representations. Given a set of CNNs
estimation. It is one of the most widespread Bayesian 𝑓𝜃𝑚 for 𝑚 ∈ 1, 2, ..., 𝑀 , the MMCD prediction is ob-
methods reviewed in [3]. It approximates the predic- tained by averaging over the predictions of all stochastic
tion 𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛 ) using the mean of 𝑆 stochastic for- CNNs. That is,
ward passes, 𝑝(𝑦|𝑥, 𝜃1 ), ..., 𝑝(𝑦|𝑥, 𝜃𝑆 ), representing 𝑆 𝑀
1 ∑︁ ∑︁
𝑆

stochastic CNNs parameterized by samples 𝜃1 , 𝜃2 ,..., and 𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛 ) ≈ 𝑝(𝑦|𝑥, 𝜃𝑚𝑠 )
𝑀 · 𝑆 𝑚=1 𝑠=1
𝜃𝑆 . That is (4)
𝑀 ∑︁ 𝑆
𝑆 𝑆 1 ∑︁
1 ∑︁ 1 ∑︁ ≈ 𝑓𝜃 (𝑥).
𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛 ) ≈ 𝑝(𝑦|𝑥, 𝜃𝑠 ) ≈ 𝑓𝜃𝑠 (𝑥). (2) 𝑀 · 𝑆 𝑚=1 𝑠=1 𝑚𝑠
𝑆 𝑠=1 𝑆 𝑠=1
In this work, we can refer to MMCD as an average of
Specifically, MCD approximates the prediction with a 𝑀 · 𝑆 stochastic CNNs.
dropout distribution realized by sampling weights with
masks drawn from known distributions, such as Gaus-
sian, Bernoulli, or a cascade of Gaussian and Bernoulli 4. Combining logits instead of
distributions [22]. For example, given the activation vec- probabilities
tor 𝑎 fed to a MCD layer (placed for example at the input
of the first fully-connected layer) and assuming that sam- The output layer of a CNN-based classifier includes
pling is realized with masks drawn from a cascade of 𝐾 output neurons with a softmax activation function,
(a) Logit averaging (b) Probability averaging

Figure 1: Example showing the difference between averaging logits and averaging probabilities in an ensemble.

which normalizes its inputs (continuous values) to pro- and reformulate the predicted probability vector of MCD,
duce
∑︀𝐾 discrete probabilities 𝑝𝑘 (with 𝑘 = 1, . . . , 𝐾 and as shown in (7). Similarly, given MMCD representing an
𝑘=1 𝑝𝑘 = 1) representing the probability that the ensemble of 𝑀 · 𝑆 stochastic CNNs with logits 𝑧 𝑚𝑠 , we
input image belongs to the class associated with the can estimate the average logit 𝑧 as
𝑘𝑡ℎ output neuron. The input to the softmax func-
𝑀 𝑆 𝑀 𝑆
tion are logits and interpreted as evidence for possible 𝑧 ≈
1 ∑︁ ∑︁ 𝑚𝑠
𝑧 ≈
1 ∑︁ ∑︁
𝑓𝜃 (𝑥), (9)
classes [15]. The discrete probability 𝑝𝑘 is interpreted 𝑀 · 𝑆 𝑚=1 𝑠=1 𝑀 · 𝑆 𝑚=1 𝑠=1 𝑚𝑠
as the model confidence that the input belongs to the
class associated with the 𝑘𝑡ℎ output neuron. Given the and reformulate the predicted probability vector of
]︀𝑇 MMCD, as shown in (7). From Figure 2, averaging logits
logit vector 𝑧 = 𝑧1 . . . 𝑧𝐾 , the softmax estimates
[︀
]︀𝑇 instead of probabilities of multiple stochastic or deter-
𝑝 = 𝑝1 . . . 𝑝𝐾 as
[︀
ministic CNNs increases the confidence of the averaged
CNNs. Intuitively, logit averaging provides the best evi-
𝑝 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧) dence (characterized by a low level of uncertainty caused
by the reduction of inductive biases inherent in individ-
1 ]︀𝑇 (5)
ual logits) for making decisions. However, probability
[︀
= ∑︀𝐾 exp(𝑧1 ) . . . exp(𝑧𝐾 ) .
𝑘=1 exp(𝑧 𝑘 ) averaging provides the best confidence associated with
decisions made using weak evidence (characterized by
From Figure 1, given an ensemble of 𝑀 deterministic
a high level of uncertainty caused by inductive biases
CNNs with logits 𝑧 𝑚 , the average logit 𝑧 can be estimated
inherent in individual logits). This implies that a decision
as
𝑀 𝑀 made using probability averaging considers more uncer-
1 ∑︁ 𝑚 1 ∑︁ tainty than that made using logit averaging. In this work,
𝑧 := 𝑧 := 𝑓𝜃 (𝑥). (6)
𝑀 𝑚=1 𝑀 𝑚=1 𝑚 we evaluated the impact of the possible increase in the
and the predicted probability vector of the ensemble of degree of confidence caused by applying logit averaging
deterministic CNNs can be reformulated as instead of probability averaging on the QoC.

𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛 ) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧). (7)
5. Experiments
Given MCD representing an ensemble of 𝑆 stochastic
CNNs with logits 𝑧 𝑠 , we can estimate the average logit 𝑧 5.1. Experimental setup
as We hypothesized that the QoC of CNNs (strongly) de-
𝑆 𝑆
𝑧 ≈
1 ∑︁ 𝑠
𝑧 ≈
1 ∑︁
𝑓𝜃 (𝑥), (8) pends on the task-difficulty (specified using the train-
𝑆 𝑠=1 𝑆 𝑠=1 𝑠 ing data), the underlying architecture, and/or the train-
ing procedure (mostly influenced by the regularization
Table 1
Summary of values assigned to regularization hyper-
Softmax parameters.
Values Values
Hyper-parameters (weak regu- (strong regu-
larization) larization)
Probability of dropout ap-
plied at inputs to max - 0.05
pooling layers
Probability of dropout ap-
Figure 2: Example showing how averaging logits in- plied at inputs to fully- 0.05 0.5
stead of probabilities increases the confidence∑︀of an en- connected layers
semble of four deterministic CNNs: 𝑧 = 14 4𝑚=1 𝑧 𝑚 , Rotation range [Degree] [-5, +5] [-45, +45]
𝑝𝑧 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧), and 𝑝 = 14 4𝑚=1 𝑝𝑚 with 𝑝1 = Width and height shift
∑︀
[-1, +1] [-5, +5]
𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 ), 𝑝 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 ), 𝑝3 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 3 ),
1 2 2 range [Pixel]
and 𝑝4 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 4 ). One can see that averaging log- Scale intensity range [0.95, 1.05] [0.9, 1.2]
its (𝑝𝑧 ) results in more confident predictions than averaging Shear intensity range 0.05 0.1
probabilities (𝑝). This is attributed to averaging logits being Additive Gaussian noise
0.05 0.5
more sensitive to the magnitude of logit values than averaging standard deviation range
probabilities. Here, 𝑧 𝑚 with large values contributes most to
𝑧 . In our example, 𝑧 is mostly influenced by the values of 𝑧 1 .
That is, the contributions of 𝑧 2 , 𝑧 3 , and 𝑧 4 to 𝑧 are minor.
5.2. Evaluation metrics
However, 𝑝 is influenced by the values of all probability vec-
tors 𝑝𝑚 and therefore, is less sensitive to the magnitude of QoC was evaluated by assessing the degree of confidence
individual logits. calibration. Specifically, we evaluated the calibration er-
ror using measures, such as the negative log likelihood
(NLL) applied in [4, 5, 31], expected calibration error
strength). Therefore, we compared logits and probabili- (ECE) applied in [13, 8, 12], and Brier score (BS) applied
ties averaging on three datasets to evaluate the impact of in [4]. Low values of NLL, ECE, and BS indicate low cali-
the task-difficulty on the QoC. Moreover, we compared bration error and vice versa. Furthermore, we evaluated
logits and probabilities averaging using three different QoC by assessing its ability to separate TPs and FPs. Here,
architectures to evaluate the impact of the underlying ar- we evaluated the average confidence on evaluation data
chitecture on the QoC. Specifically, we evaluated MNIST causing TPs or FPs. Given evaluation data causing TPs,
[32] on VGGNets [1], FashionMNIST [33] on ResNets [2] we expect the average confidence on the evaluation data
and CIFAR10 [34] on DenseNets [35]. Finally, we com- to be high. However, for the evaluation data causing FPs,
pared logits and probabilities averaging on CNNs trained we expect low average confidence on the evaluation data.
using two regularization strengths (strong and weak reg- Moreover, we evaluated the ability to separate TPs and
ularization summarized in Table 1) to evaluate the impact FPs by evaluating the area under the receiver operator
of the regularization strength on the QoC. We observed characteristic (AU-ROC) applied in [37, 5]. AU-ROC sum-
strong and weak regularization results in underconfident marizes the trade-off between the fraction of TPs that are
and overconfident CNNs, respectively. All CNNs were correctly detected and those of FPs that are undetected
regularized using batch normalization [36] layers placed using different thresholds. In summary, in addition to the
before each convolutional activation function. All CNNs NLL, ECE, and BS, we evaluated the accuracy, average
were randomly initialized and trained with random shuf- confidence, and AUC-ROC.
fling of training samples. All CNNs were trained using
the categorical cross-entropy and stochastic gradient de- 5.3. Evaluation data
scent with momentum of 0.9, learning rate of 0.02, batch
size of 128, and epochs of 100. All images were standard- We used five evaluation data for different purposes,
ized and normalized by dividing pixel values by 255. For namely test data, subsets of the correctly classified test
all MCD and MMCD, we sampled activations of the first data, out-of-domain data, swapped data, and noisy data.
fully-connected layer using masks drawn from a cascade
Test data represent the test data from the experimental
of Bernoulli and Gaussian distributions [22] and using
data, namely, MNIST, CIFAR10, and FashionM-
a dropout probability of 0.5. We performed 100 stochas-
NIST. These datasets include both correctly clas-
tic forward passes (𝑆 = 100) and considered ensembles
sified and misclassified test data. Test data are
consisting of five deterministic CNNs (𝑀 = 5).
used for estimating the accuracy, NLL, ECE, and
(a) Test data (b) Swapped data (c) Noisy data

Figure 3: Examples of evaluation data for experiments conducted on CIFAR10.

BS. We expect the accuracy to be high and NLL, 5.4. Experimental results
ECE and BS to be low on test data.
We evaluate the conducted experiments with respect to
Subsets of the correctly classified test data accuracy and QoC.
include 1000 correctly classified test data from Table 2 and Table 3 summarize the accuracy, aver-
the experimental data. Since CNNs will make TPs age confidence, NLL, ECE, and BS of different models
on these data, we used these data for evaluating using the two averaging approaches and CNNs trained
the average confidence on TPs. using strong regularization (causing underconfidence)
and weak regularization (causing overconfidence). The
Swapped data were simulated using subsets of the cor- results show that averaging logits instead of probabil-
rectly classified test data structurally perturbed ities do not strongly affect the accuracy. This means
by dividing images into four regions and diago- that averaging logits can preserve accuracy. Further-
nally permuting the regions. From Figure 3b, the more, averaging logits instead of probabilities signifi-
upper left and right are permuted with the bottom cantly increases the average confidence. Figure 2 illus-
right and left regions, respectively. Swapped data trates why the confidence increases. Further, Table 2
include structurally perturbed objects within the shows that averaging logits instead of probabilities sig-
given images. We expect CNNs to make FPs on nificantly decreases the NLL, ECE, and BS for under-
swapped data. Therefore, we used these data for condifent CNNs (trained using strong regularization).
evaluating the average confidence on FPs caused This means that averaging logits, unlike averaging prob-
by structurally perturbed objects. abilities, reduces the calibration error for undercondifent
CNNs. This is because the stronger the regularization,
Noisy data were simulated using subsets of the cor-
the lower the confidence and the higher the gap between
rectly classified test data perturbed by applying
accuracy and average confidence. Here, the increase
additive Gaussian noise with a standard devia-
in the degree of confidence caused by averaging log-
tion of 500. From Figure 3c, noisy data include
its instead of probabilities reduces the gap between ac-
noise within the given images. We expect CNNs
curacy and average confidence. For example, Table 2
to make FPs on these data. Therefore, we used
shows that averaging logits instead of probabilities of
these data for evaluating the average confidence
the ensemble reduces the gap between accuracy and av-
on FPs caused by noisy objects.
erage confidence from 18.24(= |88.75 − 70.51|)% to
Out-of-domain data were simulated using 1000 test 9.52(= |88.94 − 79.42|)% on CIFAR10.
data of CIFAR100 [34]. Since CNNs will make However, the increase in the degree of confidence caused
FPs on these data, we used these data for evalu- by averaging logits instead of probabilities increases the cal-
ating the average confidence on FPs caused by ibration error for overconfident CNNs (trained using weak
unknown objects. regularization). Table 3 provides empirical evidence for
this claim by showing that, on CIFAR10 and FashionM-
In general, we expect the average confidence to be high NIST, NLL, ECE, and BS of the ensembles increase when
on TPs and to be low on FPs. the logits are averaged instead of probabilities. We ar-
gued that the more overconfident the CNNs, the higher
Table 2
Comparison of accuracy[%], average confidence[%] (in bracket), NLL[10−2 ], ECE[10−2 ], and BS[10−2 ] of different models
using two approaches for averaging underconfident CNNs trained using strong regularization: average probabilities (AP) and
average logits (AL). The results were obtained using the test data described in Section 5.3.
Accuracy (Average confidence)↑ NLL ↓ ECE ↓ BS ↓
AP AL AP AL AP AL AP AL
CIFAR10 (DenseNets)
Ensemble 89.52 (84.31) 89.60 (87.97) 34.66 32.81 5.23 2.47 16.13 15.38
MCD 85.36 (73.35) 85.37 (80.04) 52.13 46.55 12.04 5.40 23.32 21.57
MMCD 88.75 (70.51) 88.94 (79.42) 50.83 40.99 18.24 9.55 21.82 18.34
FashionMNIST (ResNets)
Ensemble 92.70 (87.86) 92.58 (90.16) 22.57 20.99 5.15 2.86 11.37 10.87
MCD 90.56 (79.22) 90.56 (83.95) 35.45 30.18 11.47 6.85 15.82 14.57
MMCD 92.65 (76.37) 92.73 (83.78) 35.57 26.96 16.31 9.10 14.87 12.47
MNIST (VGGNets)
Ensemble 99.04 (98.24) 99.04 (98.89) 3.25 2.90 1.03 0.52 1.52 1.41
MCD 98.16 (94.53) 98.16 (96.48) 8.73 6.87 3.81 1.98 2.99 2.79
MMCD 99.03 (94.67) 99.04 (97.46) 6.91 4.13 4.49 1.75 1.89 1.52

Table 3
Comparison of accuracy[%], average confidence[%] (in bracket), NLL[10−2 ], ECE[10−2 ], and BS[10−2 ] of ensembles using
two approaches for averaging overconfident CNNs trained using weak regularization: average probabilities (AP) and average
logits (AL). The results were obtained using the test data described in Section 5.3.
Accuracy (Average confidence) ↑ NLL ↓ ECE ↓ BS ↓
AP AL AP AL AP AL AP AL
CIFAR10 (DenseNets) 88.67 (89.43) 88.88 (96.17) 40.69 54.23 3.03 7.40 16.69 18.07
FashionMNIST (ResNets) 94.49 (95.86) 94.58 (98.43) 20.20 28.00 1.98 4.11 8.36 9.32

the confidence and the higher the gap between accuracy 84.80% to 42.42%.
and average confidence. Here, the increase in the de-
gree of confidence caused by averaging logits instead of
probabilities further increases the gap between the ac- 6. Discussion
curacy and average confidence and therefore, increases
The term ‘combination process’ encompasses how mul-
the calibration error. For example, Table 3 shows that, on
tiple networks are combined and the information type
CIFAR10, averaging logits of the ensemble increases the
combined. It was found in [23, 24, 22] that simple averag-
gap between the accuracy and average confidence from
ing is more robust and captures uncertainty better than
0.76(= |88.67−89.43|)% to 7.29(= |88.88−96.17|)%.
voting approaches. This is because the simple averag-
In Table 4, the average confidence on TPs and FPs is
ing equally weights all predictions, while voting ignores
shown for underconfident models using both averaging
uncertain predictions. In this work, we compared the
approaches. The results show that averaging logits instead
process of averaging logits instead of probabilities. We
of probabilities increases the confidence level on TPs and
empirically showed that averaging logits instead of prob-
FPs. The increase in the average confidence is sometimes
abilities increases the confidence while preserving the
very large for FPs due to the noisy data. For example, for
accuracy for underconfident or overconfident networks.
MMCD evaluated on FashionMNIST, the average confi-
This might be because logit averaging preserves the po-
dence on the noisy data increases from 51.31% to 94.58%
sition of the max element of individual logit vectors, but
when averaging logits. This is because noisy data can
is more sensitive to the magnitude of logit values than
increase the magnitude of logits and averaging logits is
probability averaging. Thus, logit values with a large
more sensitive to changes in the magnitude of logits than
magnitude contribute the most to the average logit. In
averaging probabilities (see Figure 2). The increase in the
this way, the magnitude of logit values induces a non-
degree of confidence caused by averaging logits can harm
uniform weighting (for logit averaging), which is lost
the separability of TPs and FPs. For example, the increase
(for probability averaging). Furthermore, we provided
in the average confidence on the noisy data from 51.31%
empirical evidence showing that for underconfident net-
to 94.58% causes the AUC-ROC obtained based on the
works (trained using strong regularization), the increase
evaluation of the degree of confidence to decrease from
in the confidence caused by averaging logits instead of
Table 4
Comparison of average confidence[%] of different models using two approaches (average probabilities (AP) and average logits
(AL)) for averaging underconfident networks trained using strong regularization and evaluated on TPs and FPs: TPs were
obtained on subsets of the correctly classified test data, while FPs were obtained on swapped, noisy and out-of-domain (OOD)
data described in Section 5.3.
TP ↑ FP (OOD) ↓ FP (Swapped) ↓ FP (Noisy) ↓
AP AL AP AL AP AL AP AL
CIFAR10 (DenseNets)
Ensemble 93.94 96.63 35.39 40.08 51.84 56.03 39.42 58.69
MCD 81.39 88.45 31.61 33.27 40.39 44.69 44.83 69.53
MMCD 79.48 89.53 22.81 23.83 36.26 40.67 28.01 33.08
FashionMNIST (ResNets)
Ensemble 88.01 90.16 55.48 63.21 59.30 67.91 81.39 99.82
MCD 79.39 83.76 47.08 50.36 55.75 59.29 41.23 65.79
MMCD 76.40 83.76 42.76 49.09 45.73 52.70 51.31 94.58
MNIST (VGGNets)
Ensemble 99.09 99.55 57.16 80.45 51.96 62.01 69.58 88.84
MCD 95.12 97.11 64.36 69.17 58.92 62.84 97.95 99.53
MMCD 95.37 98.17 48.89 63.56 43.53 49.39 57.17 78.14

probabilities reduces the calibration error on the test data. fidence calibration and the ability of their proposed method
This is because the increase in the degree of confidence to separate TPs and FPs. Finally, for mission- and safety-
reduces the gap between accuracy and average confi- critical applications where the separability of TPs and FPs
dence. However, the increase in confidence caused by is of paramount importance, we suggest to average prob-
averaging logits instead of probabilities for overconfident abilities to avoid the negative impact of logits averaging
networks (trained using weak regularization) increases on the ability to separate TPs and FPs.
the calibration error on the test data. This is because the
increase in the confidence further increases the gap be-
tween the accuracy and average confidence. This finding 7. Conclusion
suggests that for underconfident networks, we can aver-
Due to averaging logits instead of averaging probabili-
age logits instead of probabilities to reduce the calibration
ties of stochastic or deterministic networks, the degree
error. However, we should average probabilities instead
of confidence on TPs and FPs increased. This reduces
of logits for overconfident networks to avoid increasing
the calibration error on the test data for underconfident
the calibration error. Although the increase in the confi-
networks but affects the separability of TPs and FPs. Our
dence caused by averaging logits reduces the calibration
empirical results show that there is a trade-off between
error on the test data for underconfident networks, we
improving calibration on the test data and improving
empirically showed that it can harm the separability of
the separability of TPs and FPs. Additionally, the in-
TPs and FPs. This is because averaging logits increases
crease in the degree of confidence increases the calibra-
the confidence on both TPs and FPs. Therefore, FPs can
tion error on the test data for overconfident networks.
also be made with high confidence similar to TPs. These
Therefore, averaging logits should only be applied when
findings suggest that reducing the calibration error on
combining underconfident networks. For example, we
the test data and improving the separability of TPs and
can average logits instead of probabilities of an ensemble
FPs can be two contradicting goals. Improving one may
of networks trained with mixup or other modern data
be at the detriment of the other. Furthermore, for two
augmentation techniques to improve calibration on the
models 𝐴 and 𝐵, if 𝐴 is better calibrated than 𝐵, then 𝐴
test data. Notwithstanding this, for mission- and safety-
does not necessarily separate TPs and FPs better than 𝐵.
critical applications where the separability of TPs and
This implies that calibration methods may be insufficient
FPs is essential, we suggest traditionally average prob-
for separating TPs and FPs and therefore, ensuring safe
abilities. However, it remains unclear if the findings of
decision-making. Additionally, existing methods for con-
this paper will change if the given networks or the aver-
fidence calibration may not help in separating TPs and
age logit are calibrated, for example, with temperature
FPs. Subsequently, future work will evaluate the ability
scaling [13]. This suggests a new research direction.
of existing methods for confidence calibration to separate
TPs and FPs. We also recommend researchers to evaluate
both the calibration error of their proposed method for con-
References 2017, pp. 1321–1330.
[14] Z. Zhang, A. V. Dalca, M. R. Sabuncu, Con-
[1] K. Simonyan, A. Zisserman, Very deep convolu- fidence calibration for convolutional neural net-
tional networks for large-scale image recognition, works using structured dropout, arXiv preprint
in: International Conference on Learning Represen- arXiv:1906.09551 (2019).
tations, 2015. URL: http://arxiv.org/abs/1409.1556. [15] M. Sensoy, L. Kaplan, M. Kandemir, Evidential deep
[2] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn- learning to quantify classification uncertainty, Ad-
ing for image recognition, in: Proceedings of the vances in Neural Information Processing Systems
IEEE conference on computer vision and pattern 31 (2018).
recognition, 2016, pp. 770–778. [16] C. Ju, A. Bibaut, M. van der Laan, The relative
[3] J. Gawlikowski, C. R. N. Tassi, M. Ali, J. Lee, performance of ensemble methods with deep con-
M. Humt, J. Feng, A. Kruspe, R. Triebel, P. Jung, volutional neural networks for image classification,
R. Roscher, et al., A survey of uncertainty in deep Journal of Applied Statistics 45 (2018) 2800–2818.
neural networks, arXiv preprint arXiv:2107.03342 [17] L. I. Kuncheva, Combining pattern classifiers: meth-
(2021). ods and algorithms, John Wiley & Sons, 2014.
[4] B. Lakshminarayanan, A. Pritzel, C. Blundell, Sim- [18] M. Van Erp, L. Vuurpijl, L. Schomaker, An overview
ple and scalable predictive uncertainty estimation and comparison of voting methods for pattern
using deep ensembles, Advances in neural infor- recognition, in: Proceedings Eighth International
mation processing systems 30 (2017). Workshop on Frontiers in Handwriting Recogni-
[5] S. Thulasidasan, G. Chennupati, J. A. Bilmes, tion, IEEE, 2002, pp. 195–200.
T. Bhattacharya, S. Michalak, On mixup training: [19] T. Tajti, New voting functions for neural network
Improved calibration and predictive uncertainty for algorithms, in: Annales Mathematicae et Informati-
deep neural networks, Advances in Neural Infor- cae, volume 52, Eszterházy Károly Egyetem Líceum
mation Processing Systems 32 (2019). Kiadó, 2020, pp. 229–242.
[6] Y. Qin, X. Wang, A. Beutel, E. Chi, Improving cal- [20] T. G. Dietterich, Machine-learning research, AI
ibration through the relationship with adversar- magazine 18 (1997) 97–97.
ial robustness, in: A. Beygelzimer, Y. Dauphin, [21] S. Tulyakov, S. Jaeger, V. Govindaraju, D. Doer-
P. Liang, J. W. Vaughan (Eds.), Advances in Neu- mann, Review of classifier combination methods,
ral Information Processing Systems, 2021. URL: Machine learning in document analysis and recog-
https://openreview.net/forum?id=NJex-5TZIQa. nition (2008) 361–386.
[7] Y. Wen, G. Jerfel, R. Muller, M. W. Dusenberry, [22] N. Tassi, C. Rovile, Bayesian convolutional neural
J. Snoek, B. Lakshminarayanan, D. Tran, Combin- network: Robustly quantify uncertainty for misclas-
ing ensembles and data augmentation can harm sifications detection, in: Mediterranean Conference
your calibration, arXiv preprint arXiv:2010.09875 on Pattern Recognition and Artificial Intelligence,
(2020). Springer, 2019, pp. 118–132.
[8] R. Rahaman, A. H. Thiery, Uncertainty quantifi- [23] R. T. Clemen, Combining forecasts: A review and
cation and deep ensembles, Advances in Neural annotated bibliography, International journal of
Information Processing Systems 34 (2021). forecasting 5 (1989) 559–583.
[9] H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, [24] J. Kittler, M. Hatef, R. P. Duin, J. Matas, On combin-
mixup: Beyond empirical risk minimization, in: ing classifiers, IEEE transactions on pattern analysis
International Conference on Learning Representa- and machine intelligence 20 (1998) 226–239.
tions, 2018. [25] K. C. Lichtendahl Jr, Y. Grushka-Cockayne, R. L.
[10] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wo- Winkler, Is it better to average probabilities or
jna, Rethinking the inception architecture for com- quantiles?, Management Science 59 (2013) 1594–
puter vision, in: Proceedings of the IEEE conference 1611.
on computer vision and pattern recognition, 2016, [26] Y. Gal, Z. Ghahramani, Dropout as a bayesian ap-
pp. 2818–2826. proximation: Representing model uncertainty in
[11] R. Müller, S. Kornblith, G. Hinton, When Does La- deep learning, in: international conference on ma-
bel Smoothing Help?, Curran Associates Inc., Red chine learning, PMLR, 2016, pp. 1050–1059.
Hook, NY, USA, 2019. [27] W. H. Beluch, T. Genewein, A. Nürnberger, J. M.
[12] X. Wu, M. Gales, Should ensemble members be Köhler, The power of ensembles for active learn-
calibrated?, arXiv preprint arXiv:2101.05397 (2021). ing in image classification, in: Proceedings of the
[13] C. Guo, G. Pleiss, Y. Sun, K. Q. Weinberger, On IEEE Conference on Computer Vision and Pattern
calibration of modern neural networks, in: Inter- Recognition, 2018, pp. 9368–9377.
national Conference on Machine Learning, PMLR, [28] F. K. Gustafsson, M. Danelljan, T. B. Schon, Evalu-
ating scalable bayesian deep learning methods for
robust computer vision, in: Proceedings of the
IEEE/CVF conference on computer vision and pat-
tern recognition workshops, 2020, pp. 318–319.
[29] G. Kahn, A. Villaflor, V. Pong, P. Abbeel, S. Levine,
Uncertainty-aware reinforcement learning for col-
lision avoidance, arXiv preprint arXiv:1702.01182
(2017).
[30] B. Lütjens, M. Everett, J. P. How, Safe reinforce-
ment learning with model uncertainty estimates,
in: 2019 International Conference on Robotics and
Automation (ICRA), IEEE, 2019, pp. 8662–8668.
[31] A. G. Wilson, P. Izmailov, Bayesian deep learning
and a probabilistic perspective of generalization,
Advances in neural information processing systems
33 (2020) 4697–4708.
[32] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-
based learning applied to document recognition,
Proceedings of the IEEE 86 (1998) 2278–2324.
[33] H. Xiao, K. Rasul, R. Vollgraf, Fashion-mnist:
a novel image dataset for benchmarking ma-
chine learning algorithms, arXiv preprint
arXiv:1708.07747 (2017).
[34] A. Krizhevsky, G. Hinton, et al., Learning multiple
layers of features from tiny images (2009).
[35] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Wein-
berger, Densely connected convolutional networks,
in: Proceedings of the IEEE conference on computer
vision and pattern recognition, 2017, pp. 4700–4708.
[36] S. Ioffe, C. Szegedy, Batch normalization: Acceler-
ating deep network training by reducing internal
covariate shift, in: International conference on
machine learning, PMLR, 2015, pp. 448–456.
[37] D. Hendrycks, K. Gimpel, A baseline for detect-
ing misclassified and out-of-distribution examples
in neural networks, Proceedings of International
Conference on Learning Representations (2017).