The impact of averaging logits over probabilities on ensembles of neural networks

The impact of averaging logits over probabilities on ensembles of neural networks CedriqueRovile NjieutcheuTassi German Aerospace Center (DLR) Institute of Optical Sensor Systems

Rutherfordstraße 2 12489 Berlin Germany

JakobGawlikowski German Aerospace Center (DLR) Institute of Data Science

Mälzerstraße 3-5 07745 Jena Germany

AuliyaUnnisaFitri German Aerospace Center (DLR) Institute of Data Science

Mälzerstraße 3-5 07745 Jena Germany

RudolphTriebel German Aerospace Center (DLR) Institute of Robotics and Mechatronics

Münchener Straße 20 82234 Wessling Germany

The impact of averaging logits over probabilities on ensembles of neural networks 1613-0073 251D5E357E785C67228F345F12838CEA GROBID - A machine learning software for extracting information from scholarly documents Model averaging Combination process Logit averaging Probability averaging Ensemble Monte Carlo Dropout (MCD) Mixture of Monte Carlo Dropout (MMCD) Quality of confidence (QoC) Confidence calibration Separating true predictions (TPs) and false predictions (FPs)

Model averaging has become a standard for improving neural networks in terms of accuracy, calibration, and the ability to detect false predictions (FPs). However, recent findings show that model averaging does not necessarily lead to calibrated confidences, especially for underconfident networks. While existing methods for improving the calibration of combined networks focus on recalibrating, building, or sampling calibrated models, we focus on the combination process. Specifically, we evaluate the impact of averaging logits instead of probabilities on the quality of confidence (QoC). We compare combined logits instead of probabilities of members (networks) for models such as ensembles, Monte Carlo Dropout (MCD), and Mixture of Monte Carlo Dropout (MMCD). Comparison is done using experimental results on three datasets using three different architectures. We show that averaging logits instead of probabilities increase the confidence thereby improving the confidence calibration for underconfident models. For example, for MCD evaluated on CIFAR10, averaging logits instead of probabilities reduces the expected calibration error (ECE) from 12.03% to 5.44%. However, the increase in confidence can bring harm to confidence calibration for overconfident models and the separability between true predictions (TPs) and FPs. For example, for MMCD evaluated on MNIST, the average confidence on FPs due to the noisy data increases from 51.31% to 94.58% when averaging logits instead of probabilities. While averaging logits can be applied with underconfident models to improve the calibration on test data, we suggest to average probabilities for safety-and mission-critical applications where the separability of TPs and FPs is of paramount importance.

Introduction

Recently, averaging the predictions of multiple stochastic or deterministic networks has become a standard approach for improving accuracy [1,2] and uncertainty estimates [3]. Generally, the quality of uncertainty estimates (e.g.: QoC) is assessed by the degree of calibration and/or the ability to detect FPs. Model averaging can yield well-calibrated confidence [4,5] and is one of the state-of-the-art methods for detecting FPs caused by out-of-distribution examples [4,3]. However, recent findings [6,7,8] show that model averaging does not necessarily lead to calibrated confidence, especially when the networks are built using modern regularization techniques, such as mixup [9] or label smoothing [10,11]. This is because modern regularization techniques can (strongly) regularize networks, resulting in underconfidence. Furthermore, averaging underconfident networks

The IJCAI-ECAI-22 Workshop on Artificial Intelligence Safety (AISafety 2022)

Cedrique.NjieutcheuTassi@dlr.de (C. R. N. Tassi); Jakob.Gawlikowski@@dlr.de (J. Gawlikowski); Auliya.Fitri@dlr.de (A. U. Fitri); Rudolph.Triebel@dlr.de (R. Triebel) produce more underconfident networks. For example, [7] showed that averaging networks trained with modern regularization techniques resulted in more underconfident networks and therefore miscalibrated predictions. [12] supported this argument by theoretically and empirically showing that averaging calibrated networks do not always lead to calibrated confidences. Calibrating confidences of averaged networks has received little attention in the literature. Generally, post-processing calibration methods, such as temperature scaling [13], can be used to recalibrate the confidences of averaged networks, as demonstrated in [8,12]. From [14] and further supported by [8], confidence calibration in model averaging is correlated to diversity inherent in individual networks and the more diverse the networks, the better the calibration. Motivated by this observation, [14] promoted model diversity using structured dropout to reduce calibration errors. [7] proposed class-adjusted mixup that trains less confident networks by evaluating the difference between accuracy (estimated on a validation dataset after each training epoch) and the confidence of each training sample to activate or deactive mixup training for overconfidence (average confidence > accuracy) or underconfidence (average confidence < accuracy), respec-tively. All these methods for improving the calibration of combined networks focus on recalibrating, building, or sampling the calibrated networks. However, this work focuses on combining the networks. Specifically, we address the question: What is the impact of averaging logits instead of probabilities of multiple (stochastic or deterministic) networks on the QoC?

We hypothesized that averaging logits instead of probabilities of multiple networks increases the confidence of the averaged network. This is because logits (inputs to softmax), which can be interpreted as found evidence for possible classes [15], are continuous values normalized using the softmax to produce discrete probabilities. The softmax normalization of continuous values (logits) to discrete values (probabilities) causes information loss and possible robustness to changes in the magnitudes of logits. This implies that the softmax function is a nonlinear function that maps multiple logit vectors with large differences in magnitudes to the same discrete probability vector. We evaluated the impact of the increase in confidence caused by averaging logits instead of probabilities on the QoC. Specifically, we evaluated the QoC by assessing the degree of confidence calibration, which measures the difference between the predicted (average confidence) and true probabilities (empirical accuracy). Furthermore, we evaluated the QoC by assessing its ability to seperate TPs and FPs. To provide empirical evidence for evaluating the QoC, we considered the logit averaging against probability averaging and compared both approaches using different averaged models, such as ensemble, MCD, and MMCD. The comparison was based on results from different experiments conducted on three datasets, namely, MNIST, FashionMNIST, and CIFAR10 evaluated on VGGNet, ResNet, and DenseNet, respectively.

Results show that averaging logits instead of probabilities preserves accuracy, but increases confidence. For example, for MCD evaluated on CIFAR10 (see Table 2), the accuracy remained around 85.36% while the average confidence increased from 73.35% to 80.04% when we averaged logits instead of probabilities. Furthermore, given underconfident models, the increase in the degree of confidence reduces the calibration error on the test data. For example, for MCD evaluated on CIFAR10, ECE dropped from 12.04% to 5.40% when the average confidence increased from 73.35% to 80.04%. However, given overconfident models, the increase in the degree of confidence increased the calibration error on the test data. For example, for the ensemble evaluated on CIFAR10 (see Table 3), ECE increased from 3.03% to 7.40% when the average confidence increased from 89.43% to 96.17%. Finally, for underconfident or overconfident models, the increase in the degree of confidence can harm the separability between TPs and FPs. This is because averaging logits instead of probabilities increases the confidence of both TPs and FPs. Therefore, FPs can be made with high confidence similar to TPs. For example, for MMCD evaluated on FashionMNIST (see Table 4), the average confidence on FPs due to the noisy data increased from 51.31% to 94.58% when averaging logits instead of probabilities. In summary, we provide empirical evidence demonstrating how combining logits instead of probabilities of multiple (stochastic or deterministic) networks • preserves accuracy, but increases the confidence on TPs and FPs. • reduces the calibration error (given underconfident networks), but increases the calibration error (given overconfident networks). • can harm the separability between TPs and FPs.

Related works

The combination process describes how multiple members are combined and the information type (e.g., logits or probabilities) that is combined. Several approaches such as stacking [16] and voting [17,18,19]) have been reported for aggregating multiple predictions. Some of these approaches have been reviewed and discussed in [20,21] and experimentally compared in [18,16] to find the one with the best accuracy. It was found that one approach improves accuracy better than another depending on several factors, such as the number of members, diversity inherent in individual members, and accuracy of individual members. However, in [22], we compared approaches such as averaging, plurality voting, or majority voting to find the one that better captures uncertainty. We found that the averaging approach captures uncertainty better than voting approaches. Before our work, [23] argued that simple averaging approaches are more robust than voting approaches. This argument was further supported by [24]. This is because the averaging approach considers all members' predictions, whereas plurality/majority voting ignores uncertain predictions and therefore, reduces the uncertainty in the combined members' prediction. Although various combination approaches have been presented and compared in the literature, the information type that is combined has received relatively little attention. [25] showed that averaging quantiles rather than probabilities improve the predictive performance. Generally, for neural networks and classification problems in particular, multiple members (networks) are combined by averaging probabilities [16]. [16] evaluated the impact of combining logits instead of probabilities on accuracy, however, the impact on the QoC remains unclear. Thus, we investigated the impact of combining logits instead of probabilities on the QoC.

Background

In the context of image classification, let the training data 𝐷𝑡𝑟𝑎𝑖𝑛 = {𝑥𝑖 ∈ R 𝐻×𝑊 ×𝐶 , 𝑦𝑖 ∈ 𝑈 𝐾 } 𝑖∈[1,𝑁 ] be a realization of independently and identically distributed random variables (𝑥, 𝑦) ∈ 𝑋 × 𝑌 , where 𝑥𝑖 denotes the 𝑖 𝑡ℎ input and 𝑦𝑖 its corresponding one hot encoded class label from the set of standard unit vectors of R 𝐾 , 𝑈 𝐾 . 𝑋 and 𝑌 denote the input and label spaces. 𝐻 × 𝑊 × 𝐶 denotes the dimension of input images, where 𝐻, 𝑊 , and 𝐶 refer to the height, weight, and number of channels, respectively. 𝐾 and 𝑁 denote the numbers of possible output classes and samples within the training data, respectively.

Convolutional neural network (CNN)

A CNN is a nonlinear function 𝑓 𝜃 parameterized by model parameters 𝜃, called the network weights. Here, it maps input images 𝑥𝑖 ∈ R 𝐻×𝑊 ×𝐶 to class labels 𝑦𝑖 ∈ 𝑈 𝐾 ,

𝑓 𝜃 : 𝑥𝑖 ∈ R 𝐻×𝑊 ×𝐶 → 𝑦𝑖 ∈ [0, 1] 𝐾 ; 𝑓 𝜃 (𝑥𝑖) = 𝑦𝑖 (1)

The network parameters are optimized on the training dataset, 𝐷𝑡𝑟𝑎𝑖𝑛. Given a new data sample 𝑥 ∈ R 𝐻×𝑊 ×𝐶 , a trained CNN 𝑓 𝜃 predicts the corresponding target 𝑦 = 𝑓 𝜃 (𝑥) using the set of trained weights 𝜃. The network output (logit) is given by 𝑧 = 𝑓 𝜃 (𝑥), from which a probability vector 𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧), can be computed. In the following, this probability vector will be abbreviated by 𝑝 and its entries by 𝑝 𝑘 with 𝑘 = 1, . . . , 𝐾 and ∑︀ 𝐾 𝑘=1 𝑝 𝑘 = 1. Further, we get the predicted confidence 𝑐 = max 𝑘 (𝑝 𝑘 ) and predicted class label 𝑦 = arg max 𝑘 (𝑝 𝑘 )

Monte Carlo Dropout (MCD)

MCD was investigated in [26,27,28] for uncertainty estimation. It is one of the most widespread Bayesian methods reviewed in [3]. It approximates the prediction 𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛) using the mean of 𝑆 stochastic forward passes, 𝑝(𝑦|𝑥, 𝜃1), ..., 𝑝(𝑦|𝑥, 𝜃𝑆), representing 𝑆 stochastic CNNs parameterized by samples 𝜃1, 𝜃2,..., and 𝜃𝑆. That is

𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛) ≈ 1 𝑆 𝑆 ∑︁ 𝑠=1 𝑝(𝑦|𝑥, 𝜃𝑠) ≈ 1 𝑆 𝑆 ∑︁ 𝑠=1 𝑓𝜃 𝑠 (𝑥). (2)

Specifically, MCD approximates the prediction with a dropout distribution realized by sampling weights with masks drawn from known distributions, such as Gaussian, Bernoulli, or a cascade of Gaussian and Bernoulli distributions [22]. For example, given the activation vector 𝑎 fed to a MCD layer (placed for example at the input of the first fully-connected layer) and assuming that sampling is realized with masks drawn from a cascade of Gaussian and Bernoulli distribution, the MCD layer samples the 𝑗 𝑡ℎ element of 𝑎 as

𝑎 𝑠 𝑗 = 𝑎𝑗 * 𝛼𝑗 * 𝛽𝑗 with 𝛼𝑗 ∼ 𝒩 (1, 𝜎 2 = 𝑞/(1 − 𝑞)) and 𝛽𝑗 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑞).

Here, 𝑞 denotes the dropout probability. In this work, we can refer to MCD as an average of 𝑆 stochastic CNNs.

Ensemble

An (explicit) ensemble was investigated in [4,27,28] for uncertainty estimation. It approximates the prediction 𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛) by learning different settings. Given a set of CNNs 𝑓 𝜃𝑚 for 𝑚 ∈ 1, 2, ..., 𝑀 , the ensemble prediction is obtained by averaging over the predictions of the CNNs. That is,

𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛) := 1 𝑀 𝑀 ∑︁ 𝑚=1 𝑝(𝑦|𝑥, 𝜃𝑚) := 1 𝑀 𝑀 ∑︁ 𝑚=1 𝑓 𝜃𝑚 (𝑥).(3)

In this work, we can refer to an ensemble as an average of 𝑀 deterministic CNNs.

Mixture of Monte Carlo Dropout (MMCD)

MMCD was investigated in [29,30,31]

𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛) ≈ 1 𝑀 • 𝑆 𝑀 ∑︁ 𝑚=1 𝑆 ∑︁ 𝑠=1 𝑝(𝑦|𝑥, 𝜃𝑚 𝑠 ) ≈ 1 𝑀 • 𝑆 𝑀 ∑︁ 𝑚=1 𝑆 ∑︁ 𝑠=1 𝑓 𝜃𝑚 𝑠 (𝑥).(4)

In this work, we can refer to MMCD as an average of 𝑀 • 𝑆 stochastic CNNs.

Combining logits instead of probabilities

The output layer of a CNN-based classifier includes 𝐾 output neurons with a softmax activation function, which normalizes its inputs (continuous values) to produce discrete probabilities 𝑝 𝑘 (with 𝑘 = 1, . . . , 𝐾 and ∑︀ 𝐾 𝑘=1 𝑝 𝑘 = 1) representing the probability that the input image belongs to the class associated with the 𝑘 𝑡ℎ output neuron. The input to the softmax function are logits and interpreted as evidence for possible classes [15]. The discrete probability 𝑝 𝑘 is interpreted as the model confidence that the input belongs to the class associated with the 𝑘 𝑡ℎ output neuron. Given the logit vector 𝑧 = [︀ 𝑧1 . . . 𝑧𝐾 ]︀ 𝑇 , the softmax estimates

𝑝 = [︀ 𝑝1 . . . 𝑝𝐾 ]︀ 𝑇 as 𝑝 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧) = 1 ∑︀ 𝐾 𝑘=1 exp(𝑧 𝑘 ) [︀ exp(𝑧1) . . . exp(𝑧𝐾 ) ]︀ 𝑇 .(5)

From Figure 1, given an ensemble of 𝑀 deterministic CNNs with logits 𝑧 𝑚 , the average logit 𝑧 can be estimated as

𝑧 := 1 𝑀 𝑀 ∑︁ 𝑚=1 𝑧 𝑚 := 1 𝑀 𝑀 ∑︁ 𝑚=1 𝑓 𝜃𝑚 (𝑥).(6)

and the predicted probability vector of the ensemble of deterministic CNNs can be reformulated as

𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧).(7)

Given MCD representing an ensemble of 𝑆 stochastic CNNs with logits 𝑧 𝑠 , we can estimate the average logit 𝑧 as

𝑧 ≈ 1 𝑆 𝑆 ∑︁ 𝑠=1 𝑧 𝑠 ≈ 1 𝑆 𝑆 ∑︁ 𝑠=1 𝑓 𝜃𝑠 (𝑥),(8)

and reformulate the predicted probability vector of MCD, as shown in (7). Similarly, given MMCD representing an ensemble of 𝑀 • 𝑆 stochastic CNNs with logits 𝑧 𝑚𝑠 , we can estimate the average logit 𝑧 as (9) and reformulate the predicted probability vector of MMCD, as shown in (7). From Figure 2, averaging logits instead of probabilities of multiple stochastic or deterministic CNNs increases the confidence of the averaged CNNs. Intuitively, logit averaging provides the best evidence (characterized by a low level of uncertainty caused by the reduction of inductive biases inherent in individual logits) for making decisions. However, probability averaging provides the best confidence associated with decisions made using weak evidence (characterized by a high level of uncertainty caused by inductive biases inherent in individual logits). This implies that a decision made using probability averaging considers more uncertainty than that made using logit averaging. In this work, we evaluated the impact of the possible increase in the degree of confidence caused by applying logit averaging instead of probability averaging on the QoC.

𝑧 ≈ 1 𝑀 • 𝑆 𝑀 ∑︁ 𝑚=1 𝑆 ∑︁ 𝑠=1 𝑧 𝑚𝑠 ≈ 1 𝑀 • 𝑆 𝑀 ∑︁ 𝑚=1 𝑆 ∑︁ 𝑠=1 𝑓𝜃 𝑚𝑠 (𝑥),

Experiments

Experimental setup

We hypothesized that the QoC of CNNs (strongly) depends on the task-difficulty (specified using the training data), the underlying architecture, and/or the training procedure (mostly influenced by the regularization 𝑚=1 𝑝 𝑚 with 𝑝 1 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 1 ), 𝑝 2 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 2 ), 𝑝 3 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 3 ), and 𝑝 4 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 4 ). One can see that averaging logits (𝑝 𝑧 ) results in more confident predictions than averaging probabilities (𝑝). This is attributed to averaging logits being more sensitive to the magnitude of logit values than averaging probabilities. Here, 𝑧 𝑚 with large values contributes most to 𝑧. In our example, 𝑧 is mostly influenced by the values of 𝑧 1 . That is, the contributions of 𝑧 2 , 𝑧 3 , and 𝑧 4 to 𝑧 are minor. However, 𝑝 is influenced by the values of all probability vectors 𝑝 𝑚 and therefore, is less sensitive to the magnitude of individual logits. strength). Therefore, we compared logits and probabilities averaging on three datasets to evaluate the impact of the task-difficulty on the QoC. Moreover, we compared logits and probabilities averaging using three different architectures to evaluate the impact of the underlying architecture on the QoC. Specifically, we evaluated MNIST [32] on VGGNets [1], FashionMNIST [33] on ResNets [2] and CIFAR10 [34] on DenseNets [35]. Finally, we compared logits and probabilities averaging on CNNs trained using two regularization strengths (strong and weak regularization summarized in Table 1) to evaluate the impact of the regularization strength on the QoC. We observed strong and weak regularization results in underconfident and overconfident CNNs, respectively. All CNNs were regularized using batch normalization [36] layers placed before each convolutional activation function. All CNNs were randomly initialized and trained with random shuffling of training samples. All CNNs were trained using the categorical cross-entropy and stochastic gradient descent with momentum of 0.9, learning rate of 0.02, batch size of 128, and epochs of 100. All images were standardized and normalized by dividing pixel values by 255. For all MCD and MMCD, we sampled activations of the first fully-connected layer using masks drawn from a cascade of Bernoulli and Gaussian distributions [22] and using a dropout probability of 0.5. We performed 100 stochastic forward passes (𝑆 = 100) and considered ensembles consisting of five deterministic CNNs (𝑀 = 5).

Table 1

Summary of values assigned to regularization hyperparameters.

Hyper-parameters

Values

Evaluation metrics

QoC was evaluated by assessing the degree of confidence calibration. Specifically, we evaluated the calibration error using measures, such as the negative log likelihood (NLL) applied in [4,5,31], expected calibration error (ECE) applied in [13,8,12], and Brier score (BS) applied in [4]. Low values of NLL, ECE, and BS indicate low calibration error and vice versa. Furthermore, we evaluated QoC by assessing its ability to separate TPs and FPs. Here, we evaluated the average confidence on evaluation data causing TPs or FPs. Given evaluation data causing TPs, we expect the average confidence on the evaluation data to be high. However, for the evaluation data causing FPs, we expect low average confidence on the evaluation data. Moreover, we evaluated the ability to separate TPs and FPs by evaluating the area under the receiver operator characteristic (AU-ROC) applied in [37,5]. AU-ROC summarizes the trade-off between the fraction of TPs that are correctly detected and those of FPs that are undetected using different thresholds. In summary, in addition to the NLL, ECE, and BS, we evaluated the accuracy, average confidence, and AUC-ROC.

Evaluation data

We BS. We expect the accuracy to be high and NLL, ECE and BS to be low on test data.

Subsets of the correctly classified test data

include 1000 correctly classified test data from the experimental data. Since CNNs will make TPs on these data, we used these data for evaluating the average confidence on TPs.

Swapped data were simulated using subsets of the correctly classified test data structurally perturbed by dividing images into four regions and diagonally permuting the regions. From Figure 3b, the upper left and right are permuted with the bottom right and left regions, respectively. Swapped data include structurally perturbed objects within the given images. We expect CNNs to make FPs on swapped data. Therefore, we used these data for evaluating the average confidence on FPs caused by structurally perturbed objects.

Noisy data were simulated using subsets of the correctly classified test data perturbed by applying additive Gaussian noise with a standard deviation of 500. From Figure 3c, noisy data include noise within the given images. We expect CNNs to make FPs on these data. Therefore, we used these data for evaluating the average confidence on FPs caused by noisy objects.

Out-of-domain data were simulated using 1000 test data of CIFAR100 [34]. Since CNNs will make FPs on these data, we used these data for evaluating the average confidence on FPs caused by unknown objects.

In general, we expect the average confidence to be high on TPs and to be low on FPs.

Experimental results

We evaluate the conducted experiments with respect to accuracy and QoC. Table 2 and Table 3 summarize the accuracy, average confidence, NLL, ECE, and BS of different models using the two averaging approaches and CNNs trained using strong regularization (causing underconfidence) and weak (causing overconfidence). The results show that averaging logits instead of probabilities do not strongly affect the accuracy. This means that averaging logits can preserve accuracy. Furthermore, averaging logits instead of probabilities significantly increases the average confidence. Figure 2 illustrates why the confidence increases. Further, Table 2 shows that averaging logits instead of probabilities significantly decreases the NLL, ECE, and BS for undercondifent CNNs (trained using strong regularization). This means that averaging logits, unlike averaging probabilities, reduces the calibration error for undercondifent CNNs. This is because the stronger the regularization, the lower the confidence and the higher the gap between accuracy and average confidence. Here, the increase in the degree of confidence caused by averaging logits instead of probabilities reduces the gap between accuracy and average confidence. For example, Table 2 shows that averaging logits instead of probabilities of the ensemble reduces the gap between accuracy and average confidence from 18.24(= |88.75 − 70.51|)% to 9.52(= |88.94 − 79.42|)% on CIFAR10.

However, the increase in the degree of confidence caused by averaging logits instead of probabilities increases the calibration error for overconfident CNNs (trained using weak regularization). Table 3 provides empirical evidence for this claim by showing that, on CIFAR10 and FashionM-NIST, NLL, ECE, and BS of the ensembles increase when the logits are averaged instead of probabilities. We argued that the more overconfident the CNNs, the higher

Table 2

Comparison of accuracy[%], average confidence[%] (in bracket), NLL[10 −2 ], ECE[10 −2 ], and BS[10 −2 ] of different models using two approaches for averaging underconfident CNNs trained using strong regularization: average probabilities (AP) and average logits (AL). The results were obtained using the test data described in Section 5.3. the confidence and the higher the gap between accuracy and average confidence. Here, the increase in the degree of confidence caused by averaging logits instead of probabilities further increases the gap between the accuracy and average confidence and therefore, increases the calibration error. For example, Table 3 shows that, on CIFAR10, averaging logits of the ensemble increases the gap between the accuracy and average confidence from 0.76(= |88.67−89.43|)% to 7.29(= |88.88−96.17|)%.

Accuracy (Average

In Table 4, the average confidence on TPs and FPs is shown for underconfident models using both averaging approaches. The results show that averaging logits instead of probabilities increases the confidence level on TPs and FPs. The increase in the average confidence is sometimes very large for FPs due to the noisy data. For example, for MMCD evaluated on FashionMNIST, the average confidence on the noisy data increases from 51.31% to 94.58% when averaging logits. This is because noisy data can increase the magnitude of logits and averaging logits is more sensitive to changes in the magnitude of logits than averaging probabilities (see Figure 2). The increase in the degree of confidence caused by averaging logits can harm the separability of TPs and FPs. For example, the increase in the average confidence on the noisy data from 51.31% to 94.58% causes the AUC-ROC obtained based on the evaluation of the degree of confidence to decrease from 84.80% to 42.42%.

Discussion

The term 'combination process' encompasses how multiple networks are combined and the information type combined. It was found in [23,24,22] that simple averaging is more robust and captures uncertainty better than voting approaches. This is because the simple averaging equally weights all predictions, while voting ignores uncertain predictions. In this work, we compared the process of averaging logits instead of probabilities. We empirically showed that averaging logits instead of probabilities increases the confidence while preserving the accuracy for underconfident or overconfident networks. This might be because logit averaging preserves the position of the max element of individual logit vectors, but is more sensitive to the magnitude of logit values than probability averaging. Thus, logit values with a large magnitude contribute the most to the average logit. In this way, the magnitude of logit values induces a nonuniform weighting (for logit averaging), which is lost (for probability averaging). Furthermore, we provided empirical evidence showing that for underconfident networks (trained using strong regularization), the increase in the confidence caused by averaging logits instead of

Table 4

Comparison of average confidence[%] of different models using two approaches (average probabilities (AP) and average logits (AL)) for averaging underconfident networks trained using strong regularization and evaluated on TPs and FPs: TPs were obtained on subsets of the correctly classified test data, while FPs were obtained on swapped, noisy and out-of-domain (OOD) data described in Section 5.3. probabilities reduces the calibration error on the test data. This is because the increase in the degree of confidence reduces the gap between accuracy and average confidence. However, the increase in confidence caused by averaging logits instead of probabilities for overconfident networks (trained using weak regularization) increases the calibration error on the test data. This is because the increase in the confidence further increases the gap between the accuracy and average confidence. This finding suggests that for underconfident networks, we can average logits instead of probabilities to reduce the calibration error. However, we should average probabilities instead of logits for overconfident networks to avoid increasing the calibration error. Although the increase in the confidence caused by averaging logits reduces the calibration error on the test data for underconfident networks, we empirically showed that it can harm the separability of TPs and FPs. This is because averaging logits increases the confidence on both TPs and FPs. Therefore, FPs can also be made with high confidence similar to TPs. These findings suggest that reducing the calibration error on the test data and improving the separability of TPs and FPs can be two contradicting goals. Improving one may be at the detriment of the other. Furthermore, for two models 𝐴 and 𝐵, if 𝐴 is better calibrated than 𝐵, then 𝐴 does not necessarily separate TPs and FPs better than 𝐵. This implies that calibration methods may be insufficient for separating TPs and FPs and therefore, ensuring safe decision-making. Additionally, existing methods for confidence calibration may not help in separating TPs and FPs. Subsequently, future work will evaluate the ability of existing methods for confidence calibration to separate TPs and FPs. We also recommend researchers to evaluate both the calibration error of their proposed method for con-fidence calibration and the ability of their proposed method to separate TPs and FPs. Finally, for mission-and safetycritical applications where the separability of TPs and FPs is of paramount importance, we suggest to average probabilities to avoid the negative impact of logits averaging on the ability to separate TPs and FPs.

Conclusion

Due to averaging logits instead of averaging probabilities of stochastic or deterministic networks, the degree of confidence on TPs and FPs increased. This reduces the calibration error on the test data for underconfident networks but affects the separability of TPs and FPs. Our empirical results show that there is a trade-off between improving calibration on the test data and improving the separability of TPs and FPs. Additionally, the increase in the degree of confidence increases the calibration error on the test data for overconfident networks. Therefore, averaging logits should only be applied when combining underconfident networks. For example, we can average logits instead of probabilities of an ensemble of networks trained with mixup or other modern data augmentation techniques to improve calibration on the test data. Notwithstanding this, for mission-and safetycritical applications where the separability of TPs and FPs is essential, we suggest traditionally average probabilities. However, it remains unclear if the findings of this paper will change if the given networks or the average logit are calibrated, for example, with temperature scaling [13]. This suggests a new research direction.

(a)Logit averaging (b) Probability averaging

Figure 1 :1Figure 1: Example showing the difference between averaging logits and averaging probabilities in an ensemble.

Softmax

Figure 2 : 4 ∑︀ 4 𝑚=1 4 ∑︀ 424444Figure 2: Example showing how averaging logits inof probabilities increases the confidence of an ensemble of four deterministic CNNs: 𝑧 = 1 4 ∑︀ 4 𝑚=1 𝑧 𝑚 , 𝑝 𝑧 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧), and 𝑝 = 1 4 ∑︀ 4𝑚=1 𝑝 𝑚 with 𝑝 1 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 1 ), 𝑝 2 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 2 ), 𝑝 3 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 3 ), and 𝑝 4 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 4 ). One can see that averaging logits (𝑝 𝑧 ) results in more confident predictions than averaging probabilities (𝑝). This is attributed to averaging logits being more sensitive to the magnitude of logit values than averaging probabilities. Here, 𝑧 𝑚 with large values contributes most to 𝑧. In our example, 𝑧 is mostly influenced by the values of 𝑧 1 . That is, the contributions of 𝑧 2 , 𝑧 3 , and 𝑧 4 to 𝑧 are minor. However, 𝑝 is influenced by the values of all probability vectors 𝑝 𝑚 and therefore, is less sensitive to the magnitude of individual logits.

(a)Test data (b) Swapped data (c) Noisy data

Figure 3 :3Figure 3: Examples of evaluation data for experiments conducted on CIFAR10.

used five evaluation data for different purposes, namely test data, subsets of the correctly classified test data, out-of-domain data, swapped data, and noisy data.Test data represent the test data from the experimentaldata, namely, MNIST, CIFAR10, and FashionM-NIST. These datasets include both correctly clas-sified and misclassified test data. Test data areused for estimating the accuracy, NLL, ECE, and

Table 33Comparison of accuracy[%], average confidence[%] (in bracket), NLL[10 −2 ], ECE[10 −2 ],and BS[10 −2 ] of ensembles using two approaches for averaging overconfident CNNs trained using weak regularization: average probabilities (AP) and average logits (AL). The results were obtained using the test data described in Section 5.3.confidence)↑NLL ↓ECE ↓BS ↓APALAPALAPALAPALCIFAR10 (DenseNets)Ensemble89.52 (84.31)89.60 (87.97)34.6632.815.232.4716.1315.38MCD85.36 (73.35)85.37 (80.04)52.1346.5512.045.4023.3221.57MMCD88.75 (70.51)88.94 (79.42)50.8340.9918.249.5521.8218.34FashionMNIST (ResNets)Ensemble92.70 (87.86)92.58 (90.16)22.5720.995.152.8611.3710.87MCD90.56 (79.22)90.56 (83.95)35.4530.1811.476.8515.8214.57MMCD92.65 (76.37)92.73 (83.78)35.5726.9616.319.1014.8712.47MNIST (VGGNets)Ensemble99.04 (98.24)99.04 (98.89)3.252.901.030.521.521.41MCD98.16 (94.53)98.16 (96.48)8.736.873.811.982.992.79MMCD99.03 (94.67)99.04 (97.46)6.914.134.491.751.891.52Accuracy (Average confidence) ↑NLL ↓ECE ↓BS ↓APALAPALAPALAPALCIFAR10 (DenseNets)88.67 (89.43)88.88 (96.17)40.6954.233.037.4016.6918.07FashionMNIST (ResNets)94.49 (95.86)94.58 (98.43)20.2028.001.984.118.369.32

Very deep convolutional networks for large-scale image recognition KSimonyan AZisserman International Conference on Learning Representations 2015 Deep residual learning for image recognition KHe XZhang SRen JSun Proceedings of the IEEE conference on computer vision and pattern recognition the IEEE conference on computer vision and pattern recognition 2016 JGawlikowski CR NTassi MAli JLee MHumt JFeng AKruspe RTriebel PJung RRoscher arXiv:2107.03342 A survey of uncertainty in deep neural networks 2021 arXiv preprint Simple and scalable predictive uncertainty estimation using deep ensembles BLakshminarayanan APritzel CBlundell Advances in neural information processing systems 30 2017 On mixup training: Improved calibration and predictive uncertainty for deep neural networks SThulasidasan GChennupati JABilmes TBhattacharya SMichalak Advances in Neural Information Processing Systems 32 2019 Improving calibration through the relationship with adversarial robustness YQin XWang ABeutel EChi Advances in Neural Information Processing Systems ABeygelzimer YDauphin PLiang JWVaughan 2021 YWen GJerfel RMuller MWDusenberry JSnoek BLakshminarayanan DTran arXiv:2010.09875 Combining ensembles and data augmentation can harm your calibration 2020 arXiv preprint Uncertainty quantification and deep ensembles RRahaman AHThiery Advances in Neural Information Processing Systems 34 2021 mixup: Beyond empirical risk minimization HZhang MCisse YNDauphin DLopez-Paz International Conference on Learning Representations 2018 Rethinking the inception architecture for computer vision CSzegedy VVanhoucke SIoffe JShlens ZWojna Proceedings of the IEEE conference on computer vision and pattern recognition the IEEE conference on computer vision and pattern recognition 2016 When Does Label Smoothing Help? RMüller SKornblith GHinton 2019 Curran Associates Inc Red Hook, NY, USA XWu MGales arXiv:2101.05397 Should ensemble members be calibrated? 2021 arXiv preprint On calibration of modern neural networks CGuo GPleiss YSun KQWeinberger International Conference on Machine Learning

PMLR

2017 ZZhang AVDalca MRSabuncu arXiv:1906.09551 Confidence calibration for convolutional neural networks using structured dropout 2019 arXiv preprint Evidential deep learning to quantify classification uncertainty MSensoy LKaplan MKandemir Advances in Neural Information Processing Systems 31 2018 The relative performance of ensemble methods with deep convolutional neural networks for image classification CJu ABibaut MVan Der Laan Journal of Applied Statistics 45 2018 Combining pattern classifiers: methods and algorithms LIKuncheva 2014 John Wiley & Sons An overview and comparison of voting methods for pattern recognition MVan Erp LVuurpijl LSchomaker Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition Eighth International Workshop on Frontiers in Handwriting Recognition IEEE 2002 New voting functions for neural network algorithms TTajti Annales Mathematicae et Informaticae 52 2020 Eszterházy Károly Egyetem Líceum Kiadó Machine-learning research TGDietterich AI magazine 18 1997 Review of classifier combination methods, Machine learning in document analysis and recognition STulyakov SJaeger VGovindaraju DDoermann 2008 Bayesian convolutional neural network: Robustly quantify uncertainty for misclassifications detection NTassi CRovile Mediterranean Conference on Pattern Recognition and Artificial Intelligence Springer 2019 Combining forecasts: A review and annotated bibliography RTClemen International journal of forecasting 5 1989 On combining classifiers JKittler MHatef RPDuin JMatas IEEE transactions on pattern analysis and machine intelligence 20 1998 Is it better to average probabilities or quantiles? KCLichtendahlJr YGrushka-Cockayne RLWinkler Management Science 59 2013 Dropout as a bayesian approximation: Representing model uncertainty in deep learning YGal ZGhahramani international conference on machine learning PMLR 2016 The power of ensembles for active learning in image classification WHBeluch TGenewein ANürnberger JMKöhler Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition the IEEE Conference on Computer Vision and Pattern Recognition 2018 Evaluating scalable bayesian deep learning methods for robust computer vision FKGustafsson MDanelljan TBSchon Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops the IEEE/CVF conference on computer vision and pattern recognition workshops 2020 GKahn AVillaflor VPong PAbbeel SLevine arXiv:1702.01182 Uncertainty-aware reinforcement learning for collision avoidance 2017 arXiv preprint Safe reinforcement learning with model uncertainty estimates BLütjens MEverett JPHow 2019 International Conference on Robotics and Automation (ICRA), IEEE 2019 Bayesian deep learning and a probabilistic perspective of generalization AGWilson PIzmailov Advances in neural information processing systems 33 2020 Gradientbased learning applied to document recognition YLecun LBottou YBengio PHaffner Proceedings of the IEEE the IEEE 1998 86 HXiao KRasul RVollgraf arXiv:1708.07747 Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms 2017 arXiv preprint AKrizhevsky GHinton Learning multiple layers of features from tiny images 2009 Densely connected convolutional networks GHuang ZLiu LVan Der Maaten KQWeinberger Proceedings of the IEEE conference on computer vision and pattern recognition the IEEE conference on computer vision and pattern recognition 2017 Batch normalization: Accelerating deep network training by reducing internal covariate shift SIoffe CSzegedy International conference on machine learning PMLR 2015 A baseline for detecting misclassified and out-of-distribution examples in neural networks DHendrycks KGimpel Proceedings of International Conference on Learning Representations International Conference on Learning Representations 2017