The impact of averaging logits over probabilities on ensembles of neural networks Cedrique Rovile Njieutcheu Tassi1 , Jakob Gawlikowski2 , Auliya Unnisa Fitri2 and Rudolph Triebel3 1 German Aerospace Center (DLR), Institute of Optical Sensor Systems, Rutherfordstraße 2, 12489 Berlin, Germany 2 German Aerospace Center (DLR), Institute of Data Science, Mälzerstraße 3-5, 07745 Jena, Germany 3 German Aerospace Center (DLR), Institute of Robotics and Mechatronics, Münchener Straße 20, 82234 Wessling, Germany Abstract Model averaging has become a standard for improving neural networks in terms of accuracy, calibration, and the ability to detect false predictions (FPs). However, recent findings show that model averaging does not necessarily lead to calibrated confidences, especially for underconfident networks. While existing methods for improving the calibration of combined networks focus on recalibrating, building, or sampling calibrated models, we focus on the combination process. Specifically, we evaluate the impact of averaging logits instead of probabilities on the quality of confidence (QoC). We compare combined logits instead of probabilities of members (networks) for models such as ensembles, Monte Carlo Dropout (MCD), and Mixture of Monte Carlo Dropout (MMCD). Comparison is done using experimental results on three datasets using three different architectures. We show that averaging logits instead of probabilities increase the confidence thereby improving the confidence calibration for underconfident models. For example, for MCD evaluated on CIFAR10, averaging logits instead of probabilities reduces the expected calibration error (ECE) from 12.03% to 5.44%. However, the increase in confidence can bring harm to confidence calibration for overconfident models and the separability between true predictions (TPs) and FPs. For example, for MMCD evaluated on MNIST, the average confidence on FPs due to the noisy data increases from 51.31% to 94.58% when averaging logits instead of probabilities. While averaging logits can be applied with underconfident models to improve the calibration on test data, we suggest to average probabilities for safety- and mission-critical applications where the separability of TPs and FPs is of paramount importance. Keywords Model averaging, Combination process, Logit averaging, Probability averaging, Ensemble, Monte Carlo Dropout (MCD), Mixture of Monte Carlo Dropout (MMCD), Quality of confidence (QoC), Confidence calibration, Separating true predictions (TPs) and false predictions (FPs) 1. Introduction produce more underconfident networks. For example, [7] showed that averaging networks trained with modern Recently, averaging the predictions of multiple stochas- regularization techniques resulted in more underconfi- tic or deterministic networks has become a standard ap- dent networks and therefore miscalibrated predictions. proach for improving accuracy [1, 2] and uncertainty [12] supported this argument by theoretically and empir- estimates [3]. Generally, the quality of uncertainty es- ically showing that averaging calibrated networks do not timates (e.g.: QoC) is assessed by the degree of calibra- always lead to calibrated confidences. Calibrating confi- tion and/or the ability to detect FPs. Model averaging dences of averaged networks has received little attention can yield well-calibrated confidence [4, 5] and is one of in the literature. Generally, post-processing calibration the state-of-the-art methods for detecting FPs caused methods, such as temperature scaling [13], can be used by out-of-distribution examples [4, 3]. However, recent to recalibrate the confidences of averaged networks, as findings [6, 7, 8] show that model averaging does not nec- demonstrated in [8, 12]. From [14] and further supported essarily lead to calibrated confidence, especially when by [8], confidence calibration in model averaging is cor- the networks are built using modern regularization tech- related to diversity inherent in individual networks and niques, such as mixup [9] or label smoothing [10, 11]. the more diverse the networks, the better the calibration. This is because modern regularization techniques can Motivated by this observation, [14] promoted model di- (strongly) regularize networks, resulting in underconfi- versity using structured dropout to reduce calibration dence. Furthermore, averaging underconfident networks errors. [7] proposed class-adjusted mixup that trains The IJCAI-ECAI-22 Workshop on Artificial Intelligence Safety less confident networks by evaluating the difference be- (AISafety 2022) tween accuracy (estimated on a validation dataset after $ Cedrique.NjieutcheuTassi@dlr.de (C. R. N. Tassi); each training epoch) and the confidence of each train- Jakob.Gawlikowski@@dlr.de (J. Gawlikowski); Auliya.Fitri@dlr.de ing sample to activate or deactive mixup training for (A. U. Fitri); Rudolph.Triebel@dlr.de (R. Triebel) overconfidence (average confidence > accuracy) or un- © 2022 Copyright 2022 for this paper by its authors. Use permitted under Creative CEUR Workshop http://ceur-ws.org ISSN 1613-0073 Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) derconfidence (average confidence < accuracy), respec- Proceedings tively. All these methods for improving the calibration TPs and FPs. Therefore, FPs can be made with high confi- of combined networks focus on recalibrating, building, dence similar to TPs. For example, for MMCD evaluated or sampling the calibrated networks. However, this work on FashionMNIST (see Table 4), the average confidence focuses on combining the networks. Specifically, we ad- on FPs due to the noisy data increased from 51.31% to dress the question: What is the impact of averaging 94.58% when averaging logits instead of probabilities. In logits instead of probabilities of multiple (stochastic summary, we provide empirical evidence demonstrating or deterministic) networks on the QoC? how combining logits instead of probabilities of multiple We hypothesized that averaging logits instead of prob- (stochastic or deterministic) networks abilities of multiple networks increases the confidence of the averaged network. This is because logits (inputs • preserves accuracy, but increases the confidence to softmax), which can be interpreted as found evidence on TPs and FPs. for possible classes [15], are continuous values normal- • reduces the calibration error (given underconfi- ized using the softmax to produce discrete probabilities. dent networks), but increases the calibration error The softmax normalization of continuous values (log- (given overconfident networks). its) to discrete values (probabilities) causes information • can harm the separability between TPs and FPs. loss and possible robustness to changes in the magni- tudes of logits. This implies that the softmax function is a nonlinear function that maps multiple logit vectors 2. Related works with large differences in magnitudes to the same discrete The combination process describes how multiple mem- probability vector. We evaluated the impact of the in- bers are combined and the information type (e.g., logits crease in confidence caused by averaging logits instead or probabilities) that is combined. Several approaches of probabilities on the QoC. Specifically, we evaluated the such as stacking [16] and voting [17, 18, 19]) have been QoC by assessing the degree of confidence calibration, reported for aggregating multiple predictions. Some of which measures the difference between the predicted these approaches have been reviewed and discussed in (average confidence) and true probabilities (empirical ac- [20, 21] and experimentally compared in [18, 16] to find curacy). Furthermore, we evaluated the QoC by assessing the one with the best accuracy. It was found that one its ability to seperate TPs and FPs. To provide empirical approach improves accuracy better than another depend- evidence for evaluating the QoC, we considered the logit ing on several factors, such as the number of members, averaging against probability averaging and compared diversity inherent in individual members, and accuracy both approaches using different averaged models, such of individual members. However, in [22], we compared as ensemble, MCD, and MMCD. The comparison was approaches such as averaging, plurality voting, or major- based on results from different experiments conducted ity voting to find the one that better captures uncertainty. on three datasets, namely, MNIST, FashionMNIST, and We found that the averaging approach captures uncer- CIFAR10 evaluated on VGGNet, ResNet, and DenseNet, tainty better than voting approaches. Before our work, respectively. [23] argued that simple averaging approaches are more Results show that averaging logits instead of probabil- robust than voting approaches. This argument was fur- ities preserves accuracy, but increases confidence. For ther supported by [24]. This is because the averaging example, for MCD evaluated on CIFAR10 (see Table 2), approach considers all members’ predictions, whereas the accuracy remained around 85.36% while the aver- plurality/majority voting ignores uncertain predictions age confidence increased from 73.35% to 80.04% when and therefore, reduces the uncertainty in the combined we averaged logits instead of probabilities. Furthermore, members’ prediction. Although various combination ap- given underconfident models, the increase in the degree proaches have been presented and compared in the liter- of confidence reduces the calibration error on the test ature, the information type that is combined has received data. For example, for MCD evaluated on CIFAR10, ECE relatively little attention. [25] showed that averaging dropped from 12.04% to 5.40% when the average confi- quantiles rather than probabilities improve the predic- dence increased from 73.35% to 80.04%. However, given tive performance. Generally, for neural networks and overconfident models, the increase in the degree of con- classification problems in particular, multiple members fidence increased the calibration error on the test data. (networks) are combined by averaging probabilities [16]. For example, for the ensemble evaluated on CIFAR10 (see [16] evaluated the impact of combining logits instead of Table 3), ECE increased from 3.03% to 7.40% when the av- probabilities on accuracy, however, the impact on the erage confidence increased from 89.43% to 96.17%. Finally, QoC remains unclear. Thus, we investigated the impact for underconfident or overconfident models, the increase of combining logits instead of probabilities on the QoC. in the degree of confidence can harm the separability between TPs and FPs. This is because averaging logits instead of probabilities increases the confidence of both 3. Background Gaussian and Bernoulli distribution, the MCD layer sam- ples the 𝑗 𝑡ℎ element of 𝑎 as 𝑎𝑠𝑗 = 𝑎𝑗 * 𝛼𝑗 * 𝛽𝑗 with In the context of image classification, let the training 𝛼𝑗 ∼ 𝒩 (1, 𝜎 2 = 𝑞/(1 − 𝑞)) and 𝛽𝑗 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑞). data 𝐷𝑡𝑟𝑎𝑖𝑛 = {𝑥𝑖 ∈ R𝐻×𝑊 ×𝐶 , 𝑦𝑖 ∈ 𝑈 𝐾 }𝑖∈[1,𝑁 ] be a Here, 𝑞 denotes the dropout probability. In this work, we realization of independently and identically distributed can refer to MCD as an average of 𝑆 stochastic CNNs. random variables (𝑥, 𝑦) ∈ 𝑋 × 𝑌 , where 𝑥𝑖 denotes the 𝑖𝑡ℎ input and 𝑦𝑖 its corresponding one hot encoded 3.3. Ensemble class label from the set of standard unit vectors of R𝐾 , 𝑈 . 𝑋 and 𝑌 denote the input and label spaces. 𝐻 × An (explicit) ensemble was investigated in [4, 27, 28] for 𝐾 𝑊 × 𝐶 denotes the dimension of input images, where uncertainty estimation. It approximates the prediction 𝐻, 𝑊 , and 𝐶 refer to the height, weight, and number of 𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛 ) by learning different settings. Given a set channels, respectively. 𝐾 and 𝑁 denote the numbers of of CNNs 𝑓𝜃𝑚 for 𝑚 ∈ 1, 2, ..., 𝑀 , the ensemble predic- possible output classes and samples within the training tion is obtained by averaging over the predictions of the data, respectively. CNNs. That is, 𝑀 1 ∑︁ 3.1. Convolutional neural network (CNN) 𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛 ) := 𝑝(𝑦|𝑥, 𝜃𝑚 ) 𝑀 𝑚=1 A CNN is a nonlinear function 𝑓𝜃 parameterized by model (3) 𝑀 parameters 𝜃, called the network weights. Here, it maps 1 ∑︁ := 𝑓𝜃 (𝑥). input images 𝑥𝑖 ∈ R𝐻×𝑊 ×𝐶 to class labels 𝑦𝑖 ∈ 𝑈 𝐾 , 𝑀 𝑚=1 𝑚 In this work, we can refer to an ensemble as an average 𝑓𝜃 : 𝑥𝑖 ∈ R𝐻×𝑊 ×𝐶 → 𝑦𝑖 ∈ [0, 1]𝐾 ; 𝑓𝜃 (𝑥𝑖 ) = 𝑦𝑖 (1) of 𝑀 deterministic CNNs. The network parameters are optimized on the train- ing dataset, 𝐷𝑡𝑟𝑎𝑖𝑛 . Given a new data sample 𝑥 ∈ 3.4. Mixture of Monte Carlo Dropout R𝐻×𝑊 ×𝐶 , a trained CNN 𝑓𝜃 predicts the corresponding (MMCD) target 𝑦 = 𝑓𝜃 (𝑥) using the set of trained weights 𝜃. The network output (logit) is given by 𝑧 = 𝑓𝜃 (𝑥), from which MMCD was investigated in [29, 30, 31] for uncertainty a probability vector 𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛 ) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧), can estimation. It combines both MCD and ensemble. For be computed. In the following, this probability vec- prediction estimation, MCD evaluates a single feature rep- tor will be abbreviated resentation, but additionally considers the uncertainty ∑︀ by 𝑝 and its entries by 𝑝𝑘 with associated with the feature representation. However, 𝑘 = 1, . . . , 𝐾 and 𝐾 𝑘=1 𝑝𝑘 = 1. Further, we get the predicted confidence 𝑐 = max𝑘 (𝑝𝑘 ) and predicted class an ensemble evaluates multiple feature representations label 𝑦 = arg max𝑘 (𝑝𝑘 ) without considering the uncertainty associated with in- dividual feature representations. Hence, MMCD applies MCD to an ensemble to evaluate multiple feature repre- 3.2. Monte Carlo Dropout (MCD) sentations and consider the uncertainty associated with MCD was investigated in [26, 27, 28] for uncertainty individual feature representations. Given a set of CNNs estimation. It is one of the most widespread Bayesian 𝑓𝜃𝑚 for 𝑚 ∈ 1, 2, ..., 𝑀 , the MMCD prediction is ob- methods reviewed in [3]. It approximates the predic- tained by averaging over the predictions of all stochastic tion 𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛 ) using the mean of 𝑆 stochastic for- CNNs. That is, ward passes, 𝑝(𝑦|𝑥, 𝜃1 ), ..., 𝑝(𝑦|𝑥, 𝜃𝑆 ), representing 𝑆 𝑀 1 ∑︁ ∑︁ 𝑆 stochastic CNNs parameterized by samples 𝜃1 , 𝜃2 ,..., and 𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛 ) ≈ 𝑝(𝑦|𝑥, 𝜃𝑚𝑠 ) 𝑀 · 𝑆 𝑚=1 𝑠=1 𝜃𝑆 . That is (4) 𝑀 ∑︁ 𝑆 𝑆 𝑆 1 ∑︁ 1 ∑︁ 1 ∑︁ ≈ 𝑓𝜃 (𝑥). 𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛 ) ≈ 𝑝(𝑦|𝑥, 𝜃𝑠 ) ≈ 𝑓𝜃𝑠 (𝑥). (2) 𝑀 · 𝑆 𝑚=1 𝑠=1 𝑚𝑠 𝑆 𝑠=1 𝑆 𝑠=1 In this work, we can refer to MMCD as an average of Specifically, MCD approximates the prediction with a 𝑀 · 𝑆 stochastic CNNs. dropout distribution realized by sampling weights with masks drawn from known distributions, such as Gaus- sian, Bernoulli, or a cascade of Gaussian and Bernoulli 4. Combining logits instead of distributions [22]. For example, given the activation vec- probabilities tor 𝑎 fed to a MCD layer (placed for example at the input of the first fully-connected layer) and assuming that sam- The output layer of a CNN-based classifier includes pling is realized with masks drawn from a cascade of 𝐾 output neurons with a softmax activation function, (a) Logit averaging (b) Probability averaging Figure 1: Example showing the difference between averaging logits and averaging probabilities in an ensemble. which normalizes its inputs (continuous values) to pro- and reformulate the predicted probability vector of MCD, duce ∑︀𝐾 discrete probabilities 𝑝𝑘 (with 𝑘 = 1, . . . , 𝐾 and as shown in (7). Similarly, given MMCD representing an 𝑘=1 𝑝𝑘 = 1) representing the probability that the ensemble of 𝑀 · 𝑆 stochastic CNNs with logits 𝑧 𝑚𝑠 , we input image belongs to the class associated with the can estimate the average logit 𝑧 as 𝑘𝑡ℎ output neuron. The input to the softmax func- 𝑀 𝑆 𝑀 𝑆 tion are logits and interpreted as evidence for possible 𝑧 ≈ 1 ∑︁ ∑︁ 𝑚𝑠 𝑧 ≈ 1 ∑︁ ∑︁ 𝑓𝜃 (𝑥), (9) classes [15]. The discrete probability 𝑝𝑘 is interpreted 𝑀 · 𝑆 𝑚=1 𝑠=1 𝑀 · 𝑆 𝑚=1 𝑠=1 𝑚𝑠 as the model confidence that the input belongs to the class associated with the 𝑘𝑡ℎ output neuron. Given the and reformulate the predicted probability vector of ]︀𝑇 MMCD, as shown in (7). From Figure 2, averaging logits logit vector 𝑧 = 𝑧1 . . . 𝑧𝐾 , the softmax estimates [︀ ]︀𝑇 instead of probabilities of multiple stochastic or deter- 𝑝 = 𝑝1 . . . 𝑝𝐾 as [︀ ministic CNNs increases the confidence of the averaged CNNs. Intuitively, logit averaging provides the best evi- 𝑝 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧) dence (characterized by a low level of uncertainty caused by the reduction of inductive biases inherent in individ- 1 ]︀𝑇 (5) ual logits) for making decisions. However, probability [︀ = ∑︀𝐾 exp(𝑧1 ) . . . exp(𝑧𝐾 ) . 𝑘=1 exp(𝑧 𝑘 ) averaging provides the best confidence associated with decisions made using weak evidence (characterized by From Figure 1, given an ensemble of 𝑀 deterministic a high level of uncertainty caused by inductive biases CNNs with logits 𝑧 𝑚 , the average logit 𝑧 can be estimated inherent in individual logits). This implies that a decision as 𝑀 𝑀 made using probability averaging considers more uncer- 1 ∑︁ 𝑚 1 ∑︁ tainty than that made using logit averaging. In this work, 𝑧 := 𝑧 := 𝑓𝜃 (𝑥). (6) 𝑀 𝑚=1 𝑀 𝑚=1 𝑚 we evaluated the impact of the possible increase in the and the predicted probability vector of the ensemble of degree of confidence caused by applying logit averaging deterministic CNNs can be reformulated as instead of probability averaging on the QoC. 𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛 ) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧). (7) 5. Experiments Given MCD representing an ensemble of 𝑆 stochastic CNNs with logits 𝑧 𝑠 , we can estimate the average logit 𝑧 5.1. Experimental setup as We hypothesized that the QoC of CNNs (strongly) de- 𝑆 𝑆 𝑧 ≈ 1 ∑︁ 𝑠 𝑧 ≈ 1 ∑︁ 𝑓𝜃 (𝑥), (8) pends on the task-difficulty (specified using the train- 𝑆 𝑠=1 𝑆 𝑠=1 𝑠 ing data), the underlying architecture, and/or the train- ing procedure (mostly influenced by the regularization Table 1 Summary of values assigned to regularization hyper- Softmax parameters. Values Values Hyper-parameters (weak regu- (strong regu- larization) larization) Probability of dropout ap- plied at inputs to max - 0.05 pooling layers Probability of dropout ap- Figure 2: Example showing how averaging logits in- plied at inputs to fully- 0.05 0.5 stead of probabilities increases the confidence∑︀of an en- connected layers semble of four deterministic CNNs: 𝑧 = 14 4𝑚=1 𝑧 𝑚 , Rotation range [Degree] [-5, +5] [-45, +45] 𝑝𝑧 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧), and 𝑝 = 14 4𝑚=1 𝑝𝑚 with 𝑝1 = Width and height shift ∑︀ [-1, +1] [-5, +5] 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 ), 𝑝 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 ), 𝑝3 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 3 ), 1 2 2 range [Pixel] and 𝑝4 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 4 ). One can see that averaging log- Scale intensity range [0.95, 1.05] [0.9, 1.2] its (𝑝𝑧 ) results in more confident predictions than averaging Shear intensity range 0.05 0.1 probabilities (𝑝). This is attributed to averaging logits being Additive Gaussian noise 0.05 0.5 more sensitive to the magnitude of logit values than averaging standard deviation range probabilities. Here, 𝑧 𝑚 with large values contributes most to 𝑧 . In our example, 𝑧 is mostly influenced by the values of 𝑧 1 . That is, the contributions of 𝑧 2 , 𝑧 3 , and 𝑧 4 to 𝑧 are minor. 5.2. Evaluation metrics However, 𝑝 is influenced by the values of all probability vec- tors 𝑝𝑚 and therefore, is less sensitive to the magnitude of QoC was evaluated by assessing the degree of confidence individual logits. calibration. Specifically, we evaluated the calibration er- ror using measures, such as the negative log likelihood (NLL) applied in [4, 5, 31], expected calibration error strength). Therefore, we compared logits and probabili- (ECE) applied in [13, 8, 12], and Brier score (BS) applied ties averaging on three datasets to evaluate the impact of in [4]. Low values of NLL, ECE, and BS indicate low cali- the task-difficulty on the QoC. Moreover, we compared bration error and vice versa. Furthermore, we evaluated logits and probabilities averaging using three different QoC by assessing its ability to separate TPs and FPs. Here, architectures to evaluate the impact of the underlying ar- we evaluated the average confidence on evaluation data chitecture on the QoC. Specifically, we evaluated MNIST causing TPs or FPs. Given evaluation data causing TPs, [32] on VGGNets [1], FashionMNIST [33] on ResNets [2] we expect the average confidence on the evaluation data and CIFAR10 [34] on DenseNets [35]. Finally, we com- to be high. However, for the evaluation data causing FPs, pared logits and probabilities averaging on CNNs trained we expect low average confidence on the evaluation data. using two regularization strengths (strong and weak reg- Moreover, we evaluated the ability to separate TPs and ularization summarized in Table 1) to evaluate the impact FPs by evaluating the area under the receiver operator of the regularization strength on the QoC. We observed characteristic (AU-ROC) applied in [37, 5]. AU-ROC sum- strong and weak regularization results in underconfident marizes the trade-off between the fraction of TPs that are and overconfident CNNs, respectively. All CNNs were correctly detected and those of FPs that are undetected regularized using batch normalization [36] layers placed using different thresholds. In summary, in addition to the before each convolutional activation function. All CNNs NLL, ECE, and BS, we evaluated the accuracy, average were randomly initialized and trained with random shuf- confidence, and AUC-ROC. fling of training samples. All CNNs were trained using the categorical cross-entropy and stochastic gradient de- 5.3. Evaluation data scent with momentum of 0.9, learning rate of 0.02, batch size of 128, and epochs of 100. All images were standard- We used five evaluation data for different purposes, ized and normalized by dividing pixel values by 255. For namely test data, subsets of the correctly classified test all MCD and MMCD, we sampled activations of the first data, out-of-domain data, swapped data, and noisy data. fully-connected layer using masks drawn from a cascade Test data represent the test data from the experimental of Bernoulli and Gaussian distributions [22] and using data, namely, MNIST, CIFAR10, and FashionM- a dropout probability of 0.5. We performed 100 stochas- NIST. These datasets include both correctly clas- tic forward passes (𝑆 = 100) and considered ensembles sified and misclassified test data. Test data are consisting of five deterministic CNNs (𝑀 = 5). used for estimating the accuracy, NLL, ECE, and (a) Test data (b) Swapped data (c) Noisy data Figure 3: Examples of evaluation data for experiments conducted on CIFAR10. BS. We expect the accuracy to be high and NLL, 5.4. Experimental results ECE and BS to be low on test data. We evaluate the conducted experiments with respect to Subsets of the correctly classified test data accuracy and QoC. include 1000 correctly classified test data from Table 2 and Table 3 summarize the accuracy, aver- the experimental data. Since CNNs will make TPs age confidence, NLL, ECE, and BS of different models on these data, we used these data for evaluating using the two averaging approaches and CNNs trained the average confidence on TPs. using strong regularization (causing underconfidence) and weak regularization (causing overconfidence). The Swapped data were simulated using subsets of the cor- results show that averaging logits instead of probabil- rectly classified test data structurally perturbed ities do not strongly affect the accuracy. This means by dividing images into four regions and diago- that averaging logits can preserve accuracy. Further- nally permuting the regions. From Figure 3b, the more, averaging logits instead of probabilities signifi- upper left and right are permuted with the bottom cantly increases the average confidence. Figure 2 illus- right and left regions, respectively. Swapped data trates why the confidence increases. Further, Table 2 include structurally perturbed objects within the shows that averaging logits instead of probabilities sig- given images. We expect CNNs to make FPs on nificantly decreases the NLL, ECE, and BS for under- swapped data. Therefore, we used these data for condifent CNNs (trained using strong regularization). evaluating the average confidence on FPs caused This means that averaging logits, unlike averaging prob- by structurally perturbed objects. abilities, reduces the calibration error for undercondifent CNNs. This is because the stronger the regularization, Noisy data were simulated using subsets of the cor- the lower the confidence and the higher the gap between rectly classified test data perturbed by applying accuracy and average confidence. Here, the increase additive Gaussian noise with a standard devia- in the degree of confidence caused by averaging log- tion of 500. From Figure 3c, noisy data include its instead of probabilities reduces the gap between ac- noise within the given images. We expect CNNs curacy and average confidence. For example, Table 2 to make FPs on these data. Therefore, we used shows that averaging logits instead of probabilities of these data for evaluating the average confidence the ensemble reduces the gap between accuracy and av- on FPs caused by noisy objects. erage confidence from 18.24(= |88.75 − 70.51|)% to Out-of-domain data were simulated using 1000 test 9.52(= |88.94 − 79.42|)% on CIFAR10. data of CIFAR100 [34]. Since CNNs will make However, the increase in the degree of confidence caused FPs on these data, we used these data for evalu- by averaging logits instead of probabilities increases the cal- ating the average confidence on FPs caused by ibration error for overconfident CNNs (trained using weak unknown objects. regularization). Table 3 provides empirical evidence for this claim by showing that, on CIFAR10 and FashionM- In general, we expect the average confidence to be high NIST, NLL, ECE, and BS of the ensembles increase when on TPs and to be low on FPs. the logits are averaged instead of probabilities. We ar- gued that the more overconfident the CNNs, the higher Table 2 Comparison of accuracy[%], average confidence[%] (in bracket), NLL[10−2 ], ECE[10−2 ], and BS[10−2 ] of different models using two approaches for averaging underconfident CNNs trained using strong regularization: average probabilities (AP) and average logits (AL). The results were obtained using the test data described in Section 5.3. Accuracy (Average confidence)↑ NLL ↓ ECE ↓ BS ↓ AP AL AP AL AP AL AP AL CIFAR10 (DenseNets) Ensemble 89.52 (84.31) 89.60 (87.97) 34.66 32.81 5.23 2.47 16.13 15.38 MCD 85.36 (73.35) 85.37 (80.04) 52.13 46.55 12.04 5.40 23.32 21.57 MMCD 88.75 (70.51) 88.94 (79.42) 50.83 40.99 18.24 9.55 21.82 18.34 FashionMNIST (ResNets) Ensemble 92.70 (87.86) 92.58 (90.16) 22.57 20.99 5.15 2.86 11.37 10.87 MCD 90.56 (79.22) 90.56 (83.95) 35.45 30.18 11.47 6.85 15.82 14.57 MMCD 92.65 (76.37) 92.73 (83.78) 35.57 26.96 16.31 9.10 14.87 12.47 MNIST (VGGNets) Ensemble 99.04 (98.24) 99.04 (98.89) 3.25 2.90 1.03 0.52 1.52 1.41 MCD 98.16 (94.53) 98.16 (96.48) 8.73 6.87 3.81 1.98 2.99 2.79 MMCD 99.03 (94.67) 99.04 (97.46) 6.91 4.13 4.49 1.75 1.89 1.52 Table 3 Comparison of accuracy[%], average confidence[%] (in bracket), NLL[10−2 ], ECE[10−2 ], and BS[10−2 ] of ensembles using two approaches for averaging overconfident CNNs trained using weak regularization: average probabilities (AP) and average logits (AL). The results were obtained using the test data described in Section 5.3. Accuracy (Average confidence) ↑ NLL ↓ ECE ↓ BS ↓ AP AL AP AL AP AL AP AL CIFAR10 (DenseNets) 88.67 (89.43) 88.88 (96.17) 40.69 54.23 3.03 7.40 16.69 18.07 FashionMNIST (ResNets) 94.49 (95.86) 94.58 (98.43) 20.20 28.00 1.98 4.11 8.36 9.32 the confidence and the higher the gap between accuracy 84.80% to 42.42%. and average confidence. Here, the increase in the de- gree of confidence caused by averaging logits instead of probabilities further increases the gap between the ac- 6. Discussion curacy and average confidence and therefore, increases The term ‘combination process’ encompasses how mul- the calibration error. For example, Table 3 shows that, on tiple networks are combined and the information type CIFAR10, averaging logits of the ensemble increases the combined. It was found in [23, 24, 22] that simple averag- gap between the accuracy and average confidence from ing is more robust and captures uncertainty better than 0.76(= |88.67−89.43|)% to 7.29(= |88.88−96.17|)%. voting approaches. This is because the simple averag- In Table 4, the average confidence on TPs and FPs is ing equally weights all predictions, while voting ignores shown for underconfident models using both averaging uncertain predictions. In this work, we compared the approaches. The results show that averaging logits instead process of averaging logits instead of probabilities. We of probabilities increases the confidence level on TPs and empirically showed that averaging logits instead of prob- FPs. The increase in the average confidence is sometimes abilities increases the confidence while preserving the very large for FPs due to the noisy data. For example, for accuracy for underconfident or overconfident networks. MMCD evaluated on FashionMNIST, the average confi- This might be because logit averaging preserves the po- dence on the noisy data increases from 51.31% to 94.58% sition of the max element of individual logit vectors, but when averaging logits. This is because noisy data can is more sensitive to the magnitude of logit values than increase the magnitude of logits and averaging logits is probability averaging. Thus, logit values with a large more sensitive to changes in the magnitude of logits than magnitude contribute the most to the average logit. In averaging probabilities (see Figure 2). The increase in the this way, the magnitude of logit values induces a non- degree of confidence caused by averaging logits can harm uniform weighting (for logit averaging), which is lost the separability of TPs and FPs. For example, the increase (for probability averaging). Furthermore, we provided in the average confidence on the noisy data from 51.31% empirical evidence showing that for underconfident net- to 94.58% causes the AUC-ROC obtained based on the works (trained using strong regularization), the increase evaluation of the degree of confidence to decrease from in the confidence caused by averaging logits instead of Table 4 Comparison of average confidence[%] of different models using two approaches (average probabilities (AP) and average logits (AL)) for averaging underconfident networks trained using strong regularization and evaluated on TPs and FPs: TPs were obtained on subsets of the correctly classified test data, while FPs were obtained on swapped, noisy and out-of-domain (OOD) data described in Section 5.3. TP ↑ FP (OOD) ↓ FP (Swapped) ↓ FP (Noisy) ↓ AP AL AP AL AP AL AP AL CIFAR10 (DenseNets) Ensemble 93.94 96.63 35.39 40.08 51.84 56.03 39.42 58.69 MCD 81.39 88.45 31.61 33.27 40.39 44.69 44.83 69.53 MMCD 79.48 89.53 22.81 23.83 36.26 40.67 28.01 33.08 FashionMNIST (ResNets) Ensemble 88.01 90.16 55.48 63.21 59.30 67.91 81.39 99.82 MCD 79.39 83.76 47.08 50.36 55.75 59.29 41.23 65.79 MMCD 76.40 83.76 42.76 49.09 45.73 52.70 51.31 94.58 MNIST (VGGNets) Ensemble 99.09 99.55 57.16 80.45 51.96 62.01 69.58 88.84 MCD 95.12 97.11 64.36 69.17 58.92 62.84 97.95 99.53 MMCD 95.37 98.17 48.89 63.56 43.53 49.39 57.17 78.14 probabilities reduces the calibration error on the test data. fidence calibration and the ability of their proposed method This is because the increase in the degree of confidence to separate TPs and FPs. Finally, for mission- and safety- reduces the gap between accuracy and average confi- critical applications where the separability of TPs and FPs dence. However, the increase in confidence caused by is of paramount importance, we suggest to average prob- averaging logits instead of probabilities for overconfident abilities to avoid the negative impact of logits averaging networks (trained using weak regularization) increases on the ability to separate TPs and FPs. the calibration error on the test data. This is because the increase in the confidence further increases the gap be- tween the accuracy and average confidence. This finding 7. Conclusion suggests that for underconfident networks, we can aver- Due to averaging logits instead of averaging probabili- age logits instead of probabilities to reduce the calibration ties of stochastic or deterministic networks, the degree error. However, we should average probabilities instead of confidence on TPs and FPs increased. This reduces of logits for overconfident networks to avoid increasing the calibration error on the test data for underconfident the calibration error. Although the increase in the confi- networks but affects the separability of TPs and FPs. Our dence caused by averaging logits reduces the calibration empirical results show that there is a trade-off between error on the test data for underconfident networks, we improving calibration on the test data and improving empirically showed that it can harm the separability of the separability of TPs and FPs. Additionally, the in- TPs and FPs. This is because averaging logits increases crease in the degree of confidence increases the calibra- the confidence on both TPs and FPs. Therefore, FPs can tion error on the test data for overconfident networks. also be made with high confidence similar to TPs. These Therefore, averaging logits should only be applied when findings suggest that reducing the calibration error on combining underconfident networks. For example, we the test data and improving the separability of TPs and can average logits instead of probabilities of an ensemble FPs can be two contradicting goals. Improving one may of networks trained with mixup or other modern data be at the detriment of the other. Furthermore, for two augmentation techniques to improve calibration on the models 𝐴 and 𝐵, if 𝐴 is better calibrated than 𝐵, then 𝐴 test data. Notwithstanding this, for mission- and safety- does not necessarily separate TPs and FPs better than 𝐵. critical applications where the separability of TPs and This implies that calibration methods may be insufficient FPs is essential, we suggest traditionally average prob- for separating TPs and FPs and therefore, ensuring safe abilities. However, it remains unclear if the findings of decision-making. Additionally, existing methods for con- this paper will change if the given networks or the aver- fidence calibration may not help in separating TPs and age logit are calibrated, for example, with temperature FPs. Subsequently, future work will evaluate the ability scaling [13]. This suggests a new research direction. of existing methods for confidence calibration to separate TPs and FPs. We also recommend researchers to evaluate both the calibration error of their proposed method for con- References 2017, pp. 1321–1330. [14] Z. Zhang, A. V. Dalca, M. R. Sabuncu, Con- [1] K. Simonyan, A. Zisserman, Very deep convolu- fidence calibration for convolutional neural net- tional networks for large-scale image recognition, works using structured dropout, arXiv preprint in: International Conference on Learning Represen- arXiv:1906.09551 (2019). tations, 2015. URL: http://arxiv.org/abs/1409.1556. [15] M. Sensoy, L. Kaplan, M. Kandemir, Evidential deep [2] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn- learning to quantify classification uncertainty, Ad- ing for image recognition, in: Proceedings of the vances in Neural Information Processing Systems IEEE conference on computer vision and pattern 31 (2018). recognition, 2016, pp. 770–778. [16] C. Ju, A. Bibaut, M. van der Laan, The relative [3] J. Gawlikowski, C. R. N. Tassi, M. Ali, J. Lee, performance of ensemble methods with deep con- M. Humt, J. Feng, A. Kruspe, R. Triebel, P. Jung, volutional neural networks for image classification, R. Roscher, et al., A survey of uncertainty in deep Journal of Applied Statistics 45 (2018) 2800–2818. neural networks, arXiv preprint arXiv:2107.03342 [17] L. I. Kuncheva, Combining pattern classifiers: meth- (2021). ods and algorithms, John Wiley & Sons, 2014. [4] B. Lakshminarayanan, A. Pritzel, C. Blundell, Sim- [18] M. Van Erp, L. Vuurpijl, L. Schomaker, An overview ple and scalable predictive uncertainty estimation and comparison of voting methods for pattern using deep ensembles, Advances in neural infor- recognition, in: Proceedings Eighth International mation processing systems 30 (2017). Workshop on Frontiers in Handwriting Recogni- [5] S. Thulasidasan, G. Chennupati, J. A. Bilmes, tion, IEEE, 2002, pp. 195–200. T. Bhattacharya, S. Michalak, On mixup training: [19] T. Tajti, New voting functions for neural network Improved calibration and predictive uncertainty for algorithms, in: Annales Mathematicae et Informati- deep neural networks, Advances in Neural Infor- cae, volume 52, Eszterházy Károly Egyetem Líceum mation Processing Systems 32 (2019). Kiadó, 2020, pp. 229–242. [6] Y. Qin, X. Wang, A. Beutel, E. Chi, Improving cal- [20] T. G. Dietterich, Machine-learning research, AI ibration through the relationship with adversar- magazine 18 (1997) 97–97. ial robustness, in: A. Beygelzimer, Y. Dauphin, [21] S. Tulyakov, S. Jaeger, V. Govindaraju, D. Doer- P. Liang, J. W. Vaughan (Eds.), Advances in Neu- mann, Review of classifier combination methods, ral Information Processing Systems, 2021. URL: Machine learning in document analysis and recog- https://openreview.net/forum?id=NJex-5TZIQa. nition (2008) 361–386. [7] Y. Wen, G. Jerfel, R. Muller, M. W. Dusenberry, [22] N. Tassi, C. Rovile, Bayesian convolutional neural J. Snoek, B. Lakshminarayanan, D. Tran, Combin- network: Robustly quantify uncertainty for misclas- ing ensembles and data augmentation can harm sifications detection, in: Mediterranean Conference your calibration, arXiv preprint arXiv:2010.09875 on Pattern Recognition and Artificial Intelligence, (2020). Springer, 2019, pp. 118–132. [8] R. Rahaman, A. H. Thiery, Uncertainty quantifi- [23] R. T. Clemen, Combining forecasts: A review and cation and deep ensembles, Advances in Neural annotated bibliography, International journal of Information Processing Systems 34 (2021). forecasting 5 (1989) 559–583. [9] H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, [24] J. Kittler, M. Hatef, R. P. Duin, J. Matas, On combin- mixup: Beyond empirical risk minimization, in: ing classifiers, IEEE transactions on pattern analysis International Conference on Learning Representa- and machine intelligence 20 (1998) 226–239. tions, 2018. [25] K. C. Lichtendahl Jr, Y. Grushka-Cockayne, R. L. [10] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wo- Winkler, Is it better to average probabilities or jna, Rethinking the inception architecture for com- quantiles?, Management Science 59 (2013) 1594– puter vision, in: Proceedings of the IEEE conference 1611. on computer vision and pattern recognition, 2016, [26] Y. Gal, Z. Ghahramani, Dropout as a bayesian ap- pp. 2818–2826. proximation: Representing model uncertainty in [11] R. Müller, S. Kornblith, G. Hinton, When Does La- deep learning, in: international conference on ma- bel Smoothing Help?, Curran Associates Inc., Red chine learning, PMLR, 2016, pp. 1050–1059. Hook, NY, USA, 2019. [27] W. H. Beluch, T. Genewein, A. Nürnberger, J. M. [12] X. Wu, M. Gales, Should ensemble members be Köhler, The power of ensembles for active learn- calibrated?, arXiv preprint arXiv:2101.05397 (2021). ing in image classification, in: Proceedings of the [13] C. Guo, G. Pleiss, Y. Sun, K. Q. Weinberger, On IEEE Conference on Computer Vision and Pattern calibration of modern neural networks, in: Inter- Recognition, 2018, pp. 9368–9377. national Conference on Machine Learning, PMLR, [28] F. K. Gustafsson, M. Danelljan, T. B. Schon, Evalu- ating scalable bayesian deep learning methods for robust computer vision, in: Proceedings of the IEEE/CVF conference on computer vision and pat- tern recognition workshops, 2020, pp. 318–319. [29] G. Kahn, A. Villaflor, V. Pong, P. Abbeel, S. Levine, Uncertainty-aware reinforcement learning for col- lision avoidance, arXiv preprint arXiv:1702.01182 (2017). [30] B. Lütjens, M. Everett, J. P. How, Safe reinforce- ment learning with model uncertainty estimates, in: 2019 International Conference on Robotics and Automation (ICRA), IEEE, 2019, pp. 8662–8668. [31] A. G. Wilson, P. Izmailov, Bayesian deep learning and a probabilistic perspective of generalization, Advances in neural information processing systems 33 (2020) 4697–4708. [32] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient- based learning applied to document recognition, Proceedings of the IEEE 86 (1998) 2278–2324. [33] H. Xiao, K. Rasul, R. Vollgraf, Fashion-mnist: a novel image dataset for benchmarking ma- chine learning algorithms, arXiv preprint arXiv:1708.07747 (2017). [34] A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images (2009). [35] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Wein- berger, Densely connected convolutional networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708. [36] S. Ioffe, C. Szegedy, Batch normalization: Acceler- ating deep network training by reducing internal covariate shift, in: International conference on machine learning, PMLR, 2015, pp. 448–456. [37] D. Hendrycks, K. Gimpel, A baseline for detect- ing misclassified and out-of-distribution examples in neural networks, Proceedings of International Conference on Learning Representations (2017).