<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">The impact of averaging logits over probabilities on ensembles of neural networks</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Cedrique</forename><surname>Rovile</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Njieutcheu</forename><surname>Tassi</surname></persName>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">German Aerospace Center (DLR)</orgName>
								<orgName type="department" key="dep2">Institute of Optical Sensor Systems</orgName>
								<address>
									<addrLine>Rutherfordstraße 2</addrLine>
									<postCode>12489</postCode>
									<settlement>Berlin</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jakob</forename><surname>Gawlikowski</surname></persName>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">German Aerospace Center (DLR)</orgName>
								<orgName type="department" key="dep2">Institute of Data Science</orgName>
								<address>
									<addrLine>Mälzerstraße 3-5</addrLine>
									<postCode>07745</postCode>
									<settlement>Jena</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Auliya</forename><forename type="middle">Unnisa</forename><surname>Fitri</surname></persName>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">German Aerospace Center (DLR)</orgName>
								<orgName type="department" key="dep2">Institute of Data Science</orgName>
								<address>
									<addrLine>Mälzerstraße 3-5</addrLine>
									<postCode>07745</postCode>
									<settlement>Jena</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Rudolph</forename><surname>Triebel</surname></persName>
							<affiliation key="aff2">
								<orgName type="department" key="dep1">German Aerospace Center (DLR)</orgName>
								<orgName type="department" key="dep2">Institute of Robotics and Mechatronics</orgName>
								<address>
									<addrLine>Münchener Straße 20</addrLine>
									<postCode>82234</postCode>
									<settlement>Wessling</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">The impact of averaging logits over probabilities on ensembles of neural networks</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">251D5E357E785C67228F345F12838CEA</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T23:21+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Model averaging</term>
					<term>Combination process</term>
					<term>Logit averaging</term>
					<term>Probability averaging</term>
					<term>Ensemble</term>
					<term>Monte Carlo Dropout (MCD)</term>
					<term>Mixture of Monte Carlo Dropout (MMCD)</term>
					<term>Quality of confidence (QoC)</term>
					<term>Confidence calibration</term>
					<term>Separating true predictions (TPs) and false predictions (FPs)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Model averaging has become a standard for improving neural networks in terms of accuracy, calibration, and the ability to detect false predictions (FPs). However, recent findings show that model averaging does not necessarily lead to calibrated confidences, especially for underconfident networks. While existing methods for improving the calibration of combined networks focus on recalibrating, building, or sampling calibrated models, we focus on the combination process. Specifically, we evaluate the impact of averaging logits instead of probabilities on the quality of confidence (QoC). We compare combined logits instead of probabilities of members (networks) for models such as ensembles, Monte Carlo Dropout (MCD), and Mixture of Monte Carlo Dropout (MMCD). Comparison is done using experimental results on three datasets using three different architectures. We show that averaging logits instead of probabilities increase the confidence thereby improving the confidence calibration for underconfident models. For example, for MCD evaluated on CIFAR10, averaging logits instead of probabilities reduces the expected calibration error (ECE) from 12.03% to 5.44%. However, the increase in confidence can bring harm to confidence calibration for overconfident models and the separability between true predictions (TPs) and FPs. For example, for MMCD evaluated on MNIST, the average confidence on FPs due to the noisy data increases from 51.31% to 94.58% when averaging logits instead of probabilities. While averaging logits can be applied with underconfident models to improve the calibration on test data, we suggest to average probabilities for safety-and mission-critical applications where the separability of TPs and FPs is of paramount importance.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Recently, averaging the predictions of multiple stochastic or deterministic networks has become a standard approach for improving accuracy <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref> and uncertainty estimates <ref type="bibr" target="#b2">[3]</ref>. Generally, the quality of uncertainty estimates (e.g.: QoC) is assessed by the degree of calibration and/or the ability to detect FPs. Model averaging can yield well-calibrated confidence <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5]</ref> and is one of the state-of-the-art methods for detecting FPs caused by out-of-distribution examples <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b2">3]</ref>. However, recent findings <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b7">8]</ref> show that model averaging does not necessarily lead to calibrated confidence, especially when the networks are built using modern regularization techniques, such as mixup <ref type="bibr" target="#b8">[9]</ref> or label smoothing <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b10">11]</ref>. This is because modern regularization techniques can (strongly) regularize networks, resulting in underconfidence. Furthermore, averaging underconfident networks</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>The IJCAI-ECAI-22 Workshop on Artificial Intelligence Safety (AISafety 2022)</head><p>Cedrique.NjieutcheuTassi@dlr.de (C. R. N. Tassi); Jakob.Gawlikowski@@dlr.de (J. Gawlikowski); Auliya.Fitri@dlr.de (A. U. Fitri); Rudolph.Triebel@dlr.de (R. Triebel) produce more underconfident networks. For example, <ref type="bibr" target="#b6">[7]</ref> showed that averaging networks trained with modern regularization techniques resulted in more underconfident networks and therefore miscalibrated predictions. <ref type="bibr" target="#b11">[12]</ref> supported this argument by theoretically and empirically showing that averaging calibrated networks do not always lead to calibrated confidences. Calibrating confidences of averaged networks has received little attention in the literature. Generally, post-processing calibration methods, such as temperature scaling <ref type="bibr" target="#b12">[13]</ref>, can be used to recalibrate the confidences of averaged networks, as demonstrated in <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b11">12]</ref>. From <ref type="bibr" target="#b13">[14]</ref> and further supported by <ref type="bibr" target="#b7">[8]</ref>, confidence calibration in model averaging is correlated to diversity inherent in individual networks and the more diverse the networks, the better the calibration. Motivated by this observation, <ref type="bibr" target="#b13">[14]</ref> promoted model diversity using structured dropout to reduce calibration errors. <ref type="bibr" target="#b6">[7]</ref> proposed class-adjusted mixup that trains less confident networks by evaluating the difference between accuracy (estimated on a validation dataset after each training epoch) and the confidence of each training sample to activate or deactive mixup training for overconfidence (average confidence &gt; accuracy) or underconfidence (average confidence &lt; accuracy), respec-tively. All these methods for improving the calibration of combined networks focus on recalibrating, building, or sampling the calibrated networks. However, this work focuses on combining the networks. Specifically, we address the question: What is the impact of averaging logits instead of probabilities of multiple (stochastic or deterministic) networks on the QoC?</p><p>We hypothesized that averaging logits instead of probabilities of multiple networks increases the confidence of the averaged network. This is because logits (inputs to softmax), which can be interpreted as found evidence for possible classes <ref type="bibr" target="#b14">[15]</ref>, are continuous values normalized using the softmax to produce discrete probabilities. The softmax normalization of continuous values (logits) to discrete values (probabilities) causes information loss and possible robustness to changes in the magnitudes of logits. This implies that the softmax function is a nonlinear function that maps multiple logit vectors with large differences in magnitudes to the same discrete probability vector. We evaluated the impact of the increase in confidence caused by averaging logits instead of probabilities on the QoC. Specifically, we evaluated the QoC by assessing the degree of confidence calibration, which measures the difference between the predicted (average confidence) and true probabilities (empirical accuracy). Furthermore, we evaluated the QoC by assessing its ability to seperate TPs and FPs. To provide empirical evidence for evaluating the QoC, we considered the logit averaging against probability averaging and compared both approaches using different averaged models, such as ensemble, MCD, and MMCD. The comparison was based on results from different experiments conducted on three datasets, namely, MNIST, FashionMNIST, and CIFAR10 evaluated on VGGNet, ResNet, and DenseNet, respectively.</p><p>Results show that averaging logits instead of probabilities preserves accuracy, but increases confidence. For example, for MCD evaluated on CIFAR10 (see Table <ref type="table">2</ref>), the accuracy remained around 85.36% while the average confidence increased from 73.35% to 80.04% when we averaged logits instead of probabilities. Furthermore, given underconfident models, the increase in the degree of confidence reduces the calibration error on the test data. For example, for MCD evaluated on CIFAR10, ECE dropped from 12.04% to 5.40% when the average confidence increased from 73.35% to 80.04%. However, given overconfident models, the increase in the degree of confidence increased the calibration error on the test data. For example, for the ensemble evaluated on CIFAR10 (see Table <ref type="table" target="#tab_3">3</ref>), ECE increased from 3.03% to 7.40% when the average confidence increased from 89.43% to 96.17%. Finally, for underconfident or overconfident models, the increase in the degree of confidence can harm the separability between TPs and FPs. This is because averaging logits instead of probabilities increases the confidence of both TPs and FPs. Therefore, FPs can be made with high confidence similar to TPs. For example, for MMCD evaluated on FashionMNIST (see Table <ref type="table">4</ref>), the average confidence on FPs due to the noisy data increased from 51.31% to 94.58% when averaging logits instead of probabilities. In summary, we provide empirical evidence demonstrating how combining logits instead of probabilities of multiple (stochastic or deterministic) networks • preserves accuracy, but increases the confidence on TPs and FPs. • reduces the calibration error (given underconfident networks), but increases the calibration error (given overconfident networks). • can harm the separability between TPs and FPs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related works</head><p>The combination process describes how multiple members are combined and the information type (e.g., logits or probabilities) that is combined. Several approaches such as stacking <ref type="bibr" target="#b15">[16]</ref> and voting <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b17">18,</ref><ref type="bibr" target="#b18">19]</ref>) have been reported for aggregating multiple predictions. Some of these approaches have been reviewed and discussed in <ref type="bibr" target="#b19">[20,</ref><ref type="bibr" target="#b20">21]</ref> and experimentally compared in <ref type="bibr" target="#b17">[18,</ref><ref type="bibr" target="#b15">16]</ref> to find the one with the best accuracy. It was found that one approach improves accuracy better than another depending on several factors, such as the number of members, diversity inherent in individual members, and accuracy of individual members. However, in <ref type="bibr" target="#b21">[22]</ref>, we compared approaches such as averaging, plurality voting, or majority voting to find the one that better captures uncertainty. We found that the averaging approach captures uncertainty better than voting approaches. Before our work, <ref type="bibr" target="#b22">[23]</ref> argued that simple averaging approaches are more robust than voting approaches. This argument was further supported by <ref type="bibr" target="#b23">[24]</ref>. This is because the averaging approach considers all members' predictions, whereas plurality/majority voting ignores uncertain predictions and therefore, reduces the uncertainty in the combined members' prediction. Although various combination approaches have been presented and compared in the literature, the information type that is combined has received relatively little attention. <ref type="bibr" target="#b24">[25]</ref> showed that averaging quantiles rather than probabilities improve the predictive performance. Generally, for neural networks and classification problems in particular, multiple members (networks) are combined by averaging probabilities <ref type="bibr" target="#b15">[16]</ref>. <ref type="bibr" target="#b15">[16]</ref> evaluated the impact of combining logits instead of probabilities on accuracy, however, the impact on the QoC remains unclear. Thus, we investigated the impact of combining logits instead of probabilities on the QoC.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Background</head><p>In the context of image classification, let the training data 𝐷𝑡𝑟𝑎𝑖𝑛 = {𝑥𝑖 ∈ R 𝐻×𝑊 ×𝐶 , 𝑦𝑖 ∈ 𝑈 𝐾 } 𝑖∈[1,𝑁 ] be a realization of independently and identically distributed random variables (𝑥, 𝑦) ∈ 𝑋 × 𝑌 , where 𝑥𝑖 denotes the 𝑖 𝑡ℎ input and 𝑦𝑖 its corresponding one hot encoded class label from the set of standard unit vectors of R 𝐾 , 𝑈 𝐾 . 𝑋 and 𝑌 denote the input and label spaces. 𝐻 × 𝑊 × 𝐶 denotes the dimension of input images, where 𝐻, 𝑊 , and 𝐶 refer to the height, weight, and number of channels, respectively. 𝐾 and 𝑁 denote the numbers of possible output classes and samples within the training data, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Convolutional neural network (CNN)</head><p>A CNN is a nonlinear function 𝑓 𝜃 parameterized by model parameters 𝜃, called the network weights. Here, it maps input images 𝑥𝑖 ∈ R 𝐻×𝑊 ×𝐶 to class labels 𝑦𝑖 ∈ 𝑈 𝐾 ,</p><formula xml:id="formula_0">𝑓 𝜃 : 𝑥𝑖 ∈ R 𝐻×𝑊 ×𝐶 → 𝑦𝑖 ∈ [0, 1] 𝐾 ; 𝑓 𝜃 (𝑥𝑖) = 𝑦𝑖 (1)</formula><p>The network parameters are optimized on the training dataset, 𝐷𝑡𝑟𝑎𝑖𝑛. Given a new data sample 𝑥 ∈ R 𝐻×𝑊 ×𝐶 , a trained CNN 𝑓 𝜃 predicts the corresponding target 𝑦 = 𝑓 𝜃 (𝑥) using the set of trained weights 𝜃. The network output (logit) is given by 𝑧 = 𝑓 𝜃 (𝑥), from which a probability vector 𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧), can be computed. In the following, this probability vector will be abbreviated by 𝑝 and its entries by 𝑝 𝑘 with 𝑘 = 1, . . . , 𝐾 and ∑︀ 𝐾 𝑘=1 𝑝 𝑘 = 1. Further, we get the predicted confidence 𝑐 = max 𝑘 (𝑝 𝑘 ) and predicted class label 𝑦 = arg max 𝑘 (𝑝 𝑘 )</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Monte Carlo Dropout (MCD)</head><p>MCD was investigated in <ref type="bibr" target="#b25">[26,</ref><ref type="bibr" target="#b26">27,</ref><ref type="bibr" target="#b27">28]</ref> for uncertainty estimation. It is one of the most widespread Bayesian methods reviewed in <ref type="bibr" target="#b2">[3]</ref>. It approximates the prediction 𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛) using the mean of 𝑆 stochastic forward passes, 𝑝(𝑦|𝑥, 𝜃1), ..., 𝑝(𝑦|𝑥, 𝜃𝑆), representing 𝑆 stochastic CNNs parameterized by samples 𝜃1, 𝜃2,..., and 𝜃𝑆. That is</p><formula xml:id="formula_1">𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛) ≈ 1 𝑆 𝑆 ∑︁ 𝑠=1 𝑝(𝑦|𝑥, 𝜃𝑠) ≈ 1 𝑆 𝑆 ∑︁ 𝑠=1 𝑓𝜃 𝑠 (𝑥). (2)</formula><p>Specifically, MCD approximates the prediction with a dropout distribution realized by sampling weights with masks drawn from known distributions, such as Gaussian, Bernoulli, or a cascade of Gaussian and Bernoulli distributions <ref type="bibr" target="#b21">[22]</ref>. For example, given the activation vector 𝑎 fed to a MCD layer (placed for example at the input of the first fully-connected layer) and assuming that sampling is realized with masks drawn from a cascade of Gaussian and Bernoulli distribution, the MCD layer samples the 𝑗 𝑡ℎ element of 𝑎 as</p><formula xml:id="formula_2">𝑎 𝑠 𝑗 = 𝑎𝑗 * 𝛼𝑗 * 𝛽𝑗 with 𝛼𝑗 ∼ 𝒩 (1, 𝜎 2 = 𝑞/(1 − 𝑞)) and 𝛽𝑗 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑞).</formula><p>Here, 𝑞 denotes the dropout probability. In this work, we can refer to MCD as an average of 𝑆 stochastic CNNs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Ensemble</head><p>An (explicit) ensemble was investigated in <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b26">27,</ref><ref type="bibr" target="#b27">28]</ref> for uncertainty estimation. It approximates the prediction 𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛) by learning different settings. Given a set of CNNs 𝑓 𝜃𝑚 for 𝑚 ∈ 1, 2, ..., 𝑀 , the ensemble prediction is obtained by averaging over the predictions of the CNNs. That is,</p><formula xml:id="formula_3">𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛) := 1 𝑀 𝑀 ∑︁ 𝑚=1 𝑝(𝑦|𝑥, 𝜃𝑚) := 1 𝑀 𝑀 ∑︁ 𝑚=1 𝑓 𝜃𝑚 (𝑥).<label>(3)</label></formula><p>In this work, we can refer to an ensemble as an average of 𝑀 deterministic CNNs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Mixture of Monte Carlo Dropout (MMCD)</head><p>MMCD was investigated in <ref type="bibr" target="#b28">[29,</ref><ref type="bibr" target="#b29">30,</ref><ref type="bibr" target="#b30">31]</ref>  </p><formula xml:id="formula_4">𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛) ≈ 1 𝑀 • 𝑆 𝑀 ∑︁ 𝑚=1 𝑆 ∑︁ 𝑠=1 𝑝(𝑦|𝑥, 𝜃𝑚 𝑠 ) ≈ 1 𝑀 • 𝑆 𝑀 ∑︁ 𝑚=1 𝑆 ∑︁ 𝑠=1 𝑓 𝜃𝑚 𝑠 (𝑥).<label>(4)</label></formula><p>In this work, we can refer to MMCD as an average of 𝑀 • 𝑆 stochastic CNNs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Combining logits instead of probabilities</head><p>The output layer of a CNN-based classifier includes 𝐾 output neurons with a softmax activation function,  which normalizes its inputs (continuous values) to produce discrete probabilities 𝑝 𝑘 (with 𝑘 = 1, . . . , 𝐾 and ∑︀ 𝐾 𝑘=1 𝑝 𝑘 = 1) representing the probability that the input image belongs to the class associated with the 𝑘 𝑡ℎ output neuron. The input to the softmax function are logits and interpreted as evidence for possible classes <ref type="bibr" target="#b14">[15]</ref>. The discrete probability 𝑝 𝑘 is interpreted as the model confidence that the input belongs to the class associated with the 𝑘 𝑡ℎ output neuron. Given the logit vector 𝑧 = [︀ 𝑧1 . . . 𝑧𝐾 ]︀ 𝑇 , the softmax estimates</p><formula xml:id="formula_5">𝑝 = [︀ 𝑝1 . . . 𝑝𝐾 ]︀ 𝑇 as 𝑝 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧) = 1 ∑︀ 𝐾 𝑘=1 exp(𝑧 𝑘 ) [︀ exp(𝑧1) . . . exp(𝑧𝐾 ) ]︀ 𝑇 .<label>(5)</label></formula><p>From Figure <ref type="figure" target="#fig_1">1</ref>, given an ensemble of 𝑀 deterministic CNNs with logits 𝑧 𝑚 , the average logit 𝑧 can be estimated as</p><formula xml:id="formula_6">𝑧 := 1 𝑀 𝑀 ∑︁ 𝑚=1 𝑧 𝑚 := 1 𝑀 𝑀 ∑︁ 𝑚=1 𝑓 𝜃𝑚 (𝑥).<label>(6)</label></formula><p>and the predicted probability vector of the ensemble of deterministic CNNs can be reformulated as</p><formula xml:id="formula_7">𝑝(𝑦|𝑥, 𝐷𝑡𝑟𝑎𝑖𝑛) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧).<label>(7)</label></formula><p>Given MCD representing an ensemble of 𝑆 stochastic CNNs with logits 𝑧 𝑠 , we can estimate the average logit 𝑧 as</p><formula xml:id="formula_8">𝑧 ≈ 1 𝑆 𝑆 ∑︁ 𝑠=1 𝑧 𝑠 ≈ 1 𝑆 𝑆 ∑︁ 𝑠=1 𝑓 𝜃𝑠 (𝑥),<label>(8)</label></formula><p>and reformulate the predicted probability vector of MCD, as shown in <ref type="bibr" target="#b6">(7)</ref>. Similarly, given MMCD representing an ensemble of 𝑀 • 𝑆 stochastic CNNs with logits 𝑧 𝑚𝑠 , we can estimate the average logit 𝑧 as <ref type="bibr" target="#b8">(9)</ref> and reformulate the predicted probability vector of MMCD, as shown in <ref type="bibr" target="#b6">(7)</ref>. From Figure <ref type="figure">2</ref>, averaging logits instead of probabilities of multiple stochastic or deterministic CNNs increases the confidence of the averaged CNNs. Intuitively, logit averaging provides the best evidence (characterized by a low level of uncertainty caused by the reduction of inductive biases inherent in individual logits) for making decisions. However, probability averaging provides the best confidence associated with decisions made using weak evidence (characterized by a high level of uncertainty caused by inductive biases inherent in individual logits). This implies that a decision made using probability averaging considers more uncertainty than that made using logit averaging. In this work, we evaluated the impact of the possible increase in the degree of confidence caused by applying logit averaging instead of probability averaging on the QoC.</p><formula xml:id="formula_9">𝑧 ≈ 1 𝑀 • 𝑆 𝑀 ∑︁ 𝑚=1 𝑆 ∑︁ 𝑠=1 𝑧 𝑚𝑠 ≈ 1 𝑀 • 𝑆 𝑀 ∑︁ 𝑚=1 𝑆 ∑︁ 𝑠=1 𝑓𝜃 𝑚𝑠 (𝑥),</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Experimental setup</head><p>We hypothesized that the QoC of CNNs (strongly) depends on the task-difficulty (specified using the training data), the underlying architecture, and/or the training procedure (mostly influenced by the regularization  𝑚=1 𝑝 𝑚 with 𝑝 1 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 1 ), 𝑝 2 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 2 ), 𝑝 3 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 3 ), and 𝑝 4 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 4 ). One can see that averaging logits (𝑝 𝑧 ) results in more confident predictions than averaging probabilities (𝑝). This is attributed to averaging logits being more sensitive to the magnitude of logit values than averaging probabilities. Here, 𝑧 𝑚 with large values contributes most to 𝑧. In our example, 𝑧 is mostly influenced by the values of 𝑧 1 . That is, the contributions of 𝑧 2 , 𝑧 3 , and 𝑧 4 to 𝑧 are minor. However, 𝑝 is influenced by the values of all probability vectors 𝑝 𝑚 and therefore, is less sensitive to the magnitude of individual logits. strength). Therefore, we compared logits and probabilities averaging on three datasets to evaluate the impact of the task-difficulty on the QoC. Moreover, we compared logits and probabilities averaging using three different architectures to evaluate the impact of the underlying architecture on the QoC. Specifically, we evaluated MNIST <ref type="bibr" target="#b31">[32]</ref> on VGGNets <ref type="bibr" target="#b0">[1]</ref>, FashionMNIST <ref type="bibr" target="#b32">[33]</ref> on ResNets <ref type="bibr" target="#b1">[2]</ref> and CIFAR10 <ref type="bibr" target="#b33">[34]</ref> on DenseNets <ref type="bibr" target="#b34">[35]</ref>. Finally, we compared logits and probabilities averaging on CNNs trained using two regularization strengths (strong and weak regularization summarized in Table <ref type="table">1</ref>) to evaluate the impact of the regularization strength on the QoC. We observed strong and weak regularization results in underconfident and overconfident CNNs, respectively. All CNNs were regularized using batch normalization <ref type="bibr" target="#b35">[36]</ref> layers placed before each convolutional activation function. All CNNs were randomly initialized and trained with random shuffling of training samples. All CNNs were trained using the categorical cross-entropy and stochastic gradient descent with momentum of 0.9, learning rate of 0.02, batch size of 128, and epochs of 100. All images were standardized and normalized by dividing pixel values by 255. For all MCD and MMCD, we sampled activations of the first fully-connected layer using masks drawn from a cascade of Bernoulli and Gaussian distributions <ref type="bibr" target="#b21">[22]</ref> and using a dropout probability of 0.5. We performed 100 stochastic forward passes (𝑆 = 100) and considered ensembles consisting of five deterministic CNNs (𝑀 = 5).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 1</head><p>Summary of values assigned to regularization hyperparameters. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Hyper-parameters</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Values</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Evaluation metrics</head><p>QoC was evaluated by assessing the degree of confidence calibration. Specifically, we evaluated the calibration error using measures, such as the negative log likelihood (NLL) applied in <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b30">31]</ref>, expected calibration error (ECE) applied in <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b11">12]</ref>, and Brier score (BS) applied in <ref type="bibr" target="#b3">[4]</ref>. Low values of NLL, ECE, and BS indicate low calibration error and vice versa. Furthermore, we evaluated QoC by assessing its ability to separate TPs and FPs. Here, we evaluated the average confidence on evaluation data causing TPs or FPs. Given evaluation data causing TPs, we expect the average confidence on the evaluation data to be high. However, for the evaluation data causing FPs, we expect low average confidence on the evaluation data. Moreover, we evaluated the ability to separate TPs and FPs by evaluating the area under the receiver operator characteristic (AU-ROC) applied in <ref type="bibr" target="#b36">[37,</ref><ref type="bibr" target="#b4">5]</ref>. AU-ROC summarizes the trade-off between the fraction of TPs that are correctly detected and those of FPs that are undetected using different thresholds. In summary, in addition to the NLL, ECE, and BS, we evaluated the accuracy, average confidence, and AUC-ROC.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Evaluation data</head><p>We   BS. We expect the accuracy to be high and NLL, ECE and BS to be low on test data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Subsets of the correctly classified test data</head><p>include 1000 correctly classified test data from the experimental data. Since CNNs will make TPs on these data, we used these data for evaluating the average confidence on TPs.</p><p>Swapped data were simulated using subsets of the correctly classified test data structurally perturbed by dividing images into four regions and diagonally permuting the regions. From Figure <ref type="figure" target="#fig_5">3b</ref>, the upper left and right are permuted with the bottom right and left regions, respectively. Swapped data include structurally perturbed objects within the given images. We expect CNNs to make FPs on swapped data. Therefore, we used these data for evaluating the average confidence on FPs caused by structurally perturbed objects.</p><p>Noisy data were simulated using subsets of the correctly classified test data perturbed by applying additive Gaussian noise with a standard deviation of 500. From Figure <ref type="figure" target="#fig_5">3c</ref>, noisy data include noise within the given images. We expect CNNs to make FPs on these data. Therefore, we used these data for evaluating the average confidence on FPs caused by noisy objects.</p><p>Out-of-domain data were simulated using 1000 test data of CIFAR100 <ref type="bibr" target="#b33">[34]</ref>. Since CNNs will make FPs on these data, we used these data for evaluating the average confidence on FPs caused by unknown objects.</p><p>In general, we expect the average confidence to be high on TPs and to be low on FPs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.">Experimental results</head><p>We evaluate the conducted experiments with respect to accuracy and QoC. Table <ref type="table">2</ref> and Table <ref type="table" target="#tab_3">3</ref> summarize the accuracy, average confidence, NLL, ECE, and BS of different models using the two averaging approaches and CNNs trained using strong regularization (causing underconfidence) and weak (causing overconfidence). The results show that averaging logits instead of probabilities do not strongly affect the accuracy. This means that averaging logits can preserve accuracy. Furthermore, averaging logits instead of probabilities significantly increases the average confidence. Figure <ref type="figure">2</ref> illustrates why the confidence increases. Further, Table <ref type="table">2</ref> shows that averaging logits instead of probabilities significantly decreases the NLL, ECE, and BS for undercondifent CNNs (trained using strong regularization). This means that averaging logits, unlike averaging probabilities, reduces the calibration error for undercondifent CNNs. This is because the stronger the regularization, the lower the confidence and the higher the gap between accuracy and average confidence. Here, the increase in the degree of confidence caused by averaging logits instead of probabilities reduces the gap between accuracy and average confidence. For example, Table <ref type="table">2</ref> shows that averaging logits instead of probabilities of the ensemble reduces the gap between accuracy and average confidence from 18.24(= |88.75 − 70.51|)% to 9.52(= |88.94 − 79.42|)% on CIFAR10.</p><p>However, the increase in the degree of confidence caused by averaging logits instead of probabilities increases the calibration error for overconfident CNNs (trained using weak regularization). Table <ref type="table" target="#tab_3">3</ref> provides empirical evidence for this claim by showing that, on CIFAR10 and FashionM-NIST, NLL, ECE, and BS of the ensembles increase when the logits are averaged instead of probabilities. We argued that the more overconfident the CNNs, the higher</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2</head><p>Comparison of accuracy[%], average confidence[%] (in bracket), NLL[10 −2 ], ECE[10 −2 ], and BS[10 −2 ] of different models using two approaches for averaging underconfident CNNs trained using strong regularization: average probabilities (AP) and average logits (AL). The results were obtained using the test data described in Section 5.3. the confidence and the higher the gap between accuracy and average confidence. Here, the increase in the degree of confidence caused by averaging logits instead of probabilities further increases the gap between the accuracy and average confidence and therefore, increases the calibration error. For example, Table <ref type="table" target="#tab_3">3</ref> shows that, on CIFAR10, averaging logits of the ensemble increases the gap between the accuracy and average confidence from 0.76(= |88.67−89.43|)% to 7.29(= |88.88−96.17|)%.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Accuracy (Average</head><p>In Table <ref type="table">4</ref>, the average confidence on TPs and FPs is shown for underconfident models using both averaging approaches. The results show that averaging logits instead of probabilities increases the confidence level on TPs and FPs. The increase in the average confidence is sometimes very large for FPs due to the noisy data. For example, for MMCD evaluated on FashionMNIST, the average confidence on the noisy data increases from 51.31% to 94.58% when averaging logits. This is because noisy data can increase the magnitude of logits and averaging logits is more sensitive to changes in the magnitude of logits than averaging probabilities (see Figure <ref type="figure">2</ref>). The increase in the degree of confidence caused by averaging logits can harm the separability of TPs and FPs. For example, the increase in the average confidence on the noisy data from 51.31% to 94.58% causes the AUC-ROC obtained based on the evaluation of the degree of confidence to decrease from 84.80% to 42.42%.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Discussion</head><p>The term 'combination process' encompasses how multiple networks are combined and the information type combined. It was found in <ref type="bibr" target="#b22">[23,</ref><ref type="bibr" target="#b23">24,</ref><ref type="bibr" target="#b21">22]</ref> that simple averaging is more robust and captures uncertainty better than voting approaches. This is because the simple averaging equally weights all predictions, while voting ignores uncertain predictions. In this work, we compared the process of averaging logits instead of probabilities. We empirically showed that averaging logits instead of probabilities increases the confidence while preserving the accuracy for underconfident or overconfident networks. This might be because logit averaging preserves the position of the max element of individual logit vectors, but is more sensitive to the magnitude of logit values than probability averaging. Thus, logit values with a large magnitude contribute the most to the average logit. In this way, the magnitude of logit values induces a nonuniform weighting (for logit averaging), which is lost (for probability averaging). Furthermore, we provided empirical evidence showing that for underconfident networks (trained using strong regularization), the increase in the confidence caused by averaging logits instead of</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 4</head><p>Comparison of average confidence[%] of different models using two approaches (average probabilities (AP) and average logits (AL)) for averaging underconfident networks trained using strong regularization and evaluated on TPs and FPs: TPs were obtained on subsets of the correctly classified test data, while FPs were obtained on swapped, noisy and out-of-domain (OOD) data described in Section 5.3. probabilities reduces the calibration error on the test data. This is because the increase in the degree of confidence reduces the gap between accuracy and average confidence. However, the increase in confidence caused by averaging logits instead of probabilities for overconfident networks (trained using weak regularization) increases the calibration error on the test data. This is because the increase in the confidence further increases the gap between the accuracy and average confidence. This finding suggests that for underconfident networks, we can average logits instead of probabilities to reduce the calibration error. However, we should average probabilities instead of logits for overconfident networks to avoid increasing the calibration error. Although the increase in the confidence caused by averaging logits reduces the calibration error on the test data for underconfident networks, we empirically showed that it can harm the separability of TPs and FPs. This is because averaging logits increases the confidence on both TPs and FPs. Therefore, FPs can also be made with high confidence similar to TPs. These findings suggest that reducing the calibration error on the test data and improving the separability of TPs and FPs can be two contradicting goals. Improving one may be at the detriment of the other. Furthermore, for two models 𝐴 and 𝐵, if 𝐴 is better calibrated than 𝐵, then 𝐴 does not necessarily separate TPs and FPs better than 𝐵. This implies that calibration methods may be insufficient for separating TPs and FPs and therefore, ensuring safe decision-making. Additionally, existing methods for confidence calibration may not help in separating TPs and FPs. Subsequently, future work will evaluate the ability of existing methods for confidence calibration to separate TPs and FPs. We also recommend researchers to evaluate both the calibration error of their proposed method for con-fidence calibration and the ability of their proposed method to separate TPs and FPs. Finally, for mission-and safetycritical applications where the separability of TPs and FPs is of paramount importance, we suggest to average probabilities to avoid the negative impact of logits averaging on the ability to separate TPs and FPs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>TP</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusion</head><p>Due to averaging logits instead of averaging probabilities of stochastic or deterministic networks, the degree of confidence on TPs and FPs increased. This reduces the calibration error on the test data for underconfident networks but affects the separability of TPs and FPs. Our empirical results show that there is a trade-off between improving calibration on the test data and improving the separability of TPs and FPs. Additionally, the increase in the degree of confidence increases the calibration error on the test data for overconfident networks. Therefore, averaging logits should only be applied when combining underconfident networks. For example, we can average logits instead of probabilities of an ensemble of networks trained with mixup or other modern data augmentation techniques to improve calibration on the test data. Notwithstanding this, for mission-and safetycritical applications where the separability of TPs and FPs is essential, we suggest traditionally average probabilities. However, it remains unclear if the findings of this paper will change if the given networks or the average logit are calibrated, for example, with temperature scaling <ref type="bibr" target="#b12">[13]</ref>. This suggests a new research direction.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>(a)</head><label></label><figDesc>Logit averaging (b) Probability averaging</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Example showing the difference between averaging logits and averaging probabilities in an ensemble.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Softmax</head><label></label><figDesc></figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 2 : 4 ∑︀ 4 𝑚=1 4 ∑︀ 4</head><label>24444</label><figDesc>Figure 2: Example showing how averaging logits inof probabilities increases the confidence of an ensemble of four deterministic CNNs: 𝑧 = 1 4 ∑︀ 4 𝑚=1 𝑧 𝑚 , 𝑝 𝑧 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧), and 𝑝 = 1 4 ∑︀ 4𝑚=1 𝑝 𝑚 with 𝑝 1 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 1 ), 𝑝 2 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 2 ), 𝑝 3 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 3 ), and 𝑝 4 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧 4 ). One can see that averaging logits (𝑝 𝑧 ) results in more confident predictions than averaging probabilities (𝑝). This is attributed to averaging logits being more sensitive to the magnitude of logit values than averaging probabilities. Here, 𝑧 𝑚 with large values contributes most to 𝑧. In our example, 𝑧 is mostly influenced by the values of 𝑧 1 . That is, the contributions of 𝑧 2 , 𝑧 3 , and 𝑧 4 to 𝑧 are minor. However, 𝑝 is influenced by the values of all probability vectors 𝑝 𝑚 and therefore, is less sensitive to the magnitude of individual logits.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>(a)</head><label></label><figDesc>Test data (b) Swapped data (c) Noisy data</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Examples of evaluation data for experiments conducted on CIFAR10.</figDesc><graphic coords="6,89.29,94.15,119.50,118.71" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head></head><label></label><figDesc>used five evaluation data for different purposes, namely test data, subsets of the correctly classified test data, out-of-domain data, swapped data, and noisy data.</figDesc><table><row><cell>Test data represent the test data from the experimental</cell></row><row><cell>data, namely, MNIST, CIFAR10, and FashionM-</cell></row><row><cell>NIST. These datasets include both correctly clas-</cell></row><row><cell>sified and misclassified test data. Test data are</cell></row><row><cell>used for estimating the accuracy, NLL, ECE, and</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3</head><label>3</label><figDesc>Comparison of accuracy[%], average confidence[%] (in bracket), NLL[10 −2 ], ECE[10 −2 ],and BS[10 −2 ] of ensembles using two approaches for averaging overconfident CNNs trained using weak regularization: average probabilities (AP) and average logits (AL). The results were obtained using the test data described in Section 5.3.</figDesc><table><row><cell></cell><cell></cell><cell></cell><cell cols="2">confidence)↑</cell><cell cols="2">NLL ↓</cell><cell cols="2">ECE ↓</cell><cell>BS ↓</cell></row><row><cell></cell><cell>AP</cell><cell></cell><cell>AL</cell><cell></cell><cell>AP</cell><cell>AL</cell><cell>AP</cell><cell>AL</cell><cell>AP</cell><cell>AL</cell></row><row><cell cols="3">CIFAR10 (DenseNets)</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Ensemble</cell><cell cols="2">89.52 (84.31)</cell><cell cols="2">89.60 (87.97)</cell><cell>34.66</cell><cell>32.81</cell><cell>5.23</cell><cell>2.47</cell><cell>16.13</cell><cell>15.38</cell></row><row><cell>MCD</cell><cell cols="2">85.36 (73.35)</cell><cell cols="2">85.37 (80.04)</cell><cell>52.13</cell><cell>46.55</cell><cell>12.04</cell><cell>5.40</cell><cell>23.32</cell><cell>21.57</cell></row><row><cell>MMCD</cell><cell cols="2">88.75 (70.51)</cell><cell cols="2">88.94 (79.42)</cell><cell>50.83</cell><cell>40.99</cell><cell>18.24</cell><cell>9.55</cell><cell>21.82</cell><cell>18.34</cell></row><row><cell cols="3">FashionMNIST (ResNets)</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Ensemble</cell><cell cols="2">92.70 (87.86)</cell><cell cols="2">92.58 (90.16)</cell><cell>22.57</cell><cell>20.99</cell><cell>5.15</cell><cell>2.86</cell><cell>11.37</cell><cell>10.87</cell></row><row><cell>MCD</cell><cell cols="2">90.56 (79.22)</cell><cell cols="2">90.56 (83.95)</cell><cell>35.45</cell><cell>30.18</cell><cell>11.47</cell><cell>6.85</cell><cell>15.82</cell><cell>14.57</cell></row><row><cell>MMCD</cell><cell cols="2">92.65 (76.37)</cell><cell cols="2">92.73 (83.78)</cell><cell>35.57</cell><cell>26.96</cell><cell>16.31</cell><cell>9.10</cell><cell>14.87</cell><cell>12.47</cell></row><row><cell cols="2">MNIST (VGGNets)</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Ensemble</cell><cell cols="2">99.04 (98.24)</cell><cell cols="2">99.04 (98.89)</cell><cell>3.25</cell><cell>2.90</cell><cell>1.03</cell><cell>0.52</cell><cell>1.52</cell><cell>1.41</cell></row><row><cell>MCD</cell><cell cols="2">98.16 (94.53)</cell><cell cols="2">98.16 (96.48)</cell><cell>8.73</cell><cell>6.87</cell><cell>3.81</cell><cell>1.98</cell><cell>2.99</cell><cell>2.79</cell></row><row><cell>MMCD</cell><cell cols="2">99.03 (94.67)</cell><cell cols="2">99.04 (97.46)</cell><cell>6.91</cell><cell>4.13</cell><cell>4.49</cell><cell>1.75</cell><cell>1.89</cell><cell>1.52</cell></row><row><cell></cell><cell></cell><cell cols="4">Accuracy (Average confidence) ↑</cell><cell cols="2">NLL ↓</cell><cell cols="2">ECE ↓</cell><cell>BS ↓</cell></row><row><cell></cell><cell></cell><cell>AP</cell><cell></cell><cell>AL</cell><cell></cell><cell>AP</cell><cell>AL</cell><cell>AP</cell><cell>AL</cell><cell>AP</cell><cell>AL</cell></row><row><cell cols="2">CIFAR10 (DenseNets)</cell><cell cols="2">88.67 (89.43)</cell><cell cols="2">88.88 (96.17)</cell><cell>40.69</cell><cell>54.23</cell><cell>3.03</cell><cell>7.40</cell><cell>16.69</cell><cell>18.07</cell></row><row><cell cols="2">FashionMNIST (ResNets)</cell><cell cols="2">94.49 (95.86)</cell><cell cols="2">94.58 (98.43)</cell><cell>20.20</cell><cell>28.00</cell><cell>1.98</cell><cell>4.11</cell><cell>8.36</cell><cell>9.32</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Very deep convolutional networks for large-scale image recognition</title>
		<author>
			<persName><forename type="first">K</forename><surname>Simonyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zisserman</surname></persName>
		</author>
		<ptr target="http://arxiv.org/abs/1409.1556" />
	</analytic>
	<monogr>
		<title level="m">International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Deep residual learning for image recognition</title>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE conference on computer vision and pattern recognition</title>
				<meeting>the IEEE conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="770" to="778" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Gawlikowski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">R N</forename><surname>Tassi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ali</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Humt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kruspe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Triebel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Jung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Roscher</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2107.03342</idno>
		<title level="m">A survey of uncertainty in deep neural networks</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Simple and scalable predictive uncertainty estimation using deep ensembles</title>
		<author>
			<persName><forename type="first">B</forename><surname>Lakshminarayanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pritzel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Blundell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">On mixup training: Improved calibration and predictive uncertainty for deep neural networks</title>
		<author>
			<persName><forename type="first">S</forename><surname>Thulasidasan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Chennupati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Bilmes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Bhattacharya</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Michalak</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Improving calibration through the relationship with adversarial robustness</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Beutel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Chi</surname></persName>
		</author>
		<ptr target="https://openreview.net/forum?id=NJex-5TZIQa" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Beygelzimer</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Dauphin</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Liang</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Vaughan</surname></persName>
		</editor>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Wen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Jerfel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Muller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">W</forename><surname>Dusenberry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Snoek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Lakshminarayanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Tran</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2010.09875</idno>
		<title level="m">Combining ensembles and data augmentation can harm your calibration</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Uncertainty quantification and deep ensembles</title>
		<author>
			<persName><forename type="first">R</forename><surname>Rahaman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">H</forename><surname>Thiery</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">mixup: Beyond empirical risk minimization</title>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cisse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">N</forename><surname>Dauphin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lopez-Paz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Rethinking the inception architecture for computer vision</title>
		<author>
			<persName><forename type="first">C</forename><surname>Szegedy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Vanhoucke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ioffe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shlens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wojna</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE conference on computer vision and pattern recognition</title>
				<meeting>the IEEE conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="2818" to="2826" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">When Does Label Smoothing Help?</title>
		<author>
			<persName><forename type="first">R</forename><surname>Müller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kornblith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Hinton</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
			<publisher>Curran Associates Inc</publisher>
			<pubPlace>Red Hook, NY, USA</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">X</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gales</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2101.05397</idno>
		<title level="m">Should ensemble members be calibrated?</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">On calibration of modern neural networks</title>
		<author>
			<persName><forename type="first">C</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Pleiss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">Q</forename><surname>Weinberger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="1321" to="1330" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">V</forename><surname>Dalca</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">R</forename><surname>Sabuncu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1906.09551</idno>
		<title level="m">Confidence calibration for convolutional neural networks using structured dropout</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Evidential deep learning to quantify classification uncertainty</title>
		<author>
			<persName><forename type="first">M</forename><surname>Sensoy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Kaplan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kandemir</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">The relative performance of ensemble methods with deep convolutional neural networks for image classification</title>
		<author>
			<persName><forename type="first">C</forename><surname>Ju</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bibaut</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Van Der Laan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Applied Statistics</title>
		<imprint>
			<biblScope unit="volume">45</biblScope>
			<biblScope unit="page" from="2800" to="2818" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">Combining pattern classifiers: methods and algorithms</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">I</forename><surname>Kuncheva</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2014">2014</date>
			<publisher>John Wiley &amp; Sons</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">An overview and comparison of voting methods for pattern recognition</title>
		<author>
			<persName><forename type="first">M</forename><surname>Van Erp</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Vuurpijl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Schomaker</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition</title>
				<meeting>Eighth International Workshop on Frontiers in Handwriting Recognition</meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="195" to="200" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">New voting functions for neural network algorithms</title>
		<author>
			<persName><forename type="first">T</forename><surname>Tajti</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="s">Annales Mathematicae et Informaticae</title>
		<imprint>
			<biblScope unit="volume">52</biblScope>
			<biblScope unit="page" from="229" to="242" />
			<date type="published" when="2020">2020</date>
			<publisher>Eszterházy Károly Egyetem Líceum Kiadó</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Machine-learning research</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">G</forename><surname>Dietterich</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">AI magazine</title>
		<imprint>
			<biblScope unit="volume">18</biblScope>
			<biblScope unit="page" from="97" to="97" />
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<title level="m" type="main">Review of classifier combination methods, Machine learning in document analysis and recognition</title>
		<author>
			<persName><forename type="first">S</forename><surname>Tulyakov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Jaeger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Govindaraju</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Doermann</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="361" to="386" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Bayesian convolutional neural network: Robustly quantify uncertainty for misclassifications detection</title>
		<author>
			<persName><forename type="first">N</forename><surname>Tassi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Rovile</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Mediterranean Conference on Pattern Recognition and Artificial Intelligence</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="118" to="132" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Combining forecasts: A review and annotated bibliography</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">T</forename><surname>Clemen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International journal of forecasting</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="page" from="559" to="583" />
			<date type="published" when="1989">1989</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">On combining classifiers</title>
		<author>
			<persName><forename type="first">J</forename><surname>Kittler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hatef</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">P</forename><surname>Duin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Matas</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE transactions on pattern analysis and machine intelligence</title>
		<imprint>
			<biblScope unit="volume">20</biblScope>
			<biblScope unit="page" from="226" to="239" />
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Is it better to average probabilities or quantiles?</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">C</forename><surname>Lichtendahl</surname><genName>Jr</genName></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Grushka-Cockayne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">L</forename><surname>Winkler</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Management Science</title>
		<imprint>
			<biblScope unit="volume">59</biblScope>
			<biblScope unit="page" from="1594" to="1611" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Dropout as a bayesian approximation: Representing model uncertainty in deep learning</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Gal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Ghahramani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">international conference on machine learning</title>
				<imprint>
			<publisher>PMLR</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="1050" to="1059" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">The power of ensembles for active learning in image classification</title>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">H</forename><surname>Beluch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Genewein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Nürnberger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Köhler</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="9368" to="9377" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Evaluating scalable bayesian deep learning methods for robust computer vision</title>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">K</forename><surname>Gustafsson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Danelljan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">B</forename><surname>Schon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops</title>
				<meeting>the IEEE/CVF conference on computer vision and pattern recognition workshops</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="318" to="319" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Kahn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Villaflor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Pong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Abbeel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Levine</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1702.01182</idno>
		<title level="m">Uncertainty-aware reinforcement learning for collision avoidance</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">Safe reinforcement learning with model uncertainty estimates</title>
		<author>
			<persName><forename type="first">B</forename><surname>Lütjens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Everett</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>How</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2019 International Conference on Robotics and Automation (ICRA), IEEE</title>
				<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="8662" to="8668" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Bayesian deep learning and a probabilistic perspective of generalization</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G</forename><surname>Wilson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Izmailov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="4697" to="4708" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Gradientbased learning applied to document recognition</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Lecun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Bottou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bengio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Haffner</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE</title>
				<meeting>the IEEE</meeting>
		<imprint>
			<date type="published" when="1998">1998</date>
			<biblScope unit="volume">86</biblScope>
			<biblScope unit="page" from="2278" to="2324" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Rasul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Vollgraf</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1708.07747</idno>
		<title level="m">Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b33">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Krizhevsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Hinton</surname></persName>
		</author>
		<title level="m">Learning multiple layers of features from tiny images</title>
				<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">Densely connected convolutional networks</title>
		<author>
			<persName><forename type="first">G</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Van Der Maaten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">Q</forename><surname>Weinberger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE conference on computer vision and pattern recognition</title>
				<meeting>the IEEE conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="4700" to="4708" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<analytic>
		<title level="a" type="main">Batch normalization: Accelerating deep network training by reducing internal covariate shift</title>
		<author>
			<persName><forename type="first">S</forename><surname>Ioffe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Szegedy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on machine learning</title>
				<imprint>
			<publisher>PMLR</publisher>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="448" to="456" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b36">
	<analytic>
		<title level="a" type="main">A baseline for detecting misclassified and out-of-distribution examples in neural networks</title>
		<author>
			<persName><forename type="first">D</forename><surname>Hendrycks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Gimpel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of International Conference on Learning Representations</title>
				<meeting>International Conference on Learning Representations</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
