<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">On the Validity of Bayesian Neural Networks for Uncertainty Estimation</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">John</forename><surname>Mitros</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">School of Computer Science</orgName>
								<orgName type="institution">University College Dublin</orgName>
								<address>
									<settlement>Dublin</settlement>
									<region>IR</region>
								</address>
							</affiliation>
						</author>
						<author role="corresp">
							<persName><forename type="first">Brian</forename><forename type="middle">Mac</forename><surname>Namee</surname></persName>
							<email>brian.macnamee@insight-centre.org</email>
							<affiliation key="aff0">
								<orgName type="department">School of Computer Science</orgName>
								<orgName type="institution">University College Dublin</orgName>
								<address>
									<settlement>Dublin</settlement>
									<region>IR</region>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">On the Validity of Bayesian Neural Networks for Uncertainty Estimation</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">FB55EE5CF0D4288A5F5C047203C5FF67</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T23:13+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Bayesian Neural Networks</term>
					<term>Uncertainty Quantification</term>
					<term>OoD</term>
					<term>Robustness</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Deep neural networks (DNN) are versatile parametric models utilised successfully in a diverse number of tasks and domains. However, they have limitations-particularly from their lack of robustness and over-sensitivity to out of distribution samples. Bayesian Neural Networks, due to their formulation under the Bayesian framework, provide a principled approach to building neural networks that address these limitations. This work provides an empirical study evaluating and comparing Bayesian Neural Networks to their equivalent point estimate Deep Neural Networks to quantify the predictive uncertainty induced by their parameters, as well as their performance in view of uncertainty. Specifically, we evaluated and compared three point estimate deep neural networks against their alternative comparable Bayesian neural network utilising well-known benchmark image classification datasets.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>With the advancement of technology and the abundance of data, our society has been transformed beyond recognition. From smart home assistance technologies to self-driving cars, to smart mobile phones a multitude of connected devices now assist us in our daily routines.</p><p>One thing that is common among these devices is the exponential explosion of data generated as a consequence of our activities. Predictive models rely on this data to capture patterns in our daily routines, from which they can offer us assistance tailored to our individual needs. Many of these predictive models are based on deep neural networks (DNNs).</p><p>The machine learning community, however, is becoming increasingly aware of issues associated with DNNs ranging from fairness to bias, and, from robustness to uncertainty estimation. Motivated by this we setup to investigate the issues of reliability and trustworthiness of prediction confidence estimates produced by DNNs. We first assess the capability of current DNN models to provide confident (i.e. calibration error) and reliable (i.e. noise sensitivity error or the ability to predict out of sample instances with high uncertainty) predictions. Second, we compare this to the capability of equivalent recent Bayesian formulations (i.e. Bayesian neural networks (BNN)), in terms of accuracy, calibration error, and ability to recognise and indicate out of sample instances.</p><p>There exist two types of uncertainties related to predictive models, aleatoric and epistemic uncertainty <ref type="bibr" target="#b3">[4]</ref>. Aleatoric uncertainty is usually attributed to stochasticity inherent in the related task or experiment to be performed. It can therefore be considered an irreducible error. Epistemic uncertainty is usually attributed to uncertainty induced by the model parameters. It can therefore be considered a a reducible error as it can be reduced by obtaining more data. The question under investigation in this work is whether BNNs can provide better calibrated and reliable estimates for out of sample instances compared to point estimate DNNs and therefore relates to epistemic uncertainty.</p><p>The remaining sections of this paper are divided as follows. Section 2 outlines related work in the area of confident estimations and Bayesian neural networks. In Section 3, we provide the information related to the datasets used throughout the experiments, their respective sizes and types. In Section 4, we describe the metrics used to evaluate whether a classifier is calibrated (i.e. expected calibration error and reliability figures), as well as its ability to identify and predict out of sample instances (i.e. symmetric KL divergence and distributional entropy figures), along with their respective explanations. In Section 5, we introduce the three BNN approaches utilised in the experiments, providing detailed explanations of how they work. Finally, in Sections 6 and 7 we present the results related to confidence calibration (i.e. Table <ref type="table" target="#tab_0">1</ref> and Figure <ref type="figure" target="#fig_1">1</ref>) and reliability prediction estimates for out of sample instances (i.e. Table <ref type="table" target="#tab_2">3 and Figures 2</ref>), along with the concluding remarks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Earlier findings <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b16">17,</ref><ref type="bibr" target="#b19">20,</ref><ref type="bibr" target="#b17">18,</ref><ref type="bibr" target="#b0">1]</ref> have demonstrated the incapacity of point estimate deep neural networks (DNN) to provide confident and calibrated <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b8">9]</ref> uncertainty estimates in their predictions. Motivated by recent work in this area we strive to demonstrate, first, that this is indeed a serious problem currently investigated in the machine learning community, and second, provide an alternative viable solution (i.e. BNN) which combines the best from both worlds (i.e. a principled and elegant formulation due to the Bayesian inference framework and powerful and expressive models thanks to DNN).</p><p>In order to help the reader understand the terminology and semantics of uncertainty quantification in predictive models, it would be helpful to relate the variance existent in the model parameters represented as the sum of both aleatoric and epistemic uncertainty. Additionally, whenever the reader encounters the following terminology "point estimate DNN" in the document, it simply refers to a DNN model coupled with a softmax function in the final layer. For instance, suppose our DNN is defined by ŷ = f (x), where ŷ denotes the predictions for K possible classes. Then a "point estimate DNN" is simply defined by exponentiating each prediction and normalising it by the total sum of all exponentiated predictions. p(y = j|x) = e ŷi K j=1 e ŷj for i = 1, . . . , K and ŷ = (y 1 , . . . , y</p><formula xml:id="formula_0">K ) ∈ R K<label>(1)</label></formula><p>The reason for identifying them as point estimate DNN is because they are mistakenly misinterpreted as probabilistic models due to the fact that they provide predictions which resemble probabilities (i.e. estimates ∈ R [0−1] ). Furthermore, Eq. 1 is misinterpreted mistakenly for a categorical distribution to which we disagree since it should have a prior Dirichlet in order to be classified as a categorical distribution. Our view is that Eq. 1 is more of a mathematical convenience in order to allow DNN model to emit predictions rather than a well defined probability distribution.</p><p>As previously stated there are two main problems investigated in this work. The first one is related to the inability of DNN to predict probability estimates representative of the true correct likelihood function (i.e. calibration confidence). For instance, in a binary classification task a classifier is considered uncalibrated when the classifiers' predictions do not match the empirical proportion of the positive class upon which the classifier is requested to make a prediction. Poor calibration confidence problems in DNN can be affected by different choices while constructing the DNN architecture <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b8">9]</ref> (e.g. depth, width, regularisation or batch-normalisation). The second problem, is related to the incapacity of DNN to identify and reliably predict out of sample instances (i.e. noise sensitivity) which can be a consequent of noise in the data, noise in the model parameters or noise constructed by an adversary in order to manipulate the models' predictions <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b10">11,</ref><ref type="bibr" target="#b17">18,</ref><ref type="bibr" target="#b2">3]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Data</head><p>The data used in this empirical study include two well-established datasets in the machine learning literature, CIFAR-10 [8] and SVHN <ref type="bibr" target="#b13">[14]</ref>. Both datasets are comprised of colour images of dimensionality 32x32 and include 10 distinct categories. In addition, both are considered to represent real world datasets with CIFAR-10 being collected over the Internet while SVHN <ref type="bibr" target="#b13">[14]</ref> being a result of the Google Street View project representing house numbers. Further details regarding the number of instances of each dataset and equivalently their categories are described below -The CIFAR-10 [8] dataset consists of 60,000 colour images of dimensionality 32x32 with 10 classes. Each class contains 6,000 images. In total there are 50,000 training images and 10,000 test images. -The SVHN <ref type="bibr" target="#b13">[14]</ref> dataset consists of 99,289 colour images of dimensionality 32x32 representing digits of house numbers. There exist 10 categories one for each digit, in total there are 73,257 colour images representing digits for training, and equivalently 26,032 digits for testing.</p><p>The chosen evaluation metrics utilised for this empirical study involved:</p><formula xml:id="formula_1">-Accuracy -Expected calibration error -Entropy -Symmetric D KL divergence</formula><p>Particularly, for a given neural network model ŷ = f (x; θ) of depth L defined as The accuracy on a given output ŷn is measured by the indicator function acc = 1 N N n=1 1(y n = ŷn ) for each instance n ∈ N averaged over the total number of instances N in the dataset. This metric is predominantly used in the machine learning community to evaluate the generalisation ability of a predictive model on a hold-out test set.</p><formula xml:id="formula_2">{W L σ L (W L−1 . . . σ 2 (W 2 σ 1 (W 1 x)))},</formula><p>In order to capture whether a model is calibrated we utilised the expected calibration error</p><formula xml:id="formula_3">ECE = M m=1 |acc(B m ) − conf(B m )| |Bm|</formula><p>N in combination with the equivalent reliability plots shown in Figure <ref type="figure" target="#fig_1">1</ref> similar to <ref type="bibr" target="#b4">[5]</ref>. ECE is usually expressed as a weighted average between the accuracy and confidence of a model across M bins for N samples. This metric has the ability to capture any disagreement between the classifiers predictions and the true empirical proportion of instances for each class category for every mini-batch of instances presented to the classifier.</p><p>Furthermore, to assess a models' ability to characterise out of sample data with high degree of uncertainty we focused on the work of <ref type="bibr" target="#b10">[11]</ref> utilising information entropy H(Y ) = − K k=1 p(ŷ k ) log p(ŷ k ) on the final predictions of a model in order to derive the uncertainty plots depicted in Figure <ref type="figure" target="#fig_2">2</ref>. Essentially, for every input x we have we have a corresponding vector of predictions ŷ = (0.86, 0.23, . . . , K) where each entry denotes the prediction of the classifier for each class k ∈ K. For every dataset we split them randomly into two halves. The first half represents the K/2 classes and the other other half the remaining. We select one of the halves to be utilised to train the classifier (i.e. denotes insample instances) and the remaining half (i.e. denotes out-of-sample instances) to be utilised only during the testing phase of the classifier. Therefore, after the classifier has been trained on one half (hence the 5+5 categories in Figures 2 ) we evaluate its generalisation ability on the remaining half where for every input we have a corresponding entropy over the K classes. This provides a distribution over the total number of N inputs allowing to distinguish and evaluate the classifier entropy among in-sample vs out-of-sample instances.</p><p>Finally, in order to conveniently compare and summarize a models' performance on detecting out of sample instances as a summary statistic the scalar value of the symmetric KL-divergence D KL (p q) + D KL (q p) <ref type="bibr" target="#b10">[11]</ref> between two distributions p and q was selected as a sensible candidate. The choice of KL-divergence allows to evaluate how similar are two distributions p and q. The larger the KL-divergence is the more distinct are the distributions p and q. Since we want to evaluate the ability of the classifier to recognise out of sample instances we should be able to measure the KL-divergence of the classifiers' estimates for in-sample p against out of sample q instances. The larger KL values denote the classifier is in better position to recognise out of sample instances.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Methods</head><p>This section provides the details of the empirical evaluation comprised of the following three components, (i) models, (ii) calibration and (iii) uncertainty. The following (i) models were selected among which three of them represent point estimate deep neural networks (DNN) and the remaining three their equivalent Bayesian Neural Networks (BNN).</p><p>-Point estimate deep neural networks</p><formula xml:id="formula_4">• VGG16 [19] • PreResNet164 [6] • WideResnet28x10 [22] -Bayesian neural networks • VGG16 -Monte Carlo Dropout [4]</formula><p>• VGG16 -Stochastic Weight Aaveraging of Gaussian samples <ref type="bibr" target="#b10">[11]</ref> • Stochastic Variational -Deep Kernel Learning <ref type="bibr" target="#b20">[21]</ref> (ii) The calibration of each model was evaluated using the expected calibration error introduced in Section 4 in combination with the reliability plots demonstrated in Figure <ref type="figure" target="#fig_1">1</ref>. Each model was trained on 5 categories from CIFAR-10 and accordingly SVHN with the remaining 5 categories being whithheld in order to evaluate the models' ability to associate out of samples instances with high uncertainty as they were not introduced to the model at any step. The duration of training for each model was 300 epochs with best performing model on the validation set being selected as the final model for each architecture.</p><p>(iii) As already stated in order to evaluate a model's ability to detect out of sample instances with high uncertainty we utilised entropy on the predictions of a model to derive Figures 2, for each dataset and model combination accordingly. In addition, the symmetric KL-divergence described in Section 4 was introduced in order to provide a comparable scalar summary statistic of the overall essence of Figures 2.</p><p>In the remainder of this section we will introduce the three Bayesian neural network approaches utilised during the experimental study:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Dropout as Bayesian approximation:</head><p>Provides a view of dropout at test time as approximate Bayesian inference <ref type="bibr" target="#b3">[4]</ref>.</p><p>It is based on prior work of <ref type="bibr" target="#b1">[2]</ref> which established a relationship between neural networks with dropout and Gaussian Processes (GP). Given a dataset (X, Y) the posterior over the GP is formulated as,</p><formula xml:id="formula_5">F|X ∼ N (0, K(X, X)) Y|F ∼ N (F, 0 • I) ŷ|Y ∼ Categorical (•)</formula><p>where ŷ denotes a class label and ŷ = ŷ . An integral part of the GP is the choice of the covariance matrix K representing the similarity between two inputs as a scalar value. The key insight to draw connections between neural networks and Gaussian processes is to consider the possibility the choice of the kernel to represent a non-linear function, for instance, consider the rectified linear (ReLU) function, then the kernel would be expressed as p(w)σ(w T x)σ(w T x)dw with p(w) ∼ N (µ, Σ). Because usually the integral is intractable a conventional approach would be to use Monte Carlo integration in order to approximate it k = 1 T T t=1 σ(w T t x)σ(w T t x), hence, the name Monte Carlo Dropout. Let us now consider a one hidden layer neural network with dropout ŷ = (β 2 W 2 )σ(x(β 1 W 1 )) where</p><formula xml:id="formula_6">β 1 , β 2 ∼ Bernoulli(p 1,2 ). Utilising the approximate kernel k one can ex- press the parameters W 1,2 as W 1,2 = β 1,2 (A 1,2 + σ 1,2 )(1 − β 1,2 )σ 1,2 with A 1,2 , 1,2 ∼ N (0, I) and β 1,2 ∼ Bernouli(p 1,2</formula><p>) closely resembling the NN formulation. Therefore, to establish the final connection among NNs trained with stochastic gradient descent (SGD) and dropout to GPs one has to simulate Monte Carlo sampling by drawing samples from the trained model at test time [ŷ n = f (x n ; β n θ n )] N n=1 with β n ∼ Bernouli(p n ). The samples ŷn resulting from the different dropout masks β n are averaged over the N different models in order to approximate and retrieve the posterior distribution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Stochastic weight averaging of Gaussian samples</head><p>Stochastic weight averaging of Gaussian samples (SWAG) <ref type="bibr" target="#b10">[11]</ref> is an extension of stochastic weight averaging (SWA) <ref type="bibr" target="#b6">[7]</ref> where the weights of a NN are averaged during different SGD iterates, which in itself can be viewed as approximate Bayesian inference <ref type="bibr" target="#b11">[12]</ref>, with ideas traced back to <ref type="bibr" target="#b15">[16,</ref><ref type="bibr" target="#b14">15]</ref>. In order to understand SWAG we first need to explain SWA. SWA at a high level can be viewed as averaged SGD <ref type="bibr" target="#b15">[16,</ref><ref type="bibr" target="#b14">15]</ref>. Essentially the main difference between SWA and averaged SGD is that SWA utilises a simple moving average instead of an exponential one, in conjunction with a high constant learning rate, instead of a decaying one. In essence, in SWA one maintains a running average over the weights of a NN during the last 25% of the training process which is used to update the first and second moments of batch-normalisation. This leads to better generalisation since the SGD projections are smoothed out during the average process leading to wider optima in the optimisation landscape of the NN. Now that we have established what SWA is let us introduce SWAG. SWAG is an approximate Bayesian inference technique for estimating the covariance from the weight parameters of a NN. SWAG maintains a running average θ2 = 1 T T t=1 θ 2 t in order to compute the covari-ance Σ = diag( θ2 − θ 2 ) which produces the approximate Gaussian posterior N (θ, Σ). At test time the weights of the NN are drawn from this posterior θn ∼ N (θ, Σ) in order to perform Bayesian model averaging to retrieve the final posterior of the model as well as the uncertainty estimates from the first and second moments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Deep kernel learning</head><p>The deep kernel learning method <ref type="bibr" target="#b20">[21]</ref> establishes a combination of NN architectures and GPs trained jointly in order to derive kernels with GP properties overcoming the need to perform approximate Bayesian inference. The first part is composed of any NN (i.e. task dependent) whose output is utilised in the second part in order to approximate the covariance of the GP in the additive layer As explained earlier in the MC-Dropout approach a kernel between inputs x and x can be expressed via a non-linear mapping function thanks to the kernel trick k(x, x ) → k(f (x, w), f (x , w)|w) therefore combining NNs with GPs seems like a natural evolution which permits scalable and flexible kernels represented as neural networks to be utilised directly in Gaussian Processes. Finally, given that the formulation of GPs allows to represent a distribution over a function space it is thus possible to derive uncertainty estimates from the moments of this distribution in order to inform the models about the uncertainty in their parameters having an impact in the final posterior distribution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Results</head><p>In this section we describe the results from our experiments and the findings that arise from these for our two initial questions. Let us recall them here again for clarity:</p><p>-Do point estimate deep neural networks suffer from pathologies of poor calibration and inability to identify out of sample instances? -Are Bayesian neural networks better calibrated and more resilient to out of sample instances?</p><p>To answer the first question we draw the attention of the reader to Figure <ref type="figure" target="#fig_1">1</ref> and equivalently Table <ref type="table" target="#tab_0">1</ref>. Figure <ref type="figure" target="#fig_1">1</ref> shows the reliability plots for all models and datasets. In these plots a perfectly calibrated model is indicated by the diagonal line. Anything below the diagonal represents an over-confident model, while anything above the diagonal represents an under-confident model. The expected calibration errors (ECE) in Table <ref type="table" target="#tab_0">1</ref> (which measure the degree of miscalibration present) seem to be in accordance with the results from <ref type="bibr" target="#b4">[5]</ref>. All of the models are somewhat miscalibrated. Some of the Bayesian approaches, however-in particular the models based on MC-Dropout and SWAG-are better calibrated than their point estimate DNN counterparts.</p><p>Notice that all models exhibit high accuracy on the final test set (shown in Table <ref type="table" target="#tab_1">2</ref>). This illustrates that a model can be very accurate but miscalibrated,  or equivalently a model can be very well calibrated but inaccurate. There is no real correlation between between calibration and accuracy of a model. It is also known <ref type="bibr" target="#b4">[5]</ref> that as the complexity of the model increases the calibration error increases as well.</p><p>In order to evaluate and demonstrate the ability of the models to handle out of sample instances we divided each of the CIFAR-10 and SVHN datasets into two halves containing 5 categories each. These partitions represent in and out of distribution samples. In the discussion that follows this is indicated with the parenthesis (5 + 5) next to the dataset name to denote that the model was trained on only 5 categories representing in distribution samples and at test time  Together these results suggest that the Bayesian methods are better at identifying out of sample instances. Although the result is not clear cut, in some cases the point estimate networks get higher divergence scores than the Bayesian ones, overall the results point in the Bayesian direction. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusion</head><p>In conclusion, as we have showed that point estimate deep neural networks indeed suffer from poor calibration and inability to identify out sample instances with high uncertainty. Bayesian deep neural networks provide a principled and viable alternative that allows the models to be informed about the uncertainty in their parameters and at the same time exhibits a lower degree of sensitivity against noisy samples compared to their point estimate DNN. This suggests that this is a promising research direction for improving the performance of deep neural networks.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>describing the number of function compositions and parameters {θ = {W 1 , . . . , W L } with σ(•) being a nonlinear function.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 1 :</head><label>1</label><figDesc>Fig. 1: Reliability plots across all models on CIFAR-10 [8] and SVHN [14] datasets.</figDesc><graphic coords="8,134.88,115.84,345.59,198.35" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 2 :</head><label>2</label><figDesc>Fig. 2: Out of sample distributional entropy plots for all models on CIFAR-10 (5 + 5) categories.</figDesc><graphic coords="10,134.77,115.84,360.00,432.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Expected calibration errors (ECE) for CIFAR-10 and SVHN equivalently. Lower values indicate better calibrated models.</figDesc><table><row><cell>Models</cell><cell>CIFAR10</cell><cell>SVHN</cell></row><row><cell>VGG16-SGD</cell><cell cols="2">0.0677236 0.0307552</cell></row><row><cell>VGG16-MC Dropout</cell><cell cols="2">0.0423307 0.0155248</cell></row><row><cell>VGG16-SWAG</cell><cell cols="2">0.0499478 0.0204564</cell></row><row><cell>PreResNet164</cell><cell cols="2">0.0309783 0.0238022</cell></row><row><cell>PreResNet164-MC Dropout</cell><cell cols="2">0.0338277 0.0161251</cell></row><row><cell>PreResNet164-SWAG</cell><cell cols="2">0.049338 0.0089618</cell></row><row><cell>WideResNet28x10</cell><cell cols="2">0.0200257 0.0210420</cell></row><row><cell cols="3">WideResNet28x10-MC Dropout 0.0255283 0.0147181</cell></row><row><cell>WideResNet28x10-Swag</cell><cell cols="2">0.0097967 0.0082371</cell></row><row><cell>DeepGaussProcess</cell><cell cols="2">0.1418236 0.3275412</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Accuracy of all the models on both datasets CIFAR-10 and SVHN.it was evaluated on the other 5 categories to simulate out of sample instances. The results are illustrated in Figures 2, for CIFAR-10 and SVHN respectively, and summarised in Table3. The information depicted in Table3provides a summary of Figures 2, by measuring the symmetric KL divergence, between the distribution of class confidence entropies of each model for the in and out of sample instances.</figDesc><table><row><cell>Models</cell><cell>CIFAR10 SVHN</cell></row><row><cell>VGG16-SGD</cell><cell>94.40 97.10</cell></row><row><cell>VGG16-MC Dropout</cell><cell>93.26 96.87</cell></row><row><cell>VGG16-SWAG</cell><cell>93.80 96.83</cell></row><row><cell>PreResNet164</cell><cell>93.56 97.90</cell></row><row><cell>PreResNet164-MC Dropout</cell><cell>94.68 97.73</cell></row><row><cell>PreResNet164-SWAG</cell><cell>93.14 97.69</cell></row><row><cell>WideResNet28x10</cell><cell>94.04 97.44</cell></row><row><cell>WideResNet28x10-MC Dropout</cell><cell>95.54 97.63</cell></row><row><cell>WideResNet28x10-SWAG</cell><cell>95.12 97.95</cell></row><row><cell>DeepGaussProcess</cell><cell>91.00 93.00</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 :</head><label>3</label><figDesc>Symmetric DKL divergence between in and out of distribution splits of CIFAR-10 (5 + 5) and SVHN (5 + 5). Higher values indicate the ability of a model to flag out of sample instances with high uncertainty.</figDesc><table><row><cell>Models</cell><cell>CIFAR10</cell><cell>SVHN</cell></row><row><cell>VGG16-SGD</cell><cell cols="2">2.952740 5.638210</cell></row><row><cell>VGG16-MC Dropout</cell><cell cols="2">3.849440 6.273212</cell></row><row><cell>VGG16-SWAG</cell><cell cols="2">2.375000 5.058158</cell></row><row><cell>PreResNet164</cell><cell cols="2">4.244287 3.375254</cell></row><row><cell>PreResNet164-MC Dropout</cell><cell cols="2">2.879347 2.705263</cell></row><row><cell>PreResNet164-SWAG</cell><cell cols="2">1.810131 3.344560</cell></row><row><cell>WideResNet28x10</cell><cell cols="2">2.181033 3.051318</cell></row><row><cell cols="3">WideResNet28x10-MC Dropout 2.929135 2.995543</cell></row><row><cell>WideResNet28x10-SWAG</cell><cell cols="2">2.780160 3.646878</cell></row><row><cell>DeepGaussProcess</cell><cell cols="2">0.801246 0.153648</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Acknowledgement. This work was supported by Science Foundation Ireland under Grant No.15/CDA/3520 and Grant No. 12/RC/2289.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">WAIC, but Why? Generative Ensembles for Robust Anomaly Detection</title>
		<author>
			<persName><forename type="first">H</forename><surname>Choi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Jang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">A</forename><surname>Alemi</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1810.01392</idno>
		<imprint>
			<date type="published" when="2018-10">Oct 2018</date>
		</imprint>
	</monogr>
	<note>cs, stat</note>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">C</forename><surname>Damianou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">D</forename><surname>Lawrence</surname></persName>
		</author>
		<title level="m">Deep Gaussian Processes</title>
				<imprint>
			<date type="published" when="2012-11">Nov 2012</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv e-prints</note>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Adversarial vulnerability for any classifier</title>
		<author>
			<persName><forename type="first">A</forename><surname>Fawzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Fawzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Fawzi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Neural Information Processing Systems</title>
				<imprint>
			<date type="published" when="2018-02">Feb 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Gal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Ghahramani</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2015-06">Jun 2015</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv e-prints</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">On Calibration of Modern Neural Networks</title>
		<author>
			<persName><forename type="first">C</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Pleiss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">Q</forename><surname>Weinberger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<imprint>
			<date type="published" when="2017-06">Jun 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Identity Mappings in Deep Residual Networks</title>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2016-03">Mar 2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv e-prints</note>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Averaging Weights Leads to Wider Optima and Better Generalization</title>
		<author>
			<persName><forename type="first">P</forename><surname>Izmailov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Podoprikhin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Garipov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Vetrov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G</forename><surname>Wilson</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1803.05407</idno>
		<imprint>
			<date type="published" when="2018-03">Mar 2018</date>
		</imprint>
	</monogr>
	<note>cs, stat</note>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Learning multiple layers of features from tiny images</title>
		<author>
			<persName><forename type="first">A</forename><surname>Krizhevsky</surname></persName>
		</author>
		<ptr target="https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf" />
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="32" to="33" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Verified Uncertainty Calibration</title>
		<author>
			<persName><forename type="first">A</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ma</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1909.10155</idno>
		<imprint>
			<date type="published" when="2019-09">Sep 2019</date>
			<biblScope unit="volume">33</biblScope>
			<pubPlace>Vancouver, Canada</pubPlace>
		</imprint>
	</monogr>
	<note>Cs, Stat</note>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Training Confidence-calibrated Classifiers for Detecting Out-of-Distribution Samples</title>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shin</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1711.09325</idno>
		<imprint>
			<date type="published" when="2017-11">Nov 2017</date>
		</imprint>
	</monogr>
	<note>cs, stat</note>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">A Simple Baseline for Bayesian Uncertainty in Deep Learning</title>
		<author>
			<persName><forename type="first">W</forename><surname>Maddox</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Garipov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Izmailov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Vetrov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G</forename><surname>Wilson</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1902.02476</idno>
		<imprint>
			<date type="published" when="2019-02">Feb 2019</date>
		</imprint>
	</monogr>
	<note>cs, stat</note>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Stochastic Gradient Descent as Approximate Bayesian Inference</title>
		<author>
			<persName><forename type="first">S</forename><surname>Mandt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">D</forename><surname>Hoffman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Blei</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2017-04">Apr 2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv e-prints</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Do Deep Generative Models Know What They Don&apos;t Know?</title>
		<author>
			<persName><forename type="first">E</forename><surname>Nalisnick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Matsukawa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">W</forename><surname>Teh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Gorur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Lakshminarayanan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Netzer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Coates</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bissacco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Y</forename><surname>Ng</surname></persName>
		</author>
		<ptr target="http://ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdf" />
		<title level="m">Reading digits in natural images with unsupervised feature learning</title>
				<imprint>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Acceleration of Stochastic Approximation by Averaging</title>
		<author>
			<persName><forename type="first">B</forename><surname>Polyak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Juditsky</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SIAM Journal on Control and Optimization</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="838" to="855" />
			<date type="published" when="1992-07">jul 1992</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Efficient Estimations from a Slowly Convergent Robbins-Monro Process</title>
		<author>
			<persName><forename type="first">D</forename><surname>Ruppert</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1988-02">feb 1988</date>
		</imprint>
		<respStmt>
			<orgName>Cornell</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Technical Report TR000781</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Can You Trust This Prediction? Auditing Pointwise Reliability After Learning</title>
		<author>
			<persName><forename type="first">P</forename><surname>Schulam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Saria</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of Artificial Intelligence and Statistics</title>
				<meeting>of Artificial Intelligence and Statistics</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page">10</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">Does Your Model Know the Digit 6 Is Not a Cat? A Less Biased Evaluation of &quot;Outlier&quot; Detectors</title>
		<author>
			<persName><forename type="first">A</forename><surname>Shafaei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Schmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">J</forename><surname>Little</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1809.04729</idno>
		<imprint>
			<date type="published" when="2018-09">Sep 2018</date>
		</imprint>
	</monogr>
	<note>cs, stat</note>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title level="m" type="main">Very Deep Convolutional Networks for Large-Scale Image Recognition</title>
		<author>
			<persName><forename type="first">K</forename><surname>Simonyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zisserman</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2014-09">Sep 2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv e-prints</note>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<title level="m" type="main">Quantifying Uncertainty to Improve Decision Making in Machine Learning</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">J</forename><surname>Stracuzzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">C</forename><surname>Darling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">G</forename><surname>Peterson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">G</forename><surname>Chen</surname></persName>
		</author>
		<idno>SAND2018- 11166</idno>
		<imprint>
			<date type="published" when="2018-10">Oct 2018</date>
			<biblScope unit="page">1481629</biblScope>
		</imprint>
		<respStmt>
			<orgName>Sandia National Laboratories</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Tech. Rep.</note>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<title level="m" type="main">Stochastic Variational Deep Kernel Learning</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G</forename><surname>Wilson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Salakhutdinov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">P</forename><surname>Xing</surname></persName>
		</author>
		<idno>arXiv e-prints</idno>
		<imprint>
			<date type="published" when="2016-11">Nov 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Zagoruyko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Komodakis</surname></persName>
		</author>
		<idno>arXiv e-prints</idno>
		<title level="m">Wide Residual Networks</title>
				<imprint>
			<date type="published" when="2016-05">May 2016</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
