<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring the Potential of Bilevel Optimization for Calibrating Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gabriele Sanguin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arjun Pakrashi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Viola</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Rinaldi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Matematica, Università degli Studi di Padova</institution>
          ,
          <addr-line>Via 8 Febbraio 2, 35122 Padova</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computer Science, University College Dublin</institution>
          ,
          <addr-line>Belfield, Dublin 4</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Mathematical Sciences, Dublin City University</institution>
          ,
          <addr-line>Collins Avenue Ext, Dublin 9</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Handling uncertainty is critical for ensuring reliable decision-making in intelligent systems. Modern neural networks are known to be poorly calibrated, resulting in predicted confidence scores that are dificult to use. This article explores improving confidence estimation and calibration through the application of bilevel optimization, a framework designed to solve hierarchical problems with interdependent optimization levels. A self-calibrating bilevel neural-network training approach is introduced to improve a model's predicted confidence scores. The efectiveness of the proposed framework is analyzed using toy datasets, such as Blobs and Spirals, as well as more practical simulated datasets, such as Blood Alcohol Concentration (BAC). It is compared with a well-known and widely used calibration strategy, isotonic regression. The reported experimental results reveal that the proposed bilevel optimization approach reduces the calibration error while preserving accuracy.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Bilevel optimization</kwd>
        <kwd>confidence</kwd>
        <kwd>calibration</kwd>
        <kwd>neural networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Machine learning classification algorithms with increasingly higher discriminative power, especially
neural networks, have rapidly developed in the last decade. Such advanced models are generally
supposed to help humans make decisions by assisting. Although these models have a high discriminative
power, sometimes they will predict something completely incorrect with a fairly high confidence score.
This creates a huge problem in highly regulated and sensitive real-world applications (e.g. medical field,
autonomous vehicles, healthcare diagnostics, financial forecasting, etc.) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Therefore, it is important
for these models to provide a meaningful confidence, based on which it can say "I don’t know", when
they are not confident enough, so that the human expert can inspect, and make further decisions. This
is sometimes called learning to reject or abstention [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
      </p>
      <p>
        Confidence is a probabilistic score that a model assigns to each prediction of a data point, and it
determines how certain the model is about the prediction. One straightforward approach to use such a
confidence score to understand if the model is confident enough or not is to define an interval, a rejection
window, within which if the prediction falls, it will be marked as reject. It is however hard to fix such a
confidence window to decide the rejection window for such models, especially modern neural network
models, because such models are known to have poor model confidence calibration [
        <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
        ].
      </p>
      <p>
        Confidence calibration is the process of adjusting the confidence scores to better align with the actual
likelihood of correctness. Well-calibrated confidence is crucial for efective decision making and is
important for the interpretation of the model, since humans have a natural cognitive intuition for
probabilities [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Accurate confidence scores make it easier for users to comprehend how confident the
model was during prediction and to establish trust on the decisions being made. Moreover, accurate
confidence scores are essential if such a rejection window needs to be defined by a human expert.
      </p>
      <p>
        There are two types of confidence calibration. Firstly, post-calibration, where the output
scores/probabilities of the main model are re-adjusted by another external calibration model. For such methods,
the main model does not need to be modified. In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the authors have demonstrated how well-known
post-calibration methods can be used to calibrate existing models. On the other hand, self-calibration
method are algorithms which integrate the confidence calibration process into the model training itself.
These methods aim to ensure that the model’s probabilities are calibrated during the training phase,
without the need for a separate calibration step.
      </p>
      <p>
        The objective of this article is to present an initial study on whether it is possible to self-calibrate
neural network models by exploiting bilevel optimization (BO), a mathematical framework specifically
designed to solve hierarchical two-level decision-making optimization problems. BO has recently
gained importance in machine learning, particularly in hyperparameter optimization and metalearning
[
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ]. The choice of such a framework is obvious where the inner-level optimization problem tackles
the training of the model, while the outer-level optimization problem addresses the model confidence
calibration.
      </p>
      <p>
        To the best of our knowledge, in the context of uncertainty scores, the only work in the literature
that uses BO is [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Here, we define a BO framework to train two diferent architectures, one for
classification and one for the uncertainty score (which is then tested as a rejection function), thus
leading to a significant increase of the parameters to be trained. Then we focus on the study on the
selective potential of such a score. The work presented in the current article is the first of its kind, which
is quite distinct from the one in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The objective of the current work is to propose a BO framework to
train a single self-calibrated deep neural network, BO4SC, and provide an initial, but crucial analysis of
the applicability of the approach.
      </p>
      <p>The article is organized as follows. Section 2 discusses the mathematical foundations of confidence
estimation, calibration methods, and bilevel optimization (BO). Section 3 introduces BO4SC, a BO
framework for confidence estimation in neural networks. In Section 4 the initial experiments and
analysis are shown and discussed. Finally, Section 5 concludes the article.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Theoretical Background</title>
      <p>This section will briefly introduce and discuss the relevant parts of confidence estimation, model
calibration evaluation, calibration methods, and bilevel optimization.</p>
      <sec id="sec-2-1">
        <title>2.1. Confidence Estimation</title>
        <p>For a machine learning model, the standard process (referred to as standard method later in Section 4)
to accurately predict an output involves learning a mapping function that can generalize well from
training data to unseen samples. Let {X, Y } = {(xi, yi)}in=1 be a labeled dataset, where xi represents
a data point and yi ∈ {1, 2, . . . , C} is its corresponding class label, where C being the total number of
classes and n the number of data points. The objective is usually to learn a function f that maps an
unseen data point xt to a predicted output yˆt = f (xt). This function f can be eficiently and efectively
trained by minimizing the empirical loss over all training data.</p>
        <p>
          Confidence estimation involves assigning a probabilistic score to each prediction which reflects the
model’s certainty about the predicted output. Confidence is a score function pˆi = g(f, xi, yˆi), which
measures the likelihood of the prediction yˆi being correct given the features xi and the classifier f .
Ideally, the confidence score should be continuous and fall within the range [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ].
        </p>
        <p>
          A variety of methods have been developed to enable confidence estimation across diferent model
types. Among these approaches, we find distance-based methods, which use the distance of a data point
from other points, decision boundaries, or centroids of classes to estimate confidence [
          <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14">10, 11, 12, 13, 14</xref>
          ].
Bayesian uncertainty methods use Bayesian principles to model uncertainty, providing a probabilistic
interpretation of confidence [
          <xref ref-type="bibr" rid="ref15 ref16 ref17 ref18">15, 16, 17, 18</xref>
          ]. Reconstruction error techniques rely on the error of
reconstructing input data to obtain a confidence score, often used in models with an encoder-decoder
framework, such as autoencoders. The idea is that a high reconstruction error indicates a lower
confidence in the prediction of the model, [
          <xref ref-type="bibr" rid="ref19">19, 20</xref>
          ]. Ensemble methods utilize the variance among
predictions from multiple models to estimate confidence [ 21, 22]. Extreme value theory (EVT) approaches
are based on EVT that assess confidence by modeling the tail distributions of prediction scores [ 23].
Finally, logits-based techniques involve the use of logits, which include probabilistic outputs and other
mechanisms derived from the raw scores produced especially by neural networks. These scores can be
transformed or analyzed to estimate the confidence of the predictions [24].
        </p>
        <p>Logits-based methods are the ones that are gaining more attention for the recent extensive use
of neural networks. The experiments in the current work use a smooth version of maximum class
probability (MCP), which is a common approach in many classification tasks. We define the MCP as
follows
pˆ(x) = max P (y = c | x), (1)</p>
        <p>c
where P (y = c | x) represents the predicted probability of class c for input x after applying a softmax
function to the logits.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Evaluating Model Calibration</title>
        <p>
          Unlike classification functions, which can be eficiently learned from labeled data, there is no available
supervisory information for directly learning a confidence function and the first challenge is how to
estimate it from pre-trained models (e.g. with MCP). Unfortunately, in many cases, especially in modern
deep neural networks, the calculated confidences tend to be overestimated, meaning that the models
are over-confident (see [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]). This can be formally described as
        </p>
        <p>P (yˆ = y | pˆ = p) &lt; p,
where p represents the true probability.</p>
        <p>To address this issue, it is necessary to employ methods to calibrate the confidence. A model is
considered calibrated if pˆi accurately reflects the true likelihood of correctness:</p>
        <p>
          P (yˆ = y | pˆ = p) = p,
∀p ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ].
        </p>
        <p>To understand whether a model is well-calibrated one can exploit metrics quantifying the degree to
which the model’s predicted probabilities align with the actual outcomes or its likelihood. It is important
to note that there is no single, universally accepted metric for assessing calibration.</p>
        <p>In this work, we will use two of the most common calibration metrics in the literature, namely:
reliability diagrams and expected calibration error (ECE).</p>
        <p>Reliability diagrams are a visual tool used to assess model calibration [25, 26] by plotting the expected
accuracy of samples against their predicted confidence levels. The predictions are grouped into M
interval bins, each of size M1 . By letting Bm represent the set of indices of samples whose predicted
confidence falls within the interval Im = mM− 1 , Mm , the accuracy for bin Bm is calculated as
acc(Bm) = |B1m| P
i∈Bm 1(yˆi = yi),
where yˆi and yi are the predicted and true class labels for data point xi, respectively. According to basic
probability theory, acc(Bm) serves as an unbiased and consistent estimator of P (yˆ = y | pˆ ∈ Im).</p>
        <p>The average confidence within the bin Bm is given by
(2)
(3)
conf(Bm) = |B1m| Pi∈Bm pˆi,
where pˆi represents the confidence of the sample i. For a perfectly calibrated model, the relationship
acc(Bm) = conf(Bm) should hold for all m ∈ {1, . . . , M }, i.e., the plot should follow the identity line.
It is important to note that reliability diagrams do not display the proportion of samples in each bin.
This is also why they are often paired with a density plot of confidence prediction, called confidence
histograms.</p>
        <p>The ECE represents the weighted average of the absolute diference between accuracy and confidence
over all prediction bins. Formally, it is defined as:</p>
        <p>m=1 |Bnm| |acc(Bm) − conf(Bm)| ,
ECE = PM
(4)
where n is the total number of samples. Although ECE is widely adopted due to its simplicity and
interpretability, it is sensitive to the choice of the number of bins M , which can afect the accuracy of
the measurement.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Calibration Methods</title>
        <p>Calibration methods can be categorized into two types, namely post-calibration and self-calibration.</p>
        <p>Post-calibration: These methods involve adjusting the output probabilities of a pretrained model
using a separate calibration model, applied after the initial model has been trained. This adjustment
aims to align the predicted probabilities with the true likelihood of events.</p>
        <p>
          Among the most common approaches we can find histogram binning [ 27], Bayesian binning into
quantiles (BBQ) [28], Platt scaling [29, 26] and its derivative matrix and vector scaling and temperature
scaling [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Other recent methods include beta calibration [30], Shape-Restricted Polynomial Regression
[31] and neural calibration [32].
        </p>
        <p>The experiments in the current work make use of the isotonic regression [33], because of its simplicity
and efectiveness. Isotonic regression learns a piecewise constant function f to transform uncalibrated
outputs into calibrated ones, by minimizing the squared loss subject to the constraint that f is a
non-decreasing function.</p>
        <p>
          Self-calibration: These methods integrate the calibration process into the model training itself.
These methods aim to ensure that the model’s probabilities are calibrated during the training phase,
without the need for a separate calibration step. Self-calibration often requires modifying the loss
function or the training procedure to directly incorporate the calibration objectives. Techniques such as
Bayesian neural networks, which incorporate uncertainty directly into the model predictions through
probabilistic inference, inherently produce better-calibrated probabilities [
          <xref ref-type="bibr" rid="ref16">16, 34</xref>
          ].
        </p>
        <p>The main objective of this article is to explore a new self-calibration strategy for neural networks
that makes use of a bilevel optimization framework.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Bilevel Optimization</title>
        <p>Bilevel optimization (BO) is a mathematical approach designed to address hierarchical decision-making
processes, where decisions made at an outer level influence the outcomes of an inner level, which in
turn afects the outer level. This hierarchical structure is prevalent in many real-world scenarios, such as
economics, engineering, management, and various public and private sector operations. The distinctive
feature of bilevel optimization lies in its two interconnected levels of optimization. Each level has its
own objectives and constraints, and there are two classes of decision vectors: the leader’s (outer-level)
decision vectors and the follower’s (inner-level) decision vectors. The inner-level optimization is a
parametric optimization problem solved with respect to the inner-level decision vectors, while the
outer-level decision vectors act as parameters. The inner-level optimization problem acts as a constraint
to the outer-level optimization problem, such that only those solutions are considered feasible that are
optimal for the inner level.</p>
        <p>By denoting the outer and inner parameters as w and θ , respectively, we can define an unconstrained
BO problem as
min f (w, θ ∗ )
w
s.t. θ ∗ ∈ arg min g(w, θ ),
θ
(5)
where θ ∗ is one of the minimizers of g.</p>
        <p>
          Gradient-based approaches are now the most commonly used methods for solving bilevel optimization
problems. The most compelling approach to gradient-based bilevel optimization is to replace the inner
problem with a dynamical system. This idea, discussed, e.g., in [
          <xref ref-type="bibr" rid="ref7">7, 35, 36</xref>
          ], involves approximating the
bilevel problem with a sequence of optimization steps, which allows for eficient gradient computation.
        </p>
        <p>Specifically, consider a prescribed positive integer T and let [T ] = {1, 2, . . . , T }. We now rewrite
the bilevel problem Eq.(5) with the following approximation:
min f (w, θ T (w))
w
s.t. θ 0(w) = Φ 0(w),</p>
        <p>θ t(w) = Φ t(θ t− 1(w), w), t ∈ [T ],
where Φ 0 : Rn → Rm is a smooth initialization mapping and for each t ∈ [T ] , Φ t : Rm × Rn → Rm
represents the operation performed by the t-th step of an optimization algorithm. For example, if the
optimization dynamics is gradient descent, we might have:</p>
        <p>Φ t(θ t− 1, w) = θ t− 1 − η t∇θ g(θ t− 1, w),
where (η t)t∈[T ] is a sequence of step sizes.</p>
        <p>This approach approximates the bilevel problem and gives the possibility to use gradient descent also
to solve the outer objective. To this end, one has to compute an hypergradient, which is the gradient of
the outer objective f (w, θ T (w)) with respect to the hyperparameters w, i.e.,</p>
        <p>∇wf (w, θ T (w)) = ∇wf (w, θ T ) + [Jθ T (w)(w)]⊤∇θ f (w, θ T ),
where rows in the Jacobian matrix Jθ T (w)(w) contain gradients of the entries of θ T with respect to w.</p>
        <p>The reformulation (6) allows for eficient computation of the hypergradient using reverse or forward
mode algorithmic diferentiation.
3. BO4SC: A Bilevel Optimization Framework for Self-Calibration
We introduce here the bilevel optimization framework we designed to enhance confidence estimation,
which we will name BO4SC.</p>
        <p>We here assume that the prediction models are characterized by a dual-output structure: one output
to provide the prediction for the data point, the other to estimate the confidence of that prediction. This
is essential because we want both the class predictions and the confidence estimation to be dependent
on the same model parameters. For the model m, parametrized by θ , we will denote the output relative
to the sample xi with</p>
        <p>m(xi, θ ) = (yˆ(xi, θ ), pˆ(xi, θ )) = (yˆi, pˆi),
where yˆi is the class prediction and pˆi is his confidence estimation.</p>
        <p>Now consider the optimization problem in Eq. (5), with the outer parameters the weights w and the
inner parameters θ of the model mθ . The inner loss function g is trained on the training set (Dtrain),
focusing on minimizing the weighted cross-entropy (CE) loss over the model’s prediction output with
the objective of minimizing the model’s parameters θ :
(6)
(7)
(8)
(9)
(10)
(11)
supposing θ ∗ to be unique and where the CE loss is defined as:
g(w, θ ) =
1</p>
        <p>X
|Dtrain| i∈Dtrain</p>
        <p>wi · CE(yˆ(xi, θ ), yi)
CE(yˆ(xi, θ ), yi) = −</p>
        <p>PC</p>
        <p>c=1 yi,c log(yˆ(xi, θ )c)</p>
        <p>Here, C represents the number of classes, yi,c is the binary indicator (0 or 1) if the class label c is the
correct classification for input xi, and yˆ(xi, θ )c is the final logit for class c given input xi according to
the model yˆ(· , θ ).</p>
        <p>The outer loss function f , on the other hand, is evaluated on the validation set (Dval), where it aims
to minimize a binary cross-entropy (BCE) loss on the model’s confidence output pˆ(· , θ ). The objective is
to learn weights for each sample in the training set that can efectively balance the trade-of between
prediction accuracy and confidence calibration:</p>
        <p>1
|Dval| j∈Dval
f (w, θ ∗ ) =</p>
        <p>X BCE(pˆ(xj , θ w∗), yj ),
where θ w∗ are the model parameters found by the inner problem and that depend on the weights w
assigned to the training samples. pˆ(· , θ ) is the confidence output of the model.</p>
        <p>The binary cross-entropy (BCE) loss is defined as:</p>
        <p>BCE(pˆ(xj , θ ∗ ), yj ) = −</p>
        <p>yjB log(pˆ(xj , θ ∗ )) + (1 − yjB) log(1 − pˆ(xj , θ ∗ ))</p>
        <p>In this equation, yjB is the true binary label (0 or 1) for the sample xj , indicating whether xj has
been correctly classified (i.e. yˆj = yj ); pˆ(xj , θ ∗ ) represents the predicted confidence (probability) that
xj belongs to the positive class according to the model.</p>
        <p>The dificulty in solving this bilevel optimization problem usually lies in the accurate computation of
the hypergradient ∇wLouter(w) = ∇wf (w, θ w∗), which necessitates sophisticated approaches requiring
a large cost in time and memory performance.</p>
        <p>We schematize as Algorithm 1 the approximate hypergradient descent algorithm we implemented to
solve the BO4SC problem.</p>
        <p>Algorithm 1 BO4SC via Approximate Hypergradient Descent</p>
        <sec id="sec-2-4-1">
          <title>Initialize: Set initial weights w0 and model parameters θ 0.</title>
          <p>for j = 0, 1, . . . do
for k = 0 to T − 1 do
{Inner loop: gradient descent on inner loss}
Compute the gradient of the inner loss w.r.t. θ k:
(12)
(13)
(14)</p>
        </sec>
        <sec id="sec-2-4-2">
          <title>Update the outer-parameters wj using gradient descent:</title>
          <p>end for
wj+1 = wj − η w · ∇ wf (wj , θ wj)</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experiments and Results</title>
      <p>In this section we present our experiment process and the results. First, we give an overview on the
training approaches we compared and the datasets we used. Then we analyse the experiment results.</p>
      <p>This work mainly focuses on the proposed method along with two others which are described as
follows:
• Standard: this refers to the standard training procedure in which the model’s parameters are
updated using backpropagation based on a single loss function.
• Isotonic Regression (IsoReg): it is the non-parametric method used to calibrate confidence scores
after the initial training phase of a model with the Standard method (see Section 2.3).
• BO4SC: the proposed method in this work Algorithm 1).</p>
      <p>∇θ g(wj , θ k) =</p>
      <p>1
|Dtrain|</p>
      <p>X
i∈Dtrain</p>
      <p>j
wi · ∇ θ CE(yˆ(xi, θ k), yi)</p>
      <p>Update model parameters θ k using gradient descent: θ k+1 = θ k − η θ · ∇ θ g(wj , θ k)
end for
Set θ wj = θ T {Final inner solution after T iterations, in function on outer parameters w}
Compute the hypergradient, i.e. the gradient of the outer loss w.r.t. w, using the approximated θ wj:
1
|Dval| j∈Dval
∇wf (w, θ wj) =</p>
      <p>X ∇wBCE(pˆ(xj , θ wj), yj )</p>
      <p>In the implementation of the BO4SC algorithm, particularly for the explicit calculation of the outer
loss gradient with respect to w (that is, the gradient of θ wT with respect to w), the Python package
torchopt [37] was used. torchopt is a library that extends PyTorch [38] by providing tools for
higherorder optimization, specifically tailored for problems involving complex optimization hierarchies such
as bilevel optimization. It enables eficient computation of the hypergradients. By leveraging torchopt,
we can accurately and eficiently compute the required gradients in Eq. (14), thereby facilitating the
optimization process in our experiments.</p>
      <p>To facilitate the initial investigation, some toy datasets were built which were used as a diagnostic
tool to understand BO4SC behavior, as well as how it compares with others. These datasets have two
features to facilitate visual inspection.</p>
      <p>The first two datasets are Blobs 1.3 and Blobs 1.7, each of which has two dimensions and five classes,
where the blobs are generated from a normal distribution with standard deviations of 1.3 and 1.7,
respectively. The third and fourth are two class datasets named Spiral 2.5 and Spiral 3.5, consisting of
two interlocking spiral-shaped regions, each corresponding to one class, with the values 2.5 and 3.5
indicating the standard deviation from the center of the spiral, thus controlling the amount of overlap
between the regions. These datasets are used for diagnostic purposes to understand the behaviour of
the algorithm. Finally, we used the Blood Alcohol Concentration (BAC) dataset, which is commonly
utilized in decision-making and confidence estimation tasks. Data were first collected by Nugent and
Cunningham [39] and can be used for regression and binary classification, depending on whether a
threshold is set on the BAC level to distinguish between classes. Both the toy datasets, Blobs and Spirals,
and BAC are made of 2000 samples in total, 700 are used for training, 300 for validation and 1000 in
the test set.</p>
      <p>For each dataset a feed forward neural network has been implemented, with a softmax function
applied to the final logits. The MCP is extracted with a smooth maximum function, namely the Boltzman
operator [40], to keep the confidence score diferentiable with respect to the model parameters. The
Adam [41] optimizer was used in the standard training and in the inner loop of BO4SC (to optimize the
model parameters θ ). All hyperparameters has been selected through a grid search. Besides the number
of epochs, in the bilevel approach it is important to adjust the number of inner iterations (T ) and the
learning rate η w for the update of the outer parameters.</p>
      <sec id="sec-3-1">
        <title>4.1. Confidence Estimation and Calibration</title>
        <p>What interests us in our experiments is assessing how well the methods predict calibrated confidence
estimations. We begin with a analysis using one of the toy datasets, where we can observe how well
the models diferentiate between high-confidence and low-confidence regions.</p>
        <p>The toy dataset Blobs 1.7 provides an excellent case for this analysis. In Figure 1 below, we present
an image made up of three diferent plots, each representing the confidence estimation results. These
plots visually demonstrate the predicted confidence levels across the entire input space, highlighting
areas where each model is more or less confident in its predictions.</p>
        <p>The standard model is highly confident in most regions, as indicated by the yellow areas. These
regions reflect the areas where the model predicts class membership with high certainty (confidence
value in (0.9, 1]). However, this confidence is sharply reduced in very narrow areas corresponding to the
decision boundaries, represented by the green regions. These ‘lines’ of uncertainty appear consistently
thin across diferent parts of the dataset, irrespective of the degree of overlap between classes. In
contrast, the BO4SC model’s confidence regions show a diferent pattern. Here, the uncertainty regions
are considerably broader, especially in areas where the classes overlap more. This broader distribution
of uncertainty better reflects the true complexity and intersections within the data, suggesting that the
BO4SC approach is more sensitive to the nuances of the dataset’s distribution. This ability, which also
characterize the Isotonic Regression post-calibration method, represents a significant improvement
over the Standard model, highlighting the advantages of a post- or self- calibration technique to address
the confidence estimation challenge.</p>
        <p>A more detailed examination using quantitative metrics is essential to rigorously evaluate the
efectiveness of these methods and of bilevel optimization in producing well-calibrated models.</p>
        <p>The first step is to examine the confidence calibration through reliability diagrams (Section 2.2) and
confidence histograms . These visual tools provide a direct representation of the relationship between
predicted probabilities and actual observed frequencies, allowing for a straightforward assessment of a
model’s calibration. The plots are consistently similar in all datasets, and we present in Figure 2 the
reliability diagrams for the Spiral 3.5 dataset, which emphasize a drawback of Isotonic Regression. A
critical aspect to consider is the gap between the two dashed vertical lines in the confidence istogram:
the darker line represents average accuracy, while the lighter grey line indicates the average confidence.
For a model to be considered well-calibrated, these two lines should ideally overlap, or at least be very
close to each other. The closer these lines are, the more aligned the model’s predicted confidence is with
its actual performance. When we examine the toy datasets, the gap between these two lines becomes
particularly noticeable. The Standard model consistently displays the largest gap between the average
accuracy and the average confidence across all datasets. This wide gap, with the darker line staying
on the left side, implies that the model’s confidence scores are overly optimistic and do not accurately
reflect its true performance.</p>
        <p>On the other hand, the BO4SC model shows the smallest gap, indicating that it has a more accurate
alignment between confidence and accuracy. The IsoReg method also achieves a relatively close
alignment between these two metrics. However, there is a nuanced diference between the confidence
distribution obtained through bilevel optimization and the distribution achieved by post-calibration
methods like IsoReg. Although IsoReg efectively narrows the gap between accuracy and confidence,
it does not always appropriately adjust confidence predictions. In the spiral dataset for example, the
IsoReg model produces confidence scores that fall within the (0, 0.5] range. Since these datasets are
binary classification tasks, the minimum reasonable confidence score should be around 0.5, reflecting
the baseline probability of a random guess. The presence of lower confidence scores indicates an
improper adjustment given by the IsoReg model, where it underestimates the confidence needed,
thereby deviating from a reasonable calibration.</p>
        <p>The reliability diagrams further reinforce the conclusions drawn from the confidence histograms.
The Standard model demonstrates a clear tendency toward overconfidence. This is evident from the
prevalence of orange gaps, especially in the higher confidence bins. In contrast, the bilevel optimization
approach exhibits much better calibration, with reliability diagrams visually more balanced. Interestingly,
while IsoReg efectively reduces the overconfidence seen in the Standard model, it introduces occasional
calibration issues of its own. In particular, it may undercorrect or overcorrect certain confidence levels,
leading to gaps that are not entirely aligned with the model’s true accuracy.</p>
        <p>With regard to confidence calibration metrics, we report in Table1 the results for the Expected
Calibration Error (ECE) and the Accuracy of the models. The Expected Calibration Error (ECE) shows
that the Bilevel Optimization method generally achieves lower values compared to the traditional
Standard and IsoReg methods, indicating better calibration and more reliable confidence scores that are
closer to the true probabilities, while keeping good accuracies overall.</p>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. Training Weights</title>
        <p>We can make some additional comments regarding the training approach that exploits bilevel
optimization. One of these relates to the role of the weights assigned to each training sample. The weighted
approach used allows the model to prioritize certain samples over others during training, potentially
leading to better calibration and improved performance on more challenging or ambiguous
classification. By studying the evolution of these weights in BO4SC, we can better understand how the method
operates.</p>
        <p>In Figure 3 the history of the weights values (left panel) and their final distribution (right panel)
are reported for the Blobs 1.7 dataset. In the left panel, the red lines indicate the weights associated
with those samples that at the end result to be misclassified. One can clearly see that the weights often
move in groups, creating bundles of lines that follow the same trend. They might represent groups of
samples close to each other that have the same characteristic or close in the variable space. The main
observation is that most of the red lines end between 0 and 0.5, while the darker lines are mainly above
the middle value.</p>
        <p>Looking at the right panel of Figure 3 one can observe that the BO4SC approach assigns a weight
value of 1.0 to samples that are clearly and confidently classified into a single class, typically those
located near the center of each cluster, far from the decision boundaries. As samples approach these
boundaries, their weights decrease, converging towards 0.5 or even lower. This trend reflects BO4SC’s
strategy to diminish the influence of samples that are ambiguous or more likely to be misclassified.
This is visually evident, as many of these samples are marked with a red contour to indicate their
misclassification, so belonging to a cluster of a diferent class, and appear as dark-colored (black) points,
indicative of their low weight.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusions</title>
      <p>In this article, we explored a novel bilevel optimization approach to address the challenge to self-calibrate
a neural network in classification tasks. The objective was to improve the confidence predicted by a
model in such a way that it better reflects the actual accuracy and that it would be more meaningful in
ambiguous scenarios. We made experimentation and analysis across a variety of datasets, ranging from
toy datasets like Blobs and Spirals to more complex ones like BAC, and demonstrated the efectiveness
of bilevel methods, particularly in their ability to refine confidence by dynamically adjusting sample
weights during training.</p>
      <p>We used the Expected Calibration Error (ECE) to quantitatively assess the models’ performance. The
consistent superiority of the bilevel approach over traditional methods highlights its ability to enhance
classifier reliability while maintaining good accuracies overall.</p>
      <p>The bilevel approach behaves well also when compared with post-calibration techniques. They
present better results and, more importantly, they do not sufer from typical issues that show up when
post-calibrating the confidence. In fact, we found that fine-tuning with post-calibration methods, like
isotonic regression, occasionally leads to over-adjustments, resulting in overly cautious confidence
estimates. For this reason, the confidence produced by the bilevel optimization methods would be more
trustworthy in a real-world scenario.</p>
      <p>While the results are promising, future research should focus on further refining these techniques.
There is indeed still room for improvements on the computational side, i.e., executional time and
memory performance of our bilevel approach are not always competitive with traditional training.
Another future research direction lies towards reject-option classification , which allows models to refrain
from making uncertain predictions.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This work was supported by the STEM Challenge Fund 2023, University College Dublin.
[20] R. Yoshihashi, W. Shao, R. Kawakami, S. You, M. Iida, T. Naemura, Classification-reconstruction
learning for open-set recognition, in: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2019, pp. 4016–4025.
[21] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way
to prevent neural networks from overfitting, The journal of machine learning research 15 (2014)
1929–1958.
[22] B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and scalable predictive uncertainty estimation
using deep ensembles, Advances in neural information processing systems 30 (2017).
[23] A. Bendale, T. E. Boult, Towards open set deep networks, in: Proceedings of the IEEE conference
on computer vision and pattern recognition, 2016, pp. 1563–1572.
[24] C. De Stefano, C. Sansone, M. Vento, To reject or not to reject: that is the question-an answer in case
of neural classifiers, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications
and Reviews) 30 (2000) 84–94.
[25] M. H. DeGroot, S. E. Fienberg, The comparison and evaluation of forecasters, Journal of the Royal</p>
      <p>Statistical Society: Series D (The Statistician) 32 (1983) 12–22.
[26] A. Niculescu-Mizil, R. Caruana, Predicting good probabilities with supervised learning, in:</p>
      <p>Proceedings of the 22nd international conference on Machine learning, 2005, pp. 625–632.
[27] B. Zadrozny, C. Elkan, Obtaining calibrated probability estimates from decision trees and naive
bayesian classifiers, in: Icml, volume 1, 2001, pp. 609–616.
[28] M. P. Naeini, G. Cooper, M. Hauskrecht, Obtaining well calibrated probabilities using bayesian
binning, in: Proceedings of the AAAI conference on artificial intelligence, volume 29, 2015.
[29] J. Platt, et al., Probabilistic outputs for support vector machines and comparisons to regularized
likelihood methods, Advances in large margin classifiers 10 (1999) 61–74.
[30] M. Kull, T. Silva Filho, P. Flach, Beta calibration: a well-founded and easily implemented
improvement on logistic calibration for binary classifiers, in: Artificial intelligence and statistics, PMLR,
2017, pp. 623–631.
[31] Y. Wang, L. Li, C. Dang, Calibrating classification probabilities with shape-restricted polynomial
regression, IEEE transactions on pattern analysis and machine intelligence 41 (2019) 1813–1827.
[32] F. Pan, X. Ao, P. Tang, M. Lu, D. Liu, L. Xiao, Q. He, Field-aware calibration: a simple and empirically
strong method for reliable probabilistic predictions, in: Proceedings of The Web Conference 2020,
2020, pp. 729–739.
[33] B. Zadrozny, C. Elkan, Transforming classifier scores into accurate multiclass probability estimates,
in: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery
and data mining, 2002, pp. 694–699.
[34] Y. Kwon, J.-H. Won, B. J. Kim, M. C. Paik, Uncertainty quantification using bayesian neural networks
in classification: Application to biomedical image segmentation, Computational Statistics &amp; Data
Analysis 142 (2020) 106816.
[35] J. Domke, Generic methods for optimization-based modeling, in: Artificial Intelligence and</p>
      <p>Statistics, PMLR, 2012, pp. 318–326.
[36] D. Maclaurin, D. Duvenaud, R. Adams, Gradient-based hyperparameter optimization through
reversible learning, in: International conference on machine learning, PMLR, 2015, pp. 2113–2122.
[37] J. Ren*, X. Feng*, B. Liu*, X. Pan*, Y. Fu, L. Mai, Y. Yang, Torchopt: An eficient library for
diferentiable optimization, Journal of Machine Learning Research 24 (2023) 1–14.
[38] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga,</p>
      <p>A. Lerer, Automatic diferentiation in pytorch, in: NIPS-W, 2017.
[39] C. Nugent, P. Cunningham, A case-based explanation system for black-box systems, Artif. Intell.</p>
      <p>Rev. 24 (2005) 163–178.
[40] K. Asadi, M. L. Littman, An alternative softmax operator for reinforcement learning, in:
International Conference on Machine Learning, PMLR, 2017, pp. 243–252.
[41] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980
(2017).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Djolonga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Romijnders</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hubis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Lucic, Revisiting the calibration of modern neural networks</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>34</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2021</year>
          , pp.
          <fpage>15682</fpage>
          -
          <lpage>15694</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.-Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , G.-S. Xie,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mei</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-L. Liu</surname>
          </string-name>
          ,
          <article-title>A survey on learning to reject</article-title>
          ,
          <source>Proceedings of the IEEE</source>
          <volume>111</volume>
          (
          <year>2023</year>
          )
          <fpage>185</fpage>
          -
          <lpage>215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Hendrickx</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Perini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Van der Plas</surname>
          </string-name>
          , W. Meert,
          <string-name>
            <given-names>J.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <article-title>Machine learning with a reject option: A survey</article-title>
          ,
          <source>Machine Learning</source>
          <volume>113</volume>
          (
          <year>2024</year>
          )
          <fpage>3073</fpage>
          -
          <lpage>3110</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Guo</surname>
          </string-name>
          , G. Pleiss,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <article-title>On calibration of modern neural networks</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1321</fpage>
          -
          <lpage>1330</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Cosmides</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tooby</surname>
          </string-name>
          ,
          <article-title>Are humans good intuitive statisticians after all? rethinking some conclusions from the literature on judgment under uncertainty</article-title>
          , cognition
          <volume>58</volume>
          (
          <year>1996</year>
          )
          <fpage>1</fpage>
          -
          <lpage>73</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <article-title>Hyperparameter optimization with approximate gradient</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>737</fpage>
          -
          <lpage>746</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Franceschi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Donini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Frasconi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pontil</surname>
          </string-name>
          ,
          <article-title>Forward and reverse gradient-based hyperparameter optimization</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1165</fpage>
          -
          <lpage>1173</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Franceschi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Frasconi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Salzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Grazzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pontil</surname>
          </string-name>
          ,
          <article-title>Bilevel programming for hyperparameter optimization and meta-learning</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1568</fpage>
          -
          <lpage>1577</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shenoy</surname>
          </string-name>
          ,
          <article-title>Selective classification using a robust meta-learning approach</article-title>
          ,
          <source>arXiv preprint arXiv:2212.05987</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. K.</given-names>
            <surname>Saul</surname>
          </string-name>
          ,
          <article-title>Distance metric learning for large margin nearest neighbor classification</article-title>
          .,
          <source>Journal of machine learning research 10</source>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P. R.</given-names>
            <surname>Mendes</surname>
          </string-name>
          <string-name>
            <given-names>Júnior</given-names>
            ,
            <surname>R. M. De Souza</surname>
          </string-name>
          , R. d. O.
          <string-name>
            <surname>Werneck</surname>
            ,
            <given-names>B. V.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>D. V.</given-names>
          </string-name>
          <string-name>
            <surname>Pazinato</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. R. De Almeida</surname>
            ,
            <given-names>O. A.</given-names>
          </string-name>
          <string-name>
            <surname>Penatti</surname>
            ,
            <given-names>R. d. S.</given-names>
          </string-name>
          <string-name>
            <surname>Torres</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rocha</surname>
          </string-name>
          ,
          <article-title>Nearest neighbors distance ratio open-set classifier</article-title>
          ,
          <source>Machine Learning</source>
          <volume>106</volume>
          (
          <year>2017</year>
          )
          <fpage>359</fpage>
          -
          <lpage>386</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Guan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <article-title>To trust or not to trust a classifier</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>31</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mandelbaum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weinshall</surname>
          </string-name>
          ,
          <article-title>Distance-based confidence score for neural network classifiers</article-title>
          ,
          <source>arXiv preprint arXiv:1709.09844</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>N.</given-names>
            <surname>Papernot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>McDaniel</surname>
          </string-name>
          ,
          <article-title>Deep k-nearest neighbors: Towards confident, interpretable and robust deep learning</article-title>
          , arXiv preprint arXiv:
          <year>1803</year>
          .
          <volume>04765</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ghahramani</surname>
          </string-name>
          ,
          <article-title>Dropout as a bayesian approximation: Representing model uncertainty in deep learning</article-title>
          ,
          <source>in: international conference on machine learning, PMLR</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1050</fpage>
          -
          <lpage>1059</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>C.</given-names>
            <surname>Blundell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cornebise</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wierstra</surname>
          </string-name>
          ,
          <article-title>Weight uncertainty in neural network</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1613</fpage>
          -
          <lpage>1622</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kristiadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hennig</surname>
          </string-name>
          ,
          <article-title>Being bayesian, even just a bit, fixes overconfidence in relu networks</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>5436</fpage>
          -
          <lpage>5446</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>C.</given-names>
            <surname>Riquelme</surname>
          </string-name>
          , G. Tucker,
          <string-name>
            <given-names>J.</given-names>
            <surname>Snoek</surname>
          </string-name>
          ,
          <article-title>Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling</article-title>
          , arXiv preprint arXiv:
          <year>1802</year>
          .
          <volume>09127</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wen</surname>
          </string-name>
          , G. Hua,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Learning discriminative reconstructions for unsupervised outlier removal</article-title>
          ,
          <source>in: Proceedings of the IEEE international conference on computer vision</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1511</fpage>
          -
          <lpage>1519</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>