<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Entropic Metric for Measuring Calibration of Machine Learning Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniel James Sumler</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lee Devlin</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon Maskell</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richard Oliver Lane</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>QinetiQ</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Liverpool</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Understanding the confidence with which a machine learning (ML) model classifies an input datum is an important, and often over-looked, concept. We propose a new probability calibration metric, the Entropic Calibration Diference (ECD). Inspired by existing research in the field of state estimation, specifically target tracking, we show how ECD may be applied to binary and multi-class classification ML models. We describe the relative importance of under- and over-confidence and how they are not conflated in the tracking literature. Indeed, our metric naturally distinguishes under- from over-confidence. We consider this important given that algorithms that are under-confident are likely to be “safer” or more trustworthy than algorithms that are over-confident, albeit at the expense of also being over-cautious and so statistically ineficient. We demonstrate how this new metric performs on real data and present a theoretical analysis to prove its properties. We also compare with other metrics for ML model probability calibration, including the Expected Calibration Error (ECE).</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;calibration</kwd>
        <kwd>over-confidence</kwd>
        <kwd>miscalibration</kwd>
        <kwd>safety</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Calibration of probabilities is an important and often overlooked concept when developing machine
learning (ML) models. Usually, accuracy is the main metric used to calculate how well an ML model
performs in terms of predicting a class for unseen data. Generally speaking, the closer the accuracy is
to 100%, the better the model is deemed to be. However, this does not take into account the probability
of predictions that the model outputs, which can be just as important, if not more, than the accuracy.</p>
      <p>In binary classification, a probability greater than a threshold, typically 0.5, is enough to decide
whether an input belongs to one of two classes. While accuracy informs whether a classification is
correct, a probability calibration metric informs how well the confidence probabilities match the true
proportions of correct decisions. For example, a model that always outputs a probability of 0.6 for class
label 1, but always classifies correctly, should produce a poor calibration score, as even though the
model classifies the input correctly, it has low confidence in that decision.</p>
      <p>
        Calibration has become even more important as of late, as the research of Guo et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] reveals that
while modern neural networks are more accurate than ever, they are also badly calibrated. This could
be attributed to over-confidence of the networks due to the large amount of data they are trained on.
This over-confidence can lead to a loss of trust of users in the models.
      </p>
      <p>A well-calibrated model is defined as one that outputs probabilities that are representative of the
real-life occurrences from the unseen data. For example, if, on average, 70% of people are correctly
predicted to contract a certain disease, then one would expect the average probability outputted by a
diagnosis model to be 0.7. A mathematical representation of calibration can be seen in (1), where  is
the true label,  represents a class from the total number of classes , ˆ is the predicted probability
distribution, p is the vector of class confidences with elements  , and P is the true probability.</p>
      <p>P( = |ˆ = p) =  ∀ ∈ {1, ..., }
(1)</p>
      <p>In this paper, we present a novel calibration metric that addresses some weaknesses of some of the
most commonly-used existing metrics. In section 2, we discuss existing calibration metrics that are
widely discussed in the literature. In section 3, we detail our motivations for ‘safe’ or trustworthy
calibration and why we feel our metric is necessary. Section 4 defines our new metric and details how
the results can be interpreted. In section 5, we use hypothesis tests for our metric on pre-trained models
using calibrated and uncalibrated probabilities and make a comparison with other metrics. Results for
binary classifiers are included in the main paper, while multiclass results are found in Appendix A.
Finally, section 6 concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Existing metrics for model calibration</title>
      <p>In this section, we explore the literature and highlight some important calibration metrics, from early
pioneers of the field to widely-used modern standards.</p>
      <p>
        Various methods exist that attempt to calibrate the probabilities outputted by badly calibrated models.
Some attempt to do this by altering their training process, such as using a loss function that addresses
neural network calibration as it is being trained [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], while other methods attempt to change a model’s
output probabilities [
        <xref ref-type="bibr" rid="ref1 ref3 ref4 ref5">1, 3, 4, 5</xref>
        ]. The former method is useful if we want to train a new model; however,
this is not always required. Occasionally, extant trained models need their existing probabilities
calibrated, which is where the latter method is used. However, before calibrating a model’s output
probabilities, it must first be determined whether the model is already calibrated or not, and to what
extent, using a calibration metric. The remainder of this section defines and discusses various metric
benefits and drawbacks.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Reliability Diagram</title>
        <p>
          Methods have been developed to allow one visually to assess the calibration of a model. First brought
to the limelight by DeGroot and Fienberg [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], and then further expanded upon by Niculescu-Mizil and
Caruana [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], reliability diagrams are a widely-used method to display calibration. This visual technique
works by partitioning the probabilities that a model outputs into a number of bins, where each bin
represents a probability interval. It is possible to use any number of bins, but one must keep in mind the
bias-variance trade-of as highlighted by Nixon et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. They specify that when the number of bins is
increased, this results in a lower population per bin, and, in turn, a larger variance. This will, however,
also result in a lower bias. Therefore, the number of bins should be fine-tuned for each problem.
        </p>
        <p>
          Once every probability has been assigned to the appropriate bin, the average predicted probability
and fraction of class 1 labels is calculated for each bin. For multiclass reliability diagrams, a common
method is to take the top-label, also known as most confident, prediction [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The fraction of positives
is plotted against the average predicted probability, with a reference line   = 
showing what a perfectly calibrated model would look like. Points above the line are under-confident,
while points below it are over-confident. A typical reliability diagram is shown in Figure 1.
        </p>
        <p>Reliability diagrams can be a good tool for visualising a model’s calibration. However, they do not, in
themselves, return a calibration score, and the diagram can be misleading when some bins are sparsely
populated.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Expected Calibration Error</title>
        <p>
          Initially proposed by Naeini et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], the Expected Calibration Error (ECE) is one of the most widely
used calibration metrics. It is a simple, yet efective, formula that uses the same method of binning as
reliability diagrams, but produces a normalised calibration score between 0 and 1, similar to methods
such as the Brier Score [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>
          ECE works by calculating the accuracy and confidence of each bin in a reliability diagram. The
accuracy is the traditional definition, where the number of correct predictions is divided by the total
number of predictions in the bin. The confidence refers to the average probability in a bin. It is not
explicitly stated in the original ECE paper, but Guo et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] use the maximum class probability to
calculate the confidence of a model output in multiclass classification scenarios. This means that, for
a multiclass classifier, the minimum probability in all bins will be 1/, and some of the bins may be
unpopulated. In binary classification, the estimated probability, ˆ, of the positive class is used when
calculating the mean confidence for bin , as seen in (2), and the ratio of class 1 labels is taken instead
of the accuracy [
          <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
          ], as shown in (3), where  is the true label. These are the same values used in
the reliability diagram. The ECE of each bin is then calculated by obtaining the absolute diference
between (3) and (2). This ECE value is weighted depending on how populated each bin is, as shown in
(4), where  is the total number of probabilities across all bins, and  is the number of bins.
 () =
        </p>
        <p>1 |∑︁| ˆ
|| =1
 () =</p>
        <p>1 |∑︁| 1( = 1)
|| =1

 = ∑︁ || | () −  ( )| (4)</p>
        <p>
          =1
While ECE is widely-used in the literature, it has some shortcomings. One of the most noticeable is
the use of bins, and the fact that the user needs to decide on an optimal binning strategy, or rely on
an adaptive algorithm [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Due to its positive fraction and confidence terms, as shown in (3) and
(2), changing the number of bins can change the final ECE score. This also evokes the bias-variance
trade-of mentioned previously [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>Additionally, the ECE equation deals with averages within bins rather than individual sample
probabilities and their respective true labels. Due to this, outliers (such as a wildly incorrect prediction) may
not have a large impact on the final calibration score. While it may be the case that this is a good thing,
since a model should not be heavily penalised for a single mistake, it could be considered much more
important in models for sensitive applications, such as predicting the probability of a medical patient
having a certain disease.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Maximum Calibration Error</title>
        <p>
          The Maximum Calibration Error (MCE) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], seen in (5), is a metric similar to ECE; however, it finds the
largest diference between the average confidence and average accuracy across all bins.
(2)
(3)
(5)
  =
        </p>
        <p>max
∈{1,..,}</p>
        <p>|() −  ( )|</p>
        <p>Theoretically, it is a miscalibration metric that checks the worst-case scenario, which is especially
useful in applications where the largest error should be minimised. This makes it useful in safety-critical
applications. The closer the metric is to 0, the less deviation there is between accuracy and confidence.
A value closer to 1 shows that there is a large discrepancy present.</p>
        <p>
          This metric sufers from the same drawback as ECE due to its utilisation of bins. To ensure that a
valid MCE value is received, it is good practice to use an adaptive binning practice, such as the one
used by Lee et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Other Works</title>
        <p>
          There exist numerous other calibration metrics outside of the ones we mentioned in this section. A
comprehensive review of classifier probability calibration metrics is given in [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Honourable mentions
include the Maximum Mean Calibration Error (MMCE), which was first proposed by Kumar, Sarawagi
and Jain [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. It is used to measure calibration error, and can also be used to train network parameters,
which fixes poor calibration, while maintaining high accuracy.
        </p>
        <p>
          Luo et al. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] proposed the Local Calibration Error (LCE), a kernel-based metric which is used to
bridge the gap between metrics that look at the average reliability across a population of probabilities
(such as ECE and MCE) and reliability calculation for individual data points.
        </p>
        <p>
          Verhaeghe et al. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] propose the Expected Signed Calibration Error (ESCE), a signed counterpart of
the ECE metric, with its main diference being the removal of the absolute value in the equation. Ao
et al. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] proposed the Miscalibration Score (MCS) which, like ESCE, attempts to present over- and
under-confidence by altering the ECE equation. They also propose a modified version of temperature
scaling [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], which is a post-hoc re-calibration technique.
        </p>
        <p>The above metrics attempt to find how well-calibrated a model is according to the definition in (1).
In the following section, we detail our proposal – the concept of ‘safe’ calibration. This not only looks
for perfect calibration, but also strongly penalises over-confidence, which is considered unsafe in some
models.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Motivation</title>
      <p>In this section, we present background knowledge that frames the metric we propose in this paper, and
what we call ‘safe’ calibration, a property that makes models trustworthy. In the following subsections,
we talk about the target tracking (TT) literature which is the inspiration for safe calibration. We also
propose our metric, the Entropic Calibration Diference, show how this fits into the TT field, and
demonstrate how it can be adapted for general use in ML model calibration.</p>
      <sec id="sec-3-1">
        <title>3.1. Target Tracking</title>
        <p>A fundamental goal of target tracking is to derive the state (e.g. position and velocity) of an object over
time through noisy measurements. This is accomplished by using algorithms called target trackers.</p>
        <p>
          Tracking systems generally consist of multiple components; however, one of the core algorithms
is track filtering. Commonly used methods include Kalman Filters and Particle Filters [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. Target
trackers are often used to process 2D polar measurements. When the range to a target is relatively well
estimated, contours of the 2D probability distributions involved each resemble a banana. It transpires
that the Taylor series used by the Extended Kalman Filter (EKF) [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] fails to represent the uncertainty
caused by the curvature of the distribution adequately. This can lead an EKF to diverge over time since
it attributes excessive confidence to the output of processing previous data relative to a newly acquired
datum. Techniques such as the Unscented Kalman Filter (UKF) [
          <xref ref-type="bibr" rid="ref21 ref22">21, 22</xref>
          ] attempt to address this using a
form of quasi monte-carlo integration. In the TT literature there is therefore a need to quantify the
extent to which a technique consistently under-estimates the uncertainty.
        </p>
        <p>One of the most popular consistency metrics is the Normalised Estimation Error Squared (NEES),
shown in (6), where ˆ represents the estimated value and  2 is the variance associated with the -th
estimation error. Note that this equation assumes that the true value, , is 1D. This metric works by
calculating the ratio between the actual estimation error (being the diference between the predicted
and true states), and the predicted error.</p>
        <p>
          =1
NEES is a good consistency metric for single-target tracking. In this paper, we propose the Entropic
Calibration Diference metric, which is inspired by NEES. One of the main aspects that the metric
borrows from NEES is the higher penalisation of over-confidence compared to under-confidence [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ].
1 ∑︁ ( − ˆ )2
        </p>
        <p>2
1D, is expected on average and shows perfect consistency, as it reflects a balance between the squared
prediction error and the predicted uncertainty. Values greater than 1, or , show over-confidence,
as the prediction error is large relative to the predicted uncertainty. This suggests that the model
underestimated its uncertainty, which led it to be too confident in its predictions. A value smaller than
1 shows under-confidence as the prediction error is small relative to the predicted uncertainty. This
suggests that the model overestimated its uncertainty, making it overly cautious about its predictions.
The ability to find over- and under-confident predictions using only a single equation is desirable for
calibration, as it gives good context in understanding the model’s predictions. Treating over- and
under-confidence diferently would also give us an idea of whether a model is safely calibrated.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Safe Calibration</title>
        <p>The aim of safe calibration is to determine whether an ML model can be deemed safe to use and is thus
trustworthy. A non-calibrated model runs the risk of being over-confident in incorrect answers, or
under-confident in the correct answers. We bring the TT way of thinking, where over-confidence is
considered much worse than uncertainty or under-confidence, into ML. The theory behind this is that
we would prefer a model be uncertain in the correct class rather than confidently choose an incorrect
class. To this end, we penalise overconfidence more than under-confidence and uncertainty.</p>
        <p>
          Entropic Calibration Diference can determine whether a model is well-calibrated, which can be
interpreted as whether the model is safe to use. For example, if a binary classification model was
determining whether it is safe for an aircraft to land, over-confidence in the incorrect class could
potentially be fatal. Therefore, we would prefer to have under-confidence in the correct class or
uncertainty, rather than random confident guessing [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. ISO26262 also defines ‘safety’ as a lack of
‘unacceptable risk’ when referring to modern road vehicles [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Entropic Calibration Diference</title>
      <sec id="sec-4-1">
        <title>4.1. Background and Definition</title>
        <p>The Entropic Calibration Diference (ECD) is applicable to both TT and ML calibration, or any other
probabilistic model, and does not require any parameters other than the true label and prediction.
Equation (7) shows the general definition of the ECD metric.</p>
        <p>1 ∑︁ [︂ ∫︁</p>
        <p>=1
log (|)(|) − log ( |)
where  is the total number of data points,  are the measured data for data point , (|) is the
algorithm’s estimate of the probability density or probability mass of true state  given the measurement,
and  are the known true states for a test set.</p>
        <p>The first term in the summand in (7), containing the integral, is the negative entropy of the predicted
probability distribution, or expected log likelihood, for a particular data point. This is used to represent
under-confidence. The second term of the summand is the log likelihood, which is used to represent
over-confidence. It should be noted that if the entropy term were zero, the overall expression would
be negative log-likelihood (NLL), which is a commonly used metric to measure the calibration of
classifiers. In general, ECD measures the diference between expected and actual log likelihoods. In
this case, we have a metric that can produce negative scores for under-confident values and positive
scores for over-confident values. However, unlike other calibration metrics such as ECE, under- and
over-confidence are not treated the same, as the metric follows safe calibration scoring. Some metrics
require the use of optimal binning methods or other computationally expensive calculations. However,
ECE, MCE, and ECD all run in linear time complexity, in terms of the number of data points.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Proving Relation with NEES</title>
        <p>NEES can be proven to be a special case of ECD by substituting the Gaussian distribution into both
equations. Equation (8) is the Gaussian formula, (9) is the Gaussian ECD, and (10) is the Gaussian NEES.
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(|) = √︀(2| |)
In the above,  is the true state vector,  is the observation, and   and  are the mean vector and
covariance matrix of the Gaussian uncertainty given the observation. NEES should be equal to the
dimensionality  in (9) for a system to be consistent and the ECD score would be zero.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. ECD for Discrete Variables</title>
        <p>To apply the ECD formula to classification problems, it is necessary to allow it to be used with discrete
variables. This is a simple change, as noted in (11).
∑︁ ( = |) log ( = |) − log ( |)
=1
]︃
]︃</p>
        <p>In this modified equation, we predict the probability that  is of class  given a piece of observed
data , where  and  are the number of data points and number of classes, respectively.</p>
        <sec id="sec-4-3-1">
          <title>4.3.1. ECD for Binary Classification</title>
          <p>Equation (11) allows for an easy transition to an ECD formula for binary calibration. By substituting
(12) and (13) into (11), we create the binary classification ECD formula in (14).</p>
          <p>ˆ = ( = 1|)
log (|) =  log ˆ + (1 −  ) log(1 − ˆ )
based on its calibration score, as described in the following section.</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>4.3.2. Simplified ECD for Multi-class Classification</title>
          <p>When attempting to transition the binary logic of ECD to a multiclass scenario, to preserve the original
logic behind the equation, we suggest using a ‘True Class vs. Rest’ approach, which converts the
problem into a binary one.
set to 1 in this binary representation.</p>
          <p>This, as outlined in (15), uses the probability of the true class, , with the true class label permanently
model, while ˆ is the probability assigned to class k.</p>
          <p>To test whether our metric works correctly, we present a theoretical analysis of its properties. The
definition of calibration is in (1), while the definition of ECD is in (7). The expected population metric
for ECD can be re-written as in (16). Here ˆ refers to the probability assigned by the true class by the
 = E
[︃(︃ 
∑︁ ˆ log ˆ − log</p>
          <p>ˆ
=1
)︃</p>
          <p>By using the law of total expectation, conditioning on the predicted probability ˆ = , and the
linearity of expectation, we obtain (17).</p>
          <p>= E^
∑︁  log  − E[log   |ˆ = ]
]︃
When evaluating the second term of the equation, since  is the true class, we obtain (18).
Under the perfect calibration assumption in (1), this results in (19).</p>
          <p>E[log  |ˆ = ] = ∑︁  ( = |ˆ = ) log</p>
          <p>E[log  |ˆ = ] = ∑︁  log 
[︃(︃ 
=1

=1
)︃

=1
(15)
(16)
(17)
(18)
(19)</p>
          <p>The logic remains the same as the binary case, where the ECD calculates whether the model is
under-confident in the correct class or over-confident in all other classes.</p>
        </sec>
        <sec id="sec-4-3-3">
          <title>4.3.3. ECD Score Interpretation</title>
          <p>
            Due to the fact that the ECD score is theoretically unbounded, it is not as easily interpretable as ECE
which is bound between [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ]. Therefore, we advocate for the use of a null hypothesis testing framework,
which is highlighted in Section 5.
4.4. Theoretical Analysis
1 ∑︁( − 1) log
︂(
          </p>
          <p>)︂
1 −</p>
          <p>When (19) is substituted back into (17), this results in the cancellation of values. Therefore, under the
assumption of perfect calibration, the ECD metric will return a value of zero. Similarly, if the probability
of the most likely class is 0.5, the ECD will compute to zero, indicating safely calibrated probabilities
that exhibit neither under- nor over-confidence.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation</title>
      <p>
        To measure and compare ECD values with other metrics, we utilise a hypothesis testing model. Widmann,
Lindsten, and Zachariah [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] argue that calibration scores, on their own, lack meaning and are dificult
to interpret. Vaicenavicius et al. [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] support this claim by stating that miscalibration results may lack
much meaning when compared directly with other results. Lee et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] created T-Cal, a framework
that uses an adaptive binning technique to determine whether ECE accepts or rejects the null hypothesis
of perfect calibration. In this scenario, accepting the null hypothesis refers to perfect calibration, and
rejecting it means that there was a statistically significant amount of miscalibration in the probabilities.
For the following experiments, this adaptive binning strategy is also adapted for use with MCE.
      </p>
      <p>
        Consistency resampling, a type of bootstrapping, is used for all metrics. This method helps combat
the unfairness of judging a model’s calibration score based on a single set of predictions. Rather, the
model’s confidence scores are re-sampled with replacement to create multiple new datasets, so that
calibration metrics provide a range of scores over the multiple datasets. A threshold is set based on the
distribution of metric values. If the original score is statistically larger than the re-sampled scores, the
null hypothesis is rejected [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
      </p>
      <sec id="sec-5-1">
        <title>5.1. Hypothesis Test Setup</title>
        <p>The Null Hypothesis, 0, is the case of perfect calibration for ECE and MCE. However, since ECD can
reach a value of zero due to uncertainty, 0 for ECD is the case of safe calibration.</p>
        <p>
          The alternative Hypothesis, , is the case of statistically significant miscalibration for ECE and
MCE. For ECD,  refers to statistically significant unsafe calibration. As Lee et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] point out, it
is not fair to expect a model trained on finite data to be perfectly calibrated. Therefore, accepting the
Null Hypothesis does not mean that a model is perfectly calibrated or perfectly safe, but rather that no
statistically significant miscalibration or unsafety is detected.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Binary Tests</title>
        <p>
          Binary tests were conducted on two datasets: the Adult Income dataset [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] and the Telco Customer
Churn dataset [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]. These were chosen due to their class imbalances; therefore, models trained on these
datasets may be less calibrated. Separate simple neural net models consisting of two fully connected
hidden layers containing 64 and 32 nodes respectively were trained on each of these datasets. Each
model was trained for 100 epochs using the Adam optimiser. A model’s probabilities are fed into a
metric in three diferent ways – uncalibrated, with Platt Scaling [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], and with Isotonic Regression [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
This gives an idea of how the metrics change after post-hoc calibration. Table 1 shows these results. The
‘Dataset’ row shows which dataset is being used. This is split into three columns under the ‘Calibration’
row which details which post-hoc calibration technique was used on the model’s probabilities before
they were analysed by each calibration metric. Each other row shows the calibration metric’s final score,
and whether the null hypothesis is accepted or rejected for each test. The ‘Threshold’ row shows the
threshold generated during consistency resampling. When a metric’s score is larger than this threshold,
the 0 is rejected, due to evidence of miscalibration. Figure 1 shows the reliability diagrams for both
binary models. This reliability diagram uses the standard binary method of plotting the reliability
diagram. The T-Cal [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] adaptive binning strategy is used for both ECE and MCE scores.
        </p>
        <p>Table 1 shows the metrics scores and hypothesis test results for each dataset and calibration technique.
Interestingly, the ECD score is considered safe in the Adult Income’s Platt Scaling results, whereas the
ECE does not consider it to be within range of perfect calibration. Looking at Figure 1a, it shows that
there is some under-confidence present, and a well-calibrated uncertainty. The MCE metric hypothesis
accepts a higher score than the other metrics; this is because it is below the expected value from the
consistency resampling technique.</p>
        <p>The Telco Customer Churn values show the most interesting results. The ECE metric accepts the
null hypothesis for all calibration techniques; however, the ECD rejects them.</p>
        <p>This is evidence that, while a model may be well-calibrated according to ECE, it does not necessarily
mean that the model does not sufer from over-confident predictions.</p>
        <p>Figure 1b shows that although most bins are well calibrated, there is some over-confidence in the
higher bins, reinforcing the reason for the ECD rejection.</p>
        <p>Results for multiclass classification experiments are included in the Appendix A, with tables 2 and 3.
s
l
e
b
a
L
1
s
s
a
l
C
f
o
n
o
i
t
c
a
r
F
(a) Adult Income Dataset
(b) Telco Customer Churn Dataset</p>
        <sec id="sec-5-2-1">
          <title>Average Predicted Probability</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion &amp; Future Work</title>
      <p>We have introduced a novel calibration metric, named the Entropic Calibration Diference (ECD) due to
its consideration of the entropy of probabilities. The metric is influenced by the Normalised Estimation
Error Squared (NEES) metric, which is used to determine the consistency of a state estimator within the
target-tracking field of research. The ECD metric is equivalent to applying a generalised version of NEES
to the ML problem domain. This new metric also brings a new perspective to the probability calibration
literature, namely the concept of safe calibration, which is commonly found in the target tracking
literature. We define safe calibration as a metric that prefers under-confidence to over-confidence, due
to the belief that an under-confident score in the correct class is safer than an overconfident score in the
incorrect class. Therefore, over-confidence is penalised more than under- confidence, rather than equal
penalties, which are present in other metrics. In terms of future directions, ECD could be implemented
as an objective function as well as integration into re-calibration techniques, with the hopes of making
probabilities safer as well as more well-calibrated.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Multiclass Tests</title>
      <p>
        Multiclass classification hypothesis tests were carried out on the CIFAR-10, CIFAR-100 [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ], and
ImageNet [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] datasets. Pre-trained models were used with ResNet32, ResNet50 [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] and VGG-19 [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]
models.
      </p>
      <p>
        Unlike the previous binary tests, one calibration method was used on the data. This is due to the
consistency resampling technique in the ECD tests requiring the full probability vector, whereas the
ECE requires the maximum probability only. To make the tests fair, a calibration method was used that
takes the whole vector into account, Temperature Scaling [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. To calibrate the CIFAR-100 and CIFAR-10
probabilities, 20% of the test set was partitioned into a validation set, due to the unavailability of a
dedicated set. For ImageNet, ImageNetV2 [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ] was used as a validation set for Temperature Scaling. The
T-Cal [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] framework was used for the adaptive binning strategy for both ECE and MCE. Consistency
Resampling is used for all metrics, generating a threshold within which the null hypothesis would be
accepted.
      </p>
      <p>The results tables are split into two: Table 2 results use uncalibrated model probabilities, and Table 3
results are based on post-hoc temperature scaling. In both tables, the ‘Dataset’ row details which dataset
the tests are conducted with. The ‘Calibration’ row either specifies ‘None’ or ‘Temperature Scaling’.
The ‘Model’ row specifies which model’s probabilities are being used with the results in its respective
column. Finally, each other row specifies the calibration metric and whether the null hypothesis is
accepted or rejected based on the calibration score and consistency resampling simulations. Figures 2, 3,
and 4 show the reliability diagrams for each model, both calibrated and uncalibrated. These reliability
diagrams use a top-label prediction approach, where the most confident prediction is chosen.</p>
      <p>Table 2 shows the results for the uncalibrated probabilities for the CIFAR-10, CIFAR-100 and ImageNet
datasets. The ECE metric rejected the null hypothesis on every test, even though some of the scores
were quite low. On the other hand, the MCE and ECD metrics accepted some null hypotheses, with the
most surprising being the ECD ImageNet ResNet50 result, which has a high threshold value. However,
Figure 4a shows that the majority of probabilities are actually under-confident, and therefore the
acceptance of the null hypothesis makes sense. Similarly, VGG-19 ImageNet was rejected because of its
over-confidence as seen in Figure 4b.</p>
      <p>Table 3 shows the results for the datasets that have undergone a temperature scaling post-hoc
recalibration. The results remain relatively similar, except for the fact that ECD has accepted the null
hypothesis for both ImageNet scores, while ECE has continued to reject them. While this shows that
there is statistically significant miscalibration in the model’s probabilities, it also does not seem to show
that there is enough evidence for the ECD to deem that the probabilities are over-confident. In fact,
Figure 4b shows under-confidence, explaining why the null hypothesis was accepted. This furthers the
theory that safe calibration and perfect calibration should not be assumed to be one and the same.
y
c
a
r
u
c
c
A
y
c
a
r
u
c
c
A
(a) ResNet32 CIFAR-10</p>
      <p>(b) VGG-19 CIFAR-10</p>
      <sec id="sec-8-1">
        <title>Average Top-Label Predicted Probability</title>
        <p>(a) ResNet32 CIFAR-100
(b) VGG-19 CIFAR-100</p>
      </sec>
      <sec id="sec-8-2">
        <title>Average Top-Label Predicted Probability</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Guo</surname>
          </string-name>
          , G. Pleiss,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <article-title>On calibration of modern neural networks</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1321</fpage>
          -
          <lpage>1330</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Neo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Winkler</surname>
          </string-name>
          , T. Chen,
          <article-title>Maxent loss: Constrained maximum entropy for calibration under outof-distribution shift</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>38</volume>
          ,
          <year>2024</year>
          , pp.
          <fpage>21463</fpage>
          -
          <lpage>21472</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Böken</surname>
          </string-name>
          ,
          <article-title>On the appropriateness of Platt scaling in classifier calibration</article-title>
          ,
          <source>Information Systems</source>
          <volume>95</volume>
          (
          <year>2020</year>
          )
          <fpage>101641</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Platt</surname>
          </string-name>
          , et al.,
          <article-title>Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods</article-title>
          ,
          <source>Advances in large margin classifiers 10</source>
          (
          <year>1999</year>
          )
          <fpage>61</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zadrozny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Elkan</surname>
          </string-name>
          ,
          <article-title>Transforming classifier scores into accurate multiclass probability estimates</article-title>
          ,
          <source>in: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          ,
          <year>2002</year>
          , pp.
          <fpage>694</fpage>
          -
          <lpage>699</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>M. H. DeGroot</surname>
            ,
            <given-names>S. E. Fienberg,</given-names>
          </string-name>
          <article-title>The comparison and evaluation of forecasters</article-title>
          ,
          <source>Journal of the Royal Statistical Society: Series D (The Statistician)</source>
          <volume>32</volume>
          (
          <year>1983</year>
          )
          <fpage>12</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Niculescu-Mizil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Caruana</surname>
          </string-name>
          ,
          <article-title>Obtaining calibrated probabilities from boosting</article-title>
          .,
          <source>in: UAI</source>
          , volume
          <volume>5</volume>
          ,
          <year>2005</year>
          , pp.
          <fpage>413</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Nixon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. W.</given-names>
            <surname>Dusenberry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , G. Jerfel,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <article-title>Measuring calibration in deep learning</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recognition (CVPR) Workshops</surname>
          </string-name>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Naeini</surname>
          </string-name>
          , G. Cooper,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hauskrecht</surname>
          </string-name>
          ,
          <article-title>Obtaining well calibrated probabilities using Bayesian binning</article-title>
          ,
          <source>in: Proceedings of the AAAI conference on artificial intelligence</source>
          , volume
          <volume>29</volume>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G. W.</given-names>
            <surname>Brier</surname>
          </string-name>
          ,
          <article-title>Verification of forecasts expressed in terms of probability</article-title>
          ,
          <source>Monthly weather review 78</source>
          (
          <year>1950</year>
          )
          <fpage>1</fpage>
          -
          <lpage>3</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T. Silva</given-names>
            <surname>Filho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perello-Nieto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Santos-Rodriguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Flach</surname>
          </string-name>
          ,
          <article-title>Classifier calibration: a survey on how to assess and improve predicted class probabilities</article-title>
          ,
          <source>Mach. Learn</source>
          .
          <volume>112</volume>
          (
          <year>2023</year>
          )
          <fpage>3211</fpage>
          -
          <lpage>3260</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Guilbert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Caelen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chirita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saerens</surname>
          </string-name>
          ,
          <article-title>Calibration methods in imbalanced binary classification</article-title>
          , Ann. Math. Artif. Intell.
          <volume>92</volume>
          (
          <year>2024</year>
          )
          <fpage>1319</fpage>
          -
          <lpage>1352</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hassani</surname>
          </string-name>
          , E. Dobriban, T-cal:
          <article-title>An optimal test for the calibration of predictive models</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>24</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>72</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R. O.</given-names>
            <surname>Lane</surname>
          </string-name>
          ,
          <article-title>A comprehensive review of classifier probability calibration metrics</article-title>
          ,
          <source>arXiv preprint arXiv:2504.18278</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarawagi</surname>
          </string-name>
          , U. Jain,
          <article-title>Trainable calibration measures for neural networks from kernel mean embeddings</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>2805</fpage>
          -
          <lpage>2814</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>R.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhatnagar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Savarese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ermon</surname>
          </string-name>
          , E. Schmerling,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pavone</surname>
          </string-name>
          ,
          <article-title>Local calibration: metrics and recalibration</article-title>
          ,
          <source>in: Uncertainty in Artificial Intelligence, PMLR</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1286</fpage>
          -
          <lpage>1295</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Verhaeghe</surname>
          </string-name>
          , T. De Corte,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Sauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hendriks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. W.</given-names>
            <surname>Thijssens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ongenae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Elbers</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. De Waele</surname>
            ,
            <given-names>S. Van Hoecke</given-names>
          </string-name>
          ,
          <article-title>Generalizable calibrated machine learning models for real-time atrial ifbrillation risk prediction in icu patients</article-title>
          ,
          <source>International Journal of Medical Informatics</source>
          <volume>175</volume>
          (
          <year>2023</year>
          )
          <fpage>105086</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rueger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Siddharthan</surname>
          </string-name>
          ,
          <article-title>Two sides of miscalibration: identifying over and under-confidence prediction for network calibration</article-title>
          ,
          <source>in: Uncertainty in artificial intelligence, PMLR</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>77</fpage>
          -
          <lpage>87</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>A multiple object tracking method using kalman filter</article-title>
          ,
          <source>in: The 2010 IEEE International Conference on Information and Automation</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>1862</fpage>
          -
          <lpage>1866</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICINFA.
          <year>2010</year>
          .
          <volume>5512258</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Einicke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. B.</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <article-title>Robust extended Kalman filtering</article-title>
          ,
          <source>IEEE transactions on signal processing 47</source>
          (
          <year>1999</year>
          )
          <fpage>2596</fpage>
          -
          <lpage>2599</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Julier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. K.</given-names>
            <surname>Uhlmann</surname>
          </string-name>
          ,
          <article-title>New extension of the Kalman filter to nonlinear systems</article-title>
          , in: I. Kadar (Ed.),
          <string-name>
            <surname>Signal</surname>
            <given-names>Processing</given-names>
          </string-name>
          , Sensor Fusion, and
          <string-name>
            <surname>Target Recognition VI</surname>
          </string-name>
          , volume
          <volume>3068</volume>
          , International Society for Optics and Photonics,
          <string-name>
            <surname>SPIE</surname>
          </string-name>
          ,
          <year>1997</year>
          , pp.
          <fpage>182</fpage>
          -
          <lpage>193</lpage>
          . URL: https://doi.org/10.1117/12.280797. doi:
          <volume>10</volume>
          .1117/12.280797.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>E.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. Van Der Merwe</surname>
          </string-name>
          ,
          <article-title>The unscented Kalman filter for nonlinear estimation</article-title>
          ,
          <source>in: Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No.00EX373)</source>
          ,
          <year>2000</year>
          , pp.
          <fpage>153</fpage>
          -
          <lpage>158</lpage>
          . doi:
          <volume>10</volume>
          .1109/ASSPCC.
          <year>2000</year>
          .
          <volume>882463</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>X. R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. P.</given-names>
            <surname>Jilkov</surname>
          </string-name>
          ,
          <article-title>Estimator's credibility and its measures</article-title>
          ,
          <source>in: Proc. IFAC 15th World Congress</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pitale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abbaspour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Upadhyay</surname>
          </string-name>
          ,
          <article-title>Inherent diverse redundant safety mechanisms for ai-based software elements in automotive applications</article-title>
          ,
          <source>arXiv preprint arXiv:2402.08208</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <source>[25] ROHM, Iso</source>
          <volume>26262</volume>
          :
          <article-title>Functional safety standard for modern road vehicles (</article-title>
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>D.</given-names>
            <surname>Widmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lindsten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zachariah</surname>
          </string-name>
          ,
          <article-title>Calibration tests in multi-class classification: A unifying framework</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>32</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>J.</given-names>
            <surname>Vaicenavicius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Widmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Andersson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lindsten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Roll</surname>
          </string-name>
          , T. Schön,
          <article-title>Evaluating model calibration in classification</article-title>
          ,
          <source>in: The 22nd international conference on artificial intelligence and statistics</source>
          , PMLR,
          <year>2019</year>
          , pp.
          <fpage>3459</fpage>
          -
          <lpage>3467</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>B.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kohavi</surname>
          </string-name>
          , Adult,
          <source>UCI Machine Learning Repository</source>
          ,
          <year>1996</year>
          . DOI: https://doi.org/10.24432/C5XW20.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>IBM</surname>
          </string-name>
          , Telco Customer Churn, https://www.kaggle.com/datasets/blastchar/telco-customer-churn,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          , et al.,
          <article-title>Learning multiple layers of features from tiny images (</article-title>
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <article-title>Imagenet: A large-scale hierarchical image database</article-title>
          ,
          <source>in: 2009 IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2009</year>
          .
          <volume>5206848</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          . doi:
          <volume>10</volume>
          .1109/ CVPR.
          <year>2016</year>
          .
          <volume>90</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>B.</given-names>
            <surname>Recht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Roelofs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Shankar</surname>
          </string-name>
          ,
          <article-title>Do imagenet classifiers generalize to imagenet?</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>5389</fpage>
          -
          <lpage>5400</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>