<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Model agnostic calibration of image classifiers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paolo Giudici</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giulia Vilone</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Tobii AC</institution>
          ,
          <addr-line>Galway</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Pavia</institution>
          ,
          <addr-line>Pavia</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Predictions obtained from deep neural networks can be extremely accurate but not very robust, leading to uncertainty in their predictions. This crucial problem is getting growing attention from the Machine Learning (ML) community. A pragmatic solution that is increasingly applied is computing confidence bounds of the predictions of ML models. Most confidence bounds in the literature are theoretically sound but unfeasible from a practical perspective. This paper contributes to the literature by proposing probabilistic confidence bounds based on conditional probabilities. It demonstrates their operational validity using a real-world application: predicting car drivers' sleeping states.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Confidence measure</kwd>
        <kwd>Confidence calibration</kwd>
        <kwd>Driver's drowsiness</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>is alert or has a microsleep. The proposed calibration method’s advantage is that confidence
bounds can be calculated for individual predictions without requiring heavy computations, and
it is based on a transparent and interpretable process.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Literature review</title>
      <p>
        Confidence calibration aims to estimate uncertainty via matching the confidence level of a
set of samples with their prediction accuracy [
        <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
        ]. For instance, a model should correctly
classify 90 out of 100 samples if its confidence level on such predictions is 0.9. More formally,
given the input  ∈  and label  ∈  = {1, · · · , } both random variables following a
ground truth joint distribution  (,  ) =  ( |) (), a DNN  with  () = (ˆ , ˆ ) where
ˆ ∈ {1, · · · , } is a predicted class and ˆ is its associated confidence level, perfect calibration
can be defined as [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]  (ˆ =  |ˆ = ) = , ∀ ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ].
      </p>
      <p>
        The most recent DNNs are poorly calibrated [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Depth, width, weight decay, and batch
normalisation influence calibration. Consequently, achieving perfect calibration in practical,
realworld settings is impossible. Scholars have proposed diferent solutions to improve calibration
that can be clustered into “scaling-based”, “binning-based”, “similarity-based”, and
“Bayesianbased” methods. Scaling-based methods adjust the probability returned by a model that an input
belongs to an output class by learning one or more scalar parameters so that this probability
accurately represents the likelihood of that particular class. Standard methods for confidence
calibration in the classification domain are Platt scaling [ 5], Beta calibration [6] and temperature
scaling methods [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Binning-based methods divide samples into multiple bins based on samples’
confidence and calibrate each bin. Popular binning-based methods include Bayesian Binning
into Quantiles (BBQ) [7], histogram binning [8] and Ensemble of Near Isotonic Regression [9].
However, existing binning-based calibration methods fail to see the proximity bias issue, which is
the tendency of models to be overly confident in low-proximity samples (samples lying in sparse
density regions of the input space) rather than high-proximity ones. Thus, models sufer from
inconsistent miscalibration, limiting the capabilities of calibration methods to deliver reliable
and interpretable uncertainty estimates [10]. Similarity-based methods estimate the confidence
level based on the distance of the instances in the input dataset that are closer (or more similar)
to the test sample and their output class. For instance, [11] proposed to estimate confidence
levels using a non-conformity measure, calculated as the average k-neighbour proximity for
all the samples in the same predicted class for a given sample, to indicate how ‘atypical’ this
sample is relative to the other samples. Bayesian-based methods quantify the uncertainty related
to inputs and parameters’ calibration via a posterior distribution of the model’s parameters,
which balances the prior probability of the parameters with the likelihood function learned
from the available data [12, 13]. However, exact Bayesian inference is not tractable in DNNs
due to its sophisticated implementation and high computational cost [14]. Furthermore, these
methods are often harder to scale and can sufer from sub-optimal performance [ 15]. Scholars
have proposed many techniques to approximate the intractable posterior distributions derived
by Bayesian inference for DNNs [16, 17], a popular one being Markov Chain Monte Carlo [18].
      </p>
      <p>Finally, Conformal Prediction (CP) is a framework for assessing the uncertainties of AI
systems. Given a sample, CP returns a prediction interval in regression problems and a set
of classes in classification problems guaranteed to cover the true value with high probability.
However, CP is computationally ineficient as it requires retraining a model over a calibration
set containing  + 1 samples w.r.t. the previous iteration [19]. Some real-world applications,
like autonomous cars, need lightweight DNNs as the computational resources are required to
process the signals received from various devices, such as sensors and cameras. Furthermore,
most of these methods are complex and thus hard to understand, explain, and debug.</p>
    </sec>
    <sec id="sec-3">
      <title>3. The proposed calibration method</title>
      <p>The proposed calibration method was inspired by the request of a car manufacturer to show an
indicator of a driver’s state (either alert or microsleep) augmented by confidence levels in the
predictions made by a DNN. The methodology to calculate this indicator must a) calculate and
display in real-time; hence, its computation cannot be resource greedy; b) be understandable,
especially in incremental innovations concerning existing practices, to ensure meeting safety
and industry quality standards; and 3) return two confidence levels, depending on whether or
not the driver is in a microsleep state. These conditions led us to develop a methodology that,
while mathematically sound, is also simple to understand and implement.</p>
      <p>The DNN, based on a Shuflenet backbone, returns a probability score on whether the eyes of
the driver are open or closed. The eyes are considered open if the probability score is lower
than 80%; above this threshold, eyes are considered closed. This threshold was determined by
maximising the prediction accuracy of the network on a Tobii proprietary dataset containing
several videos of people driving at a simulator at diferent times of the day and night. When the
eyes are predicted as open, the driver is in a no-microsleep, or alert, state. A microsleep starts
after 15 consecutive frames (the car manufacturer required this) are labelled as “eyes closed”
and ends when there occurs a frame labelled as “open eyes” (see Figure 1).</p>
      <p>The proposed calibration method considers the confusion matrices to calculate the DNN’s
confidence levels. The false positive and negative rates reported in the confusion matrices
provide insight into how often the network confuses the two output classes. The positive cases
correspond to eyes open and no microsleep, and an example of the two confusion matrices
calculated over the network’s predictions is reported in Table 1. The number of instances can
be transformed into probabilities of correctly or wrongly classifying frames by dividing each
value by the sum of its column. For example, the values in the first row of the open/closed
eyes confusion matrix are divided by 19, 501 + 580 = 20, 081. The resulting probabilities are
reported in brackets in the same table.</p>
      <p>When predicting a microsleep state, the DNN’s confidence levels are based on the conditional
probabilities of open or closed eyes. The prior conditions are the true and false positive/negative
rates derived from the two confusion matrices and the probability scores returned by the
DNN. When the DNN’s prediction is “eyes open”, its confidence level  () is the sum of the
probability  (ˆ) that the eyes are closed returned by the DNN multiplied by the true and false
positive rates, respectively (see Eq. 1).</p>
      <p>() =  (ˆ) (|ˆ) +  (ˆ) (|ˆ)
where  (ˆ) is the DNN’s probability score that the eyes are open and is computed as
 (ˆ) = 1 −  (ˆ).  (|ˆ) and  (|ˆ) are the true and false positive rates of the
eyesopen/close confusion matrix, respectively. Similarly, the confidence level that the eyes are truly
closed  () can be calculated per Eq. 2.</p>
      <p>() =  (ˆ) (|ˆ) +  (ˆ) (|ˆ)
where  (|ˆ) and  (|ˆ) are the true and false negative rates of the eyes-open/close
confusion matrix, respectively.</p>
      <p>The confidence level that the driver is truly alert (  ()) follows the same logic (see Eq. 3.
The probability scores correspond to the probability that the eyes are open calculated per Eq. 1.</p>
      <p>() =  () (|ˆ) + (1 −  ()) (|ˆ )
where  (|ˆ) and  (|ˆ ) are the true and false positive rates as per the microsleep confusion
matrix, respectively.</p>
      <p>The confidence level of microsleeps difers from the previous cases because the microsleep
probability scores correspond to the eyes-closed probability  () raised to the power of the
number of frames that are missing to reach the microsleep state (see Eq. 4 and Eq. 5). For
instance, if the DNN assigned the label “eyes closed” to three consecutive frames,  () must
be raised to the power of 12. This corresponds to the probability of independently randomly
sampling 12 frames predicted as “eyes closed”.  () of the last frame is the best estimate of the
(1)
(2)
(3)
probability that the following frames will be predicted as “eyes closed” because it is impossible
to know what probability scores will be returned by the DNN for future frames.
 ( ) =  (ˆ ) ( |ˆ ) + (1 −  (ˆ )) ( |ˆ)
(4)
where  ( |ˆ ) and  ( |ˆ) are the true and false negative rates as per the microsleep
confusion matrix, respectively.</p>
      <p>(ˆ ) =  ()(15− )
(5)
where  = ∑︀0≤ ≤ 14 ⊮ represents the number of consecutive frames (up to 15)
labelled as “eyes closed” by the DNN.</p>
    </sec>
    <sec id="sec-4">
      <title>4. The experiment</title>
      <p>The proposed method was applied to the public dataset Night-Time
Yawning-MicrosleepEyeblink-driver Distraction (NITYMED)1 [20]. NITYMED contains 21 videos with 25 frames
per second, each lasting approximately 2 minutes, of 11 male and eight female drivers in real
cars under nighttime conditions. The drivers talk, look around, and have microsleeps. The
NITYMED videos are not labelled. The ground truth labels were created by applying a key point
detector to extract the facial landmarks of each frame and calculate the Eye Aspect Ratio (EAR).
It was decided that the driver’s eyes are closed when EAR is below 20%. The frames classified as
“eyes closed” were visually inspected to ensure this threshold was not too high, thus labelling
as “eyes closed” frames with open eyes. We remark that the described labelling procedure
provides an “expert-based ground truth,” which is not objective. This is the case in many other
machine learning applications, where the model is assessed not against an “objective” truth but
a subjective one. This does not alter the generality of the proposed method.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>The model reaches a prediction accuracy of 88.1% in classifying the frames of the NITYMED
videos as eyes open/closed, and 95.2% prediction accuracy of the alert or microsleep states with
the EAR considered as ground truth (see table 2). Noticeably, the true negative rates of the
DNN are 78% and 74% in predicting eyes open/close and the alert/microsleep stats, which are
quite low and are expected to significantly impact the resulting confidence levels on the alert
or microsleep states. The DNN predicted slightly less than 50% of the microsleeps detected
with the EAR (87 out of 185). This is due to the high EAR threshold (20%) used to determine
when the driver’s eyes are closed. This gap can be easily closed by reducing this threshold. A
further inspection of the frames classified as “eyes closed” with EAR highlighted a few where
the eyes are still partially open, but it is possible to see most of the eyelids. This threshold
allowed testing of the proposed calibration method under suboptimal conditions where the
network’s accuracy is not high. This is the typical situation where a DNN should not always
be trusted, and confidence levels can support and improve decision-making. A video 2 shows
examples of microsleeps detected by the model and its alert/microsleep levels of confidence.</p>
      <p>Figure 2 shows an example of
the model’s confidence levels of
the alert and microsleep statuses Figure 2: Alert/microsleep confidence levels computed on
of a NITYMED driver. The mi- a microsleep event in a NITYMED video.
crosleep confidence level remains
constantly low until the DNN
returns a frame with closed eyes.</p>
      <p>Then, it quickly increases as the
number of consecutive eyes-closed
frames increases. However, this
confidence level never goes above
60% even when the number of
consecutive eyes-closed frames is far
higher than 15, and the DNN assigns
high eye-closed probability scores
to these frames, meaning that the
chances that the driver is truly
having a microsleep are pretty high.</p>
      <p>This is, as expected, due to the combined efect of the low eyes-open/close and microsleep
true negative rates. One desired requirement for a confidence level assessment method like
the proposed one is to compute calibrated uncertainty estimates. However, this method was
not expected to meet this requirement as the confidence levels are computed using confusion
matrices that consider the DNN’s errors throughout the entire dataset. This assumption was
tested by binning the frames of the NITYMED videos according to their microsleep confidence
levels and checking whether these levels match the prediction accuracy. The results are reported
in Table 3 confirm this assumption. When the confidence level for the microsleep state is below
50%, only one frame out of 57,000 was correctly labelled as microsleep by the model. Conversely,
the prediction accuracy is higher than the confidence levels when they are in the range 50-58%.
This issue could be easily overcome by extracting other confusion matrices for the frames
with mid-range confidence levels (the frames where the eyes are not fully open or closed) and
calculating the confidence levels with these matrices. Confidence levels and prediction accuracy
match the two extreme tails of the data distribution. The confidence levels cannot be higher than
59%, and correspondingly 61% of the frames are correctly labelled as showing a microsleep event.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>Revisiting the calibration of modern neural networks, Advances in Neural Information
Processing Systems 34 (2021) 15682–15694.
[5] J. Platt, et al., Probabilistic outputs for support vector machines and comparisons to
regularized likelihood methods, Advances in large margin classifiers 10 (1999) 61–74.
[6] M. Kull, T. Silva Filho, P. Flach, Beta calibration: a well-founded and easily implemented
improvement on logistic calibration for binary classifiers, in: Artificial Intelligence and
Statistics, PMLR, 2017, pp. 623–631.
[7] M. P. Naeini, G. Cooper, M. Hauskrecht, Obtaining well calibrated probabilities using
bayesian binning, in: Proceedings of the AAAI conference on artificial intelligence, volume
29(1), 2015.
[8] B. Zadrozny, C. Elkan, Obtaining calibrated probability estimates from decision trees and
naive bayesian classifiers, in: Icml, volume 1, 2001, pp. 609–616.
[9] M. P. Naeini, G. F. Cooper, Binary classifier calibration using an ensemble of near isotonic
regression models, in: 16th International Conference on Data Mining, IEEE, 2016, pp.
360–369.
[10] M. Xiong, A. Deng, P. W. W. Koh, J. Wu, S. Li, J. Xu, B. Hooi, Proximity-informed calibration
for deep neural networks, Advances in Neural Information Processing Systems 36 (2024)
68511–68538.
[11] S. Bhattacharyya, Confidence in predictions from random tree ensembles, in: 2011 IEEE
11th International Conference on Data Mining, IEEE, 2011, pp. 71–80.
[12] Y. Gal, R. Islam, Z. Ghahramani, Deep bayesian active learning with image data, in:</p>
      <p>International conference on machine learning, PMLR, 2017, pp. 1183–1192.
[13] A. Kendall, Y. Gal, What uncertainties do we need in bayesian deep learning for computer
vision?, Advances in neural information processing systems 30 (2017).
[14] S. Seo, P. H. Seo, B. Han, Learning for single-shot confidence calibration in deep neural
networks through stochastic inferences, in: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, 2019, pp. 9030–9038.
[15] L. Blier, Y. Ollivier, The description length of deep learning models, Advances in Neural</p>
      <p>Information Processing Systems 31 (2018).
[16] B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and scalable predictive uncertainty
estimation using deep ensembles, Advances in neural information processing systems 30
(2017).
[17] S. Xie, R. Girshick, P. Dollár, Z. Tu, K. He, Aggregated residual transformations for deep
neural networks, in: Proceedings of the IEEE conference on computer vision and pattern
recognition, 2017, pp. 1492–1500.
[18] R. M. Neal, Bayesian learning for neural networks, volume 118, Springer Science &amp; Business</p>
      <p>Media, 2012.
[19] H. Papadopoulos, V. Vovk, A. Gammerman, Conformal prediction with neural networks,
in: 19th International Conference on Tools with Artificial Intelligence, volume 2, IEEE,
2007, pp. 388–395.
[20] N. Petrellis, S. Zogas, P. Christakos, P. Mousouliotis, G. Keramidas, N. Voros, C.
Antonopoulos, Software acceleration of the deformable shape tracking application: How to eliminate
the eigen library overhead, in: Proceedings of the European Symposium on Software
Engineering, 2021, pp. 51–57.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Guo</surname>
          </string-name>
          , G. Pleiss,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <article-title>On calibration of modern neural networks</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1321</fpage>
          -
          <lpage>1330</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>I.</given-names>
            <surname>Cortés-Ciriano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bender</surname>
          </string-name>
          ,
          <article-title>Deep confidence: a computationally eficient framework for calculating reliable prediction errors for deep neural networks</article-title>
          ,
          <source>Journal of chemical information and modeling</source>
          <volume>59</volume>
          (
          <year>2018</year>
          )
          <fpage>1269</fpage>
          -
          <lpage>1281</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kim</surname>
          </string-name>
          , et al.,
          <article-title>Bin-wise temperature scaling (bts): Improvement in confidence calibration performance through simple scaling techniques</article-title>
          , in: International Conference on Computer Vision Workshop, IEEE/CVF,
          <year>2019</year>
          , pp.
          <fpage>4190</fpage>
          -
          <lpage>4196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Djolonga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Romijnders</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hubis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tran</surname>
          </string-name>
          , M. Lucic,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>