<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Empirical Optimal Risk to Quantify Model Trustworthiness for Failure Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shuang Ao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Rueger</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Advaith Siddharthan</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Knowledge Media Institute, The Open University</institution>
          ,
          <addr-line>Walton Hall, Kents Hill, Milton Keynes MK7 6AA</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Failure detection (FD) in AI systems is a crucial safeguard for the deployment for safety-critical tasks. The common evaluation method of FD performance is the Risk-coverage (RC) curve, which reveals the trade-of between the data coverage rate and the performance on accepted data. One common way to quantify the RC curve by calculating the area under the RC curve. However, this metric does not inform on how suited any method is for FD, or what the optimal coverage rate should be. As FD aims to achieve higher performance with fewer data discarded, evaluating with partial coverage excluding the most uncertain samples is more intuitive and meaningful than full coverage. In addition, there is an optimal point in the coverage where the model could achieve ideal performance theoretically. We propose the Excess Area Under the Optimal RC Curve (E-AUoptRC), with the area in coverage from the optimal point to the full coverage. Further, the model performance at this optimal point can represent both model learning ability and calibration. We propose it as the Trust Index (TI), a complementary evaluation metric to the overall model accuracy. We report extensive experiments on three benchmark image datasets with ten variants of transformer and CNN models. Our results show that our proposed methods can better reflect the model trustworthiness than existing evaluation metrics. We further observe that the model with high overall accuracy does not always yield the high TI, which indicates the necessity of the proposed Trust Index as a complementary metric to the model overall accuracy. The code are available at https://github.com/AoShuang92/optimal_risk.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Failure Detection</kwd>
        <kwd>Evaluation</kwd>
        <kwd>Trustworthiness</kwd>
        <kwd>Risk-Coverage Curve</kwd>
        <kwd>Model Calibration</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>these samples in a coverage range for safe and trusted
prediction. FD is beneficial for gaining higher trust from
The deployment of deep neural networks (DNNs) in users and for time and cost savings by only requiring
safety-critical applications such as autonomous driv- human interventions for a small percentage of data.
ing [1] and medical diagnosing [2, 3] requires high trust- One of the criteria for FD is for the model to achieve
worthiness and reliability, as mistakes can be expensive better performance with fewer instances removed; hence
and raise serious concerns. To reduce mispredictions, the evaluation is about the trade-of between the
covera model should be equipped with a safeguard for auto- age of data and model accuracy or risk (error). Popular
matic failure detection [4, 5, 6] or a reject option [7], visualisation methods of FD performance such as
riskwhere samples with high uncertainty or low confidence coverage (RC) curve [8] and accuracy-rejection curves
can be discarded or sent to an expert or the third sys- (ARCs) [9, 10] plot model risk or accuracy against
covtem. Specifically, failure detection (FD) determines the erage of data. However, the quantification of FD
perforportion of coverage over the entire dataset deemed to mance is a less explored domain. Recent studies attempt
be safe predictions and discards data using a threshold to quantify FD by using the area under the RC-curve
on model confidence or uncertainty. If the confidence or (AURC) [11] and the area under the ARCs [10].
Neveruncertainty is below or above the threshold, the model re- theless, both methods include the full coverage of data,
jects samples and defers them to human experts or third ignoring the selection of thresholds and the FD
perforsystems to re-evaluate. Otherwise, the model considers mance under and above thresholds.
Theoretically, a perfectly calibrated model should
AISafety-SafeRL 2023 Workshop (IJCAI), August 19–21, 2023, Macao, achieve the ideal performance (i.e., accuracy of 1) after
reSAR, China moving the most uncertain samples in numbers equal to
* Corresponding author. the error percentage. In other words, the perfect
perfor($S. sRhuueagnegr).a;oa@dvoapitehn.s.aidcd.uhkar(Sth.aAno@);ospteefna.na.cr.uuekg(eAr@.Soidpdenh.aarct.huakn) mance takes place hypothetically by covering the portion
 https://github.com/AoShuang92 (S. Ao); of samples equivalent to model accuracy. Therefore, the
https://kmi.open.ac.uk/people/member/stefan-rueger (S. Rueger); model risk is supposed to be 0 at this very coverage point,
https://people.kmi.open.ac.uk/advaith-siddharthan/ which is denoted as the optimal point in work on
uncer(A. Siddharthan) tainty estimation [12] as shown in Figure 1. A perfectly
0000-0003-2648-3082 (S. Ao); 0000-0003-0796-8826 calibrated model should not contain any risk before the
(A. Siddharthan)</p>
      <p>© 2023 Copyright © 2023 for this paper by its authors. Use permitted under Creative Commons optimal point, whereas the risk increases monotonically
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g LCicEenUseRAttWribuotironk4s.0hIontpernPatrioonacl e(CeCdBiYn4g.0)s. (CEUR-WS.org) until the model error after the optimal point. This risk
is naturally inherited from the model as DNNs cannot methods can better evaluate failure detection for
obtain the perfect performance in practice, thus, should model trustworthiness.
perhaps be discounted in FD evaluations. Based on this
hypothesis, Geifman et.al [12] exclude the area under the
optimal risk (grey part in Figure 1) for the AURC and pro- 2. Related Work
pose the metric of Excess-AURC (E-AURC) (yellow part
in Figure 1). However, this still evaluates FD based on 2.1. Failure Detection
the whole dataset even though some data are supposed
to be safe and trusted predictions. In the deployment of safety-critical scenarios, DNNs tend</p>
      <p>As the percentage of rejected samples is generally cus- to fail silently by providing high-confidence in woefully
tomised during deployment of a model, there is a lack of incorrect predictions, which makes the uncertainty
escommon ground for a fair comparison of failure detection timation a great concern to AI safety [13, 14]. These
among models with varying accuracies. In addition, most high-confidence predictions are often produced by the
of the existing evaluation metrics (i.e., AURC, E-AURC) softmax function as it is computed with a fast-growing
measure the entire area under the curve, which cannot exponential function. It is clearly necessary to identify
reveal the FD performance for a specific coverage. For potentially wrong predictions. Hendrycks et al. [4]
proexample, the performance of a model at very low cov- posed to detect misclassified samples by enlarging the
erage is not of interest to real applications. To address softmax probabilities between correct and incorrect
samthe above issues, we propose the Excess area under the ples. Meanwhile, utilizing true class probability instead
optimal RC curve (E-AUoptRC) as an alternative metric of maximum class probability has been shown to be more
for failure detection that considers the risk in the range reliable in the context of failure detection [5]. In addition,
from the optimal point to the full coverage (shown as training the model with data that can reflect the
compink area in Figure 2). We emphasise this area for reasons plexity of real-world scenario can improve the reliability
as follows: (1) with a perfectly calibrated model, samples in prediction, such as curating diabetic retinopathy for
falling into the coverage from 0 to optimal point (yellow training Bayesian DNNs [6].
area in Figure 2) are already highly trusted ones; (2) we To make the model more cautious when it is uncertain,
argue that it is more important to compare models in the a rejection option allows it to abstain from making a
preregion that errors are made, for instance, samples in the diction when it is likely to be a mistake. Geifman and
E-AUoptRC include the high uncertainty ones, and the El-Yaniv [15] designed a selective classifier that allows
corresponding risk here should be primarily utilised to users to set a desired risk level. They further proposed
determine the trustworthiness of the model. (3) Further- a selective network with a shared classifier of dedicated
more, with our precise method of FD quantification, a prediction and ambiguity rejection layer [16]. What’s
model with lower accuracy may yield higher trustworthi- more, Geifman et.al [12] developed a selective
mechaness and vice versa, capturing the intuition that a model nism by using early snapshots for samples with high
with higher accuracy may not be the most trusted one. confidence in model training.</p>
      <p>Finally, we propose a Trust Index (TI) as a novel evalua- Besides training classifiers with a rejection option,
tion metric, which measures the accuracy of the model at studies also shed light on post-hoc approaches for
failthe optimal point, mimics the behaviour of E-AUoptRC, ure detection. Setting thresholds based on confidence or
and is easier to compute. The Trust Index combines the uncertainty ranking of samples is widely used to
distinperformance and calibration of the model into a single guish correct and incorrect predictions, such as AI for
metric. A higher TI suggests better model performance breast cancer screening [17] and decision-making
modand calibration and higher trust and reliability of the els for low-power Internet of Things (IoT) devices [18].
model predictions. The threshold needs to be tuned as its value trades of</p>
      <p>Our contributions and findings are summarized as be- the predictor’s coverage rate and the performance on
low: accepted examples [8, 7]. In our work, we will provide
an insightful reference for such threshold selection.
1. We propose the E-AUoptRC to quantify the RC
curve with the coverage from the optimal point
to the full coverage. 2.2. Evaluation Metrics
2. We propose Trust Index as an evaluation metric. The quantification of failure detection (FD) performance
3. With extensive experiments and observations we shares the same characteristic as selective prediction (SP).
ifnd that: (i) a model with higher AURC or E- FD focuses on the model performance after rejecting
AURC can obtain lower E-AUoptRC ; (ii) A model worst predicted samples under coverage, while SP
highwith a high overall accuracy does not necessar- lights the model accuracy or error with partial input.
ily yield higher Trust Index; (iii) Our proposed
0.16
0.14
0.12</p>
      <p>0.75
Coverage</p>
      <sec id="sec-1-1">
        <title>2.3. Model Calibration</title>
        <p>More broadly, they are techniques for uncertainty esti- achieved impressive performance in calibration.
Howmation [11]. Therefore, the evaluation metrics for SP ever, it is arguable to what extent calibration techniques
should also be applicable for FD, such as Area Under the can improve the model trustworthiness [23]. Our work
Receiver Operating Characteristic curve (AUROC) [19] will provide a more comprehensive evaluation method
and Area Under the Precision-Recall Curve (AUPR) [20]. regarding this issue.</p>
        <p>Despite the wide use of these metrics for such
thresholdindependent performance evaluation [21, 22, 17], [11]
point out that AUROC and AUPR can cause misleading 3. Methodology
and meaningless results for classification tasks with
softmax function. The main reason lies in the assumption
that the numbers of correct and wrong predictions are the
same. To mitigate this issue, Risk-Coverage (RC) curve
is applied for SP in terms of the multi-class classification
tasks[12, 11, 15, 23]. Hence, this paper utilises the RC
curve for the following experiments and analysis.</p>
        <sec id="sec-1-1-1">
          <title>The issue we address in this paper is the quantification</title>
          <p>of the failure detection performance for supervised
classification models with the utilization of softmax function.</p>
          <p>Let  be the input space and  = {1, 2, 3, . . . , } be
the set of class labels. Given (,  ) as the data
distribution over  ×  , a classifier is the function  where
the error (true risk)  and accuracy  is obtained
by  :  →  . For each input  ∈  and its
corresponding true label , the probability distribution of the
model prediction is  ( | ), and the predicted label is
ˆ = argmax∈  ( | ).</p>
          <p>To measure the performance of calibration methods, the
Expected Calibration Error (ECE) [24] was proposed and
is widely applied in various tasks, such as image
classification [ 12, 23] and sentiment analysis [25, 26]. ECE 3.1. Problem Setting
splits the data into bins , calculates for each bin the
average confidence and average accuracy, and averages In the Risk-Coverage (RC) curve, the coverage  is the
over all bins. To alleviate the miscalibration issue for percentage of covered set over the entire data, which is
DNNs, calibration techniques have been proposed and written as  = |||| . For each coverage, the risk is the
then widely applied. Label Smoothing (LS) [27] reduces corresponding error in model prediction. A model with
over-confidence by computing the cross-entropy loss better FD performance should obtain less risk/ higher
with uniformly squeezed labels instead of one-hot labels. accuracy with fewer samples rejected.
Extensions of LS such as Margin-based Label Smooth- To eficiently quantify the FD performance of a model,
ing (MBLS) [28] further provides a unifying constrained- we first need to construct the reject function ℛ to decide
optimization perspective of calibration losses. Focal Loss whether to reject samples or not under diferent
thresh(FL) [29] adds a focusing factor to the standard cross- olds. By adopting settings in [31, 5, 12], we utilize the
entropy loss to deal with an imbalanced dataset. Recent predictive uncertainty  to rank samples. A sample with
work on sample-dependent focal loss (FLSD) [30] inves- low uncertainty indicates high confidence and better
retigated the efect of the loss on the training data and liability of the model prediction; whereas a sample with
0.75</p>
          <p>Coverage
high  is more likely to be rejected when narrowing the
coverage. Given a fixed or adaptive threshold , the reject
function ℛ is written as follows:
area as the Excess-AURC (E-AURC), where E-AURC =
  −  .</p>
          <p>3.2. E-AUoptRC
{︃ cover,  ∈ ,
if  &lt;= 
if  &gt;</p>
          <p>(1)
ℛ() =
reject,  ∈ , The E-AURC reveals the total risk in coverage range from
0 to 1. However, in real-world applications, the coverage
where  is the covered input set and  is the reject is mainly customised due to specific deployment
requireset. ments, making it challenging to compare the failure
de</p>
          <p>There are two types of risks namely empirical risk and tection (FD) performance for various models. In addition,
optimal risk [12]. The empirical risk  is the pre- the E-AURC cannot reveal the failure detection (FD)
perdicted error of the model under diferent coverage, as formance in a specific coverage range. To mitigate the
shown in the solid green line in Figure 1. As the aleatory above issues, we propose E-AUoptRC with the coverage
uncertainty inherits from the data, some risks inevitably from  to 1 (E-AUoptRC, shown as pink in Figure 2).
exist in certain coverage regardless of the model perfor- We emphasise the E-AUoptRC for the following reasons:
mance. For a model with perfect uncertainty estimation, (1) it is more practical for deployment, as it is unlikely
if we discard the error percentage of high uncertainty to discard more than half of data in applications; (2) the
samples, the risk in the remaining coverage input should smaller E-AUoptRC indicates more samples with high
be zero. This specific coverage point of 1 −  (or ) uncertainty are successfully removed so that the model
was proposed by [12] as the optimal point  and shown prediction on the remaining data will be more reliable.
as the red star in Figure 1. Specifically, the risk between
coverage of  to 1 monotonically increases until the 3.3. Trust Index
error of the model. For optimal calibration, the above
risks are called optimal risk  illustrated as the Model accuracy  should track the confidence of the
blue dotted line in the figure. For example, the model model prediction. For example, a model with 80%
acerror in the figure is 0.16 and the  is 0.84. Therefore, curacy suggests 80% confidence in its own predictions,
the optimal risk  under coverage 0 to  is sup- which also defines the perfect confidence score in
caliposed to be 0; while it increases from 0 to 0.16 under  bration. As the risk at the optimal point () is supposed
to full coverage. It is worth-noticing that the monotonic to be 0, the accuracy at  should be 1, indicating the
increment of  is not exactly in the linear way. prediction’s highest model confidence and
trustworthi</p>
          <p>Both  and  can be calculated by Area ness. In other words, after removing % data with
Under the RC-curve (AURC) [12, 11], named   high uncertainty, the correctly predicted samples in the
(yellow plus grey area in Figure 1) and   (grey remaining data are most trusted. The accuracy at  also
area in Figure 1) respectively. The diference between reveals the model calibration, as the discarded % data
  and   is the real FD area, shown as can be misclassified. To represent the model performance
the yellow area in Figure 1. [12] propose this specific in terms of accuracy and calibration, we propose the
accuracy at the  as a Trust Index (TI), a complementary
evaluation to the accuracy metric to indicate the model’s
trustworthiness. For example, in Figure 2, with the model
accuracy of 84%, the model is 0.84 trust of the prediction.</p>
          <p>After removing 16% samples with high uncertainty (the
 is 0.84), the risk is approximately 0.08. The   , the
accuracy over the most confident 84% of samples is 0.92.</p>
          <p>The higher TI suggests the better trustworthiness of the
model predictions, and we next present empirical data to
substantiate this.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Experimental Setup</title>
      <sec id="sec-2-1">
        <title>4.1. Datasets and Baselines</title>
        <p>test set. For the ImageNet dataset, we equally divide its
original test set of 50,000 images into validation and test
sets for a fair comparison. For Tiny-ImageNet and
CIFAR100 dataset, an 80/10/10 for training/validation/test
split is applied.</p>
      </sec>
      <sec id="sec-2-2">
        <title>4.2. Implementation Details</title>
        <sec id="sec-2-2-1">
          <title>For a fair comparison and replicability of experimenta</title>
          <p>tion, we utilized publicly available existing pre-trained
weights for our investigation and experimentation. The
GPU of the Nvidia Tesla P40 was used for all experiments.
The bins number for ECE was set as  = 15.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Results</title>
      <p>We validate the proposed method with three
benchmark image datasets: ImageNet 2012 (IN) [32], CI- We conducted extensive experiments on benchmark
FAR100 (C100) [33] and Tiny-ImageNet [34]. For datasets ImageNet and Cifar100 with various CNNs and
baselines, we use state-of-the-art (SOTA) Vision Trans- variants of transformers to compare the AURC, E-AURC
former (ViT) [35] and its variants such as Swin- and our proposed E-AUoptRC. We further observed the
Transformer (SwinT) [36], Class-Attention in Image limitation of the conventional overall model accuracy
Transformers (CaiT) [37], Cross-Attention Multi-Scale and how our proposed Trust Index (TI) mitigates it.
FiVision Transformer (CrossViT) [38], ConvNext [39] nally, to validate the eficacy of our method, we applied
with the ImageNet pretrained weights from TIMM 1 it to SOTA calibration techniques with Tiny_ImageNet
library. To report comprehensive results on various on ResNet50 dataset. All the experiments and results are
models architectures, we also use the Convolutional shown in Tables 1 and 2, and Figure 3.
neural networks (CNNs) in our experiments, namely Table 1 shows the results for image classification with
DenseNet121 [40], ResNet56 [41], variants of VGG [42] the benchmark datasets. For AURC, E-AURC and
Eand MobileNetV2 [43]. All models are with pretrained AUoptRC in the ImageNet dataset, the variants of
transweights of ImageNet dataset. For recent SOTA calibration formers outperform CNNs model. The E-AURC for ViT
techniques label smoothing (LS) [27], focal loss (FL) [29], is about half of the E-AURC of SwinTran, CaiT and
ConMBLS [28] and FLSD [30], we utilize the pre-trained vNext, indicating that ViT greatly outperforms the other
model and oficial implementation from the repository 2. three models in failure detection. However, regarding the</p>
      <p>As the evaluation of failure detection is a post- E-AUoptRC, the diference is almost ignorable and the
processing approach, we primarily utilize each dataset’s ConvNext is slightly better than the other three models.
1https://github.com/rwightman/pytorch-image-models The risk-coverage (RC) curve (Left in Figure 3) also shows
2https://github.com/by-liu/MbLS that at the coverage of 0.84 (near the optimal point) to 1,
the risk curve of ViT and ConvNext is nearly overlapping. E-AURC, but label smoothing outperforms other methods
The lower risk for VIT occurs at very low coverage levels, and CE in terms of overall accuracy (improves by 0.6%)
which are not of interest for most real world applica- and TI. MBLS nearly halves the overall ECE of baseline
tions. For CF100 dataset with CNNs, VGG13_bn substan- and achieves the best ECE at the optimal point. In the
tially outperforms other models in terms of AURC and Right RC curve in Figure 3, LS is with the lowest risk at
E-AURC. However, the diference in E-AUoptRC between the coverage of 0.65 to 1 (the likely operating range when
VGG13_bn and VGG19_bn is much smaller. This can be the model is deployed), and our proposed E-AUoptRC
understood from the Middle plot in Figure 3, where the and TI metrics are the only ones that capture this. Failure
curve for VGG13_bn and VGG19_bn overlaps at cover- detection performance should be a significant evaluation
age between 0.74(near the optimal point) to 0.9. These for calibration techniques, and our methods provide a
diferences in the metrics provide empirical evidence that more insightful view of the model trustworthiness.
our proposed E-AUoptRC more accurately reflects real
diferences in failure detection performance than other
methods. 6. Discussion &amp; Conclusion</p>
      <p>Similar to the results of AURC-related evaluation, the
variants of transformer models also outperform CNNs in
terms of overall model accuracy and trust index (TI). The
SwinTran obtains the highest overall model accuracy for
the ImageNet dataset, but it does not yield the highest
TI. For the Cifar100 dataset, the VGG13_bn achieves the
highest overall model accuracy, whereas the VGG19_bn
obtains the best TI. It indicates that the model with the
highest overall accuracy does not guarantee the highest
TI, which shows that our proposed TI is necessary for
model trustworthiness evaluation.</p>
      <p>In Table 2, the baseline (CE) obtains better AURC and</p>
      <p>In this paper, we proposed the E-AUoptRC to more
precisely quantify the failure detection performance in the
key region of interest, and the Trust Index (TI) that
measures model accuracy at its optimal point. The empirical
results show that our methods can better reveal the model
trustworthiness under a fair comparison. In the
realworld deployment, a fixed threshold is often used due to
specific task requirements and simplicity of
implementation. Our proposed TI can be utilized as the reference for
the threshold selection with following reasons: (1) the
accuracy should indicate the model confidence in its
prediction, suggesting the TI can interpret the confidence;
(2) TI is obtained at the optimal point, where the model tion of uncertainty estimation and its application to
is supposed to achieve the ideal performance. This is an explore model complexity-uncertainty trade-of, in:
objective method for the fair comparison of models with Proceedings of the IEEE/CVF Conference on
Comdiferent accuracy and calibration (as shown in Table 1 puter Vision and Pattern Recognition Workshops,
and 2); (3) TI is easy to calculate, which is a time and 2020, pp. 4–5.
computational cost saving. We have shown several ben- [12] Y. Geifman, G. Uziel, R. El-Yaniv, Bias-reduced
unefits of our proposed metrics over existing ones and in certainty estimation for deep neural classifiers, in:
our future work, we will further investigate the role of International Conference on Learning
RepresentaTI in improving failure detection. tions, 2019.
[13] I. J. Goodfellow, J. Shlens, C. Szegedy,
Explaining and harnessing adversarial examples, arXiv
References preprint arXiv:1412.6572 (2014).
[14] D. Amodei, C. Olah, J. Steinhardt, P. Christiano,
[1] S. Atakishiyev, M. Salameh, H. Yao, R. Goebel, Ex- J. Schulman, D. Mané, Concrete problems in ai
plainable artificial intelligence for autonomous driv- safety, arXiv preprint arXiv:1606.06565 (2016).
ing: a comprehensive overview and field guide [15] Y. Geifman, R. El-Yaniv, Selective classification for
for future research directions, arXiv preprint deep neural networks, Advances in neural
informaarXiv:2112.11561 (2021). tion processing systems 30 (2017).
[2] M. Raghu, K. Blumer, R. Sayres, Z. Obermeyer, [16] Y. Geifman, R. El-Yaniv, Selectivenet: A deep neural
B. Kleinberg, S. Mullainathan, J. Kleinberg, Direct network with an integrated reject option, in:
Inuncertainty prediction for medical second opinions, ternational conference on machine learning, PMLR,
in: International Conference on Machine Learning, 2019, pp. 2151–2159.</p>
      <p>PMLR, 2019, pp. 5281–5290. [17] C. Leibig, M. Brehmer, S. Bunk, D. Byng, K. Pinker,
[3] M. W. Dusenberry, D. Tran, E. Choi, J. Kemp, L. Umutlu, Combining the strengths of radiologists
J. Nixon, G. Jerfel, K. Heller, A. M. Dai, Analyzing and ai for breast cancer screening: a retrospective
the role of model uncertainty for electronic health analysis, The Lancet Digital Health 4 (2022) e507–
records, in: Proceedings of the ACM Conference on e519.</p>
      <p>Health, Inference, and Learning, 2020, pp. 204–213. [18] C. Cho, W. Choi, T. Kim, Leveraging uncertainties
[4] D. Hendrycks, K. Gimpel, A baseline for detecting in softmax decision-making models for low-power
misclassified and out-of-distribution examples in iot devices, Sensors 20 (2020) 4603.
neural networks, ICLR (2017). [19] T. Fawcett, An introduction to roc analysis, Pattern
[5] C. Corbière, N. Thome, A. Bar-Hen, M. Cord, recognition letters 27 (2006) 861–874.</p>
      <p>P. Pérez, Addressing failure prediction by learning [20] C. Manning, H. Schutze, Foundations of statistical
model confidence, Advances in Neural Information natural language processing, MIT press, 1999.</p>
      <p>Processing Systems 32 (2019). [21] D. Hendrycks, K. Gimpel, A baseline for detecting
[6] N. Band, T. G. Rudner, Q. Feng, A. Filos, Z. Nado, misclassified and out-of-distribution examples in
M. W. Dusenberry, G. Jerfel, D. Tran, Y. Gal, neural networks, arXiv preprint arXiv:1610.02136
Benchmarking bayesian deep learning on diabetic (2016).
retinopathy detection tasks, in: NeurIPS 2021 Work- [22] A. Malinin, M. Gales, Predictive uncertainty
esshop on Distribution Shifts: Connecting Methods timation via prior networks, Advances in neural
and Applications, 2021. information processing systems 31 (2018).
[7] K. Hendrickx, L. Perini, D. Van der Plas, W. Meert, [23] F. Zhu, Z. Cheng, X.-Y. Zhang, C.-L. Liu, Rethinking
J. Davis, Machine learning with a reject option: A confidence calibration for failure prediction, in:
Eusurvey, arXiv preprint arXiv:2107.11277 (2021). ropean Conference on Computer Vision, Springer,
[8] R. El-Yaniv, et al., On the foundations of noise- 2022, pp. 518–536.</p>
      <p>free selective classification., Journal of Machine [24] M. P. Naeini, G. Cooper, M. Hauskrecht, Obtaining
Learning Research 11 (2010). well calibrated probabilities using bayesian binning,
[9] C. Ferri, J. Hernández-Orallo, Cautious classifiers., in: Twenty-Ninth AAAI Conference on Artificial</p>
      <p>ROCAI 4 (2004) 27–36. Intelligence, 2015.
[10] M. S. A. Nadeem, J.-D. Zucker, B. Hanczar, [25] R. Müller, S. Kornblith, G. E. Hinton, When does
Accuracy-rejection curves (arcs) for comparing clas- label smoothing help?, Advances in neural
inforsification methods with a reject option, in: Machine mation processing systems 32 (2019).
Learning in Systems Biology, PMLR, 2009, pp. 65– [26] S. Obadinma, H. Guo, X. Zhu, Class-wise
calibra81. tion: A case study on covid-19 hate speech., in:
[11] Y. Ding, J. Liu, J. Xiong, Y. Shi, Revisiting the evalua- Canadian Conference on AI, 2021.
[27] C. Szegedy, V. Vanhoucke, S. Iofe, J. Shlens, Z. Wo- [39] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T.
Darjna, Rethinking the inception architecture for com- rell, S. Xie, A convnet for the 2020s, in: Proceedings
puter vision, in: Proceedings of the IEEE conference of the IEEE/CVF Conference on Computer Vision
on computer vision and pattern recognition, 2016, and Pattern Recognition, 2022, pp. 11976–11986.
pp. 2818–2826. [40] G. Huang, Z. Liu, L. Van Der Maaten, K. Q.
Wein[28] B. Liu, I. Ben Ayed, A. Galdran, J. Dolz, The devil berger, Densely connected convolutional networks,
is in the margin: Margin-based label smoothing in: Proceedings of the IEEE conference on computer
for network calibration, in: Proceedings of the vision and pattern recognition, 2017, pp. 4700–4708.
IEEE/CVF Conference on Computer Vision and Pat- [41] K. He, X. Zhang, S. Ren, J. Sun, Deep residual
learntern Recognition, 2022, pp. 80–88. ing for image recognition, in: Proceedings of the
[29] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Fo- IEEE conference on computer vision and pattern
cal loss for dense object detection, in: Proceedings recognition, 2016, pp. 770–778.
of the IEEE international conference on computer [42] K. Simonyan, A. Zisserman, Very deep
convoluvision, 2017, pp. 2980–2988. tional networks for large-scale image recognition,
[30] J. Mukhoti, V. Kulharia, A. Sanyal, S. Golodetz, arXiv preprint arXiv:1409.1556 (2014).</p>
      <p>P. Torr, P. Dokania, Calibrating deep neural net- [43] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov,
L.works using focal loss, Advances in Neural Infor- C. Chen, Mobilenetv2: Inverted residuals and linear
mation Processing Systems 33 (2020) 15288–15299. bottlenecks, in: Proceedings of the IEEE conference
[31] B. Lakshminarayanan, A. Pritzel, C. Blundell, Sim- on computer vision and pattern recognition, 2018,
ple and scalable predictive uncertainty estimation pp. 4510–4520.
using deep ensembles, Advances in neural
information processing systems 30 (2017).
[32] O. Russakovsky, J. Deng, H. Su, J. Krause,</p>
      <p>S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, et al., Imagenet large scale
visual recognition challenge, International journal
of computer vision 115 (2015) 211–252.
[33] A. Krizhevsky, G. Hinton, et al., Learning
multiple layers of features from tiny images, Technical</p>
      <p>Report, University of Toronto, 2009.
[34] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L.
Fei</p>
      <p>Fei, Imagenet: A large-scale hierarchical image
database, in: 2009 IEEE conference on computer
vision and pattern recognition, Ieee, 2009, pp. 248–
255.
[35] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D.
Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
M. Minderer, G. Heigold, S. Gelly, et al., An image is
worth 16x16 words: Transformers for image
recognition at scale, arXiv preprint arXiv:2010.11929
(2020).
[36] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin,</p>
      <p>B. Guo, Swin transformer: Hierarchical vision
transformer using shifted windows, in: Proceedings of
the IEEE/CVF International Conference on
Computer Vision, 2021, pp. 10012–10022.
[37] H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve,</p>
      <p>H. Jégou, Going deeper with image transformers,
in: Proceedings of the IEEE/CVF International
Conference on Computer Vision, 2021, pp. 32–42.
[38] C.-F. R. Chen, Q. Fan, R. Panda, Crossvit:
Crossattention multi-scale vision transformer for image
classification, in: Proceedings of the IEEE/CVF
international conference on computer vision, 2021,
pp. 357–366.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>