1. Introduction

Empirical Optimal Risk to Quantify Model Trustworthiness for Failure Detection

Shuang Ao

Stefan Rueger

Advaith Siddharthan

0 Knowledge Media Institute, The Open University , Walton Hall, Kents Hill, Milton Keynes MK7 6AA , UK

Failure detection (FD) in AI systems is a crucial safeguard for the deployment for safety-critical tasks. The common evaluation method of FD performance is the Risk-coverage (RC) curve, which reveals the trade-of between the data coverage rate and the performance on accepted data. One common way to quantify the RC curve by calculating the area under the RC curve. However, this metric does not inform on how suited any method is for FD, or what the optimal coverage rate should be. As FD aims to achieve higher performance with fewer data discarded, evaluating with partial coverage excluding the most uncertain samples is more intuitive and meaningful than full coverage. In addition, there is an optimal point in the coverage where the model could achieve ideal performance theoretically. We propose the Excess Area Under the Optimal RC Curve (E-AUoptRC), with the area in coverage from the optimal point to the full coverage. Further, the model performance at this optimal point can represent both model learning ability and calibration. We propose it as the Trust Index (TI), a complementary evaluation metric to the overall model accuracy. We report extensive experiments on three benchmark image datasets with ten variants of transformer and CNN models. Our results show that our proposed methods can better reflect the model trustworthiness than existing evaluation metrics. We further observe that the model with high overall accuracy does not always yield the high TI, which indicates the necessity of the proposed Trust Index as a complementary metric to the model overall accuracy. The code are available at https://github.com/AoShuang92/optimal_risk.

eol>Failure Detection Evaluation Trustworthiness Risk-Coverage Curve Model Calibration

1. Introduction

these samples in a coverage range for safe and trusted prediction. FD is beneficial for gaining higher trust from The deployment of deep neural networks (DNNs) in users and for time and cost savings by only requiring safety-critical applications such as autonomous driv- human interventions for a small percentage of data. ing [1] and medical diagnosing [2, 3] requires high trust- One of the criteria for FD is for the model to achieve worthiness and reliability, as mistakes can be expensive better performance with fewer instances removed; hence and raise serious concerns. To reduce mispredictions, the evaluation is about the trade-of between the covera model should be equipped with a safeguard for auto- age of data and model accuracy or risk (error). Popular matic failure detection [4, 5, 6] or a reject option [7], visualisation methods of FD performance such as riskwhere samples with high uncertainty or low confidence coverage (RC) curve [8] and accuracy-rejection curves can be discarded or sent to an expert or the third sys- (ARCs) [9, 10] plot model risk or accuracy against covtem. Specifically, failure detection (FD) determines the erage of data. However, the quantification of FD perforportion of coverage over the entire dataset deemed to mance is a less explored domain. Recent studies attempt be safe predictions and discards data using a threshold to quantify FD by using the area under the RC-curve on model confidence or uncertainty. If the confidence or (AURC) [11] and the area under the ARCs [10]. Neveruncertainty is below or above the threshold, the model re- theless, both methods include the full coverage of data, jects samples and defers them to human experts or third ignoring the selection of thresholds and the FD perforsystems to re-evaluate. Otherwise, the model considers mance under and above thresholds. Theoretically, a perfectly calibrated model should AISafety-SafeRL 2023 Workshop (IJCAI), August 19–21, 2023, Macao, achieve the ideal performance (i.e., accuracy of 1) after reSAR, China moving the most uncertain samples in numbers equal to * Corresponding author. the error percentage. In other words, the perfect perfor($S. sRhuueagnegr).a;oa@dvoapitehn.s.aidcd.uhkar(Sth.aAno@);ospteefna.na.cr.uuekg(eAr@.Soidpdenh.aarct.huakn) mance takes place hypothetically by covering the portion https://github.com/AoShuang92 (S. Ao); of samples equivalent to model accuracy. Therefore, the https://kmi.open.ac.uk/people/member/stefan-rueger (S. Rueger); model risk is supposed to be 0 at this very coverage point, https://people.kmi.open.ac.uk/advaith-siddharthan/ which is denoted as the optimal point in work on uncer(A. Siddharthan) tainty estimation [12] as shown in Figure 1. A perfectly 0000-0003-2648-3082 (S. Ao); 0000-0003-0796-8826 calibrated model should not contain any risk before the (A. Siddharthan)

© 2023 Copyright © 2023 for this paper by its authors. Use permitted under Creative Commons optimal point, whereas the risk increases monotonically CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g LCicEenUseRAttWribuotironk4s.0hIontpernPatrioonacl e(CeCdBiYn4g.0)s. (CEUR-WS.org) until the model error after the optimal point. This risk is naturally inherited from the model as DNNs cannot methods can better evaluate failure detection for obtain the perfect performance in practice, thus, should model trustworthiness. perhaps be discounted in FD evaluations. Based on this hypothesis, Geifman et.al [12] exclude the area under the optimal risk (grey part in Figure 1) for the AURC and pro- 2. Related Work pose the metric of Excess-AURC (E-AURC) (yellow part in Figure 1). However, this still evaluates FD based on 2.1. Failure Detection the whole dataset even though some data are supposed to be safe and trusted predictions. In the deployment of safety-critical scenarios, DNNs tend

As the percentage of rejected samples is generally cus- to fail silently by providing high-confidence in woefully tomised during deployment of a model, there is a lack of incorrect predictions, which makes the uncertainty escommon ground for a fair comparison of failure detection timation a great concern to AI safety [13, 14]. These among models with varying accuracies. In addition, most high-confidence predictions are often produced by the of the existing evaluation metrics (i.e., AURC, E-AURC) softmax function as it is computed with a fast-growing measure the entire area under the curve, which cannot exponential function. It is clearly necessary to identify reveal the FD performance for a specific coverage. For potentially wrong predictions. Hendrycks et al. [4] proexample, the performance of a model at very low cov- posed to detect misclassified samples by enlarging the erage is not of interest to real applications. To address softmax probabilities between correct and incorrect samthe above issues, we propose the Excess area under the ples. Meanwhile, utilizing true class probability instead optimal RC curve (E-AUoptRC) as an alternative metric of maximum class probability has been shown to be more for failure detection that considers the risk in the range reliable in the context of failure detection [5]. In addition, from the optimal point to the full coverage (shown as training the model with data that can reflect the compink area in Figure 2). We emphasise this area for reasons plexity of real-world scenario can improve the reliability as follows: (1) with a perfectly calibrated model, samples in prediction, such as curating diabetic retinopathy for falling into the coverage from 0 to optimal point (yellow training Bayesian DNNs [6]. area in Figure 2) are already highly trusted ones; (2) we To make the model more cautious when it is uncertain, argue that it is more important to compare models in the a rejection option allows it to abstain from making a preregion that errors are made, for instance, samples in the diction when it is likely to be a mistake. Geifman and E-AUoptRC include the high uncertainty ones, and the El-Yaniv [15] designed a selective classifier that allows corresponding risk here should be primarily utilised to users to set a desired risk level. They further proposed determine the trustworthiness of the model. (3) Further- a selective network with a shared classifier of dedicated more, with our precise method of FD quantification, a prediction and ambiguity rejection layer [16]. What’s model with lower accuracy may yield higher trustworthi- more, Geifman et.al [12] developed a selective mechaness and vice versa, capturing the intuition that a model nism by using early snapshots for samples with high with higher accuracy may not be the most trusted one. confidence in model training.

Finally, we propose a Trust Index (TI) as a novel evalua- Besides training classifiers with a rejection option, tion metric, which measures the accuracy of the model at studies also shed light on post-hoc approaches for failthe optimal point, mimics the behaviour of E-AUoptRC, ure detection. Setting thresholds based on confidence or and is easier to compute. The Trust Index combines the uncertainty ranking of samples is widely used to distinperformance and calibration of the model into a single guish correct and incorrect predictions, such as AI for metric. A higher TI suggests better model performance breast cancer screening [17] and decision-making modand calibration and higher trust and reliability of the els for low-power Internet of Things (IoT) devices [18]. model predictions. The threshold needs to be tuned as its value trades of

Our contributions and findings are summarized as be- the predictor’s coverage rate and the performance on low: accepted examples [8, 7]. In our work, we will provide an insightful reference for such threshold selection. 1. We propose the E-AUoptRC to quantify the RC curve with the coverage from the optimal point to the full coverage. 2.2. Evaluation Metrics 2. We propose Trust Index as an evaluation metric. The quantification of failure detection (FD) performance 3. With extensive experiments and observations we shares the same characteristic as selective prediction (SP). ifnd that: (i) a model with higher AURC or E- FD focuses on the model performance after rejecting AURC can obtain lower E-AUoptRC ; (ii) A model worst predicted samples under coverage, while SP highwith a high overall accuracy does not necessar- lights the model accuracy or error with partial input. ily yield higher Trust Index; (iii) Our proposed 0.16 0.14 0.12

0.75 Coverage

2.3. Model Calibration

More broadly, they are techniques for uncertainty esti- achieved impressive performance in calibration. Howmation [11]. Therefore, the evaluation metrics for SP ever, it is arguable to what extent calibration techniques should also be applicable for FD, such as Area Under the can improve the model trustworthiness [23]. Our work Receiver Operating Characteristic curve (AUROC) [19] will provide a more comprehensive evaluation method and Area Under the Precision-Recall Curve (AUPR) [20]. regarding this issue.

Despite the wide use of these metrics for such thresholdindependent performance evaluation [21, 22, 17], [11] point out that AUROC and AUPR can cause misleading 3. Methodology and meaningless results for classification tasks with softmax function. The main reason lies in the assumption that the numbers of correct and wrong predictions are the same. To mitigate this issue, Risk-Coverage (RC) curve is applied for SP in terms of the multi-class classification tasks[12, 11, 15, 23]. Hence, this paper utilises the RC curve for the following experiments and analysis.

The issue we address in this paper is the quantification

of the failure detection performance for supervised classification models with the utilization of softmax function.

Let be the input space and = {1, 2, 3, . . . , } be the set of class labels. Given (, ) as the data distribution over × , a classifier is the function where the error (true risk) and accuracy is obtained by : → . For each input ∈ and its corresponding true label , the probability distribution of the model prediction is ( | ), and the predicted label is ˆ = argmax∈ ( | ).

To measure the performance of calibration methods, the Expected Calibration Error (ECE) [24] was proposed and is widely applied in various tasks, such as image classification [ 12, 23] and sentiment analysis [25, 26]. ECE 3.1. Problem Setting splits the data into bins , calculates for each bin the average confidence and average accuracy, and averages In the Risk-Coverage (RC) curve, the coverage is the over all bins. To alleviate the miscalibration issue for percentage of covered set over the entire data, which is DNNs, calibration techniques have been proposed and written as = |||| . For each coverage, the risk is the then widely applied. Label Smoothing (LS) [27] reduces corresponding error in model prediction. A model with over-confidence by computing the cross-entropy loss better FD performance should obtain less risk/ higher with uniformly squeezed labels instead of one-hot labels. accuracy with fewer samples rejected. Extensions of LS such as Margin-based Label Smooth- To eficiently quantify the FD performance of a model, ing (MBLS) [28] further provides a unifying constrained- we first need to construct the reject function ℛ to decide optimization perspective of calibration losses. Focal Loss whether to reject samples or not under diferent thresh(FL) [29] adds a focusing factor to the standard cross- olds. By adopting settings in [31, 5, 12], we utilize the entropy loss to deal with an imbalanced dataset. Recent predictive uncertainty to rank samples. A sample with work on sample-dependent focal loss (FLSD) [30] inves- low uncertainty indicates high confidence and better retigated the efect of the loss on the training data and liability of the model prediction; whereas a sample with 0.75

Coverage high is more likely to be rejected when narrowing the coverage. Given a fixed or adaptive threshold , the reject function ℛ is written as follows: area as the Excess-AURC (E-AURC), where E-AURC = − .

3.2. E-AUoptRC {︃ cover, ∈ , if <= if >

(1) ℛ() = reject, ∈ , The E-AURC reveals the total risk in coverage range from 0 to 1. However, in real-world applications, the coverage where is the covered input set and is the reject is mainly customised due to specific deployment requireset. ments, making it challenging to compare the failure de

There are two types of risks namely empirical risk and tection (FD) performance for various models. In addition, optimal risk [12]. The empirical risk is the pre- the E-AURC cannot reveal the failure detection (FD) perdicted error of the model under diferent coverage, as formance in a specific coverage range. To mitigate the shown in the solid green line in Figure 1. As the aleatory above issues, we propose E-AUoptRC with the coverage uncertainty inherits from the data, some risks inevitably from to 1 (E-AUoptRC, shown as pink in Figure 2). exist in certain coverage regardless of the model perfor- We emphasise the E-AUoptRC for the following reasons: mance. For a model with perfect uncertainty estimation, (1) it is more practical for deployment, as it is unlikely if we discard the error percentage of high uncertainty to discard more than half of data in applications; (2) the samples, the risk in the remaining coverage input should smaller E-AUoptRC indicates more samples with high be zero. This specific coverage point of 1 − (or ) uncertainty are successfully removed so that the model was proposed by [12] as the optimal point and shown prediction on the remaining data will be more reliable. as the red star in Figure 1. Specifically, the risk between coverage of to 1 monotonically increases until the 3.3. Trust Index error of the model. For optimal calibration, the above risks are called optimal risk illustrated as the Model accuracy should track the confidence of the blue dotted line in the figure. For example, the model model prediction. For example, a model with 80% acerror in the figure is 0.16 and the is 0.84. Therefore, curacy suggests 80% confidence in its own predictions, the optimal risk under coverage 0 to is sup- which also defines the perfect confidence score in caliposed to be 0; while it increases from 0 to 0.16 under bration. As the risk at the optimal point () is supposed to full coverage. It is worth-noticing that the monotonic to be 0, the accuracy at should be 1, indicating the increment of is not exactly in the linear way. prediction’s highest model confidence and trustworthi

Both and can be calculated by Area ness. In other words, after removing % data with Under the RC-curve (AURC) [12, 11], named high uncertainty, the correctly predicted samples in the (yellow plus grey area in Figure 1) and (grey remaining data are most trusted. The accuracy at also area in Figure 1) respectively. The diference between reveals the model calibration, as the discarded % data and is the real FD area, shown as can be misclassified. To represent the model performance the yellow area in Figure 1. [12] propose this specific in terms of accuracy and calibration, we propose the accuracy at the as a Trust Index (TI), a complementary evaluation to the accuracy metric to indicate the model’s trustworthiness. For example, in Figure 2, with the model accuracy of 84%, the model is 0.84 trust of the prediction.

After removing 16% samples with high uncertainty (the is 0.84), the risk is approximately 0.08. The , the accuracy over the most confident 84% of samples is 0.92.

The higher TI suggests the better trustworthiness of the model predictions, and we next present empirical data to substantiate this.

4. Experimental Setup 4.1. Datasets and Baselines

test set. For the ImageNet dataset, we equally divide its original test set of 50,000 images into validation and test sets for a fair comparison. For Tiny-ImageNet and CIFAR100 dataset, an 80/10/10 for training/validation/test split is applied.

4.2. Implementation Details For a fair comparison and replicability of experimenta

tion, we utilized publicly available existing pre-trained weights for our investigation and experimentation. The GPU of the Nvidia Tesla P40 was used for all experiments. The bins number for ECE was set as = 15.

5. Results

We validate the proposed method with three benchmark image datasets: ImageNet 2012 (IN) [32], CI- We conducted extensive experiments on benchmark FAR100 (C100) [33] and Tiny-ImageNet [34]. For datasets ImageNet and Cifar100 with various CNNs and baselines, we use state-of-the-art (SOTA) Vision Trans- variants of transformers to compare the AURC, E-AURC former (ViT) [35] and its variants such as Swin- and our proposed E-AUoptRC. We further observed the Transformer (SwinT) [36], Class-Attention in Image limitation of the conventional overall model accuracy Transformers (CaiT) [37], Cross-Attention Multi-Scale and how our proposed Trust Index (TI) mitigates it. FiVision Transformer (CrossViT) [38], ConvNext [39] nally, to validate the eficacy of our method, we applied with the ImageNet pretrained weights from TIMM 1 it to SOTA calibration techniques with Tiny_ImageNet library. To report comprehensive results on various on ResNet50 dataset. All the experiments and results are models architectures, we also use the Convolutional shown in Tables 1 and 2, and Figure 3. neural networks (CNNs) in our experiments, namely Table 1 shows the results for image classification with DenseNet121 [40], ResNet56 [41], variants of VGG [42] the benchmark datasets. For AURC, E-AURC and Eand MobileNetV2 [43]. All models are with pretrained AUoptRC in the ImageNet dataset, the variants of transweights of ImageNet dataset. For recent SOTA calibration formers outperform CNNs model. The E-AURC for ViT techniques label smoothing (LS) [27], focal loss (FL) [29], is about half of the E-AURC of SwinTran, CaiT and ConMBLS [28] and FLSD [30], we utilize the pre-trained vNext, indicating that ViT greatly outperforms the other model and oficial implementation from the repository 2. three models in failure detection. However, regarding the

As the evaluation of failure detection is a post- E-AUoptRC, the diference is almost ignorable and the processing approach, we primarily utilize each dataset’s ConvNext is slightly better than the other three models. 1https://github.com/rwightman/pytorch-image-models The risk-coverage (RC) curve (Left in Figure 3) also shows 2https://github.com/by-liu/MbLS that at the coverage of 0.84 (near the optimal point) to 1, the risk curve of ViT and ConvNext is nearly overlapping. E-AURC, but label smoothing outperforms other methods The lower risk for VIT occurs at very low coverage levels, and CE in terms of overall accuracy (improves by 0.6%) which are not of interest for most real world applica- and TI. MBLS nearly halves the overall ECE of baseline tions. For CF100 dataset with CNNs, VGG13_bn substan- and achieves the best ECE at the optimal point. In the tially outperforms other models in terms of AURC and Right RC curve in Figure 3, LS is with the lowest risk at E-AURC. However, the diference in E-AUoptRC between the coverage of 0.65 to 1 (the likely operating range when VGG13_bn and VGG19_bn is much smaller. This can be the model is deployed), and our proposed E-AUoptRC understood from the Middle plot in Figure 3, where the and TI metrics are the only ones that capture this. Failure curve for VGG13_bn and VGG19_bn overlaps at cover- detection performance should be a significant evaluation age between 0.74(near the optimal point) to 0.9. These for calibration techniques, and our methods provide a diferences in the metrics provide empirical evidence that more insightful view of the model trustworthiness. our proposed E-AUoptRC more accurately reflects real diferences in failure detection performance than other methods. 6. Discussion & Conclusion

Similar to the results of AURC-related evaluation, the variants of transformer models also outperform CNNs in terms of overall model accuracy and trust index (TI). The SwinTran obtains the highest overall model accuracy for the ImageNet dataset, but it does not yield the highest TI. For the Cifar100 dataset, the VGG13_bn achieves the highest overall model accuracy, whereas the VGG19_bn obtains the best TI. It indicates that the model with the highest overall accuracy does not guarantee the highest TI, which shows that our proposed TI is necessary for model trustworthiness evaluation.

In Table 2, the baseline (CE) obtains better AURC and

In this paper, we proposed the E-AUoptRC to more precisely quantify the failure detection performance in the key region of interest, and the Trust Index (TI) that measures model accuracy at its optimal point. The empirical results show that our methods can better reveal the model trustworthiness under a fair comparison. In the realworld deployment, a fixed threshold is often used due to specific task requirements and simplicity of implementation. Our proposed TI can be utilized as the reference for the threshold selection with following reasons: (1) the accuracy should indicate the model confidence in its prediction, suggesting the TI can interpret the confidence; (2) TI is obtained at the optimal point, where the model tion of uncertainty estimation and its application to is supposed to achieve the ideal performance. This is an explore model complexity-uncertainty trade-of, in: objective method for the fair comparison of models with Proceedings of the IEEE/CVF Conference on Comdiferent accuracy and calibration (as shown in Table 1 puter Vision and Pattern Recognition Workshops, and 2); (3) TI is easy to calculate, which is a time and 2020, pp. 4–5. computational cost saving. We have shown several ben- [12] Y. Geifman, G. Uziel, R. El-Yaniv, Bias-reduced unefits of our proposed metrics over existing ones and in certainty estimation for deep neural classifiers, in: our future work, we will further investigate the role of International Conference on Learning RepresentaTI in improving failure detection. tions, 2019. [13] I. J. Goodfellow, J. Shlens, C. Szegedy, Explaining and harnessing adversarial examples, arXiv References preprint arXiv:1412.6572 (2014). [14] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, [1] S. Atakishiyev, M. Salameh, H. Yao, R. Goebel, Ex- J. Schulman, D. Mané, Concrete problems in ai plainable artificial intelligence for autonomous driv- safety, arXiv preprint arXiv:1606.06565 (2016). ing: a comprehensive overview and field guide [15] Y. Geifman, R. El-Yaniv, Selective classification for for future research directions, arXiv preprint deep neural networks, Advances in neural informaarXiv:2112.11561 (2021). tion processing systems 30 (2017). [2] M. Raghu, K. Blumer, R. Sayres, Z. Obermeyer, [16] Y. Geifman, R. El-Yaniv, Selectivenet: A deep neural B. Kleinberg, S. Mullainathan, J. Kleinberg, Direct network with an integrated reject option, in: Inuncertainty prediction for medical second opinions, ternational conference on machine learning, PMLR, in: International Conference on Machine Learning, 2019, pp. 2151–2159.

PMLR, 2019, pp. 5281–5290. [17] C. Leibig, M. Brehmer, S. Bunk, D. Byng, K. Pinker, [3] M. W. Dusenberry, D. Tran, E. Choi, J. Kemp, L. Umutlu, Combining the strengths of radiologists J. Nixon, G. Jerfel, K. Heller, A. M. Dai, Analyzing and ai for breast cancer screening: a retrospective the role of model uncertainty for electronic health analysis, The Lancet Digital Health 4 (2022) e507– records, in: Proceedings of the ACM Conference on e519.

Health, Inference, and Learning, 2020, pp. 204–213. [18] C. Cho, W. Choi, T. Kim, Leveraging uncertainties [4] D. Hendrycks, K. Gimpel, A baseline for detecting in softmax decision-making models for low-power misclassified and out-of-distribution examples in iot devices, Sensors 20 (2020) 4603. neural networks, ICLR (2017). [19] T. Fawcett, An introduction to roc analysis, Pattern [5] C. Corbière, N. Thome, A. Bar-Hen, M. Cord, recognition letters 27 (2006) 861–874.

P. Pérez, Addressing failure prediction by learning [20] C. Manning, H. Schutze, Foundations of statistical model confidence, Advances in Neural Information natural language processing, MIT press, 1999.

Processing Systems 32 (2019). [21] D. Hendrycks, K. Gimpel, A baseline for detecting [6] N. Band, T. G. Rudner, Q. Feng, A. Filos, Z. Nado, misclassified and out-of-distribution examples in M. W. Dusenberry, G. Jerfel, D. Tran, Y. Gal, neural networks, arXiv preprint arXiv:1610.02136 Benchmarking bayesian deep learning on diabetic (2016). retinopathy detection tasks, in: NeurIPS 2021 Work- [22] A. Malinin, M. Gales, Predictive uncertainty esshop on Distribution Shifts: Connecting Methods timation via prior networks, Advances in neural and Applications, 2021. information processing systems 31 (2018). [7] K. Hendrickx, L. Perini, D. Van der Plas, W. Meert, [23] F. Zhu, Z. Cheng, X.-Y. Zhang, C.-L. Liu, Rethinking J. Davis, Machine learning with a reject option: A confidence calibration for failure prediction, in: Eusurvey, arXiv preprint arXiv:2107.11277 (2021). ropean Conference on Computer Vision, Springer, [8] R. El-Yaniv, et al., On the foundations of noise- 2022, pp. 518–536.

free selective classification., Journal of Machine [24] M. P. Naeini, G. Cooper, M. Hauskrecht, Obtaining Learning Research 11 (2010). well calibrated probabilities using bayesian binning, [9] C. Ferri, J. Hernández-Orallo, Cautious classifiers., in: Twenty-Ninth AAAI Conference on Artificial

ROCAI 4 (2004) 27–36. Intelligence, 2015. [10] M. S. A. Nadeem, J.-D. Zucker, B. Hanczar, [25] R. Müller, S. Kornblith, G. E. Hinton, When does Accuracy-rejection curves (arcs) for comparing clas- label smoothing help?, Advances in neural inforsification methods with a reject option, in: Machine mation processing systems 32 (2019). Learning in Systems Biology, PMLR, 2009, pp. 65– [26] S. Obadinma, H. Guo, X. Zhu, Class-wise calibra81. tion: A case study on covid-19 hate speech., in: [11] Y. Ding, J. Liu, J. Xiong, Y. Shi, Revisiting the evalua- Canadian Conference on AI, 2021. [27] C. Szegedy, V. Vanhoucke, S. Iofe, J. Shlens, Z. Wo- [39] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darjna, Rethinking the inception architecture for com- rell, S. Xie, A convnet for the 2020s, in: Proceedings puter vision, in: Proceedings of the IEEE conference of the IEEE/CVF Conference on Computer Vision on computer vision and pattern recognition, 2016, and Pattern Recognition, 2022, pp. 11976–11986. pp. 2818–2826. [40] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Wein[28] B. Liu, I. Ben Ayed, A. Galdran, J. Dolz, The devil berger, Densely connected convolutional networks, is in the margin: Margin-based label smoothing in: Proceedings of the IEEE conference on computer for network calibration, in: Proceedings of the vision and pattern recognition, 2017, pp. 4700–4708. IEEE/CVF Conference on Computer Vision and Pat- [41] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learntern Recognition, 2022, pp. 80–88. ing for image recognition, in: Proceedings of the [29] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Fo- IEEE conference on computer vision and pattern cal loss for dense object detection, in: Proceedings recognition, 2016, pp. 770–778. of the IEEE international conference on computer [42] K. Simonyan, A. Zisserman, Very deep convoluvision, 2017, pp. 2980–2988. tional networks for large-scale image recognition, [30] J. Mukhoti, V. Kulharia, A. Sanyal, S. Golodetz, arXiv preprint arXiv:1409.1556 (2014).

P. Torr, P. Dokania, Calibrating deep neural net- [43] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.works using focal loss, Advances in Neural Infor- C. Chen, Mobilenetv2: Inverted residuals and linear mation Processing Systems 33 (2020) 15288–15299. bottlenecks, in: Proceedings of the IEEE conference [31] B. Lakshminarayanan, A. Pritzel, C. Blundell, Sim- on computer vision and pattern recognition, 2018, ple and scalable predictive uncertainty estimation pp. 4510–4520. using deep ensembles, Advances in neural information processing systems 30 (2017). [32] O. Russakovsky, J. Deng, H. Su, J. Krause,

S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition challenge, International journal of computer vision 115 (2015) 211–252. [33] A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images, Technical

Report, University of Toronto, 2009. [34] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei

Fei, Imagenet: A large-scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition, Ieee, 2009, pp. 248– 255. [35] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020). [36] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin,

B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022. [37] H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve,

H. Jégou, Going deeper with image transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 32–42. [38] C.-F. R. Chen, Q. Fan, R. Panda, Crossvit: Crossattention multi-scale vision transformer for image classification, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 357–366.