1. Introduction

Improvement of Rejection for AI Safety through Loss-Based Monitoring

Daniel Scholz

0 2

Florian Hauer

Klaus Knobloch

Christian Mayr

2 0 Infineon Technologies Dresden , Königsbrücker Str. 180, 01099 Dresden , Germany 1 Infineon Technologies München , Am Campeon 1-15, 85579 Neubiberg , Germany 2 Technische Universität Dresden, Chair of Highly-Parallel VLSI Systems and Neuro-Microelectronics , Mommsenstr. 12, 01069 Dresden , Germany

There are numerous promising applications for AI which are safety-critical, e.g. computer vision for automated driving. This requires safety measures for the underlying algorithm. Typically, the validity of a classification is solely based on the output probability of a network. Literature suggests that by rejecting classifications below an a-priori set probability threshold, the error rate of the network can be reduced. This inherently does not catch those errors, where the output probability of wrong classifications exceeds such a threshold. However, these are the most critical errors, since the system is erroneously overconfident. To solve this problem and close the gap, we present how this rejection idea can be improved by performing loss-based rejection. Our approach takes data as well as the pre-trained base-model as input and yields a monitoring model as output. For training of the monitoring model, the data samples are labeled based on the loss resulting from the base-model. This way, overconfident misclassifications can be avoided and the overall error rate reduced. As evaluation, we applied the approach to two datasets, one of which is the German Trafic Sign Recognition Benchmark (GTSRB) that is used to train safety-critical trafic sign classifiers. The experiments show that this approach yields results that improve the error-rate up to an order of magnitude while a portion of inputs is rejected as trade-of.

eol>Rejection AI Safety Robustness Classification Neural Networks Representation Learning

1. Introduction

judged whether single predictions of an ANN are trustworthy. This is mandatory to decide whether actions Artificial neural networks (ANN) are deployed for a va- are performed based on those predictions or a verified riety of tasks. If the present trend persists, they will be safety path shall be used as fallback solution [6]. Espemore frequently included for safety critical decisions in cially, when the softmax activation function is applied ifelds such as medical diagnosis or automated driving. at the output, resulting values are interpreted as probTherefore, safety of AI is important and already broadly abilities and might be mistakenly used as a confident discussed [1, 2, 3]. measure of the given prediction [7]. It has been shown

For safety-critical domains, e.g. the automotive do- that when those networks are trained with multinomial main, there exists no suitable standard for the safety as- cross-entropy loss a tendency of over-confident decisions sessment of ANNs yet [4, 5]. Future standardized safety exists [8]. Usually the term “over-confident predictions” assessments might aim for a high test set accuracy or implies that the error-rate does not match the reported ensure a low error-rate. Latter is especially important output probability for a given prediction. In the followfor safety critical applications. Error-rates directly result ing “over-confident predictions” includes cases with relfrom the accuracy if and only if a prediction is forced atively high output probabilities that might reflect the in all cases and no reject option exists. For this work, a true error-rate but are incorrectly predicted. Such errors model is considered safer compared to another model for are the worst kind of failures that an ANN may produce. the same task if the relative error-rate is reduced using A deployed system might rely on the reported high confithe same evaluation method in the same testing context. dence of the model and will not - without any additional

Upon integration in a running system, possibly com- mechanism - be able to maintain a safe state. bined with multiple sensors and algorithms, it must be Problem: A decision mechanism is required to decide The IJCAI-ECAI-22 Workshop on Artificial Intelligence Safety whether or not a prediction should be forced on an input, (AISafety 2022), July 24–25, 2022, Vienna, Austria when it is necessary to reduce the error-rate beyond a * Corresponding author. model’s performance. " daniel.scholz@infineon.com (D. Scholz); Diferent methods are present to approach those problfokrlaiauns..hknauoebrlo@cihn@fineinofinne.coonm.co(Fm. H(Ka.uKern)o;bloch); lems. Proper calibration [8, 9] enables to reduce the christian.mayr@tu-dresden.de (C. Mayr) impact of over-confident errors. However, noting the © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License extended definition of over-confident errors in this work CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) it does not address the actual issue. Other works tion of Monte Carlo Dropout [7] change the classical [10, 11, 12, 13] address the issue directly by rejecting deterministic fashion of ANNs to a probabilistic nature, samples. Some use a selective prediction score that is therefore, equipping the model itself with a built-in robuilt-in and improved during the initial training. In ad- bust nature. Both methods increase generalization but dition, there exist approaches that result in mathemat- do not necessarily perform better on crucial samples. ical heuristics upon which the decisions to abstain are Additionally, there is a computational overhead due to made. However, output probabilities for which rejec- multiple network evaluations to calculate the output for tions are expected are rarely reported. Decreasing the a single input. Intuitively, these works are important, but error-rate solely by discarding decisions which report simply address a diferent problem. a low output probability is less noteworthy for safety- Approaches that include rejection [10, 18, 19, 11, 12, critical domains since one would not rely on such deci- 13] show the same underlying principle as in this work. sions in the first place. Summarized, there exists no Rejection is commonly trained in combination with the approach that reduces the error rate of a present classifier. However, it is advantageous if the monitoring blackbox model by rejection based on a trained rep- approach that is used to abstain must not necessarily be resentation of the model’s weak points that can de- trained combined with a base-model like shown in this tect over-confident errors. work. Additionally, it is commonly focused on reduction

In the following the model which is monitored is re- of error-rate without diferentiating between rejected ferred to as the “base-model” while the additional one is predictions with a low and high output probability. called the “monitoring model”. We close this gap with More straightforward methods like a probability or the following contribution: We present how the well- confidence score threshold under which a prediction is known rejection procedure can be improved by propos- not trusted [18, 20, 21, 19] will be efective in decreasing ing a loss-based rejection. Our approach yields an moni- the overall error rate. However, since such a threshold toring model as output via training centered on the base- ideally divides confident and uncertain cases, per defmodel’s loss. This way, overconfident misclassifications inition it will fail to catch over-confident predictions. can be avoided and the overall error rate reduced. Considering the above, there exists no model specific decision boundary to choose which prediction to trust that enables to exclude also over-confident decisions, leav2. Related Work ing the risk of possible fatal situations for high output probabilities unchanged.

There are multiple approaches which can be considered

to have the same goal of improving safety in AI. A trust score was proposed that is supposed to correlate with 3. Preliminaries and Formalism whether the classification is actually correct. It measures the consensus between the base classifier and a modified Selective Prediction. When rejection also known as nearest-neighbor classifier during test case [ 14]. For the selective prediction [22] is performed for a classification Digits dataset it was possible to detect trustworthy and problem, it can be formulated as follows. Let be a suspicious predictions. The authors stated that for higher feature space with its corresponding class label space dimensional datasets like MNIST the trust score provides that enables supervised training. In this work consists only little or no improvement in detecting wrong deci- of images. Predictions are obtained by model : → sions better than the base model’s confidence itself. that is trained by minimizing a loss function ℓ : × ) →is

Another line of research suggests to use the data’s R+. A labeled set = {(, )}=1 ⊆ ( × distribution: Present approaches perform anomaly- sampled i.i.d. from (, ) which is the distribution detection on a dataset [15]. This flags specific samples over × . The empirical risk of classifier is given by but is purely based on the data and does not include ˆ( |) ≜ 1 ∑︀=1 ℓ( (), ) [22, 12]. any information about the model or the training process. A selective model is the pair of the already defined More specifically, for a fixed dataset but models with prediction function and a selection function * : → diferent weak points, the same samples would be iden- {0, 1}. which performs the binary task of abstaining tified as possible failures, since the models are not part for . In this work a sample shall be rejected when it of the evaluation. Similar applies when out-of distribu- is predicted as “positive” for a possible fault. To be in tion detection [16] is performed. It can be distinguished accordance with [22] * is inverted to . between data near and far away from the training dis- () = 1 − * () ( 1 ) tribution, typically corresponding to a diferent dataset.

However, this does not prove an improvement for test Therefore, an input x is rejected as follows. datBaatyheastialinesnienusriadlentehtewtorariknsi n(BgNdNis)tr[i1b7u]tioornt.he applica- {︃ (), if * () = 0; ( 2 ) (, )() ≜ , if * () = 1.

Selective prediction can be evaluated by coverage and risk. The empirical coverage that is the ratio of data which is kept is defined as ˆ(|) ≜

1 ∑︁ ().

=1 The empirical risk is given by 1 ∑︀

=1 ℓ( (), )() ˆ(, |) ≜

ˆ(|) which will result in the relative error-rate on the covered data when the 0-1 loss function is applied.

Loss Theory. Loss functions are used to give a metric for the performance of a machine learning model. It is the basis of AI training since the gradient of those functions dictate the direction of change to the network in every training step. For classification tasks the cross-entropy loss is often used which is given by = − ∑︁ () =1 where L is the resulting Loss, i is the index of a class with M being the total number of present classes, y is the target value and p is the actual value of the i-th class [23]. When the target value of the correct class is 1 while all others are 0 like it is the case for one-hot-encoding this collapses into negative log-likelihood and is only dependent on the output value of the correct class .

= − ()

From the highest possible < 0.5 in the case of a wrong

prediction and the softmax activation function it directly follows that the lower bound of a false predicted sample is

> − (0.5) ≈ 0.693.

Moreover, the upper bound of the loss from a sample which is predicted correctly is given by 1 < − ( ) when n classes are present. The upper and lower bound for false and correct predicted cases is infinity and zero, respectively. For > 2 there exists an overlap of samples that are correctly predicted with low confidence and samples that are incorrectly predicted. Discarding samples over a set loss threshold < evades the overlap.

4. Approach

The goal is to detect samples which are being classified incorrectly. It is desired that a pre-trained blackbox model ( 3 ) ( 4 ) ( 5 ) ( 6 ) ( 7 ) (8) is monitored and distinction is possible for the whole range of reported output probabilities. The solution is to distinguish between samples of data that induce a high and low loss in the given base-model and further abstain from former cases.

The proposed methodology expects an already trained base-model and additional data which is i.i.d. as the train and test data. The paradigm how the trained monitoring model is obtained is shown in Fig. 1. The loss-based labeling is depicted in Fig. 2. It is suitable to separate samples by a loss threshold < as derived from Eq. ( 7 ), leading to a division between correct and incorrect (overconfident) decisions. When a bigger ratio of samples leading to incorrect compared to correct predictions is rejected the safety of the AI algorithm is improved.

Contribution. The loss-based labeling provides a dataset where the weak points of the base-model are embedded and enables to intercept incorrect predictions for all output probabilities made by a blackbox model.

Note. The suggestion to use rejection is not our contribution; it was already proposed in the past [10, 18]. Our focus lies on intercepting over-confident errors and evaluating eficiency for the whole range of output probabilities.

The dataset consists of the unaltered base samples with replaced labels, corresponding to two classes “negative” and “positive”. The monitoring network is trained on mentioned dataset. Upon deployment the monitoring model will perform the binary decision prior to the basemodel’s prediction as depicted in Fig. 3, reducing the error-rate.

The approach is especially helpful when considering that a trained classifier will have certain latent weak points. Uncovering those would be ideal but might be impossible for blackbox approaches. Even without exactly defining such weak points, the monitoring model can be able to learn a pattern which is present in critical input data. To best acquire the performance of the basemodel by the monitoring model, data that the base-model has not seen during training is necessary. Since ANN training aims to reduce errors on the training set, weak points which the model includes may not be detectable at this stage. However, the monitoring model can extract further information on additional data. corr. pred. samples

high loss partially wrong wrong pred. label “negative” label “positive” on which the rejection is based. However, it helps to interpret whether the obtained monitoring is acting on a meaningful basis.

RQ2. Is the reduction of the error-rate by rejection based on the monitoring model better than pure chance?

Since even random rejection will improve the errorrate by a factor of the rejection rate, it is important to investigate whether the monitoring is resulting in an improvement higher than this base-line.

RQ3. Are incorrect decisions caught for the whole range of output probabilities?

This work is motivated on catching over-confident decisions, therefore, answering this question is a key aspect of the evaluation. Incorrect decisions with high output probabilities might be challenging but are the most important inputs to reject.

RQ4. What are the diferences for both datasets?

It is important to point out where diferences occur since this can indicate specific limitations of the method.

5.2. Datasets

...

... ... ... Tsaofeetvyaloufaatemwohdeelththere tGhTeSaRpBpr[o2a4c]hiscachnoascehni.eSvienicme pitroisvaend automotive related dataset incorrect classification may

Caught lead to fatal decisions. For the sake of minimizing threats Figure 3: Ideal behaviour of monitoring model and base- to validity the approach is evaluated on an additional model stack. Light and black boxes depict samples which will dataset. Fashion-MNIST (F-MNIST) [25] is chosen for be predicted correct and incorrect, respectively. this purpose for multiple reasons. First, the classification is based on pictures of real world objects which is comparable to GTSRB. Secondly, there are less diferent classes which allows for a more detailed analysis. Moreover, al

One has to consider that the monitoring network does though F-MNIST is relatively low dimensional, the SOTA not specifically discriminate between correct and incor- error-rate of over 3% [26] is comparable high. Since [14] rect predictions but is rejecting upon the set loss thresh- showed that detecting wrong predictions may work on old. Therefore, a performance metric for the monitor- simpler datasets but fail on more complex ones, for this ing network alone will not give suficient information. work it was decided against classical MNIST [27]. The The investigation includes which images are ultimately evaluation is only shown for those two datasets due to assigned to an incorrect class. The error-rate is solely space limitations. reduced by rejecting such samples. German Trafic Sign Recognition Benchmark. The GTSRB dataset is intended as a automotive related large 5. Evaluation and Experiments multi-category classification benchmark. It consists of 39209 train and 12630 test samples corresponding to 43 5.1. Evaluation Goals diferent categories. Distribution of classes are highly unbalanced such that some classes are almost ten times This work performs a study of the proposed method- as frequently present as other classes [24]. The images ology and compares this to the performance of a deci- were extracted from video sequences and are supplied sion threshold based on the softmax output which is as RBG color images of diferent sizes between 15 × 15 known to result in good results for selective prediction and 222 × 193 pixels. In this work color channels are [22] but comes with the discussed shortcomings that the kept but normalized to one and images are resized to presented approach aims to solve. The objective is to 32 × 32 pixels. No action is performed that is tackling answer the following research questions (RQs): the unbalanced distribution of classes. RQ1. Can rejected inputs be assigned to “weak points” Fashion-MNIST. The F-MNIST dataset consists of gray of the base-model? scale images based on Zalando’s article images. There Answering this question will not explain exact features are 60,000 train and 10,000 test samples that are grouped tireceddPNPeogsaittiivvee 5118805503%%98 281465587%%21

Negative Positive

Actual (a)

T-shirt/top 8865%8 04% 11%3 22%1 00% 00% 1166%0 00% 03% 00%

Trouser 01% 9966%5 01% 03% 00% 00% 00% 00% 00% 00% Pullover 11%2 01% 8811%5 21%7 1155%0 00% 88%4 00% 01% 00%

Dress 32%7 22%2 11%2 9921%8 54%6 00% 33%1 00% 17% 00% tireceddP SaCnodaatl 0042%% 0030%% 6061%%3 1010%%5 770440%%2 990770%%2 7060%%6 0220%%1 0015%% 0110%%1

Shirt 88%3 04% 99%1 21%9 55%5 00% 6644%5 00% 02% 01% Sneaker 00% 00% 00% 00% 00% 21%6 00% 9965%9 02% 43%7

Bag 11%3 01% 04% 05% 17% 00% 11%4 00% 9987%9 00% Ankle boot 00% 00% 00% 02% 00% 11%2 00% 22%0 00% 9955%1 distribution of the loss for predictions by the base-model.

The overall performance of the base- and monitoringmodel stack was evaluated on the oficial test set which was not used in any of the training procedures. This is ifnally compared to the performance of the base-model on the same test data.

5.4. Results

into ten diferent classes. Each image has a dimension of 28× 28 pixels. In this work the images were preprocessed by normalizing the pixel values to one.

5.3. Experimental Setup For the GTSRB dataset a convolutional neural network

(CNN) with the LeNet-5 architecture [28] while for the F-MNIST dataset a fully connected (FC) feed-forward neural network is applied by using the TensorFlow library [29] as listed in Tab. 1.

Loss-Based Monitoring. The lowest loss for a sample

that is incorrectly predicted is 0.699 for GTSRB and 0.696 for F-MNIST which places both near to the theoretical Table 1 minimum given by Eq. ( 7 ). The maximum loss of corParameters of the networks used in the experiments. rectly predicted samples is determined to be 1.757 for GTSRB and 1.353 for F-MNIST which is smaller than the Dataset GTSRB F-MNIST given upper bound from Eq. (8). The confusion matrix of ABracshei-tMecotudreel LeNet-5 FC, 1 hidden layer, 128 neurons the trained monitoring model for F-MNIST is shown in N Parameters 85 k 100 k Fig. 4 (a) in combination with confusion matrices of the Base Accuracy 90.48% 88.04% base-model due to limited amount of classes. Data which AMrochniitteocrtiunrge LeNet-5 FC, 2 hidden layers, 32 neurons each is predicted “negative” is passed from the monitoring to N Parameters 85 k 26 k the base-model for inference. Images that are flagged TProaritnioinngoSfaTmrapinleisng Data 72,804%0 101,70%00 as “positive” are discarded due to them being considered unsafe inputs.

Ratios of training data for the monitoring model were Confusion Matrices (RQ1). Figure 4 shows that for kept to a minimum to have minor influence on the base- F-MNIST error-rates of all classes except the “coat” class model training. However, a suficient absolute number improved. The classes “t-shirt/top”, “pullover”, “coat” and of images is needed to enable successful training conver- “shirt” are discarded the most. Additionally, those are gence. To decide when to stop the monitoring training the classes with the lowest accuracy in the base-model. and to chose which loss threshold performs best, 10% of Since there are 43 classes for the GTSRB no numbers are the training data used for the base-model is adapted as given but both base and monitored confusion matrix are validation set. The result of the classification task per- color coded as shown in Fig. 7. Individual classes can formed by the monitoring model shows whether the data be identified by Fig. 6. Comparing both matrices shows induces a higher loss than the given threshold on the that less misclassification occured. While three classes base-model. Chosen threshold values are based on the were fully rejected the error-rate was improved for all 0.8 e tr a cay 0.6 r u c acg 0.4 n ii n eam0.2 R 0 t = 0.6 t = 0.1 t = 0.005

Softmax 0 0.02 0.04 0.06 0.08 0.1 0.12

Remaining error rate (a) isno 1 it c red 0.8 p t c rre 0.6 o c n i lla 0.4 r e fov 0.2 o o tiaR 0 0.2 0.4 0.6 0.8 Output probability (b) 1 0.2 0.4 0.6 0.8 Output probability (c) 1 does not reflect the efectiveness of the method. Instead, two metrics are adapted from [6]. The remaining error rate (rer) and remaining accuracy rate (rar) give the errorrate and correctly classified rate, respectively, relative to the overall input data including the rejected portion. The rer can be expressed with risk when calculated by the 0-1 loss and coverage via Eq. ( 3 ) and Eq. ( 4 ) as

= ˆ · ˆ while the rar can be expressed as = ˆ · (1 − ˆ) = ˆ − .

(9) (10)

RQ2. Figure 5 (a) and Fig. 8 (a) represent the rar against rer for both datasets and multiple monitoring networks where the decision boundary of the binary classification was varied. Rejection based on softmax values are given as comparison. Each point on a function represents a possible operation point and corresponds to a diferent binary decision boundary value. Due to the (a) (b) counter-intuitive behaviour it has to be mentioned that Figure 7: (a) Base confusion matrix. (b) Monitored confusion while the decision boundary of the monitoring model is matrix with t=0.005. Columns are actual while rows are pre- increased, the rer and rar increases since less samples dicted classes where values are given in relative color code, are rejected. When the decision boundary is ≥ 1, all increasing from white (equal to zero) to red to blue (equal samples are kept which is analog to not applying the to 100% of the given class). Lines are separating the classes monitoring network resulting in rer beeing equal to the as grouped in Fig. 6 which does not match with the oficial base error-rate and + = 1. However, for the numbering of classes. Best viewed in color. Illustration style softmax activation function it is vice versa, increasing adapted to own data from [24] the decision boundary rejects more samples with higher output probability, leading to a decrease in rer and rar.

RQ3. For the marked operation points in Fig. 5 (a) and remaining classes except for one. Fig. 8 (a), Fig. 5 (b) and Fig. 8 (b) depict the gap between

Coverage and Error Trade-Of. To evaluate the se- incorrect predictions and caught ones for all output problective prediction it was decided against ROC-curves abilities. In contrast, Fig. 5 (c) and Fig. 8 (c) show the since the separation of “positive” and “negative” samples ratio of rejected but correct classifications compared to all correct predictions. 0.2 0.4 0.6 0.8 Output probability (b) 1 0.2 0.4 0.6 0.8 Output probability (c) 1 Table 2 in a lower error-rate by rejecting less samples this sugResults for the GTSRB and F-MNIST with = 0.005 and gests that it is hard to classify those classes correctly and = 0.1 for the monitoring while = 0.8 and = 0.9 for the base-model fails there in a diferent, more fundamenrejection based on softmax, respectively. The binary decision tal, way. boundary is = 0.5 for the monitoring models. All values For the GTSRB absolute values of the confusion magiven in %. trices will not be discussed in detail since they cannot Dataset GTSRB F-MNIST be interpreted from given Fig. 7 due to too many classes. rMero(nbiatsoer-imngodMelo)del 2.11 (9.52) 2.08 (11.96) However, it is clearly visible that in the monitored case rCaaru(gbhastee-rmroordse(lr)ejection rate) 7573..8271 ((4940..6498)) 8620..6813 ((3878..0094)) less misclassification occur overall. Confusion between rrSHeaoirrgftm((hbbaaapxssreeoT--bmhmarbooeiddlsieehtlly)o)eldrrors caught 822..8279 ✓((99.05.24)8) 618.9.948✓((1818.9.064)) “csopnefeudsiloimnibtestigwnese”nshoothwesrosnulbyglriottulepsimisprdoevcermeaesnetd.wThhilee Caught errors (rejection rate) 75.95 (14.81) 83.77 (29.04) classes that are rejected completely are one of a few with a High probability errors caught X X relative class frequency of less than 1.0% [24]. Such problems could be eliminated by tackling the unbalanced distribution problem, however, for this work the approach

RQ4. Table 2 summarizes the results where operation shall be analyzed without heavy interventions. Why two points of similar rer values are compared. Rejection oc- classes in the “danger signs” subclass clearly increases curred for the whole range of output probabilities when in error-rate is unknown since the base-model showed the monitoring-model was applied. relative good performance for those. RQ1. For both datasets, classes with poor base5.5. Discussion performance were rejected, which shows that the moniLoss-Based Monitoring. The separation of samples is toring detects weak points. However, for GTSRB this led dependent on the set loss threshold . Setting just be- to slight deterioration of two individual classes that the low will guarantee to separate a maximum of correct base-model was able to classify with low error-rate. from all incorrect predictions. While this is true in the- Coverage and Error Trade-Of ( RQ4). When anaory, it was determined that the monitoring-model does lyzing the rar vs. rer trade-of it is visible that GTSRB is not properly learn a separation of both classes when the more challenging to discriminate between positive and threshold is set near as seen for the operation points negative samples than F-MNIST since the graph shows a in Fig. 5 (a) and Fig. 8 (a) for = 0.6. In this work it was faster increasing gradient while approaching lower rer. determined that the threshold needs to be set dependent This can be explained by GTSRB consisting of higher on the loss distribution of the base model such that a suf- dimensional images. Additionally, there are 43 classes ifcient amount of samples corresponding to the positive where a high loss can be present for relatively clear sepclass is present. arated, correct decisions, meaning predictions with a

Confusion Matrices (RQ1). Analyzing the confu- maximum probability far away from the second highest sion matrices for F-MNIST reveals that classes “coat” and score. “shirt” are discarded the most. By this the error-rate for RQ2. Overall one can state, that the monitoring is “coat” is even increasing. While all other classes resulted eficiently improving the safety as long as the relative reduction of the rer is greater than the relative reduction proposed to train a monitoring on the generated dataset of the rar. This is true for the investigated operation to improve the well-known rejection procedure. We appoints. If it does not hold, the result of rejection is only plied this approach to the GTSRB and F-MNIST dataset as eficient as discarding samples by pure chance. and compared it to rejection based on the softmax acti

RQ3. The ratio of caught samples against output prob- vation function. The presented empirical results showed ability as depicted in Fig. 5 (b) and Fig. 8 (b) confirms that that the error-rate was improved and over-confident prethe goal to discard over-confident predictions is reached. dictions were successfully caught. We discussed that The graphs are continuously increasing and show no while the rejection based on a softmax threshold shows area with zero gradient. This implies that cases with a better remaining accuracy rate trade-of, the range of various output probabilities were caught. In contrast, re- output probabilities for caught samples is bigger for the jection based on the softmax threshold inherently missed monitoring approach. The shown results serve as a proofover-confident errors but did not discard any high proba- of-concept for the approach which is targeting safety bility correct decisions. The latter explains the lower rar critical domains. We believe that the methodology can trade-of. be used for a variety of models and datasets.

RQ4. While for F-MNIST, the gradient of the ratio of correct samples grows much faster than for discarded samples, the GTSRB dataset revealed to have a bigger Acknowledgments portion of correct samples rejected. This is in accordance to the harsher rar trade-of. Overall the monitoring is less prone to reject high confident correct cases.

This work was funded by the German Federal Ministry of Education and Research (BMBF) within the KI-ASIC project (16ES0993). We thank Infineon Technologies AG for supporting this research. 5.6. Limitations A key aspect in the training of the monitoring is that

it shall be based on data, that the base-model was not trained on to uncover weak points. In this work latter is accomplished with a portion of the base training data. One can argue that the base-model would show higher performance when this missing data would be available during training. In the following this is discussed separately for both datasets. Comparing the error-rate of state of the art networks for F-MNIST given as 3.09% with 3.2 M parameters [26] which is orders of magnitude more than the model investigated here, proves that reducing the error rate on the base-model to such levels is not an easy task. Given the size of the used model, obtaining such improvements is excluded but can be obtained by applying the monitoring. For GTSRB other works [30] report a error-rate smaller than 1.0% using the Le-Net architecture. However, the focus is an enhanced architecture and pre-processing of data that tackles the unbalanced class distribution and image quality. A common method is data augmentation [31] to alter class distribution. The gained performance is expected to heavily rely on such pre-processing which was consciously excluded in this work to not rely on balanced classes or other modifications that may induce any bias.

6. Conclusion We motivated the need to reduce the error-rate of a basemodel by catching over-confident errors due to their safety critical nature. We contributed a loss-based labeling that reflects the weak points of the base-model and

[8] C. Guo, G. Pleiss, Y. Sun, K. Q. Weinberger, On national joint conference on neural networks, IEEE, calibration of modern neural networks, in: Inter- 2011, pp. 1453–1460. national Conference on Machine Learning, PMLR, [25] H. Xiao, K. Rasul, R. Vollgraf, Fashion-mnist: 2017, pp. 1321–1330. a novel image dataset for benchmarking ma[9] J. Nixon, M. W. Dusenberry, L. Zhang, G. Jerfel, chine learning algorithms, arXiv preprint D. Tran, Measuring calibration in deep learning., arXiv:1708.07747 (2017).

in: CVPR Workshops, volume 2, 2019. [26] M. S. Tanveer, M. U. K. Khan, C.-M. Kyung, Fine[10] C.-K. Chow, An optimum character recognition tuning darts for image classification, in: 2020 25th system using decision functions, IRE Transactions International Conference on Pattern Recognition on Electronic Computers EC-6 (1957) 247–254. (ICPR), IEEE, 2021, pp. 4789–4796. [11] P. L. Bartlett, M. H. Wegkamp, Classification with a [27] Y. LeCun, The mnist database of handwritten digits, reject option using a hinge loss., Journal of Machine http://yann. lecun. com/exdb/mnist/ (1998).

Learning Research 9 (2008). [28] Y. LeCun, L. Bottou, Y. Bengio, P. Hafner, Gradient[12] Y. Geifman, R. El-Yaniv, Selectivenet: A deep neural based learning applied to document recognition, network with an integrated reject option, in: Inter- Proceedings of the IEEE 86 (1998) 2278–2324. national Conference on Machine Learning, PMLR, [29] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, 2019, pp. 2151–2159. Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, [13] N. Manwani, K. Desai, S. Sasidharan, R. Sundarara- M. Devin, et al., Tensorflow: Large-scale machine jan, Double ramp loss based reject option classifier, learning on heterogeneous systems, 2015. in: Pacific-Asia Conference on Knowledge Discov- [30] A. Zaibi, A. Ladgham, A. Sakly, A lightweight model ery and Data Mining, Springer, 2015, pp. 151–163. for trafic sign classification based on enhanced [14] H. Jiang, B. Kim, M. Y. Guan, M. Gupta, To lenet-5 network, Journal of Sensors 2021 (2021). trust or not to trust a classifier, arXiv preprint [31] L. Perez, J. Wang, The efectiveness of data augmenarXiv:1805.11783 (2018). tation in image classification using deep learning, [15] V. Chandola, A. Banerjee, V. Kumar, Anomaly de- arXiv preprint arXiv:1712.04621 (2017). tection: A survey, ACM computing surveys (CSUR) 41 (2009) 1–58. [16] A. Meinke, M. Hein, Towards neural networks that provably know when they don’t know, 2020.

arXiv:1909.12180. [17] I. Kononenko, Bayesian neural networks, Biological

Cybernetics 61 (1989) 361–370. [18] C. Chow, On optimum recognition error and reject tradeof, IEEE Transactions on information theory 16 (1970) 41–46. [19] C. M. Santos-Pereira, A. M. Pires, On optimal reject rules and roc curves, Pattern recognition letters 26 (2005) 943–952. [20] L. P. Cordella, C. De Stefano, F. Tortorella, M. Vento,

A method for improving classification reliability of multilayer perceptrons, IEEE Transactions on

Neural Networks 6 (1995) 1140–1147. [21] C. De Stefano, C. Sansone, M. Vento, To reject or not to reject: that is the question-an answer in case of neural classifiers, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and

Reviews) 30 (2000) 84–94. [22] Y. Geifman, R. El-Yaniv, Selective classification for deep neural networks, arXiv preprint arXiv:1705.08500 (2017). [23] C. M. Bishop, N. M. Nasrabadi, Pattern recognition

and machine learning, volume 4, Springer, 2006. [24] J. Stallkamp, M. Schlipsing, J. Salmen, C. Igel, The german trafic sign recognition benchmark: a multiclass classification competition, in: The 2011 inter

[1]

Amodei ,

Olah ,

Steinhardt ,

Christiano ,

Schulman ,

Mané , Concrete problems in ai safety , arXiv preprint arXiv:1606.06565 ( 2016 ).

[2]

Burton ,

Gauerhof , C. Heinzemann, Making the case for safety of machine learning in highly automated driving , in: International Conference on Computer Safety, Reliability, and Security, Springer, 2017 , pp. 5 - 16 .

[3]

Koopman ,

Wagner , Autonomous vehicle safety: An interdisciplinary challenge , IEEE Intelligent Transportation Systems Magazine 9 ( 2017 ) 90 - 96 .

[4]

Salay ,

Queiroz ,

Czarnecki , An analysis of iso 26262: Using machine learning safely in automotive software , arXiv preprint arXiv:1709.02435 ( 2017 ).

[5]

Gabreau ,

Pesquet-Popescu ,

Kaakai ,

Lefèvre , Artificial intelligence for future skies: On-going standardization activities to build the next certification/approval framework for airborne and ground aeronautic products ., in: AISafety@ IJCAI, 2021 .

[6]

Henne ,

Schwaiger ,

Roscher , G. Weiss, Benchmarking uncertainty estimation methods for deep learning with safety-related metrics ., in: SafeAI@ AAAI, 2020 , pp. 83 - 90 .

[7]

Gal ,

Ghahramani , Dropout as a bayesian approximation: Representing model uncertainty in deep learning , in: international conference on machine learning, PMLR , 2016 , pp. 1050 - 1059 .