<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Improvement of Rejection for AI Safety through Loss-Based Monitoring</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniel Scholz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Florian Hauer</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Klaus Knobloch</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Mayr</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Infineon Technologies Dresden</institution>
          ,
          <addr-line>Königsbrücker Str. 180, 01099 Dresden</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Infineon Technologies München</institution>
          ,
          <addr-line>Am Campeon 1-15, 85579 Neubiberg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Technische Universität Dresden, Chair of Highly-Parallel VLSI Systems and Neuro-Microelectronics</institution>
          ,
          <addr-line>Mommsenstr. 12, 01069 Dresden</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>There are numerous promising applications for AI which are safety-critical, e.g. computer vision for automated driving. This requires safety measures for the underlying algorithm. Typically, the validity of a classification is solely based on the output probability of a network. Literature suggests that by rejecting classifications below an a-priori set probability threshold, the error rate of the network can be reduced. This inherently does not catch those errors, where the output probability of wrong classifications exceeds such a threshold. However, these are the most critical errors, since the system is erroneously overconfident. To solve this problem and close the gap, we present how this rejection idea can be improved by performing loss-based rejection. Our approach takes data as well as the pre-trained base-model as input and yields a monitoring model as output. For training of the monitoring model, the data samples are labeled based on the loss resulting from the base-model. This way, overconfident misclassifications can be avoided and the overall error rate reduced. As evaluation, we applied the approach to two datasets, one of which is the German Trafic Sign Recognition Benchmark (GTSRB) that is used to train safety-critical trafic sign classifiers. The experiments show that this approach yields results that improve the error-rate up to an order of magnitude while a portion of inputs is rejected as trade-of.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Rejection</kwd>
        <kwd>AI Safety</kwd>
        <kwd>Robustness</kwd>
        <kwd>Classification</kwd>
        <kwd>Neural Networks</kwd>
        <kwd>Representation Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>judged whether single predictions of an ANN are
trustworthy. This is mandatory to decide whether actions
Artificial neural networks (ANN) are deployed for a va- are performed based on those predictions or a verified
riety of tasks. If the present trend persists, they will be safety path shall be used as fallback solution [6].
Espemore frequently included for safety critical decisions in cially, when the softmax activation function is applied
ifelds such as medical diagnosis or automated driving. at the output, resulting values are interpreted as
probTherefore, safety of AI is important and already broadly abilities and might be mistakenly used as a confident
discussed [1, 2, 3]. measure of the given prediction [7]. It has been shown</p>
      <p>For safety-critical domains, e.g. the automotive do- that when those networks are trained with multinomial
main, there exists no suitable standard for the safety as- cross-entropy loss a tendency of over-confident decisions
sessment of ANNs yet [4, 5]. Future standardized safety exists [8]. Usually the term “over-confident predictions”
assessments might aim for a high test set accuracy or implies that the error-rate does not match the reported
ensure a low error-rate. Latter is especially important output probability for a given prediction. In the
followfor safety critical applications. Error-rates directly result ing “over-confident predictions” includes cases with
relfrom the accuracy if and only if a prediction is forced atively high output probabilities that might reflect the
in all cases and no reject option exists. For this work, a true error-rate but are incorrectly predicted. Such errors
model is considered safer compared to another model for are the worst kind of failures that an ANN may produce.
the same task if the relative error-rate is reduced using A deployed system might rely on the reported high
confithe same evaluation method in the same testing context. dence of the model and will not - without any additional</p>
      <p>Upon integration in a running system, possibly com- mechanism - be able to maintain a safe state.
bined with multiple sensors and algorithms, it must be Problem: A decision mechanism is required to decide
The IJCAI-ECAI-22 Workshop on Artificial Intelligence Safety whether or not a prediction should be forced on an input,
(AISafety 2022), July 24–25, 2022, Vienna, Austria when it is necessary to reduce the error-rate beyond a
* Corresponding author. model’s performance.
" daniel.scholz@infineon.com (D. Scholz); Diferent methods are present to approach those
problfokrlaiauns..hknauoebrlo@cihn@fineinofinne.coonm.co(Fm. H(Ka.uKern)o;bloch); lems. Proper calibration [8, 9] enables to reduce the
christian.mayr@tu-dresden.de (C. Mayr) impact of over-confident errors. However, noting the
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License extended definition of over-confident errors in this work
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)
it does not address the actual issue. Other works tion of Monte Carlo Dropout [7] change the classical
[10, 11, 12, 13] address the issue directly by rejecting deterministic fashion of ANNs to a probabilistic nature,
samples. Some use a selective prediction score that is therefore, equipping the model itself with a built-in
robuilt-in and improved during the initial training. In ad- bust nature. Both methods increase generalization but
dition, there exist approaches that result in mathemat- do not necessarily perform better on crucial samples.
ical heuristics upon which the decisions to abstain are Additionally, there is a computational overhead due to
made. However, output probabilities for which rejec- multiple network evaluations to calculate the output for
tions are expected are rarely reported. Decreasing the a single input. Intuitively, these works are important, but
error-rate solely by discarding decisions which report simply address a diferent problem.
a low output probability is less noteworthy for safety- Approaches that include rejection [10, 18, 19, 11, 12,
critical domains since one would not rely on such deci- 13] show the same underlying principle as in this work.
sions in the first place. Summarized, there exists no Rejection is commonly trained in combination with the
approach that reduces the error rate of a present classifier. However, it is advantageous if the monitoring
blackbox model by rejection based on a trained rep- approach that is used to abstain must not necessarily be
resentation of the model’s weak points that can de- trained combined with a base-model like shown in this
tect over-confident errors. work. Additionally, it is commonly focused on reduction</p>
      <p>In the following the model which is monitored is re- of error-rate without diferentiating between rejected
ferred to as the “base-model” while the additional one is predictions with a low and high output probability.
called the “monitoring model”. We close this gap with More straightforward methods like a probability or
the following contribution: We present how the well- confidence score threshold under which a prediction is
known rejection procedure can be improved by propos- not trusted [18, 20, 21, 19] will be efective in decreasing
ing a loss-based rejection. Our approach yields an moni- the overall error rate. However, since such a threshold
toring model as output via training centered on the base- ideally divides confident and uncertain cases, per
defmodel’s loss. This way, overconfident misclassifications inition it will fail to catch over-confident predictions.
can be avoided and the overall error rate reduced. Considering the above, there exists no model specific
decision boundary to choose which prediction to trust that
enables to exclude also over-confident decisions,
leav2. Related Work ing the risk of possible fatal situations for high output
probabilities unchanged.</p>
      <sec id="sec-1-1">
        <title>There are multiple approaches which can be considered</title>
        <p>to have the same goal of improving safety in AI. A trust
score was proposed that is supposed to correlate with 3. Preliminaries and Formalism
whether the classification is actually correct. It measures
the consensus between the base classifier and a modified Selective Prediction. When rejection also known as
nearest-neighbor classifier during test case [ 14]. For the selective prediction [22] is performed for a classification
Digits dataset it was possible to detect trustworthy and problem, it can be formulated as follows. Let  be a
suspicious predictions. The authors stated that for higher feature space with its corresponding class label space 
dimensional datasets like MNIST the trust score provides that enables supervised training. In this work  consists
only little or no improvement in detecting wrong deci- of images. Predictions are obtained by model  :  → 
sions better than the base model’s confidence itself. that is trained by minimizing a loss function ℓ :  × ) →is</p>
        <p>
          Another line of research suggests to use the data’s R+. A labeled set  = {(, )}=1 ⊆ ( × 
distribution: Present approaches perform anomaly- sampled i.i.d. from  (,  ) which is the distribution
detection on a dataset [15]. This flags specific samples over  ×  . The empirical risk of classifier  is given by
but is purely based on the data and does not include ˆ( |) ≜ 1 ∑︀=1 ℓ( (), ) [22, 12].
any information about the model or the training process. A selective model is the pair of the already defined
More specifically, for a fixed dataset but models with prediction function  and a selection function * :  →
diferent weak points, the same samples would be iden- {0, 1}. which performs the binary task of abstaining
tified as possible failures, since the models are not part for  . In this work a sample shall be rejected when it
of the evaluation. Similar applies when out-of distribu- is predicted as “positive” for a possible fault. To be in
tion detection [16] is performed. It can be distinguished accordance with [22] * is inverted to .
between data near and far away from the training dis- () = 1 − * () (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
tribution, typically corresponding to a diferent dataset.
        </p>
        <p>
          However, this does not prove an improvement for test Therefore, an input x is rejected as follows.
datBaatyheastialinesnienusriadlentehtewtorariknsi n(BgNdNis)tr[i1b7u]tioornt.he applica- {︃ (), if * () = 0; (
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
(, )() ≜
, if * () = 1.
        </p>
        <p>Selective prediction can be evaluated by coverage and
risk. The empirical coverage that is the ratio of data
which is kept is defined as
ˆ(|) ≜</p>
        <p>1 ∑︁ ().</p>
        <p>=1
The empirical risk is given by
1 ∑︀</p>
        <p>=1 ℓ( (), )()
ˆ(, |) ≜</p>
        <p>ˆ(|)
which will result in the relative error-rate on the covered
data when the 0-1 loss function is applied.</p>
        <p>Loss Theory. Loss functions are used to give a metric
for the performance of a machine learning model. It is the
basis of AI training since the gradient of those functions
dictate the direction of change to the network in every
training step. For classification tasks the cross-entropy
loss is often used which is given by
 = −

∑︁ ()
=1
where L is the resulting Loss, i is the index of a class
with M being the total number of present classes, y is
the target value and p is the actual value of the i-th class
[23]. When the target value of the correct class is 1 while
all others are 0 like it is the case for one-hot-encoding
this collapses into negative log-likelihood and is only
dependent on the output value of the correct class .</p>
        <p>= − ()</p>
      </sec>
      <sec id="sec-1-2">
        <title>From the highest possible  &lt; 0.5 in the case of a wrong</title>
        <p>prediction and the softmax activation function it directly
follows that the lower bound  of a false predicted
sample is</p>
        <p>&gt; − (0.5) ≈ 0.693.</p>
        <p>Moreover, the upper bound of the loss  from a sample
which is predicted correctly is given by
1
 &lt; − (  )
when n classes are present. The upper and lower bound
for false and correct predicted cases is infinity and zero,
respectively. For  &gt; 2 there exists an overlap of samples
that are correctly predicted with low confidence and
samples that are incorrectly predicted. Discarding samples
over a set loss threshold  &lt;  evades the overlap.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Approach</title>
      <p>
        The goal is to detect samples which are being classified
incorrectly. It is desired that a pre-trained blackbox model
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
(
        <xref ref-type="bibr" rid="ref7">7</xref>
        )
(8)
is monitored and distinction is possible for the whole
range of reported output probabilities. The solution is to
distinguish between samples of data that induce a high
and low loss in the given base-model and further abstain
from former cases.
      </p>
      <p>
        The proposed methodology expects an already trained
base-model and additional data which is i.i.d. as the train
and test data. The paradigm how the trained
monitoring model is obtained is shown in Fig. 1. The loss-based
labeling is depicted in Fig. 2. It is suitable to separate
samples by a loss threshold  &lt;  as derived from Eq. (
        <xref ref-type="bibr" rid="ref7">7</xref>
        ),
leading to a division between correct and incorrect
(overconfident) decisions. When a bigger ratio of samples
leading to incorrect compared to correct predictions is
rejected the safety of the AI algorithm is improved.
      </p>
      <p>Contribution. The loss-based labeling provides a
dataset where the weak points of the base-model are
embedded and enables to intercept incorrect predictions
for all output probabilities made by a blackbox model.</p>
      <p>Note. The suggestion to use rejection is not our
contribution; it was already proposed in the past [10, 18].
Our focus lies on intercepting over-confident errors and
evaluating eficiency for the whole range of output
probabilities.</p>
      <p>The dataset consists of the unaltered base samples with
replaced labels, corresponding to two classes “negative”
and “positive”. The monitoring network is trained on
mentioned dataset. Upon deployment the monitoring
model will perform the binary decision prior to the
basemodel’s prediction as depicted in Fig. 3, reducing the
error-rate.</p>
      <p>The approach is especially helpful when considering
that a trained classifier will have certain latent weak
points. Uncovering those would be ideal but might be
impossible for blackbox approaches. Even without
exactly defining such weak points, the monitoring model
can be able to learn a pattern which is present in critical
input data. To best acquire the performance of the
basemodel by the monitoring model, data that the base-model
has not seen during training is necessary. Since ANN
training aims to reduce errors on the training set, weak
points which the model includes may not be detectable
at this stage. However, the monitoring model can extract
further information on additional data.
corr. pred. samples</p>
      <p>high loss
partially wrong
wrong pred.
label “negative”
label “positive”
on which the rejection is based. However, it helps to
interpret whether the obtained monitoring is acting on a
meaningful basis.</p>
      <p>RQ2. Is the reduction of the error-rate by rejection based
on the monitoring model better than pure chance?</p>
      <p>Since even random rejection will improve the
errorrate by a factor of the rejection rate, it is important to
investigate whether the monitoring is resulting in an
improvement higher than this base-line.</p>
      <p>RQ3. Are incorrect decisions caught for the whole range
of output probabilities?</p>
      <p>This work is motivated on catching over-confident
decisions, therefore, answering this question is a key
aspect of the evaluation. Incorrect decisions with high
output probabilities might be challenging but are the
most important inputs to reject.</p>
      <p>RQ4. What are the diferences for both datasets?</p>
      <p>It is important to point out where diferences occur
since this can indicate specific limitations of the method.</p>
      <sec id="sec-2-1">
        <title>5.2. Datasets</title>
        <p>...</p>
        <p>...</p>
        <p>...
... ... Tsaofeetvyaloufaatemwohdeelththere tGhTeSaRpBpr[o2a4c]hiscachnoascehni.eSvienicme pitroisvaend
automotive related dataset incorrect classification may</p>
        <p>Caught lead to fatal decisions. For the sake of minimizing threats
Figure 3: Ideal behaviour of monitoring model and base- to validity the approach is evaluated on an additional
model stack. Light and black boxes depict samples which will dataset. Fashion-MNIST (F-MNIST) [25] is chosen for
be predicted correct and incorrect, respectively. this purpose for multiple reasons. First, the classification
is based on pictures of real world objects which is
comparable to GTSRB. Secondly, there are less diferent classes
which allows for a more detailed analysis. Moreover,
al</p>
        <p>One has to consider that the monitoring network does though F-MNIST is relatively low dimensional, the SOTA
not specifically discriminate between correct and incor- error-rate of over 3% [26] is comparable high. Since [14]
rect predictions but is rejecting upon the set loss thresh- showed that detecting wrong predictions may work on
old. Therefore, a performance metric for the monitor- simpler datasets but fail on more complex ones, for this
ing network alone will not give suficient information. work it was decided against classical MNIST [27]. The
The investigation includes which images are ultimately evaluation is only shown for those two datasets due to
assigned to an incorrect class. The error-rate is solely space limitations.
reduced by rejecting such samples. German Trafic Sign Recognition Benchmark. The
GTSRB dataset is intended as a automotive related large
5. Evaluation and Experiments multi-category classification benchmark. It consists of
39209 train and 12630 test samples corresponding to 43
5.1. Evaluation Goals diferent categories. Distribution of classes are highly
unbalanced such that some classes are almost ten times
This work performs a study of the proposed method- as frequently present as other classes [24]. The images
ology and compares this to the performance of a deci- were extracted from video sequences and are supplied
sion threshold based on the softmax output which is as RBG color images of diferent sizes between 15 × 15
known to result in good results for selective prediction and 222 × 193 pixels. In this work color channels are
[22] but comes with the discussed shortcomings that the kept but normalized to one and images are resized to
presented approach aims to solve. The objective is to 32 × 32 pixels. No action is performed that is tackling
answer the following research questions (RQs): the unbalanced distribution of classes.
RQ1. Can rejected inputs be assigned to “weak points” Fashion-MNIST. The F-MNIST dataset consists of gray
of the base-model? scale images based on Zalando’s article images. There
Answering this question will not explain exact features are 60,000 train and 10,000 test samples that are grouped
tireceddPNPeogsaittiivvee 5118805503%%98 281465587%%21</p>
        <p>Negative Positive</p>
        <p>Actual
(a)</p>
        <p>T-shirt/top 8865%8 04% 11%3 22%1 00% 00% 1166%0 00% 03% 00%</p>
        <p>Trouser 01% 9966%5 01% 03% 00% 00% 00% 00% 00% 00%
Pullover 11%2 01% 8811%5 21%7 1155%0 00% 88%4 00% 01% 00%</p>
        <p>Dress 32%7 22%2 11%2 9921%8 54%6 00% 33%1 00% 17% 00%
tireceddP SaCnodaatl 0042%% 0030%% 6061%%3 1010%%5 770440%%2 990770%%2 7060%%6 0220%%1 0015%% 0110%%1</p>
        <p>Shirt 88%3 04% 99%1 21%9 55%5 00% 6644%5 00% 02% 01%
Sneaker 00% 00% 00% 00% 00% 21%6 00% 9965%9 02% 43%7</p>
        <p>Bag 11%3 01% 04% 05% 17% 00% 11%4 00% 9987%9 00%
Ankle boot 00% 00% 00% 02% 00% 11%2 00% 22%0 00% 9955%1
distribution of the loss for predictions by the base-model.</p>
        <p>The overall performance of the base- and
monitoringmodel stack was evaluated on the oficial test set which
was not used in any of the training procedures. This is
ifnally compared to the performance of the base-model
on the same test data.</p>
      </sec>
      <sec id="sec-2-2">
        <title>5.4. Results</title>
        <p>into ten diferent classes. Each image has a dimension of
28× 28 pixels. In this work the images were preprocessed
by normalizing the pixel values to one.</p>
      </sec>
      <sec id="sec-2-3">
        <title>5.3. Experimental Setup</title>
        <sec id="sec-2-3-1">
          <title>For the GTSRB dataset a convolutional neural network</title>
          <p>(CNN) with the LeNet-5 architecture [28] while for the
F-MNIST dataset a fully connected (FC) feed-forward
neural network is applied by using the TensorFlow library
[29] as listed in Tab. 1.</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>Loss-Based Monitoring. The lowest loss for a sample</title>
          <p>
            that is incorrectly predicted is 0.699 for GTSRB and 0.696
for F-MNIST which places both near to the theoretical
Table 1 minimum given by Eq. (
            <xref ref-type="bibr" rid="ref7">7</xref>
            ). The maximum loss of
corParameters of the networks used in the experiments. rectly predicted samples is determined to be 1.757 for
GTSRB and 1.353 for F-MNIST which is smaller than the
Dataset GTSRB F-MNIST given upper bound from Eq. (8). The confusion matrix of
ABracshei-tMecotudreel LeNet-5 FC, 1 hidden layer, 128 neurons the trained monitoring model for F-MNIST is shown in
N Parameters 85 k 100 k Fig. 4 (a) in combination with confusion matrices of the
Base Accuracy 90.48% 88.04% base-model due to limited amount of classes. Data which
AMrochniitteocrtiunrge LeNet-5 FC, 2 hidden layers, 32 neurons each is predicted “negative” is passed from the monitoring to
N Parameters 85 k 26 k the base-model for inference. Images that are flagged
TProaritnioinngoSfaTmrapinleisng Data 72,804%0 101,70%00 as “positive” are discarded due to them being considered
unsafe inputs.
          </p>
          <p>Ratios of training data for the monitoring model were Confusion Matrices (RQ1). Figure 4 shows that for
kept to a minimum to have minor influence on the base- F-MNIST error-rates of all classes except the “coat” class
model training. However, a suficient absolute number improved. The classes “t-shirt/top”, “pullover”, “coat” and
of images is needed to enable successful training conver- “shirt” are discarded the most. Additionally, those are
gence. To decide when to stop the monitoring training the classes with the lowest accuracy in the base-model.
and to chose which loss threshold performs best, 10% of Since there are 43 classes for the GTSRB no numbers are
the training data used for the base-model is adapted as given but both base and monitored confusion matrix are
validation set. The result of the classification task per- color coded as shown in Fig. 7. Individual classes can
formed by the monitoring model shows whether the data be identified by Fig. 6. Comparing both matrices shows
induces a higher loss than the given threshold on the that less misclassification occured. While three classes
base-model. Chosen threshold values are based on the were fully rejected the error-rate was improved for all
0.8
e
tr
a
cay 0.6
r
u
c
acg 0.4
n
ii
n
eam0.2
R
0
t = 0.6
t = 0.1
t = 0.005</p>
          <p>Softmax
0 0.02 0.04 0.06 0.08 0.1 0.12</p>
          <p>
            Remaining error rate
(a)
isno 1
it
c
red 0.8
p
t
c
rre 0.6
o
c
n
i
lla 0.4
r
e
fov 0.2
o
o
tiaR 0
0.2
0.4 0.6 0.8
Output probability
(b)
1
0.2
0.4 0.6 0.8
Output probability
(c)
1
does not reflect the efectiveness of the method. Instead,
two metrics are adapted from [6]. The remaining error
rate (rer) and remaining accuracy rate (rar) give the
errorrate and correctly classified rate, respectively, relative to
the overall input data including the rejected portion. The
rer can be expressed with risk when calculated by the
0-1 loss and coverage via Eq. (
            <xref ref-type="bibr" rid="ref3">3</xref>
            ) and Eq. (
            <xref ref-type="bibr" rid="ref4">4</xref>
            ) as
          </p>
          <p>= ˆ · ˆ
while the rar can be expressed as
 = ˆ · (1 − ˆ) = ˆ − .</p>
          <p>(9)
(10)</p>
          <p>RQ2. Figure 5 (a) and Fig. 8 (a) represent the rar against
rer for both datasets and multiple monitoring networks
where the decision boundary  of the binary
classification was varied. Rejection based on softmax values are
given as comparison. Each point on a function
represents a possible operation point and corresponds to a
diferent binary decision boundary value. Due to the
(a) (b) counter-intuitive behaviour it has to be mentioned that
Figure 7: (a) Base confusion matrix. (b) Monitored confusion while the decision boundary of the monitoring model is
matrix with t=0.005. Columns are actual while rows are pre- increased, the rer and rar increases since less samples
dicted classes where values are given in relative color code, are rejected. When the decision boundary is  ≥ 1, all
increasing from white (equal to zero) to red to blue (equal samples are kept which is analog to not applying the
to 100% of the given class). Lines are separating the classes monitoring network resulting in rer beeing equal to the
as grouped in Fig. 6 which does not match with the oficial base error-rate and  +  = 1. However, for the
numbering of classes. Best viewed in color. Illustration style softmax activation function it is vice versa, increasing
adapted to own data from [24] the decision boundary rejects more samples with higher
output probability, leading to a decrease in rer and rar.</p>
          <p>RQ3. For the marked operation points in Fig. 5 (a) and
remaining classes except for one. Fig. 8 (a), Fig. 5 (b) and Fig. 8 (b) depict the gap between</p>
          <p>Coverage and Error Trade-Of. To evaluate the se- incorrect predictions and caught ones for all output
problective prediction it was decided against ROC-curves abilities. In contrast, Fig. 5 (c) and Fig. 8 (c) show the
since the separation of “positive” and “negative” samples ratio of rejected but correct classifications compared to
all correct predictions.
0.2
0.4 0.6 0.8
Output probability
(b)
1
0.2
0.4 0.6 0.8
Output probability
(c)
1
Table 2 in a lower error-rate by rejecting less samples this
sugResults for the GTSRB and F-MNIST with  = 0.005 and gests that it is hard to classify those classes correctly and
 = 0.1 for the monitoring while  = 0.8 and  = 0.9 for the base-model fails there in a diferent, more
fundamenrejection based on softmax, respectively. The binary decision tal, way.
boundary is  = 0.5 for the monitoring models. All values For the GTSRB absolute values of the confusion
magiven in %. trices will not be discussed in detail since they cannot
Dataset GTSRB F-MNIST be interpreted from given Fig. 7 due to too many classes.
rMero(nbiatsoer-imngodMelo)del 2.11 (9.52) 2.08 (11.96) However, it is clearly visible that in the monitored case
rCaaru(gbhastee-rmroordse(lr)ejection rate) 7573..8271 ((4940..6498)) 8620..6813 ((3878..0094)) less misclassification occur overall. Confusion between
rrSHeaoirrgftm((hbbaaapxssreeoT--bmhmarbooeiddlsieehtlly)o)eldrrors caught 822..8279 ✓((99.05.24)8) 618.9.948✓((1818.9.064)) “csopnefeudsiloimnibtestigwnese”nshoothwesrosnulbyglriottulepsimisprdoevcermeaesnetd.wThhilee
Caught errors (rejection rate) 75.95 (14.81) 83.77 (29.04) classes that are rejected completely are one of a few with a
High probability errors caught X X relative class frequency of less than 1.0% [24]. Such
problems could be eliminated by tackling the unbalanced
distribution problem, however, for this work the approach</p>
          <p>RQ4. Table 2 summarizes the results where operation shall be analyzed without heavy interventions. Why two
points of similar rer values are compared. Rejection oc- classes in the “danger signs” subclass clearly increases
curred for the whole range of output probabilities when in error-rate is unknown since the base-model showed
the monitoring-model was applied. relative good performance for those.
RQ1. For both datasets, classes with poor
base5.5. Discussion performance were rejected, which shows that the
moniLoss-Based Monitoring. The separation of samples is toring detects weak points. However, for GTSRB this led
dependent on the set loss threshold . Setting  just be- to slight deterioration of two individual classes that the
low  will guarantee to separate a maximum of correct base-model was able to classify with low error-rate.
from all incorrect predictions. While this is true in the- Coverage and Error Trade-Of ( RQ4). When
anaory, it was determined that the monitoring-model does lyzing the rar vs. rer trade-of it is visible that GTSRB is
not properly learn a separation of both classes when the more challenging to discriminate between positive and
threshold is set near  as seen for the operation points negative samples than F-MNIST since the graph shows a
in Fig. 5 (a) and Fig. 8 (a) for  = 0.6. In this work it was faster increasing gradient while approaching lower rer.
determined that the threshold needs to be set dependent This can be explained by GTSRB consisting of higher
on the loss distribution of the base model such that a suf- dimensional images. Additionally, there are 43 classes
ifcient amount of samples corresponding to the positive where a high loss can be present for relatively clear
sepclass is present. arated, correct decisions, meaning predictions with a</p>
          <p>Confusion Matrices (RQ1). Analyzing the confu- maximum probability far away from the second highest
sion matrices for F-MNIST reveals that classes “coat” and score.
“shirt” are discarded the most. By this the error-rate for RQ2. Overall one can state, that the monitoring is
“coat” is even increasing. While all other classes resulted eficiently improving the safety as long as the relative
reduction of the rer is greater than the relative reduction proposed to train a monitoring on the generated dataset
of the rar. This is true for the investigated operation to improve the well-known rejection procedure. We
appoints. If it does not hold, the result of rejection is only plied this approach to the GTSRB and F-MNIST dataset
as eficient as discarding samples by pure chance. and compared it to rejection based on the softmax
acti</p>
          <p>RQ3. The ratio of caught samples against output prob- vation function. The presented empirical results showed
ability as depicted in Fig. 5 (b) and Fig. 8 (b) confirms that that the error-rate was improved and over-confident
prethe goal to discard over-confident predictions is reached. dictions were successfully caught. We discussed that
The graphs are continuously increasing and show no while the rejection based on a softmax threshold shows
area with zero gradient. This implies that cases with a better remaining accuracy rate trade-of, the range of
various output probabilities were caught. In contrast, re- output probabilities for caught samples is bigger for the
jection based on the softmax threshold inherently missed monitoring approach. The shown results serve as a
proofover-confident errors but did not discard any high proba- of-concept for the approach which is targeting safety
bility correct decisions. The latter explains the lower rar critical domains. We believe that the methodology can
trade-of. be used for a variety of models and datasets.</p>
          <p>RQ4. While for F-MNIST, the gradient of the ratio of
correct samples grows much faster than for discarded
samples, the GTSRB dataset revealed to have a bigger Acknowledgments
portion of correct samples rejected. This is in accordance
to the harsher rar trade-of. Overall the monitoring is
less prone to reject high confident correct cases.</p>
        </sec>
        <sec id="sec-2-3-3">
          <title>This work was funded by the German Federal Ministry of Education and Research (BMBF) within the KI-ASIC project (16ES0993). We thank Infineon Technologies AG for supporting this research.</title>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>5.6. Limitations</title>
        <sec id="sec-2-4-1">
          <title>A key aspect in the training of the monitoring is that</title>
          <p>it shall be based on data, that the base-model was not
trained on to uncover weak points. In this work latter
is accomplished with a portion of the base training data.
One can argue that the base-model would show higher
performance when this missing data would be available
during training. In the following this is discussed
separately for both datasets. Comparing the error-rate of
state of the art networks for F-MNIST given as 3.09% with
3.2 M parameters [26] which is orders of magnitude more
than the model investigated here, proves that reducing
the error rate on the base-model to such levels is not an
easy task. Given the size of the used model, obtaining
such improvements is excluded but can be obtained by
applying the monitoring. For GTSRB other works [30]
report a error-rate smaller than 1.0% using the Le-Net
architecture. However, the focus is an enhanced
architecture and pre-processing of data that tackles the
unbalanced class distribution and image quality. A common
method is data augmentation [31] to alter class
distribution. The gained performance is expected to heavily
rely on such pre-processing which was consciously
excluded in this work to not rely on balanced classes or
other modifications that may induce any bias.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6. Conclusion</title>
      <sec id="sec-3-1">
        <title>We motivated the need to reduce the error-rate of a basemodel by catching over-confident errors due to their safety critical nature. We contributed a loss-based labeling that reflects the weak points of the base-model and</title>
        <p>[8] C. Guo, G. Pleiss, Y. Sun, K. Q. Weinberger, On national joint conference on neural networks, IEEE,
calibration of modern neural networks, in: Inter- 2011, pp. 1453–1460.
national Conference on Machine Learning, PMLR, [25] H. Xiao, K. Rasul, R. Vollgraf, Fashion-mnist:
2017, pp. 1321–1330. a novel image dataset for benchmarking
ma[9] J. Nixon, M. W. Dusenberry, L. Zhang, G. Jerfel, chine learning algorithms, arXiv preprint
D. Tran, Measuring calibration in deep learning., arXiv:1708.07747 (2017).</p>
        <p>in: CVPR Workshops, volume 2, 2019. [26] M. S. Tanveer, M. U. K. Khan, C.-M. Kyung,
Fine[10] C.-K. Chow, An optimum character recognition tuning darts for image classification, in: 2020 25th
system using decision functions, IRE Transactions International Conference on Pattern Recognition
on Electronic Computers EC-6 (1957) 247–254. (ICPR), IEEE, 2021, pp. 4789–4796.
[11] P. L. Bartlett, M. H. Wegkamp, Classification with a [27] Y. LeCun, The mnist database of handwritten digits,
reject option using a hinge loss., Journal of Machine http://yann. lecun. com/exdb/mnist/ (1998).</p>
        <p>Learning Research 9 (2008). [28] Y. LeCun, L. Bottou, Y. Bengio, P. Hafner,
Gradient[12] Y. Geifman, R. El-Yaniv, Selectivenet: A deep neural based learning applied to document recognition,
network with an integrated reject option, in: Inter- Proceedings of the IEEE 86 (1998) 2278–2324.
national Conference on Machine Learning, PMLR, [29] M. Abadi, A. Agarwal, P. Barham, E. Brevdo,
2019, pp. 2151–2159. Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,
[13] N. Manwani, K. Desai, S. Sasidharan, R. Sundarara- M. Devin, et al., Tensorflow: Large-scale machine
jan, Double ramp loss based reject option classifier, learning on heterogeneous systems, 2015.
in: Pacific-Asia Conference on Knowledge Discov- [30] A. Zaibi, A. Ladgham, A. Sakly, A lightweight model
ery and Data Mining, Springer, 2015, pp. 151–163. for trafic sign classification based on enhanced
[14] H. Jiang, B. Kim, M. Y. Guan, M. Gupta, To lenet-5 network, Journal of Sensors 2021 (2021).
trust or not to trust a classifier, arXiv preprint [31] L. Perez, J. Wang, The efectiveness of data
augmenarXiv:1805.11783 (2018). tation in image classification using deep learning,
[15] V. Chandola, A. Banerjee, V. Kumar, Anomaly de- arXiv preprint arXiv:1712.04621 (2017).
tection: A survey, ACM computing surveys (CSUR)
41 (2009) 1–58.
[16] A. Meinke, M. Hein, Towards neural networks
that provably know when they don’t know, 2020.</p>
        <p>arXiv:1909.12180.
[17] I. Kononenko, Bayesian neural networks, Biological</p>
        <p>Cybernetics 61 (1989) 361–370.
[18] C. Chow, On optimum recognition error and reject
tradeof, IEEE Transactions on information theory
16 (1970) 41–46.
[19] C. M. Santos-Pereira, A. M. Pires, On optimal reject
rules and roc curves, Pattern recognition letters 26
(2005) 943–952.
[20] L. P. Cordella, C. De Stefano, F. Tortorella, M. Vento,</p>
        <p>A method for improving classification reliability
of multilayer perceptrons, IEEE Transactions on</p>
        <p>Neural Networks 6 (1995) 1140–1147.
[21] C. De Stefano, C. Sansone, M. Vento, To reject or
not to reject: that is the question-an answer in case
of neural classifiers, IEEE Transactions on Systems,
Man, and Cybernetics, Part C (Applications and</p>
        <p>Reviews) 30 (2000) 84–94.
[22] Y. Geifman, R. El-Yaniv, Selective
classification for deep neural networks, arXiv preprint
arXiv:1705.08500 (2017).
[23] C. M. Bishop, N. M. Nasrabadi, Pattern recognition</p>
        <p>and machine learning, volume 4, Springer, 2006.
[24] J. Stallkamp, M. Schlipsing, J. Salmen, C. Igel, The
german trafic sign recognition benchmark: a
multiclass classification competition, in: The 2011
inter</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Olah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Steinhardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Christiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mané</surname>
          </string-name>
          ,
          <article-title>Concrete problems in ai safety</article-title>
          ,
          <source>arXiv preprint arXiv:1606.06565</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Burton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gauerhof</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Heinzemann, Making the case for safety of machine learning in highly automated driving</article-title>
          , in: International Conference on Computer Safety, Reliability, and Security, Springer,
          <year>2017</year>
          , pp.
          <fpage>5</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Koopman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wagner</surname>
          </string-name>
          ,
          <article-title>Autonomous vehicle safety: An interdisciplinary challenge</article-title>
          ,
          <source>IEEE Intelligent Transportation Systems Magazine</source>
          <volume>9</volume>
          (
          <year>2017</year>
          )
          <fpage>90</fpage>
          -
          <lpage>96</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Salay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Queiroz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Czarnecki</surname>
          </string-name>
          ,
          <article-title>An analysis of iso 26262: Using machine learning safely in automotive software</article-title>
          ,
          <source>arXiv preprint arXiv:1709.02435</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gabreau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pesquet-Popescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kaakai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lefèvre</surname>
          </string-name>
          ,
          <article-title>Artificial intelligence for future skies: On-going standardization activities to build the next certification/approval framework for airborne and ground aeronautic products</article-title>
          ., in: AISafety@ IJCAI,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Henne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schwaiger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Roscher</surname>
          </string-name>
          , G. Weiss,
          <article-title>Benchmarking uncertainty estimation methods for deep learning with safety-related metrics</article-title>
          ., in: SafeAI@ AAAI,
          <year>2020</year>
          , pp.
          <fpage>83</fpage>
          -
          <lpage>90</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ghahramani</surname>
          </string-name>
          ,
          <article-title>Dropout as a bayesian approximation: Representing model uncertainty in deep learning</article-title>
          ,
          <source>in: international conference on machine learning, PMLR</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1050</fpage>
          -
          <lpage>1059</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>