<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>How to build trust in AI systems with misclassification detectors and local misclassification explorations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pål Vegard Bun Johnsen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Milan De Cauwer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joel Bjervig</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brian Elvesaeter</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>SINTEF Digital</institution>
          ,
          <addr-line>Forskningsveien 1, 0373 Oslo</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>In the context of an AI system consisting of a machine learning model for classification, we present a framework denoted SafetyCage for systematically detecting and explaining misclassifications. We show how the framework can be used under deployment of the AI systems when true labels are unknown. Specifically, a misclassification detector measures the reliability in one particular model prediction and flags the prediction as either trustworthy or not. Unfortunately, most existing misclassification detectors are not easily interpretable for the purpose of ifnding the root cause of a misclassification. Hence, if the prediction is deemed untrustworthy, our approach provides additional so-called local misclassification explorations to further assess the trustworthiness of the prediction. The purpose of the framework is to be able to systematically explore the root cause of a particular misclassification, and hence incentivizing procedures to enhance the AI system even further. We showcase our framework with three ML models of diferent model architectures trained on images, tabular data and text respectively, and present three generic suggestions of local misclassification explorations, and how they can be adapted for each use case.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;AI systems</kwd>
        <kwd>Trustworthy AI</kwd>
        <kwd>Machine Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The integration and use of AI systems has been influential in modern society. The widespread availability
of so-called large language models (LLMs) is a good example of this. Another example is within the
health sector where for instance radiologists can get assistance from an AI model to detect bone fractures
based on X-ray images of patients [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. As defined in the memorandum published by the Organisation
for Economic Co-operation and Development (OECD) in 2024 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]: "An AI system is a machine-based
system that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such
as predictions, content, recommendations, or decisions that can influence physical or virtual environments.
Diferent AI systems vary in their levels of autonomy and adaptiveness after deployment ". The backbone
of any AI system is the underlying AI model that provides the predictions. Most modern AI models are
so-called machine learning (ML) models — a subset of AI models that are trained on historical data.
This type of AI model has gained particular traction due to its superior performance within imaging,
text and speech tasks. However, this often comes at the cost of reduced explainability - the degree to
which one can understand the basis for a model’s predictions.
      </p>
      <p>
        Following the definition above, an AI system is not limited to providing predictions, but may also
provide recommendations for decisions or even make decisions on behalf of a human. We emphasize
the diference between the raw predictions, and the final recommendations and decisions that will in
some way make use of the raw predictions in a decision-making process. Under the EU AI Act [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
which entered into force on 1 August 2024, high-risk AI systems are according to Article 15 required
among other things to be resilient to errors that occur within the system. Moreover, under Article
14 natural persons must be able to efectively oversee the high-risk AI system during use. To ensure
human oversight and control of the AI system, and to avoid negative consequences, it is important
that the decision maker is eficiently able to assess the trustworthiness of a particular prediction. As
described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], several aspects must be considered when assessing the trustworthiness of an AI system.
In this work, we will focus on the aspects of accuracy and human oversight, and the extent to which
we can assess and explore whether a particular prediction from the AI system is correct. We present
the framework SafetyCage for detecting and exploring incorrect predictions using misclassification
detectors. The advantage of having a misclassification detector together with the ML classification
model is that the accuracy of each prediction can be assessed in contrast to an overall model-based
accuracy assessment. In an AI system, this is a useful tool when human intervention is relevant or
required, for instance for quality assurance, spot checks or when a particular prediction is suspicious.
If both the ML model and the misclassification detector are suficiently accurate, it can be relevant to
use it actively such as by presenting the output from the misclassification detector together with the
particular ML model prediction. In this way, the ideally small proportion of suspicious predictions
lfagged by the misclassification detector can be inspected in detail by a human. To build trust in the
detection procedure, the framework includes analysis that help explore why a particular prediction is
correct or not. We limit ourselves to classification models where the task is to predict whether an input
sample (such as an X-ray image) belongs to a certain class (such as representing the presence of a bone
fracture).
      </p>
      <p>The remainder of the paper is structured as follows. In Section 2 we include related work, and present
challenges of using misclassification detectors in AI systems, and how we address them in this work.
In Section 3, we present our SafetyCage framework for misclassification detection and exploration.
Section 4 demonstrates the application of the framework in three diferent use cases involving ML
models for images, tabular data and text. In Section 5 we discuss the results and limitations of the
framework. Finally, Section 6 concludes the paper and presents directions for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work - Misclassification detectors</title>
      <p>Given a pre-trained (already trained) ML model. With an input sample x, the ML model will output a
prediction ˆ, ˆ : x ↦→ , satisfying ˆ = ˆ (x). The ML model is trained by fitting the parameters 
such as to mimic the true target function  * : x ↦→  satisfying  =  * (x). This branch of machine
learning is called supervised learning where the ML model is trained based on true labels 1, . . . ,  for
every input x1, . . . , x with training size , denoted the training data  = {(x, )}=1 .
In classification problems the labels are distinct and disjoint categories or classes. We assume only one
class can be present for each input sample. In this study, we will deal with ML models where the output
is a vector where each element, representing class , is a value that predicts the probability that class
 is present, and where the final prediction is equal to the class with the largest probability. We will
further only explore ML models where the softmax activation function is applied to the output layer
which ensures that all elements sum to one, following the law of total probability.</p>
      <p>The goal of a misclassification detector (MD) is in general to detect whenever the ML model prediction
is wrong for a given input x and prediction ˆ for which ˆ ̸= . We define the MD to be a function
(x, ˆ) ↦→ B which given x and ˆ outputs a boolean value where the output 1 means that the MD
predicts the ML model prediction to be wrong, while the output 0 means the MD model prediction is
correct. Ultimately, the MD is a binary classifier. When the output is 1 we say that the MD model flags
the prediction as being wrong. Note that this problem is only relevant during so-called deployment of
the ML model, which means whenever the ML model makes a prediction, but when we still don’t know
what the true label  is. The MD is itself a prediction model, which can flag wrongly, and hence we
will throughout the paper refer to an MD model.</p>
      <p>
        At the time of writing, there exist several misclassification detection models that have been published.
What can be considered the baseline methods are the maximum softmax probability (MSP) method [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
and the DOCTOR method [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] respectively which can be applied to every ML model with a softmax
output layer. For the MSP method, a prediction is considered incorrect whenever the maximum softmax
value is less than a predefined threshold. For the DOCTOR method, a prediction is considered incorrect
when the estimated odds of a misclassification is larger than a predefined threshold. Another MD
model called RED is described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] where the the uncertainty in the maximum softmax value is
estimated using Gaussian processes. There also exist detectors that are made for particular model
architectures, such as in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] where the ML model is a feedforward neural network model. In
these cases, the presence of an incorrect prediction is inferred by using the empirical distribution
function of correctly and incorrectly predictions on historical data across diferent layers of the neural
network, and a hypothesis test is then constructed to infer the likelihood that the prediction is wrong.
All of the mentioned methods share the property that, for each input x and prediction ˆ, the raw
output of any detector is a quantitative measure of the reliability or trustworthiness of the prediction.
For instance, under the MSP method, the larger the maximum softmax value for a prediction, the more
trustworthy the prediction is considered.
      </p>
      <sec id="sec-2-1">
        <title>2.1. The challenge of using MD models in AI systems</title>
        <p>
          This raw output from the misclassification detectors is in itself informative, and can ultimately be used
to declare a prediction as trustworthy or not. In practice, this will require a threshold with respect to
the quantitative reliability measure, which defines the border between correct and wrong predictions.
However, in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], the performance of the MSP method is evaluated using the AUROC and AUPR metrics.
These metrics aggregate the model performance across several thresholds, and they are therefore
threshold-independent performance metrics. While these are informative metrics to quantify general
performance, they are not helpful for flagging a particular prediction as trustworthy or not. In [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ],
an MD model is evaluated by computing the proportion of correct classifications that are flagged by
the MD model (Type I error) given that the proportion of all misclassifications that are flagged by the
MD model (recall) is required to be greater than 95%. This requirement automatically determines the
threshold. However, this evaluation metric does not account for other important performance metrics,
such as the proportion of flags that are correct (precision), or the proportion of non-flagged samples
that are correctly classified (negative predictive value). Importantly, for an MD model to be practical in
an AI system, only a small proportion of flagged instances should be false alarms. Hence, the precision
should be suficiently high. For a threshold-based MD approach, this raises the question of how one
should evaluate an MD model in a way that accounts for all aspects of MD model performance, and
how one should estimate the threshold.
        </p>
        <p>What the MD models mentioned above also have in common is that while their mutual goal is to
capture incorrect predictions, they do not provide any explanations as to why the prediction is incorrect.
This lack of interpretability is problematic, and limits the trustworthiness of the MD model itself.</p>
        <p>Our contribution in this work is the following:
1. We propose to use the Matthews correlation coeficient (MCC) as a suitable evaluation metric for
threshold-based MD models
2. We propose a way to estimate the threshold which decides when to declare a prediction as
trustworthy or not
3. We introduce the concept of local misclassification explorations with the aim of exploring
predictions flagged by an MD model in order to monitor an AI system and to validate the flagging by
the MD model</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodological approach and framework</title>
      <p>In this section our proposition points 1., 2. and 3. described in Section 2.1 will be described in detail in
Section 3.1, 3.2 and 3.3-3.4 respectively.</p>
      <sec id="sec-3-1">
        <title>3.1. Evaluation of misclassification detectors</title>
        <p>
          As noted earlier, misclassification detectors (MDs) can themselves produce inaccurate outputs, so their
trustworthiness must also be quantified. The underlying ML model in an operative AI system must
be suficiently accurate. Therefore, it is essential that the MD correctly identifies the small portion of
incorrect predictions, while minimizing false alarms on correct ones. This imbalance between correct
and incorrect predictions must be accounted for when evaluating MD performance. Moreover, the MD
model performance should be evaluated based on all aspects covered by the performance metrics recall,
precision, specificity (1-Type I error) and negative predictive value. We therefore recommend the use
of the Matthews Correlation Coeficient (MCC), which is specifically designed to provide a reliable
performance measure for binary classifiers under class imbalance [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], and which is also constructed
such that a large MCC value is only possible if also the recall, precision, specificity (1-Type I error) and
negative predictive value are large. The MCC values range from -1 to 1 where -1 means the classifier is
perfectly wrong in all cases, 0 means the classifier performs as well as a coin tossing classifier, while 1
means the classifier predicts correctly in all cases. Another favourable feature is that MCC = 0 for
classifiers that predict only one class in every case, whereas the classification accuracy for instance
would be equal to 0.99 in the setup where the majority class covers 99 % of all samples. It is also
therefore we follow the same procedure as presented by [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] where the optimal threshold during training
on an MD model is estimated based on data such as to maximize the Matthews correlation coeficient
(MCC).
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Misclassification detectors and how to construct them</title>
        <p>
          To avoid optimism bias due to potential model overfitting, the threshold should be estimated on data
independent of the data used to train the ML model [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. For methods such as in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], parameter
iftting is required as well, and so this should also be fitted under data independent of the training data.
        </p>
        <p>where  ∩  = ∅. When  is applied on the ML
We denote this as  = {(x, )}=1

model, we can aqcuire the training data for the MD model which we denote * = {(x, )}=1
with  = 1(ˆ = ). Summarized, the construction of a misclassification detector can be visualized in
Figure 1.</p>
        <p>ML model
architecture</p>
        <p>Trained</p>
        <p>ML model</p>
        <p>Train
ML model</p>
        <p>Train</p>
        <p>MD model
MD model
architecture</p>
        <p>Trained
misclassification
detector</p>
        <p>Once an MD model is trained we can use it during deployment of the ML model to flag potential
misclassifications, see Figure 2 where we see the MD model takes as input the input sample, x, the ML
model prediction, ˆ as well as information about the ML model architecture if needed.</p>
        <p>Trained
ML model</p>
        <p>Trained
MD model</p>
        <p>Flag</p>
        <p>Local
misclassification
exploration</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. The SafetyCage framework for monitoring AI systems during deployment</title>
        <p>The performance of an ML model can be evaluated based on historical data where we know the incidents
where the predictions were correct or wrong. This can be achieved by computing accuracy metrics
such as classification accuracy or the F1 score, and the performance can be visualized for instance with a
confusion matrix. These metrics can also be calculated across diferent subsets of the data that share
similar characteristics, helping to identify specific conditions under which the model performs better or
worse. These results can give an overall impression of when the ML model fails, but to a limited degree
the reason why.</p>
        <p>
          Recent work has emphasized the value of creating local explanations — that is, explanations for
individual predictions. This includes identifying which input variables the ML model found most
important for a given prediction [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Such explanations are valuable because they incentivize questions
such as "Should this variable in the input sample be this important?" or "Is it reasonable that such
a small change in the input sample flips the prediction?". If the answer to any of these questions is
"no", it automatically lowers the confidence in the prediction. However, less efort has been put to
locally explore misclassifications. In this particular case, the inspection may lead to hypotheses as
to why something went wrong in particular cases, and altogether this insight may lead to concrete
actions for enhancing the ML model when retrained. During deployment of the AI system, hence under
decision-making, we do not have true labels and the decision-maker must predict the best action to take
under available information. In this process, we can only use the single ML model prediction during the
decision-making process. To combat the limitation of the framework above, we present a framework
for using a misclassification detector during the deployment of an AI system. As the misclassification
detector typically does not reveal why a prediction is flagged as being wrong, we additionally provide
local misclassification explorations . Summarized, the SafetyCage framework is given in Algorithm 1.
        </p>
        <p>Input: Input sample x, trained ML model  , misclassification detector 
Output: Trust decision on prediction ˆ
ˆ ←  (x) ;
lfag ←  (x, ˆ) ;
if flag = 1 then</p>
        <p>Generate local misclassification explorations;
end
Make trust decision based on flag and local misclassification explorations;
// Obtain ML model prediction
// Detect if prediction may be incorrect
Algorithm 1: Framework for use of misclassification detector together with local misclassification
explorations during deployment of an AI system.</p>
        <p>
          Note that the term SafetyCage has been previously used as a particular type of misclassification
detection model first introduced in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], authored among others by the main author of this
workshop paper. The interpretation of SafetyCage is from hereon extended to mean the framework in
which a misclassification detector acts within an AI system.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Generating local misclassification explorations</title>
        <p>As the previous section suggests, we are interested in generating local misclassification explorations for
a sample x whenever (x, ˆ) = 1.</p>
        <p>In this section, we present three generic open-ended misclassification exploration methods. We
approach this as a hypothesis-generating study rather than a hypothesis-testing one.
• Exploration 1: Historically explore similar samples with same misclassified prediction
• Exploration 2: Evaluate whether the sample is an outlier, and in what way
• Exploration 3: Perturb the input sample, without changing the true label, and explore the efect
it has on the corresponding prediction</p>
        <p>Exploration 1. requires the definition of what are similar samples. This will depend from use case
to use case. In Section 4.1 we will show a particular example of this. The purpose of Exploration 2. is
to explore whether the sample deviates significantly from other previous observations in the training
dataset, hence being an outlier. This is based on the fact that ML models typically perform poorly in
these circumstances. The presence of an outlier may be due to measurement errors in the sample, but
it might as well also be due to the limited representativeness of the training data. In Section 4.2 we
show a practial example. Exploration 3. explores how the ML model behaves if we modify the original
input sample without loosing its essence. If the corresponding prediction changes, we automatically
loose trust in the original prediction. The challenge is to know how we should perturb the input sample
without changing its true label. In Section 4.3 we show one example for how to deal with this.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Use cases and experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Image classification with the MNIST database</title>
        <p>
          We demonstrate the approach of generating Exploration 1. of a flagged sample to the MNIST
dataset [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. The dataset consists of 70 000 grayscaled images (28*28 pixels) representing digits from 0-9.
Given an unseen image of a digit, the classification task is to predict which class that image belongs to.
        </p>
        <p>We trained a simple feed-forward neural network designed to solve the MNIST classification problem.
As illustrated by Figure 3, the model consists of an input layer, two hidden layers with sigmoid activation
functions, and an output layer of size 10 with a softmax activation function. This model achieves an
overall accuracy of 97% on unseen test data of size 1000.</p>
        <p>We trained a SPARDACUS detector on the  data using the Dense 2 layer as a source of
information. The training procedure lead to an associated MCC = 0.48 on testing data. Specifically, the
trained SafetyCage flagged 2.61% of the 1000 test samples as not trustworthy.</p>
        <p>(a)
(b)
(c)
(d)
(e)
(f)</p>
        <p>
          Figure 4a shows a sample, wrongly predicted to be the digit 5, from the testing set correctly flagged
by the SPARDACUS detector. We show an example of applying Exploration 1, namely by looking at
previous samples with same prediction, however misclassified, and similar to the input sample (4a). The
notion of similarity must be explicitly defined, which is not straightforward to do between raw images.
However, with a neural network, one approach is to instead base the similarity between samples on
the corresponding representations (activation values) at a particular layer. Specifically, given a new
input sample x flagged by SafetyCage , we can search for previous samples, having the same class
prediction and known to be misclassified, that are closest to x with respect to a distance metric in a
particular layer. Here, we use the penultimate layer of the neural network model (Dense 2 layer), and
deploy the KNN distance metric introduced by [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Figures 4b-f, depict the 5 nearest neighbors of the
lfagged sample under scrutiny. These samples are suggesting to the end-user in what way digit 9 can be
misclassified to digit 5.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Heart disease prediction based on clinical variables</title>
        <p>
          We now discuss another possible approach to explaining misclassified samples. Consider a tabular
dataset of 918 rows linking 11 patient features (age, sex, resting blood pressure, ...) to a 0/1 variable
stating whether the patient had a cardio-vascular incident. An eficient method to model the binary
classification problem ˆ =  () with ˆ ∈ {0, 1} is to use a decision tree based algorithm such as
light gradient boosting machines (LGBM). We relied on the open source implementation of LGBM:
LightGBM [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Binary and categorical features from the dataset were encoded using an Ordinal
Encoding strategy. Some model hyperparameters were set explicit values; The maximum depth of
trees (max depth) was set to 5, the bagging fraction to 0.5 and the number of training rounds to 100.
Those hyperparameter that were not set explicitly took their default values. With this setting the ML
model achieved an overall accuracy of 0.85 on test data with precision and recall equal to 0.84 and 0.90
respectively.
        </p>
        <p>In this use case, the DOCTOR missclassification detector is used to flag test set samples. The method
lfagged 30 samples (13%) as potentially wrongly classified, achieving an overall accuracy of 87% at
correctly flagging samples and MCC = 0.45 on the testing set.</p>
        <p>
          Figure 5 (Left) shows test set samples as flagged by the detector and ECOD [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] (Empirical Cumulative
Outlier Detection), a simple, computationally eficient and explainable outlier detection method. The
ifgure suggests that sample 22 is detected as an outlier in the testing dataset while also having been
successfully flagged as wrongly predicted by SafetyCage . The ECOD outlier detection algorithm
allows us to measure the feature’s individual contribution to the sample’s overall outlier score as seen
on Figure 5 (Middle). As can be seen on the figure, the outlier detection method suggests that the age
feature is contributing significantly to the overall score and is also above the 99% cutof band making
the sample very likely to be an outlier.
        </p>
        <p>To conclude Exploration 2, we can present the age distribution (Figure 5 (Right)) to the ML model
end-user and propose that the extreme value of the age can explain why the ML model misclassified the
sample.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Review sentiment prediction based on the IMDb Dataset</title>
        <p>
          We apply a pre-trained Roberta transformer model [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] to predict sentiment (either negative or positive)
of movie reviews from the IMDb dataset [18]. We collect 1000 samples, and train an MSP misclassification
detector on half of it, leaving the remaining 500 as test data. The MSP detector achieves an MCC = 0.38
on the test data. See left column of Table 1 which shows a movie review which was wrongly classified
to have a positive sentiment by the Roberta transformer model.
        </p>
        <sec id="sec-4-3-1">
          <title>Original Review (shortened)</title>
          <p>This looks like one of these Australian movies done
by “talented” students and funded by the government.
It is chock full of smart shots of colors and shapes
and verbal excursions into Freudian psychology to be
appreciated by art students and teachers alike, but
in general it is perceived a stupid mockery of good
cinema, good storytelling and generally good taste...</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>Paraphrased Review</title>
          <p>This movie is thought to be a stupid mockery of good
cinema because it is full of smart shots of colors and
shapes and verbal excursions into Freudian psychology
to be appreciated by art students and teachers alike.</p>
          <p>We showcase how Exploration 3 can be utilized in this case for the movie review in Table 1. We
employ the PEGASUS transformer model fine-tuned for paraphrasing [19, 20], which rephrases a given
input while preserving its original meaning. We generate ten paraphrases of the original review, and
compute the sentiment prediction of the Roberta transformer model. See right column of Table 1 to
see one such paraphrase generated. We recompute the sentiment predictions by the Roberta model for
each paraphrase. It turns out that the prediction flips from positive to negative sentiment for seven
out of ten paraphrases. Provided that the paraphrases are of particular quality, this shows the lack of
robustness for the sentiment prediction model for this particular input sample, and hence reduces the
trustworthiness of the prediction.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>We have presented a system, denoted SafetyCage , for detecting and exploring misclassifications in an
AI system for the purpose of general human oversight and finding the root cause of a misclassification.
We showed three generic examples. Note the important distinction between exploration and explanation.
A next step would be to construct hypotheses of root causes, and further to evaluate them. With this
in mind, we regard the quality of a local misclassification exploration as the degree to which one can
generate formal hypothesis testing to infer the root cause. Misclassifications where the root cause have
been found is valuable information for the future use of an AI system. How to actually materialize this
to further improve the AI system is yet another natural step.</p>
      <p>The SafetyCage framework may be a compliance enabler for high-risk AI systems with respect to
the EU AI Act, for instance with respect to Article 14 and 15 which among other things require that
high-risk AI systems are resilient to errors, and that natural persons can eficiently oversee the system
during use. The quantification of the prediction uncertainty and the flagging procedure can make the AI
system more resilient to the prediction errors, and the local misclassification explorations can contribute
to human oversight. As SafetyCage assesses the likelihood of an ML model to produce a prediction
error, the framework can be integrated into a risk management system. This is especially relevant
given that Article 9 of the EU AI Act requires a risk management system for high-risk AI systems. A
potential integration of SafetyCage in a high-risk AI system also creates new obligations regarding
the technical documentation (Article 11). This should for instance include detailed information about
the flagging procedure of SafetyCage , and its performance quantified with metrics (such as the MCC).</p>
      <p>The examples provided in Section 4 serve as illustrative cases intended to showcase the practical
potential of the framework, and should not be interpreted as the default behaviour of flagged samples by
a misclassification detector. The successfulness of the framework highly depends on the performance of
the ML model, the MD model as well as the local explorations. The purpose of this work is to introduce
the framework, and how to couple diferent components together in a bigger picture.</p>
      <p>From the use cases in Section 4, we see the MCC on the test data ranges from 0.38 to 0.48 across
diferent MD models. While this means the MD models are informative, there is still room for
improvement by catching more misclassifications and reducing false flags, which efectively would increase the
MCC towards 1.</p>
      <p>A premise for the motivation of local misclassification explorations is the limited interpretability
of existing misclassification detectors. One alternative way forward is to construct interpretable
misclassification detectors where it is also possible to see the reason why there is a misclassification
such as by using linear models or decision trees. However, increased interpretability often comes at the
cost of reduced accuracy of the misclassification detector.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and future work</title>
      <p>In this work, we proposed the framework SafetyCage for how to actively use misclassification detectors
in AI systems. SafetyCage acts as a guardrail mechanism during decision-making, aiming to reduce
the risk of poor decisions due to incorrect predictions from AI classification models. We recommend the
use of local misclassification explorations, which can help uncover the underlying causes of erroneous
predictions. These insights can support the development of hypothesis-driven tests and guide AI model
retraining or modification to improve the overall quality and reliability of the decision-making.</p>
      <p>We have identified several directions for future research and development to enhance the SafetyCage
framework:
• Improving the misclassification detectors: Even though the MDs covered in this work are
informative, there is still room for improvement. How to construct MD models that can achieve
higher MCC values will be important future research.
• Improving local explorations: Enhancing the quality and interpretability of local explorations
is a key challenge. In particular, identifying root causes of misclassifications and briding the gap
between exploratory insights and actionable explanations is essential.
• Interpretable misclassification detectors: By using an interpretable model architecture, one
can not only predict the presence of a misclassification, but also be able to interpret the reason
why.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work has received funding from the THEMIS 5.0 project under the European Union’s Horizon
Europe research and innovation programme (grant agreement No 101121042). THEMIS 5.0 aims to
develop an AI-driven, human-centered trustworthiness optimization ecosystem that enables users to
assess, influence, and enhance the fairness, transparency, and accountability of AI systems.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT for grammar and spelling check,
paraphrasing and rewording. After using this tool, the authors reviewed and edited the content as
needed and take full responsibility for the publication’s content.
[18] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, C. Potts, Learning word vectors for sentiment
analysis, in: Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies, Association for Computational Linguistics, Portland,
Oregon, USA, 2011, pp. 142–150. URL: http://www.aclweb.org/anthology/P11-1015.
[19] J. Zhang, Y. Zhao, M. Saleh, P. J. Liu, Pegasus: Pre-training with extracted gap-sentences for
abstractive summarization, 2019. arXiv:1912.08777.
[20] Tuner007, tuner007/pegasus_paraphrase, https://huggingface.co/tuner007/pegasus_paraphrase,
2021. Fine-tuned PEGASUS model for paraphrasing, hosted on Hugging Face.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Pettersen-Johansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Faxvaag</surname>
          </string-name>
          ,
          <article-title>Bruk av kunstig intelligens i bildediagnostikk - Hvordan påvirker en kunstig intelligens (KI) applikasjon for tolkning av beinbrudd, arbeidsflyt og oppgaveløsning for helsepersonell i klinisk praksis?</article-title>
          ,
          <source>Master's thesis</source>
          , Norwegian University of Science and Technology, Trondheim, Norway,
          <year>2024</year>
          . URL: https://ntnuopen.ntnu.no/ntnu-xmlui/handle/ 11250/3190984?show=full.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>[2] Explanatory memorandum on the updated OECD definition of an AI system</article-title>
          ,
          <source>Technical Report, Organisation for Economic Co-Operation and Development (OECD)</source>
          ,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .1787/ 623da898-en.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Regulation</surname>
          </string-name>
          (EU)
          <year>2024</year>
          /
          <article-title>1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence and amending Regulations (EC</article-title>
          ) No 300/
          <year>2008</year>
          , (EU) No 167/
          <year>2013</year>
          , (EU) No 168/
          <year>2013</year>
          , (EU)
          <year>2018</year>
          /858,
          <string-name>
            <surname>(</surname>
            <given-names>EU</given-names>
          </string-name>
          )
          <year>2018</year>
          /1139 and (EU)
          <year>2019</year>
          /2144 and Directives 2014/90/EU, (EU)
          <year>2016</year>
          /797 and (EU)
          <year>2020</year>
          /1828 (Artificial
          <issue>Intelligence Act</issue>
          ),
          <year>2024</year>
          . URL: https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Díaz-Rodríguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Del</given-names>
            <surname>Ser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Coeckelbergh</surname>
          </string-name>
          , M. López de Prado,
          <string-name>
            <given-names>E.</given-names>
            <surname>Herrera-Viedma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Herrera</surname>
          </string-name>
          ,
          <article-title>Connecting the dots in trustworthy Artificial Intelligence: From AI principles, ethics, and key requirements to responsible AI systems and regulation</article-title>
          ,
          <source>Information Fusion</source>
          <volume>99</volume>
          (
          <year>2023</year>
          )
          <article-title>101896</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.inffus.
          <year>2023</year>
          .
          <volume>101896</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hendrycks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gimpel</surname>
          </string-name>
          ,
          <article-title>A baseline for detecting misclassified and out-of-distribution examples in neural networks</article-title>
          ,
          <year>2018</year>
          . URL: https://arxiv.org/abs/1610.02136. arXiv:
          <volume>1610</volume>
          .
          <fpage>02136</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Granese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Romanelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gorla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Palamidessi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Piantanida</surname>
          </string-name>
          ,
          <article-title>Doctor: A simple method for detecting misclassification errors</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2106.02395. arXiv:
          <volume>2106</volume>
          .
          <fpage>02395</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Miikkulainen</surname>
          </string-name>
          ,
          <article-title>Detecting Misclassification Errors in Neural Networks with a Gaussian Process Model</article-title>
          ,
          <year>2022</year>
          . URL: http://arxiv.org/abs/
          <year>2010</year>
          .
          <year>02065</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2010</year>
          .
          <year>02065</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P. V.</given-names>
            <surname>Johnsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Remonato</surname>
          </string-name>
          ,
          <article-title>Safetycage: A misclassification detector for feed-forward neural networks</article-title>
          , in: T.
          <string-name>
            <surname>Lutchyn</surname>
            ,
            <given-names>A. Ramírez</given-names>
          </string-name>
          <string-name>
            <surname>Rivera</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          Ricaud (Eds.),
          <source>Proceedings of the 5th Northern Lights Deep Learning Conference (NLDL)</source>
          , volume
          <volume>233</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>113</fpage>
          -
          <lpage>119</lpage>
          . URL: https://proceedings.mlr.press/v233/johnsen24a.html.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P. V.</given-names>
            <surname>Johnsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Remonato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Benedict</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ndur-Osei</surname>
          </string-name>
          ,
          <article-title>SPARDACUS SafetyCage: A new misclassiifcation detector</article-title>
          ,
          <source>in: Proceedings of the 6th Northern Lights Deep Learning Conference (NLDL)</source>
          ,
          <source>PMLR</source>
          ,
          <year>2025</year>
          , pp.
          <fpage>133</fpage>
          -
          <lpage>140</lpage>
          . URL: https://proceedings.mlr.press/v265/johnsen25a.html.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Chicco</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. Jurman,</surname>
          </string-name>
          <article-title>The advantages of the Matthews correlation coeficient (MCC) over F1 score and accuracy in binary classification evaluation, BMC Genomics 21 (</article-title>
          <year>2020</year>
          )
          <article-title>6</article-title>
          . doi:
          <volume>10</volume>
          .1186/ s12864-019-6413-7.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hastie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tibshirani</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Friedman,</surname>
          </string-name>
          <article-title>The elements of statistical learning: data mining, inference</article-title>
          and prediction, 2 ed., Springer,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          ,
          <article-title>"why should i trust you?": Explaining the predictions of any classifier</article-title>
          ,
          <year>2016</year>
          . URL: https://arxiv.org/abs/1602.04938. arXiv:
          <volume>1602</volume>
          .
          <fpage>04938</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <article-title>The mnist database of handwritten digit images for machine learning research</article-title>
          ,
          <source>IEEE Signal Processing Magazine</source>
          <volume>29</volume>
          (
          <year>2012</year>
          )
          <fpage>141</fpage>
          -
          <lpage>142</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ming</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Out-of-Distribution Detection with Deep Nearest Neighbors</article-title>
          ,
          <year>2022</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2204.06507.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Finley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , W. Ma,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ye</surname>
          </string-name>
          , T.-Y. Liu,
          <article-title>Lightgbm: A highly eficient gradient boosting decision tree</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          )
          <fpage>3146</fpage>
          -
          <lpage>3154</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Botta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. H.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Ecod:
          <article-title>Unsupervised outlier detection using empirical cumulative distribution functions</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>35</volume>
          (
          <year>2023</year>
          )
          <fpage>12181</fpage>
          -
          <lpage>12193</lpage>
          . URL: http://dx.doi.org/10.1109/TKDE.
          <year>2022</year>
          .
          <volume>3159580</volume>
          . doi:
          <volume>10</volume>
          . 1109/tkde.
          <year>2022</year>
          .
          <volume>3159580</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Aychang</surname>
          </string-name>
          , aychang/roberta-base-imdb, https://huggingface.co/aychang/roberta-base-imdb,
          <year>2020</year>
          .
          <article-title>Fine-tuned RoBERTa model on the IMDb sentiment classification task</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>