<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MRxaI: Black-Box Explainability for Image Classifiers in a Medical Setting</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nathan Blake</string-name>
          <email>nathan.blake@kcl.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David A. Kelly</string-name>
          <email>david.a.kelly@kcl.ac.uk</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Santiago Calderón Peña</string-name>
          <email>santiago.calderon@kcl.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Akchunya Chanchal</string-name>
          <email>akchunya.chanchal@kcl.ac.uk</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hana Chockler</string-name>
          <email>hana.chockler@kcl.ac.uk</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>King's College London</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University College London</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Explainable Artificial Intelligence (XAI) tools have become common in elucidating the decision-making processes of machine learning models in medical imaging. Despite their increasing use, these tools often lack rigorous validation against clinical standards, and there is no single measure which can comprehensively evaluate their performance. Traditional quantitative metrics, including the Sørensen-Dice Coeficient, Jaccard Index, and Hausdorf Distance, individually capture overlap or spatial characteristics but fail to account simultaneously for key clinical requirements, including size, location, and continuity of explanations. To address these shortcomings, we propose the Penalized Dice Coeficient (PDC), a novel quantitative measure integrating spatial alignment, area similarity, and explanation fragmentation into a single, clinically relevant measure. We demonstrate the utility of the PDC through simulations, comparing it against established metrics under varied conditions, including shifts in location, changes in size, and combined transformations. We apply the PDC to a realistic medical research task to evaluate and compare popular black-box XAI tools using a publicly available brain MRI dataset for tumor classification. Results reveal clear diferences in tool performance, with one tool, ReX, consistently outperforming others, highlighting the utility of the PDC in discriminating between clinically useful explanations. Our findings underscore the importance of tailored evaluation metrics in medical XAI applications and establish the PDC as a viable addition.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;XAI</kwd>
        <kwd>MRI</kwd>
        <kwd>Glioma</kwd>
        <kwd>Machine Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Post-hoc Explainable AI (XAI) is the process whereby a technique is applied to a machine learning
(ML) model in order to discover which features that model used to make a particular classification.
These techniques are sometimes referred to as local, referencing the fact that they seek to explain a
given instance, rather than the general workings of the model. There are several popular post-hoc XAI
tools available such as SHAP [1], LIME [2] and Grad-CAM [3], as well as more recent additions such as
ReX [4, 5].</p>
      <p>These methods have largely been developed in the context of general computer vision, but have been
increasingly applied to medical imaging (ref). However, their use in medical domains is not universally
accepted, for several reasons. One of these reasons is that XAI techniques are not validated to the rigor
required of medical applications [6]. Indeed, for this reason the DECIDE-AI guidelines, which sets
minimum standards for the publication of clinical studies involving AI [7], does not recommend the use
of XAI. However, more recent medical guidelines for AI implementation, FUTURE-AI, do include XAI
as one of six essential pillars [8]. This debate highlights that medical AI tasks require a much higher
EXPLIMED 2025 - Second Workshop on Explainable Artificial Intelligence for the medical domain - 25-30 October 2025, Bologna,
Italy
* Corresponding author.
burden of proof, and what may be suficient for general computer vision tasks is often insuficient for
ostensibly similar medical vision applications.</p>
      <p>This has been acknowledged and several frameworks for a more rigorous assessment of medically
deployed XAI tools have been created. For instance, Jin et al. [9] describe four criteria, including
assessment against simulations or human annotations, in both cases using some sense of a ground truth
by which to measure the ’faithfulness’ and ’plausibility’ of an XAI tool. Ali et al. [10] provide a more
general framework, highlighting often conflicting desiderata between those developing a model and
the clinicians using it [10]. This requires that explanations are able to satisfy criteria from two distinct,
if overlapping, domains, with clinicians being particularity interested in post-hoc explainability and
assessing how such an explanation fits into the existing medical evidence base. These are not mutually
exclusive goals, and clinicians also seek more interpretable models [11], but it is clear that this must be
framed in terms of clinically relevance. This divergence in expectations, with developers seeking to
make the model more interpretable and clinicians seeking clinical plausibility [12], also influences how
XAI outputs themselves (as distinct from the model) are assessed.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work: what is a good explanation?</title>
      <p>In general, there is an absence of assessment of XAI tools in the medical domain. Part of this stems
from from the lack of an agreed definition of what an explanation is, with relatively few XAI tools
having it formally defined. With no consensus on explanation, a definition of a ‘good’ definition is even
more fraught.</p>
      <p>However, at a high level, we can consider several factors which contribute to what makes for a
good explanation. Perhaps foremost of these is the nature of the of the model under consideration:
the explanation for a Large Language Model (LLM) output will look very diferent to an image based
classifier. Explanations for the latter often take the form of a heat map, or saliency map, which in some
way highlights the importance of the pixels of an image to a models classification.</p>
      <p>The gold-standard for assessing an explanation for a medical image classifier is a qualitative
assessment by the healthcare professionals expected to use the model during deployment. However, this
method does not scale well as the time required of such healthcare professionals is a scarce resource.
Therefore, there is a need to supplement qualitative assessments with quantitative approaches.</p>
      <p>There are many proposed methods to quantify the quality of an explanation: insertion/deletion
curves [13] measure changes in model confidence as pixels are added and removed. Various fidelity
measures have also been proposed [14], all based on the assumption that an explanation which is in
some way resistant to perturbation is a good explanation. It is unclear how useful this is in a medical
setting, as fidelity may just be a measure of a model’s stubbornness rather than accuracy.</p>
      <p>There exist several measures to quantify the degree to which a model identifies known clinical
features. These are applicable when some kind of ground truth is available, usually domain expert
annotation. This is predicated upon the extent to which the model is identifying the same features a
clinician would use to make a diagnosis. This may not be true, as an AI model may learn to use other
features, which may be spurious correlations or hitherto unknown true biomarkers of disease (Figure 8).
However, given that clinicians want models which do align with known clinical features, such measures
are useful to some applications.</p>
      <p>
        Common measures domain include, but are not limited to: the Sørensen–Dice coeficient (DC), the
Jaccard Index (JI) and the Hausdorf Distance (HD). Common to all is that they, in some sense, measure
the degree of overlap between two images. The DC (Equation (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )) measures the similarity of any two
sets [15, 16]. It is a common metric to assess the similarity of medical images in segmentation tasks,
where one image is the result of a segmentation model and the other is a ground truth mask.
      </p>
      <p>Given two sets of pixels  and  , we have
(,  ) =</p>
      <p>2 |  ∩  |
|  | + |  | .</p>
      <p>
        (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
      </p>
      <p>
        The Jaccard Index Equation (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), JI, takes the ratio of the size of the intersection of two sets to the size
of their union, and is perhaps more commonly used in the computer science literature compared to the
DC.
      </p>
      <p>
        (,  ) = |  ∩  | (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
      </p>
      <p>|  ∪  |</p>
      <p>Unfortunately neither the DC nor the JI accounts for the distance between an explanation and the
ground truth, nor the size or the number of explanation segments. These are important, as the farther
an explanation is from the ground truth, the less clinically useful it is, as it takes human attention
away from pertinent areas. Similarly, an explanation that is too large can distract attention, and one
too small can lead to missing clinical information. An explanation erroneously consisting of multiple
non-contiguous areas is also distracting.</p>
      <p>{︃
}︃
(,  ) = max
sup inf (, ), sup inf (, )
∈ ∈ ∈ ∈</p>
      <p>
        More able to capture this features, but also much more computationally expensive, the Hausdorf
Distance Equation (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ), HD, measures the greatest distance between any point in one set to the nearest
point in the other set. It captures the notion of the maximum minimal separation between two sets
and is commonly used in computer vision and medical imaging to quantify how closely two shapes
resemble each other in terms of spatial location and shape alignment. However, it is sensitive to outliers,
especially for particularly irregular shapes which are more common in medical imaging, and unlike the
DC and JI does not give any indication of overlap between shapes.
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
(5)
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Penalised Dice Coeficient (PDC)</title>
      <p>We propose a new measure, the the Penalized Dice Coeficient (Equation (5)), or PDC, which combines
the strengths of these disparate measures. It captures the diference between a ground truth (HPE) 1 and
an explanation, , taking the comparative areas, the distance between the areas, and the number of
areas into account.</p>
      <p>First, we need a notion of distance. Let  be the standard euclidean distance, measured between
the centers of the HPE and , normalized against the maximum possible distance from the center of
the HPE (). We capture this as  = 1 − / .  = 0 means that the centers are in the same
location, and  = 1 that the centers are maximally far from each other.</p>
      <p>
        We also calculate the ratio between the HPE area and  area. As we want the PDC to always be in
the range (
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ) for easy comparison, we define the ratio  as:
      </p>
      <p>
        ⎧ size if size &lt; size
(, , ) = ⎨⎪ sssiiizzzeee if size &lt; size
⎪⎩1
if size = size
where  and  are parameters, between (
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ), which allow users to define how much they wish to
penalize against explanations that are too small or too big, respectively. The ratio  is thus bounded in
the range (
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ).
      </p>
      <p>We combine the distance and area ratio with the DC (the latter ensuring that explanations which
overlap with the HPE score higher than non-overlapping explanations) to form the summary statistic
of the PDC:</p>
      <p>As each component ,  and DC is in [0, 1], the PDC is also in this range. 1 indicates a perfect
alignment of the HPE and explanation area and 0 indicates an empty explanation. Unlike the DC, which
simply returns 0 if two masks do not overlap to any degree, the PDC is always a strictly positive number,
assuming the explanation is not empty, but approaches 0 as the diferent penalties for size and location
come into efect.</p>
      <p>In many cases XAI tools give an explanation with multiple non-contiguous areas. In this case the
PDC for a single explanation mask is given as the mean of all non-contiguous areas.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methods</title>
      <p>We performed a number of experiments, both simulated and involving a brain tumor dataset to
demonstrate the performance of the various measures discussed above.</p>
      <p>Simulations We use simulations to compare the behavior of these quantitative measures under a
number of conditions. In all simulations a fixed ’ground truth’ mask was generated on a 256 × 256 pixel
grid using a procedurally defined irregular shape based on a polar coordinate perturbation of a radial
contour. This base shape was created by sampling 10-15 points at equiangular intervals around a circle
and applying random radial deviations, followed by morphological closing, opening, and hole filling.
Distance An explanation mask was created, identical in shape to the ground truth, but its location
translated across various areas of the grid. The translation followed a linear diagonal path, moving the
mask from the upper-left toward the bottom-right of the image domain in nine uniform steps. The
central mask in the sequence aligned exactly with the ground truth (Figure 1).</p>
      <p>Figure 2 shows how each of DC, JI, HD and PDC scores for each plot given in Figure 1. This shows
that measures such as DC and JI contain no additional information beyond some degree of overlap. The
PDC and HD scale more gradually, but the PDC does have a ‘bump’ when it starts to overlap with the
ground truth.</p>
      <p>Size An irregular shape was rescaled around its center by factors ranging from 0.6 to 1.7in 9 uniform
steps. Scaling was applied to the contour points relative to the centroid of the image, and each scaled
contour was rasterized to generate the corresponding binary mask. The center of the mask remained
ifxed at the center of the image. The ground truth was the version with a scale factor of 1.0. All other
masks preserved contour shape but difered in size (Figure 3).</p>
      <p>Figure Figure 4 shows how each of DC, JI, HD and PDC scores for each plot given in figure Figure 3.
In all rescalings there is some degree of overlap and so all measures score. The PDC remains more
symmetric than the HD around the inflection point of unitary scale.</p>
      <p>Size and Distance The shape generation and scaling were performed as above, but each scaled
contour was also shifted along a diagonal path from the upper-left to the bottom-right corner. The
shift and scale were both applied in 9 uniform steps such that the middle mask (5ℎ index) matched the
ground truth in both size and position (Figure 5).</p>
      <p>Figure 6 shows how each of DC, JI, HD and PDC scores for each plot given in Figure 5. Again, the DC
and JI give no information outside of a degree of overlap. The PDC and HD are strictly more informative
in the sense that they give information even when no overlap occurs, capturing the intuition that an
explanation closer to a ground truth - even if mot overlapping with it - is better than a more distant
explanation. Whether this intuition holds for a given problem is one which the domain expert should
closely consider. If it holds, then measures such as the PDC and HD are strictly more informative.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Design</title>
      <p>The most commonly used XAI tools in medical diagnostic imaging are Gradient-weighted Class
Activation Mapping (Grad-CAM) [3], Local Interpretable Model-agnostic Explanations (LIME) [2] and
SHapley Additive exPlanations (SHAP) [1]. Randomized Input Sampling for Explanation (RISE) [13]
and Integrated Gradients (IG) [17] are less popular in the medical literature but have attractive
features, discussed below, making them worth considering. ReX [4], a tool based in the theory of actual
causality [18], has not previously been used in medical explainability.</p>
      <p>Most of the XAI tools described above, and indeed many similar tools, occlude, or otherwise perturb,
various pixels in an input image to determine the diference such perturbations make on its classification.
The nature of the perturbations might include setting pixel values to a baseline value such as 0 or to
the mean pixel value, with or without blurring at the edges.</p>
      <sec id="sec-5-1">
        <title>5.1. Model and dataset</title>
        <p>The model used is a pre-trained CNN based on the ResNet50 architecture [19]. Brain MRI data was
obtained from The Cancer Imaging Archive, as published by Buda et al. [20], and made publicly available
on kaggle [21]. This curated dataset of 110 pre-operative patients with low grade gliomas (LGG) was
gathered from five US institutions. All of the images are axial slices of the brain and include
fluidattenuated-inversion recovery (FLAIR) sequence, while 101 also have pre-contrast sequence and 104
have post-contrast sequence images. Each patient had between 20 and 88 slices taken, with a total of
3, 929 images. Each image contains 3 channels, one for each of the three sequences. All images are
256 × 256 × 3 . A radiologist annotated the FLAIR images into binary masks of “tumor” or “no tumor”.
This provides a proxy for the ground-truth, a HPE, by which to quantify the performance of the XAI
tools.</p>
        <p>As there is no clear definition of an explanation of a negative classification, we have retained only
the positive (diseased) images for assessment, leaving 1370 images. Of these 1370, 118, ≈ 9% of the
data, are false negatives, where the model reports no tumor, even though the GT indicates tumor. As
the goal is to explain positive classifications, we also exclude the false negatives. The failure modes of
each tool are interesting in and of themselves and merit further study.</p>
        <p>All XAI tools tested provide explanations in the sense that the pixels indicated by the tool are suficient
to generate a positive classification by the model (with the possible exception of IG, discussed later). It
is categorically not the case, however, that each explanation is equally good.</p>
        <p>It is desirable for an XAI tool to have few, if any, parameters which require fine-tuning [ 22]. We
use all tools with their default settings for the experiments, except for the number of mutant images
generated. To put each tool on a more level playing field, we try to allow them the same computational
work budget. This is not entirely possible, as ReX, for example, uses an iterative refinement procedure
which is non-deterministic in its total mutant production.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <p>We evaluated the XAI tools against the HPE. We also inspected tool performance in order to understand
why they succeed or fail. None of the XAI tools have been optimized for this particular task. Such
an optimization for every tool would be impractical. Moreover, parameter fine-tuning suggests prior
knowledge on the part of the user as to appropriate values. Our goal was to determine the efectiveness
of a tool without such knowledge.</p>
      <p>We therefore proceeded with the most parsimonious case of using each tool’s default parameters,
except for the number of mutants generated which was set at 2000. We set the total work budget
for ReX at 2000 as an upper limit. We present the relative performance of each tool in Figure 7 and
table Table 1. For simplicity we did not include size penalization for the PDC in this experiment.</p>
      <p>Across all measures of performance, ReX is the best performing tool, surpassing even Grad-CAM, a
white-box XAI tool, which is usually in second place, followed closely by SHAP.</p>
      <p>However, the diferences between these three XAI tools is more prominent in the PDC plots, compared
even to the HD. For instance, there is no overlap in the 75th percentile box between SHAP and ReX,
unlike any other measure.</p>
      <p>For the purposes of visualization we show four rows of results corresponding to the worst, median,
mean and best ReX PDC respectively in Figure 8. Notably, there is significant diversity in explanations,
Grad-CAM</p>
      <p>LIME
RISE</p>
      <p>IG
SHAP
ReX</p>
      <p>PDC
even if the GT is a part of the explanation. This, despite being the same input and model, highlights the
need to assess XAI tools themselves as they do not necessarily accord.</p>
      <sec id="sec-6-1">
        <title>6.1. Tool analysis</title>
        <p>In this section we inspect the explanations provided by the XAI tools, and investigate their performance
and their failure modes. XAI tools are frequently applied to general datasets of everyday objects, such as
ImageNet [23]. These images exhibit a relatively high degree of diversity, so that a dog, for example, may
appear against many diferent backgrounds, in diferent sections of an image. MRIs almost completely
lack this diversity, with brains at relatively consistent locations, sizes and pixel densities. Occlusion
techniques and defaults developed for models trained on diverse data may be inappropriate for models
trained on low diversity data. The model may be overly sensitive to occlusions, or fail entirely.
Grad-CAM Grad-CAM is the only white-box XAI tool we investigated, and is the second best
performing tool in terms of PDC and HD. These measures both incorporate the distance of a GT
to an explanation, whereas DC and JI only consider the degree to which pixels overlap, indicating
that it is relatively well able to get close to the GT. However, it also often includes unassociated
regions of the head (though it seldom shows areas outside of the head, which helps imbue trust with
clinicians), as demonstrated in Figure 9a. These partitions tend to separate the brain and the skull via
the subarachnoid space. This separation is perhaps related to a similar phenomenon observed in brain
imaging of explanations localizing to eyeballs – the sharp gradients of such anatomical features confuse
the model, as sharp gradients are also a common feature of tumors.</p>
        <p>From a qualitative perspective, even though Grad-CAM has a tendency to return multiple areas of
explanation, these areas are often confined within a contained region. This would draw a clinician’s
eye to that region (see top row of Figure 8), thus could be said to be performing well, as captured by the
PDC and HD.</p>
        <p>RISE creates random rectangular masks which it overlays on an image. The number and size are
controlled by parameters. We generate 2, 000 mutants, in keeping with the maximum amount of work
performed by ReX. We set the other parameters to their default values. It is the least complex of the
black-box tools we investigated, and performs generally poorly across all measures. The tool has a
tendency to return explanations at the front of the brain or explanations that fill the majority of the
image (Figure 9c).</p>
        <p>To examine this, we calculated the average DC of the random masks produced by RISE against a
mask containing the front region of the head. As we used the same seed for all experiments, the mask
production is the same for all images. The average DC of 0.002 indicates that occlusions almost never
covered these areas, hence their over-representation in passing mutants.</p>
        <p>There were no duplicates in the mutants produced by RISE with our chosen seed. The area a mutant
covers with default parameters is approximately 10% of the image. As the model is overly generous
in what it accepts, the mutants that RISE produces do not cover enough of the image to force failing
cases. Passing cases constitute 77% of the mutants on our dataset. It has been long known in the testing
community that a balance of passing and failing cases in required for a good-quality output [24] and it
is also the reason for outputting wrong regions of the image as the explanation here. The result is a
heatmap which is too hot in general. The concentration of explanations towards the front of the head is
due to the particular set of mutants generated. This indicated that RISE is overly sensitive to initial
conditions (i.e. the random seed).</p>
        <p>LIME is unique among the XAI tools we tested in that it uses a segmentation algorithm, rather than
rectilinear partitioning, to generate occluded mutants (Figure 9b). The resulting segments are generally
larger than pertinent to neuro-anatomical features. Nor are they anatomically meaningful, though this
could be achieved by segmentation algorithms dedicated to brain images [25]. In our dataset, there
was an average of 40 segments per image. The probability of mutants sharing at least one segment
approaches 1 after just 19 mutants. As LIME with default parameters generates 1, 000 mutants, the
best case scenario is that there exists a subset of ≈ 25 mutants that share a segment. In practice, this is
likely to be much higher. Further, the default use of the mean segment value to generate mutants is not
clinically meaningful. This results in mutants which are not diverse, likely a consequence of the relative
homogeneity of MRI images compared to standard images for which LIME was developed. Additionally,
as has been noted in the context of histopathological images [26], LIME is non-deterministic: given
the same image and random seed it can produce diferent explanations. This is an undesirable trait,
especially in a medical context in which consistency is paramount.</p>
        <p>SHAP is the second best performing in terms DC and JI. As seen in Figure 9e it tends to return multiple
segments, many of which are distant to the HPE. This is ameliorated by the fact that among these is the
tumor. However, from a clinical perspective, having so many disparate areas in an explanation is highly
undesirable, as each area requires clinical scrutiny for confirmation. From a qualitative perspective,
the performance of Grad-CAM is preferable to SHAP. Although both produce multiple regions of
explanation, in the former these more often cluster in the clinically relevant region. This demonstrates
the strength of using distance based measures, such as HD and PDC, for evaluating medical images.</p>
        <p>The computation of Shapley values requires considering all possible mutants to quantify the
contribution of each pixel. To avoid this computational overhead, SHAP relies on Owen values to estimate
the contribution of a collection of pixels, which is a valid approximation of Shapley values [27]. As
a consequence, the granularity of the explanations is constrained by the minimal grouping of pixels
utilized.</p>
        <p>Additionally, prior research has demonstrated that reliance on a single iteration can introduce bias
into the explanations [28]. In the context of this experiment, this bias manifests in the form of certain
images consistently yielding explanations that are empty over the whole image: no mutant was able
to induce a change in the image’s classification. SHAP requires the specification of a blur window in
order to generate mutants of images. A small scale investigation suggested to us that there was no one
ideal value for this setting. We utilized a 64 × 64 window in our final experiments, noting that diferent
window sizes can produce diferent, including empty, results.</p>
        <p>IG presents an interesting case. The rationale behind IG is diferent than that of other tools we
evaluated, in the sense that IG is most similar to gradual exposure of an image from the background. IG
does not isolate specific areas of the image, but rather gradually decreases the transparency of all pixels
of the image. The output of IG is a matrix of pixel gradients. Our assessment method assumes locality
of explanations, hence when assessed using the same method we apply to the other tools, it is the worst
performing tool across all measures. As can be seen from Figure 9d its explanation is often composed of
many disjoint small groups of pixels, but often spatially overlaps with the HPE. For a human observer,
this can be helpful. If we apply the same pixel threshold as applied to SHAP, RISE and Grad-CAM, IG
exhibits a particularly poor performance, so we instead calculated the measures on IG’s original output
in keeping with its intended mode of use.</p>
        <p>ReX is the best performing XAI tool by all measures. It is comparatively fast, in the order of seconds,
due to its dynamic mutant generation which stops when no new mutants increase the measure of causal
responsibility. Its explanations usually coincide with the tumor to at least partially, thus producing a
higher DC and JI relative to the other tools. Additionally, it seldom returns spurious explanations, with
multiple areas, distant from the HPE, resulting in a good PDC and HD. However, as with other tools, it
is not immune to the efect of the model treating eyeballs as signs of cancer.</p>
        <p>It is perhaps surprising that ReX outperforms Grad-CAM, but on this particular task it is highly likely
that the model is suboptimal, despite a high test accuracy (≈ 91% ). This is a very common feature of
deep learning models in a medical context [29, 30, 31]. As Grad-CAM is a white-box method, it is more
sensitive to over-fitting accrued in the training process [ 32], and is likely to improve significantly on a
thoroughly curated and validated dataset and model.</p>
        <p>We occasionally see straight edges in ReX explanations (Figure 9f). Straight edges are extremely
unlikely in any natural phenomenon, especially tumors. This appears to be a byproduct of the explanation
extraction mechanism. SHAP exhibits similar rectilinearity. As these are explanation tools, and not
segmentation tools, we do not penalize explanations which have these features.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Discussion</title>
      <p>Much of the focus of XAI in the medical domain is on the need for it and its required characteristics,
how it should be incorporated into medical workflows and how to evaluate its outputs. There are also
many studies which employ XAI in a medical context. Far fewer publications have focused on directly
comparing XAI tools as we have here.</p>
      <p>In one study comparing several white-box XAI tools for chest x-ray images, Grad-CAM was found
to be the best performing, supporting our decision to use it as a comparator [33]. This study also
used HPEs to compare XAI tool outputs, but additionally had blinded clinicians again to annotate the
images in order to create a baseline explanation. By doing so they found that Grad-CAM struggled with
pathologies that were small or irregular in shape. We have also noted that all the XAI tools struggle to
match the contours of irregular shaped tumors. The very occasional exception to this is LIME, which at
times provided at least partial tracing of complex shapes. This is likely due to its segmentation step
sometimes successfully delineating the complex contours of a tumor.</p>
      <p>Arun et al. [34] similarly compared a number of white-box XAI tools and found that none performed
as well as dedicated segmentation DNNs [34]. They stress that the tools failed on a number of clinical
benchmarks, highlighting the performance level required of clinically orientated XAI tools over and
above normal imaging standards. As noted above, explanations are not segmentations, and a dedicated
segmentation tool does not explain what a DL model is doing. Such segmentation tools themselves are,
or course, in need of explanation.</p>
      <p>That diferent XAI tools will give diferent explanations with the same model and dataset has been
noted in various medical contexts including; histopathology (CAM variants vs LIME variants) [26],
blood test results (interpretable models vs SHAP) [35] and electronic healthcare records (SHAP vs
LIME) [36]. We have similarly noted that the XAI tools often give conflicting information to one another.
As they are evaluating the same model, this can only be an artifact of the XAI tools themselves. For this
reason, nascent guidelines for publishing clinically orientated AI studies in premier medical journals do
not include a recommendation that they include explainability, highlighting the need for biomedical
specific XAI research [7].</p>
      <p>Although several papers have compared various XAI tools, as far as we can discern this is the first
paper comparing a suite of black-box XAI tools against each other and the white-box method Grad-CAM
on MRI data.</p>
      <p>All of the XAI tools were used with their default settings, with the exception of mutant budget. This
was set to 2000 where applicable. It is likely that each could be optimized for clinical applications to
improve the performances reported here. However, it would be impractical to exhaustively search the
entire parameter space for each tool. Even if possible, great care would be required to avoid over-fitting
the tool parameters to perform well on a particular model and dataset, but being unable to generalize to
new datasets. Therefore, assessing each tool with its default settings provides the most parsimonious
comparison.</p>
      <p>Although the PDC has prima facie validity in that it rewards explanations closer, and of a similar
size, to the HPE, it requires validation in the clinical setting. To this end we plan to correlate PDC with
clinicians’ assessments to develop the first clinically validated measure of XAI tools. These results are
based upon a single clinical dataset and would need to be repeated on several such datasets in order to
confirm the generalisability of our findings.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Limitations</title>
      <p>Due to the parameters associated with penalizing size, direct quantitative comparisons in diferent
studies becomes challenging. This is a result of the measure attempting to quantify what are currently
subjective intuitions (i.e. that size should be penalized). Hence the PDC is only meaningful within
the same study, and does not easily lend itself to cross-study comparison, Although this limits its
applicability, it can still be useful for teams developing medical AI models who need to assess the quality
of explanations at scale.</p>
      <p>The XAI tools considered in this paper all have their limitations when applied to tumor detection in
MRIs. ReX performs the best across all measures of performance, at least comparable to Grad-CAM
which is significant given that the latter is a white-box method.</p>
      <p>If XAI tools are to find a place in nascent clinical AI workflows, then their performance must be
demonstrable to a degree commensurate to the task. ReX is under active development and assessment2.
Discussions with neuroimaging experts will guide the tool’s development in the clinical setting. Future
work will include assessing tools with clinically validated DL models, trained on large, multi-site
datasets with annotations performed by leading cancer imaging experts. Access to domain experts
will also allow us to conduct clinically orientated qualitative assessments of XAI outputs, alongside
quantitative measures.</p>
      <p>Future work will focus on creating guidelines for parameter choice for diferent XAI tools. Clearly,
default parameters are insuficient when applied to specialized datasets and models. This should not
come as a surprise, but does leave the task of finding domain-appropriate defaults open. This is a large
task, requiring high quality datasets, well trained models, and clinicians who can assess the clinical
relevance of XAI output.</p>
      <p>Although ReX outperformed its competitors here, it is possible that the other XAI tools could be
improved to make them more amenable to clinical tasks. We will explore several modifications that
might make these general XAI tools more suitable to medical images.</p>
    </sec>
    <sec id="sec-9">
      <title>9. Conclusion</title>
      <p>We have presented a study of XAI tools on an MRI dataset and introduced a new measure for assessing
explanation quality against a provided segmentation mask. We have shown that the PDC is more
expressive than other spatial measures and that the tool ReX produces the highest quality explanations
across all measures.</p>
      <p>There exist benchmarks to assess general XAI tools in the general computer vision domain. We
recommend the development of a similar medically orientated benchmark. This would constitute a suite
of datasets and quantitative measures explicitly designed to rigorously assess XAI tools for medical
tasks.</p>
      <p>Acknowledgments The authors were supported in part by CHAI—the EPSRC Hub for Causality in
Healthcare AI with Real Data (EP/Y028856/1).</p>
    </sec>
    <sec id="sec-10">
      <title>Appendix A</title>
    </sec>
    <sec id="sec-11">
      <title>Comparing tool output</title>
      <p>Unfortunately, the diferent XAI tools do not produce outputs which are directly comparable or
immediately amenable to the DC or PDC. Both LIME and ReX provide boolean-valued masks in addition
to heatmaps. Comparing these masks is much easier than comparing heatmaps. We use these masks
directly when comparing against HPE.</p>
      <p>SHAP, RISE and Grad-CAM do not provide these masks directly, so we need to extract them from
the heatmaps. We do this via a method similar to that used by ReX to produce its minimal passing
explanations. We choose those pixels which are most highly ranking in the heatmap and query the
model on these pixels only. If this satisfies the classification requirement, we quit there, otherwise we
add in the next set of pixels and so on until we have the necessary classification. This gives us a direct
assessment over the quality of the heatmap.</p>
      <p>We then extract the DC, PDC, HD and JI of every positively labeled image, using each XAI tool.</p>
    </sec>
    <sec id="sec-12">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[5] H. Chockler, D. A. Kelly, D. Kroening, Multiple diferent explanations for image classifiers, in:</p>
      <p>ECAI European Conference on Artificial Intelligence, 2025.
[6] M. Ghassemi, T. Naumann, P. Schulam, A. L. Beam, I. Y. Chen, R. Ranganath, A review of
challenges and opportunities in machine learning for health, AMIA Summits on Translational
Science Proceedings 2020 (2020) 191.
[7] B. Vasey, M. Nagendran, B. Campbell, D. A. Clifton, G. S. Collins, S. Denaxas, A. K. Denniston,
L. Faes, B. Geerts, M. Ibrahim, et al., Reporting guideline for the early-stage clinical evaluation of
decision support systems driven by artificial intelligence: Decide-ai, Nature medicine 28 (2022)
924–933.
[8] K. Lekadir, A. F. Frangi, A. R. Porras, B. Glocker, C. Cintas, C. P. Langlotz, E. Weicken, F. W.</p>
      <p>Asselbergs, F. Prior, G. S. Collins, et al., Future-ai: International consensus guideline for trustworthy
and deployable artificial intelligence in healthcare, bmj 388 (2025).
[9] W. Jin, X. Li, G. Hamarneh, Evaluating explainable ai on a multi-modal medical imaging task:
Can existing algorithms fulfill clinical requirements?, in: Proceedings of the AAAI Conference on
Artificial Intelligence, volume 36, 2022, pp. 11945–11953.
[10] S. Ali, T. Abuhmed, S. El-Sappagh, K. Muhammad, J. M. Alonso-Moral, R. Confalonieri, R. Guidotti,
J. Del Ser, N. Díaz-Rodríguez, F. Herrera, Explainable artificial intelligence (xai): What we know
and what is left to attain trustworthy artificial intelligence, Information fusion 99 (2023) 101805.
[11] M. Din, K. Daga, J. Saoud, D. Wood, P. Kierkegaard, P. Brex, T. C. Booth, Clinicians’ perspectives
on the use of artificial intelligence to triage mri brain scans, European journal of radiology 183
(2025) 111921.
[12] N. Bienefeld, J. M. Boss, R. Lüthy, D. Brodbeck, J. Azzati, M. Blaser, J. Willms, E. Keller, Solving
the explainable ai conundrum by bridging clinicians’ needs and developers’ goals, NPJ Digital
Medicine 6 (2023) 94.
[13] V. Petsiuk, A. Das, K. Saenko, RISE: randomized input sampling for explanation of black-box
models, in: British Machine Vision Conference (BMVC), BMVA Press, 2018.
[14] M. Miró-Nicolau, A. J. i Capó, G. Moyà-Alcover, A comprehensive study on fidelity
metrics for xai, Information Processing and Management 62 (2025) 103900. URL: https://www.
sciencedirect.com/science/article/pii/S0306457324002590. doi:https://doi.org/10.1016/j.
ipm.2024.103900.
[15] L. R. Dice, Measures of the amount of ecologic association between species, Ecology 26 (1945)
297–302.
[16] T. Sørensen, A method of establishing groups of equal amplitude in plant sociology based on
similarity of species and its application to analyses of the vegetation on danish commons, Kongelige
Danske Videnskabernes Selskab. 5 (1948) 1–34.
[17] M. Sundararajan, A. Taly, Q. Yan, Axiomatic attribution for deep networks, in: International
conference on machine learning, PMLR, 2017, pp. 3319–3328.
[18] J. Y. Halpern, Actual Causality, MIT Press, Cambridge, MA, 2016.
[19] B. Legastelois, A. Raferty, P. Brennan, H. Chockler, A. Rajan, V. Belle, Challenges in explaining
brain tumor detection, in: Proceedings of the First International Symposium on Trustworthy
Autonomous Systems, 2023, pp. 1–8.
[20] M. Buda, A. Saha, M. A. Mazurowski, Association of genomic subtypes of lower-grade gliomas
with shape features automatically extracted by a deep learning algorithm, Computers in biology
and medicine 109 (2019) 218–225.
[21] M. Buda, Brain mri segmentation, 2017. https://www.kaggle.com/datasets/mateuszbuda/
lgg-mri-segmentation.
[22] B. H. Van der Velden, H. J. Kuijf, K. G. Gilhuijs, M. A. Viergever, Explainable artificial intelligence
(xai) in deep learning-based medical image analysis, Medical Image Analysis 79 (2022) 102470.
[23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei, ImageNet Large Scale Visual Recognition
Challenge, International Journal of Computer Vision (IJCV) 115 (2015) 211–252. doi:10.1007/
s11263-015-0816-y.
[24] R. Abreu, P. Zoeteweij, A. J. Van Gemund, On the accuracy of spectrum-based fault localization,
in: Testing: Academic and industrial conference practice and research techniques-MUTATION
(TAICPART-MUTATION 2007), IEEE, 2007, pp. 89–98.
[25] M. M. Ghazi, M. Nielsen, Fast-aid brain: Fast and accurate segmentation tool using artificial
intelligence developed for brain, arXiv preprint arXiv:2208.14360 (2022).
[26] M. Graziani, T. Lompech, H. Müller, V. Andrearczyk, Evaluation and comparison of cnn visual
explanations for histopathology, in: Proceedings of the AAAI Conference on Artificial Intelligence
Workshops (XAI-AAAI-21), Virtual Event, 2021, pp. 8–9.
[27] R. Okhrati, A. Lipani, A Multilinear Sampling Algorithm to Estimate Shapley Values, in: 2020 25th
International Conference on Pattern Recognition (ICPR), IEEE Computer Society, Los Alamitos,
CA, USA, 2021, pp. 7992–7999. URL: https://doi.ieeecomputersociety.org/10.1109/ICPR48806.2021.
9412511. doi:10.1109/ICPR48806.2021.9412511.
[28] H. Chen, S. M. Lundberg, S. I. Lee, Explaining a series of models by propagating Shapley values,</p>
      <p>Nature Communications 13 (2022). doi:10.1038/s41467-022-31384-3.
[29] M. Hutson, Artificial intelligence faces reproducibility crisis, 2018.
[30] M. B. McDermott, S. Wang, N. Marinsek, R. Ranganath, L. Foschini, M. Ghassemi, Reproducibility
in machine learning for health research: Still a ways to go, Science Translational Medicine 13
(2021) eabb1655.
[31] V. Volovici, N. L. Syn, A. Ercole, J. J. Zhao, N. Liu, Steps to avoid overuse and misuse of machine
learning in clinical research, Nature Medicine 28 (2022) 1996–1999.
[32] A. Ghorbani, A. Abid, J. Zou, Interpretation of neural networks is fragile, in: Proceedings of the</p>
      <p>AAAI conference on artificial intelligence, volume 33, 2019, pp. 3681–3688.
[33] A. Saporta, X. Gui, A. Agrawal, A. Pareek, S. Q. Truong, C. D. Nguyen, V.-D. Ngo, J. Seekins, F. G.</p>
      <p>Blankenberg, A. Y. Ng, et al., Benchmarking saliency methods for chest x-ray interpretation,
Nature Machine Intelligence 4 (2022) 867–878.
[34] N. Arun, N. Gaw, P. Singh, K. Chang, M. Aggarwal, B. Chen, K. Hoebel, S. Gupta, J. Patel, M. Gidwani,
et al., Assessing the trustworthiness of saliency maps for localizing abnormalities in medical
imaging, Radiology: Artificial Intelligence 3 (2021) e200267.
[35] M. A. Onari, M. S. Nobile, I. Grau, C. Fuchs, Y. Zhang, A.-K. Boer, V. Scharnhorst, Comparing
interpretable ai approaches for the clinical environment: an application to covid-19, in: 2022 IEEE
Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB),
IEEE, 2022, pp. 1–8.
[36] J. Duell, X. Fan, B. Burnett, G. Aarts, S.-M. Zhou, A comparison of explanations given by
explainable artificial intelligence methods on analysing electronic health records, in: 2021 IEEE EMBS
International Conference on Biomedical and Health Informatics (BHI), IEEE, 2021, pp. 1–4.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Lundberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-I.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>A unified approach to interpreting model predictions</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>30</volume>
          ,
          <year>2017</year>
          , pp.
          <fpage>4765</fpage>
          -
          <lpage>4774</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          , “
          <article-title>Why should I trust you?” Explaining the predictions of any classifier, in: Knowledge Discovery and Data Mining (KDD)</article-title>
          , ACM,
          <year>2016</year>
          , pp.
          <fpage>1135</fpage>
          -
          <lpage>1144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Selvaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cogswell</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vedantam</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Batra</surname>
          </string-name>
          , Grad-CAM:
          <article-title>Visual explanations from deep networks via gradient-based localization</article-title>
          , in: International Conference on Computer Vision (ICCV), IEEE,
          <year>2017</year>
          , pp.
          <fpage>618</fpage>
          -
          <lpage>626</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chockler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Kelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kroening</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Causal explanations for image classifiers</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2411.08875. arXiv:
          <volume>2411</volume>
          .
          <fpage>08875</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>