<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluation of Explainable AI by Medical Experts: a Survey of the Existing Approaches</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nikolay Babakov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elena Rezgova</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ehud Reiter</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Bugarín</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela</institution>
          ,
          <addr-line>Santiago de Compostela, Galicia</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Health Technologies LLC</institution>
          ,
          <addr-line>Saint Petersburg, Russian Federation</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Aberdeen</institution>
          ,
          <addr-line>Aberdeen</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this survey, we examine the landscape of Explainable Artificial Intelligence (XAI) techniques evaluation within the medical domain, focusing on studies evaluated by medical practitioners. Our analysis delves into the prevailing trends and identifies notable deficiencies in the current evaluation methodologies. Notably, we uncover that significant details of evaluation studies - such as the user study interface, study location, and participant remuneration-are often disregarded in the final reports describing the evaluation studies. Furthermore, our ifndings reveal a concerning scarcity of statistical significance testing in evaluation results, leading to overly optimistic conclusions regarding the applicability of XAI. Additionally, we highlight a prevalent trend of assessing XAI in isolation, devoid of comparative analysis, and an unbalanced emphasis on XAI perception attributes like usefulness, human-AI performance, and clinical relevance, overshadowing other crucial properties. Another issue identified in our survey is the imprecise formulation of participant queries, resulting in an excessive number of similarly purposed questions. The culmination of our study is not only an exposition of these findings but also a curated set of recommendations consisting of the main steps to be done before, during, and after the evaluation study aimed at researchers endeavoring to deploy XAI techniques in real medical applications. These guidelines are designed to enhance the genuine usability evaluation of XAI tools by medical professionals, ensuring a robust and meaningful application of XAI in healthcare.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;explainable artificial intelligence</kwd>
        <kwd>evaluation by medical practitioners</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Artificial intelligence (AI) is rapidly transforming the landscape of medical applications, ofering
unprecedented advancements in disease diagnosis, patient care, and treatment methodologies. The integration
of AI within the healthcare sector promises to enhance the eficiency, accuracy, and accessibility of
medical services, thereby improving patient outcomes across a variety of disease areas, including cancer,
neurology, and cardiology [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, the opaque nature of most AI algorithms poses a significant
challenge, particularly in the critical domain of medicine, where decisions directly impact patient health
and lives. This opacity has ignited a growing necessity for explainable AI (XAI) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] that can demystify AI
decision-making processes, ensuring transparency and accountability, and facilitating trust in AI-driven
medical interventions [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. As the demand for explainability in AI increases, so does the importance of
developing AI systems that not only predict and diagnose with high precision but also articulate their
reasoning in ways that medical practitioners can understand and trust, thereby safeguarding patient
welfare [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        XAI can greatly aid medical practitioners by clarifying AI decisions, supporting tasks from diagnosis
to treatment planning [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In diagnostic imaging, XAI helps specialists interpret complex medical
images, leading to more accurate diagnoses [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. It also advances personalized medicine by analyzing
patient data to recommend tailored treatments [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and supports patient monitoring and predictive
healthcare through explainable risk assessments, enabling early interventions [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Overall, XAI enhances
practitioner capabilities and promotes a more transparent and patient-centric healthcare system.
      </p>
      <p>
        Although many works discuss XAI evaluation properties [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ], there are no generally established
practices for evaluating XAI techniques in the medical domain. This creates a significant gap in assessing
how medical professionals perceive and understand XAI in real-world tasks.
      </p>
      <p>
        The lack of standardized evaluation frameworks limits assessment of how well medical practitioners
interpret, trust, and use XAI in clinical decision-making. The variety of medical applications and
data types complicates unified evaluation, while XAI’s interdisciplinary nature further challenges the
development of metrics that capture both understanding and practical usability [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Without such
standards, refining XAI models to meet practitioners’ needs is dificult, hindering their integration and
acceptance in healthcare. Addressing this gap is crucial to realizing XAI’s potential to improve patient
care and outcomes.
      </p>
      <p>While a universal approach to XAI evaluation in medicine is challenging, aggregating current practices
can be valuable. In this paper, we present a structured survey of XAI evaluation methods used by medical
practitioners, analyzing methodologies to reveal trends and gaps. Our synthesis highlights the range,
efectiveness, and applicability of evaluation strategies across medical applications, identifying strengths
and weaknesses both within and across studies. Based on these findings, we propose recommendations
to help standardize and improve XAI evaluation, aiming to better align future XAI tools with the needs
of healthcare settings.</p>
      <p>
        The main diferences between our survey and the existing surveys discussing XAI evaluation in
general [
        <xref ref-type="bibr" rid="ref12 ref13 ref14 ref15">12, 13, 14, 15</xref>
        ] and in the medical domain [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ] are as follows:
• Our paper addresses the evaluation of XAI without narrowing the scope of the studied papers to
the particular disease or data type (unlike, e.g., [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] considering only cardiological diseases or
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] considering only tabular and time series data)
• The methodology of information extraction and analysis difers significantly from all papers
related to similar topics, which allows us to obtain valuable insights and propose original
recommendations to the medical XAI community
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related literature</title>
      <sec id="sec-2-1">
        <title>2.1. Explainable AI</title>
        <p>
          There are several ways XAI techniques gain insights into certain AI models’ decision-making processes
that can be divided into two groups: post-hoc explanations aiming to explain the decision of “black
box” models and ante-hoc interpretability utilizing the transparent models to get the explanations [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].
Each approach provides a diferent pathway to achieve transparency and understanding in AI models,
catering to the nuanced requirements of various applications.
        </p>
        <p>
          Post-hoc explanation methods are designed to elucidate the decision-making of already trained
predictive models. This category includes tools like SHAP (SHapley Additive exPlanations) [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], LIME
(Local Interpretable Model-agnostic Explanations) [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], and GRAD-CAM (Gradient-weighted Class
Activation Mapping) [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. SHAP and LIME ofer insights into the contribution of each feature to the
prediction of a model on a case-by-case basis, allowing users to understand the reasoning behind
specific decisions. GRAD-CAM, on the other hand, provides visual explanations for decisions made by
convolutional neural networks, particularly useful in tasks like image classification, by highlighting the
areas of the image most influential to the model’s prediction. These post-hoc methods are invaluable for
dissecting complex models after training, ofering a lens through which the inner workings of opaque
models can be examined and interpreted.
        </p>
        <p>
          Ante-hoc interpretability involves integrating interpretability by design directly into the architecture
of the predictive model. This can be achieved through models inherently interpretable, such as decision
trees [
          <xref ref-type="bibr" rid="ref23 ref24">23, 24</xref>
          ], or by incorporating interpretability constraints such as sparsity [25] or attention
mechanisms [26] within the training process. Decision trees provide a straightforward, rule-based framework
for decision-making, making their logic easily traceable. Attention mechanisms, commonly used in
neural network architectures, highlight the importance of diferent parts of input data, such as words in
a sentence or pixels in an image, directly linking model outputs to specific input features. By embedding
interpretability into the model design, ante-hoc methods ensure transparency and ease of understanding
from the outset, making these models particularly suited for applications where explainability is as
critical as accuracy.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Existing surveys about explainable AI</title>
        <p>
          The growing demand for explainability in various fields yields a vast amount of papers related to novel
XAI techniques and their application. This results in many surveys about XAI, in general, covering
many facets of XAI such as problem definitions, goals, etc. [ 27, 28, 29, 30]. [31] discussed various aspects
related to the deployment of XAI-based models. [32] concentrated on the diverse existing literature on
counterfactuals and causability in XAI. [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] presented a filed guide for AI novices to efectively use XAI
techniques.
        </p>
        <p>
          The importance of explainability and interpretability in AI-driven decision-making in medicine also
results in numerous corresponding surveys [
          <xref ref-type="bibr" rid="ref5">5, 33</xref>
          ]. [34] discussed recent trends in medical diagnosis
and surgical applications using XAI. [35] identified nine diferent types of interpretability methods
that have been used for understanding deep learning models for medical image analysis applications
based on the type of generated explanations and technical similarities. [36] discussed the problem of
the trustworthiness of AI applications in medicine. The systematic reviews performed in [37]resulted
in structured recommendations for XAI applications in the medical domain.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Evaluation of explainable AI</title>
        <p>
          While established metrics are available to assess the performance of predictive models, a consensus on
an evaluation framework for XAI remains elusive. This gap underscores the complexity of quantifying
the efectiveness of explanations in AI systems [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] highlighted such important directions of
XAI evaluation as the goodness of explanations, user satisfaction, trust, and mental model, the role of
curiosity in the search of explanations, and the performance of collaboration between user and XAI
system. [38] identified 12 conceptual properties of XAI evaluation, performed a structured survey,
and analyzed what properties are scored more frequently in the papers engaging XAI evaluation. [39]
proposed to adapt the user-centric evaluation framework used in recommender systems to contribute
to the human-centered standardization of XAI evaluation. [40] presented a taxonomy that provides
guidance for researchers and practitioners on the design and execution of XAI evaluations. [41] showed
that the XAI evaluation setup must be based on the real decision task rather than on artificial proxy
tasks. [42] discussed that, apart from collecting feedback from experts, there are still several ways to
evaluate the XAI techniques objectively in a user-agnostic manner. [43] summarized a human-centered
demand framework to categorize diferent groups of XAI users into five key roles with specific demands
by reviewing existing research and then extracting six commonly used human-centered XAI evaluation
measures that help validate the efect of XAI. [ 44] proposed a set of recommendations on designing user
evaluations in the field of XAI and performed an extensive user evaluation on the efects of rule-based
and example-based contrastive explanations. There are also numerous surveys aimed at collecting
existing practices in XAI evaluation from diferent points of view [
          <xref ref-type="bibr" rid="ref12 ref13 ref14 ref15">12, 13, 14, 15</xref>
          ].
        </p>
        <p>XAI evaluation in medicine has also been studied from diferent perspectives [ 45]. [46] proposed to
design and evaluate clinical XAI systems based on five principal guidelines: understandability, clinical
relevance, truthfulness, informative plausibility, and computational eficiency. [ 47] delved into the
critical consideration of various XAI failures and their potential implications on decision-making for
individual patients. [48] discussed a use case of a decision support system integrated into a platform for
healthcare professionals and demonstrates the importance of human-centered evaluation methods and
potential struggles with mixed methods as detected by diferences between qualitative and quantitative
approaches.</p>
        <p>
          There are also several existing surveys dedicated to collecting and analyzing the ways of XAI
evaluation in medicine. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] analyzed both human-dependent and human-agnostic methods of XAI
technique evaluation. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] reviewed the existing XAI evaluation practices in cardiology. [49] studied
essential properties of XAI and an evaluation of explanation efectiveness in the healthcare field in a
systematic review setup, which resulted in a pretty limited number of papers (only six papers were
analyzed). [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] performed a survey analyzing diferent sides of XAI evaluation by clinicians, but it was
dedicated only to tabular and time series data. In our paper, we perform a structured survey of the
papers engaging the evaluation of XAI by medical practitioners, not limiting the scope of the papers to
any specific data type or disease. Moreover, the methodology of feature extractions and analysis of the
found papers difers from the aforementioned papers in many regards (e.g., certain parts of the XAI
taxonomy were inherited from [38], more details are available in Section 3), which allows us to obtain
valuable insights and propose original recommendations to the medical XAI community.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Survey methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Literature search</title>
        <p>We conduct the survey following the standards described in the Preferred Reporting Items for
Systematic Reviews and Meta-Analyses guidelines for Scoping Reviews (PRISMA-ScR) [50]. We used the
following scientific papers databases: Scopus [ 51], Web of Science [52], IEEE Xplore [53], ACM [54],
and Pubmed [55].</p>
        <p>We aim to find papers dedicated to the application of XAI techniques to the medical domain, that
involve evaluation of XAI technology used by medical practitioners. To find such papers we run
the database search by the papers’ abstracts using the following query: (XAI OR (explain* OR
interpret*) AND (artificial AND intelligence OR (deep OR machine) AND learning
OR neural AND network*)) AND ((*medic* OR health* OR clinic* OR hospital) OR
(disease OR ill* OR disorder OR infection OR sick*)) AND (user* OR expert* OR
practitioner* OR physician*) AND (eval* OR asses* OR validat*). The “*” symbol
(also referred to as a “wildcard” operator), is used to substitute for one or more unknown characters in
a search term.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Study selection</title>
        <p>We define the following inclusion and exclusion criteria for the collected papers. We consider papers
written in English that either introduce the original XAI technique or use existing ones in the medical
domain and evaluate the performance of the XAI with medical professionals. We disregard paper if it is
not related to medicine, if it does not use the XAI technique, or if the XAI technique is used without
verification of its efectiveness by medical professionals.</p>
        <p>
          The time range of the paper selection is between 2018 and 2023, inclusively. We decided to use 2018
as a starting year because, from numerous studies, it was clear that the trend of XAI research started in
2018 [
          <xref ref-type="bibr" rid="ref5">5, 38, 56</xref>
          ], which is probably a consequence of the XAI 2016 DARPA challenge [57]. 2023 was the
last year fully available for the literature extraction (we queried the databases in February 2024).
        </p>
        <p>The papers retrieved from the database using the aforementioned query were screened for inclusion
by title and abstract by two authors of the paper (one has experience in XAI, another in medicine). The
ifrst 300 titles and abstracts were reviewed by both authors, and the result in Cohen’s Kappa score for
inter-annotator agreement was 0.81. Other papers were divided equally and screened independently. In
cases of doubt, the inclusion or exclusion of certain papers was discussed collectively. The full content of
the papers considered after screening was analyzed and, if considered relevant to the inclusion criteria,
the paper was included in the final scope of the survey’s papers. Figure 1 shows the literature selection
process corresponding to the aforementioned setup.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Data synthesis and summarization</title>
        <p>We extract the following data from the papers considered for analysis:
• General information: Year of publication, Medical domain: disease or specific task
• AI model information: Type of predictive model (Deep Neural Network, Bayesian or
Hierarchical Network, Support Vector Machine, Tree Ensemble, other), Type of data (graph, image, tabular,
text, time series, user-item matrix, video, other), Type of AI model task (classification, regression,
policy learning, other) [38]
• XAI technique information: Type of method used to explain (post-hoc explanation, built-in or
ante-hoc interpretability), Type of explanation (decision tree, feature importance, heatmap, text,
etc), Type of XAI task (model explanation, model inspection, outcome explanation, transparent
box design) [38]
• Medical professionals evaluation design: Number of participants, Number of cases shown to
each participant, Place of study (online, ofline), Remuneration, Type of evaluation task (proxy or
real), Is the XAI technique interface clearly defined?,Is the user study interface clearly defined?,
Was evaluation pre-registered (e.g., with AsPredicted service [58])?, Experimental design (between,
within, mixed) [59], General aim of the evaluation (Compare original XAI technique with existing
ones, Compare diferent existing XAI techniques on a certain task, Compare explained and
nonexplained AI techniques, Compare the performance of medical professionals completing tasks
on their own with the performance aided by XAI, Direct assessment of XAI technique (without
alternatives), Explanation efect (positive, mixed, non-positive) [ 59], Statistical verification of
evaluation results
• Information about each task delegated to medical professionals: Formulation of the task
(e.g., the text of the question), Way of answer collection (e.g., Likert-scale or observing the
accuracy of diagnosis), Task frequency (e.g., after each case, after all cases)</p>
        <sec id="sec-3-3-1">
          <title>AI model information and XAI technique information are re-used directly from [38] as far as they</title>
          <p>provide a comprehensive understanding of the AI models and XAI techniques used in each paper.</p>
          <p>Medical professionals’ evaluation design contains the core information representing the main
peculiarities of the evaluation studies. While most features in this group are self-explanatory, others may deserve
additional clarification. We consider two principal types of evaluation tasks: proxy and real. The real
task assumes that a medical professional decides on certain patients’ data relying on the assistance of
the XAI technique (or another technique proposed in the experiment). A proxy task does not engage
a medical professional to make a decision but asks to assess the perceived quality of the system’s
performance from diferent perspectives. There are three common experimental design setups when
conducting user evaluation: between-subjects (or between-groups) designs, within-subjects designs, and
mixed designs that combine elements of both. The between-subjects setup assumes that one subject (i.e.,
participant) is only exposed to one condition of the experiment. In terms of within-subject setup, each
participant sequentially passes through all conditions. Explanation efect may generally be classified as
positive, mixed, or non-positive, and a certain value of this parameter is extracted from the self-reported
analysis of the performed XAI evaluation experiments in the studied papers.</p>
          <p>Information about each task delegated to medical professionals involves individual analysis of each task
delegated to the medical professionals in terms of an evaluation study. Normally, such tasks are either
answering questions about the perceived quality of the XAI technique or deciding a certain patient’s
1251 duplicates
dropped
3942 papers
excluded
60 papers excluded
5295 papers extracted from
the databases</p>
          <p>4044 papers
screened by title and
abstract
102 pbaypefursll stecxretened
42 patpheerssuinrcvleuyded in
case with the assistance of the XAI technique. In most cases, the formulation of the task corresponds
to the formulation of the question asked to a participant. The way of answer collection may be either
direct collecting of the responses to the questions or observation of the participant’s performance by
checking the quality of performing tasks or analyzing the logs of interaction with the proposed XAI
system. Task frequency analyzes the number of times certain tasks are demonstrated to the participants,
which in most cases may be either after each demonstrated case or after using a system with all cases.</p>
          <p>
            During the analysis of each task, we also aim to assign the task to one of the groups of XAI quality
properties: Trust [
            <xref ref-type="bibr" rid="ref10 ref16">10, 16, 60, 49</xref>
            ], Mental model (how users understand a system) [
            <xref ref-type="bibr" rid="ref10">10, 60</xref>
            ], Agreement with
AI decision, Human-AI performance [
            <xref ref-type="bibr" rid="ref13">60, 49, 13</xref>
            ], Plausibility of explanation, clinical relevance [
            <xref ref-type="bibr" rid="ref13 ref17">17, 13</xref>
            ],
Usefulness [
            <xref ref-type="bibr" rid="ref16">60, 16, 39</xref>
            ], User confidence in explanation [ 38, 61], Complexity [
            <xref ref-type="bibr" rid="ref17">17, 38, 39</xref>
            ], Correctness [38,
39], Completeness [
            <xref ref-type="bibr" rid="ref14 ref17">17, 38, 14</xref>
            ], Satisfaction [
            <xref ref-type="bibr" rid="ref13">60, 49, 39, 13</xref>
            ]
          </p>
          <p>
            This list of the XAI perceived quality properties is not comprehensive. It is formed dynamically
during the analysis of the survey papers, relying both on the existing literature on XAI (the cases with
the corresponding citations) and on the questions that are not frequently discussed in such literature
but naturally form the questions cluster of considerable size (e.g., agreement with AI decision). We
acknowledge the existence of some known taxonomies of XAI quality properties, such as a fundamental
list proposed in [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] (trust, goodness, etc.) and a 12-Co list proposed in [38], however for the aims
of this survey we found it more useful to create a list that would correspond to the XAI evaluation
properties really occurring in the papers included in the survey.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. Literature selection</title>
        <p>The keywords-based query to the databases returned 5295 papers. After deduplication, we drop 1251
papers and screen titles and abstracts of the rest 4044. This results in the disregarding of 3942 papers
due to inconsistency with the stated inclusion and exclusion criteria (see Section 3.2). We read the full
text of 102 papers and exclude 60 more from the survey, as far as they also did not meet the criteria.
Thus we consider 42 papers to perform further analysis. Refer to Figure 1 for the literature selection
diagram.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. General statistics of AI and XAI techniques</title>
        <p>The table with the details of analyzed paper is available in our repository.1We present the main statistics
corresponding to publication year, AI model, and XAI techniques in Figure 2. Figure 2a confirms the
idea that the trend in XAI research started in 2018 and within the last 4 years (2020-2023), interest in
XAI applications in the medical domain has been growing. Figure 2b shows that images, tabular data,
and time series are frequently used for XAI applications. Deep Neural Networks are the most frequent
choice of the models to be explained (see Figure 2c), which is natural due to the big part of image data
indicated in Figure 2b and also to the general interest in Neural Networks and, in most cases, their
better performance than tree ensembles or other AI model types. Figure 2d shows that XAI techniques
are most frequently used for classification tasks. In some cases, classification is accompanied by a
segmentation task [62, 63].Combination of DNNs for image data classification naturally yields frequent
usage of post-hoc explanation generating heatmaps (e.g., GRAD-CAM) [64, 65].Even though post-hoc
methods seem more popular, ante-hoc methods are also used quite often (e.g., DNN architectures
combine classification and segmentation prediction [ 62, 63]). Apart from explanation in the form of
heatmap textual and feature importance explanations are also used pretty often, especially with tabular
data [66, 67].In this survey, all papers’ XAI tasks were outcome explanations, which is natural in terms
of the scope of the studied papers (i.e. when the medical professional makes the XAI-assisted decision,
it is normally related to a certain case rather than, e.g., the whole model).
1https://gitlab.nl4xai.eu/nikolay.babakov/xai_evaluation_survey
a) Evaluation aim
b) # of participants
c) # of cases</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Statistics of evaluation studies</title>
        <p>Figure 3 shows the general statistics of evaluation studies performed in the papers included in the
survey. From Figures 3a,b we can see that most evaluation studies are prone to minimizing the number
of participants and total cases shown to them - most papers involve up to 10 specialists who are
demonstrated with up to 10 cases each. Figure 3c shows that the most prevalent evaluation setup is
within-subject which assumes demonstrating all conditions (e.g., all XAI techniques) to each participant.
Overall, these insights correspond to the general knowledge of the limited availability of medical
professionals due to their high load on their work.</p>
        <p>Figures 3d-i analyze the clarity of the evaluation study setup reported in the papers. Figure 3d shows
that in the majority of cases, the interface of the explanation demonstrated to the medical practitioners
is clearly defined, normally by means of screenshots (37 papers), but sometimes it is described either
textually or shown only partially (2 papers for each case). However, the exact interface of human
studies is normally described just textually (21 papers). Figure 3f shows that in most papers, the place of
evaluation study performance is unclear (29 papers), whereas online and ofline studies take place with
almost equal frequency (6 and 8 papers correspondingly). Figure 3g shows that involving real decisions
about certain patient data cases occurs less frequently (18 papers), than evaluation by proxy tasks, such
as asking the participants about the perceived quality of the explanation (24 papers). Figure 3h shows
that in most cases, the remuneration of participants is not discussed directly (38 papers). In rare cases,
when it is discussed, it may be clearly volunteer participation (2 papers), co-authorship in the paper or
clearly specified sum (one paper for each case correspondingly). Figure 3i shows that the practice of
evaluation studies pre-registration seems to not be known in the medical XAI community, so none of
the papers performed it.</p>
        <p>Figures 3j,k,l show the core properties of the analyzed evaluation studies. Figure 3j shows that
many papers (16) analyze the XAI technique in isolation without having any alternatives (e.g., other
multiple XAI techniques). However, alternative evaluation options, even though they are used a bit less
frequently, also take place (e.g., a comparison of the performance of medical professionals completing
tasks on their own with the performance aided by XAI is performed in 14 papers, and a comparison of
the original XAI technique with existing ones is performed in 10). Figure 3k shows the self-reported
efect of the application of XAI, and in most cases, it is either positive (21 papers) or mixed (14 papers).
Finally, Figure 3l shows that in most cases, the statistical verification of evaluation study results is not
performed (in 29 papers).</p>
        <p>Figure 4 shows the analysis of the self-reported efect of the application of XAI technologies with
certain details of the evaluation setup. Figure 4a shows assessing XAI technologies alone results in
positive results of the evaluation in almost half of cases (47%). Comparison of human-only and
humanXAI performance evenly results in positive and mixed efects, and comparison of the original XAI
a) Number of questions
b) Answer collection</p>
        <p>c) Task frequency
technique in most cases is reported to have a positive efect (78%). Figure 4b visualizes the relation
between a number of participants and the report efect of the XAI application. The tendency to report a
positive efect is prevalent for all numbers of participants (i.e., for three of the four groups it is more
than 50%). Figure 4c shows that the increase in the number of cases demonstrated to each participant
increases the likelihood of the positive efect of XAI being reported by the authors, as far as it increases
from 38% to 56%. Still, if the number of cases demonstrated by each evaluation study participant is
unclear, the results are reported as positive with the biggest frequency (62%). Figure 4d shows that the
evaluation studies without statistics analysis are more likely to have positive efects of XAI techniques
application reported than those with statistical analysis (54% and 46% correspondingly).</p>
        <p>Figure 5 shows the statistics of the tasks the evaluation studies consist of. In particular, in Figures 5a,b,c
we can see that in most cases, the evaluation studies involve a small number of questions (up to 5)
corresponding to each patient’s case, asked using the Likert scale. Another relatively frequent choice
of task answer collection is observation (17% of questions), which implies collecting the information
through observing the participant’s performing the task (e.g., comparing the participant performance
with and without the assistance of XAI [68, 69, 70]). Whereas tasks are in most cases shown after each
particular case, sometimes (23% of questions) they are shown after all cases. Figures 5d,e show the typical
frequency of the questions depending on whether the questions are asked after each demonstrated
case or after all cases. Figure 5f shows the distribution of several answer options for Likert scale-based
questions. From these three Figures, we can see that in most studies, the participants get a relatively
small number of questions, and if the questions are based on a Likert scale, the most typical number of
options is 5.</p>
        <p>We also analyze the exact properties of XAI technologies perception by medical professionals included
in the evaluation. For further analysis, we rely on the taxonomy discussed in Section 3.3, that is why
ifrst of all we aim to analyze the coverage of this taxonomy w.r.t. the properties occurring in the papers
(i.e., how many questions used in the papers’ evaluation study can be related to one of the properties
within this taxonomy and how many questions fall out of it). Figure 6 shows that in most cases (21
papers) 100% of questions included in evaluation studies of the papers included in our survey can
be described with the proposed taxonomy. Note that we consider only the papers that provide clear
questions. This means that the papers using, e.g., unstructured interviews of medical professionals are
not considered in this analysis.</p>
        <p>Figure 7 shows the detailed statistics of questions used in the evaluation studies w.r.t. our taxonomy
of XAI perception properties. Figure 7a analyses the number of evaluated properties per paper. Most
papers evaluate from one to three properties. Figure 7b shows the statistics of the number of questions
related to the similar XAI perception property asked within one evaluation study. In particular, this plot
aims to show how often the medical practitioners participating in evaluation studies have to answer
very similar or almost duplicated questions. The leftmost bar shows that in most cases, the duplicates do
not take place, i.e., there is one question related to a certain property. However, in some cases, similar
questions occur in the studies, resulting in two to four similar questions, which is particularly visible
for usefulness (in the bars corresponding to two and four similar questions). Finally, Figure 7c shows
that the most frequently evaluated properties are usefulness (14 papers), human-AI performance (14
papers), the clinical relevance of the provided explanation (13 papers), and mental model (11 papers).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>In this section, we discuss the general insights obtained from the studied papers.</p>
      <sec id="sec-5-1">
        <title>5.1. Insuficient details of evaluation studies setups</title>
        <p>Our analysis in Figure 3 shows that most researchers endeavor to conscientiously report on the number
of participants and the explanation interface utilized in their studies. This practice facilitates a basic
understanding of the user study framework and aids in assessing the study’s scope and relevance to
real-world applications. Despite this positive trend, a recurring issue is the lack of comprehensive
details that are crucial for fully comprehending the user studies’ methodology and, if needed, replicating
them. Notably, the user study interface, particularly the description of tasks assigned to participants, is
often articulated merely in verbal terms without any visual or demonstrative supplementation. This
absence of a concrete demonstration may obscure the study’s operational dynamics and the participant’s
interaction with the XAI system.</p>
        <p>Furthermore, some studies occasionally omit the exact number of cases reviewed by each participant,
a detail that is pivotal for evaluating the study’s robustness and the reliability of its findings [ 63, 71].
Equally concerning is that some studies inadvertently impose a heavy burden on participants, presenting
them with a rather large number of cases (e.g., over fifty patient cases). This could potentially lead to
fatigue, afecting the quality of the feedback or decisions made by the medical practitioners involved [ 70,
72, 73]. Additionally, the rarity of testing the statistical significance of the results in these studies casts
doubt on their validity and the generalizability of the conclusions drawn. Moreover, we also find that
more than half of the evaluation studies do not involve making real decisions about the patient data,
but instead rely only on proxy tasks (such as collecting the subjective feedback of medical practitioners
about the explanation quality). Such tasks are valuable, but they do not always accurately reflect the
complexities and pressures inherent in real clinical decision-making processes.</p>
        <p>Moreover, finer details such as the place of the user study (whether conducted online or ofline) and
participant remuneration are seldom disclosed, despite their potential impact on participant engagement,
fairness and the overall outcome of the study. The practice of pre-registration of evaluation studies
(with online platforms like AsPredicted [58]), which can lend additional credibility and transparency
to the research, appears to be largely overlooked in the field. This oversight suggests a gap in the
adherence to rigorous scientific protocols, which, if addressed, could significantly enhance the quality
and reproducibility of research in XAI applications in healthcare.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Poorly justified positive results</title>
        <p>We also find a predominant trend among researchers to report positive outcomes from the application of
XAI technologies in medical tasks (Figure 3k). While these optimistic conclusions might initially seem
encouraging, they underscore the necessity for rigorous validation. Specifically, a common approach
in evaluation studies is to assess the efectiveness of an XAI technology in isolation rather than in
comparison with other XAI solutions or even against the baseline performance of medical practitioners
without XAI assistance (Figure 3j). This methodology, despite yielding favorable results in nearly half of
the cases (Figure 4a), raises concerns regarding the reliability of such outcomes. Without a benchmark or
competitive framework, gauging the true impact and advancement ofered by a specific XAI application
becomes challenging, potentially overestimating its value in the absence of a comparative analysis.</p>
        <p>Moreover, the rarity of testing for statistical significance further complicates the assessment of
these studies’ reliability. In instances where feedback from medical practitioners is garnered through
unstructured or semi-structured formats, such as interviews [64, 74] or group discussions [75], the
feasibility of applying statistical tests to validate the results diminishes. However, the absence of
statistical verification, regardless of the reason, is noteworthy. Our analysis identified that over half
of the studies reporting positive impacts of XAI evaluations failed to substantiate their claims with
statistical significance tests. This gap not only casts doubt on the asserted benefits of XAI technologies
but also highlights a critical need for more stringent methodological standards in research. The practice
of afirming positive findings without robust statistical evidence can lead to a skewed understanding of
XAI’s eficacy, underscoring the imperative for more rigorous and quantitatively verified research to
truly ascertain the efectiveness of XAI interventions in healthcare.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. XAI perception properties</title>
        <p>
          The analysis performed in Figure 7 showcases a broad spectrum of properties related to XAI perception
among medical practitioners addressed in the evaluation studies. While there’s a notable inclination
towards assessing universally intuitive properties like usefulness, human-AI performance, clinical
relevance, and mental model or understandability, it’s imperative to recognize the importance of a more
extensive array of perception properties. This includes, but is not limited to, the medical practitioner’s
confidence in the decision made with the help of XAI, the complexity of understanding the delivered
explanation, and overall satisfaction with the XAI technology. The existing skew might limit our
understanding of XAI’s multifaceted impact on healthcare professionals. Addressing this issue requires
a more structured approach, potentially through the adoption of existing taxonomies of XAI perception
properties [
          <xref ref-type="bibr" rid="ref10">38, 10</xref>
          ]. These frameworks can guide researchers in comprehensively evaluating XAI tools,
ensuring that a wider range of significant attributes is considered during user studies with medical
practitioners. This not only enriches the depth of the evaluation but also aligns the development of XAI
technologies with the nuanced needs of healthcare settings.
        </p>
        <p>Moreover, while the formulation of questions in most evaluation studies is generally precise—aiming
to avoid redundancy and ensure clarity—instances of question duplication are not unheard of. Figure 7b
shows that in some cases, similar properties may be addressed with two to four diferent questions.
In some cases, several questions related to a similar property may be justified by the particular setup
of the task. For example, in [75] two statements related to the clinical relevance of the explanations
are shown for scoring: “pertinency of the map with respect to the diagnosis considered correct ” and
“coherence between the map and the suggestion provided by the machine”. In [63] the user satisfaction
of diferent parts of the proposed system is analyzed with two diferent questions: “I enjoyed being in
the virtual environment?” and “I enjoyed using the COVIRplatform?”.</p>
        <p>However, there still are numerous cases when multiple questions about similar properties seem
redundant. For example, [76] asks about the clinical relevance of the explanation with two pretty similar
questions: “The predicted result is reasonable” and “The explanation of the prediction is clear and
reasonable”. The same paper involves two questions about complexity which also seem pretty similar:
“The application is user-friendly and easy to navigate” and “I believe I would require technical assistance
to use this system efectively”. [ 63] asked participants about the complexity of the system addressing
three questions: “Is it easy to learn to use it?”, “I learned to use it quickly? ”, and “I easily remember
how to use it?”.</p>
        <p>The redundancy of these and some other questions not only afects the eficiency of data collection
but also unnecessarily increases the cognitive load on participants. Ensuring the uniqueness and
relevance of each question asked during a user study is crucial for maximizing the utility of the feedback
collected and enhancing the overall eficiency of the study. As such, closer attention to the design of
the evaluation study setup is warranted. By meticulously crafting the questions and avoiding overlaps,
researchers can not only streamline the evaluation process but also foster a more engaging and less
burdensome experience for medical practitioners participating in the study. This careful approach to
study design underscores the importance of thorough preparation in achieving meaningful and reliable
insights into the perception of XAI applications in healthcare.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Recommendation</title>
      <p>Joining the existing XAI user evaluation study recommendations [59, 44], our findings from this survey
and our experience in running similar evaluation studies, in this section we make the guideline for the
XAI specialists aiming to evaluate the XAI techniques with medical practitioners.</p>
      <sec id="sec-6-1">
        <title>6.1. Before the user study</title>
        <sec id="sec-6-1-1">
          <title>Set a reasonable number of cases and questions</title>
          <p>
            The questions asked to participants (normally related to the subjective perception of XAI) should be
optimally selected. In particular, it is recommended to maximize the range of various XAI perception
properties addressed by these questions, avoiding asking multiple questions related to a similar property.
It may be particularly useful to learn about the commonly known XAI perception properties from the
existing literature [
            <xref ref-type="bibr" rid="ref10 ref13 ref17">38, 10, 39, 17, 13, 40</xref>
            ].
          </p>
          <p>At the same time, the cognitive load and the potential loss of interest of the participant, if the number
of tasks included in the evaluation study is unreasonably high, must be taken into account. Therefore,
it is necessary to select only those XAI perception properties that seem to be most relevant to a certain
application in the particular study and also include a limited amount of patients that, at the same time,
cover a maximal range of potential states (e.g., all possible diagnoses within the framework of the
disease being studied are represented evenly in the sample of demonstrated patients).</p>
        </sec>
        <sec id="sec-6-1-2">
          <title>Evaluate objective and subjective properties</title>
          <p>The applicability of XAI in the medical field may equally depend on both subjective and objective
factors (they can also be referred to as self-reported measures and behavioral measures, respectively [44]).
Whereas subjective factors are normally studied by asking questions regarding personal perceptions of
various aspects of XAI quality, the most frequent choice for measuring the objective quality of XAI,
which still requires medical practitioners, is evaluating human-AI performance by asking to diagnose the
patient first without and then with the help of the XAI technique. The cases demonstrated within these
two groups could be both diferent and similar. Whereas in diferent cases (i.e., a medical practitioner
diagnoses one group of patients with the help of XAI diferent group), the evaluation study may be
performed within a single session, in the case of similar patients, it is recommended to make a break
between evaluation sessions to prevent the fact that the doctor may stick to a previously made decision
about the patient.</p>
          <p>Objective properties may also be human-agnostic, and they are also worth being reported. For
example, [38] discusses such objective properties as compactness (the size of the explanation) and
confidence (the presence and accuracy of probability information in the explanation), and [ 77] proposes
an objective metric named Degree of Explainability, which is directly proportional to the number of
relevant questions that a piece of information can correctly answer.</p>
        </sec>
        <sec id="sec-6-1-3">
          <title>Balance between real and proxy tasks</title>
          <p>While subjective metrics of XAI quality perception provide valuable information it can also not be
neglected that proxy tasks and subjective measures can be misleading in evaluating XAI systems [41].
Therefore, it is recommended to include in the design of the task the real action normally performed by
medical practitioners in terms of the task XAI technologies are aimed to assist. In most cases, such a
task is making the diagnosis regarding the particular patient.</p>
        </sec>
        <sec id="sec-6-1-4">
          <title>Select the experimental settings proportional to the number of participants</title>
          <p>It is generally expected that it would be unfeasible to engage numerous medical practitioners with
suficient experience in the specific medical field to participate in the evaluation studies. That is why
the most frequent experimental setting chosen for such an evaluation study is within-subjects, where
each participant sequentially passes through all conditions and provides feedback. For example, in such
a setup, the practitioners may be first exposed to the patients’ cases without explanations and then with
explanations.</p>
          <p>However, if it is possible to engage the number of evaluation study participants suficient to form at
least two groups of reasonable size, one can use a between-subjects setting when one subject is only
exposed to one condition, e.g., one group gives feedback about unexplained AI predictions, and another
group gives such feedback about the predictions with the explanations.</p>
        </sec>
        <sec id="sec-6-1-5">
          <title>Avoid positive bias</title>
          <p>It is necessary to foresee the potential positive bias caused by experimental design. One of the “risky”
experimental designs that is likely to yield positive bias is when the explanation is evaluated in isolation
without any alternatives, such as a demonstration of the patients’ cases without any explanation or with
an explanation generated with alternative XAI techniques. Such a setup may be particularly critical if
the participants are asked about a single XAI technique knowingly developed by the authors of a paper
in the interview format (when the authors are actually the interviewers).</p>
          <p>Overall, it is recommended to measure explanation efects implicitly rather than explicitly. When
participants are not aware of the evaluation’s purpose, their responses may be more genuine. When
measuring understanding or similar constructs, the participant’s explicit focus on the explanations may
cause skewed results not present in a more realistic scenario when a medical practitioner is concentrated
on solving the actual task rather than on assessing the quality of the explanation.</p>
          <p>In general, the phenomena of positive bias is known in the literature as a significant challenge
across various domains of research, including clinical trials and expert evaluations [78]. Studies have
shown that the lack of blinding and the absence of alternative options can lead to overly optimistic
outcomes due to cognitive biases. For instance, in clinical trials, both patient and expert evaluations
are subject to this bias, highlighting the necessity of implementing blinding techniques to mitigate
its efects [ 78, 79, 80]. Overall, regardless of the nature of the study (whether it is a real clinical study
involving patients taking some medicines or medical experts assessing the quality of XAI technology),
the objective remains to ensure the integrity and objectivity of research findings. These strategies are
crucial for minimizing performance and detection biases, thereby preventing the overestimation of
treatment efects and maintaining the credibility of scientific inquiry across diferent fields of study.</p>
        </sec>
        <sec id="sec-6-1-6">
          <title>Prepare the definite plan of the core process of user study</title>
          <p>While the preparation of the content of the evaluation study is important, the preparation of the
process of the study itself is also crucial. It may slightly vary depending on the place of study.</p>
          <p>In the case of ofline participation, it may be helpful to provide information about meeting points,
things necessary to bring, ways of preparation for the study, and to also send a reminder one day before.
The plans during the evaluation study may also include finer details like where the participants are
supposed to leave their belongings and plans for unexpected situations (e.g., uncooperative participants
or malfunction of the demonstrated system).</p>
          <p>In the case of online participation [81, 82], especially if the link for the study is distributed publically
(e.g., in professional communities), it may be good to design some basic questionary that will ensure
that the participant has the relevant experience for the certain medical task. This may involve asking
about the years of experience dealing with certain diseases and asking about the standard diagnostic
procedure this practitioner makes relative to the disease. Moreover, given that it is dificult to foresee
how much time a certain participant is ready to devote to the user study, it is recommended to use
tools that allow for objectively controlling the real engagement in the proper answers to the tasks (e.g.,
measuring the time spent on each page).</p>
          <p>If it is initially clear that the study could take a long time, it is also recommended to give an explicit
option to leave the study at a certain moment (e.g., when half of the cases are shown). This may guarantee
the better quality of the collected feedback, avoiding the answers collected from the participants who
became tired or lost interest in the study.</p>
          <p>For whatever setup, it is also necessary to prepare the proper consent to participate in the study that
will thoroughly discuss the whole process of the anticipated study and remind about the right to leave
the study at whatever moment. The tasks and questions delegated to the participants should be clear
and should also avoid inadvertent cues.</p>
        </sec>
        <sec id="sec-6-1-7">
          <title>Run pilot study with small group</title>
          <p>Before running the study with all available participants the good practice could be to engage a
small pilot group of participants ready to test the proposed XAI system in a more thorough way by
not only answering the pre-defined questions and performing fixed tasks but also by participating in
a post-study interview and/or performing the proposed tasks following a think-aloud protocol [83].
This preliminary feedback can potentially improve not only the user study but also the evaluated XAI
technology. However, the participation of such medical experts in the final evaluation studies is not
recommended, because they may have a positive bias toward the system.</p>
        </sec>
        <sec id="sec-6-1-8">
          <title>Pre-register of the evaluation study</title>
          <p>Pre-registration of evaluation studies with online platforms like AsPredicted [58] has recently become
a common practice [84]. The pre-registration involves submitting a document detailing the planned
study (e.g., measured variables and hypotheses, data exclusion criteria, and the number of samples
that will be collected) online before initiating the data collection. The pre-registration can be evidence
against the findings being a result of selective reporting or p-hacking [ 85] and thus strengthen the
credibility of a study.</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. During the user study</title>
        <sec id="sec-6-2-1">
          <title>Collect the intermediate feedback</title>
          <p>If the preparation of the evaluation study is performed correctly, no specific actions are expected
from the organizers of the study. However, it may be necessary to foresee unexpected problems in the
design of the study, which is why it is important to give the option for participants to leave a free-form
comment (either in the text field or in a verbal manner depending on the place of study) to quickly
analyze it to correct the error promptly or, in the worst case, interrupt the study if the identified error
is critical.</p>
        </sec>
        <sec id="sec-6-2-2">
          <title>Exclude irrelevant or poorly-performing participants</title>
          <p>It may also be useful to automatically exclude the participant from the study under some conditions.
This could be the inconsistency of medical qualifications (e.g., if the participant indicates that one
never or rarely works with the studied disease) or if the average time spent on the page is below some
manually selected minimal threshold. Another criterion for automatic termination of the participant
could be using attention-checking tasks that could contain relatively easy-to-answer tasks, so incorrect
answers to such tasks may suggest that the participant can no longer continue to provide valuable
answers.</p>
        </sec>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. After the user study</title>
        <sec id="sec-6-3-1">
          <title>Filter irrelevant answers</title>
          <p>Each participant’s entry should be analyzed in terms of relevance. If the preliminary information
about the participant’s experience with certain diseases was collected in a free-form text, the analysis
of such answers may give serious grounds to believe that this participant has insuficient experience
in diagnosing the disease being studied. Moreover, in the case of collecting the answers by asking
multiple Likert scale-based questions, similar answers to such questions (i.e., all questions on the page
receive an equal response value) may also be the reason for excluding specific or all responses from a
participant. The excluded answers, however, should be open-sourced as well as the considered answers
to increase the credibility of the study, making it possible to ensure that the answers have been excluded
for data-cleanness reasons rather than for adjusting the answers to the initial hypothesis.</p>
        </sec>
        <sec id="sec-6-3-2">
          <title>Run the statistical verification of the collected results</title>
          <p>If the evaluation setup involves a comparison of various XAI techniques with each other or a
comparison of human performance with and without XAI aid (which is in general a recommended
setup), the statistical verification of such a comparison is crucial to making a scientifically correct
conclusion based on the results of such a study. Typical solutions for comparison between distributions
obtained under diferent conditions are ANOVA tests and t-tests. The analysis of the data collected by
the Likert scale may involve nonparametric tests such as the paired Wilcoxon signed-rank test or the
Kruskal-Wallis H test to avoid normality assumptions.</p>
          <p>Sometimes, the analysis of the data may involve the aggregation of one XAI perception property
obtained from diferent sources (e.g., objective and subjective measures of understanding are combined
into one distribution). In such cases, it is necessary to verify that there is suficient agreement between
the aggregated distribution, which is necessary to assess the validity of this aggregation with reliability
measures such as tau-equivalent reliability (also known as Cronbach’s ).</p>
          <p>Moreover, the agreement between numerous participants providing feedback about similar items
could also be valuable for analysis. It is normally calculated using Cohan’s Kappa and Fleiß’s Kappa [86].</p>
        </sec>
        <sec id="sec-6-3-3">
          <title>Report all essential details of the study</title>
          <p>When preparing the report about the evaluation study, it is important to disclose all details, including,
but not limited to the ones studied in our survey (see Section 3.3, and Figures 2 and 3): comprehensive
information about the backbone AI model and corresponding XAI technique, all details about the
user study, such as number of participants and information about their background, number of tasks
delegated to each participant, precise formulation of the tasks, etc. Visually report the most important
ifndings to make it easier to understand the general findings of the study.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>Drawing from the survey of XAI techniques evaluated by medical practitioners in the medical domain,
this paper has illuminated several critical areas for improvement within current evaluation practices.
The investigation underscored the frequent omission of essential evaluation study details, such as user
study interfaces, locations, and participant remuneration, in published reports. A significant finding of
our survey is the pervasive lack of statistical significance tests in the evaluation of XAI applications,
which has contributed to unwarranted optimism about the eficacy of XAI in medical contexts. Moreover,
the analysis highlighted the tendency to evaluate XAI systems in isolation without benchmarking them
against other methods or technologies. This approach risks overstating the utility and clinical relevance
of XAI without a thorough comparative analysis.</p>
      <p>A notable concern identified is the skewed focus on certain perception attributes of XAI, such as its
supposed usefulness and performance in conjunction with human operators, at the expense of exploring
a broader spectrum of important attributes. Additionally, the survey pointed out the issue of poorly
constructed queries for study participants, which complicates the evaluation process with redundant or
overly similar questions.</p>
      <p>In response to these findings, the paper presents a set of actionable recommendations for future
researchers. These recommendations emphasize the importance of comprehensive and transparent
reporting of evaluation studies, the necessity for statistical rigor in assessing the results, and the value of
comparative analysis to truly gauge the efectiveness of XAI systems. Furthermore, the recommendations
advocate for a more balanced evaluation of XAI attributes and call for clearer, more distinct questioning
techniques in participant studies.</p>
      <p>The ultimate goal of these recommendations is to refine the approach to evaluating XAI applications
in healthcare, ensuring that future studies can provide a more accurate, meaningful assessment of
these technologies’ real-world utility. By addressing the identified deficiencies, the research community
can move toward a more nuanced and efective integration of XAI in medical practice, enhancing the
potential for these technologies to contribute positively to patient care and medical outcomes.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This paper is part of the R+D+i project TED2021-130295B-C33, funded by
MCIN/AEI/10.13039/501100011033/ and by the “European Union NextGenerationEU/PRTR”.
This research also contributes to the projects PID2020-112623GB-I00 and PID2023-149549NB-I00
funded by MCIN/AEI/10.13039/501100011033/ and by ERDF A way of making Europe. The support of
the Galician Ministry for Education, Universities and Professional Training and the "ERDF A way of
making Europe" is also acknowledged through grants "Centro de investigación de Galicia accreditation
2024-2027 ED431G-2023/04" and "Reference Competitive Group accreditation 2022-2025 ED431C
2022/19"</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-5 to check grammar and spelling.
https://aclanthology.org/2021.inlg-1.12.
[25] L. Schneider, B. Bischl, J. Thomas, Multi-objective optimization of performance and interpretability
of tabular supervised machine learning models, Proceedings of the Genetic and Evolutionary
Computation Conference (2023). doi:10.1145/3583131.3590380.
[26] M. Tutek, J. Šnajder, Toward practical usage of the attention mechanism as a tool for interpretability,</p>
      <p>IEEE access 10 (2022) 47011–47030.
[27] S. Ali, T. Abuhmed, S. El-Sappagh, K. Muhammad, J. M. Alonso-Moral, R. Confalonieri, R. Guidotti,
J. Del Ser, N. Díaz-Rodríguez, F. Herrera, Explainable artificial intelligence (XAI): What we know
and what is left to attain trustworthy artificial intelligence, Information Fusion 99 (2023) 101805.
doi:https://doi.org/10.1016/j.inffus.2023.101805.
[28] V. Hassija, V. Chamola, A. Mahapatra, A. Singal, D. Goel, K. Huang, S. Scardapane, I. Spinelli,
M. Mahmud, A. Hussain, Interpreting black-box models: a review on explainable artificial
intelligence, Cognitive Computation 16 (2024) 45–74.
[29] W. Saeed, C. Omlin, Explainable AI (XAI): A systematic meta-survey of current challenges and
future opportunities, Knowledge-Based Systems 263 (2023) 110273.
[30] A. Saranya, R. Subhashini, A systematic review of explainable artificial intelligence models and
applications: Recent developments and future trends, Decision analytics journal (2023) 100230.
[31] B. Li, P. Qi, B. Liu, S. Di, J. Liu, J. Pei, J. Yi, B. Zhou, Trustworthy AI: From principles to practices,</p>
      <p>ACM Computing Surveys 55 (2023) 1–46.
[32] Y.-L. Chou, C. Moreira, P. Bruza, C. Ouyang, J. Jorge, Counterfactuals and causability in explainable
artificial intelligence: Theory, algorithms, and applications, Information Fusion 81 (2022) 59–83.
doi:https://doi.org/10.1016/j.inffus.2021.11.003.
[33] R. González-Alday, E. García-Cuesta, C. A. Kulikowski, V. Maojo, A scoping review on the progress,
applicability, and future of explainable artificial intelligence in medicine, Applied Sciences 13
(2023) 10778.
[34] Y. Zhang, Y. Weng, J. Lund, Applications of explainable artificial intelligence in diagnosis and
surgery, Diagnostics 12 (2022) 237.
[35] Z. Salahuddin, H. C. Woodruf, A. Chatterjee, P. Lambin, Transparency of deep neural networks
for medical image analysis: A review of interpretability methods, Computers in Biology and
Medicine 140 (2022) 105111. doi:https://doi.org/10.1016/j.compbiomed.2021.105111.
[36] M. Kim, H. Sohn, S. Choi, S. Kim, Requirements for trustworthy artificial intelligence and its
application in healthcare, Healthcare Informatics Research 29 (2023) 315.
[37] H. Chen, C. Gomez, C.-M. Huang, M. Unberath, Explainable medical imaging AI needs
humancentered design: guidelines and evidence from a systematic review, NPJ digital medicine 5 (2022)
156.
[38] M. Nauta, J. Trienes, S. Pathak, E. Nguyen, M. Peters, Y. Schmitt, J. Schlötterer, M. van Keulen,
C. Seifert, From anecdotal evidence to quantitative evaluation methods: A systematic review on
evaluating explainable AI, ACM Computing Surveys 55 (2023) 1–42.
[39] I. Donoso-Guzmán, J. Ooge, D. Parra, K. Verbert, Towards a comprehensive human-centred
evaluation framework for explainable AI, in: World Conference on Explainable Artificial Intelligence,
Springer, 2023, pp. 183–204.
[40] M. Chromik, M. Schuessler, A taxonomy for human subject evaluation of black-box explanations
in XAI., Exss-atec@ iui 1 (2020).
[41] Z. Buçinca, P. Lin, K. Z. Gajos, E. L. Glassman, Proxy tasks and subjective measures can be
misleading in evaluating explainable AI systems, in: Proceedings of the 25th international
conference on intelligent user interfaces, 2020, pp. 454–464.
[42] A. Rosenfeld, Better metrics for evaluating explainable artificial intelligence, in: Proceedings of
the 20th international conference on autonomous agents and multiagent systems, 2021, pp. 45–50.
[43] X. Kong, S. Liu, L. Zhu, Toward human-centered XAI in practice: A survey, Machine Intelligence</p>
      <p>Research (2024) 1–31.
[44] J. van der Waa, E. Nieuwburg, A. Cremers, M. Neerincx, Evaluating XAI: A comparison of
rule-based and example-based explanations, Artificial Intelligence 291 (2021) 103404. doi: https:
//doi.org/10.1016/j.artint.2020.103404.
[45] G. Yang, Q. Ye, J. Xia, Unbox the black-box for the medical explainable AI via multi-modal and
multi-centre data fusion: A mini-review, two showcases and beyond, Information Fusion 77 (2022)
29–52.
[46] W. Jin, X. Li, M. Fatehi, G. Hamarneh, Guidelines and evaluation of clinical explainable AI in
medical image analysis, Medical Image Analysis 84 (2023) 102684. doi:https://doi.org/10.
1016/j.media.2022.102684.
[47] M. Ghassemi, L. Oakden-Rayner, A. L. Beam, The false hope of current approaches to explainable
artificial intelligence in health care, The Lancet Digital Health 3 (2021) e745–e750.
[48] K. S. Kacafírková, S. Polak, M. S. Smitt, S. A. Elprama, A. Jacobs, Trustworthy enough? evaluation
of an AI decision support system for healthcare professionals, in: The World Conference on
eXplainable Artificial Intelligence, CEUR Workshop Proceedings, 2023, pp. 7–11.
[49] J. Jung, H. Lee, H. Jung, H. Kim, Essential properties and explanation efectiveness of explainable
artificial intelligence in healthcare: A systematic review, Heliyon 9 (2023) e16110. doi: https:
//doi.org/10.1016/j.heliyon.2023.e16110.
[50] A. C. Tricco, E. Lillie, W. Zarin, K. K. O’Brien, H. Colquhoun, D. Levac, D. Moher, M. D. Peters,
T. Horsley, L. Weeks, et al., PRISMA extension for scoping reviews (prisma-scr): checklist and
explanation, Annals of internal medicine 169 (2018) 467–473.
[51] Scopus, ???? URL: https://www.scopus.com/search/form.uri?display=basic#basic, accessed:
202402-15.
[52] Web of Science, https://www.webofscience.com/wos, ???? Accessed: 2024-02-15.
[53] IEEE Xplore, https://ieeexplore.ieee.org/Xplore/home.jsp, ???? Accessed: 2024-02-15.
[54] ACM digital library, https://dl.acm.org/, ???? Accessed: 2024-02-15.
[55] Pubmed, https://pubmed.ncbi.nlm.nih.gov/, ???? Accessed: 2024-02-15.
[56] M. A. Kadir, A. Mosavi, D. Sonntag, Assessing XAI: Unveiling evaluation metrics for local
explanation, taxonomies, key concepts, and practical applications (2023).
[57] D. Gunning, E. Vorm, Y. Wang, M. Turek, Darpa’s explainable AI (XAI) program: A retrospective,</p>
      <p>Authorea Preprints (2021).
[58] ASPREDICTED, https://aspredicted.org/, ???? Accessed: 2024-02-15.
[59] Y. Rong, T. Leemann, T.-T. Nguyen, L. Fiedler, P. Qian, V. Unhelkar, T. Seidel, G. Kasneci, E. Kasneci,
Towards human-centered explainable AI: A survey of user studies for model explanations, IEEE
Transactions on Pattern Analysis and Machine Intelligence (2023).
[60] S. Mohseni, N. Zarei, E. D. Ragan, A multidisciplinary survey and framework for design and
evaluation of explainable AI systems, ACM Transactions on Interactive Intelligent Systems (TiiS)
11 (2021) 1–45.
[61] U. Ehsan, S. Passi, Q. V. Liao, L. Chan, I. Lee, M. Muller, M. O. Riedl, et al., The who in explainable
AI: How AI background shapes perceptions of AI explanations, arXiv preprint arXiv:2107.13509
(2021).
[62] R. Lindsey, A. Daluiski, S. Chopra, A. Lachapelle, M. Mozer, S. Sicular, D. Hanel, M. Gardner,
A. Gupta, R. Hotchkiss, et al., Deep neural network improves fracture detection by clinicians,
Proceedings of the National Academy of Sciences 115 (2018) 11591–11596.
[63] K. Amara, A. Aouf, H. Kennouche, A. O. Djekoune, N. Zenati, O. Kerdjidj, F. Ferguene, COVIR:
A virtual rendering of a novel nn architecture o-net for covid-19 ct-scan automatic lung lesions
segmentation, Computers &amp; Graphics 104 (2022) 11–23.
[64] S. Jadhav, G. Deng, M. Zawin, A. E. Kaufman, COVID-view: Diagnosis of COVID-19 using chest</p>
      <p>CT, IEEE transactions on visualization and computer graphics 28 (2021) 227–237.
[65] T. Chanda, K. Hauser, S. Hobelsberger, T.-C. Bucher, C. N. Garcia, C. Wies, H. Kittler, P. Tschandl,
C. Navarrete-Dechent, S. Podlipnik, et al., Dermatologist-like explainable AI enhances trust and
confidence in diagnosing melanoma, Nature Communications 15 (2024) 524.
[66] S. Chari, P. Acharya, D. M. Gruen, O. Zhang, E. K. Eyigoz, M. Ghalwash, O. Seneviratne, F. S.</p>
      <p>Saiz, P. Meyer, P. Chakraborty, et al., Informing clinical assessment by contextualizing post-hoc
explanations of risk prediction models in type-2 diabetes, Artificial Intelligence in Medicine 137
(2023) 102498.
[67] K. Aliyeva, N. Mehdiyev, Uncertainty-aware multi-criteria decision analysis for evaluation of
explainable artificial intelligence methods: A use case from the healthcare domain, Information
Sciences 657 (2024) 119987.
[68] S. Domínguez-Rodríguez, H. Liz-López, A. Panizo-LLedot, Á. Ballesteros, R. Dagan, D. Greenberg,
L. Gutiérrez, P. Rojo, E. Otheo, J. C. Galán, et al., Testing the performance, adequacy, and
applicability of an artificial intelligence model for pediatric pneumonia diagnosis, Computer
Methods and Programs in Biomedicine 242 (2023) 107765.
[69] I. Neves, D. Folgado, S. Santos, M. Barandas, A. Campagner, L. Ronzio, F. Cabitza, H. Gamboa,
Interpretable heartbeat classification using local model-agnostic explanations on ecgs, Computers
in Biology and Medicine 133 (2021) 104393.
[70] A. Kumar, R. Manikandan, U. Kose, D. Gupta, S. C. Satapathy, Doctor’s dilemma: evaluating an
explainable subtractive spatial lightweight convolutional neural network for brain tumor diagnosis,
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17
(2021) 1–26.
[71] M. Dabass, S. Vashisth, R. Vig, MTU: A multi-tasking u-net with hybrid convolutional learning
and attention modules for cancer classification and gland segmentation in colon histopathological
images, Computers in Biology and Medicine 150 (2022) 106095.
[72] J. Pedrosa, P. Sousa, J. Silva, A. M. Mendonça, A. Campilho, Lesion-based chest radiography image
retrieval for explainability in pathology detection, in: Iberian Conference on Pattern Recognition
and Image Analysis, Springer, 2022, pp. 81–94.
[73] Y. Xu, M. Hu, H. Liu, H. Yang, H. Wang, S. Lu, T. Liang, X. Li, M. Xu, L. Li, et al., A hierarchical deep
learning approach with transparency and interpretability based on small samples for glaucoma
diagnosis, NPJ digital medicine 4 (2021) 48.
[74] A. Pirovano, H. Heuberger, S. Berlemont, S. Ladjal, I. Bloch, Improving interpretability for
computeraided diagnosis tools on whole slide imaging with multiple instance learning and gradient-based
explanations, in: Interpretable and Annotation-Eficient Learning for Medical Image Computing:
Third International Workshop, iMIMIC 2020, Second International Workshop, MIL3ID 2020, and
5th International Workshop, LABELS 2020, Held in Conjunction with MICCAI 2020, Lima, Peru,
October 4–8, 2020, Proceedings 3, Springer, 2020, pp. 43–53.
[75] D. Slijepcevic, F. Horst, S. Lapuschkin, B. Horsak, A.-M. Raberger, A. Kranzl, W. Samek, C.
Breiteneder, W. I. Schöllhorn, M. Zeppelzauer, Explaining machine learning models for clinical gait
analysis, ACM Transactions on Computing for Healthcare (HEALTH) 3 (2021) 1–27.
[76] R. Hendawi, J. Li, S. Roy, et al., A mobile app that addresses interpretability challenges in machine
learning–based diabetes predictions: Survey-based user study, JMIR Formative Research 7 (2023)
e50328.
[77] F. Sovrano, F. Vitali, An objective metric for explainable AI: how and why to estimate the degree
of explainability, Knowledge-Based Systems 278 (2023) 110866.
[78] P. Probst, S. Zaschke, P. Heger, J. C. Harnoss, F. J. Hüttner, A. L. Mihaljevic, P. Knebel, M. K. Diener,
Evidence-based recommendations for blinding in surgical trials, Langenbeck’s archives of surgery
404 (2019) 273–284.
[79] A. Hróbjartsson, F. Emanuelsson, A. S. Skou Thomsen, J. Hilden, S. Brorson, Bias due to lack of
patient blinding in clinical trials. A systematic review of trials randomizing patients to blind and
nonblind sub-studies, International Journal of Epidemiology 43 (2014) 1272–1283. doi:10.1093/
ije/dyu115.
[80] I. Boutron, C. Estellat, L. Guittet, A. Dechartres, D. L. Sackett, A. Hróbjartsson, P. Ravaud, Methods
of blinding in reports of randomized controlled trials assessing pharmacologic treatments: a
systematic review, PLoS medicine 3 (2006) e425.
[81] B.-Y. Mo, S. Nuannimnoi, A. Baskoro, A. Khan, J. Ariesta Dwi Pratiwi, C.-Y. Huang, Clusteredshap:
Faster gradientexplainer based on k-means clustering and selections of gradients in explaining
12-lead ecg classification model, in: Proceedings of the 13th International Conference on Advances
in Information Technology, 2023, pp. 1–8.
[82] A. Katzmann, O. Taubmann, S. Ahmad, A. Mühlberg, M. Sühling, H.-M. Groß, Explaining clinical
decision support systems in medical imaging using cycle-consistent activation maximization,
Neurocomputing 458 (2021) 141–156.
[83] P. Aflerbach, Verbal reports and protocol analysis, in: Methods of literacy research, Routledge,
2001, pp. 97–114.
[84] J. P. Simmons, L. D. Nelson, U. Simonsohn, Pre-registration: Why and how, Journal of Consumer</p>
      <p>Psychology 31 (2021) 151–162. doi:https://doi.org/10.1002/jcpy.1208.
[85] U. Simonsohn, L. D. Nelson, J. P. Simmons, P-curve: a key to the file-drawer., Journal of
experimental psychology: General 143 (2014) 534.
[86] J. L. Fleiss, Measuring nominal scale agreement among many raters., Psychological bulletin 76
(1971) 378.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Artificial intelligence in healthcare: past, present and future</article-title>
          ,
          <source>Stroke and Vascular Neurology</source>
          <volume>2</volume>
          (
          <year>2017</year>
          )
          <fpage>230</fpage>
          -
          <lpage>243</lpage>
          . doi:
          <volume>10</volume>
          .1136/svn-2017-000101.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Explainable</surname>
            <given-names>AI</given-names>
          </string-name>
          :
          <article-title>A brief survey on history, research areas, approaches and challenges</article-title>
          ,
          <source>in: Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC</source>
          <year>2019</year>
          , Dunhuang, China, October 9-
          <issue>14</issue>
          ,
          <year>2019</year>
          , Proceedings,
          <source>Part II 8</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>563</fpage>
          -
          <lpage>574</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>U.</given-names>
            <surname>Pawar</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. O'Shea</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Rea</surname>
            ,
            <given-names>R. O</given-names>
          </string-name>
          <string-name>
            <surname>'Reilly</surname>
          </string-name>
          , Explainable AI in healthcare,
          <source>2020 International Conference on Cyber Situational Awareness, Data Analytics and Assessment (CyberSA)</source>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          . doi:
          <volume>10</volume>
          . 1109/CyberSA49311.
          <year>2020</year>
          .
          <volume>9139655</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Holzinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Biemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Pattichis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. B.</given-names>
            <surname>Kell</surname>
          </string-name>
          ,
          <article-title>What do we need to build explainable AI systems for the medical domain?</article-title>
          ,
          <source>arXiv preprint arXiv:1712.09923</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Loh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. P.</given-names>
            <surname>Ooi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Seoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. D.</given-names>
            <surname>Barua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Molinari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U. R.</given-names>
            <surname>Acharya</surname>
          </string-name>
          ,
          <article-title>Application of explainable artificial intelligence for healthcare: A systematic review of the last decade (</article-title>
          <year>2011</year>
          -2022),
          <source>Computer Methods and Programs in Biomedicine</source>
          <volume>226</volume>
          (
          <year>2022</year>
          )
          <fpage>107161</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Borys</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. A.</given-names>
            <surname>Schmitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nauta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Seifert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Krämer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nensa</surname>
          </string-name>
          ,
          <article-title>Explainable AI in medical imaging: An overview for clinical practitioners-saliency-based XAI approaches</article-title>
          ,
          <source>European journal of radiology</source>
          (
          <year>2023</year>
          )
          <fpage>110787</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ivanovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Autexier</surname>
          </string-name>
          , M. Kokkonidis,
          <article-title>AI approaches in processing and using data in personalized medicine</article-title>
          ,
          <source>in: European Conference on Advances in Databases and Information Systems</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>11</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Saraswat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Verma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. K.</given-names>
            <surname>Prasad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tanwar</surname>
          </string-name>
          , G. Sharma,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Bokoro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <article-title>Explainable AI for healthcare 5.0: opportunities and challenges</article-title>
          ,
          <source>IEEE Access 10</source>
          (
          <year>2022</year>
          )
          <fpage>84486</fpage>
          -
          <lpage>84517</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Doshi-Velez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>Towards a rigorous science of interpretable machine learning</article-title>
          ,
          <source>arXiv preprint arXiv:1702.08608</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Hofman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Mueller</surname>
          </string-name>
          , G. Klein,
          <string-name>
            <given-names>J.</given-names>
            <surname>Litman</surname>
          </string-name>
          ,
          <article-title>Metrics for explainable AI: Challenges and prospects</article-title>
          , arXiv preprint arXiv:
          <year>1812</year>
          .
          <volume>04608</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>Essential properties and explanation efectiveness of explainable artificial intelligence in healthcare: A systematic review</article-title>
          ,
          <source>Heliyon</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohseni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Zarei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Ragan</surname>
          </string-name>
          ,
          <article-title>A multidisciplinary survey and framework for design and evaluation of explainable AI systems</article-title>
          ,
          <source>ACM Trans. Interact. Intell. Syst</source>
          .
          <volume>11</volume>
          (
          <year>2021</year>
          ). URL: https: //doi.org/10.1145/3387166. doi:
          <volume>10</volume>
          .1145/3387166.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Visser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Scharlau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hammer</surname>
          </string-name>
          , Trust, distrust, and
          <article-title>appropriate reliance in (x) AI: a survey of empirical evaluation of user trust</article-title>
          ,
          <source>arXiv preprint arXiv:2312</source>
          .
          <year>02034</year>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Gandomi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holzinger</surname>
          </string-name>
          ,
          <article-title>Evaluating the quality of machine learning explanations: A survey on methods and metrics</article-title>
          ,
          <source>Electronics</source>
          <volume>10</volume>
          (
          <year>2021</year>
          )
          <fpage>593</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D. V.</given-names>
            <surname>Carvalho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Cardoso</surname>
          </string-name>
          ,
          <article-title>Machine learning interpretability: A survey on methods and metrics</article-title>
          ,
          <source>Electronics</source>
          <volume>8</volume>
          (
          <year>2019</year>
          )
          <fpage>832</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>E.</given-names>
            <surname>Pietilä</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Moreno-Sánchez</surname>
          </string-name>
          ,
          <article-title>When an explanation is not enough: An overview of evaluation metrics of explainable AI systems in the healthcare domain</article-title>
          ,
          <source>in: Mediterranean Conference on Medical and Biological Engineering and Computing</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>573</fpage>
          -
          <lpage>584</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Salih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. B.</given-names>
            <surname>Galazzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gkontra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rauseo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lekadir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Radeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Petersen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Menegaz</surname>
          </string-name>
          ,
          <article-title>A review of evaluation approaches for explainable AI with applications in cardiology</article-title>
          , Authorea
          <string-name>
            <surname>Preprints</surname>
          </string-name>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>F.</given-names>
            <surname>Di Martino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Delmastro</surname>
          </string-name>
          ,
          <article-title>Explainable AI for clinical and remote health applications: a survey on tabular and time series data</article-title>
          ,
          <source>Artificial Intelligence Review</source>
          <volume>56</volume>
          (
          <year>2023</year>
          )
          <fpage>5261</fpage>
          -
          <lpage>5315</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Xie</surname>
          </string-name>
          , M. van
          <string-name>
            <surname>Gerven</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Doran</surname>
          </string-name>
          ,
          <article-title>Explainable deep learning: A field guide for the uninitiated</article-title>
          ,
          <source>J. Artif. Int. Res</source>
          .
          <volume>73</volume>
          (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .1613/jair.1.13200.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Lundberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-I.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>A unified approach to interpreting model predictions</article-title>
          , in: I. Guyon,
          <string-name>
            <given-names>U. V.</given-names>
            <surname>Luxburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>30</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2017</year>
          . URL: https://proceedings. neurips.cc/paper_files/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          , “
          <article-title>why should i trust you?”: Explaining the predictions of any classifier</article-title>
          ,
          <source>Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          (
          <year>2016</year>
          ). doi:
          <volume>10</volume>
          .1145/2939672.2939778.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Selvaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cogswell</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vedantam</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Batra</surname>
          </string-name>
          , Grad-cam:
          <article-title>Visual explanations from deep networks via gradient-based localization</article-title>
          ,
          <source>in: Proceedings of the IEEE international conference on computer vision</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>618</fpage>
          -
          <lpage>626</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sivaprasad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Reiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tintarev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Oren</surname>
          </string-name>
          ,
          <article-title>Evaluation of human-understandability of global model explanations using decision tree, 2023</article-title>
          . arXiv:
          <volume>2309</volume>
          .
          <fpage>09917</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Maruf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Zukerman</surname>
          </string-name>
          , E. Reiter, G. Hafari,
          <article-title>Explaining decision-tree predictions by addressing potential conflicts between predictions and plausible expectations</article-title>
          , in: A.
          <string-name>
            <surname>Belz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
          </string-name>
          , E. Reiter, Y. Sripada (Eds.),
          <source>Proceedings of the 14th International Conference on Natural Language Generation</source>
          , Association for Computational Linguistics, Aberdeen, Scotland, UK,
          <year>2021</year>
          , pp.
          <fpage>114</fpage>
          -
          <lpage>127</lpage>
          . URL:
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>