<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Biomedical Informatics</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1016/J.JBI.2020.103655</article-id>
      <title-group>
        <article-title>Designing an evaluation framework for eXplainable AI in the Healthcare domain</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ivania Donoso-Guzmán</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>KU Leuven, Department of Computer Science</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Pontificia Universidad Católica de Chile</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>113</volume>
      <issue>2021</issue>
      <fpage>2376</fpage>
      <lpage>2387</lpage>
      <abstract>
        <p>The rapid adoption of Artificial Intelligence (AI) has brought automation and problem-solving capabilities in various fields, including healthcare. However, a significant challenge lies in the lack of explanation for AI predictions, particularly in healthcare, where transparency is crucial. This issue has led to eXplainable AI (XAI) development, focusing on constructing explanations for AI systems. However, the evaluation of these explanations lacks a standardized user-centric approach. This research proposes an evaluation framework for XAI methods to address this gap. The project involves four stages: conducting a systematic review of current evaluation methods, assessing the appropriateness of automatic evaluation of explanations, and conducting user studies to gauge the framework's efectiveness in capturing the user experience complexity. The desired outcome is a user-centric evaluation framework and guidelines, enhancing the scalability of XAI research and fostering confidence in adopting AI systems in the healthcare domain.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Explainable AI</kwd>
        <kwd>User-Centric evaluation</kwd>
        <kwd>Human-Centered AI</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Motivation</title>
      <p>
        Research on artificial intelligence systems that seek to support medical teams in
decisionmaking has shown outstanding progress in recent years. These systems aim to automate the
greatest number of tasks and/or provide summarized and selected information to those who
make decisions in a hospital environment [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This allows healthcare professionals to spend
more time with patients on clinical tasks. However, these systems can make mistakes, have
significant degrees of uncertainty [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], or even have important biases [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Since clinicians are
responsible for making decisions that impact the well-being or life of people, this current state
of technology makes it dificult to adopt these systems confidently [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ]. In this context, it
has been proposed that providing explanations about AI models or single predictions could
potentially increase clinicians’ trust [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and ultimately boost adoption in healthcare settings [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        EXplainable Artificial Intelligence (XAI) is the sub-area of Artificial Intelligence that aims
to develop systems humans can understand, with the ultimate goal of increasing trust in AI
systems. Research in XAI has mainly focused on developing algorithms that explain predictions.
Diferent taxonomies have been defined to classify them, and methods work with diferent data
types (text, images, tabular data, time series). In the healthcare domain, some works have shown
the potential of using XAI to uncover models’ inner workings and explain single predictions.
Zech et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] used saliency maps to show the learned patterns of a chest X-ray CNN-based
classifier. The model’s goal was to predict the risk of pneumonia. The authors found that
the CNN used non-disease-related image features to predict the risk. Kim et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] presented
an XAI system that connected visual features with appropriate semantic concepts to explain
its predictions in diabetic retinopathy. Gutiérrez et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] presented a user study on a call
recommendation system for nursing homes. In this study, the healthcare professionals had to
decide which call to attend based on the recommendation and explanations from the mobile
application.
      </p>
      <p>
        Although this area shows promising results, the value of explainable AI methods still has
to be proven to work in practice [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and this makes it dificult to deploy these systems in
their respective domains [
        <xref ref-type="bibr" rid="ref11 ref6">11, 12, 13, 6</xref>
        ]. Methodologies from the AI/ML community, such as
evaluation with a Ground Truth Dataset, cannot be used for these methods: the success of an
explanation depends on the user, its context, the AI model and the explanation object. Since no
clear consensus exists on properties and concepts related to explanations [14, 15], it has been
challenging to define a formal evaluation procedure for XAI methods. In addition, while many
researchers stress the importance of context, we are not aware of XAI evaluation methods that
treat explanations’ efects on humans as a complex user experience that assesses the efect of
diferent explanation characteristics in the user experience. To better evaluate XAI methods,
researchers have tried to disentangle explanation’s characteristics into simple, measurable
properties such as completeness [14, 16], novelty [17, 18], and interactivity [14, 16]. However,
there is little evidence on how these properties relate to explanations being appropriate in
real scenarios [17]. It has been proposed that some properties have relations with the user
experience; for instance, simpler explanations generate more understanding [
        <xref ref-type="bibr" rid="ref11">11, 17</xref>
        ], but many
of these relations have not been established with user studies.
      </p>
      <p>In this work, we aim to discover what characteristics of explanations of AI predictions
afect the user experience in terms of satisfaction, understanding, trust, reliability, adoption
propensity and task performance. For this, we will design an evaluation framework for
XAIgenerated explanations in the context of healthcare applications with a user-centric approach.
The healthcare domain has several characteristics that make it an excellent place to design a
framework that could be later extended and generalized to other domains: AI technology has
shown promising results, there is interest in the community in developing better and more
robust tools, and there are diferent kinds of contexts (time to make a decision, user knowledge,
interaction object) in which the XAI systems can be applied. Our main goal is to understand how
explanations afect the user experience and to do so, we will propose an evaluation framework
of XAI-based systems in the healthcare domain. This development could help to increase the
deployment of AI applications in the domain, with the ultimate goal of improving healthcare
practice.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <sec id="sec-2-1">
        <title>2.1. Evaluation methods in XAI</title>
        <p>Even though AI/ML models have standard evaluation metrics, there is still no consensus
on the strategy to evaluate XAI methods. According to the study by Nauta et al. [14], 58%
studies that present an XAI method performed a quantitative evaluation, 33% provided only
anecdotal evidence, 17% performed a proxy task user study and only 5% performed a user study
with domain experts. Within the studies that performed quantitative studies, they found no
consistency in how the methods were evaluated. Seventy-three papers evaluated only one
property of explanations, and six of the 12 properties were measured by very few studies.</p>
        <p>According to Doshi-Velez and Kim [19], the metrics could be performed in three levels:
application-grounded, with real tasks and users; human-grounded, with real users and proxy
tasks; and functionality-grounded, with proxy tasks and no users. Most of the XAI evaluations
use an application or human-grounded approach. These user studies have been criticized for
their lack of rigor and for the use of proxy tasks [20].</p>
        <p>
          To conduct functionally-grounded evaluations, i.e. proxy tasks and no users, some studies
have focused on grouping concepts and defining properties [
          <xref ref-type="bibr" rid="ref11">11, 16, 18</xref>
          ] and their corresponding
metrics [14]. These works aggregate existing literature that defines properties or presents metrics
to assess them. The properties that have been proposed in these works try to measure the
quality of the explanations without context so that they can be used in functionality-grounded
evaluation. Recently [21] presented a framework to benchmark diferent XAI methods using
automatic metrics. Still, it is limited to certain methods and only works with specific datasets
created specifically for the benchmark.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. AI and XAI in healthcare</title>
        <p>
          The increasing availability of healthcare-related data has led to the creation of several
applications of AI in the domain. From pattern detection using wearable data to using Electronic Health
Records to improve patient care, Artificial Intelligence has started to play a role in the clinical
research. However, few AI-based applications have been deployed in real clinical settings [13].
Some studies [
          <xref ref-type="bibr" rid="ref6">12, 6</xref>
          ] have suggested that the lack of trust of clinicians in the AI systems has a big
impact on system deployments. There have been some [
          <xref ref-type="bibr" rid="ref1 ref7">7, 1</xref>
          ] qualitative studies to understand
the interactions between medical staf and AI/ML systems, but they have focused on system
aspects, not on the medical practice.
        </p>
        <p>There is no consensus on the role of explanations in the clinical domain. Ghassemi et al. [22]
states that explanations are not the right direction to increase the deployment of such systems.
Instead, they propose using validation methods already in use in the medical community. On
the contrary, Amann et al. [23] proposes to include explanations depending on the application
context as long as they are proven to work.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Research Problem</title>
      <sec id="sec-3-1">
        <title>3.1. Problem statement</title>
        <p>The current situation of healthcare, AI, and XAI can be summarized as follows. Several AI
models have proven to work for specific medical tasks. However, these models can be biased
and always have uncertainty. In the healthcare domain, clinicians are responsible for decisions
that afect people’s lives; therefore, they need to ensure they understand what the AI models
are predicting. One way to achieve this is to generate explanations of what the model is doing
so that users can understand the logical procedure. Methods are being developed to generate
explanations, but there are many definitions of explanations; therefore, it is complicated to
describe what people could expect from them. This is because it is not well-defined what would
be a good explanation in diferent contexts and for diferent users. Accordingly, it is challenging
to establish a consistent evaluation that could increase adoption in the healthcare domain.</p>
        <p>To tackle this problem, we proposed to first establish an evaluation procedure for XAI systems
and, second, use this framework to understand how explanations afect the user experience in
the healthcare domain. By deeply understanding these connections, guidelines for designing
appropriate XAI systems can be created and used to develop better systems that increase
adoption in the domain.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Objectives</title>
        <p>This project has two goals: (O1) to develop a standard evaluation procedure and (O2) to generate
guidelines for designing XAI systems. The first objective is decomposed into the following
sub-goals
1. Define properties of explanation and their associated measurements.
2. Design, an evaluation framework for XAI-generated explanations that can be applied in
the healthcare domain</p>
        <sec id="sec-3-2-1">
          <title>3. Find relations between properties to explain the user experience</title>
          <p>4. Provide researcher guidelines to help decide the properties and measurements more suited
to the user study characteristics
5. Provide guidelines on user studies reporting to increase comparability between studies
Once the evaluation procedure has been proposed, we will be able to use it to answer the
research questions by conducting user studies using the framework. With the outcome of these
user studies, we will be able to accomplish the second objective of generating design guidelines.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Research questions</title>
        <p>To be able to generate guidelines for XAI systems design we want to answer the following
questions:
(RQ1) What are the characteristics of explanations that relate to the user’s experience?
(RQ2) How do explanation characteristics afect the user experience in terms of satisfaction,
trust, understanding, adoption, task performance and reliability? Are there mediation
efects with other variables or characteristics that afect the user experience?
(RQ3) What measurements or combination of those are more efective for measuring
properties of explanations in the healthcare domain?
(RQ4) How could these metrics be operationalized to measure explanations from models of
diferent natures and for any dataset?</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Research plan</title>
      <p>To answer the research questions, I am working in four stages that depend on each other. The
project is divided in two parts: first proposing</p>
      <sec id="sec-4-1">
        <title>4.1. Stage 1: Taxonomy of explanations evaluation (RQ1, O1)</title>
        <p>This stage will work towards achieving answering RQ1. During this stage, works in the area of
evaluation of explanations will be reviewed to understand the following:</p>
        <sec id="sec-4-1-1">
          <title>1. What are the properties of explanations according to current literature?</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>2. What are the metrics that are used to measure said properties?</title>
        </sec>
        <sec id="sec-4-1-3">
          <title>3. What is the relationship between the properties and the user experience?</title>
          <p>4. What are the measurement models that are used to evaluate the user experience?</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Stage 2: Automation of evaluation (RQ4)</title>
        <p>This stage aims to create procedures to measure certain metrics related to explanation aspects to
answer RQ4. To accomplish this, we will create a Python package that implements the metrics
that can be automated. This package could be used with any combination of XAI method, AI
model, and dataset.</p>
        <p>The metrics that will be implemented are currently being evaluated. Metrics that measure
properties that have more connections with others and have been described as more relevant by
previous authors will be prioritised. Additionally, we will conduct user studies to evaluate the
alignment of these automatic metrics with the user experience. We will conduct these evaluations
in domains where users can be found more easily than healthcare, such as recommendation
systems, to conduct a quantitative study. This will consist of evaluating the user experience with
questionnaires and applying the metrics to the examples shown to the user. Both measurements
will be compared to understand whether the metric correctly measures the user experience
property.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Stage 3: Evaluate alignment between theoretical properties and user experience (RQ2, O2)</title>
        <p>During this phase, we will conduct qualitative user studies in a clinical setting to evaluate
whether the properties of explanations we can measure align with the user experience. In this
stage, we also want to understand the factors that afect users in a healthcare context. We will
work in systems that area being developed by the Augment and HAIVis research groups. These
projects work with image classification and interactive dashboards that help clinicians make
decisions. The user study will be a semi-structured interview and analysed using thematic
analysis. It considers a minimum of 5 participants, and we will stop recruiting when reaching
saturation of themes.</p>
        <p>The outcome of this stage will be an analysis of how the theoretical properties we found in
the previous stage work in practice in a healthcare-related setting. This will help us to answer
RQ2 and will contribute to O2 as well.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Stage 4: propose XAI evaluation guidelines (RQ2, RQ3, O2)</title>
        <p>The study in stage 3 will allow us to understand more deeply what properties give value to the
user experience. In this stage, we will test the complete evaluation framework quantitatively.
We will use XAI applications in the healthcare domain, and we will apply the evaluation
framework. The applications will be part of the current eforts of both research groups. The
questionnaires for the user experience will be validated using confirmatory factor analysis to
select the questions that more appropriately represent each property.</p>
        <p>The user studies will shed light on the relevant properties of explanation for the medical
domain, and we will be able to understand the relation between the properties of explanations
in this particular area. This new understanding will help us to answer RQ1, which refers to the
relation between the explanation properties in this particular domain. At the end of this stage,
we will be able to achieve O2; We expect to provide guidelines for the development of XAI
applications in the healthcare domain sustained in the evidence provided by the user studies
that were conducted.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Contributions</title>
      <p>The contributions of this work are the following: we will provide the research community with
an evaluation procedure applicable to XAI systems based on evidence of multiple works in
diferent domains, and by conducting user studies using this framework, we will be able to
generate guidelines for XAI design based on empirical evidence that could be used to design
more appropriate XAI systems.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research is supported by ANID BECAS/DOCTORADO NACIONAL 21202228, Basal Funds
for Center of Excellence FB210017 (CENIA), the Research Foundation Flanders (FWO, grant
G0A3319N) and KU Leuven (grant C14/21/072).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Reif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hipp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Smilkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wattenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Viegas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Corrado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Stumpe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Terry</surname>
          </string-name>
          ,
          <article-title>Human-centered tools for coping with imperfect algorithms during medical decision-making</article-title>
          ,
          <source>in: Conference on Human Factors in Computing Systems - Proceedings, Association for Computing Machinery</source>
          ,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .1145/3290605.3300234.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Teredesai</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. Eckert,</surname>
          </string-name>
          <article-title>Interpretable Machine Learning in Healthcare</article-title>
          , in: 2018
          <source>IEEE International Conference on Healthcare Informatics (ICHI)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>447</fpage>
          -
          <lpage>447</lpage>
          . URL: https://ieeexplore.ieee.org/document/8419428/. doi:
          <volume>10</volume>
          .1109/ICHI.
          <year>2018</year>
          .
          <volume>00095</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. X.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Abdalla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>McDermott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghassemi</surname>
          </string-name>
          ,
          <article-title>Hurtful words</article-title>
          ,
          <source>in: Proceedings of the ACM Conference on Health, Inference, and Learning</source>
          , ACM, New York, NY, USA,
          <year>2020</year>
          , pp.
          <fpage>110</fpage>
          -
          <lpage>120</lpage>
          . URL: https://dl.acm.org/doi/10.1145/3368555.3384448. doi:
          <volume>10</volume>
          .1145/3368555.3384448.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kaushal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Khullar</surname>
          </string-name>
          ,
          <article-title>Should health care demand interpretable artificial intelligence or accept "black Box"</article-title>
          <source>Medicine?</source>
          ,
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .7326/M19-2548.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Cutillo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Foschini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kundu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mackintosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. D.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Beck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Collier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Colvis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gersing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Gordon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jensen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Shabestari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Southall</surname>
          </string-name>
          ,
          <article-title>Machine intelligence in healthcare-perspectives on trustworthiness, explainability, usability</article-title>
          , and transparency,
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .1038/s41746-020-0254-2.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gaube</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Suresh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Raue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Merritt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Berkowitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Lermer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Coughlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. V.</given-names>
            <surname>Guttag</surname>
          </string-name>
          , E. Colak,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghassemi</surname>
          </string-name>
          ,
          <article-title>Do as AI say: susceptibility in deployment of clinical decisionaids</article-title>
          ,
          <source>npj Digital Medicine</source>
          <volume>4</volume>
          (
          <year>2021</year>
          ). URL: http://dx.doi.org/10.1038/s41746-021-00385-9. doi:
          <volume>10</volume>
          .1038/s41746-021-00385-9.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tonekaboni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Mccradden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Goldenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Ai</surname>
          </string-name>
          , What Clinicians Want:
          <article-title>Contextualizing Explainable Machine Learning for Clinical End Use</article-title>
          ,
          <source>in: Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>359</fpage>
          -
          <lpage>380</lpage>
          . URL: http://proceedings.mlr.press/ v106/tonekaboni19a.html.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Zech</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Badgeley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Costa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Titano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. K.</given-names>
            <surname>Oermann</surname>
          </string-name>
          ,
          <article-title>Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study</article-title>
          ,
          <source>PLOS Medicine 15</source>
          (
          <year>2018</year>
          )
          <article-title>e1002683</article-title>
          . URL: https://journals.plos.org/plosmedicine/article?id=
          <volume>10</volume>
          .1371/journal.pmed.1002683. doi:
          <volume>10</volume>
          . 1371/JOURNAL.PMED.
          <volume>1002683</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wattenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gilmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wexler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Viegas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sayres</surname>
          </string-name>
          , Interpretability Beyond Feature Attribution:
          <article-title>Quantitative Testing with Concept Activation Vectors (TCAV) (</article-title>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Gutiérrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. N.</given-names>
            <surname>Htun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanden</surname>
          </string-name>
          <string-name>
            <surname>Abeele</surname>
          </string-name>
          , R. De Croon,
          <string-name>
            <given-names>K.</given-names>
            <surname>Verbert</surname>
          </string-name>
          ,
          <article-title>Explaining Call Recommendations in Nursing Homes: a User-Centered Design Approach for Interacting with Knowledge-Based Health Decision Support Systems, International Conference on Intelligent User Interfaces</article-title>
          ,
          <source>Proceedings IUI</source>
          (
          <year>2022</year>
          )
          <fpage>162</fpage>
          -
          <lpage>172</lpage>
          . URL: https://dl.acm.org/doi/10. 1145/3490099.3511158. doi:
          <volume>10</volume>
          .1145/3490099.3511158.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Markus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Kors</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. R.</given-names>
            <surname>Rijnbeek</surname>
          </string-name>
          ,
          <article-title>The role of explainability in creating trustworthy artificial intelligence for health care: A comprehensive survey of the terminology</article-title>
          , design
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>