<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Going Beyond Explainability in Medical AI Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ulrich Reimer</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edith Maier</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Beat Tödtli</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>185</fpage>
      <lpage>191</lpage>
      <abstract>
        <p>Despite the promises of artificial intelligence (AI) technology, its adoption in medicine has met with formidable obstacles due to the inherent opaqueness of the internal decision processes that are based on models which are dificult or even impossible to understand. In particular, the increasing usage of (deep) neural nets and the resulting black-box algorithms has led to wide-spread demands for explainability. Apart from discussing how explainability might be achieved, the paper also looks at other approaches at building trust such as certification or controlling bias. Trust is a crucial prerequisite for the use and acceptance of AI systems in the medical domain.</p>
      </abstract>
      <kwd-group>
        <kwd>Explainability</kwd>
        <kwd>Certification</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>AI</kwd>
        <kwd>Bias</kwd>
        <kwd>Medical Device</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        AI systems are beginning to have an impact in the medical domain. They comprise
applications for clinicians that allow rapid and accurate image interpretation, genome
sequencing or recommend treatments for patients (such as IBM Watson Health’s cancer
algorithm). AI technology can also be found in applications for lay people such as biosensors
with continuous output of physiologic metrics that enable them to process their own data to
promote their health. According to [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], for these purposes, algorithms are desperately needed
because analysing such huge amounts of data exceeds human capacities. The question is
how AI and human intelligence can best be integrated.
      </p>
      <p>The decision on a patient’s next treatment is shared between the patient, the physician and
the medical AI system2. The degree of autonomy and responsibility in the decision-making
process of each role is central to the attribution of trust since clearly, systems require less
trust if they are supervised by a human or another system.</p>
      <p>Therefore, it may be useful to distinguish between decision support systems and decision
making systems, as well as recognising the continuum between these extremes. For example,
a skin melanoma detection system may be viewed as a decision making system if
falsenegative detections are not re-examined by a physician. Medical information retrieval
systems on the other hand are clearly only capable of providing decision support.
1 Institute for Information &amp; Process Management, University of Applied Sciences St.Gallen, firstname.lastname@
fhsg.ch
2 One may object to the anthropomorphic notion of a system making a decision. Alternatively, the decision of an
AI system may be viewed as a shared decision by the physician and the engineers who built the system.</p>
    </sec>
    <sec id="sec-2">
      <title>Trust-Enhancing Features of Medical AI Systems</title>
      <p>
        An analysis of trust-influencing features of automated systems in general is given by Chien
et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The authors suggest three main categories, each having several dimensions some
of which are particularly relevant for our topic:
      </p>
      <p>The category Performance Expectancy is defined “as an individual’s belief that
applying automation will help her to enhance job performance”. One particularly
relevant dimension of this category is perceived usefulness, which we will discuss
briefly in Section 2.5.</p>
      <p>The second category Process Transparency refers to factors that influence an
individual’s perceived dificulty in using an automated system. Relevant dimensions are
understandability (see Sec.2.1) and reliability (see Sec.2.4) of the system.
Finally, the category of Purpose Influence relates to “a person’s knowledge of what
the automation is supposed to do”, i.e. the alignment of a user’s expectations and the
system designers’ goals. An important dimension of this category is the certification
of an automated system, which will be discussed in Section 2.2.</p>
      <p>
        An additional aspect, specific to data-driven AI systems and not mentioned in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is
controlling the sampling bias. We will deal with this aspect in Section 2.3.
In this position paper we focus on the following questions:
      </p>
      <p>What makes a user trust a (medical) AI system to provide correct and adequate advice
or decisions?
Will (medical) AI systems override experience and intuition, i.e. the largely tacit
knowledge harboured by experts, when it comes to taking decisions?
While the above mentioned issues are relevant for AI systems in general, they deserve
particular attention in medicine where lives may be at stake. The models AI systems are
based on primarily determine their behaviour. We therefore argue that certain characteristics
of these models such as interpretability and representativeness play a crucial role for the
usage and acceptance of AI systems.</p>
      <p>In the following subsections we will have a closer look at each of the above mentioned
dimensions and discuss how they may influence a physician’s or patient’s trust in a medical
AI system.
2.1</p>
      <sec id="sec-2-1">
        <title>Explainability</title>
        <p>
          The trust-enhancing dimension of understandability is closely related to the notion of
explainability as discussed in the machine learning community [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Gilpin et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
        </p>
        <p>
          Going Beyond Explainability in Medical AI Systems 187
distinguish between explainability and interpretability. According to them the goal of
interpretability is to describe the internals of a system in a way that is understandable to
humans. Since humans are biased towards simple descriptions, an oversimplification of the
description of the actual system is unavoidable and even necessary. Explainability includes
interpretability but additionally requires the explanation to be complete, i.e. not lacking
relevant aspects. Clearly, there is a trade-of between an explanation being complete and
comprehensible. Other authors refuse to make a distinction between explainability and
interpretability (e.g. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]).
        </p>
        <p>
          Explainability is typically problematic for sub-symbolic models, i.e. (deep) neural networks
as opposed to symbolic models such as decision trees, which can in principle be inspected by
a human. Nevertheless, inspecting a complex decision tree with hundreds or even thousands
of nodes would quite probably be pointless since reading does not automatically imply
understanding. Thus we run into the performance-explainability-tradeof between having
simple (symbolic) models that facilitate explainability and complex (possibly sub-symbolic)
models that result in a better performance of the AI application but cannot be understood
anymore [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>Since sub-symbolic models cannot (easily) be inspected, several researchers have come
up with the idea of having an additional, simpler (symbolic) model just for the purpose
of explainability while the actual system performance relies on the full-fledged, complex
(sub-symbolic) model [7].</p>
        <p>
          Other authors suggest to transfer a part of the input-to-output transformation complexity from
the modelling to the preprocessing stage [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. This may help to improve the explainability of
a model at the expense of a more complex and opaque preprocessing procedure .
It can be stipulated that explainability or at least interpretability is an essential capability
of a medical AI system. Otherwise, a physician or clinician either just has to trust the
conclusions of the system or has to go through a subsequent verification process, which
may well be costly and time-consuming and thus nullify any potential eficiency benefits
of the AI system. At the same time, he or she may not be willing to go through a lengthy
explanation to understand the decision ofered by the system. Since neither approach is
desirable or practicable, the task of verification is normally taken on by oficially recognised
agencies or regulatory bodies such as the Food and Drug Administration (FDA).
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Certification of Medical AI Systems</title>
        <p>Medical AI systems need to get certified as medical devices by regulatory bodies such as the
already-mentioned FDA in the US or national bodies such as SwissMedics in Switzerland
before they can be used in practice. A clinical trial is usually at the core of the certification
process. By assuming responsibility for the adequacy of the medical AI system, regulatory
bodies provide an established source of trust.</p>
        <p>However, the certification of a medical AI system requires a diferent approach than that for
approving e.g. a new drug. The decisions suggested by an AI system must be compared to
the decisions of a physician, whereas clinical trials evaluate new treatments by comparing
them with traditional or treatment as usual. Since the model underlying an AI system
is a generalization of a possibly large but limited input it will sometimes come up with
inadequate decisions. Thus, for AI systems to be certifiable at all it is important that other
criteria are fulfilled – e.g. that a clinician can easily cross-check the systems’ decision
against his/her own expertise, by simple lab tests or the explanation given by the system.
Depending on its complexity, the certification of an AI system can require a huge efort. One
possibility to simplify the process is to narrow down the functional range of the system, e.g.
by having it diagnose only one kind of disease or a small range of diseases. The downside
of breaking up the diagnostic scope of an AI system into smaller systems poses the problem
that a disease might not be diagnosed if it comes with atypical symptoms. A system with a
broader range can more easily do a diferential diagnosis between several possible diagnoses.
Another approach to narrow down the scope of an AI system is the range of patients it can
be applied to. For example, if it was developed on data from a group of people of a specific
ethnic group and gender the clinical trial can be narrowed down to the same kind of sample
of applicants. This approach would also help with the issue of bias, as is further discussed
in Section 2.3.</p>
        <p>
          Certification amounts to model testing. This means the absence of errors (wrong diagnoses,
wrong therapies) cannot be shown. When the certification process uses a sample with a
similar bias as the sample used for developing the AI system there might exist fundamental
lfaws that will not be uncovered during certification. Thus, instead of only doing model testing
via a clinical trial, a certification process should additionally include inspecting the model
and checking it for plausibility. Here, an explanation component could provide considerable
support and help to “ensure algorithmic fairness, identify potential bias/problems in the
training data, and to ensure that the algorithms perform as expected"[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Reducing Sampling Bias in medical AI Systems</title>
        <p>
          Sampling bias refers to the bias incurred due to the specific data set chosen to train an
AI system. The resulting system extrapolates from the training data to the general case. If
the training set is skewed the system does not generalize well. For example, if a medical
AI system is trained on data from Asian people it might not work well for Africans or
Europeans. While the bias concerning gender and ethnic group can be controlled relatively
easily [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], the feature space is huge so that other, less obvious biases can exist that neither
the developer nor the certification agency are aware of. The problem is that it is usually
unknown what the efect of a feature on the generated model is and how its values should
be distributed to provide a representative sample.
        </p>
        <p>
          Going Beyond Explainability in Medical AI Systems 189
Bias can be reduced by utilizing domain knowledge about features and their impact on the
learned model [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Another approach is testing for specific biases. When using medical AI
systems a patient may rightfully ask how well she or he is represented in the training data.
An approach to do this might be to go back to the original dataset and determine the nearest
neighbours of the data point representing the patient. Comparing the number and closeness
of the nearest neighbours to the number and closeness of the average person in the sample
gives an estimate how well that patient is covered or if he or she is an outlier.
2.4
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>Reliability</title>
        <p>
          The reliability of an AI system may influence the willingness to use it [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Reliability can
be estimated by classical measures from machine learning such as precision, specificity,
sensitivity (or recall), false positive rate and false negative rate. An important aspect in this
context is if the system is optimized to reduce false positives or false negatives.
2.5
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>Perceived Usefulness</title>
        <p>
          Finally, a further criterion for deciding to use a (medical) AI system is its perceived
usefulness, which refers to a user’s belief that the system would be of assistance to achieve
a certain goal [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. This criterion is strongly influenced by reliability (see above) but also by
other factors such as the degree of support the system gives and how often it fails to comes
up with a viable solution.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Points for discussion</title>
      <p>Let us summarize the main points to be addressed when discussing the prerequisites for
accepting AI technology in the medical domain:</p>
      <p>Physicians do not need an explanation component because a medical AI system
has already undergone a formal certification process which relieves them from the
responsibility to ensure that advice obtained from the system is correct.</p>
      <p>Even if a medical AI system has obtained oficial certification, clinicians might at
least want to check the plausibility of a system’s decision. Therefore an explanation
component should be available that provides succinct and intuitive (but not necessarily
complete) explanations.</p>
      <p>AI systems can only be certified if the system is not too complex, i.e. has a limited
functional range and/or patient range. Otherwise the required efort would make
certification too costly and impractical.</p>
      <p>Certification should not only focus on system performance, i.e. on testing the
underlying model with concrete cases, but also include the inspection of the underlying
decision model as well as the training and test sets used for creating the model.
An AI system might be more reliable than a human expert, make decisions in less
time and with lower costs and come up with better suggestions, but is still bound
to make errors eventually. The error rate, however, should be inferior to that of a
physician.</p>
      <p>The sample used to train the decision model of an AI system will always be biased.
To make up for this it should be possible to determine for specific patients how well
they are covered by the training sample to get an estimate how much to trust the AI
system in those specific cases.</p>
      <p>Instead of applying a pure machine learning approach to generate the model underlying
an AI system it should also comprise (manually engineered) domain knowledge to
increase its reliability, e.g. for plausibility checking of a decision.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Future Work</title>
      <p>As we have seen there are approaches to building trust in medical AI systems that go beyond
explainability such as certification, controlling the bias of the training set or checking how
well an individual patient is covered by the learned model. But even if all these are applied,
we would argue that full autonomy is unlikely to ever be attained for medical AI systems.
Humans will always be required for oversight of algorithmic interpretation of images and
data in this sensitive realm.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Shih-Yi Chien</surname>
          </string-name>
          et al. “
          <article-title>An Empirical Model of Cultural Factors on Trust in Automation”</article-title>
          .
          <source>In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting</source>
          <volume>58</volume>
          (Oct.
          <year>2014</year>
          ), pp.
          <fpage>859</fpage>
          -
          <lpage>863</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Filip</given-names>
            <surname>Karlo</surname>
          </string-name>
          <string-name>
            <surname>Došilović</surname>
          </string-name>
          , Mario Brčić, and Nikica Hlupić. “
          <article-title>Explainable artificial intelligence: A survey”</article-title>
          . In: 2018 41st
          <article-title>International convention on information and communication technology, electronics and microelectronics (MIPRO)</article-title>
          .
          <source>IEEE</source>
          .
          <year>2018</year>
          , pp.
          <fpage>0210</fpage>
          -
          <lpage>0215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L. H.</given-names>
            <surname>Gilpin</surname>
          </string-name>
          et al. “
          <article-title>Explaining Explanations: An Overview of Interpretability of Machine Learning”</article-title>
          .
          <source>In: 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)</source>
          . Oct.
          <year>2018</year>
          , pp.
          <fpage>80</fpage>
          -
          <lpage>89</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Clark</surname>
          </string-name>
          Johnson et al. “
          <article-title>Sampling bias and other methodological threats to the validity of health survey research”</article-title>
          .
          <source>In: International Journal of Stress Management 7.4</source>
          (
          <issue>2000</issue>
          ), pp.
          <fpage>247</fpage>
          -
          <lpage>267</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>David</given-names>
            <surname>Martens</surname>
          </string-name>
          et al. “
          <article-title>Performance of classification models from a user perspective”</article-title>
          .
          <source>In: Decision Support Systems 51.4</source>
          (
          <issue>2011</issue>
          ), pp.
          <fpage>782</fpage>
          -
          <lpage>793</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Tim</given-names>
            <surname>Miller</surname>
          </string-name>
          . “Explanation in Artificial Intelligence:
          <article-title>Insights from the Social Sciences”</article-title>
          .
          <source>In: CoRR abs/1706</source>
          .07269 (
          <year>2017</year>
          ). arXiv:
          <volume>1706</volume>
          .
          <fpage>07269</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Marco</given-names>
            <surname>Tulio</surname>
          </string-name>
          <string-name>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Sameer</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Carlos</given-names>
            <surname>Guestrin</surname>
          </string-name>
          . “
          <article-title>"Why Should I Trust You?": Explaining the Predictions of Any Classifier”</article-title>
          .
          <source>In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          . New York, NY, USA: ACM,
          <year>2016</year>
          , pp.
          <fpage>1135</fpage>
          -
          <lpage>1144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Cynthia</given-names>
            <surname>Rudin</surname>
          </string-name>
          . “
          <article-title>Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead”</article-title>
          .
          <source>In: Nature Machine Intelligence</source>
          <volume>1</volume>
          .5 (
          <issue>2019</issue>
          ), pp.
          <fpage>206</fpage>
          -
          <lpage>215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Eric</surname>
            <given-names>J Topol.</given-names>
          </string-name>
          “
          <article-title>High-performance medicine: the convergence of human and artificial intelligence”</article-title>
          .
          <source>In: Nature medicine 25.1</source>
          (
          <issue>2019</issue>
          ), pp.
          <fpage>44</fpage>
          -
          <lpage>56</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>M. B. Zafar</surname>
          </string-name>
          et al. “
          <article-title>Fairness Constraints: A Flexible Approach for Fair Classification”</article-title>
          .
          <source>In: Journal of Machine Learning Research 20.75</source>
          (
          <year>2019</year>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>