<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Incorporating Explainable Arti cial Intelligence (XAI) to aid the Understanding of Machine Learning in the Healthcare Domain</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Urja Pawar</string-name>
          <email>Urja.Pawar@mycit.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Donna O'Shea</string-name>
          <email>Donna.OShea@cit.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Susan Rea</string-name>
          <email>Susan.Rea@cit.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruairi O'Reilly</string-name>
          <email>Ruairi.OReilly@cit.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cork Institute of Technology</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In the healthcare domain, Arti cial Intelligence (AI) based systems are being increasingly adopted with applications ranging from surgical robots to automated medical diagnostics. While a Machine Learning (ML) engineer might be interested in the parameters related to the performance and accuracy of these AI-based systems, it is postulated that a medical practitioner would be more concerned with the applicability, and utility of these systems in the medical setting. However, medical practitioners are unlikely to have the prerequisite skills to enable reasonable interpretation of an AI-based system. This is a concern for two reasons. Firstly, it inhibits the adoption of systems capable of automating routine analysis work and prevents the associated productivity gains. Secondly, and perhaps more importantly, it reduces the scope of expertise available to assist in the validation, iteration, and improvement of AI-based systems in providing healthcare solutions. Explainable Arti cial Intelligence (XAI) is a domain focused on techniques and approaches that facilitate the understanding and interpretation of the operation of ML models. Research interest in the domain of XAI is becoming more widespread due to the increasing adoption of AI-based solutions and the associated regulatory requirements [1]. Providing an understanding of ML models is typically approached from a Computer Science (CS) perspective [2] with a limited research emphasis being placed on supporting alternate domains [3]. In this paper, a simple, yet powerful solution for increasing the explainability of AI-based solutions to individuals from non-CS domains (such as medical practitioners), is presented. The proposed solution enables the explainability of ML models and the underlying work ows to be readily integrated into a standard ML work ow. Central to this solution are feature importance techniques that measure the impact of individual features on the outcomes of AI-based systems. It is envisaged that feature importance can enable a high-level understanding of a ML model and the work ow used to train the model. This could aid medical practitioners in comprehending AI-based systems and enhance their understanding of ML models' applicability and utility.</p>
      </abstract>
      <kwd-group>
        <kwd>Explainable Arti cial Intelligence</kwd>
        <kwd>Healthcare</kwd>
        <kwd>Feature Importance</kwd>
        <kwd>Decision trees</kwd>
        <kwd>Explainable Underlying Work ow</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Interpretability is the degree to which the rationale of a decision can be
observed within a system [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. If a ML model's operation is readily understood then
the model is interpretable. Explainability is the extent to which the internal
operation of a system can be explained in human terms. XAI is comprised of
methodologies for making AI systems interpretable and explainable [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>The context of interpretability and explainability is generally considered
domain-speci c in an applied setting. For instance, a ML engineer and a medical
practitioner would have a di erent perspective on what is \explainable" when
viewing the same system. Interpretability from the perspective of the ML
engineer relates to understanding the internal working of a system so that the
technical parameters can be tuned to improve the overall performance. Interpretability
from the medical practitioner's perspective would relate to a higher-level
understanding of the internal operation of a system as it relates to the medical function
it provides. Explainability for a ML engineer may relate to presenting technical
information in an understandable format that enables e ective evaluation of a
system while explainability for medical practitioners may be more related to the
rationale as to why a course of action is prescribed for a patient.</p>
      <p>It is postulated that AI-based systems need to accommodate a medical
practitioner's perspective to be considered explainable in a healthcare setting. This
presents several challenges which are highlighted and addressed as part of this
work:</p>
      <p>Designing domain-agnostic systems with XAI and simultaneously
accommodating multiple perspectives is a complex problem because
explanations require a context of the domain (engineering, medicine, or healthcare)
and can be useful for a targeted perspective but trivial for others. For instance,
presenting interactive visualisations to explain layers of a neural network is
bene cial for ML engineers but of less importance to the radiologists who use the
neural network for analysing MRI scans.</p>
      <p>The scope of interpretability and explainability for AI-based
solutions is broader than the operation of a ML model. It also concerns
the work ow adopted to train these models. The work ow can provide
technical knowledge regarding the pre-processing steps, the ML models used, and the
evaluation criteria (e.g. accuracy, precision) to the ML engineer. It can
benet medical practitioners with an overview of the underlying data, the model's
interpretation of the data, and the performance metrics pertinent to medical
diagnostics. For instance, the ML models that are used to predict based on a
patient's medical record might be inappropriate if the underlying training data
does not include records from similar demographics.</p>
      <sec id="sec-1-1">
        <title>The subjective nature of XAI in medical setting presents challenges</title>
        <p>
          such as as the association of a trained model's knowledge with the medical
features, the provisioning of explanations with regard to the underlying medical
dataset [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], and an understanding of how the presence or absence of some medical
features' information a ects a model's performance and its interpretation of
features.
        </p>
        <p>
          There are several nuanced issues related to the challenges articulated. These
include: (a) lack of explainability in underlying feature engineering processes
to incorporate clinical expertise; (b) complexity in the integration of XAI
approaches with existing ML work ows [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]; (c) a lack of a high-level explainability
of the data and the ML model [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]; (d) and a lack of explainability of a model's
operation in di erent medical settings.
        </p>
        <p>
          A standard ML work ow consists of several stages: data collection, data
pre-processing, modeling, training, evaluation, tuning, and deployment. XAI
approaches should endeavor to integrate interpretability and explainability into the
standard ML work ow. Feature Importance (FI) is a set of techniques that assign
weightings(scores) to each feature indicating their relative importance in making
a prediction or classi cation by a ML model [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. FI techniques are typically used
as part of the data pre-processing to enhance feature selection.
        </p>
        <p>Moving towards a solution: While addressing the challenges articulated
in their totality is beyond the scope of this paper, addressing the nuanced issues
outlined will provide the initial steps for a more complete solution to be
derived and is the primary contribution of this work. In this paper, FI techniques
are utilised as a means of enabling XAI. It is envisaged that FI will provide a
simple but powerful means of integrating XAI into the standard ML work ow
in a domain-agnostic manner. The approach can enable the explainability of a
ML model as well as the underlying work ow whilst accommodating multiple
perspectives. This is realised by three proposed approaches that utilise the
associations between FI scores, FI techniques, the inclusion/exclusion of features,
data augmentation techniques, and performance metrics. In doing so, it enables
multiple levels of explainability encapsulating the operation of the ML model
with di erent underlying datasets in di erent medical settings. The
explainability derived is expected to enable the clinical validation of AI-based systems as
discussed in the following sections.</p>
        <p>The remainder of the paper is organised as follows: Section 2 presents related
work with regards XAI, FI, and its utilisation in an applied ML setting. Section
3 outlines the proposed methodology for enabling XAI in a standard ML
workow. Section 4 outlines the results of the experimental work of the approaches
proposed. Section 5 presents a discussion and concluding remarks arising from
the work carried out to date.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Preliminary work for making ML models used in clinical domains increasingly
interpretable and explainable has been initiated in [
        <xref ref-type="bibr" rid="ref1 ref10 ref5 ref9">1, 5, 9, 10</xref>
        ]. The interpretability
and explainability of ML models enable ML engineers to understand, and
evaluate, a model's parameters (weights/coe cients) and hyper-parameters
(inputsize, number of layers) with the model's outcomes (predictions/classi cations).
It can also enable medical practitioners to e ectively comprehend and validate
the output derived from ML models as per their medical expertise [
        <xref ref-type="bibr" rid="ref1 ref5">1, 5</xref>
        ].
      </p>
      <p>
        There exists a variety of XAI methods that are applicable to the medical
domain. Ante-hoc XAI methods achieve interpretability without an additional
step that makes them easier to adopt in existing ML work ows. They include
inherently interpretable ML models such as Decision trees [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], Random Forests
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and Generalised Additive Models (GAMs) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. They are typically used to
achieve interpretability at the cost of lower performance scores as compared to
complex ML models. However, their contribution towards enabling the
explainability to non-CS perspectives in di erent domains is not extensively discussed
in the literature [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], an XAI-enabled framework to include clinical expertise in AI-based
systems is proposed in an abstract format. This work also discusses the use of
FI to enable the inclusion of clinical expertise when building AI-based solutions.
In [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] FI scores based on Decision trees were used to analyse the importance
of features in classifying cervical cancer and achieving interpretability in the
model. However, the interpretability in relation to the underlying dataset was not
discussed. Also, the utilisation of the FI scores to enable explainability from the
perspective of medical practitioners was not addressed. In [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] Random forests
were used to classify arrhythmia from time-series ECG data and FI scores were
presented as a means of achieving interpretability. However, as the time-series
ECG data has numeric values for each sampled record, the FI scores assigned
to each time-stamped value were not useful as e ective conclusions cannot be
drawn by associating a FI score with a single amplitude value in a time-series
ECG wave.
      </p>
      <p>
        Post-hoc XAI methods are speci cally designed for explainability and are
applied after a ML model is trained. This makes post-hoc methods di cult
to adopt but they are advantageous as they typically support multiple
noninterpretable but performant classi ers [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Local Interpretable Model-agnostic
Explanations (LIME) is one of the commonly used post-hoc XAI methods that
was developed to explain the predictions of any ML classi er by calculating FI
scores based on some assumptions that don't always hold true across di erent
types of classi ers [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Shapley values are another post-hoc XAI technique that
was initially introduced in game theory to present the average expected marginal
contribution of a player in achieving a payout when all possible combinations of
players are considered [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. In XAI, Shapley values are used to assign FI scores
to features (players) in achieving predictions (payout) made by a model. In [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]
LIME and Shapley's FI scores were compared and it was found that Shapley's
FI scores were more consistent when compared to LIME's. This consistency
was derived on the basis of objective criteria including similarity, identity, and
separability, which are important considerations when generating and providing
explanations in a healthcare setting.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>The work ow adopted in this work is depicted in Figure 1. It follows a standard
ML work ow with the addition of the FI stage to enable post-hoc explainability.
The FI scores are calculated without modifying prior stages of the work ow and
are utilised to enable the explainability of the model and the inherent work ow.</p>
      <p>
        Three approaches are proposed that utilise FI scores to enable explainability
and interpretability of a ML model and the underlying dataset:
A1. Relative feature ranking: There needs to be a careful validation
regarding features that are considered more or less relevant by ML models in
healthcare [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. This approach derives FI scores using two distinct methods. The
rst is generated using Decision tree FI scores, the FI score of a feature based
on its position in the conditional ow of a classi cation process, and the second
is generated using Shapley values, based on weighting the feature's impact on
the model's outcome. The derived FI scores are collated and sorted in
descending order. This provides a high-level understanding of how a ML model ranks
di erent features to be considered while deriving an outcome.
      </p>
      <p>This enables a comparison between features that are considered important
by the classi cation model (realised by the rst approach) and the features that
uctuate the ML model's outcome (realised by the second approach). This
enables explainability to be derived as it provides a relative ranking of the features
as interpreted by a ML model along with their impact on the outcome. In the
applied setting, this can be used by medical practitioners to gain an understanding
of the features that are critical in formulating a medical diagnosis, highlighting
features whose values cannot be ignored due to their high impact on the model's
output.</p>
      <p>A2. Feature importance in di erent medical settings: The availability
of medical information in di erent medical settings is not uniform (e.g. lack of
advanced medical tests in small clinics) and therefore, approaches followed by
medical practitioners belonging to di erent medical settings di er. This reduces
the associated utility of AI-based solutions. A gold-standard solution should be
designed to include all the relevant data while providing multiple versions to
acknowledge that di erent healthcare facilities will have di erent levels of access
to this data. This realisation dramatically broadens the applicability and utility
of the solution as it acknowledges the inclusion and exclusion of features in
di erent settings.</p>
      <p>This approach demonstrates the relative change of FI scores and performance
metrics based on the inclusion/exclusion of features. This enables a broader
understanding of a ML model and highlights its suitability to di erent medical
settings (e.g. a general practitioner in a clinic and an emergency room doctor
in a hospital will have access to signi cantly di erent levels of data regarding
an individual's health). If a ML model is trained upon a set of n features,
explainability can be derived by training the model on all possible subsets (2n) of
features and can enhance the understanding of how features are re-ranked, and
performance is a ected, based on the inclusion/exclusion of features.</p>
      <p>In an applied setting, this approach is useful to medical practitioners as
it aids their understanding based on inclusion/exclusion of clinical test results
or medical information with an associated performance score. This enables an
informed evaluation regarding the suitability of the AI-based solution on a per
actor basis.</p>
      <p>
        A3. Understanding the Data: The data on which a model was trained
and how it was pre-processed can have signi cant consequences in a medical
setting [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. As such, this approach demonstrates the association of FI scores
and performance metrics with the data augmentation techniques that are used in
a dataset. This association provides explainability by enabling an understanding
of how di erences in underlying data impact the performance metrics and the
FI scores.
      </p>
      <p>In an applied setting, the medical practitioners can validate the ranking of
features as interpreted by a ML model trained on data augmented using di
erent techniques and be able to associate it with the corresponding performance
metrics. This furthers the interpretability of the underlying work ow used for
processing the data and can enable better selection of augmentation techniques
by incorporating clinical expertise along with the expected performance metrics.</p>
      <p>The dataset and the modelling technique utilised for experimental work are
discussed in Section 3.1 and 3.2 respectively. In Section 3.3, the two FI
techniques: one based on Decision trees and the other based on Shapley values used
in this work are discussed.</p>
      <sec id="sec-3-1">
        <title>3.1 Dataset</title>
        <p>
          In this work, the \Cervical Cancer Risk Factors" dataset available from the UCI
data repository is used [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. This dataset was used in [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] to train di erent ML
models to predict the occurrence of cervical cancer based on a person's health
record. The performance of di erent models was compared based on accuracy,
precision, recall, and F-score (harmonic mean of precision and recall) values [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
The work did not address the interpretability and explainability of ML models
and the underlying work ow.
        </p>
        <p>
          The dataset contains 36 feature attributes representing risk factors
responsible for causing cervical cancer and the results of some preliminary and advanced
medical tests. In the dataset, 803 out of 858 records have a negative Biopsy
result while 55 have a positive result. The class-imbalance problem is addressed
using Imbalanced-learn that o ers many data sampling techniques to balance the
number of the majority and minority classes [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. Table 1 denotes the number
of records corresponding to positive and negative biopsy results after a sampling
technique is applied.
        </p>
        <p>
          Resampling Method Sam. 0:1 Ratio
Random Over sampling (ROS) 1606 803:803
Adaptive Synthetic Over sampling (ASS) 1606 803:803
Random Under sampling (RUS) 110 55:55
Neighbourhood cleaning Under sampling 725 670:55
(NCUS)
SMOTEtomek Combination sampling (S-TOM) 1600 800:800
SMOTE edited nearest neighbours Combination 1429 652:777
sampling (S-ENN)
Table 1. Number of samples in di erent data sampling techniques [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. Legend:
Number of Samples (Sam.), Biopsy results ratio - Positive (0): Negative (1).
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2 ML Model</title>
        <p>Decision trees are graphs where nodes represent sets of data samples and edges
represent conditions. Each node has an associated impurity factor indicating
the diversity of classes/labels in that node. A node is pure if all the data
samples present in it belong to the same class/label. In a classi cation problem,
the conditions in the edges of Decision trees are designed to decrease the
impurity. Therefore, from root node to leaf nodes, the impurity factor decreases
and each leaf node should contain data samples that are classi ed under a single
class/label.</p>
        <p>
          Decision trees are more interpretable as compared to complex models such
as Support Vector Machines or Neural Networks [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. They also provide su cient
performance scores in the given dataset [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. In this work, a Decision tree was
chosen as it achieves su cient performance while retaining interpretability [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3 Feature Importance (FI)</title>
        <p>
          FI identi es the important features as considered by a ML model from a dataset
for making a classi cation or prediction. In this paper, FI using decision trees and
Shapley were used. When a decision tree is trained, FI scores can be calculated by
measuring how much a feature contributes towards a decrease in the impurity
[
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. The FI scores obtained represent features considered important by the
Decision tree model. Shapley values can be used to generate the FI score of a
feature by rst calculating a model's output including and excluding that feature
to get the contribution of that feature alone. This contribution is then weighted
in presence of all subsets of features. This whole process is summed for all the
subsets of features to get a weighted and permuted FI score. The FI scores
obtained represent the impact of di erent features on a model's outcome.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>The main challenge in achieving explainability using FI was to present the FI
scores generated after training the model with di erent data augmentation
techniques and di erent feature sets in an integrated manner such that their
association with the performance metrics (e.g. accuracy, F-scores) and relative ranking
of the features can be e ectively utilised in a domain-agnostic manner. This
integration of information enables explainability of both: the ML model and the
underlying data, thereby broadening the scope of explainability.</p>
      <p>
        The three approaches outlined in Section 3 have been implemented. The
value of the approaches and the derived explainability is demonstrated in this
section. This is considered a contribution towards a long-term generic work ow
for simplifying the integration of XAI in an applied setting such as healthcare.
A1: Relative Feature Ranking When the ML model was trained on data
sampled using Random Over Sampling, FI scores assigned to di erent features
were plotted as depicted in Figure 2. Random over sampling provided higher
accuracy than other sampling techniques [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], as such it was selected for this
approach.
      </p>
      <p>
        In Figure 2, the feature Schiller Test was omitted due to its high
correlation with the biopsy results as it is an advanced medical test conducted to
diagnose cervical cancer [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. It can be observed that the feature Hinselmann is
the highest-ranked feature by both FI approaches. There is a similarity between
the two sets of FI scores obtained as features considered important by the model
(represented by Decision trees based FI) will automatically have a higher impact
on its outcome (represented by Shapley based FI). The value derived from these
FI scores is that a medical practitioner can understand and validate the
ranking of features. This enables the incorporation of clinical expertise to improve
feature engineering processes and achieve improved models for future use.
      </p>
      <sec id="sec-4-1">
        <title>A2: FI in Di erent medical settings The random over sampled data was</title>
        <p>used to train multiple instances of the model, each time on a subset of
features that excluded the highest-ranked feature from the previous instance. The
resulting FI scores assigned to individual features are indicative of the impact
the omission of the highest-ranked feature from a prior instance has on the ML
model and is depicted in Figure 3.</p>
        <p>The change in the relative ranking of di erent features can be observed on
the omission of the highest-ranked feature. For instance, when all the features
were present (All features), Schiller (orange segment) was given the highest
importance followed by Age (yellow segment). When feature Schiller is omitted
(second bar), it was noticed that Hinselmann (grey segment) was given the
highest importance instead of Age. Thus the inclusion or exclusion of a feature
does not behave in an ordered fashion as dictated by a gold-standard approach
that includes all features.</p>
        <p>Furthermore, the compounded omission of the highest-ranked features (left
to right) signi cantly reduces the total sum of FI scores assigned to features in
each instance ( 0.7 to 0.3). This is accompanied by a reduction in performance
metrics such as F-scores (denoted at the top of each bar) and indicates less
accurate models due to the absence of more important features or the presence
of less important features. This approach warrants the derivation of multiple
instances of a single model such that the relationship among features can be fully
understood and can be validated with clinical expertise. Based on a threshold
value of performance metrics, a medical practitioner can select a ML model that
is trained with the features that are accessible to his/her medical setting and
assigns appropriate importance scores to the available features while generating
an outcome.</p>
      </sec>
      <sec id="sec-4-2">
        <title>A3: Understanding Underlying Data The model was trained on the data</title>
        <p>sampled using di erent sampling techniques as discussed in Section 3.1 and FI
scores were plotted corresponding to each of the sampled versions as depicted
in Figure 4. FI scores relating to a particular type of sampled data are assigned
a particular color. Performance metrics corresponding to each of the sampling
techniques are noted in the legend.</p>
        <p>This approach enables the interpretability of the underlying dataset by
presenting the di erence in FI scores when using data augmented using di erent
techniques. For instance, in Figure 4, there is a lack of similarity in the FI
scores associated with under-sampling techniques (e.g. NCUS, RUS) as
compared to over/combination-sampling techniques (e.g. S-TOM, ROS). The lesser
the volume of data generated using under-sampling techniques the less diverse
the values of a feature. This is evident when comparing the sorted ordering of FI
scores in under-sampling techniques to over/combination sampling techniques.</p>
        <p>
          As depicted in Figure 4, the under-sampled data provided less accuracy and
recall values ( 70-90%) compared to the over/combination-sampled data (
9397%). The association of performance metrics aligned with FI scores enables
the explainability to validate the suitability of datasets from a domain-speci c
perspective. In contrast to over/combination sampled data, in the under-sampled
data augmented using the NCUS technique (dark-red bars), Age is assigned a
higher FI score than the Cytology test which would be considered an invalid
approach as a Cytology test is a diagnostic aid with a high level of e cacy when
detecting cervical cancer [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. A medical practitioner should disregard the use of
NCUS data due to its invalid FI ranking along with the low-performance scores.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>XAI is a crucial tool for enabling medical practitioners to understand and
evaluate AI-based solutions e ectively in the healthcare domain. It provides additional
bene ts in the form of increased con dence in solutions being adopted amongst
medical practitioners and increased exposure to the operation of the solutions.</p>
      <p>In this paper, an alternative perspective regarding how FI scores can be
integrated into a ML work ow is adopted. FI scores are used to surface
pertinent information relating to associations between features, models, and data to
provide explainability. This perspective is realised in three distinct approaches.</p>
      <p>A1) A model/output-based perspective with regards to the relative
ranking of a feature, this informs the medical practitioner which features the model
considers most important and which features uctuate the model outcome. A2)
Relative feature ranking in di erent medical settings, this incorporates a
hierarchical perspective which considers diagnostic capacity in the form of feature
inclusion/exclusion aligning it more closely to the real world. This informs the
medical practitioner how the model will perform and rank features in di erent
medical settings enabling a more informed interpretation of a model's operation.
A3) The impact of data augmentation approaches on the performance of a model
and the validity of their FI scores in a medical setting. This informs the medical
practitioner how suitable the augmented data is and how valid it is in a medical
setting. The simple but powerful nature of FI enables the applicability of the
three approaches proposed in a domain-agnostic manner.</p>
      <p>It is intended to extend the work by developing a framework that automates
the training and validation of models appropriate to the intended level of a
hierarchy in order to enable explainability from a multi-level perspective. The
work ow comprising that hierarchy will empirically evaluate the applicability of
combining XAI and recommendations to increase operational e cacy.
Acknowledgement: This publication has emanated from research co-sponsored
by McKesson and Science Foundation Ireland under Grant number SFI CRT
18/CRT/6222.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Holzinger</surname>
          </string-name>
          , Chris Biemann, Constantinos S. Pattichis, and
          <string-name>
            <surname>Douglas</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Kell</surname>
          </string-name>
          .
          <article-title>What do we need to build explainable AI systems for the medical domain?</article-title>
          (Ml):
          <volume>1</volume>
          {
          <fpage>28</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Benjamin</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Evans</surname>
            ,
            <given-names>Bing</given-names>
          </string-name>
          <string-name>
            <surname>Xue</surname>
          </string-name>
          , and Mengjie Zhang.
          <article-title>What's inside the black-box? A genetic programming method for interpreting complex machine learning models</article-title>
          .
          <source>GECCO 2019 - Proceedings of the 2019 Genetic and Evolutionary Computation Conference</source>
          , pages
          <volume>1012</volume>
          {
          <fpage>1020</fpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Danding</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qian Yang</surname>
            ,
            <given-names>Ashraf</given-names>
          </string-name>
          <string-name>
            <surname>Abdul</surname>
          </string-name>
          , and
          <string-name>
            <surname>Brian</surname>
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Lim</surname>
          </string-name>
          .
          <article-title>Designing TheoryDriven User-Centric Explainable AI</article-title>
          .
          <source>Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems - CHI '19</source>
          , pages
          <fpage>1</fpage>
          {
          <fpage>15</fpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Tim</given-names>
            <surname>Miller</surname>
          </string-name>
          .
          <article-title>Explanation in arti cial intelligence: Insights from the social sciences</article-title>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Erico</given-names>
            <surname>Tjoa</surname>
          </string-name>
          and
          <string-name>
            <given-names>Cuntai</given-names>
            <surname>Guan</surname>
          </string-name>
          .
          <article-title>A Survey on Explainable Arti cial Intelligence (XAI): Towards Medical XAI</article-title>
          .
          <volume>1</volume>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>D</given-names>
            <surname>Douglas</surname>
          </string-name>
          <article-title>Miller</article-title>
          .
          <article-title>The medical AI insurgency: what physicians must know about data to practice with intelligent machines</article-title>
          .
          <source>npj Digital Medicine</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):
          <fpage>62</fpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Namrata</given-names>
            <surname>Vaswani</surname>
          </string-name>
          , Yuejie Chi, and
          <string-name>
            <given-names>Thierry</given-names>
            <surname>Bouwmans</surname>
          </string-name>
          .
          <article-title>Rethinking pca for modern data sets: Theory, algorithms, and applications [scanning the issue]</article-title>
          .
          <source>Proceedings of the IEEE</source>
          ,
          <volume>106</volume>
          (
          <issue>8</issue>
          ):
          <volume>1274</volume>
          {
          <fpage>1276</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Ben</given-names>
            <surname>Hoyle</surname>
          </string-name>
          , Markus Michael Rau, Roman Zitlau, Stella Seitz, and
          <string-name>
            <given-names>Jochen</given-names>
            <surname>Weller</surname>
          </string-name>
          .
          <article-title>Feature importance for machine learning redshifts applied to sdss galaxies</article-title>
          .
          <source>Monthly Notices of the Royal Astronomical Society</source>
          ,
          <volume>449</volume>
          (
          <issue>2</issue>
          ):
          <volume>1275</volume>
          {
          <fpage>1283</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Rich</given-names>
            <surname>Caruana</surname>
          </string-name>
          , Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and
          <string-name>
            <given-names>Noemie</given-names>
            <surname>Elhadad</surname>
          </string-name>
          .
          <article-title>Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission</article-title>
          .
          <source>In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining</source>
          , pages
          <volume>1721</volume>
          {
          <fpage>1730</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Devam</surname>
            <given-names>Dave</given-names>
          </string-name>
          , Het Naik, Smiti Singhal, and
          <string-name>
            <given-names>Pankesh</given-names>
            <surname>Patel</surname>
          </string-name>
          .
          <article-title>Explainable ai meets healthcare: A study on heart disease dataset</article-title>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>X.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>Analysis of risk factors for cervical cancer based on machine learning methods</article-title>
          .
          <source>In 2018 5th IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS)</source>
          , pages
          <fpage>631</fpage>
          {
          <fpage>635</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>U.</given-names>
            <surname>Pawar</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. O'Shea</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Rea</surname>
          </string-name>
          , and
          <string-name>
            <surname>R. O'Reilly.</surname>
          </string-name>
          <article-title>Explainable ai in healthcare</article-title>
          .
          <source>In 2020 International Conference on Cyber Situational Awareness, Data Analytics and Assessment (CyberSA)</source>
          , pages
          <fpage>1</fpage>
          <issue>{2</issue>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>P</given-names>
            <surname>Nisha</surname>
          </string-name>
          , Urja Pawar, and
          <string-name>
            <surname>Ruairi O'Reilly.</surname>
          </string-name>
          <article-title>Interpretable machine learning models for assisting clinicians in the analysis of physiological data</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. Marco Tulio Ribeiro and
          <string-name>
            <given-names>Carlos</given-names>
            <surname>Guestrin</surname>
          </string-name>
          . \
          <string-name>
            <surname>Why Should I Trust You</surname>
          </string-name>
          <article-title>?" Explaining the Predictions of Any Classi er</article-title>
          . pages
          <volume>1135</volume>
          {
          <fpage>1144</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Scott M Lundberg</surname>
            and
            <given-names>Su-In</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>A uni ed approach to interpreting model predictions</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <volume>4765</volume>
          {
          <fpage>4774</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>R. El Shawi</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Sherif</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Al-Mallah</surname>
            , and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sakr</surname>
          </string-name>
          .
          <article-title>Interpretability in healthcare a comparative study of local machine learning interpretability techniques</article-title>
          .
          <source>In 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS)</source>
          , pages
          <fpage>275</fpage>
          {
          <fpage>280</fpage>
          ,
          <year>June 2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Kelwin</surname>
            <given-names>Fernandes</given-names>
          </string-name>
          , Jaime S Cardoso, and
          <string-name>
            <given-names>Jessica</given-names>
            <surname>Fernandes</surname>
          </string-name>
          .
          <article-title>Transfer learning with partial observability applied to cervical cancer screening</article-title>
          .
          <source>In Iberian conference on pattern recognition and image analysis</source>
          , pages
          <volume>243</volume>
          {
          <fpage>250</fpage>
          . Springer,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Sean</surname>
            <given-names>Quinlan</given-names>
          </string-name>
          ,
          <article-title>Haithem A i, and Ruairi O'Reilly. A comparative analysis of classi cation techniques for cervical cancer utilising at risk factors and screening test results</article-title>
          .
          <source>In AICS</source>
          , pages
          <volume>400</volume>
          {
          <fpage>411</fpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19. Guillaume Lema^tre, Fernando Nogueira, and
          <string-name>
            <surname>Christos</surname>
            <given-names>K</given-names>
          </string-name>
          <string-name>
            <surname>Aridas. Imbalancedlearn</surname>
          </string-name>
          :
          <article-title>A python toolbox to tackle the curse of imbalanced datasets in machine learning</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          ,
          <volume>18</volume>
          (
          <issue>1</issue>
          ):
          <volume>559</volume>
          {
          <fpage>563</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>J. Ross</surname>
          </string-name>
          <article-title>Quinlan. Induction of decision trees</article-title>
          .
          <source>Machine learning</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <volume>81</volume>
          {
          <fpage>106</fpage>
          ,
          <year>1986</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21. Anita Ww Lim, Rebecca Landy, Alejandra Castanon, Antony Hollingworth, Willie Hamilton, Nick Dudding, and
          <string-name>
            <given-names>Peter</given-names>
            <surname>Sasieni</surname>
          </string-name>
          .
          <article-title>Cytology in the diagnosis of cervical cancer in symptomatic young women: a retrospective review</article-title>
          .
          <source>The British journal of general practice : the journal of the Royal College of General Practitioners</source>
          ,
          <volume>66</volume>
          (
          <issue>653</issue>
          ):e871{e879, dec
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>