<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating Machine Learning Models against Clinical Protocols for Enhanced Interpretability and Continuity of Care</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christel Sirocchi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Muhammad Sufian</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Federico Sabbatini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Bogliolo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sara Montagna</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Pure and Applied Sciences, University of Urbino</institution>
          ,
          <addr-line>Piazza della Repubblica 13, 61029, Urbino</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In clinical practice, decision-making relies heavily on established protocols, often formalised as rules. Concurrently, machine learning (ML) models, trained on clinical data, aspire to integrate into medical decision-making processes. However, despite the growing number of ML applications, their adoption into clinical practice remains limited. Two critical concerns arise, relevant to the notions of consistency and continuity of care: (a) accuracy - the ML model, albeit more accurate, might introduce errors that would not have occurred by applying the protocol; (b) interpretability - ML models operating as black boxes might make predictions based on relationships that contradict established clinical knowledge. In this context, the literature suggests using integrated ML models to reduce errors introduced by purely data-driven approaches and improve interpretability. However, there is a lack of appropriate metrics for comparing ML models with clinical rules in addressing these challenges. Accordingly, in this article, we first propose a metric to assess the accuracy of ML models with respect to the established protocol. Secondly, we propose an approach to measure the distance of explanations provided by two rule sets, with the goal of comparing the explanation similarity between clinical rulebased systems and rules extracted from ML models. The approach is validated by employing the Pima Indians Diabetes dataset, for which a well-grounded clinical protocol is available, by training two neural networks-one exclusively on data, and the other integrating knowledge. Our findings demonstrate that the integrated ML model achieves comparable performance to that of a fully data-driven model while exhibiting superior relative accuracy with respect to the clinical protocol, ensuring enhanced continuity of care. Furthermore, we show that our integrated model provides explanations for predictions that align more closely with the clinical protocol compared to the data-driven model.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Informed AI</kwd>
        <kwd>interpretable AI</kwd>
        <kwd>clinical protocols</kwd>
        <kwd>diabetes</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Machine learning (ML) has revolutionised various industries, from manufacturing to finance,
and is now making its way into healthcare, a sector traditionally resistant to technological
disruptions. ML has achieved remarkable performance in various domains of clinical medicine,
EXPLIMED - First Workshop on Explainable Artificial Intelligence for the medical domain - 19-20 October 2024, Santiago
de Compostela, Spain
* Corresponding author.
outperforming human physicians in some cases and enabling the development of
computeraided diagnosis systems [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. With thousands of studies applying ML algorithms to medical
data, only a handful have significantly contributed to clinical care, a stark contrast to the
substantial impact ML has had in other industries. Indeed, only a few of these systems have
been FDA-approved for healthcare use [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Resistance to embrace ML in clinical settings can be attributed to the prevailing reliance
on evidence-based clinical pathways, guidelines, and protocols as the foundation for clinical
decision-making [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Adherence to established guidelines and practices is at the core of the
consistency and continuity of care, defined as the degree to which a series of discrete healthcare
events is experienced by people as coherent and consistent over time and across diferent
healthcare providers [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Introducing novel decision-support systems ofering alternative
predictions and explanations may introduce variability among practices and practitioners,
potentially compromising the quality and eficiency of care.
      </p>
      <p>
        Novel ML models reporting superior performance compared to the current protocol might be
found unsuitable for clinical use if they (a) fail to correctly predict cases efectively managed
by the protocol in place due to potential liabilities, or (b) make predictions based on
confounding variables and erroneous relationships that contradict established clinical knowledge [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Therefore, in order to foster continuity of care when developing novel decision-support systems
for healthcare, it is imperative not only to attain high overall accuracy but also to provide
predictions and explanations adhering to current clinical guidelines. To this end, approaches
integrating domain knowledge from clinical protocols into ML models have been proposed
and proved efective [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. However, metrics for evaluating the similarity of a novel model with
respect to established protocols in terms of predictions and explanations are still lacking.
      </p>
      <p>The main contribution of the manuscript is thus to introduce metrics capturing the adherence
of a model to established protocols in terms of accuracy and explanation of its predictions.
Specifically, we introduce the notions of relative accuracy to quantify the proportion of samples
correctly predicted by the model compared to those handled correctly by the existing protocol,
and of explanation similarity to quantify the degree of overlap between local explanations
provided by the protocol and the ML model for the dataset instances.</p>
      <p>
        Through a comparison between a neural network model, trained solely on data, and a model
incorporating domain knowledge encoded in a clinical protocol, we illustrate the potential of
these metrics using the PIMA dataset [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. While conventional performance metrics cannot
definitively identify a superior model between the two, our proposed metrics reveal that the
integrated model introduces fewer errors into the decision-making process and provides
explanations that more closely mirror established practices. Consequently, these newly introduced
metrics serve as valuable tools for identifying the ML model that better aligns with the
protocol in place and is thus more suitable for integration into clinical practice in the prospect of
continuity and consistency of care.
      </p>
      <p>Incorporating domain knowledge from clinical protocols into ML and developing metrics to
evaluate the accuracy and interpretability of such models with respect to the protocol in place
represent a pivotal step towards overcoming the limitations of ML and facilitating its seamless
integration into medical practice.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and previous work</title>
      <p>
        As medical decision-making becomes increasingly complex due to the development of new
therapies and diagnostics, as well as the accumulation of health records, ML has emerged as a
promising tool to support medical decision-making processes for its ability to model complex
interactions between features [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. However, the application of ML in healthcare presents several
challenges, primarily related to the quantity, quality, and composition of clinical data, as well as
a lack of explainability and limited robustness [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. To address these limitations, the literature
reports on various integrative approaches that leverage multiple models, data sources, and
prior knowledge. A notable advancement is the paradigm of Informed Machine Learning, which
integrates data and prior knowledge derived from independent sources to strike a balance
between model complexity and efectiveness [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]. This approach has gained attention in the
medical domain, where structured knowledge is abundant but data is often limited and noisy.
Recent contributions in this area have provided taxonomies of integration strategies applied to
the healthcare sector, with a focus on the integration of ML with rule-based expert systems,
highlighting that integration can be beneficial across all phases of the ML pipeline, from data
preprocessing and feature engineering, to model learning and output evaluation [
        <xref ref-type="bibr" rid="ref11 ref12 ref9">9, 12, 11</xref>
        ].
Particular emphasis is placed on strategies incorporating prior knowledge into the model’s
loss function, often through regularisation or penalty terms quantifying inconsistencies or
violations concerning the knowledge base. This approach has shown promising results in
clinical applications, enhancing model performance, robustness, and interpretability [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>Despite these advancements, there remains a critical need of metrics to evaluate the resulting
hybrid models against the knowledge base to measure adherence, and against the data-driven
counterpart to quantify knowledge injection. For a comprehensive comparative analysis, such
metrics must evaluate both accuracy and interpretability. This study aims to address this gap
by proposing metrics assessing model adherence to a knowledge base in terms of performance
and explainability, with immediate applications in evaluating hybrid models in clinical settings.</p>
      <sec id="sec-2-1">
        <title>2.1. Evaluating model performance</title>
        <p>
          A plethora of scores have been proposed to gauge the correctness of the predictions with respect
to the ground truth. In classification tasks, accuracy emerges as an intuitive metric, quantifying
the proportion of correctly classified instances. Sensitivity (recall) and specificity measure the
proportion of correctly identified true positives and true negatives, respectively, while precision
evaluates the ratio of true positives over positive predictions. F1-score, the harmonic mean of
precision and recall, balances both metrics. The area under the receiver operating characteristic
curves provides a comprehensive view of model performance across diferent thresholds but is
less interpretable to some stakeholders. Overall, accuracy and F1-score are the most popular
metrics but may yield overly optimistic results with imbalanced data. Recently, Matthew’s
correlation coeficient has gained prominence in biomedical data analysis for yielding more
trustworthy results in imbalanced datasets [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>
          Determining the most suitable statistical metric remains challenging, with no consensus
reached [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Comparative analyses often leverage a diverse array of metrics for a comprehensive
evaluation, with the choice of the most appropriate metric contingent upon the specific case at
hand. In clinical contexts, for instance, recall often takes precedence, as the cost (risk) associated
with false negatives outweighs that of false positives (as a positive result typically leads to
additional, more precise tests, unlike a negative one). However, in such contexts where
longstanding guidelines are in place, it is also crucial to evaluate ML models with respect to the
protocol as well as the ground truth, as mistakes introduced by the model also carry high costs.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Evaluating model explanations</title>
        <p>
          As ML architectures become increasingly complex, there arises a pressing need to bridge the
gap between the opaque nature of these models and human comprehension, especially in
domains like healthcare where transparency and interpretability are essential [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Addressing
this challenge, the field of eXplainable Artificial Intelligence (XAI) has emerged, developing
tools aimed at providing human-understandable explanations for AI-driven decisions, thereby
fostering transparency, trust, and collaboration between human expertise and computational
intelligence [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. XAI employs various techniques to provide reasoning on ML decisions, mainly
operating on two levels: local and global. In the former, individual model predictions are
analysed, while in the latter the overall behaviour of the model is analysed to identify patterns
and relationships in the data. Among XAI techniques, feature importance methods have emerged
as influential for identifying important variables. Additionally, example-based explanations
ofer insights by presenting similar instances in the dataset that influenced the predictions.
        </p>
        <p>
          Rule extraction techniques translate the ML models into human-understandable rules or
decision trees which provide insights into the overall behaviour of the model across the entire
dataset [
          <xref ref-type="bibr" rid="ref16">16, 17</xref>
          ]. Moreover, a rule applicable to a given data instance indicates the conditions
that were satisfied to produce the corresponding outcome, ofering explanations at the local
level. Several rule extraction algorithms exist in the literature. The Rule Extraction From Neural
Network Ensemble (REFNE) [18] was initially developed for extracting symbolic rules from
neural network ensembles. However, its accuracy decreases when the data is highly complex
or nonlinear. C4.5Rule-PANE [19] utilises the C4.5 rule induction algorithm to extract if-then
rules from neural networks and, like other tree-based algorithms, is susceptible to over-fitting.
TREPAN [20] constructs a decision tree by querying the underlying network to determine
output classes. However, it often extracts suboptimal rule sets and requires binary inputs.
        </p>
        <p>Decision trees, particularly Classification and Regression Trees (CART) [ 21], remain one of the
most prominent approaches in rule extraction. CART constructs a binary tree structure which is
then translated into human-readable rules by converting each possible path from the root to the
leaves into an if-then rule. Its strengths are simplicity, interpretability, and ability to handle both
categorical and numerical data efectively. Several other rule-extraction algorithms exist, as well
as software libraries dedicated to knowledge extraction, e.g., the PSyKE platform [22], providing
a unified software framework supporting various rule-extraction methods [ 20, 21, 23, 24, 25].
Several evaluation metrics are documented in the literature to assess the quality of extracted
rule sets. Among these metrics, the number of rules and average rule length reflect attributes of
the explainability of the rule extractor. The other metrics – completeness, correctness, fidelity,
robustness, and coverage – serve as general validation factors applicable to any rule extractor
method. These metrics primarily analyse properties of the rules as global explanations for the
model, ofering a coarse-grained evaluation. Less attention is given to metrics assessing rules
as local explanations for dataset instances, which would ofer a more nuanced and
contextaware evaluation, particularly relevant in the clinical setting. Furthermore, while most metrics
evaluate the properties of a single rule set, there is a noticeable scarcity of similarity measures
comparing multiple rule sets. Existing literature reports similarity metrics over dataset instances
(e.g., Jaccard [26], Cosine [27], Dice [28]) or similarities between rule-based knowledge bases
(e.g., XNOR). However, there is a lack of similarity metrics over rule-based local explanations
aggregated across data instances to provide global measures of similarity between rule sets.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>This section is structured as follows: Section 3.1 details the dataset and domain knowledge used
in our case study, Section 3.2 describes the machine learning model that integrates this domain
knowledge, while Section 3.3 introduces the metrics for accuracy and explainability used to
evaluate the integrated model against clinical knowledge and a data-driven model.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset and domain knowledge</title>
        <p>
          In this work, we present our investigations involving the Pima Indians Diabetes dataset,
originally compiled by the National Institute of Diabetes and Digestive and Kidney Diseases from a
study of the Pima Indian population, known for its notably high incidence of diabetes [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The
dataset comprises 768 medical profiles of women aged 21 and above, who underwent an Oral
Glucose Tolerance Test (OGTT) to measure their glucose and insulin levels at two hours. The
target variable is binary, indicating a diabetes diagnosis within five years. Table 1 reports the 8
input features available in the dataset. Missing values are present in the attributes 120 (48.70%),
 (29.56%),  (4.55%),   (1.43%), and 120 (0.65%), and were imputed in this work with
the median value of the respective variable, as reported in the literature [29]. Further details
about this dataset can be found in Table 1.
Public health guidelines on type-2 diabetes risks report that individuals with a high   (≥
30) and high blood glucose level (≥ 126) are at severe risk for diabetes, while those with normal
  (≤ 25) and low blood glucose level (≤ 100) are less likely to develop diabetes. These
guidelines have been utilised to design rules [30] expressed as logic predicates (see Table 2).
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Integrated ML model</title>
        <p>The hybrid ML model examined in this study, herein denoted as KB-ML, integrates domain
knowledge in the loss function. Specifically, KB-ML is a neural network for binary classification
trained using a custom loss function that assigns greater weight to samples accurately predicted
by the clinical guidelines represented by the two logic predicates in Table 2. Formally, let 
denote a dataset comprising  instances each represented by , where  ranges from 1 to .
Three  × 1 vectors , , and  can be defined. Vector  contains the ground-truth binary
labels, with each element denoted as  and representing the expected outcome for instance .
Vector  contains the probability of the outcome belonging to the positive class predicted by
the neural network, with each element  corresponding to . Finally, vector  contains the
predictions according to the rules in Table 2, i.e., each element  takes value 1 if  satisfies
the conditions of the first rule, 0 if it satisfies the second rule, and N/A otherwise. Then, the
Custom Total Loss (CTL) for the integrated model is computed as:</p>
        <p>CTL(, , ,  ) = 1 ∑︁ CSL(, , ,  ),</p>
        <p>=1
where  is the scaling factor controlling the influence of the additional loss term, CSL is the
custom binary cross-entropy loss for a single sample defined as
and  is the standard binary cross-entropy loss for a single sample</p>
        <p>CSL(, , ,  ) =
{︃(, )</p>
        <p>if  ̸= 
( + 1)(, ) if  = 
(, ) = − [ · log() + (1 − ) · log(1 − )] .</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Proposed evaluation metrics</title>
        <sec id="sec-3-3-1">
          <title>3.3.1. Relative accuracy</title>
          <p>Performance metrics can be redefined to evaluate adherence to accurate predictions set by
the rules, quantifying errors introduced by the model in comparison to the reference protocol.
As in Section 3.2, consider  as a dataset consisting of  samples represented by , where
 ranges from 1 to , and let  denote the prediction made by a clinical protocol for each .
Additionally, let ^ represent the binary prediction provided by a ML model for . Relative
Accuracy (RA) can be defined as the fraction of samples correctly predicted by the protocol that
are also correctly predicted by the model:</p>
          <p>RA = |{ :  ∈  ∧  =  = ^}| ,
|{ :  ∈  ∧  = }|
(4)
(5)
(6)
where |·| denotes the cardinality of a set. Similarly, the relative counterparts for other
performance metrics, such as Relative sensitivity or Recall (RR) and Relative Specificity ( RS) with
respect to a given class , can be defined as follows:</p>
          <p>RR = |{ :  ∈  ∧  =  = ^ = }| ,</p>
          <p>|{ :  ∈  ∧  =  = }|
RS = |{ :  ∈  ∧  ̸=  ∧  ̸=  ∧ ^ ̸= }| .</p>
          <p>|{ :  ∈  ∧  ̸=  ∧  ̸= }|
This evaluation does not account for samples where the protocol makes errors or fails to provide
a prediction, requiring additional performance metrics for a comprehensive assessment.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Explanation similarity</title>
          <p>Applying XAI in clinical settings requires proper evaluation to ensure the explanations are both
technically sound and clinically useful. Rule sets extracted from ML models provide valuable
insights into model behaviour. Notably, rules extracted from diferent ML models can emphasise
diferent variables, even when predicting similar outcomes. Therefore, it is crucial to assess
the similarity of explanations provided by rules approximating predictors to those ofered by a
specified reference protocol. This evaluation helps determine which explanation aligns more
closely with the clinical protocol in use and better reflects clinical expertise.</p>
          <p>A novel explanation similarity strategy is here proposed to estimate the similarity of
explanations from rule-based predictors, whether extracted from black-box models or built on clinical
knowledge. This method allows for comparing explanations from integrated and data-driven
models with those provided by a clinical protocol, to verify which aligns better. A diagram
summarising the approach is shown in Figure 1. The method entails the following steps.
1. Rule extraction symbolic knowledge is extracted from black-box predictors trained on a
given dataset and represented as rule sets that are both human- and machine-interpretable
and can provide explanations to predictions in the form of first-order logic clauses.
2. Feature discretisation the features of the dataset are discretised according to the thresholds
found in the rules of the considered rule sets. This involves collecting all thresholds
associated with each feature and discretising the feature into intervals, accordingly.
3. Rule vectorisation each rule is assigned a vector representing the feature space, where
each element corresponds to an interval of a feature and is assigned a value of 1 if the
corresponding feature and interval satisfy the rule, and 0 otherwise.
4. Local explanation for every rule set, and for each sample in the data set, the rule satisfied
by the sample is identified and the corresponding vector is assigned to the sample.
5. Similarity calculation the similarity between two rule sets is obtained by computing, for
each sample, the similarity between the vectors obtained from the two rule sets, and
averaging across all samples, while the similarity among more than two rule sets is obtained
by calculating the similarity between each pair of rule sets, and averaging all scores.
Formally, let  represent a dataset comprising  samples denoted by , where  ranges from 1 to
. Each sample is described by  input features, labelled as 1, 2, . . . , . Here,  represents
the value of feature  in the instance . For each input ,  denotes the corresponding
outcome.  and  denote the domains of the inputs and outputs, respectively:
︁(  ∈ )︁ ∧  ∈ )︁ ,
︁(
∀ = 1, 2, . . . , .</p>
          <p>Rule extraction. Let us consider a predictive function ℱ
ℱ :  → ,
ℱ () = ^,
available from domain knowledge, which we aim to compare:
where ^ is the value predicted by ℱ for the instance . Then, a rule set ℛ mapping instances
to outputs and approximating the input-output relationship of ℱ can be obtained by analysing
ℱ . Let  be a set of  rule sets, either obtained from predictive functions by rule extraction or
where
 = {ℛ , ℛ2, . . . , ℛ},</p>
          <p>1
ℛ :  ⊆  →  ∀  = 1, 2, . . . , .</p>
          <p>Each rule set consists of rules denoted by . For instance, if rule set  comprises  rules, then
ℛ = {1 , 2 , . . . , }. Each rule  in rule set ℛ is represented as a tuple (,   ), where
 constitutes a set of  conditions {1, 2, . . . , 
} and ^  represents the outcome associated
to that rule. Each condition ℎ can be expressed by a tuple (ℎ
, ℎ, ℎ), where ℎ is the
variable included in the condition, and ℎ and ℎ are the lower and upper bounds for the
condition. If a condition is defined over a discontinuous interval, it is separated into distinct
conditions. If a condition is of the type less than or greater than, the lower or upper bound is
replaced with the minimum or maximum value of the variable for that feature in the dataset.
^</p>
          <p>For instance, in the considered case study where   is in the range [18, 67] and 120
in [67, 199], the rule set ℛ
1 presented in Table 2 is defined as
ℛ
1 = {11, 21}, where 11 =
with 21 = {( , 18, 25), (120, 44, 100)}.
{(11, diabetes)} with 11 = {( , 30, 67), (120, 126, 199)} and 21 = {(21, healthy)}
Feature discretisation. For the set of predictors , we define the set of thresholds  as:
 = { (1),  (2), . . . ,  ()},</p>
          <p>If feature  never occurs in any conditions of , then | ()| = 0. Each set  () can be
represented as an ordered set of thresholds retrieved from rule conditions as detailed above:
 () = (1, 2, . . . , ), 1 &lt; 2 &lt; . . . &lt; .
() is a binary vector representing intervals for variable . If  is not present in
where 
any rule, i.e., | ()| = 0, then this vector has zero length. Otherwise, the vector has length
| ()| − 1, and the -th element of the vector corresponds to the interval [, (+1)]. The
th element of the vector is set to 1 if the values in the corresponding interval meet all conditions
on that variable for the considered rule, or if no conditions on that variable are specified in the
rule. Otherwise, the element is set to 0:
()[] =
⎧1 if [, (+1)] ⊆
⎪
⎨</p>
          <p>1 if ℎ ̸=  ∀(ℎ, ℎ, ℎ) ∈  ,
⎪⎩0 otherwise.</p>
          <p>[ℎ, ℎ] ∀(ℎ, ℎ, ℎ) ∈  : ℎ = ,
Then vector  is obtained from ℐ by concatenating all vectors into a single one:
 = (1)(2) . . . ().</p>
          <p>Local explanation. Let  be the subset of instances in  for which each of the considered
rule sets can provide a prediction, i.e.,</p>
          <p>{︃
 =  ⃒⃒⃒⃒⃒  ∈  ∧  ∈ ⋂=︁1 }︃ .</p>
          <p>Then,   is the set of rules in ℛ such that the instance  satisfies all the conditions of the rule:
∈ℛ
  =</p>
          <p>⋃︁ {︁ ⃒⃒⃒ ℎ =  ∧  ∈ [ℎ, ℎ], ∀ ℎ ∈  }︁ .</p>
          <p>Here we assume, for each rule set in , that each instance of the dataset satisfies all conditions
for only one rule, i.e. | | = 1 ∀  ∈ . The vector corresponding to the rule in   is assigned
to  and denoted as  (). This provides a vectorised representation of the explanation
ofered by rule set ℛ for the data instance .</p>
          <p>Without loss of generality, the rule vectorisation and local explanation procedure can also
be applied to categorical variables. Instead of intervals defined by thresholds, we have vectors
representing subsets of possible categorical values, and conditions are verified by set inclusion.
Similarity evaluation. Let ( 1,  2) be a similarity function on two binary vectors  1 and
 2. The similarity (ℛ1, ℛ2) for two rule sets ℛ1 and ℛ2 in  can then be computed as
(ℛ1, ℛ2, , ) =</p>
          <p>( 1(),  2()).
1</p>
          <p>∑︁
|| ∈</p>
          <p>The similarity among more than two sets is computed by calculating the pairwise similarity
between each pair of rule sets and then averaging across all rule sets. For a set  of  rule sets
the similarity is computed as:
( , , ) =
2
1</p>
          <p>− 1
∑︁ ∑︁</p>
          <p>∑︁
( − 1) ||  =1 =1 ∈
(  (),  ()).</p>
          <p>To compute the similarity of two binary vectors  1 and  2 of length , various similarity
metrics  are available in the literature.</p>
          <p>XNOR similarity
considers matching and non-matching elements:
(7)
(8)
(9)
(10)
(11)
(12)
XNOR( 1,  2) =
∑︀=1  ( 1[],  2[])

where  ( 1[],  2[]) equals 1 if  1[] =  2[] and 0 otherwise.</p>
          <p>2 · ∑︀=1  1[] ·  2[]</p>
          <p>DICE( 1,  2) = ∑︀=1  1[] + ∑︀=1  2[]</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>JACCARD similarity</title>
          <p>considers the intersection over the union of elements in both vectors:</p>
          <p>∑︀=1  1[] ·  2[]</p>
          <p>JACCARD( 1,  2) = ∑︀=1  1[] + ∑︀=1  2[] − ∑︀=1  1[] ·  2[]
where ∑︀=1  1[] ·  2[] counts the elements that are 1 in both vectors (intersection), while
∑︀=1  1[] + ∑︀</p>
          <p>=1  2[] counts the elements that are 1 in either rule vector (union).</p>
        </sec>
        <sec id="sec-3-3-4">
          <title>COSINE similarity</title>
          <p>computes the cosine of the angle between the vectors:</p>
          <p>∑︀=1  1[] ·  2[]</p>
          <p>COSINE( 1,  2) = √︀∑︀=1  1[]2 · √︀∑︀=1  2[]2
where √︀∑︀ =1  2[]2 computes the product of the magnitudes of the vectors.</p>
          <p>=1  1[]2 · √︀∑︀</p>
        </sec>
        <sec id="sec-3-3-5">
          <title>DICE similarity</title>
          <p>divides twice the number of matching elements by the number of elements:</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Evaluation strategy</title>
        <p>The study conducted a comparison between two neural networks trained on the Pima Indians
Diabetes dataset. One model, termed the data-driven model (DD-ML), was exclusively trained
on data, while the other, referred to as the integrated or knowledge-based model (KB-ML), was
trained with a custom loss function incorporating knowledge from a knowledge base (KB), as
detailed in Section 3.2. Both neural networks were designed as feed-forward models, comprising
three fully connected layers: two hidden layers with rectified linear unit activation functions
and an output layer with a sigmoid activation function. DD-ML was trained using binary
cross-entropy loss, whereas KB-ML employed a customised loss function defined in Eq. 1 with
parameter  , tuning the contribution of KB to model learning, ranging from 0.5 to 4 at intervals
of 0.5. All neural networks were trained with a batch size of 20 for 25 epochs.</p>
        <p>In all experiments, data was divided into training and testing sets using a 10-times
10fold stratified cross-validation approach [ 31]. The performance and explainability metrics
computed for the integrated model were evaluated against the corresponding metrics for the
data-driven model using paired Student-t tests with the Nadeau and Bengio correction [32].
Performance evaluation encompassed a range of metrics, including Accuracy (A), F1-score
(F1), Recall (R), Precision (P), Balanced Accuracy (BA), the Area Under the Receiver Operating
Characteristic Curve (ROC AUC), and Matthews Correlation Coeficient (MCC). Moreover, the
Relative Accuracy (RA), Sensitivity (RR) and Specificity (RS) metrics herein introduced were
computed for all models.</p>
        <p>Interpretable models approximating the predictions of the neural networks were obtained
by rule extraction using CART [21] available from the PSyKE library [33]. Rule sets were
extracted from DD-ML and KB-ML (trained with the tuning parameter  set to 1.5) and denoted
as DD-ML and KB-ML , respectively. Thus, each experiment yields three rule sets: KB, which
formalises the clinical protocol; DD-ML , which approximates the data-driven model; and
KB-ML , which approximates the integrated model. The maximum number of leaves, and thus
rules, in the CART rule-extraction process, varied from 2 to 12. The fidelity of the obtained rule
set was evaluated in terms of accuracy and F1-score with respect to the black-box model.</p>
        <p>The proposed explanation similarity metrics (leveraging XNOR, Dice, Jaccard, and Cosine
similarity) were computed between DD-ML and KB, and between DD-ML and KB on two
subsets of the dataset. Initially, explanation similarity metrics were computed over samples
for which all considered predictors (KB, DD-ML, KB-ML) could make predictions, thus
excluding samples not handled by the protocol. Subsequently, explanation similarity metrics were
computed over samples for which all considered models made correct predictions. Finally, the
explanation similarity metrics were utilised to gauge the robustness of explanations. A
comparison was made among the 100 instances of the KB-ML model trained over the 10-times 10-fold
cross-validation. A 100x100 similarity matrix was generated, computing pairwise explanation
similarity with XNOR operation between each pair of model instances. The similarities were
then averaged across all elements of the matrix. The same process was repeated for DD-ML.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and discussion</title>
      <sec id="sec-4-1">
        <title>4.1. Relative accuracy evaluation</title>
        <p>This integration of domain knowledge, modulated by the parameter  , influences the model’s
performance, which varies with respect to  as shown in Figure 2a. For standard metrics, the
performance increases, peaking between  values of 1 and 1.5, subsequently declining for A and
MCC, while stabilising for ROC. This trend suggests that while the introduced learning bias given
by the protocol could be beneficial, excessive bias might impede the learning process, leading
to decreased accuracy that falls below that of the data-driven model for  greater than 2. The
proposed RA metric increases with  , efectively detecting the reduction of errors introduced
by the integrated model with respect to the reference model. For values of  around 1.5, optimal
scores of standard metrics are achieved, as well as improved RA. This evaluation highlights the
need of tuning integration to maximise adherence without compromising performance.</p>
        <p>A comprehensive array of metrics comparing the data-driven model with the integrated model
at  equal to 1.5, along with relative p-values indicating statistical significance, are reported
in Table 3. The integrated model yields superior scores across all metrics except precision,
with statistical significance observed for BA, ROC, and R. Nonetheless, precision significantly
decreases, and improvements in MCC, F1, and A lack statistical significance. Therefore, it
remains challenging to conclusively state that one model is superior to the other. However,
the RA metric significantly improved from 0.90 to 0.97, driven by the increased RR (since RS
is maximal for both models). These findings highlight the greater alignment with the clinical
protocol, also seen in Figure 2b, making the integrated model preferable overall, and demonstrate
the role of the proposed metrics in facilitating this assessment.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Explanation similarity evaluation</title>
        <p>The model incorporating domain knowledge also ofers explanations that better align with the
underlying reasoning of the knowledge base. Given the black-box nature of both the data-driven
and integrated neural networks, explanations for each prediction are provided via surrogate
rule sets, with a number of rules varying from 2 to 12, serving as approximations of the model’s
decision-making process. The surrogate models KB-ML and DD-ML closely mirror the
behaviour of the black-box models, reporting accuracy and F1 scores consistently above 0.85
across all rule set sizes, as shown in Figure 3a.</p>
        <p>Explanation similarity metrics computed over samples with prediction for all considered
models (KB, DD-ML, KB-ML) reveal that the similarity of KB-ML to the knowledge base
consistently exceeds that of DD-ML across all similarity metrics and for every number of rules
considered (Figure 3b). These diferences are statistically significant across all metrics and rule
set sizes. Notably, for the XNOR similarity, these diferences maintain statistical significance at
the 0.01 level across all rule set sizes, emerging as the most efective approach for capturing the
impact of integration on improving explanation similarity to the established protocol. This is
unsurprising, as the other similarities tend to give more emphasis to the overlapping of 1 values
between the two local explanation vectors, while XNOR similarly accounts for the overlapping
of 1s and 0s. This is desirable, as a 1 (meaning satisfied condition) in this context is as relevant
as a 0 (i.e., unsatisfied condition). Explanation similarity metrics computed over samples for
which all considered models make a correct prediction verify that, with predictions being equal,
explanations of the integrated model remain closer to the protocol than those of the data-driven
model. In this analysis, a similar pattern is observed, with explanation similarity being greater
for the integrated model across all metrics and numbers of rules, with diferences statistically
significant at the 0.01 level for XNOR and at the 0.05 level for all others.</p>
        <p>Finally, the examination of explanation similarity across 100 instances of models trained via
the 10-times 10-fold cross-validation, depicted in Figure 3c, reveals that similarity among
KBML rule sets is comparable to that of DD-ML for rule sets comprising up to 5 rules. However,
it surpasses that of DD-ML for rule sets with more rules, which also have greater fidelity
with the black-box model. These findings demonstrate that the integrated model generates
4
** ** ** ** ** ** ** ** ** **
** ** ** * ** * ** * ** ** **
2
4</p>
        <p>6 8
# extracted rules
2
4</p>
        <p>6 8
# extracted rules
(b) Explanation similarity metrics for explanation adherence.
2
4
6 8
# extracted rules
10</p>
        <p>KB-MLX
DD-MLX
12
(c) Explanation similarity metrics for model explanation robustness.
explanations that not only are more aligned with domain knowledge but are also more robust
compared to the fully data-driven model for larger and more accurate rule sets, and that the
proposed explanation similarity strategy is instrumental in evaluating this crucial aspect.</p>
        <p>This approach presents notable advantages compared to strategies that rely solely on rules
as global explanations for the model. Leveraging local explanations ofers a more nuanced and
ifne-grained evaluation of model explanations, reflecting the structure of the data and providing
more context-aware insights into the model’s inner workings, which is particularly relevant in
clinical settings. The proposed approach ofers several additional benefits. It can be applied
to both numerical and categorical features. Instead of discretising data first and then building
rule sets, it uses rule thresholds for data discretisation, eliminating the need of prior knowledge
of relevant intervals. Furthermore, it provides a representation that automatically performs
feature selection, excluding variables not present in the rules from the vector representation. It
also accommodates variables included in other rule sets but not present in the knowledge base.
In this scenario, rule sets with conditions on variables not accounted for by the knowledge base
will have certain non-overlapping vector regions with the base and will likely record a lower
score. Conversely, rule sets using the same features as the base will have greater opportunities
for vector overlap and will typically yield higher scores. Lastly, it has a low computational cost,
with similarity computation growing linearly with the number of samples, unlike methods that
compute pairwise rule similarities, which grow quadratically with the number of rules.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and future work</title>
      <p>This study introduces novel metrics to evaluate the adherence of models to established protocols
in terms of accuracy and explanation of predictions. Through comparative analysis on a
benchmark dataset, we illustrate that models incorporating protocol knowledge exhibit superior
alignment with established practices, making them more suitable for integration into clinical
decision-making processes.</p>
      <p>In future research, we aim to extend this investigation to other datasets, retrieving the
corresponding domain knowledge either by translating established protocols into rules or by
consulting clinicians to encode that knowledge. Having demonstrated adherence to the clinical
protocol across diferent datasets and clinical applications, we also plan to consult respective
experts to verify that the trained ML model is trustworthy also outside the domain of application
of a protocol, by evaluating whether the learning criteria align with clinicians’ judgement in
borderline cases. Additionally, we plan to validate the proposed approach using other automatic
rule extraction algorithms, including those based on fuzzy logic, such as neuro-fuzzy models.
Finally, we intend to enhance the explanation similarity metrics by scaling intervals based on
their length or the number of samples within them, rather than assigning binary values.
Availability of data and code The dataset analysed is publicly available (https://www.kaggle.
com/datasets/uciml/pima-indians-diabetes-database), and the code to replicate the experiments
can be found in the GitHub repository (https://github.com/ChristelSirocchi/XAI-similarity).
[17] G. Ciatto, F. Sabbatini, A. Agiollo, M. Magnini, A. Omicini, Symbolic knowledge
extraction and injection with sub-symbolic predictors: A systematic literature review, ACM
Computing Surveys 56 (2024) 161:1–161:35.
[18] Z.-H. Zhou, Y. Jiang, S.-F. Chen, Extracting symbolic rules from trained neural network
ensembles, Ai Communications 16 (2003) 3–15.
[19] G. Vilone, L. Longo, A quantitative evaluation of global, rule-based explanations of
post-hoc, model agnostic methods, Frontiers in artificial intelligence 4 (2021) 717899.
[20] M. W. Craven, J. W. Shavlik, Extracting tree-structured representations of trained
networks, in: Advances in Neural Information Processing Systems 8. Proceedings of the 1995
Conference, The MIT Press, 1996, pp. 24–30.
[21] L. Breiman, Classification and regression trees, Routledge, 2017.
[22] F. Sabbatini, G. Ciatto, R. Calegari, A. Omicini, On the design of PSyKE: A platform for
symbolic knowledge extraction, in: Proceedings of the 22nd Workshop “From Objects to
Agents”, Bologna, Italy, September 1–3, 2021, volume 2963 of CEUR Workshop Proceedings,
CEUR-WS.org, 2021, pp. 29–48.
[23] M. W. Craven, J. W. Shavlik, Using sampling and queries to extract rules from trained
neural networks, in: Machine Learning Proceedings 1994, Elsevier, 1994, pp. 37–45.
[24] J. Huysmans, B. Baesens, J. Vanthienen, ITER: An algorithm for predictive regression
rule extraction, in: Data Warehousing and Knowledge Discovery (DaWaK 2006), Springer,
2006, pp. 270–279.
[25] F. Sabbatini, G. Ciatto, A. Omicini, GridEx: An algorithm for knowledge extraction from
black-box regressors, in: Explainable and Transparent AI and Multi-Agent Systems. Third
International Workshop, EXTRAAMAS 2021, Virtual Event, May 3–7, 2021, volume 12688
of LNCS, Springer Nature, Basel, Switzerland, 2021, pp. 18–38.
[26] A. H. Murphy, The Finley afair: A signal event in the history of forecast verification,</p>
      <p>Weather and forecasting 11 (1996) 3–20.
[27] C. D. Manning, P. Raghavan, H. Schütze, Introduction to information retrieval, Cambridge</p>
      <p>University Press, 2008.
[28] L. R. Dice, Measures of the amount of ecologic association between species, Ecology 26
(1945) 297–302.
[29] H. B. Kibria, M. Nahiduzzaman, M. O. F. Goni, M. Ahsan, J. Haider, An ensemble approach
for the prediction of diabetes mellitus using a soft voting classifier with an explainable ai,
Sensors 22 (2022) 7268.
[30] G. Kunapuli, K. P. Bennett, A. Shabbeer, R. Maclin, J. Shavlik, Online knowledge-based
support vector machines, in: Machine Learning and Knowledge Discovery in Databases:
European Conference, 2010, Proceedings, Part II 21, Springer, 2010, pp. 145–161.
[31] R. R. Bouckaert, E. Frank, Evaluating the replicability of significance tests for comparing
learning algorithms, in: Pacific-Asia conference on knowledge discovery and data mining,
Springer, 2004, pp. 3–12.
[32] C. Nadeau, Y. Bengio, Inference for the generalization error, Advances in neural information
processing systems 12 (1999).
[33] F. Sabbatini, G. Ciatto, R. Calegari, A. Omicini, Symbolic knowledge extraction from
opaque ML predictors in PSyKE: Platform design &amp; experiments, Intelligenza Artificiale
16 (2022) 27–48.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Piccialli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Di Somma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giampaolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cuomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Fortino</surname>
          </string-name>
          ,
          <article-title>A survey on deep learning in medicine: Why, how</article-title>
          and when?,
          <source>Information Fusion</source>
          <volume>66</volume>
          (
          <year>2021</year>
          )
          <fpage>111</fpage>
          -
          <lpage>137</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Benjamens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhunnoo</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Meskó,</surname>
          </string-name>
          <article-title>The state of artificial intelligence-based fda-approved medical devices and algorithms: an online database</article-title>
          ,
          <source>NPJ digital medicine 3</source>
          (
          <year>2020</year>
          )
          <fpage>118</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Clinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>McCormick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Besteman</surname>
          </string-name>
          ,
          <article-title>Enhancing clinical practice: The role of practice guidelines</article-title>
          .,
          <source>American Psychologist</source>
          <volume>49</volume>
          (
          <year>1994</year>
          )
          <fpage>30</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Haggerty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Reid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Freeman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. H.</given-names>
            <surname>Starfield</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Adair</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. McKendry</surname>
          </string-name>
          ,
          <article-title>Continuity of care: a multidisciplinary review</article-title>
          ,
          <source>Bmj</source>
          <volume>327</volume>
          (
          <year>2003</year>
          )
          <fpage>1219</fpage>
          -
          <lpage>1221</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fleuren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Elbers</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>van der Schaar, Integrating expert odes into neural odes: pharmacology and disease progression</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>34</volume>
          (
          <year>2021</year>
          )
          <fpage>11364</fpage>
          -
          <lpage>11383</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Montagna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sirocchi</surname>
          </string-name>
          ,
          <article-title>Hybrid personal medical digital assistant agents</article-title>
          ,
          <source>in: Proceedings of the 25th Workshop “</source>
          From Objects to Agents”,
          <source>Forte di Bard (AO)</source>
          ,
          <source>Italy, July</source>
          <volume>8</volume>
          -
          <issue>10</issue>
          ,
          <year>2024</year>
          , volume
          <volume>3735</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>58</fpage>
          -
          <lpage>72</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Everhart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dickson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. C.</given-names>
            <surname>Knowler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Johannes</surname>
          </string-name>
          ,
          <article-title>Using the adap learning algorithm to forecast the onset of diabetes mellitus, in: Proceedings of the annual symposium on computer application in medical care</article-title>
          , American Medical Informatics Association,
          <year>1988</year>
          , p.
          <fpage>261</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Obermeyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Lost in thought: the limits of the human mind and the future of medicine</article-title>
          ,
          <source>The New England journal of medicine 377</source>
          (
          <year>2017</year>
          )
          <fpage>1209</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Leiser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schmidt-Kraepelin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thiebes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sunyaev</surname>
          </string-name>
          ,
          <article-title>Medical informed machine learning: A scoping review and future research directions</article-title>
          ,
          <source>Artificial Intelligence in Medicine</source>
          <volume>145</volume>
          (
          <year>2023</year>
          )
          <fpage>102676</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Von Rueden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mayer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Beckh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Georgiev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Giesselbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Heese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kirsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pfrommer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ramamurthy</surname>
          </string-name>
          , et al.,
          <article-title>Informed machine learning-a taxonomy and survey of integrating prior knowledge into learning systems</article-title>
          ,
          <source>IEEE Trans. on Knowledge and Data Engineering</source>
          <volume>35</volume>
          (
          <year>2021</year>
          )
          <fpage>614</fpage>
          -
          <lpage>633</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sirocchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bogliolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Montagna</surname>
          </string-name>
          ,
          <article-title>Medical-informed machine learning: integrating prior knowledge into medical decision systems, BMC Medical Informatics and Decision Making 24 (Suppl 4) (</article-title>
          <year>2024</year>
          )
          <fpage>186</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kierner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kucharski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kierner</surname>
          </string-name>
          ,
          <article-title>Taxonomy of hybrid architectures involving rule-based reasoning and machine learning in clinical decision systems: A scoping review</article-title>
          ,
          <source>Journal of Biomedical Informatics</source>
          (
          <year>2023</year>
          )
          <fpage>104428</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Chicco</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. Jurman,</surname>
          </string-name>
          <article-title>The advantages of the matthews correlation coeficient (mcc) over f1 score and accuracy in binary classification evaluation</article-title>
          ,
          <source>BMC genomics 21</source>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sokol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Flach</surname>
          </string-name>
          ,
          <article-title>Explainability fact sheets: A framework for systematic assessment of explainable approaches</article-title>
          ,
          <source>in: Proceedings of the 2020 conference on fairness, accountability, and transparency</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>56</fpage>
          -
          <lpage>67</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Explainable artificial intelligence for predictive modeling in healthcare</article-title>
          ,
          <source>Journal of healthcare informatics research 6</source>
          (
          <year>2022</year>
          )
          <fpage>228</fpage>
          -
          <lpage>239</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>R.</given-names>
            <surname>Calegari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ciatto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Omicini</surname>
          </string-name>
          ,
          <article-title>On the integration of symbolic and sub-symbolic techniques for xai: A survey</article-title>
          ,
          <source>Intelligenza Artificiale</source>
          <volume>14</volume>
          (
          <year>2020</year>
          )
          <fpage>7</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>