1. Introduction

Evaluating Machine Learning Models against Clinical Protocols for Enhanced Interpretability and Continuity of Care

Christel Sirocchi

Muhammad Sufian

Federico Sabbatini

Alessandro Bogliolo

Sara Montagna

0 0 Department of Pure and Applied Sciences, University of Urbino , Piazza della Repubblica 13, 61029, Urbino , Italy

In clinical practice, decision-making relies heavily on established protocols, often formalised as rules. Concurrently, machine learning (ML) models, trained on clinical data, aspire to integrate into medical decision-making processes. However, despite the growing number of ML applications, their adoption into clinical practice remains limited. Two critical concerns arise, relevant to the notions of consistency and continuity of care: (a) accuracy - the ML model, albeit more accurate, might introduce errors that would not have occurred by applying the protocol; (b) interpretability - ML models operating as black boxes might make predictions based on relationships that contradict established clinical knowledge. In this context, the literature suggests using integrated ML models to reduce errors introduced by purely data-driven approaches and improve interpretability. However, there is a lack of appropriate metrics for comparing ML models with clinical rules in addressing these challenges. Accordingly, in this article, we first propose a metric to assess the accuracy of ML models with respect to the established protocol. Secondly, we propose an approach to measure the distance of explanations provided by two rule sets, with the goal of comparing the explanation similarity between clinical rulebased systems and rules extracted from ML models. The approach is validated by employing the Pima Indians Diabetes dataset, for which a well-grounded clinical protocol is available, by training two neural networks-one exclusively on data, and the other integrating knowledge. Our findings demonstrate that the integrated ML model achieves comparable performance to that of a fully data-driven model while exhibiting superior relative accuracy with respect to the clinical protocol, ensuring enhanced continuity of care. Furthermore, we show that our integrated model provides explanations for predictions that align more closely with the clinical protocol compared to the data-driven model.

eol>Informed AI interpretable AI clinical protocols diabetes

1. Introduction

Machine learning (ML) has revolutionised various industries, from manufacturing to finance, and is now making its way into healthcare, a sector traditionally resistant to technological disruptions. ML has achieved remarkable performance in various domains of clinical medicine, EXPLIMED - First Workshop on Explainable Artificial Intelligence for the medical domain - 19-20 October 2024, Santiago de Compostela, Spain * Corresponding author. outperforming human physicians in some cases and enabling the development of computeraided diagnosis systems [ 1 ]. With thousands of studies applying ML algorithms to medical data, only a handful have significantly contributed to clinical care, a stark contrast to the substantial impact ML has had in other industries. Indeed, only a few of these systems have been FDA-approved for healthcare use [ 2 ].

Resistance to embrace ML in clinical settings can be attributed to the prevailing reliance on evidence-based clinical pathways, guidelines, and protocols as the foundation for clinical decision-making [ 3 ]. Adherence to established guidelines and practices is at the core of the consistency and continuity of care, defined as the degree to which a series of discrete healthcare events is experienced by people as coherent and consistent over time and across diferent healthcare providers [ 4 ]. Introducing novel decision-support systems ofering alternative predictions and explanations may introduce variability among practices and practitioners, potentially compromising the quality and eficiency of care.

Novel ML models reporting superior performance compared to the current protocol might be found unsuitable for clinical use if they (a) fail to correctly predict cases efectively managed by the protocol in place due to potential liabilities, or (b) make predictions based on confounding variables and erroneous relationships that contradict established clinical knowledge [ 5 ]. Therefore, in order to foster continuity of care when developing novel decision-support systems for healthcare, it is imperative not only to attain high overall accuracy but also to provide predictions and explanations adhering to current clinical guidelines. To this end, approaches integrating domain knowledge from clinical protocols into ML models have been proposed and proved efective [ 6 ]. However, metrics for evaluating the similarity of a novel model with respect to established protocols in terms of predictions and explanations are still lacking.

The main contribution of the manuscript is thus to introduce metrics capturing the adherence of a model to established protocols in terms of accuracy and explanation of its predictions. Specifically, we introduce the notions of relative accuracy to quantify the proportion of samples correctly predicted by the model compared to those handled correctly by the existing protocol, and of explanation similarity to quantify the degree of overlap between local explanations provided by the protocol and the ML model for the dataset instances.

Through a comparison between a neural network model, trained solely on data, and a model incorporating domain knowledge encoded in a clinical protocol, we illustrate the potential of these metrics using the PIMA dataset [ 7 ]. While conventional performance metrics cannot definitively identify a superior model between the two, our proposed metrics reveal that the integrated model introduces fewer errors into the decision-making process and provides explanations that more closely mirror established practices. Consequently, these newly introduced metrics serve as valuable tools for identifying the ML model that better aligns with the protocol in place and is thus more suitable for integration into clinical practice in the prospect of continuity and consistency of care.

Incorporating domain knowledge from clinical protocols into ML and developing metrics to evaluate the accuracy and interpretability of such models with respect to the protocol in place represent a pivotal step towards overcoming the limitations of ML and facilitating its seamless integration into medical practice.

2. Background and previous work

As medical decision-making becomes increasingly complex due to the development of new therapies and diagnostics, as well as the accumulation of health records, ML has emerged as a promising tool to support medical decision-making processes for its ability to model complex interactions between features [ 8 ]. However, the application of ML in healthcare presents several challenges, primarily related to the quantity, quality, and composition of clinical data, as well as a lack of explainability and limited robustness [ 9 ]. To address these limitations, the literature reports on various integrative approaches that leverage multiple models, data sources, and prior knowledge. A notable advancement is the paradigm of Informed Machine Learning, which integrates data and prior knowledge derived from independent sources to strike a balance between model complexity and efectiveness [ 10, 11 ]. This approach has gained attention in the medical domain, where structured knowledge is abundant but data is often limited and noisy. Recent contributions in this area have provided taxonomies of integration strategies applied to the healthcare sector, with a focus on the integration of ML with rule-based expert systems, highlighting that integration can be beneficial across all phases of the ML pipeline, from data preprocessing and feature engineering, to model learning and output evaluation [ 9, 12, 11 ]. Particular emphasis is placed on strategies incorporating prior knowledge into the model’s loss function, often through regularisation or penalty terms quantifying inconsistencies or violations concerning the knowledge base. This approach has shown promising results in clinical applications, enhancing model performance, robustness, and interpretability [ 9 ].

Despite these advancements, there remains a critical need of metrics to evaluate the resulting hybrid models against the knowledge base to measure adherence, and against the data-driven counterpart to quantify knowledge injection. For a comprehensive comparative analysis, such metrics must evaluate both accuracy and interpretability. This study aims to address this gap by proposing metrics assessing model adherence to a knowledge base in terms of performance and explainability, with immediate applications in evaluating hybrid models in clinical settings.

2.1. Evaluating model performance

A plethora of scores have been proposed to gauge the correctness of the predictions with respect to the ground truth. In classification tasks, accuracy emerges as an intuitive metric, quantifying the proportion of correctly classified instances. Sensitivity (recall) and specificity measure the proportion of correctly identified true positives and true negatives, respectively, while precision evaluates the ratio of true positives over positive predictions. F1-score, the harmonic mean of precision and recall, balances both metrics. The area under the receiver operating characteristic curves provides a comprehensive view of model performance across diferent thresholds but is less interpretable to some stakeholders. Overall, accuracy and F1-score are the most popular metrics but may yield overly optimistic results with imbalanced data. Recently, Matthew’s correlation coeficient has gained prominence in biomedical data analysis for yielding more trustworthy results in imbalanced datasets [ 13 ].

Determining the most suitable statistical metric remains challenging, with no consensus reached [ 13 ]. Comparative analyses often leverage a diverse array of metrics for a comprehensive evaluation, with the choice of the most appropriate metric contingent upon the specific case at hand. In clinical contexts, for instance, recall often takes precedence, as the cost (risk) associated with false negatives outweighs that of false positives (as a positive result typically leads to additional, more precise tests, unlike a negative one). However, in such contexts where longstanding guidelines are in place, it is also crucial to evaluate ML models with respect to the protocol as well as the ground truth, as mistakes introduced by the model also carry high costs.

2.2. Evaluating model explanations

As ML architectures become increasingly complex, there arises a pressing need to bridge the gap between the opaque nature of these models and human comprehension, especially in domains like healthcare where transparency and interpretability are essential [ 14 ]. Addressing this challenge, the field of eXplainable Artificial Intelligence (XAI) has emerged, developing tools aimed at providing human-understandable explanations for AI-driven decisions, thereby fostering transparency, trust, and collaboration between human expertise and computational intelligence [ 15 ]. XAI employs various techniques to provide reasoning on ML decisions, mainly operating on two levels: local and global. In the former, individual model predictions are analysed, while in the latter the overall behaviour of the model is analysed to identify patterns and relationships in the data. Among XAI techniques, feature importance methods have emerged as influential for identifying important variables. Additionally, example-based explanations ofer insights by presenting similar instances in the dataset that influenced the predictions.

Rule extraction techniques translate the ML models into human-understandable rules or decision trees which provide insights into the overall behaviour of the model across the entire dataset [ 16, 17 ]. Moreover, a rule applicable to a given data instance indicates the conditions that were satisfied to produce the corresponding outcome, ofering explanations at the local level. Several rule extraction algorithms exist in the literature. The Rule Extraction From Neural Network Ensemble (REFNE) [18] was initially developed for extracting symbolic rules from neural network ensembles. However, its accuracy decreases when the data is highly complex or nonlinear. C4.5Rule-PANE [19] utilises the C4.5 rule induction algorithm to extract if-then rules from neural networks and, like other tree-based algorithms, is susceptible to over-fitting. TREPAN [20] constructs a decision tree by querying the underlying network to determine output classes. However, it often extracts suboptimal rule sets and requires binary inputs.

Decision trees, particularly Classification and Regression Trees (CART) [ 21], remain one of the most prominent approaches in rule extraction. CART constructs a binary tree structure which is then translated into human-readable rules by converting each possible path from the root to the leaves into an if-then rule. Its strengths are simplicity, interpretability, and ability to handle both categorical and numerical data efectively. Several other rule-extraction algorithms exist, as well as software libraries dedicated to knowledge extraction, e.g., the PSyKE platform [22], providing a unified software framework supporting various rule-extraction methods [ 20, 21, 23, 24, 25]. Several evaluation metrics are documented in the literature to assess the quality of extracted rule sets. Among these metrics, the number of rules and average rule length reflect attributes of the explainability of the rule extractor. The other metrics – completeness, correctness, fidelity, robustness, and coverage – serve as general validation factors applicable to any rule extractor method. These metrics primarily analyse properties of the rules as global explanations for the model, ofering a coarse-grained evaluation. Less attention is given to metrics assessing rules as local explanations for dataset instances, which would ofer a more nuanced and contextaware evaluation, particularly relevant in the clinical setting. Furthermore, while most metrics evaluate the properties of a single rule set, there is a noticeable scarcity of similarity measures comparing multiple rule sets. Existing literature reports similarity metrics over dataset instances (e.g., Jaccard [26], Cosine [27], Dice [28]) or similarities between rule-based knowledge bases (e.g., XNOR). However, there is a lack of similarity metrics over rule-based local explanations aggregated across data instances to provide global measures of similarity between rule sets.

3. Methods

This section is structured as follows: Section 3.1 details the dataset and domain knowledge used in our case study, Section 3.2 describes the machine learning model that integrates this domain knowledge, while Section 3.3 introduces the metrics for accuracy and explainability used to evaluate the integrated model against clinical knowledge and a data-driven model.

3.1. Dataset and domain knowledge

In this work, we present our investigations involving the Pima Indians Diabetes dataset, originally compiled by the National Institute of Diabetes and Digestive and Kidney Diseases from a study of the Pima Indian population, known for its notably high incidence of diabetes [ 7 ]. The dataset comprises 768 medical profiles of women aged 21 and above, who underwent an Oral Glucose Tolerance Test (OGTT) to measure their glucose and insulin levels at two hours. The target variable is binary, indicating a diabetes diagnosis within five years. Table 1 reports the 8 input features available in the dataset. Missing values are present in the attributes 120 (48.70%), (29.56%), (4.55%), (1.43%), and 120 (0.65%), and were imputed in this work with the median value of the respective variable, as reported in the literature [29]. Further details about this dataset can be found in Table 1. Public health guidelines on type-2 diabetes risks report that individuals with a high (≥ 30) and high blood glucose level (≥ 126) are at severe risk for diabetes, while those with normal (≤ 25) and low blood glucose level (≤ 100) are less likely to develop diabetes. These guidelines have been utilised to design rules [30] expressed as logic predicates (see Table 2).

3.2. Integrated ML model

The hybrid ML model examined in this study, herein denoted as KB-ML, integrates domain knowledge in the loss function. Specifically, KB-ML is a neural network for binary classification trained using a custom loss function that assigns greater weight to samples accurately predicted by the clinical guidelines represented by the two logic predicates in Table 2. Formally, let denote a dataset comprising instances each represented by , where ranges from 1 to . Three × 1 vectors , , and can be defined. Vector contains the ground-truth binary labels, with each element denoted as and representing the expected outcome for instance . Vector contains the probability of the outcome belonging to the positive class predicted by the neural network, with each element corresponding to . Finally, vector contains the predictions according to the rules in Table 2, i.e., each element takes value 1 if satisfies the conditions of the first rule, 0 if it satisfies the second rule, and N/A otherwise. Then, the Custom Total Loss (CTL) for the integrated model is computed as:

CTL(, , , ) = 1 ∑︁ CSL(, , , ),

=1 where is the scaling factor controlling the influence of the additional loss term, CSL is the custom binary cross-entropy loss for a single sample defined as and is the standard binary cross-entropy loss for a single sample

CSL(, , , ) = {︃(, )

if ̸= ( + 1)(, ) if = (, ) = − [ · log() + (1 − ) · log(1 − )] .

3.3. Proposed evaluation metrics 3.3.1. Relative accuracy

Performance metrics can be redefined to evaluate adherence to accurate predictions set by the rules, quantifying errors introduced by the model in comparison to the reference protocol. As in Section 3.2, consider as a dataset consisting of samples represented by , where ranges from 1 to , and let denote the prediction made by a clinical protocol for each . Additionally, let ^ represent the binary prediction provided by a ML model for . Relative Accuracy (RA) can be defined as the fraction of samples correctly predicted by the protocol that are also correctly predicted by the model:

RA = |{ : ∈ ∧ = = ^}| , |{ : ∈ ∧ = }| (4) (5) (6) where |·| denotes the cardinality of a set. Similarly, the relative counterparts for other performance metrics, such as Relative sensitivity or Recall (RR) and Relative Specificity ( RS) with respect to a given class , can be defined as follows:

RR = |{ : ∈ ∧ = = ^ = }| ,

|{ : ∈ ∧ = = }| RS = |{ : ∈ ∧ ̸= ∧ ̸= ∧ ^ ̸= }| .

|{ : ∈ ∧ ̸= ∧ ̸= }| This evaluation does not account for samples where the protocol makes errors or fails to provide a prediction, requiring additional performance metrics for a comprehensive assessment.

3.3.2. Explanation similarity

Applying XAI in clinical settings requires proper evaluation to ensure the explanations are both technically sound and clinically useful. Rule sets extracted from ML models provide valuable insights into model behaviour. Notably, rules extracted from diferent ML models can emphasise diferent variables, even when predicting similar outcomes. Therefore, it is crucial to assess the similarity of explanations provided by rules approximating predictors to those ofered by a specified reference protocol. This evaluation helps determine which explanation aligns more closely with the clinical protocol in use and better reflects clinical expertise.

A novel explanation similarity strategy is here proposed to estimate the similarity of explanations from rule-based predictors, whether extracted from black-box models or built on clinical knowledge. This method allows for comparing explanations from integrated and data-driven models with those provided by a clinical protocol, to verify which aligns better. A diagram summarising the approach is shown in Figure 1. The method entails the following steps. 1. Rule extraction symbolic knowledge is extracted from black-box predictors trained on a given dataset and represented as rule sets that are both human- and machine-interpretable and can provide explanations to predictions in the form of first-order logic clauses. 2. Feature discretisation the features of the dataset are discretised according to the thresholds found in the rules of the considered rule sets. This involves collecting all thresholds associated with each feature and discretising the feature into intervals, accordingly. 3. Rule vectorisation each rule is assigned a vector representing the feature space, where each element corresponds to an interval of a feature and is assigned a value of 1 if the corresponding feature and interval satisfy the rule, and 0 otherwise. 4. Local explanation for every rule set, and for each sample in the data set, the rule satisfied by the sample is identified and the corresponding vector is assigned to the sample. 5. Similarity calculation the similarity between two rule sets is obtained by computing, for each sample, the similarity between the vectors obtained from the two rule sets, and averaging across all samples, while the similarity among more than two rule sets is obtained by calculating the similarity between each pair of rule sets, and averaging all scores. Formally, let represent a dataset comprising samples denoted by , where ranges from 1 to . Each sample is described by input features, labelled as 1, 2, . . . , . Here, represents the value of feature in the instance . For each input , denotes the corresponding outcome. and denote the domains of the inputs and outputs, respectively: ︁( ∈ )︁ ∧ ∈ )︁ , ︁( ∀ = 1, 2, . . . , .

Rule extraction. Let us consider a predictive function ℱ ℱ : → , ℱ () = ^, available from domain knowledge, which we aim to compare: where ^ is the value predicted by ℱ for the instance . Then, a rule set ℛ mapping instances to outputs and approximating the input-output relationship of ℱ can be obtained by analysing ℱ . Let be a set of rule sets, either obtained from predictive functions by rule extraction or where = {ℛ , ℛ2, . . . , ℛ},

1 ℛ : ⊆ → ∀ = 1, 2, . . . , .

Each rule set consists of rules denoted by . For instance, if rule set comprises rules, then ℛ = {1 , 2 , . . . , }. Each rule in rule set ℛ is represented as a tuple (, ), where constitutes a set of conditions {1, 2, . . . , } and ^ represents the outcome associated to that rule. Each condition ℎ can be expressed by a tuple (ℎ , ℎ, ℎ), where ℎ is the variable included in the condition, and ℎ and ℎ are the lower and upper bounds for the condition. If a condition is defined over a discontinuous interval, it is separated into distinct conditions. If a condition is of the type less than or greater than, the lower or upper bound is replaced with the minimum or maximum value of the variable for that feature in the dataset. ^

For instance, in the considered case study where is in the range [18, 67] and 120 in [67, 199], the rule set ℛ 1 presented in Table 2 is defined as ℛ 1 = {11, 21}, where 11 = with 21 = {( , 18, 25), (120, 44, 100)}. {(11, diabetes)} with 11 = {( , 30, 67), (120, 126, 199)} and 21 = {(21, healthy)} Feature discretisation. For the set of predictors , we define the set of thresholds as: = { (1), (2), . . . , ()},

If feature never occurs in any conditions of , then | ()| = 0. Each set () can be represented as an ordered set of thresholds retrieved from rule conditions as detailed above: () = (1, 2, . . . , ), 1 < 2 < . . . < . () is a binary vector representing intervals for variable . If is not present in where any rule, i.e., | ()| = 0, then this vector has zero length. Otherwise, the vector has length | ()| − 1, and the -th element of the vector corresponds to the interval [, (+1)]. The th element of the vector is set to 1 if the values in the corresponding interval meet all conditions on that variable for the considered rule, or if no conditions on that variable are specified in the rule. Otherwise, the element is set to 0: ()[] = ⎧1 if [, (+1)] ⊆ ⎪ ⎨

1 if ℎ ̸= ∀(ℎ, ℎ, ℎ) ∈ , ⎪⎩0 otherwise.

[ℎ, ℎ] ∀(ℎ, ℎ, ℎ) ∈ : ℎ = , Then vector is obtained from ℐ by concatenating all vectors into a single one: = (1)(2) . . . ().

Local explanation. Let be the subset of instances in for which each of the considered rule sets can provide a prediction, i.e.,

{︃ = ⃒⃒⃒⃒⃒ ∈ ∧ ∈ ⋂=︁1 }︃ .

Then, is the set of rules in ℛ such that the instance satisfies all the conditions of the rule: ∈ℛ =

⋃︁ {︁ ⃒⃒⃒ ℎ = ∧ ∈ [ℎ, ℎ], ∀ ℎ ∈ }︁ .

Here we assume, for each rule set in , that each instance of the dataset satisfies all conditions for only one rule, i.e. | | = 1 ∀ ∈ . The vector corresponding to the rule in is assigned to and denoted as (). This provides a vectorised representation of the explanation ofered by rule set ℛ for the data instance .

Without loss of generality, the rule vectorisation and local explanation procedure can also be applied to categorical variables. Instead of intervals defined by thresholds, we have vectors representing subsets of possible categorical values, and conditions are verified by set inclusion. Similarity evaluation. Let ( 1, 2) be a similarity function on two binary vectors 1 and 2. The similarity (ℛ1, ℛ2) for two rule sets ℛ1 and ℛ2 in can then be computed as (ℛ1, ℛ2, , ) =

( 1(), 2()). 1

∑︁ || ∈

The similarity among more than two sets is computed by calculating the pairwise similarity between each pair of rule sets and then averaging across all rule sets. For a set of rule sets the similarity is computed as: ( , , ) = 2 1

− 1 ∑︁ ∑︁

∑︁ ( − 1) || =1 =1 ∈ ( (), ()).

To compute the similarity of two binary vectors 1 and 2 of length , various similarity metrics are available in the literature.

XNOR similarity considers matching and non-matching elements: (7) (8) (9) (10) (11) (12) XNOR( 1, 2) = ∑︀=1 ( 1[], 2[]) where ( 1[], 2[]) equals 1 if 1[] = 2[] and 0 otherwise.

2 · ∑︀=1 1[] · 2[]

DICE( 1, 2) = ∑︀=1 1[] + ∑︀=1 2[]

JACCARD similarity

considers the intersection over the union of elements in both vectors:

∑︀=1 1[] · 2[]

JACCARD( 1, 2) = ∑︀=1 1[] + ∑︀=1 2[] − ∑︀=1 1[] · 2[] where ∑︀=1 1[] · 2[] counts the elements that are 1 in both vectors (intersection), while ∑︀=1 1[] + ∑︀

=1 2[] counts the elements that are 1 in either rule vector (union).

COSINE similarity

computes the cosine of the angle between the vectors:

∑︀=1 1[] · 2[]

COSINE( 1, 2) = √︀∑︀=1 1[]2 · √︀∑︀=1 2[]2 where √︀∑︀ =1 2[]2 computes the product of the magnitudes of the vectors.

=1 1[]2 · √︀∑︀

DICE similarity

divides twice the number of matching elements by the number of elements:

3.4. Evaluation strategy

The study conducted a comparison between two neural networks trained on the Pima Indians Diabetes dataset. One model, termed the data-driven model (DD-ML), was exclusively trained on data, while the other, referred to as the integrated or knowledge-based model (KB-ML), was trained with a custom loss function incorporating knowledge from a knowledge base (KB), as detailed in Section 3.2. Both neural networks were designed as feed-forward models, comprising three fully connected layers: two hidden layers with rectified linear unit activation functions and an output layer with a sigmoid activation function. DD-ML was trained using binary cross-entropy loss, whereas KB-ML employed a customised loss function defined in Eq. 1 with parameter , tuning the contribution of KB to model learning, ranging from 0.5 to 4 at intervals of 0.5. All neural networks were trained with a batch size of 20 for 25 epochs.

In all experiments, data was divided into training and testing sets using a 10-times 10fold stratified cross-validation approach [ 31]. The performance and explainability metrics computed for the integrated model were evaluated against the corresponding metrics for the data-driven model using paired Student-t tests with the Nadeau and Bengio correction [32]. Performance evaluation encompassed a range of metrics, including Accuracy (A), F1-score (F1), Recall (R), Precision (P), Balanced Accuracy (BA), the Area Under the Receiver Operating Characteristic Curve (ROC AUC), and Matthews Correlation Coeficient (MCC). Moreover, the Relative Accuracy (RA), Sensitivity (RR) and Specificity (RS) metrics herein introduced were computed for all models.

Interpretable models approximating the predictions of the neural networks were obtained by rule extraction using CART [21] available from the PSyKE library [33]. Rule sets were extracted from DD-ML and KB-ML (trained with the tuning parameter set to 1.5) and denoted as DD-ML and KB-ML , respectively. Thus, each experiment yields three rule sets: KB, which formalises the clinical protocol; DD-ML , which approximates the data-driven model; and KB-ML , which approximates the integrated model. The maximum number of leaves, and thus rules, in the CART rule-extraction process, varied from 2 to 12. The fidelity of the obtained rule set was evaluated in terms of accuracy and F1-score with respect to the black-box model.

The proposed explanation similarity metrics (leveraging XNOR, Dice, Jaccard, and Cosine similarity) were computed between DD-ML and KB, and between DD-ML and KB on two subsets of the dataset. Initially, explanation similarity metrics were computed over samples for which all considered predictors (KB, DD-ML, KB-ML) could make predictions, thus excluding samples not handled by the protocol. Subsequently, explanation similarity metrics were computed over samples for which all considered models made correct predictions. Finally, the explanation similarity metrics were utilised to gauge the robustness of explanations. A comparison was made among the 100 instances of the KB-ML model trained over the 10-times 10-fold cross-validation. A 100x100 similarity matrix was generated, computing pairwise explanation similarity with XNOR operation between each pair of model instances. The similarities were then averaged across all elements of the matrix. The same process was repeated for DD-ML.

4. Results and discussion 4.1. Relative accuracy evaluation

This integration of domain knowledge, modulated by the parameter , influences the model’s performance, which varies with respect to as shown in Figure 2a. For standard metrics, the performance increases, peaking between values of 1 and 1.5, subsequently declining for A and MCC, while stabilising for ROC. This trend suggests that while the introduced learning bias given by the protocol could be beneficial, excessive bias might impede the learning process, leading to decreased accuracy that falls below that of the data-driven model for greater than 2. The proposed RA metric increases with , efectively detecting the reduction of errors introduced by the integrated model with respect to the reference model. For values of around 1.5, optimal scores of standard metrics are achieved, as well as improved RA. This evaluation highlights the need of tuning integration to maximise adherence without compromising performance.

A comprehensive array of metrics comparing the data-driven model with the integrated model at equal to 1.5, along with relative p-values indicating statistical significance, are reported in Table 3. The integrated model yields superior scores across all metrics except precision, with statistical significance observed for BA, ROC, and R. Nonetheless, precision significantly decreases, and improvements in MCC, F1, and A lack statistical significance. Therefore, it remains challenging to conclusively state that one model is superior to the other. However, the RA metric significantly improved from 0.90 to 0.97, driven by the increased RR (since RS is maximal for both models). These findings highlight the greater alignment with the clinical protocol, also seen in Figure 2b, making the integrated model preferable overall, and demonstrate the role of the proposed metrics in facilitating this assessment.

4.2. Explanation similarity evaluation

The model incorporating domain knowledge also ofers explanations that better align with the underlying reasoning of the knowledge base. Given the black-box nature of both the data-driven and integrated neural networks, explanations for each prediction are provided via surrogate rule sets, with a number of rules varying from 2 to 12, serving as approximations of the model’s decision-making process. The surrogate models KB-ML and DD-ML closely mirror the behaviour of the black-box models, reporting accuracy and F1 scores consistently above 0.85 across all rule set sizes, as shown in Figure 3a.

Explanation similarity metrics computed over samples with prediction for all considered models (KB, DD-ML, KB-ML) reveal that the similarity of KB-ML to the knowledge base consistently exceeds that of DD-ML across all similarity metrics and for every number of rules considered (Figure 3b). These diferences are statistically significant across all metrics and rule set sizes. Notably, for the XNOR similarity, these diferences maintain statistical significance at the 0.01 level across all rule set sizes, emerging as the most efective approach for capturing the impact of integration on improving explanation similarity to the established protocol. This is unsurprising, as the other similarities tend to give more emphasis to the overlapping of 1 values between the two local explanation vectors, while XNOR similarly accounts for the overlapping of 1s and 0s. This is desirable, as a 1 (meaning satisfied condition) in this context is as relevant as a 0 (i.e., unsatisfied condition). Explanation similarity metrics computed over samples for which all considered models make a correct prediction verify that, with predictions being equal, explanations of the integrated model remain closer to the protocol than those of the data-driven model. In this analysis, a similar pattern is observed, with explanation similarity being greater for the integrated model across all metrics and numbers of rules, with diferences statistically significant at the 0.01 level for XNOR and at the 0.05 level for all others.

Finally, the examination of explanation similarity across 100 instances of models trained via the 10-times 10-fold cross-validation, depicted in Figure 3c, reveals that similarity among KBML rule sets is comparable to that of DD-ML for rule sets comprising up to 5 rules. However, it surpasses that of DD-ML for rule sets with more rules, which also have greater fidelity with the black-box model. These findings demonstrate that the integrated model generates 4 ** ** ** ** ** ** ** ** ** ** ** ** ** * ** * ** * ** ** ** 2 4

6 8 # extracted rules 2 4

6 8 # extracted rules (b) Explanation similarity metrics for explanation adherence. 2 4 6 8 # extracted rules 10

KB-MLX DD-MLX 12 (c) Explanation similarity metrics for model explanation robustness. explanations that not only are more aligned with domain knowledge but are also more robust compared to the fully data-driven model for larger and more accurate rule sets, and that the proposed explanation similarity strategy is instrumental in evaluating this crucial aspect.

This approach presents notable advantages compared to strategies that rely solely on rules as global explanations for the model. Leveraging local explanations ofers a more nuanced and ifne-grained evaluation of model explanations, reflecting the structure of the data and providing more context-aware insights into the model’s inner workings, which is particularly relevant in clinical settings. The proposed approach ofers several additional benefits. It can be applied to both numerical and categorical features. Instead of discretising data first and then building rule sets, it uses rule thresholds for data discretisation, eliminating the need of prior knowledge of relevant intervals. Furthermore, it provides a representation that automatically performs feature selection, excluding variables not present in the rules from the vector representation. It also accommodates variables included in other rule sets but not present in the knowledge base. In this scenario, rule sets with conditions on variables not accounted for by the knowledge base will have certain non-overlapping vector regions with the base and will likely record a lower score. Conversely, rule sets using the same features as the base will have greater opportunities for vector overlap and will typically yield higher scores. Lastly, it has a low computational cost, with similarity computation growing linearly with the number of samples, unlike methods that compute pairwise rule similarities, which grow quadratically with the number of rules.

5. Conclusions and future work

This study introduces novel metrics to evaluate the adherence of models to established protocols in terms of accuracy and explanation of predictions. Through comparative analysis on a benchmark dataset, we illustrate that models incorporating protocol knowledge exhibit superior alignment with established practices, making them more suitable for integration into clinical decision-making processes.

In future research, we aim to extend this investigation to other datasets, retrieving the corresponding domain knowledge either by translating established protocols into rules or by consulting clinicians to encode that knowledge. Having demonstrated adherence to the clinical protocol across diferent datasets and clinical applications, we also plan to consult respective experts to verify that the trained ML model is trustworthy also outside the domain of application of a protocol, by evaluating whether the learning criteria align with clinicians’ judgement in borderline cases. Additionally, we plan to validate the proposed approach using other automatic rule extraction algorithms, including those based on fuzzy logic, such as neuro-fuzzy models. Finally, we intend to enhance the explanation similarity metrics by scaling intervals based on their length or the number of samples within them, rather than assigning binary values. Availability of data and code The dataset analysed is publicly available (https://www.kaggle. com/datasets/uciml/pima-indians-diabetes-database), and the code to replicate the experiments can be found in the GitHub repository (https://github.com/ChristelSirocchi/XAI-similarity). [17] G. Ciatto, F. Sabbatini, A. Agiollo, M. Magnini, A. Omicini, Symbolic knowledge extraction and injection with sub-symbolic predictors: A systematic literature review, ACM Computing Surveys 56 (2024) 161:1–161:35. [18] Z.-H. Zhou, Y. Jiang, S.-F. Chen, Extracting symbolic rules from trained neural network ensembles, Ai Communications 16 (2003) 3–15. [19] G. Vilone, L. Longo, A quantitative evaluation of global, rule-based explanations of post-hoc, model agnostic methods, Frontiers in artificial intelligence 4 (2021) 717899. [20] M. W. Craven, J. W. Shavlik, Extracting tree-structured representations of trained networks, in: Advances in Neural Information Processing Systems 8. Proceedings of the 1995 Conference, The MIT Press, 1996, pp. 24–30. [21] L. Breiman, Classification and regression trees, Routledge, 2017. [22] F. Sabbatini, G. Ciatto, R. Calegari, A. Omicini, On the design of PSyKE: A platform for symbolic knowledge extraction, in: Proceedings of the 22nd Workshop “From Objects to Agents”, Bologna, Italy, September 1–3, 2021, volume 2963 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 29–48. [23] M. W. Craven, J. W. Shavlik, Using sampling and queries to extract rules from trained neural networks, in: Machine Learning Proceedings 1994, Elsevier, 1994, pp. 37–45. [24] J. Huysmans, B. Baesens, J. Vanthienen, ITER: An algorithm for predictive regression rule extraction, in: Data Warehousing and Knowledge Discovery (DaWaK 2006), Springer, 2006, pp. 270–279. [25] F. Sabbatini, G. Ciatto, A. Omicini, GridEx: An algorithm for knowledge extraction from black-box regressors, in: Explainable and Transparent AI and Multi-Agent Systems. Third International Workshop, EXTRAAMAS 2021, Virtual Event, May 3–7, 2021, volume 12688 of LNCS, Springer Nature, Basel, Switzerland, 2021, pp. 18–38. [26] A. H. Murphy, The Finley afair: A signal event in the history of forecast verification,

Weather and forecasting 11 (1996) 3–20. [27] C. D. Manning, P. Raghavan, H. Schütze, Introduction to information retrieval, Cambridge

University Press, 2008. [28] L. R. Dice, Measures of the amount of ecologic association between species, Ecology 26 (1945) 297–302. [29] H. B. Kibria, M. Nahiduzzaman, M. O. F. Goni, M. Ahsan, J. Haider, An ensemble approach for the prediction of diabetes mellitus using a soft voting classifier with an explainable ai, Sensors 22 (2022) 7268. [30] G. Kunapuli, K. P. Bennett, A. Shabbeer, R. Maclin, J. Shavlik, Online knowledge-based support vector machines, in: Machine Learning and Knowledge Discovery in Databases: European Conference, 2010, Proceedings, Part II 21, Springer, 2010, pp. 145–161. [31] R. R. Bouckaert, E. Frank, Evaluating the replicability of significance tests for comparing learning algorithms, in: Pacific-Asia conference on knowledge discovery and data mining, Springer, 2004, pp. 3–12. [32] C. Nadeau, Y. Bengio, Inference for the generalization error, Advances in neural information processing systems 12 (1999). [33] F. Sabbatini, G. Ciatto, R. Calegari, A. Omicini, Symbolic knowledge extraction from opaque ML predictors in PSyKE: Platform design & experiments, Intelligenza Artificiale 16 (2022) 27–48.

[1]

Piccialli ,

Di Somma ,

Giampaolo ,

Cuomo ,

Fortino , A survey on deep learning in medicine: Why, how and when?, Information Fusion 66 ( 2021 ) 111 - 137 .

[2]

Benjamens ,

Dhunnoo , B. Meskó, The state of artificial intelligence-based fda-approved medical devices and algorithms: an online database , NPJ digital medicine 3 ( 2020 ) 118 .

[3]

J. J.

Clinton ,

McCormick ,

Besteman , Enhancing clinical practice: The role of practice guidelines ., American Psychologist 49 ( 1994 ) 30 .

[4]

J. L.

Haggerty ,

R. J.

Reid ,

G. K.

Freeman ,

B. H.

Starfield ,

C. E.

Adair , R. McKendry , Continuity of care: a multidisciplinary review , Bmj 327 ( 2003 ) 1219 - 1221 .

[5]

Qian ,

Zame ,

Fleuren ,

Elbers , M. van der Schaar, Integrating expert odes into neural odes: pharmacology and disease progression , Advances in Neural Information Processing Systems 34 ( 2021 ) 11364 - 11383 .

[6]

Montagna ,

Sirocchi , Hybrid personal medical digital assistant agents , in: Proceedings of the 25th Workshop “ From Objects to Agents”, Forte di Bard (AO) , Italy, July 8 - 10 , 2024 , volume 3735 of CEUR Workshop Proceedings, CEUR-WS.org , 2024 , pp. 58 - 72 .

[7]

J. W.

Smith ,

J. E.

Everhart ,

Dickson ,

W. C.

Knowler ,

R. S.

Johannes , Using the adap learning algorithm to forecast the onset of diabetes mellitus, in: Proceedings of the annual symposium on computer application in medical care , American Medical Informatics Association, 1988 , p. 261 .

[8]

Obermeyer ,

T. H.

Lee , Lost in thought: the limits of the human mind and the future of medicine , The New England journal of medicine 377 ( 2017 ) 1209 .

[9]

Leiser ,

Rank ,

Schmidt-Kraepelin ,

Thiebes ,

Sunyaev , Medical informed machine learning: A scoping review and future research directions , Artificial Intelligence in Medicine 145 ( 2023 ) 102676 .

[10]

Von Rueden ,

Mayer ,

Beckh ,

Georgiev ,

Giesselbach ,

Heese ,

Kirsch ,

Pfrommer ,

Pick ,

Ramamurthy , et al., Informed machine learning-a taxonomy and survey of integrating prior knowledge into learning systems , IEEE Trans. on Knowledge and Data Engineering 35 ( 2021 ) 614 - 633 .

[11]

Sirocchi ,

Bogliolo ,

Montagna , Medical-informed machine learning: integrating prior knowledge into medical decision systems, BMC Medical Informatics and Decision Making 24 (Suppl 4) ( 2024 ) 186 .

[12]

Kierner ,

Kucharski ,

Kierner , Taxonomy of hybrid architectures involving rule-based reasoning and machine learning in clinical decision systems: A scoping review , Journal of Biomedical Informatics ( 2023 ) 104428 .

[13]

Chicco , G. Jurman, The advantages of the matthews correlation coeficient (mcc) over f1 score and accuracy in binary classification evaluation , BMC genomics 21 ( 2020 ) 1 - 13 .

[14]

Sokol ,

Flach , Explainability fact sheets: A framework for systematic assessment of explainable approaches , in: Proceedings of the 2020 conference on fairness, accountability, and transparency , 2020 , pp. 56 - 67 .

[15]

C. C.

Yang , Explainable artificial intelligence for predictive modeling in healthcare , Journal of healthcare informatics research 6 ( 2022 ) 228 - 239 .

[16]

Calegari ,

Ciatto ,

Omicini , On the integration of symbolic and sub-symbolic techniques for xai: A survey , Intelligenza Artificiale 14 ( 2020 ) 7 - 32 .