Integrating Symbolic Knowledge and Machine Learning in Healthcare Christel Sirocchi1,∗,† , Sara Montagna1,∗,† 1 Department of Pure and Applied Sciences, University of Urbino, Piazza della Repubblica 13, 61029, Urbino, Italy Abstract The intersection of Artificial Intelligence and healthcare has driven advancements, particularly through machine learning, which exploits large datasets to develop predictive models and identify risk factors. Despite its success in clinical medicine, only a few models are FDA-approved due to issues of trustworthi- ness and lack of explainability, hindering adoption in clinical settings. Addressing these issues, symbolic knowledge injection and symbolic knowledge extraction have emerged. The first approach integrates domain-specific expertise encoded as rules into machine learning models, while the second extracts interpretable rules from trained models. In this study, this framework is validated using the Pima Indians diabetes dataset, a benchmark in diabetes research. By incorporating a diagnostic protocol for diabetes into machine learning models, the study demonstrates an improvement in the predictive capabilities of these models. By extracting rules from pure data-driven trained models and integrating them with medical knowledge, we reduce false negatives, while achieving a fully explainable diagnostic system. Finally, a combination of these two methods is explored, reporting higher diabetes detection rates and improved model explainability. Accordingly, this study demonstrates the potential of combining machine-learnt insights with medical guidelines to improve healthcare outcomes. Keywords Hybrid ML architecture, Symbolic knowledge extraction, Symbolic knowledge injection 1. Introduction In medical settings, critical decisions often rely on clinical protocols that, while generally reliable and trustworthy, sometimes fail to correctly identify a subtle yet significant subset of patients. These patients fall within the ”grey zone”, characterised by uncertainty about the appropriate course of action, as they are not clearly defined as either normal or abnormal, healthy or diseased [1]. In these cases, decisions may be more subjective or open to interpretation, challenging the accuracy of conventional protocols. In response, the literature recognises the advanced capabilities of Machine Learning (ML) models, which can uncover latent patterns and knowledge from data that extend beyond the scope of traditional medical protocols [2]. Despite advancements, significant issues persist. The accuracy of certain ML algorithms is not consistently satisfactory, and discrepancies are often observed between predictions made RuleML+RR’24: Companion Proceedings of the 8th International Joint Conference on Rules and Reasoning, September 16–22, 2024, Bucharest, Romania ∗ Corresponding author. † These authors contributed equally. Envelope-Open c.sirocchi2@campus.uniurb.it (C. Sirocchi); sara.montagna@uniurb.it (S. Montagna) Orcid 0000-0002-5011-3068 (C. Sirocchi); 0000-0001-5390-4319 (S. Montagna) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings by these models and those derived from clinical protocols. Moreover, in most cases, they are characterised by a level of opacity that makes it hard for humans to understand their behaviour. However, both interpreting and explaining model predictions is crucial in the medical domain, which is a safety- and ethic-critical application. Given these premises, there is a growing recognition of the need for hybrid models that integrate the robustness of medical protocols with the adaptive learning capabilities of ML. This integration aims to harness the strengths of both approaches while ensuring the decisions are both explainable and reliable. Our goal in this paper is to engineer new Artificial Intelligence (AI) solutions that address these challenges. We aim to integrate medical knowledge and ML solutions, building upon existing literature that introduces the concepts of Symbolic Knowledge Injection (SKI) and Symbolic Knowledge Extraction (SKE) [3]. Our objectives are twofold: first, to demonstrate the advantages of SKI-SKE technologies in terms of various indicators within the medical domain, showcasing how performance improves; and second, to experiment with these technologies which are often only introduced in literature and only partially validated, especially within the medical context. This paper demonstrates how performance and explainability evolve, starting from simple knowledge bases (KB) and progressing to pure ML algorithms. Building on the two models with the highest recall (decision trees and neural networks), we applied SKI and SKE technologies and conducted novel experimentation with a SKI-SKE loop. In this loop, recently proposed in the literature and open to exploration, medical knowledge is injected into an ML model, rules are extracted from the trained model, and then re-injected into the model. The potential of this integrated approach is demonstrated using the Pima Indians Diabetes dataset for diabetes prediction [4]. Results show that applying SKI techniques to inject clinical knowledge into ML models improves performance, specifically reducing the number of false negatives in diabetes diagnosis. Additionally, SKE techniques can derive interpretable models that are further enhanced when combined with clinical knowledge. Integrating both techniques into a loop yields novel and promising results, where knowledge extracted from neural networks and re-injected can further enhance model performance and explainability. 2. Motivations and Background The intersection of artificial intelligence and healthcare has fostered significant advancements. ML, in particular, is the most discussed technology in this field [5, 2], as it allows for the exploitation of large datasets by discovering relationships and patterns hidden in data. Beyond developing accurate and robust clinical predictive models, ML is also extensively used to identify risk factors by detecting key features in predictions. ML has achieved remarkable performance in various domains of clinical medicine, outperforming human physicians in some cases and enabling the development of computer-aided diagnosis systems [6]. However, with thousands of studies applying ML to medical data, only a handful have significantly contributed to clinical care: indeed, only a few of these systems have been FDA-approved for healthcare use [7]. Resistance to embrace ML in clinical settings can be attributed to the prevailing reliance on evidence-based clinical guidelines as the foundation for clinical decision-making [8], while classical ML does not rely on medical knowledge but solely on data. Novel ML models, even when reporting superior performance compared to current protocols, might be unsuitable for clinical use if they (a) fail to correctly predict cases effectively managed by the protocol in place due to potential liabilities, (b) make predictions based on confounding variables and erroneous relationships that contradict established clinical knowledge [9] or (c) make predictions that cannot be explained to the user, suffering from opacity and offering poorly interpretable solutions [10]. On the other side, medical protocols alone can sometimes fail to detect complex patterns, correlations, causal relationships and little variations in data due to their reliance on predefined rules and thresholds, making them less effective in borderline decision cases [11]. Since healthcare is a safety and ethic-critical application requiring humans to be in full control of the computational system supporting their decisions, the goal is to find methods that ensure the best trade-off between performance and explainability. To bridge this gap, the integration of medical knowledge with ML has emerged as a topic of ongoing debate in the literature. 2.1. Symbolic Knowledge Injection and Extraction In the context of knowledge exploitation, with the purpose of both creating more reliable recommenders and understanding the decision process, two main methods have been defined in literature [3]. Symbolic knowledge and methods involve the use of interpretable languages, such as logic formalisms, that are understandable by both humans and computers. In contrast, subsymbolic knowledge involves the use of numerical data processing, such as functions over fixed-sized tensors in NNs, which often results in less interpretable solutions despite their high predictive performance. Additionally, the literature introduces the concepts of Symbolic knowledge injection and extraction: Symbolic knowledge injection – SKI Particular attention is given to methods perform- ing knowledge injections into ML models, which fall under the paradigm of informed ML [12, 13]. This approach, also referred to as symbolic knowledge injection, aims to enhance ML models by integrating data-driven learning with domain-specific expertise typically encoded as rules. It encompasses a class of algorithms that ensure sub-symbolic predictors draw their inferences consistently with a given set of symbolic knowledge. SKI procedures of this kind influence either the structure or the training process of sub- symbolic predictors, ensuring that these predictors incorporate symbolic knowledge when making predictions. Consequently, these procedures compel sub-symbolic predictors to learn from both data and symbolic knowledge. SKI can thus result in a higher control over what the ML model is learning, ensuring more reliable and trustworthy predictors whose behaviour is consistent with domain knowledge. Symbolic knowledge extraction – SKE Symbolic knowledge extraction methods are also documented in the literature as a means to derive symbolic knowledge from trained ML models, which can then be used in decision support systems [14]. The goal of SKE is manifold. First, given a black-box predictor and a knowledge-extraction procedure, the extracted knowledge can be used as a basis to construct explanations for that predictor. The extracted knowledge may serve as an interpretable replacement, also referred to as surrogate model for the original predictor, provided that the two have a high-fidelity score. Moreover, this approach facilitates the discussion of how and if the extracted knowledge can be merged with existing domain knowledge to improve the classifications based solely on domain knowledge. Finally, open research questions arise, discussing whether the surrogate model can truly enrich the domain knowledge or if it presents any contradictions and, in this case, how to reconcile the two. The same considerations apply if we aim to integrate surrogate models extracted from different predictors. SKI and SKE are thus methods devised to integrate knowledge into and from predictors. Several approaches have been developed which, according to [3], may be categorised as follows. SKI methods are classified by input knowledge form, strategy, targeted predictor type, and purpose. They accept logic formulæ or expert knowledge, including First Order Logic and Knowledge Graphs (KGs). SKI strategies include predictor structuring, knowledge embedding, and guided learning. They primarily target NN-based predictors. Conversely, SKE methods are mainly classified by translucency, i.e., if they rely on the inspection of the internal structure of black- box models and output knowledge (rule lists, graphs, decision trees, tables). The method can inspect (even partially) the internal parameters of the underlying black-box predictor, such as with neural networks. The symbolic knowledge produced can be in the form of propositional and fuzzy rules, decision trees or triplets of KGs. The potential of the joint exploitation of both SKI and SKE is also recognised in the literature, specifically in the loop presented in [3] as train–extract–fix–inject. In this loop, a trained model is inspected via SKE, the extracted knowledge is verified by a domain expert, and the corrected knowledge is injected back into the trained predictor via SKI to align with the corrected symbolic knowledge. This approach is proposed for debugging purposes but has not yet been thoroughly investigated and experimented with for improving classifier performance and explainability. 2.2. Knowledge Integration in Medicine Literature reports different integration strategies, mainly devoted to injecting knowledge in the various stages of the ML pipeline [12, 15, 16]. A comprehensive review is out of the scope of this paper, but we report here the main methods: Data Pre-processing Inconsistencies and errors in datasets are mitigated by removing anoma- lous samples based on clinical norms. To counter insufficient or missing clinical data, virtual samples adhering to medical knowledge can be generated [17]. Feature Engineering Novel features can be derived from existing ones using mathematical or logical models based on medical knowledge [18]. Feature selection can be strategically informed by prior knowledge [19]. Model Learning Rules can be incorporated into model loss function and architecture [20, 21]. Output Evaluation ML models can be combined with rule-based systems modelling clinical guidelines, either by integrating outputs, filtering predictions in series, or verifying consistency with domain knowledge [22]. However, these attempts are sparse and do not refer to the SKI-SKE framework, where also the extraction of knowledge plays a crucial role, thereby losing part of the expected benefits, especially in terms of model explainability. Only recently some work introduced a discussion on SKE, but still only in one direction and within the specific domain of diagnostic imaging [23]. 3. Materials and Methods Given the identified gaps in the literature, in this paper, we explore some of the SKI-SKE methods presented above, with the goal of defining a framework that effectively leverages the advantages of data analytics and the exploitation of well-grounded medical rules. Special attention is devoted to experimenting with the loop that exploits both SKI and SKE methods to assess the validity of this approach and evaluate improvements in model performance and explainability. To the best of our knowledge, no attempts in this direction are discussed in the literature. 3.1. Dataset and domain knowledge The dataset analysed in this study is the Pima Indians Diabetes dataset, compiled by the National Institute of Diabetes and Digestive and Kidney Diseases. This dataset originates from a study of the Pima Indian population, known for its high incidence of diabetes. It includes 768 medical profiles of women aged 21 and older who underwent an Oral Glucose Tolerance Test (OGTT) to measure their glucose and insulin levels after two hours. The target variable is binary, indicating whether diabetes was diagnosed within five years, and is unbalanced, with diabetes diagnoses accounting for 35% of the cases. Details about the dataset features are listed in Table 1. Missing values in the attributes 𝐼120 (48.70%), 𝑆𝑇 (29.56%), 𝐵𝑃 (4.55%), 𝐵𝑀𝐼 (1.43%), and 𝐺120 (0.65%) were imputed using the median value. Table 1 Pima Indians Diabetes dataset Feature name Code Description Pregnancies Number of times pregnant Glucose 𝐺120 2-hour plasma glucose concentration in OOGT in 𝑚𝑔/𝑑𝐿 Blood Pressure 𝐵𝑃 Diastolic blood pressure in 𝑚𝑚𝐻 𝑔 Skin Thickness 𝑆𝑇 Triceps skin-fold thickness in 𝑚𝑚 Insulin 𝐼120 2-hour serum insulin in 𝜇𝑈 /𝑚𝐿 Body mass index 𝐵𝑀𝐼 Body mass index as weight/(height)2 in 𝑘𝑔/𝑚2 Diabetes Pedigree Function 𝐷𝑃𝐹 Likelihood function of diabetes based on family history [4] Age Age in years Public health guidelines on type-2 diabetes risks indicate that individuals with a high 𝐵𝑀𝐼 (≥ 30) and elevated blood glucose levels (≥ 126) are at a severe risk for diabetes. Conversely, those with a normal 𝐵𝑀𝐼 (≤ 25) and low blood glucose levels (≤ 100) are less likely to develop the disease. These guidelines have been used to design rules [24] expressed as logic predicates (Table 2), which form the KB for this case study. Table 2 Knowledge base for predicting risk of type-2 diabetes as formalised by Kunapuli et al. (2010) [24]. Rule 1 (𝐵𝑀𝐼 ≥ 30) ∧ (𝐺120 ≥ 126) ⟹ diabetes Rule 2 (𝐵𝑀𝐼 ≤ 25) ∧ (𝐺120 ≤ 100) ⟹ healthy 3.2. Machine learning models and metrics In this study, a wide range of ML classifiers are explored, including linear models such as Logistic Regression (LR) and linear Support Vector (SV) classifiers, tree-based approaches including single learners like Decision Trees (DT) and ensemble methods such as Gradient Boosting (GB) and Random Forest (RF), as well as Neural Networks (NN). The data was normalised to a mean of 0 and a standard deviation of 1 for facilitating the learning of scale-sensitive models, such as NN. Performance evaluation encompassed Accuracy (A), Precision (P), F1 score (F1), Balanced Accuracy (BA), and Matthew’s Correlation Coefficient (MCC), as well as True Positive Rate (TPR) or recall, True Negative Rate (TNR), False Positive Rate (FPR), and False Negative Rate (FNR). Nested cross-validation with 10 outer folds for evaluation and 5 inner folds for hyperparameter tuning was employed with an extensive parameter search. Hyperparameter optimisation was conducted by maximising accuracy with class weights set inversely proportional to class frequency to address data imbalance. Alternative strategies, such as random oversampling of the positive class and undersampling of the negative class, were also tested but did not improve performance. For NN, the optimal number of training epochs was determined by early stopping. This method involved splitting the training set into 90% training and 10% validation subsets and monitoring the validation loss during training, for a maximum of 100 epochs. Early stopping was configured with a patience of 5 epochs, meaning training would halt if the validation loss did not improve for 5 consecutive epochs, and the best weights observed during training were restored. Performance metrics were computed for each outer fold using the model parameters optimised in the inner folds, and the average of these metrics was calculated to provide a comprehensive understanding of the models performance. In the remainder of this paper, we focus on NN and DT along with their respective learning methods. These two families of predictors are particularly relevant as they are closely related to many surveyed SKI and SKE methods. DTs are noteworthy for their user-friendliness, making them accessible and interpretable for users. In contrast, NNs are predominantly popular due to their superior predictive performance and flexibility, allowing them to adapt to a wide range of tasks and data types. Moreover, considering the clinical context where correctly identifying positive cases is critical and recall is the key metric to minimise the risk of missing critical diagnoses, NN and DT are identified as the best-performing models according to results presented in Table 3 and are considered for further exploration. In particular, the reference NN architecture, derived through hyperparameter optimisation, was configured as follows: an input layer of size 8; two hidden layers of size 12 and 8 with Rectified Linear Unit (ReLU) activation function; an output layer comprising a single neuron with a sigmoid activation function. The model was compiled using the Adam optimiser and binary cross-entropy with class weights as the loss function, with performance evaluation based on weighted accuracy. Models were trained with a batch size of 32 for a number of epochs determined by early stopping with patience 5 and a maximum of 100 epochs, as described. The reference DT architecture was configured with a maximum depth of 10 and Gini impurity as the split criterion. DTs were trained with class weights to account for data imbalance. 3.3. Knowledge injection and extraction: the PSyKE and PsyKI Platforms Knowledge injection and extraction in NNs leveraged two Python libraries 1 : PSyKI (Platform for Symbolic Knowledge Injection) [25] and PSyKE (Platform for Symbolic Knowledge Extrac- tion) [26]. Knowledge injection is facilitated by methods available in PSyKI [25]. This Python library primarily uses logic formulae for knowledge representation, supported by the Prolog language through integration with 2P-Kt 2 , a multi-paradigm logic programming framework. Key components of PSyKI include Injectors, Theories, and Fuzzifiers, which represent SKI algo- rithms, domain-specific symbolic knowledge, and methods for translating symbolic knowledge into sub-symbolic data structures, respectively. The available injectors include Knowledge- Based Artificial Neural Networks (KBANN) [27], one of the first injectors introduced in the literature, Knowledge Injection via Lambda Layer (KILL) [28] and Knowledge Injection via Network Structuring (KINS) [29], which structures knowledge by adding ad-hoc layers into a NN. In this work, knowledge injection in NNs, depicted in Figure 1 (a), was performed using KINS due to its several advantages: it does not constrain the NN to a specific architecture, does not require logic predicates to be grounded, and is robust to both data scarcity and imperfect or incomplete knowledge, often found in clinical scenarios. In the KINS method, a neural network (NN) is first initialised with a specified architecture. The architecture is then augmented with additional neural modules specifically designed to incorporate symbolic knowledge. Each mod- ule functions as a sub-network, sharing the input layer with the original NN and producing an output that represents the continuous interpretation of a logic formula. The weights and biases within these modules can be either trainable or fixed, while the rest of the network’s weights and biases remain trainable. In this study, the knowledge module weights are not trained to ensure that all provided logic rules are given equal importance, regardless of data evidence. Knowledge extraction methods are available in PSyKE, which offers several algorithms for both classification and regression problems, allowing knowledge to be extracted in the form of a Prolog theory. PSyKE is designed around the notion of an Extractor, which is composed of a trained predictor, used as an oracle and a set of feature descriptors. The supported extraction algorithms include those based on trees, iteratively dividing the feature space, like Classification and Regression Trees (CART) and Trepan, as well as those based on hypercubes, iteratively expanding in the input space, like ITER, GridEx, and GridREx [26]. In this study, knowledge extraction from NNs, illustrated in Figure 1 (b), was performed using CART due to its simplicity and interpretability. CART performs rule extraction by training a decision tree on the inputs and outputs of the NN and converting the tree structure into human-readable if-then rules. The fidelity of the obtained rule set was evaluated in terms of accuracy and F1-score with respect to the black-box model. The optimal number of leaves, and thus rules, in the CART rule-extraction process was determined by varying the leaf number from 5 to 20 and selecting the value that maximised the accuracy of the rule set on a validation set. Knowledge injection and extraction in DTs was relatively straightforward as both DTs and domain knowledge can be formalised as rules. Knowledge injection by model restructuring was achieved by modifying the structure of the DT to incorporate the two domain-specific rules as its initial split criteria. Beyond these rules, the tree expanded as a typical DT. For knowledge 1 https://github.com/psykei 2 http://tuprolog.apice.unibo.it Figure 1: Diagrams illustrating the three integrated approaches leveraging SKI-SKE technologies that were implemented and evaluated in this study: (a) knowledge injection, (b) knowledge extraction, (c) injection-extraction-injection loop. extraction, the DT was simply converted into a rule set by translating root-to-leaf paths into if-then rules and adding the two domain-specific rules with priority such that, if an instance satisfies the conditions of multiple rules, priority is given to the clinical rules. The effectiveness of knowledge injection in enhancing predictive model performance was evaluated by training the reference NN and DT architectures, along with their injected counter- parts by 10-fold cross-validation. Performance metrics were averaged across folds and compared to assess improvements resulting from knowledge injection. Similarly, the same reference NN and DT architectures were trained using 10-fold cross-validation, and for each fold, converted into interpretable rule sets. The predictive performance of these extracted rule sets was averaged across all folds and compared to that of the original clinical protocol. Additionally, integrated rule sets, which combined clinical rules with ML-derived rules, were evaluated to detect any increase in predictive performance as a result of this integration. 3.4. Knowledge injection-extraction feedback loop The potential to apply a combination of SKI and SKE strategies in a feedback loop to further enhance the predictor’s performance was explored. The process, outlined in Figure 1 (c), begins with the initial injection of available domain knowledge. The model is then trained, and rules are extracted from it. The quality of these rules is evaluated, and the best rules are added to the current domain rules, which are then re-injected into a new model. In this study, the injection-extraction process was structured as follows. The dataset was divided into training, validation, and test sets in a 60:20:20 ratio. A NN injected with the two protocol rules was trained on the training set, with training parameters optimised based on performance on the validation set. Rules were then extracted from the trained injected NN, with the rule set size fine-tuned according to validation set performance. These extracted rules were evaluated using performance metrics as well as coverage, which measures the proportion of dataset samples accounted for by the rule set. Four rules predicting diabetic outcomes were identified and added to the clinical protocol, both individually and in combination, and reinjected into new NN models. Consequently, five new NN models were injected with the updated knowledge bases. Their performance was compared against the initial injected NN model and the traditional NN model to assess the impact of injecting ML-derived rules. 4. Results and discussion 4.1. ML performance The initial performance comparison of various ML models trained on the Pima Indians diabetes dataset is summarised in Table 3. All models show moderate prediction accuracy, ranging from 0.73 to 0.78. Among these, RF stands out with the highest quality of positive predictions, evidenced by superior precision, and the highest scores for overall performance metrics, such as A, BA, F1, and MCC. SV excels in predicting the negative class (healthy individuals), with the lowest FPR and highest TNR. In contrast, NN demonstrates the best capability for predicting the positive class (diabetic individuals), achieving the highest TPR and lowest FNR. DT and RF follow closely and are notable for their diabetes prediction capabilities. Table 3 Evaluation metrics for ML models trained on the Pima Indians diabetes dataset. The best value for each metric is highlighted in bold, corresponding to the highest value for all metrics, except for FPR and FNR for which it is the lowest. Metric A BA F1 MCC P TNR TPR FNR FPR Neural Network 0.738 0.742 0.670 0.472 0.612 0.730 0.754 0.246 0.270 Decision Tree 0.738 0.741 0.667 0.468 0.604 0.730 0.753 0.247 0.270 Random Forest 0.772 0.768 0.697 0.522 0.652 0.782 0.753 0.247 0.218 Gradient Boosting 0.756 0.754 0.681 0.499 0.637 0.762 0.746 0.254 0.238 Support Vector 0.762 0.751 0.678 0.497 0.651 0.786 0.717 0.283 0.214 Logistic Regression 0.751 0.742 0.666 0.477 0.636 0.774 0.709 0.291 0.226 Diabetic N/A Healthy 1 2 3 4 5 Outcome Clinical Protocol Neural Network Decision Tree Random Forest Gradient Boosting Support Vector Logistic Regression 0 100 200 300 400 500 600 700 Figure 2: Diabetes dataset divided into five regions based on the predictions of the clinical protocol with respect to the actual outcomes. The proportion of diabetic and healthy predictions made by six ML models is shown for each region. A detailed analysis of the predictions made by each model, compared to those made by the clinical protocol and the actual outcomes, is illustrated in Figure 2. The graph is divided into regions based on whether the clinical protocol correctly predicts positive and negative instances. For each region, the proportion of healthy and diabetic predictions made by ML models is displayed. It can be observed that the coverage of the clinical protocol is relatively low, at about 34.5%, leaving many cases, primarily healthy individuals, without a diagnosis. Such cases are generally deferred to follow-up, thus treated for the time being as healthy individuals. For this reason, in performance metrics computation, these cases are considered healthy. Additionally, it can be noted that the protocol produces false positives (region 3) but no false negatives, which is highly desirable in a clinical setting where a positive outcome typically leads to specialised tests for confirmation, whereas a negative outcome usually does not prompt further examination. Examining the predictions of the ML models in detail reveals several insights. In region 1, which includes diabetic cases correctly predicted by Rule 1 of the protocol, all models make some mistakes, with NN reporting the fewest errors in this region and DT the most. In region 2, which includes diabetic cases where the protocol could not make predictions, the most crucial classification challenge arises, as these patients inhabit a clinical ”grey zone” and often do not receive adequate care. All ML models struggle to classify this region. DT emerges as the best- performing model and the only one correctly identifying over 50% of the patients as diabetic. Poor performance indicates that the available features may not be sufficiently predictive for these cases. However, some patients are correctly identified by multiple models, indicating potential criteria for accurate classification. In region 3, which includes cases incorrectly classified as diabetic by Rule 1 of the protocol, most models also classify these instances as diabetic, suggesting that the available features are not sufficiently predictive also for these patients. This misclassification needs to be addressed as it increases over-triage for healthcare providers but takes lower priority, as our primary focus is on reducing false negatives rather than false positives. In region 4, which includes healthy individuals correctly predicted by Rule 2 of the protocol, all models also predict these patients as healthy. In region 5, which consists of healthy individuals for whom the protocol cannot give a prediction, all models correctly predict most patients. The fraction of false positives remains below 20% for all models, demonstrating the value of ML in predicting these patients. These findings underscore the opportunities (region 5) and challenges (regions 2 and 3) in leveraging ML for clinical prediction. Combining data-driven ML with rule-based knowledge may address these challenges, forming the basis for investigating knowledge injection and extraction to enhance predictive models. 4.2. Knowledge injection and extraction Performance evaluation of DT and NN architectures injected with clinical rules by model restructuring is presented in Table 4. Injected models were evaluated against the standard ML architectures and the clinical protocol. Despite the vastly different learning paradigms, the effect of knowledge injection on the two models was similar. The injection led to an increase in the classification of positive outcomes, with a rise in both true positives and false positives. This is due to the fact that, as discussed in the previous section, the clinical protocol does not predict false negatives but does predict false positives through Rule 1. This increase in positive predictions yields an increase in TPR and a decrease in FNR for both injected models, a desirable outcome in clinical scenarios where the primary objective is identifying positive cases. However, this comes at the expense of P, particularly in the DT model, where the overall performance metrics—including A, BA, F1, and MCC—degraded. In contrast, the NN model showed an improvement in these metrics, indicating a more balanced trade-off between precision and recall. These results highlight the potential of augmenting ML models with available knowledge. However, in clinical settings, black-box models like NNs, and even rule-based methods like decision trees DTs when the elevated number of rules impacts model interpretability, are often not adopted due to their lack of transparency and trustworthiness. Therefore, working with a small set of interpretable rules that closely approximate the behaviour of trained ML models could be more useful and applicable in clinical practice. In this regard, effective knowledge integration can be achieved by combining protocol rules with rules derived from ML models through knowledge extraction methods. The performance evaluation of rule sets extracted from trained ML models and composite rule sets combining extracted rules with protocol rules, is presented in Table 5. As with the injected models, integration results in an increase in positive predictions, as indicated by higher TPR and lower FNR. In this case, however, also P either remains stable or improves. All global performance metrics—A, BA, F1, and MCC—also show improvement. These findings suggest that when using a surrogate interpretable model in place of a black-box model, integrating additional rules from clinical knowledge can enhance predictions, especially in areas where the protocol is effective. Table 4 Knowledge injection. Evaluation metrics computed for the clinical protocol formalising the KB and ML models trained on the Pima Indians dataset. ML models comprise DT and NN, both trained solely on data, or injected with domain knowledge by model restructuring, denoted as NN-I and DT-I, respectively. Metric A BA F1 MCC P TNR TPR FNR FPR KB 0.764 0.719 0.626 0.466 0.707 0.869 0.567 0.433 0.131 NN 0.752 0.751 0.679 0.493 0.634 0.756 0.746 0.255 0.244 NN-I 0.759 0.765 0.694 0.513 0.628 0.747 0.783 0.218 0.253 DT 0.721 0.719 0.638 0.424 0.582 0.725 0.711 0.287 0.275 DT-I 0.676 0.686 0.607 0.360 0.527 0.650 0.723 0.276 0.350 Table 5 Knowledge extraction. Evaluation metrics computed for rule sets over the Pima Indians diabetes dataset, including the clinical protocol formalising the Knowledge Base (KB), the Decision Tree model trained on data (DT), the rule set extracted from the Neural Network using CART (NN-E), as well as composite rule sets DT+KB and NN-E+KB, integrating protocol rules with priority. Metric A BA F1 MCC P TNR TPR FNR FPR KB 0.764 0.719 0.626 0.466 0.707 0.869 0.567 0.433 0.131 NN-E 0.722 0.723 0.643 0.435 0.588 0.722 0.724 0.275 0.278 NN-E+KB 0.725 0.731 0.654 0.449 0.589 0.711 0.750 0.249 0.289 DT 0.721 0.719 0.638 0.424 0.582 0.725 0.711 0.287 0.275 DT+KB 0.726 0.736 0.661 0.454 0.583 0.704 0.768 0.232 0.296 4.3. Knowledge injection-extraction feedback loop The explorations in the previous section highlight the potential of using injection and extraction techniques to incorporate symbolic knowledge into the learning process or derive symbolic knowledge from trained models. However, the combined application of these approaches is heavily understudied, and a preliminary investigation is presented here. The combination of these strategies was set up as an injection-extraction-injection loop, capitalising on the enhanced performance through knowledge injection and improved explainability from knowl- edge extraction (due to the intrinsic interpretability of rule-based systems). A model injected with clinical rules was trained on data, and a rule set maximising fidelity with the model was extracted. The extracted rules, along with performance metrics, are presented in Table 6. Four rules predict diabetic outcomes and are further analysed. Extracted Rule 1 closely mirrors Rule 1 of the clinical protocol, predicting diabetic individuals with elevated glucose and BMI. The thresholds for these features are lower in the extracted rules, suggesting that individuals with glucose and BMI just below the clinical thresholds should also be considered at elevated risk. Extracted Rule 2 suggests that individuals with elevated glucose could be considered at higher risk above a certain age, even if they do not have elevated BMI, identifying age as an additional risk factor not considered in the protocol. Conversely, extracted Rule 3 indicates that even if the glucose level is not elevated, risk could still be high if BMI is very elevated, prompting to consider these two features not only in combination but also individually. Finally, extracted Rule 4 suggests that even if glucose and BMI are not elevated, risk might still be high above a certain age and with a family history of diabetes quantified by DPF, prompting to consider these two additional factors even when the two main diabetes risk factors are in the normal range. The extracted rules can be used to augment rule-based protocols or to improve ML training. Table 6 Performance metrics computed on the test set for 6 rules extracted from a NN model injected with the rules of the diabetes clinical protocol and trained on the Pima Indians diabetes dataset. Rule Outcome Total #TP #TN #FP #FN A Coverage 1 𝐺120 > 121.5 ∧ 𝐵𝑀𝐼 > 29.1 diabetes 262 168 0 94 0 0.641 0.341 2 𝐺120 > 121.5 ∧ 𝐵𝑀𝐼 =< 29.1 ∧ 𝐴𝑔𝑒 > 30.5 diabetes 44 18 0 26 0 0.409 0.057 3 𝐺120 =< 121.5 ∧ 𝐵𝑀𝐼 > 40.75 diabetes 33 12 0 21 0 0.364 0.043 4 𝐺120 =< 121.5 ∧ 𝐵𝑀𝐼 =< 40.75 ∧ 𝐷𝑃𝐹 > 0.65 ∧ 𝐴𝑔𝑒 > 40 diabetes 75 23 0 52 0 0.307 0.098 5 𝐺120 =< 121.5 ∧ 𝐵𝑀𝐼 =< 40.75 ∧ 𝐷𝑃𝐹 =< 0.65 healthy 317 0 277 0 40 0.874 0.413 6 𝐺120 > 121.5 ∧ 𝐵𝑀𝐼 =< 29.1 ∧ 𝐴𝑔𝑒 =< 30.5 healthy 37 0 30 0 7 0.811 0.048 Table 7 Evaluation metrics computed for NNs injected with prior domain knowledge together with rules extracted from trained NNs. Metric A BA F1 MCC P TNR TPR FNR FPR KB 0.764 0.719 0.627 0.463 0.700 0.869 0.567 0.433 0.131 NN 0.771 0.768 0.698 0.519 0.643 0.773 0.765 0.235 0.227 NN-I 0.741 0.752 0.680 0.481 0.598 0.716 0.787 0.212 0.284 NN-I update #1 0.760 0.769 0.699 0.516 0.622 0.740 0.799 0.201 0.260 NN-I update #2 0.768 0.772 0.702 0.523 0.636 0.760 0.784 0.218 0.240 NN-I update #3 0.754 0.770 0.700 0.516 0.609 0.716 0.825 0.175 0.284 NN-I update #4 0.758 0.761 0.690 0.503 0.623 0.750 0.772 0.226 0.250 NN-I update #5 0.771 0.770 0.701 0.523 0.643 0.773 0.769 0.232 0.227 Adding each of the four extracted rules (Rules 1 through 4 in Table 6) to the protocol yielded four updated knowledge bases named respectively KB update #1, KB update #2, KB update #3, and KB update #4, while adding all four rules resulted in KB update #5, depicted in Figure 3. Injecting each updated knowledge base into NN resulted in five injected models, termed NN-I updated #1 through #5. Performance evaluation of these models, compared against the first injected model (NN-I) and the standard NN model, is presented in Table 7. NN injected with Rule 3 reported the best scores for TPR and FNR. It excelled in predicting cases in the challenging Region 2, where the clinical protocol fails, achieving 62% accuracy in this region, compared to 48% for the uninjected model and 53-55% for the other injected models. NN injected with Rule 1 reported the second-best scores for these metrics due to improved prediction in Region 2 and almost perfect prediction in Region 1. All injected models with updated rules reported TPR and FNR scores at least as good as those of the standard NN. However, only the injections of Rule 1 and Rule 3 improved these scores above those of NN-I. Notably, Rule 2 scored higher than Rule 3 in terms of accuracy and coverage but had a less beneficial effect on TPR, indicating that the available metrics to evaluate rules are not always predictive of the effect of adding that rule to the knowledge base. This suggests a need for novel metrics for evaluating new rules against existing ones. The model that performed the worst was NN-I Update #5, which incorporated all four rules, resulting in a complex architecture. These findings suggest that adding a few high-quality rules is more beneficial than incorporating many rules. For this reason, only one loop of knowledge injection-extraction was applied in this study. However, this approach can potentially be repeated multiple times, allowing the rule knowledge base to grow and increasingly complex knowledge to be injected. These explorations demonstrate the potential of augmenting ML models with ML-derived rules in addition to domain knowledge. They also highlight the challenges in identifying high-quality ML-derived rules for reinjection. Further investigation is required to understand the potential of this integration architecture. Diabetic N/A Healthy 1 2 3 4 5 Outcome ClinicalProtocol KB update #1 KB update #2 KB update #3 KB update #4 KB update #5 NN-I rules 0 100 200 300 400 500 600 700 Rule 1 Rule 2 Rule 3 Rule 4 Rule 5 Rule 6 Figure 3: Clinical protocol and updated knowledge bases (KB update #1 through #5) integrating, either individually or collectively, four rules extracted from the injected neural network (NN-I rules). 5. Conclusions and future work To leverage the potential of ML while addressing its limitations, we experimented with SKI, SKE, and their combination on a diabetes benchmark dataset. SKI effectively improved diabetes detection by enhancing recall, albeit with a reduction in precision. To increase explainability, SKE was applied, integrating the extracted rules with domain-specific knowledge, which resulted in higher recall while preserving precision. Additionally, implementing a loop that combines rule extraction and reinjection led to further performance improvements. Future research will focus on refining integration techniques and exploring additional knowledge extraction and injection methods. This includes extending knowledge representation from propositional logic to first-order logic, Datalog-like rules, and knowledge graphs. Availability of data and code The dataset analysed is publicly available (https://www.kaggle. com/datasets/uciml/pima-indians-diabetes-database), and the code to replicate the experiments can be found in the GitHub repository (https://github.com/ChristelSirocchi/hybrid-ML). References [1] S. Montagna, C. Sirocchi, Hybrid personal medical assistant agents, in: 25th Workshop “From Objects to Agents”, volume 3735 of CEUR Workshop Proceedings, CEUR-WS.org, 2024, pp. 58–72. [2] P. Rajpurkar, E. Chen, O. Banerjee, E. J. Topol, AI in health and medicine, Nature Medicine 28 (2022) 31–38. doi:10.1038/s41591- 021- 01614- 0 . [3] G. Ciatto, F. Sabbatini, A. Agiollo, M. Magnini, A. Omicini, Symbolic knowledge extraction and injection with sub-symbolic predictors: A systematic literature review, ACM Comput- ing Surveys 56 (2024). URL: https://doi.org/10.1145/3645103. doi:10.1145/3645103 . [4] J. W. Smith, J. E. Everhart, W. Dickson, W. C. Knowler, R. S. Johannes, Using the adap learning algorithm to forecast the onset of diabetes mellitus, in: Proceedings of the annual symposium on computer application in medical care, American Medical Informatics Association, 1988, p. 261. [5] E. Topol, High-performance medicine: the convergence of human and artificial intelligence, Nature Medicine 25 (2019) 44–56. doi:10.1038/s41591- 018- 0300- 7 . [6] F. Piccialli, V. Di Somma, F. Giampaolo, S. Cuomo, G. Fortino, A survey on deep learning in medicine: Why, how and when?, Information Fusion 66 (2021) 111–137. [7] S. Benjamens, P. Dhunnoo, B. Meskó, The state of artificial intelligence-based fda-approved medical devices and algorithms: an online database, NPJ digital medicine 3 (2020) 118. [8] J. J. Clinton, K. McCormick, J. Besteman, Enhancing clinical practice: The role of practice guidelines., American Psychologist 49 (1994) 30. [9] Z. Qian, W. Zame, L. Fleuren, P. Elbers, M. van der Schaar, Integrating expert odes into neural odes: pharmacology and disease progression, Advances in Neural Information Processing Systems 34 (2021) 11364–11383. [10] C. C. Yang, Explainable artificial intelligence for predictive modeling in healthcare, Journal of healthcare informatics research 6 (2022) 228–239. [11] Z. Obermeyer, T. H. Lee, Lost in thought — the limits of the human mind and the future of medicine, New England Journal of Medicine 377 (2017) 1209–1211. [12] L. Von Rueden, S. Mayer, K. Beckh, B. Georgiev, S. Giesselbach, R. Heese, B. Kirsch, J. Pfrommer, A. Pick, R. Ramamurthy, et al., Informed machine learning–a taxonomy and survey of integrating prior knowledge into learning systems, IEEE Trans. on Knowledge and Data Engineering 35 (2021) 614–633. [13] C. Sirocchi, A. Bogliolo, S. Montagna, Medical-informed machine learning: integrating prior knowledge into medical decision systems, BMC Medical Informatics and Decision Making 24 (Suppl 4) (2024) 186. doi:https://doi.org/10.1186/s12911- 024- 02582- 4 . [14] M. Magnini, G. Ciatto, F. Cantürk, R. Aydoğan, A. Omicini, Symbolic knowledge ex- traction for explainable nutritional recommenders, Computer Methods and Programs in Biomedicine 235 (2023) 107536. doi:10.1016/J.CMPB.2023.107536 . [15] S. Kierner, J. Kucharski, Z. Kierner, Taxonomy of hybrid architectures involving rule-based reasoning and machine learning in clinical decision systems: A scoping review, Journal of Biomedical Informatics (2023) 104428. [16] M. van Bekkum, M. de Boer, F. van Harmelen, A. Meyer-Vitali, A. t. Teije, Modular design patterns for hybrid learning and reasoning systems: a taxonomy, patterns and use cases, Applied Intelligence 51 (2021) 6528–6546. [17] A. Bochare, A. Gangopadhyay, Y. Yesha, A. Joshi, Y. Yesha, M. Brady, M. A. Grasso, N. Rishe, Integrating domain knowledge in supervised machine learning to assess the risk of breast cancer, International journal of medical engineering and informatics 6 (2014) 87–99. [18] Z. H. Janjua, D. Kerins, B. O’Flynn, S. Tedesco, Knowledge-driven feature engineering to detect multiple symptoms using ambulatory blood pressure monitoring data, Computer Methods and Programs in Biomedicine 217 (2022) 106638. [19] R. Gazzotti, C. Faron, F. Gandon, V. Lacroix-Hugues, D. Darmon, Extending electronic med- ical records vector models with knowledge graphs to improve hospitalization prediction, Journal of Biomedical Semantics 13 (2022) 1–20. [20] J. Huang, H. Yan, J. Li, H. M. Stewart, F. Setzer, Combining anatomical constraints and deep learning for 3-d cbct dental image multi-label segmentation, in: 2021 IEEE 37th International Conference on Data Engineering (ICDE), IEEE, 2021, pp. 2750–2755. [21] S.-C. Tsai, T.-Y. Chang, Y.-N. Chen, Leveraging hierarchical category knowledge for data-imbalanced multi-label diagnostic text understanding, in: Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019), 2019, pp. 39–43. [22] L.-Y. Lee, C.-H. Yang, Y.-C. Lin, Y.-H. Hsieh, Y.-A. Chen, M. D.-T. Chang, Y.-Y. Lin, C.-T. Liao, A domain knowledge enhanced yield based deep learning classifier identifies perineural invasion in oral cavity squamous cell carcinoma, Frontiers in Oncology 12 (2022). [23] K. H. Ngan, E. Mansouri-Benssassi, J. Phelan, J. Townsend, A. d. Garcez, From explanation to intervention: Interactive knowledge extraction from convolutional neural networks used in radiology, PLOS ONE 19 (2024) 1–29. [24] G. Kunapuli, K. P. Bennett, A. Shabbeer, R. Maclin, J. Shavlik, Online knowledge-based support vector machines, in: Machine Learning and Knowledge Discovery in Databases: European Conference, 2010, Proceedings, Part II 21, Springer, 2010, pp. 145–161. [25] M. Magnini, G. Ciatto, A. Omicini, On the design of psyki: a platform for symbolic knowl- edge injection into sub-symbolic predictors, in: International Workshop on Explainable, Transparent Autonomous Agents and Multi-Agent Systems, Springer, 2022, pp. 90–108. [26] F. Sabbatini, G. Ciatto, R. Calegari, A. Omicini, et al., On the design of psyke: a platform for symbolic knowledge extraction, in: CEUR WORKSHOP PROCEEDINGS, volume 2963, Sun SITE Central Europe, RWTH Aachen University, 2021, pp. 29–48. [27] G. G. Towell, J. W. Shavlik, M. O. Noordewier, Refinement of approximate domain theories by knowledge-based neural networks, in: Proceedings of the eighth National conference on Artificial intelligence-Volume 2, 1990, pp. 861–866. [28] M. Magnini, G. Ciatto, A. Omicini, et al., A view to a kill: knowledge injection via lambda layer., in: WOA, 2022, pp. 61–76. [29] M. Magnini, G. Ciatto, A. Omicini, Knowledge injection of datalog rules via neural network structuring with kins, Journal of Logic and Computation 33 (2023) 1832–1850.