=Paper=
{{Paper
|id=Vol-3735/paper_05
|storemode=property
|title=Hybrid Personal Medical Digital Assistant Agents
|pdfUrl=https://ceur-ws.org/Vol-3735/paper_05.pdf
|volume=Vol-3735
|authors=Sara Montagna,Christel Sirocchi
|dblpUrl=https://dblp.org/rec/conf/woa/MontagnaS24
}}
==Hybrid Personal Medical Digital Assistant Agents==
Hybrid Personal Medical Digital Assistant Agents Sara Montagna1,∗ , Christel Sirocchi1,∗ 1 Department of Pure and Applied Sciences, University of Urbino, Piazza della Repubblica 13, 61029, Urbino, Italy Abstract Autonomous intelligent systems are beginning to impact clinical practice as personal medical assistant agents, by leveraging experts’ knowledge when needed and exploiting the vast amount of patient data available to clinicians. However, these approaches are seldom integrated. In this paper, we propose an integrated hybrid agent architecture that combines symbolic reasoning with sub-symbolic, data-driven models. Using the PIMA dataset, we demonstrate that this hybrid approach enhances the performance of both approaches when used alone. Specifically, we show that integrating a logical agent, which uses predefined expert knowledge plans, with rules obtained by symbolic knowledge extraction from machine learning models trained on historical data, improves system reliability and clinical decision-making, while reducing misclassified instances. Keywords PMDA, Hybrid agent architecture, Symbolic knowledge extraction 1. Introduction The advent of personal medical digital assistant agents (PMDA) marks a significant milestone in healthcare [1], aiming to provide support and recommendations to both patients and clinicians. However, in healthcare settings, it is essential for systems to be both trustworthy and explainable, as they typically handle safety-critical tasks [2]. Consequently, the design of PMDA agents exhibiting trust and reliability is pivotal for adopting these systems in clinical practice. Agents that base their recommendations on established medical protocols, usually modelled as the agent’s beliefs in its knowledge base, inherently possess a degree of trustworthiness and reliability. When these agents utilise logical, rule-based systems, explaining decisions becomes relatively straightforward, as the explanation is provided by the rule whose conditions were satisfied. However, these protocols may not always deliver the performance required for effective clinical adoption, particularly in grey areas where patient cases do not conform to predefined categories or the clinical evidence is ambiguous [3]. In this context, the literature recognises the advanced capabilities of machine learning (ML) models, which have gained significant attention in recent years [4]. These models can uncover latent patterns and knowledge from data that extend beyond the scope of traditional medical protocols [5]. Thus, trained ML models can be integrated into the agent’s internal cycle, delivering robust analysis of its perceptions. However, unlike purely rule-based agents, this integration complicates the explanation of the decision-making process, as black-box ML models inherently lack transparency [2]. WOA 2024: 25th Workshop ”From Objects to Agents”, July 8-10, 2024, Forte di Bard (AO), Italy ∗ Corresponding author. Envelope-Open sara.montagna@uniurb.it (S. Montagna); c.sirocchi2@campus.uniurb.it (C. Sirocchi) Orcid 0000-0001-5390-4319 (S. Montagna); 0000-0002-5011-3068 (C. Sirocchi) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Given these premises, there is a growing recognition of the need for hybrid models that integrate the robustness of medical protocols with the adaptive learning capabilities of ML [6]. Such integration aims to harness the strengths of both approaches while ensuring the decisions made by PMDA agents are both explainable and reliable [7, 8, 9]. In this paper, we propose a hybrid agent architecture obtained by integrating ML into the reasoning cycle of agents. In particular, the proposed solution is grounded on logical agents whose knowledge can be updated based on new rules extracted by ML models. This hybrid methodology allows the PMDA to navigate the grey areas where medical protocols fall short, and data-driven insights can provide additional support. Additionally, incorporating knowledge extracted from ML models in symbolic form enhances predictive abilities of these agents while maintaining their explanatory capabilities. The potential of this integrated approach is demonstrated using the PIMA dataset, a widely used benchmark in medical research, particularly in the study of diabetes [10]. The study focuses on reducing the number of false negatives in diabetes diagnosis, where patients at risk are not identified by the clinical protocol due to its limited coverage. By incorporating additional rules extracted from ML models, we aim to refine and improve diagnostic accuracy, particularly for these borderline cases. The integration of ML-derived rules improved the predic- tive capabilities of the PMDA agents, ensuring both higher diabetes detection and full coverage, while outperforming both the clinical protocol and the standalone ML models. Furthermore, since agent actions are based solely on a logical theory, they inherently retain explainability. Additionally, agents may have distinct requirements for updating their knowledge, based on factors such as prediction accuracy, explanation readability, and knowledge fragmentation, depending on their application domain. Therefore, we demonstrate how agents can update their knowledge base according to specific internal criteria. Finally, simulations conducted within this framework demonstrate that if a PMDA identifies a patient at high risk of developing diabetes, immediate interventions such as dietary adjustments can be initiated. This proactive approach underscores the significance of combining traditional medical knowledge with cutting-edge ML techniques to enhance patient outcomes and advance personalised medicine. 2. Background The intersection of artificial intelligence and healthcare is the core of several research efforts and reports significant advancements. ML, in particular, is the most discussed technology in this field [11, 12], as it allows for the exploitation of large datasets by discovering relationships and patterns hidden in data. Its primary objective in healthcare is to develop accurate and robust models capable of making clinical predictions. Moreover, ML models are also widely used to identify risk factors by detecting crucial features in predictions. ML has achieved remarkable performance in various domains of clinical medicine, outper- forming human physicians in some cases and enabling the development of computer-aided diagnosis systems [4]. However, with thousands of studies applying ML algorithms to medical data, only a handful have significantly contributed to clinical care, with few systems receiving FDA approval for healthcare use [13] Resistance to embracing ML in clinical settings can be at- tributed to the prevalent reliance on evidence-based clinical pathways, guidelines, and protocols as the foundation for clinical decision-making [14], while ML primarily relies on available data. Novel ML models, even when reporting superior performance compared to current protocols, might be unsuitable for clinical use if they (a) fail to correctly predict cases effectively managed by the protocol in place due to potential liabilities, (b) make predictions based on confounding variables and erroneous relationships that contradict established clinical knowledge [15] or (c) make predictions that cannot be explained to the user [16]. On the other side, medical protocols alone can sometimes fail to detect complex patterns, correlations, causal relationships and little variations in data due to their reliance on predefined rules and thresholds, making them less effective in borderline decision cases [17]. Bridging this gap, the integration of medical knowledge with ML has emerged as a topic of ongoing debate in the literature. Particular attention is given to methods performing knowledge injections into ML models, falling under the paradigm of informed ML [6], which seeks to augment ML models by combining data-driven learning with domain-specific expertise. Fur- thermore, symbolic knowledge extraction methods are also documented in the literature as a means to derive rules from trained ML models, which can then be used in recommender and decision support systems [18, 19]. For instance in [20] a method to extracting a symbolic model from a Graph Neural Network is presented. At the same time, there is a growing demand for seamless real-time interaction between human users and digital assistants to ensure continuous and effective support. In response, a dynamic and evolving research field is dedicated to the design and advancement of Personal Medical Assistant Agents (PMDAs) [21]. These are specialised assistant agents [22], designed to support users in their daily activities –from simple tasks such as making calls, reading emails, sending messages, and opening web pages, to more complex tasks like scheduling appointments, interacting with physical objects in the environment, and controlling smart devices–, specifically devised for the medical domain. A distinctive feature of PMDAs is their ability to support continuous bidirectional interaction with the user. This involves acquiring information about the user’s state and environment and providing feedback in various forms: these agents can for instance engage in natural language dialogues, offering a more intuitive and effective means of interaction. While traditional recommender systems have paved the way, the future lies in developing more interactive and responsive digital assistants that, by leveraging real-time data, can provide personalised and context-aware support. In the context of healthcare, these agents aim to provide personalised medical assistance, rang- ing from health monitoring to medical advice, by delivering precise, reliable, and explainable healthcare recommendations [1]. With the rise in chronic diseases and an ageing population, they have been particularly adopted for assisting patients, offering 24/7 support, thereby re- ducing the burden on healthcare professionals and improving patient outcomes [23]. These digital assistants can integrate with various health monitoring devices, such as wearable fitness trackers, smartwatches, and home medical equipment, to gather real-time health data. This data is then analysed to provide insights into the user’s health status, offer reminders for medication, suggest lifestyle changes, and even predict potential health issues before they become critical. Additionally, PMDAs have been used for supporting the work of caregivers [21]. PMDAs leverage a variety of technologies, integrating wearable devices and the Internet of Medical Things (IoMT) to acquire data from the environment and exploiting different cognitive algorithms to exhibit intelligent functionalities, thereby providing comprehensive support to Figure 1: Flows of information and control between the BDI agent and a ML model users. The literature also reports several examples of embedded data-driven AI. Generally, the concept of learning and adapting plans based on experience is addressed through reinforcement learning techniques, enabling a BDI agent to have some plans explicitly programmed while others are learned during its lifecycle [24]. Another instance of embedded AI is presented by [25], where a decision tree is used to define the optimal set of plans for a BDI agent. Similarly, another study explores using decision trees to enhance BDI agents with learning capabilities for plan selection [26]. Moreover, when discussing these topics, the need for a proper reference ontology is recognised. For instance, the work presented in [27] introduces a multi-agent system architecture that integrates neural network and symbolic methods for constructing and updating ontologies, enhancing autonomous agent cooperation and mutual monitoring through semantic similarity derivation. In this paper, we focus on the design and discussion of the internal mechanisms of these agents. By delving into the internal architecture of PMDAs, we aim to highlight a model that makes these assistants reliable and interpretable. Given the identified gaps in the literature, we propose an architecture that effectively leverages the advantages of data analytics and the inference capabilities derived from well-grounded medical knowledge. Our proposed architecture integrates real-time health data processing, advanced machine learning algorithms, and established medical guidelines to provide accurate, personalised, and reliable healthcare assistance. This comprehensive approach ensures that the digital assistant not only processes vast amounts of health data efficiently but also applies robust medical reasoning to deliver high-quality care recommendations. 3. An Integrated Architecture The hybrid integrated agent architecture we are proposing in this paper grounds on previous works of us [28, 21], where we envisioned different possible models for making the outcomes of data-driven models influencing the agent cycle in the definition of the plan, strategies or actions to be performed. These integrations delineate various approaches for how data-driven models may be integrated within the life-cycle of a logical agent, there assuming that the agent is specifically designed as a BDI [29, 30] agent—but the same discussion stands for any architecture of logical agent. A simplified version, which summarises the main elements of the proposal in [28, 21], is reported in Figure 1. The first architecture (case (a)) positions the ML model as an input for the agent’s knowledge base, manipulating its constructs like goals or beliefs. Depending on the specific design, the agent’s goals define whether it accepts or rejects the KB update proposed by ML. This allows for adaptiveness by expanding or contracting the agent’s range of activities based on data collected from the operational domain. Contrarily, case (b) inverts the roles of supervision and intervention. The agent can for instance operate on the inner workflow of a ML model, potentially adjusting parameters or modifying the learning process for instance by injecting knowledge at different steps of the learning pipeline. The final architecture (case (c)) presents a collaborative model where the agent and data-driven models are peers, providing outputs to an arbiter responsible for making final decisions. This approach facilitates adaptiveness, safety, and other desired properties through argumentation, where both parties engage in dialogue overseen by the arbiter. Each architecture presents technical challenges, particularly regarding the impact on agent autonomy and integration with other technologies. While not entirely novel, these architectures draw from existing research in belief revision, automated planning, reinforcement learning, and explainable AI paradigms. In this paper, we present our proposal in alignment with integration type (a) and address the various challenges it entails, with a particular emphasis on enhancing the reliability and explainability of agent decisions. Building on our previous work [21], where ML models suggested actions based on their predictive capabilities without explaining the reasoning behind those suggestions, we now focus on methods to enhance predictive capabilities while also providing explanations for those predictions. Moreover, in this improved version, our objective is not to interfere with the agent’s actions by prioritising data-driven predictions, but rather to enrich the agent’s knowledge, upon which the agent’s strategy is defined. We achieve this by approximating the ML black-box model with an interpretable rule-based model, extracting rules from the trained ML, and integrating these rules into the agent’s knowledge base. Issues arise when the extracted rules may not align with the predefined plans, and the policy for retaining rules strongly depends on the agent’s goals. More details on this are provided in the next section by exemplifying possible solution through the case study. 4. Materials and methods 4.1. Dataset and domain knowledge The dataset analysed within the study is the Pima Indians Diabetes dataset, compiled by the National Institute of Diabetes and Digestive and Kidney Diseases from a study of the Pima Indian population, known for its high diabetes incidence [10]. The dataset comprises 768 medical profiles of women aged 21 and above, who underwent an Oral Glucose Tolerance Test (OGTT) to measure their glucose and insulin levels at two hours. The target variable is binary, indicating a diabetes diagnosis within five years. Details about the features available in the dataset can be found in Table 1. Missing values are present in the attributes 𝐼120 (48.70%), 𝑆𝑇 (29.56%), 𝐵𝑃 (4.55%), 𝐵𝑀𝐼 (1.43%), and 𝐺120 (0.65%), and were imputed in this work with the median value of the respective variable, as reported in the literature [31]. Table 1 Pima Indians Diabetes dataset Feature name Code Description Pregnancies Number of times pregnant Glucose 𝐺120 2-hour plasma glucose concentration in OOGT in 𝑚𝑔/𝑑𝐿 Blood Pressure 𝐵𝑃 Diastolic blood pressure in 𝑚𝑚𝐻 𝑔 Skin Thickness 𝑆𝑇 Triceps skin-fold thickness in 𝑚𝑚 Insulin 𝐼120 2-hour serum insulin in 𝜇𝑈 /𝑚𝐿 Body mass index 𝐵𝑀𝐼 Body mass index as weight/(height)2 in 𝑘𝑔/𝑚2 Diabetes Pedigree Function 𝐷𝑃𝐹 Likelihood function of diabetes based on family history [10] Age Age in years Public health guidelines on type-2 diabetes risks report that individuals with a high 𝐵𝑀𝐼 (≥ 30) and high blood glucose level (≥ 126) are at severe risk for diabetes, while those with normal 𝐵𝑀𝐼 (≤ 25) and low blood glucose level (≤ 100) are less likely to develop diabetes. These guidelines have been utilised to design rules [32] expressed as logic predicates (see Table 2), which represent the Knowledge Base (KB) for the current case study. Table 2 Knowledge base for predicting risk of type-2 diabetes as formalised by Kunapuli et al. (2010) [32]. Rule 1 (𝐵𝑀𝐼 ≥ 30) ∧ (𝐺120 ≥ 126) ⟹ diabetes Rule 2 (𝐵𝑀𝐼 ≤ 25) ∧ (𝐺120 ≤ 100) ⟹ healthy 4.2. Machine learning models and rule extraction In this study, six ML classifiers were utilised, namely Decision Tree (DT), Gradient Boosting (GB), Multi-Layer Perceptron (MLP), Logistic Regression (LR), Random Forest (RF), and K-Nearest Neighbour (KNN). Performance evaluation encompassed a range of metrics including Accuracy (A), Precision (P), Recall (R), F1 score (F1), Balanced Accuracy (BA), and Matthew’s Correlation Coefficient (MCC), as well as True Positive (TPR), True Negative (TNR), False Positive (FPR), and False Negative Rates (FNR). Cross-validation was employed with an extensive parameter search for hyperparameter optimisation of each model. In particular, a nested cross-validation approach was utilised, comprising 10 outer folds for evaluation and 5 inner folds for hyperparameter tuning. Performance metrics were computed for each outer fold using the model parameters optimised in the inner folds, and the average and standard deviation of these metrics were calculated to provide a comprehensive understanding of the models performance. The models with the highest R and the lowest FNR (GB and DT) were selected for further investigation. The two selected models were retrained with a two-fold cross-validation procedure, with one half of the dataset used for training, incorporating nested 3-fold cross-validation for hy- perparameter tuning, and the other half for testing. The training and test sets were alternated between folds to test over the entire dataset while reducing computational costs. Following training, GB and DT models were converted into rule sets. Rule-based interpretable models approximating the predictions of GB were derived via rule extraction using CART [33] available from the PSyKE library [34], resulting in rule sets denoted as GB-CART. The maximum number of leaves, and hence rules, in the CART rule-extraction process was varied from 5 to 30 and ultimately set to 20 to maximise fidelity, which was evaluated in terms of accuracy and F1-score of the rule set with respect to the black-box model. Conversely, rule extraction was not required for DT, which could be converted into rule sets by translating each root-to-leaf path into an if-then rule. The rule sets representing the trained models (GB-CART and DT) were utilised to predict outcomes in the test set. Additionally, modified rule sets (GB-CART + KB and DT + KB), integrating the two rules from KB, were also used for prediction. In this integration, rules from KB are assigned priority over the ones derived by ML, so that if an instance satisfies the conditions of multiple rules, priority is given to the rules from KB. Performance metrics detailed above, as well as coverage, which measures the proportion of dataset samples accounted for by the rule set, were computed for the clinical protocol, the two ML-derived rule sets (GB-CART, DT) and the two integrated rule sets (GB-CART + KB, DT + KB). 4.3. Personal medical digital assistant with knowledge update A PMDA was developed in Java, incorporating a knowledge base in tuProlog [35]1 . The system consists of three main components: the environment, the health monitor agent, and the knowledge base. The environment simulates a patient’s health data, including glucose levels, blood pressure, BMI, insulin levels, age, diabetes pedigree function, pregnancies, and skin thickness. These parameters are initialised with example values and are dynamically updated to simulate changes over time. The health monitor agent interfaces with a Prolog-based reasoning engine (tuProlog). The agent loads a Prolog knowledge base containing rules for assessing diabetes risk. During each reasoning cycle, the agent performs the following steps: (a) sensing - retrieves current health data from the environment and updates the Prolog knowledge base with these values; (b) reasoning - evaluates the updated knowledge base against predefined Prolog rules to determine the next action; (c) acting - sets the risk status in the environment to either high or low based on the reasoning outcome and communicate an appropriate message to the user to help manage 1 https://github.com/tuProlog/2p-java the risk of diabetes. Specifically, the agent communicates ”Alert: Diabetes Risk is High!” if the risk is evaluated as high, ”Well done! Diabetes Risk is Low.” if the risk is low, and no message is given if the risk could not be assessed because the patient’s parameters did not satisfy the conditions of any rule in the knowledge base (which occurs when rule coverage is incomplete). Additionally, an explanation is provided to the patient, listing the specific conditions that were met, which determined the risk prediction. In the main simulation loop, the health monitor agent runs for 1000 iterations, updating health parameters randomly within specified ranges to simulate real-world variability. Each iteration involves the agent performing its sensing, reasoning, and acting cycle. The agent can update the knowledge base by adding rules extracted and presented by a ML model trained on data. To this aim, GB was retrained on 50% the dataset, using nested 3-fold cross-validation for hyperparameter tuning, while the remaining 50% was used for testing. Eight rules were extracted from the trained model using CART to balance fidelity and rule set size. Each extracted rule was evaluated on the test set in terms of coverage, accuracy, number of conditions, and the counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). The rules, along with their corresponding metrics, were proposed to the agent. The agent selected rules for inclusion in the current knowledge base based on the following criteria: (i) include rules that introduce zero false negatives to reduce cases of undiagnosed diabetes; (ii) include rules with at least 50% accuracy to maximise correctness; (iii) include rules with at least 5% coverage to avoid knowledge fragmentation; (iv) include rules with a maximum of three conditions for improved readability of the explanation offered to the patient. 5. Results and discussion 5.1. ML performance The initial performance comparison of various ML models trained on the Pima Indians Diabetes dataset is summarised in Table 3. All models report a modest accuracy, ranging from 0.706 for the MLP to 0.762 for the LR. Among these models, LR stands out with the highest ability to correctly predict the negative class (healthy individuals), as evidenced by the lowest FPR, highest TNR, and highest P. This strength in evaluating negative instances, more abundant in the dataset, reflects on global performance metrics, making LR also the top scorer in A and MCC. On the other hand, GB reports the best ability to correctly predict the positive class (diabetic individuals) with the highest TPR and the lowest FNR, as well as the best scores for R, F1, and BA. DT is also notable for its diabetes prediction capabilities, being the only other model besides GB to achieve a recall above 0.6. Given the considered clinical scenario, where the correct classification of positive instances is crucial and recall is typically the metric to optimise, GB and DT are identified as the best-performing models and considered for integration into personal medical assistant agents. A deeper analysis of the predictions made by each model, compared to those made by the clinical protocol and the actual outcomes, is illustrated in Figure 2. The heatmap is divided into regions based on whether the clinical protocol correctly predicts positive and negative instances. It can be observed that the coverage of the clinical protocol is relatively low, at about 34.5%, leaving many cases, primarily healthy individuals, without a diagnosis. Additionally, Table 3 Evaluation metrics for ML models trained on the Pima Indians Diabetes dataset, averaged over 10 model instances trained during 10-fold cross-validation. The best value for each metric is highlighted in bold, corresponding to the highest value for all metrics, except for FPR and FNR for which it is the lowest. Metric A P R F1 BA MCC TNR FPR FNR TPR Gradient Boosting 0.750 0.645 0.638 0.639 0.724 0.450 0.527 0.124 0.126 0.223 Decision Tree 0.741 0.636 0.623 0.624 0.713 0.432 0.523 0.128 0.132 0.217 K Nearest Neighbor 0.743 0.653 0.589 0.614 0.708 0.429 0.538 0.113 0.143 0.206 Random Forest 0.759 0.686 0.582 0.626 0.718 0.456 0.556 0.095 0.146 0.203 Logistic Regression 0.762 0.714 0.560 0.619 0.715 0.464 0.566 0.085 0.154 0.195 Multi-Layer Perceptron 0.706 0.608 0.488 0.530 0.655 0.333 0.535 0.116 0.178 0.170 the protocol produces false positives (region 3) but no false negatives, which is highly desir- able in a clinical setting where a positive outcome typically leads to further specialised tests for confirmation, whereas a negative outcome usually does not prompt further examination. This characteristic should ideally be preserved in updated models, as models that introduce undiagnosed cases are less likely to be adopted in clinical practice. Examining the predictions of the ML models in detail reveals several insights. In region 1, which includes diabetic cases correctly predicted by the protocol, all models make some mistakes. This suggests that replacing the current protocol with any of these models could pose potential risks, as cases previously predicted correctly might now become cases of undiagnosed diabetes. In region 2, which consists of diabetic cases for which the protocol could not make predictions, all models perform poorly, with only a fraction of cases being correctly identified as diabetic. This indicates that the available features are not sufficiently predictive for these cases. Nevertheless, some patients in this region are correctly identified by multiple models, suggesting the possibility of identifying criteria to correctly classify these patients and augmenting the protocol by adding rules to increase coverage in this region. In region 3, which includes cases incorrectly classified as diabetic by the protocol, most models also classify these instances as diabetic. This suggests that the features are not sufficiently predictive for this group of patients. However, some patients in this region are correctly identified by multiple models, indicating potential for identifying rules to replace existing ones and reduce false positives, thereby mitigating over-triage. This update, however, takes lower priority as we are mainly concerned with reducing false negatives and will be addressed in future work. In region 4, which includes healthy individuals correctly predicted by the protocol, all models also predict these patients as healthy, maintaining consistency with the protocol. In region 5, which consists of healthy individuals for whom the protocol cannot give a prediction, all models correctly predict most patients. This again suggests the potential for augmenting the knowledge base with rules to address this region and improve coverage. 5.2. KB integration Interpretable rule-based models, such as DT, or obtained by rule extraction from black-box models, such as GB with CART, were evaluated against the protocol before (DT, GB-CART) Diabetic N/A Healthy 1 2 3 4 5 Outcome Clinical Protocol Gradient Boosting Decision Tree K-Nearest Neighbor Random Forest Logistic Regression Multi-Layer Perceptron 0 100 200 300 400 500 600 700 Figure 2: Predictions generated by six ML models trained on the Pima Indians Diabetes dataset using 10-fold nested cross-validation with hyperparameter tuning. Predictions from the clinical protocol and the actual outcomes are also reported. Five regions are highlighted based on protocol predictions. and after (DT + KB, GB-CART + KB) integrating the rules of the clinical protocol. Performance metrics reported in Table 4 demonstrate the effectiveness of the clinical protocol, achieving very high performance metrics over the covered instances, including a 0% FNR and, consequently, perfect recall. However, this high performance comes with a limited coverage of only 34.5%. In contrast, all ML-derived rule sets offer full coverage. Integrated rule sets, DT + KB and GB-CART + KB, report improved performance across almost all metrics with respect to DT and GB-CART. FNR, which we want to minimise, is reduced by at least 25% for both models and TPR is similarly increased, while R is increased from 0.56 to 0.66 for GB-CART and from 0.58 to 0.71 for DT. The integration increases the overall number of patients predicted diabetic, thus reducing TNR and increasing FPR, although by a lesser amount. Performance metrics evaluating both classes (A, F1, BA, and MCC) all report improvement as a result of integration. Table 4 Evaluation metrics computed for rule sets on the Pima Indians Diabetes dataset. Included rule sets are the clinical protocol formalising the Knowledge Base (KB), Decision Tree (DT), rules extracted from Gradient Boosting (GB) using CART (GB-CART), and composite rule sets DT + KB and GB-CART + KB. Metric A P R F1 BA MCC TNR FPR FNR TPR Coverage Clinical Protocol 0.755 0.700 1.000 0.824 0.712 0.545 0.181 0.245 0.000 0.574 0.345 GB-CART 0.734 0.636 0.560 0.595 0.694 0.401 0.539 0.112 0.154 0.195 1.000 GB-CART + KB 0.754 0.644 0.660 0.652 0.732 0.462 0.523 0.128 0.118 0.230 1.000 DT 0.711 0.586 0.582 0.584 0.681 0.363 0.508 0.143 0.146 0.203 1.000 DT + KB 0.727 0.590 0.713 0.645 0.723 0.431 0.478 0.173 0.100 0.249 1.000 Figure 3 illustrates how integrating rules from the knowledge base (KB) impacts predictions. Adding Rule 1 from Table 2 to ML-derived models corrects predictions from healthy to diabetic for patients in region 1, but also introduces false positives in region 3, which are less critical than false negatives in the considered scenario. Rule 2 has minimal impact, as all rule sets agree on this patient subgroup. By incorporating KB rules with priority, the integrated models align perfectly with the predictions of the clinical protocol. Additionally, they provide full coverage, correctly identifying a fraction of diabetic patients in region 2, and most healthy individuals in region 5. This integrated approach leverages the high recall of the original protocol with the full coverage and more complex knowledge base derived from ML. Diabetic N/A Healthy 1 2 3 4 5 Outcome Clinical Protocol GB-CART GB-CART + KB DT DT + KB 0 100 200 300 400 500 600 700 Figure 3: Predictions generated by rule sets over the Pima Indians Diabetes dataset, including the clinical protocol formalising the Knowledge Base (KB), the Decision Tree model trained on data (DT), the rule set extracted from the Gradient Boosting using CART (GB-CART), as well as composite rule sets DT + KB and GB-CART + KB, integrating protocol rules with priority. Additionally, the actual outcomes are included. The highlighted data subsets corresponds to the same regions depicted in Figure 2. 5.3. KB integration with agents requirements The integration of a clinical protocol with additional rules extracted from ML models offers significant potential for personal medical assistant agents. However, the decision to include new rules may vary depending on the agent’s role and specific criteria. For instance, agents tasked with diagnosing critical conditions may prioritise rules with a null FNR to minimise the risk of undiagnosed conditions. Agents aiming to maximise coverage while ensuring rule quality may select rules achieving a minimum level of accuracy. Conversely, agents aiming to prevent rule proliferation and knowledge fragmentation may only consider rules above a certain coverage threshold. Finally, agents providing simple explanations to users regarding prediction rationales may favour rules with a small number of conditions. To address diverse agent requirements for potential knowledge updates based on ML recom- mendations, four (non-exhaustive) scenarios were explored. In each scenario, an agent predicted the patient’s state based on sensed clinical parameters and an internal knowledge base, with the possibility to update this knowledge base with rules extracted by ML based on internal quality criteria. The eight extracted rules and their performance metrics computed over the dataset are reported in Table 5. Based on the rule metrics and quality criteria, the first scenario includes rules 1, 3, 5, 6, and 8, while the second scenario incorporates rules 1, 3, 4, 5, and 7. Similarly, in the third scenario, rules 1, 4, and 7 are selected, whereas in the fourth scenario, all rules except 7 and 8 are added. Figure 4 illustrates the predictions made by the four updated knowledge bases over the original dataset. Additional scenarios can be explored, particularly by combining the criteria mentioned above or applying different criteria to rules predicting healthy and diseased outcomes. For example, by considering only rules with a null FNR and an accuracy exceeding 70%, only rule 1 would be added. This rule predicts individuals as diabetic if their 𝐺120 value is greater than 143.5 and their 𝐷𝑃𝐹 value exceeds 0.32. Remarkably, this rule correctly predicts 24 individuals, constituting 3% of the dataset and 9% of the diabetic patients in the dataset, who could not be predicted by the original protocol. This illustrates that even minor updates to the knowledge base can greatly enhance the predictive capabilities of the monitoring agent. Table 5 Performance metrics computed on the test set for 8 rules extracted from a Gradient Boosting model trained on the Pima Indians Diabetes dataset with 50-50 train-test split. Rule #Conditions Outcome Total Correct #TP #TN #FN #FP Accuracy Coverage 1 2 Diabetes 53 39 39 0 0 14 0.736 0.138 2 3 Healthy 14 5 0 5 9 0 0.357 0.036 3 3 Diabetes 18 12 12 0 0 6 0.667 0.047 4 3 Healthy 244 191 0 191 53 0 0.783 0.635 5 3 Diabetes 7 4 4 0 0 3 0.571 0.018 6 3 Diabetes 16 7 7 0 0 9 0.438 0.042 7 4 Healthy 30 20 0 20 10 0 0.667 0.078 8 4 Diabetes 2 0 0 0 0 2 0.000 0.005 Diabetic N/A Healthy 1 2 3 4 5 Outcome ClinicalProtocol CP update #1 CP update #2 CP update #3 CP update #4 ML rules 0 100 200 300 400 500 600 700 Rule 1 Rule 2 Rule 3 Rule 4 Rule 5 Rule 6 Rule 7 Rule 8 Figure 4: Predictions generated by the clinical protocol and the protocol enhanced with rules extracted from a ML model and selected by the agent according to four different criteria. 6. Conclusions and future work This study demonstrates the potential of integrating clinical protocols with machine learning ML-derived rules to enhance the performance of PMDA agents. By combining the robustness and trustworthiness of established medical protocols with the adaptive learning capabilities of ML, these hybrid models can offer more comprehensive and accurate diagnostic suggestions. The approach was validated using the PIMA dataset, focusing on reducing the number of false negatives—patients likely to develop diabetes but not identified by medical protocols alone. The integration of additional rules extracted from ML models improved the predictive capabilities of the PMDA agents, ensuring both higher recall and full coverage. Future research will focus on refining integration techniques by investigating more sophisticated methods for combining ML-derived rules with clinical protocols, such as ensemble techniques. We also plan to validate our approach on a broader range of medical datasets to ensure generalisability across different conditions. Additionally, we aim to develop mechanisms for dynamic rule updates in real-time as new data becomes available, maintaining the accuracy of PMDA agents. Finally, enhancing the explainability of PMDA agents by incorporating user-friendly explanations in the form of natural language and visualisations is another avenue for further investigation. Availability of data and code The dataset analysed is publicly available (https://www.kaggle. com/datasets/uciml/pima-indians-diabetes-database), and the code to replicate the experiments can be found in the GitHub repository (https://github.com/ChristelSirocchi/hybrid-medical). References [1] A. Croatti, S. Montagna, A. Ricci, E. Gamberini, V. Albarello, V. Agnoletti, Bdi personal medical assistant agents: The case of trauma tracking and alerting, Artificial Intelligence in Medicine 96 (2019) 187–197. URL: https://www.sciencedirect.com/science/article/pii/ S0933365717306000. doi:https://doi.org/10.1016/j.artmed.2018.12.002 . [2] H. Hagras, Toward human-understandable, explainable ai, Computer 51 (2018) 28–36. [3] G.-D. Hou, Y. Zheng, W.-X. Zheng, M. Gao, L. Zhang, N.-N. Hou, J.-R. Yuan, D. Wei, D.-E. Ju, X.-L. Dun, et al., A novel nomogram predicting the risk of positive biopsy for patients in the diagnostic gray area of prostate cancer, Scientific Reports 10 (2020) 17675. [4] F. Piccialli, V. Di Somma, F. Giampaolo, S. Cuomo, G. Fortino, A survey on deep learning in medicine: Why, how and when?, Information Fusion 66 (2021) 111–137. [5] Z. Obermeyer, T. H. Lee, Lost in thought: the limits of the human mind and the future of medicine, The New England journal of medicine 377 (2017) 1209. [6] L. Von Rueden, S. Mayer, K. Beckh, B. Georgiev, S. Giesselbach, R. Heese, B. Kirsch, J. Pfrommer, A. Pick, R. Ramamurthy, et al., Informed machine learning–a taxonomy and survey of integrating prior knowledge into learning systems, IEEE Trans. on Knowledge and Data Engineering 35 (2021) 614–633. [7] F. Leiser, S. Rank, M. Schmidt-Kraepelin, S. Thiebes, A. Sunyaev, Medical informed machine learning: A scoping review and future research directions, Artificial Intelligence in Medicine 145 (2023) 102676. [8] S. Kierner, J. Kucharski, Z. Kierner, Taxonomy of hybrid architectures involving rule-based reasoning and machine learning in clinical decision systems: A scoping review, Journal of Biomedical Informatics (2023) 104428. [9] C. Sirocchi, A. Bogliolo, S. Montagna, Medical-informed machine learning: integrating prior knowledge into medical decision systems, BMC Medical Informatics and Decision Making 24 (Suppl 4) (2024) 186. doi:https://doi.org/10.1186/s12911- 024- 02582- 4 . [10] J. W. Smith, J. E. Everhart, W. Dickson, W. C. Knowler, R. S. Johannes, Using the adap learning algorithm to forecast the onset of diabetes mellitus, in: Proceedings of the annual symposium on computer application in medical care, American Medical Informatics Association, 1988, p. 261. [11] E. Topol, High-performance medicine: the convergence of human and artificial intelligence, Nature Medicine 25 (2019) 44–56. doi:10.1038/s41591- 018- 0300- 7 . [12] P. Rajpurkar, E. Chen, O. Banerjee, E. J. Topol, AI in health and medicine, Nature Medicine 28 (2022) 31–38. doi:10.1038/s41591- 021- 01614- 0 . [13] S. Benjamens, P. Dhunnoo, B. Meskó, The state of artificial intelligence-based fda-approved medical devices and algorithms: an online database, NPJ digital medicine 3 (2020) 118. [14] J. J. Clinton, K. McCormick, J. Besteman, Enhancing clinical practice: The role of practice guidelines., American Psychologist 49 (1994) 30. [15] Z. Qian, W. Zame, L. Fleuren, P. Elbers, M. van der Schaar, Integrating expert odes into neural odes: pharmacology and disease progression, Advances in Neural Information Processing Systems 34 (2021) 11364–11383. [16] C. C. Yang, Explainable artificial intelligence for predictive modeling in healthcare, Journal of healthcare informatics research 6 (2022) 228–239. [17] Z. Obermeyer, T. H. Lee, Lost in thought — the limits of the human mind and the future of medicine, New England Journal of Medicine 377 (2017) 1209–1211. [18] G. Ciatto, F. Sabbatini, A. Agiollo, M. Magnini, A. Omicini, Symbolic knowledge extrac- tion and injection with sub-symbolic predictors: A systematic literature review, ACM Computing Surveys 56 (2024). doi:10.1145/3645103 . [19] M. Magnini, G. Ciatto, F. Cantürk, R. Aydoğan, A. Omicini, Symbolic knowledge ex- traction for explainable nutritional recommenders, Computer Methods and Programs in Biomedicine 235 (2023) 107536. doi:https://doi.org/10.1016/j.cmpb.2023.107536 . [20] M. Cranmer, A. Sanchez-Gonzalez, P. Battaglia, R. Xu, K. Cranmer, D. Spergel, S. Ho, Discovering symbolic models from deep learning with inductive biases, in: Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Curran Associates Inc., Red Hook, NY, USA, 2020. [21] S. Montagna, S. Mariani, E. Gamberini, Augmenting bdi agency with a cognitive service: Architecture and validation in healthcare domain, Journal of Medical Systems 45 (2021) 103. URL: https://doi.org/10.1007/s10916-021-01780-1. doi:10.1007/s10916- 021- 01780- 1 . [22] P. Maes, Agents that reduce work and information overload, Commun. ACM 37 (1994) 30–40. doi:10.1145/176789.176792 . [23] D. Calvaresi, D. Cesarini, P. Sernani, M. Marinoni, A. F. Dragoni, A. Sturm, Exploring the ambient assisted living domain: a systematic review, Journal of Ambient Intelligence and Humanized Computing 8 (2017) 239–257. doi:10.1007/s12652- 016- 0374- 3 . [24] M. Bosello, A. Ricci, From programming agents to educating agents – a jason-based framework for integrating learning in the development of cognitive agents, in: L. A. Dennis, R. H. Bordini, Y. Lespérance (Eds.), Engineering Multi-Agent Systems, Springer International Publishing, Cham, 2020, pp. 175–194. [25] D. Singh, S. Sardiña, L. Padgham, G. James, Integrating learning into a BDI agent for environments with changing dynamics, in: T. Walsh (Ed.), IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, July 16-22, 2011, IJCAI/AAAI, 2011, pp. 2525–2530. doi:10.5591/978- 1- 57735- 516- 8/ IJCAI11- 420 . [26] A. Guerra-Hernández, A. El Fallah-Seghrouchni, H. Soldano, Learning in bdi multi-agent systems, in: J. Dix, J. Leite (Eds.), Computational Logic in Multi-Agent Systems, Springer Berlin Heidelberg, Berlin, Heidelberg, 2005, pp. 218–233. [27] D. Rosaci, Cilios: Connectionist inductive learning and inter-ontology similarities for recommending information agents, Information Systems 32 (2007) 793–825. doi:https: //doi.org/10.1016/j.is.2006.06.003 . [28] S. Montagna, S. Mariani, E. Gamberini, A. Ricci, F. Zambonelli, Complementing agents with cognitive services: A case study in healthcare, Journal of Medical Systems 44 (2020) 188. URL: https://doi.org/10.1007/s10916-020-01621-7. doi:10.1007/s10916- 020- 01621- 7 . [29] M. E. Bratman, D. J. Israel, M. E. Pollack, Plans and resource-bounded practical reasoning, Computational Intelligence 4 (1988) 349–355. [30] M. P. Georgeff, A. L. Lansky, Reactive reasoning and planning, in: Proceedings of the Sixth National Conference on Artificial Intelligence - Volume 2, AAAI’87, AAAI Press, 1987, pp. 677–682. [31] H. B. Kibria, M. Nahiduzzaman, M. O. F. Goni, M. Ahsan, J. Haider, An ensemble approach for the prediction of diabetes mellitus using a soft voting classifier with an explainable ai, Sensors 22 (2022) 7268. [32] G. Kunapuli, K. P. Bennett, A. Shabbeer, R. Maclin, J. Shavlik, Online knowledge-based support vector machines, in: Machine Learning and Knowledge Discovery in Databases: European Conference, 2010, Proceedings, Part II 21, Springer, 2010, pp. 145–161. [33] L. Breiman, Classification and regression trees, Routledge, 2017. [34] F. Sabbatini, G. Ciatto, R. Calegari, A. Omicini, Symbolic knowledge extraction from opaque ML predictors in PSyKE: Platform design & experiments, Intelligenza Artificiale 16 (2022) 27–48. [35] E. Denti, A. Omicini, A. Ricci, Multi-paradigm java–prolog integration in tuprolog, Science of Computer Programming 57 (2005) 217–250. doi:https://doi.org/10.1016/j.scico. 2005.02.001 .