Symbolic Regression for Medical Scoring Systems: a Bayesian and Multi-Objective Approach Mattia Billa1,* 1 Dipartimento di Scienze Fisiche, Informatiche e Matematiche, Univ. Modena e Reggio Emilia, Modena, Italy Abstract Scoring systems play an important role in high-stakes domains, such as medicine, by quantifying complex phenomena through the combination of data features, thereby assisting decision-making processes and clinical research. Traditional methods often rely on linear models, which may struggle to capture the complexity inherent in data. Recently, Symbolic Regression has emerged as a promising alternative, offering the ability to construct nonlinear models that are both interpretable and accurate. However, this approach faces some limitations, including a lack of uncertainty awareness and difficulties in adapting to non-IID scenarios such as those found in Federated and Continual Learning settings. We propose a novel data-driven approach that integrates Symbolic Regression with Bayesian Inference and Multi-Objective Optimization. By combining these methodologies, our approach aims to address both uncertainty quantification and adaptability in Continual and Federated Learning scenarios. Initial experiments on clinical data have shown promising results, highlighting the potential of the proposed framework for improving the reliability and applicability of scoring systems in medical contexts. Keywords Scoring Systems, Symbolic Regression, Federated Learning, Continual Learning, Bayesian Inference, Multi-Objective Optimization 1. Introduction The integration of Artificial Intelligence (AI) and medicine has given rise to an innovative approach known as P4 medicine. P4 medicine combines predictive, preventive, personalized, and participatory healthcare, presenting a paradigm shift in healthcare delivery [1]. Data from Electronic Health Records (EHRs), Patient Reported Outcomes (PROs), and wearable devices, combined with statistical and machine learning methods, have significant potential in clinical research, decision support, and knowledge discovery. However, data-driven healthcare applications present unique challenges due to the constantly evolving and often poorly controlled environment in which they are developed. Clinical data, for example, is highly sensitive, expensive to acquire and curate, and subject to complex governance policies. Moreover, medical data is often distributed across multiple healthcare institutions and facilities. In specific scenarios, combining knowledge from multiple institutions becomes necessary to improve model performance, overcome data scarcity, or validate results. Nevertheless, this should be done while safeguarding patient privacy and data security. A possible solution to address these challenges is Federated Learning [2]. This approach allows SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy * Corresponding author. $ mattia.billa@unimore.it (M. Billa)  0009-0005-1979-8918 (M. Billa) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings models to be trained across decentralized devices or servers without exchanging raw data, thereby preserving the confidentiality of sensitive information. Overcoming these obstacles, while maintaining model interpretability, is crucial to ensure the successful integration of AI in healthcare and to maximize its potential for improving patient care and outcomes [3]. Recent machine learning methods for interpretable scoring systems mostly employ linear classification models [4], hence assuming a fixed index form. An alternative approach is Symbolic Regression [5], a technique aiming to discover mathematical expressions approximating a dataset without relying on predefined functional forms. The primary objective of this contribution is to address both result interpretability and data distribution issues for the development of nonlinear scoring systems. We will accomplish this by combining Symbolic Regression, Multi-Objective Optimization [6], and parametric Bayesian Inference. Indeed, since the current approach to Symbolic Regression doesn’t take into account the epistemic and aleatoric uncertainty, we propose to integrate Bayesian Inference [7] into our framework. We also plan to exploit the uncertainty quantification, together with Multi-Objective optimization, to address Continual and Federated Learning in non-IID scenarios [8]. The following sections are structured as follows: Section 2 provides an overview of related and background material; Section 3 describes the proposed approach; Section 4 presents some preliminary results; Lastly, Section 5 concludes with final remarks. 2. Background and related work 2.1. Scoring systems Scoring systems are mathematical equations that combine elementary indicators to describe complex phenomena with a single value, providing decision-support tools. Examples in clinical settings include BMI and Charlson score [9]. Traditionally, domain experts have developed these scores using trial and error methods. However, current efforts concentrate on data-driven approaches, emphasizing the importance of interpretability in the generated models. Symbolic Regression is a potential solution to this end, consisting of finding a mathematical expression that best fits a given dataset without assuming a specific form beforehand [5]. This problem is usually solved using Genetic Programming (GP) [10], i.e. an evolutionary approach that encodes mathematical formulas as unary/binary trees. The main goal of GP is to select, simulating a natural selection process, the model that optimizes a particular loss function over a dataset. In the context of scoring system development, where data can be small and unbalanced, we are also interested in data sample stratification and balancing, aside from accuracy. Therefore, previous work was focused on tackling these issues as a Multi-Objective Optimization problem (MOO) [11, 6], using the NSGA-II evolutionary algorithm [12]. 2.2. Learning with non-IID data Federated Learning Growing concerns about data privacy and the need to combine knowl- edge from different facilities have led to the introduction of Federated Learning (FL). The goal of FL is to train a joint model in a decentralized way using data distributed across multiple devices [13]. FedAvg [14] was the first FL approach able to achieve good performance on distributed datasets assuming IID data. However, in real-world scenarios, data is usually heterogeneous, with different statistical distributions from each device. Therefore, subsequent research has investigated the convergence on non-IID data [15] and has proposed a regularized approach for heterogeneous networks [16]. Other approaches, such as client or group personalization, have been introduced. These approaches allow individual devices, or similar groups of devices, to acquire a personalized model. Additionally, techniques like Domain Transformation and Domain Adaptation aim to either transform the perceived input space into a common input space or measure dissimilarities between datasets, allowing for the adjustment of the training model accordingly [8]. Recently, even Bayesian Learning has been considered in FL settings [17], taking advantage of its uncertainty quantification and performance on limited and heterogeneous data. For example, pFedBayes [18] is a personalization approach that uses the global distribution as a prior distribution and tries to minimize the KL divergence. In [19], online Laplace Approximation is used to approximate the local and global posterior, reducing the aggregation error. Instead, FedBE [20] is based on Bayesian model Ensemble to perform the aggregation step, achieving good performance on non-IID data. Continual Learning In scenarios characterized by dynamic data collection and ongoing updates to datasets, data distribution may change over time, a phenomenon referred to as concept drift. This kind of time-wise heterogeneity is tackled by the Continual Learning (CL) paradigm, whose main challenge is the so-called catastrophic forgetting, i.e. the forgetting of previously learned concepts. We can identify two stages of learning under concept drift, the first one is the Drift Detection. The goal is to understand whether a drift has occurred [8], and this can be done with Data Distribution-based methods, which use the statistical properties of data distributions, or Error Rate-based, which are based on the accuracy, or the uncertainty, of the model through time. The other stage is the Drift Adaptation, preventing the model from decreasing accuracy on new data, while not forgetting the previous data. Two main strategies are used for Drift Adaptation: memory-based methods [21], also known as rehearsal, and regularization methods [22]. Other approaches to CL rely on Bayesian Learning. For example, [23] employs online variational inference, using the previous posterior distribution as the new prior and multiplying it with the likelihood of the new data. Similarly, [24] uses the uncertainty of the parameters, obtained through Bayesian Inference, to change their learning rates. Only a few works considered both Federated and Continual learning settings, such as [25], which performs both Drift Detection and Adaptation using the uncertainty related to the classifier on a sliding window and storing samples in a long-term memory. While most of these works rely on neural networks, little research has yet been done in the Symbolic Regression field in the context of Continual and Federated learning. For instance, [26] proposes a federated Genetic Programming framework based on the aggregation of the local fitness (the loss of the model), achieving better generalization performance compared to models trained only on local datasets. However, this approach doesn’t take into account the uncertainty related to the model, the relative importance that each dataset can have, and the possibility of updating the model after receiving more data. 3. A Bayesian and Multi-Objective approach to Symbolic Regression in non-IID scenarios In Section 1, some of the limitations of previous work concerning Symbolic Regression have been introduced. The current multi-objective framework (MOSR) for scoring system development [6] doesn’t take into account the uncertainty of the models, it lacks a formal mechanism for incorporating prior knowledge or updating the current one. To tackle these issues, we introduce an extension of this framework based on Bayesian Inference. The key idea behind the Bayesian extension of MOSR is to replace the numerical constants inside each model with random variables whose initial distribution will encode prior knowledge, turning each model, estimated through maximum likelihood estimation (MLE), into a Bayesian model. The inference process, based on a Markov Chain Monte Carlo (MCMC) algorithm [7], should then return a posterior distribution of the model’s parameters that also incorporates the uncertainty of the model itself. For the sake of simplicity, let’s direct our attention towards a single model from this point onward. The standard assumption behind SR is that models are normally distributed around their expected values, given by a nonlinear expression 𝑓 (𝑥, 𝜃) involving parameters 𝜃, and with a standard deviation of 𝜎. 𝑌𝑖 ∼ 𝒩 (𝜇𝑖 , 𝜎 2 ), with 𝜇𝑖 = 𝑓 (𝑋𝑖 , 𝜃) As new data is acquired, we can update the prior distribution of 𝜃 by simply using its previous approximated posterior distribution, iteratively. According to the Bayes rule, a batch update of the posterior is equivalent to a sequential update of the posterior: 𝑝(𝜃)𝑝(𝒟1:𝑛−1 |𝜃) 𝑝(𝒟𝑛 |𝜃) 𝑝(𝜃|𝒟1:𝑛 ) = 𝑝(𝒟1:𝑛−1 ) 𝑝(𝒟𝑛 ) where 𝒟𝑡 is the subset of data at time 𝑡, 𝑝(𝜃) is the prior of 𝜃, 𝑝(𝒟|𝜃) is the likelihood, and 𝑝(𝜃|𝒟) is the posterior. The results of this approach are shown in the next section. The proposed framework also aims to deal with Federated Learning scenarios by using Multi-Objective Optimization (MOO). The goal of MOO is to optimize problems with more objective functions that may conflict with each other. In this case, each objective function represents the loss function, such as the BIC criterion, of the considered model on a local dataset. Therefore, the current MOSR framework is naturally extended using the evolutionary algorithm to optimize the models on more datasets rather than different objective functions. In MOO, a recurring issue is the presence of non-comparable objective functions. This means that the objectives cannot be directly compared in a meaningful way. In realistic Federated Learning settings, datasets from each client can differ in size, changing the magnitude of the corresponding loss functions. To make the posterior distributions comparable, we plan to use a fractional likelihood, introducing a temperature parameter. So, at each generation, the server generates a population of models, which are then sent to the clients. Each client computes the loss of each model on its local dataset and then sends the evaluation back to the server. The server uses the received evaluations to select and evolve the population of models, optimizing the loss functions in a MOO manner. At the end of the process, the parameters of the models are inferred using the sequential procedure described for Table 1 Results on the Body Fat dataset. Sampler # samples W_50 W_95 ESS bulk ESS tail R hat Time (s) 2000 39.1 95.6 2038.6 1722 1.00 89 NUTS 1000 36.9 95.6 1099.6 802.2 1.00 59 2000 41.3 95.6 1820.8 2002.2 1.00 39 HMC 1000 36.9 95.6 889.6 1067.8 1.00 28 10000 39.1 95.6 416.0 465.4 1.012 57 MH 1000 36.9 95.6 42.0 63.4 1.084 10 CL, which is equivalent to estimating parameters in a centralized scenario. Importantly, no raw data is transmitted over the network, maintaining privacy and security. 4. Preliminary results The Bayesian extension for MOSR has been tested on both real-world and synthetic datasets. The real-world data was sourced from both a publicly available repository and the private Electronic Health Record (EHR) system. To be concise while maintaining generality, we present the results just on the Body Fat Percentage dataset1 , which contains body fat percentage estimates based on Siri’s formula, along with 14 anthropometric measurements of 252 men. Underwater weighing and body fat percentage were removed from the dataset because the former is not easy to obtain, and we want to compare the latter with the solutions coming from the MOSR. The results are reported in Table 1, comparing different MCMC sampling algorithms (NUTS, Hamiltonian Monte Carlo, and Metropolis-Hastings) and sampling sizes. To test the uncertainty quantification of the framework, we introduce two metrics: within 50 (W_50) and within 95 (W_95). The within 50/95 metric quantifies the proportion of observed values that fall within their 50/95% posterior prediction interval, respectively. We have also included some converge metrics, such as the Effective Sample Size (ESS) and the 𝑅 ˆ . The results suggest that the framework can capture the uncertainty within the 95% prediction intervals but tends to be overconfident within the 50% prediction intervals. Concerning the convergence, NUTS and HMC sampling outperformed MH, even with fewer samples and comparable training times. To assess the framework in a continual learning scenario, we split the Body Fat dataset into two distinct populations. The model is first trained on the first population, using the NUTS algorithm with two chains of 1000 samples. Then the posterior is sequentially updated using just the other population data. The results are shown in Table 2. Not only did the predictive performance of the model not decrease, but it appears that the integration of the second population also improved the regression performance and posterior predictive intervals, especially for the 50% interval. So, this extension seems not to suffer from catastrophic forgetting. 1 https://www.kaggle.com/datasets/fedesoriano/body-fat-prediction-dataset Table 2 Results on the Body Fat dataset, integrating the second population. Population W_50 W_95 ESS bulk ESS tail R hat Time (s) 1 39.1 93.4 1099.6 802.2 1.00 110 1 and 2 43.4 1.00 822.4 977.8 1.00 190 5. Discussion and conclusion In this contribution, we have addressed the challenge of building interpretable scoring systems in real-world scenarios characterized by distributed and evolving data. Some selected works about Federated and Continual Learning have been reported, with a focus on Bayesian approaches, which exhibit relevant properties related to uncertainty quantification. Our proposed approach aims to integrate Symbolic Regression with parametric Bayesian Inference and MOO, ensuring both interpretability and data privacy. Initial testing on clinical data in a Continual Learning setup has shown promising results, demonstrating the potentiality of our framework in this kind of setting. In addition, a Federated Learning strategy that makes use of both Bayesian Learning and MOO has been introduced. In future work, we plan to investigate the performance of this framework under both Con- tinual and Federated Learning settings, simulating different degrees of heterogeneity of data between clients. Furthermore, scalability concerns will be addressed to ensure the efficacy of our approach as the number of clients and the amount of data increase. References [1] M. Flores, G. Glusman, K. Brogaard, N. D. Price, L. Hood, P4 medicine: how systems medicine will transform the healthcare sector and society, Personalized Medicine 10 (2013) 565–576. doi:10.2217/pme.13.57. [2] J. Konečný, H. B. McMahan, D. Ramage, P. Richtárik, Federated optimization: Distributed machine learning for on-device intelligence, 2016. doi:10.48550/ARXIV.1610.02527. [3] F. Mandreoli, D. Ferrari, V. Guidetti, F. Motta, P. Missier, Real-world data mining meets clinical practice: Research challenges and perspective, Frontiers in Big Data 5 (2022). doi:10.3389/fdata.2022.1021621. [4] B. Ustun, C. Rudin, Supersparse linear integer models for optimized medical scoring systems, Machine Learning 102 (2016) 349–391. [5] W. La Cava, P. Orzechowski, B. Burlacu, F. O. de França, M. Virgolin, Y. Jin, M. Kommenda, J. H. Moore, Contemporary symbolic regression methods and their relative performance, 2021. doi:10.48550/ARXIV.2107.14351. [6] D. Ferrari, V. Guidetti, F. Mandreoli, Multi-objective symbolic regression for data-driven scoring system management, in: 2022 IEEE International Conference on Data Mining (ICDM), 2022, pp. 945–950. doi:10.1109/ICDM54844.2022.00112. [7] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, D. B. Rubin, Bayesian data analysis, CRC press, 2013. [8] M. F. Criado, F. E. Casado, R. Iglesias, C. V. Regueiro, S. Barro, Non-iid data and continual learning processes in federated learning: A long road ahead, Information Fusion 88 (2022). [9] M. E. Charlson, P. Pompei, K. L. Ales, C. MacKenzie, A new method of classifying prognostic comorbidity in longitudinal studies: Development and validation, Journal of Chronic Diseases 40 (1987) 373–383. doi:10.1016/0021-9681(87)90171-8. [10] J. R. Koza, Genetic programming as a means for programming computers by natural selection, Statistics and computing 4 (1994) 87–112. [11] J. Kubalík, E. Derner, R. Babuška, Symbolic regression driven by training data and prior knowledge, in: Proc. of the 2020 Genetic and Evolutionary Computation Conference, 2020. [12] K. Deb, A. Pratap, S. Agarwal, T. Meyarivan, A fast and elitist multiobjective genetic algorithm: Nsga-ii, IEEE transactions on evolutionary computation 6 (2002) 182–197. [13] Q. Yang, Y. Liu, T. Chen, Y. Tong, Federated machine learning: Concept and applications, ACM Transactions on Intelligent Systems and Technology (TIST) 10 (2019) 1–19. [14] B. McMahan, E. Moore, D. Ramage, S. Hampson, B. A. y Arcas, Communication-efficient learning of deep networks from decentralized data, in: Artificial intelligence and statistics, PMLR, 2017, pp. 1273–1282. [15] X. Li, K. Huang, W. Yang, S. Wang, Z. Zhang, On the convergence of fedavg on non-iid data, arXiv preprint arXiv:1907.02189 (2019). [16] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, V. Smith, Federated optimization in heterogeneous networks, Proceedings of Machine learning and systems 2 (2020) 429–450. [17] L. Cao, H. Chen, X. Fan, J. Gama, Y.-S. Ong, V. Kumar, Bayesian federated learning: A survey, arXiv preprint arXiv:2304.13267 (2023). [18] X. Zhang, Y. Li, W. Li, K. Guo, Y. Shao, Personalized federated learning via variational bayesian inference, in: International Conference on Machine Learning, 2022. [19] L. Liu, X. Jiang, F. Zheng, H. Chen, G.-J. Qi, H. Huang, L. Shao, A bayesian federated learning framework with online laplace approximation, IEEE Transactions on Pattern Analysis and Machine Intelligence 46 (2024) 1–16. doi:10.1109/TPAMI.2023.3322743. [20] H.-Y. Chen, W.-L. Chao, Fedbe: Making bayesian model ensemble applicable to federated learning, arXiv preprint arXiv:2009.01974 (2020). [21] A. Robins, Catastrophic forgetting, rehearsal and pseudorehearsal, Connection Science 7 (1995) 123–146. [22] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al., Overcoming catastrophic forgetting in neural networks, Proceedings of the national academy of sciences 114 (2017) 3521–3526. [23] C. V. Nguyen, Y. Li, T. D. Bui, R. E. Turner, Variational continual learning, arXiv preprint arXiv:1710.10628 (2017). [24] S. Ebrahimi, M. Elhoseiny, T. Darrell, M. Rohrbach, Uncertainty-guided continual learning with bayesian neural networks, arXiv preprint arXiv:1906.02425 (2019). [25] F. E. Casado, D. Lema, M. F. Criado, R. Iglesias, C. V. Regueiro, S. Barro, Concept drift detection and adaptation for federated and continual learning, Multimedia Tools and Applications (2022) 1–23. [26] J. Dong, J. Zhong, W.-N. Chen, J. Zhang, An efficient federated genetic programming frame- work for symbolic regression, IEEE Transactions on Emerging Topics in Computational Intelligence (2022).