=Paper=
{{Paper
|id=Vol-3741/paper57
|storemode=property
|title=Symbolic Regression for Transparent Clinical Decision Support: A Data-Centric Framework for Scoring System Development
|pdfUrl=https://ceur-ws.org/Vol-3741/paper57.pdf
|volume=Vol-3741
|authors=Veronica Guidetti,Federica Mandreoli
|dblpUrl=https://dblp.org/rec/conf/sebd/GuidettiM24
}}
==Symbolic Regression for Transparent Clinical Decision Support: A Data-Centric Framework for Scoring System Development==
Symbolic Regression for Transparent Clinical Decision Support: A Data-Centric Framework for Scoring System Development Veronica Guidetti1,* , Federica Mandreoli1 1 Department of Physics, Informatics and Mathematics, University of Modena e Reggio Emilia, Modena, Italy Abstract Machine learning (ML) has transformed healthcare, improving diagnostics, treatment, research, and patient care. However, clinical decision support (CDS) still relies heavily on classical statistical models and manual rules, lacking transparency and accuracy. Starting from the mid-20th century, scoring systems offer a transparent approach to CDS development. Nevertheless, classical methods for scoring systems like logistic regression may lack predictive accuracy and suffer in handling complex high-dimensional electronic health record data, while black-box ML models pose risks due to their lack of interpretability. To address these challenges, our group focuses on developing interpretable symbolic ML approaches, leveraging multi-objective symbolic regression (MOSR) to accelerate index development, mitigate human bias, and allow for the exploration of new aggregation functions and weighting systems. MOSR optimizes multiple objectives simultaneously, distilling complex phenomena into non-linear yet understandable constructs, a crucial aspect for gaining trust from healthcare professionals. Moreover, MOSR is highly flexible and extendable to classical statistical models. This paper presents our experience in developing data-driven scoring systems, building on real-world applications such as COVID-19 mortality prediction and risk estimation post-liver transplantation. Our methodology involves designing the entire data pipeline, from feature selection to scoring formula generation, highlighting the importance of developing data-centric and interpretable ML techniques for high-risk domains. Keywords Scoring systems, Symbolic Machine Learning, Data-Centric AI, Multi-Objective Optimization, Clinical Decision Making 1. Introduction Artificial Intelligence (AI) systems have revolutionized healthcare by offering opportunities to enhance diagnostics, treatment, research, and patient care. Among these tasks, we want to focus on the development of data-driven solutions to clinical decision support (CDS). Currently, CDS predominantly relies on classical statistical models and manually curated rules or heuristics that are designed to identify patient cohorts with specific characteristics of interest. Medical scoring systems are considered as the most common CDS tool because they provide healthcare professionals with objective, intelligible, and quantifiable measures to aid in clinical decision-making. Indeed, by interpreting scores in conjunction with clinical expertise and patient-specific factors, healthcare professionals can make more informed decisions regarding diagnosis, treatment planning, and patient management. The history of scoring systems in medicine dates back to the mid-20th century. One of the earliest and most well-known scoring SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy * Corresponding author. $ veronica.guidetti@unimore.it (V. Guidetti); federica.mandreoli@unimore.it (F. Mandreoli) ยฉ 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings systems is the Apgar score, developed by Virginia Apgar in 1952, which assesses the health of newborns and is still widely used today for quick assessment. Creating and validating clinical scoring systems involves several key steps. Clinicians begin by identifying a patient population with a specific disease or condition and gathering relevant data, including demographics, medical history, and laboratory results. Statistical methods are then employed to identify predictive factors for the outcome of interest, such as mortality or disease progression. Using these factors, a model is developed to generate a score predicting the likelihood of the outcome occurring. Finally, the scoring system is validated by testing its performance on a separate group of patients with the same condition to ensure its accuracy and reliability. Over the years, scoring systems continued to be developed, incorporating data from new technologies, such as imaging and laboratory tests, to improve their accuracy and reliability. However, traditional index creation methods, successful in the past, faced challenges in the context of complex clinical phenotypes and the vast landscape of Electronic Health Records (EHRs). In fact, without automated pipelines to reduce variables or identify their importance, EHR-derived datasets are often high-dimensional, scarce, sparse, and unbalanced [1]. For these reasons, the emergence of AI and sophisticated ML models able to analyze extensive datasets encompassing genetics, lifestyle factors, Patient Reported Outcomes, and EHRs, has sparked a renewed emphasis on developing medical scoring systems. This work aims to describe the state of the art in the development of data-driven scoring systems and illustrate the progress and applications studied by our research group in recent years for the development of interpretable CDS tools. Specifically, in Section 2, we introduce the concept of scoring systems and their traditional development methods. We then delve into the challenges posed by the rise of AI in automating index creation and explain why interpretable ML techniques are the primary tools for this task. Moving forward, we present state-of-the-art methods for automatically generating scoring systems, with a focus on symbolic regression. This technique serves as the foundation for a new approach developed by our team for constructing flexible and parsimonious indices in real-world scenarios. Section 3 outlines our comprehensive data pipeline for scoring system development, highlighting key aspects and featuring real case studies from our recent research endeavors. In Section 4, we draw some concluding remarks and summarize the outstanding challenges in this field that we aim to address in the near future. 2. Preliminaries: Scoring Systems Development and Symbolic Regression A score is a mathematical combination of a set of elementary indicators (EIs) representing the different components of a multidimensional concept to be measured (e.g. development, quality of life, wealth, risk, etc.). Hence, synthetic scores are used to measure concepts that cannot be captured by a single indicator. In general, a composite index should be based on a theoretical framework, allowing the selection, combination, and weight of the EIs to reflect the size or structure of the phenomenon being measured. Building scores often requires a series of decisions/choices to be made. The process involves multiple steps: i) theoretical framework definition where the concept to be measured and the EIs to be considered are identified; ii) EIs selection based on their relevance, validity, availability, cost, etc., and standardization so as to work with adimensional quantities; iv) aggregation function definition where the final shape of the index combining the selected EI is determined; v) index validation in terms of robustness, generalization and discriminating ability. There is no general method for the construction of synthetic indices so each score is linked to the particular application. Classically, finding the right aggregation function and the weighting of EIs is a highly non-trivial task [2] that involves human decision by default and in which data sets are mostly used to validate the score after it is built or, when using parametric statistical models, to learn its numerical coefficients (see, e.g., [3]). Despite its widespread use in high-stake domains, such an approach may lead to scores that suffer from the lack of formal guarantees in terms of performance and constraint compliance [4]. 2.1. Clinical Score Generation in the Era of Big Data and AI Despite the definition of the theoretical context and the identification of the relevant EIs will always require some human knowledge to be reliable, the other stages of index creation may be automated. Indeed, the promise of systematic, targeted, and data-driven scores for CDS becomes apparent with recent advances in ML and the ever-increasing availability of EHR data. This would not only speed up index creation but also eliminate any human bias in its construction, as well as explore further aggregation functions and weighting systems than what was done in the past. However, the potential benefits are met with equally substantial challenges. Achieving systematic index creation necessitates the development and deployment of highly accurate clinical prediction models for a wide range of clinical problems. Moreover, the CDS framework should be robust enough to cope with and adapt to significant variations in clinical practices and documentation standards across different healthcare providers and systems. For these reasons, comprehensive assessments should be performed to weigh the potential advantages and risks associated with each automated CDS solution [5]. Meeting the previous requirements is made easier by interpretable ML techniques that can be inspected and questioned by domain experts, ultimately leading to a higher level of trust and, consequently, facilitating their integration into existing decision-making frameworks. In fact, it is a view increasingly shared by the scientific community that black box models cannot be the final answer to increase accuracy or fix cases where the assumptions of classical models are not met [6, 7]. Opaque models cannot be fully understood, were shown to be dangerous in many real-life situations, and so should not be used in high-stakes domains [8]. Moreover, it was recently shown that simpler intelligible models achieve comparable, if not better, results in several real-world cases [9], especially when dealing with tabular data such as EHRs. 2.2. State-of-the-art Approaches State-of-the-art approaches for interpretable data-driven scoring systems formalize finding the index aggregation formula as an optimization problem. Most of them translate index creation into a classification problem, primarily focusing on binary outcomes [10, 4, 11, 12]. Most works use integer programming to build scoring systems with fixed aggregation function shapes and use training data to learn numerical coefficients. While these works have the merit of supporting high-stake decisions through interpretable data-driven models, they marginally explore the space of scoring functions. In fact, most methods rely on dummy indicators, based on thresholds, and use decision tree models to construct the aggregation functions [10, 11, 12, 13]. Constraining the space of aggregation functions to be searched to some linear structure may prevent the model from finding the right solution because of structural constraints introduced for the sake of simplifying the phenomenon. As a matter of fact, non-linear scoring systems are already widely used (see [2] for a technical introduction). For instance, the Body Mass Index (BMI) [14] represents a simple and widely known example in clinical practice. The problems previously listed can be solved by using symbolic regression (SR), the data- driven ML method for regression analysis. In fact, as we will elaborate in the next section, SR is able to search a much wider space of mathematical formulas to find the model that best fits a given dataset. Moreover, the use of SR is not limited to non-linear model fitting but can be adapted to any parametric statistical model. For the above reasons, the use of SR in scoring system development for high-stakes domains may lead to extremely important results in the coming years [5, 15]. 2.3. Symbolic Regression SR refers to data-driven ML methods for regression analysis that search the space of mathe- matical formulas to find the model that best fits a given dataset. The most classical approach to SR relies on genetic programming (GP) algorithms [16, 17, 15]. Standard GP approaches to SR require the selection of a predefined set of mathematical operators and variables that are used to construct candidate models (syntax trees in SR, see Figure 1 for an example). The algorithms need to perform optimization starting from a large set of randomly initialized models (population) that is evolved through several generations where individuals undergo transforma- tions inspired by genetic mutations. Our interest in SR comes from its remarkable results on non-linear data-driven modeling, even on small training data sets [18]. Indeed, SR was shown to generalize better than standard ML methods on hundreds of tasks, providing interpretable results in most cases. When the optimization process relies on meta-heuristic processes such as GP, SR can be easily extended to multi-objective tasks. Multi-objective SR (MOSR) has been recently developed to address constrained minimization problems and prevent overfitting the training data. A classical approach to solve the overfitting problem is to consider a bi-objective setup optimizing both complexity and accuracy [19, 20]. In fact, without a goal associated with complexity, models tend to become excessively verbose, losing generalization performance and transparency. Only recently, some work started combining accuracy and parsimony into a single objective through model selection principles such as the Akaike or Bayes Information Criterion (AIC and BIC), or the minimum description length [21]. Other applications of MOSR aim to incorporate knowledge-driven (in)equalities to be satisfied [22, 23, 24]. While these types of constraints can be easily incorporated in scoring system development, score construction itself can be a natively multi-dimensional problem that must satisfy or track more than one property. 3. Symbolic Data-centric AI for Scoring Systems Our interest in developing data-driven scoring systems has stemmed from our close collaboration with the Department of Infectious Diseases of the Modena University Hospital, particularly with members of the HIV Metabolic Clinic (MHMC), a tertiary-level referral center for the + min * ** ๐ฅ2 ๐ผ1 ๐ฅ3 ๐ฅ1 ๐ผ2 ๐ (x, ๐ผ) = ๐ผ1 ๐ฅ3 + min(๐ฅ2 , ๐ฅ๐ผ 1 ) 2 Figure 1: Left: Pictorial view of the data pipeline construction. Right: Example of model syntax tree. diagnosis and treatment of non-infectious co-morbidities in people living with HIV (PWH). In 2022 [25], our research group pioneered the introduction of the first MOSR approach for managing data-driven, continuous, and non-linear scoring systems. The necessity for a multi- objective approach in continuous medical index creation arises from the intricate nature of medical outcomes, often too complex to predict precisely. This challenge is compounded by the characteristics of EHRs, which are typically not only small but also unbalanced. To address these challenges, unlike standard strategies that validate results post-hoc using metrics like stratification power and index balancing, our approach incorporates these desirable properties directly into the optimization phase. This integration provides greater control over the behavior and complexity of the models. Moreover, despite the scientific communityโs main focus being on identifying index aggrega- tion functions, our fruitful collaboration with the MHMC highlighted that the automation of data-driven scoring system development should rely on a complete data pipeline that encompasses the entire data lifecycle, from data preprocessing and feature engineering, to model creation, optimization, and validation. A pictorial view of the data pipeline needed to create scoring systems with MOSR is depicted in Figure 1. While certain phases, such as the problem definition step, necessitate a synergistic approach between data scientists and clinicians, other steps can be partially optimized. In our experience, these are: โข EI engineering and selection. The construction of normalized EI can be automatized once the data type, its meaning, and the kind of non-linearities are specified in index formulation. A major challenge in SR is the vast function space to be explored, which grows exponentially with the number of mathematical operators and variables [26]. To improve convergence rate we usually rely on minimum-redundancy-maximum-relevance non-linear feature selections [27, 28, 29] whenever needed. โข Optimization objective. The ability to abstract the problem enables the identification of the appropriate class of models wherein risk assessment should be integrated, whether itโs classification, regression, survival analysis, etc. By comprehending the properties essential for the indexโs effectiveness, one can discern additional constraints or desiderata, such as constraints/objectives on correlation, calibration, or distribution. Lastly, identify- ing the final application domain aids in understanding the tolerable nonlinearities and the maximum acceptable complexity level of the final formulas. These considerations collectively facilitate the identification of optimal optimization goals. โข Performance metrics and comparison. Performance metrics beyond the optimization criteria should be identified to check the predictive power of the generated scores. These metrics need to be tailored to the specific problem under study. For instance, in the context of a risk score based on binary classification, it is essential to calculate metrics such as sensitivity, specificity, and the area under the ROC curve. Furthermore, it is paramount to compare the results with those obtained from classical statistical methods and other benchmark ML approaches. โข Model selection and validation. Model selection should consider performance metrics, safety, and domain expertise. In addition to standard performance evaluation, validating the score typically entails analyzing its distribution within the target population and its correlation with other clinically relevant measures. This ensures the scoreโs consistency and compatibility with existing medical knowledge. Below, we will demonstrate that since the publication of the methodological work, our group has persistently advanced in developing, systematizing, and automating various aspects necessary for score development automation, also expanding the methodology to encompass different case studies. We start by briefly summarizing two situations where scores are modeled as regression and binary classification problems, afterward, we show in detail how we extended the method to survival analysis problems. Improving Frailty index assessment for PWH [30]. This study developed a data-driven tool for geriatric assessment of PWH, focusing on enhancing the Frailty Index (FI), a clinical score based on 35 variables. Leveraging data from the MHMC cohort, we designed a pipeline for constructing reduced FIs. Starting from a set of knowledge driven 54 binary EIs, we employed a non-linear feature selection method and identified the 27 most relevant EIs in predicting the FI. We modeled index simplification as a regression problem and used MOSR to replicate the values, distribution, and risk stratification ability of the original FI, minimizing weighted mean square error, Wasserstein distance, and maximizing pairwise Kendall correlation. We evaluated optimal model predictiveness through calibration, correlation with the original index, and associations with established geriatric outcomes such as age, the EQ-5D-5L score, and the SPPB index. The simplest optimal model used only 16 readily available variables, meeting all requirements and was incorporated into the MHMCโs automatic data collection pipeline. Short-term mortality prediction in patients with Covid-19 [31]. This study aimed to predict short-term in-hospital mortality risk using data collected upon hospital admission. EHRs from 2400 patients with Covid-19 diagnoses were gathered. A non-linear feature selection method reduced covariates from 25 to 10, validated by the medical team and standardized. Index creation involved a tailored non-linear logistic regression using formulas generated by MOSR. Experiments minimized weighted Binary Cross Entropy and maximized the F1 score to optimize the sensitivity-specificity tradeoff. MOSR outperformed popular machine-learning Table 1 Comparison between classical Cox model and a MOSR solution showing examples of risk factors mined by MOSR. Only statistically significant variables were reported for the Cox model. Model Cov. ๐ผ๐ [5%, 95%]๐ผ๐ โ log2 (p) PHA โ๏ธ ๐ถ๐ฟ๐ป = ๐ ๐ผ๐ ๐ฅ๐ ICU 0.68 [0.29,1.08] 10.62 Failed Cox model cMELD 0.67 [0.26,1.08] 9.54 Satisfied PTI 1.50 [0.74,2.25] 13.35 Satisfied ๐ถ๐ฟ๐ป = ๐ผF1 F1 + ๐ผF2 F2 F1 1.82 [1.07,2.57] 18.89 Satisfied MOSR ๐น1 = PTIHD F2 1.15 [0.78,1.52] 30.11 Satisfied ๐น2 = max( 23 ๐ป๐ท, min(ICU, HD max(MELD, PTI Tx2015PTI )))) algorithms and classical human-generated indices, prioritizing false negative reduction for timely interventions. 3.1. MOSR for survival: Mining post Liver Transplantation (LT) risk factors The study [32] aimed to quantify the interaction between risk factors in predicting death within the first four months post-LT. The data consisted of 485 EHRs of patients who underwent LT at the University Hospital of Modena between 2010 and 2020. Available exposure variables included patient admission details, preoperative conditions such as colonization by multidrug- resistant bacteria, and postoperative risk factors like bloodstream infections. Due to data scarcity and imbalance, EI selection and construction required collaboration between data scientists and clinicians. The chosen set of covariates was: Hospitalization days (HD); Intensive Care Unit days (ICU); Model for End-Stage Liver Disease (MELD); Duration of surgery (DS); LT year (LTY); Post-LT infection (PTI); MDR-Gram negative pre-operative colonization (GnC); on top of death observation and censoring times. Variable selection was followed by feature engineering to create comparable and informative EIs. Specifically, we discretized continuous variables based on clinically meaningful intervals. Coxโs regression, a classic semi-parametric survival model, is the most classic and easy-to- interpret model that can estimate the effects of exposure variables and adjust for confounding effects. Classically, Coxโs hazard model is written as: ๐ป(x, ๐ก) = ๐ป0 (๐ก) ร exp ๐ผ๐ x (1) {๏ธ }๏ธ where x = {๐ฅ๐ }, ๐ = 1 . . . ๐, are the covariates and ๐ผ = {๐ผ๐ } are the parameters to be estimated by the model. Therefore, the hazard function ๐ป is given by the product of the baseline hazard ๐ป0 (๐ก) and the covariate-dependent relative risk. As the only time dependence lies in ๐ป0 (๐ก), a fundamental assumption underlying the application of the Cox model is the Proportional Hazard Assumption (PHA). The application of the classical Cox model turned out not to be suitable for the indicators created, since, as shown in Table 1, some covariates did not meet PHA. Classical methods for extending the Cox model in this situation are difficult to apply to sparse and unbalanced data, as well as being less interpretable. Therefore, we embedded SR into Coxโs regression by making the covariate-dependent log relative hazard function (CLH) trainable and potentially non-linear: exp ๐ผ๐ x โ exp {๐ (๐ผ, x)}. By optimizing partial {๏ธ }๏ธ AIC (๐ซAIC) and model complexity, parametrized as the number of nodes in the syntax tree, we maximized model likelihood while minimizing the number of numerical parameters and the degree of non-linearity for enhanced interpretability. MOSR survival models not only outperformed classical Cox regression in predictive performance but also successfully mined composite data-driven risk factors, overcoming classical model limitations such as the PHA. See the bottom panel in Table 1 for a representative optimal solution. Finally, model selection was guided by differences in ๐ซAIC values for theoretical support, calibration and predictive performance metrics for efficiency and usefulness, and out-of-distribution prediction for safety assessment. More details about feature selection and construction and model performance and validation can be found in the original text. 4. Concluding Remarks, Open Challenges, and Future Research Opportunities Scoring system development stands out as a prominent application of data-centric AI, where the fusion of data mining, big data analytics, and ML plays a pivotal role. In this work, we have presented the state of the art in automatic scoring system generation, covering data frameworks, algorithms, and select case studies successfully tackled by our research group. By distilling knowledge from retrospective clinical data, our data processing pipeline and analyses offer seamless adaptability to diverse case studies. Despite the advancements obtained during the last years in the development of scoring systems, state of the art data pipeline for data-driven scoring system development still lack some crucial milestones to be reached. In addition to being interpretable, CDSs must be able to handle uncertainty. The problem of quantifying and managing epistemic uncertainty becomes paramount when the system under study evolves with time or when dealing with multicenter studies. The ML frameworks devoted to these issues are continual learning (CL) and federated learning (FL) [33]. CL focuses on enabling AI models to learn and adapt continuously over time, incorporating new information while retaining previously acquired knowledge. On the other hand, FL aims to facilitate the development of multicenter studies by overcoming the problems related to centralized learning settings that need to transfer sensitive medical data from multiple centers to a central location. FL and CL approaches to SR have received little attention, with the literature focusing on DL models. Only a few works exist about federated SR algorithms [34, 35], and none of them deal with concept drift. To foster trust among domain experts, model interpretability may not suffice and their full engagement in model creation and selection should be allowed. In a true human-in-the-loop scenario, physicians should be allowed to contribute their clinical expertise directly to ML model development and selection. This can be facilitated through flexible graphical interfaces enabling iterative feedback processes. Finally, proposing SR as a method capable of automating the creation of CDS tools will require consolidating and extending it to various parametric models commonly used in clinical score development, such as sub-distribution hazard models and time series analysis. References [1] F. Mandreoli, D. Ferrari, V. Guidetti, F. Motta, P. Missier, Real-world data mining meets clinical practice: Research challenges and perspective, Frontiers in big Data 5 (2022) 1021621. [2] M. Mazziotta, A. Pareto, Methods for constructing composite indices: One for all or all for one, Rivista Italiana di Economia Demografia e Statistica 67 (2013) 67โ80. [3] M. Than, et al., Development and validation of the emergency department assessment of chest pain score and 2h accelerated diagnostic protocol, EMA 26 (2014) 34โ44. [4] B. Ustun, C. Rudin, Learning optimized risk scores, Journal of Machine Learning Research 20 (2019) 1โ75. [5] W. G. La Cava, P. C. Lee, I. Ajmal, X. Ding, P. Solanki, J. B. Cohen, J. H. Moore, D. S. Herman, A flexible symbolic regression method for constructing interpretable clinical prediction models, NPJ Digital Medicine 6 (2023) 107. [6] G. Kantidakis, H. Putter, C. Lancia, J. d. Boer, A. E. Braat, M. Fiocco, Survival prediction models since liver transplantation-comparisons between cox models and machine learning techniques, BMC medical research methodology 20 (2020) 1โ14. [7] G. Kantidakis, E. Biganzoli, H. Putter, M. Fiocco, et al., A simulation study to compare the predictive performance of survival neural networks with cox models for clinical trial data, Computational and Mathematical Methods in Medicine 2021 (2021). [8] C. Rudin, Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead, Nature Machine Intelligence 1 (2019) 206โ215. [9] L. Semenova, C. Rudin, R. Parr, On the existence of simpler machine learning models, in: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 2022, pp. 1827โ1858. [10] B. Ustun, C. Rudin, Supersparse linear integer models for optimized medical scoring systems, Machine Learning 102 (2016) 349โ391. [11] N. Sokolovska, Y. Chevaleyre, J.-D. Zucker, A provable algorithm for learning interpretable scoring systems, in: Proc. of the 21 Intโl Conf. on Artificial Intelligence and Statistics, 2018, pp. 566โ574. [12] C. Q. Zhu, M. Tian, L. Semenova, J. Liu, J. Xu, J. Scarpa, C. Rudin, Fast and interpretable mortality risk scores for critical care patients, 2023. arXiv:2311.13015. [13] R. Zhang, R. Xin, M. Seltzer, C. Rudin, Optimal sparse survival trees, arXiv preprint arXiv:2401.15330 (2024). [14] A. Romero-Corral, et al., Accuracy of body mass index in diagnosing obesity in the adult general population, International journal of obesity 32 (2008) 959โ966. [15] D. Angelis, F. Sofos, T. E. Karakasidis, Artificial intelligence in physical sciences: Symbolic regression trends and perspectives, Archives of Computational Methods in Engineering 30 (2023) 3845โ3865. [16] J. R. Koza, Genetic programming as a means for programming computers by natural selection, Statistics and computing 4 (1994) 87โ112. [17] W. La Cava, P. Orzechowski, B. Burlacu, F. O. de Franรงa, M. Virgolin, Y. Jin, M. Kommenda, J. H. Moore, Contemporary symbolic regression methods and their relative performance, arXiv preprint arXiv:2107.14351 (2021). [18] C. Wilstrup, J. Kasak, Symbolic regression outperforms other models for small data sets, arXiv preprint arXiv:2103.15147 (2021). [19] B. Burlacu, G. Kronberger, M. Kommenda, M. Affenzeller, Parsimony measures in multi- objective genetic programming for symbolic regression, in: Proceedings of the genetic and evolutionary computation conference companion, 2019, pp. 338โ339. [20] Q. Chen, B. Xue, M. Zhang, Rademacher complexity for enhancing the generalization of genetic programming for symbolic regression, IEEE transactions on cybernetics 52 (2020) 2382โ2395. [21] D. J. Bartlett, H. Desmond, P. G. Ferreira, Exhaustive symbolic regression, IEEE Transac- tions on Evolutionary Computation (2023). [22] J. Kubalรญk, E. Derner, R. Babuลกka, Symbolic regression driven by training data and prior knowledge, in: Proceedings of the 2020 Genetic and Evolutionary Computation Conference, 2020, pp. 958โ966. [23] C. Haider, F. O. de Franรงa, B. Burlacu, G. Kronberger, Using shape constraints for improving symbolic regression models, arXiv preprint arXiv:2107.09458 (2021). [24] J. Kubalรญk, E. Derner, R. Babuลกka, Multi-objective symbolic regression for physics-aware dynamic modeling, Expert Systems with Applications 182 (2021) 115210. [25] D. Ferrari, V. Guidetti, F. Mandreoli, Multi-objective symbolic regression for data-driven scoring system management, in: 2022 IEEE International Conference on Data Mining (ICDM), IEEE, 2022, pp. 945โ950. [26] M. Virgolin, S. P. Pissis, Symbolic regression is np-hard, Transactions on Machine Learning Research (2022). [27] H. Peng, F. Long, C. Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Transactions on pattern analysis and machine intelligence 27 (2005) 1226โ1238. [28] M. Yamada, J. Tang, J. Lugo-Martinez, E. Hodzic, R. Shrestha, A. Saha, H. Ouyang, D. Yin, H. Mamitsuka, C. Sahinalp, et al., Ultra high-dimensional nonlinear feature selection for big biological data, IEEE Transactions on Knowledge and Data Engineering 30 (2018) 1352โ1365. [29] Z. Zhao, R. Anand, M. Wang, Maximum relevance and minimum redundancy feature selection methods for a marketing machine learning platform, in: 2019 IEEE international conference on data science and advanced analytics (DSAA), IEEE, 2019, pp. 442โ452. [30] V. Guidetti, F. Motta, J. Milic, D. Ferrari, F. Mandreoli, G. Guaraldi, Unlocking frailty index: Knowledge distillation with symbolic machine learning to simplify frailty assessment in standard hiv clinics, Under review (2024). [31] D. Ferrari, V. Guidetti, Y. Wang, V. Curcin, Multi-objective symbolic regression to generate data-driven, non-fixed structure and intelligible mortality predictors using ehr: Binary classification methodology and comparison with state-of-the-art, in: AMIA Annual Symposium Proceedings, volume 2022, American Medical Informatics Association, 2022, p. 442. [32] V. Guidetti, G. Dolci, E. Franceschini, E. Bacca, G. J. Burastero, D. Ferrari, V. Serra, F. Di Benedetto, C. Mussini, F. Mandreoli, Death after liver transplantation: Mining interpretable risk factors for survival prediction, in: 2023 IEEE 10th International Confer- ence on Data Science and Advanced Analytics (DSAA), IEEE, 2023, pp. 1โ10. [33] L. Cao, H. Chen, X. Fan, J. Gama, Y.-S. Ong, V. Kumar, Bayesian federated learning: A survey, arXiv preprint arXiv:2304.13267 (2023). [34] J. Dong, J. Zhong, W.-N. Chen, J. Zhang, An efficient federated genetic programming frame- work for symbolic regression, IEEE Transactions on Emerging Topics in Computational Intelligence (2022). [35] D. Nguyen Duy, M. Affenzeller, R. Nikzad-Langerodi, Towards vertical privacy-preserving symbolic regression via secure multiparty computation, in: Proceedings of the Companion Conference on Genetic and Evolutionary Computation, 2023, pp. 2420โ2428.