Data-driven vs knowledge-driven inference of health outcomes in the ageing population: a case study Davide Ferrari Giovanni Guaraldi Federica Mandreoli University of Modena and Reggio Emilia University of Modena and Reggio Emilia University of Modena and Reggio Emilia Modena, Italy Modena, Italy Modena, Italy 162996@studenti.unimore.it giovanni.guaraldi@unimore.it federica.mandreoli@unimore.it Riccardo Martoglia Jovana Milić Paolo Missier University of Modena and Reggio Emilia University of Modena and Reggio Emilia Newcastle University Modena, Italy Modena, Italy Newcastle upon Tyne, UK riccardo.martoglia@unimore.it jovana.milic@gmail.com paolo.missier@ncl.ac.uk ABSTRACT 1.1 Background Preventive, Predictive, Personalised and Participative (P4) We focus on two easy-to-interpret metrics that clinical re- medicine has the potential to not only vastly improve people’s searchers have proposed to succinctly express the health status quality of life, but also to significantly reduce healthcare costs of patients at a given point in time. The first is a measure of and improve its efficiency. Our research focuses on age-related frailty, designed to quantify the reduction of homeostatic reserves diseases and explores the opportunities offered by a data-driven available to an individual. High frailty is indicative of higher approach to predict wellness states of ageing individuals, in con- risk of negative health outcomes, but it is also a potentially re- trast to the commonly adopted knowledge-driven approach that versible condition [21] In practice, frailty is measured using a relies on easy-to-interpret metrics manually introduced by clin- variety of specific Frailty Index metrics (FIs). These are calcu- ical experts. This is done by means of machine learning mod- lated using at least 30 directly assessed health variables, which els applied on the My Smart Age with HIV (MySAwH) dataset, include signs or symptoms, biochemical parameters, various co- which is collected through a relatively new approach especially morbidities or socio-demographic data [6, 22]. The choice of the for older HIV patient cohorts. This includes Patient Related Out- specific variables, as well as their sources, may vary depending comes values from mobile smartphone apps and activity traces on data availability, leading to different specifications for the FI. from commercial-grade activity loggers. Our results show better Given their complexity and multidimensionality, FIs reflect the predictive performance for the data-driven approach. We also biological age of an individual rather than their chronological age, show that a post hoc interpretation method applied to the predic- making them reliable prognostic tools that can be used in differ- tive models can provide intelligible explanations that enable new ent settings for clinical decision algorithms [6]. In particular, the forms of personalised and preventive medicine. dataset used in this research concerns a cohort of HIV patients. This is important, because long-lived HIV patients exhibit a form of accentuated ageing [3],such that they can successfully be used 1 INTRODUCTION to study frailty, i.e., where the duration of the condition (number Medical practice is evolving rapidly, away from the traditional but of years since infection) is used as a proxy for chronological age. inefficient detect-and-cure approach, and towards a Preventive, The second measure of health considered in this work directly Predictive, Personalised and Participative (P4) vision that focuses reflects the more positive notion of healthy ageing [16] that is on extending people’s wellness state, with particular focus on age- becoming prevalent especially in public health settings. In con- ing individuals [19]. This vision is increasingly data-driven, and trast to frailty, which is designed to measure decay, the term is underpinned by many forms of “Big Health Data” including pe- healthy ageing (HA) has been proposed to promote a positive riodic clinical assessments and electronic health records, but also approach to ageing that relies on reserves and preserved capac- using new forms of self-assessment, such as mobile-based ques- ities in an individual, rather than accumulation of deficits. In tionnaires and personal wearable devices. With these premises, the World Health Organization Guidelines on Integrated Care P4 medicine has the potential to not only vastly improve people’s for Older People (ICOPE), HA is based on concrete measures of quality of life, but also to significantly reduce healthcare costs Intrinsic Capacity (IC). These are defined as a composite of all and improve its efficiency. the physical and mental capacities of an individual, divided into Our research explores specific opportunities offered by data- five domains: locomotion, cognition, psychological, vitality and driven approaches to predictive care, in contrast to traditional, sensory capacity [16]. knowledge-driven approaches that rely purely on clinical ex- As suggested by Belloni and Cesari [2], frailty and IC should pertise. Our focus is on age-related diseases, an emerging issue not be considered as two opposed constructs, but rather two for health care systems. The World Health Organisation esti- constructs that share a common biological background. The IC mates that the proportion of people over 60 years of age will should be considered as an evolution of the frailty concept, taking reach 2 billion by 2050 [17]. Ageing is associated with increased into special consideration the functional reserve expressed by prevalence of co-morbidities that accumulate in the complex of the vitality domain, the need for a worldwide implementation multi-morbidity in older people [1]. of prevention, the continuum of the ageing process, and the opportunities offered by novel technologies [8]. © 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed- Key to a successful operational definition of IC is the choice ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen, of the variables used to characterise healthy ageing. [16] argues Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At- tribution 4.0 International (CC BY 4.0) that integrated care is crucial to incorporate intrinsic capacity assessment in new care models and encourages the adoption of wearable devices and mobile apps for longitudinal data collection. 2 RELATED WORK In [9], the authors proposed a self-generated health measure The idea of exploiting machine learning toward a general “health” called Intrinsic Capacity Index (ICI) which relies on physical focus is certainly gaining popularity, also thanks to the possi- function data collected through fitness tracking wearable devices bility of continuously monitoring well-being through wearable and a set of electronic Patient-Related Outcomes (PRO) collected activity trackers’ data. For instance, passive sensing techniques through a dedicated smart-phone app (MySAwH App). In the have been recently exploited to assess both mental and physical study, these variables have been collected longitudinally over a health [20] or to predict weight objectives for users of smart period of 18 months for a cohort of HIV patients as part of the My connected devices [26]. Some researchers also successfully ex- Smart Age with HIV (MySAwH) study. This is a prospective multi- ploited this kind of data in more traditional disease monitoring center international case-only study, designed to empower Older scenarios, e.g., to track Multiple Sclerosis [24], depression [29] People Living With HIV (OPLWH) to achieve healthy lifestyles and schizophrenia [28] patients’ health. In most of these studies and improvement in quality of life. mobile phones are only exploited as a passive sensing device, Similarly to FI, ICI is computed from a subset of the available while in our case we combine the use of passive sensing (wear- variables, by manually selecting a cutoff point for each variable able activity tracker) data with self-reporting EMA data collected and simply counting, for a given patient, the variables with value through mobile phone apps. higher than the cutoff for that patient. We have referred to this A critical aspect for the successful application of ML to approach that represents the common practice to assess health medicine is the recent increased emphasis on the need for ex- condition in geriatric medicine as “knowledge-driven” (KD for planations of ML systems [7, 10]. This has lead to another im- short), as it relies on easy-to-interpret metrics where the choice portant research trend, i.e. Interpretable Machine Learning in of variables and of their cutoff points is defined manually by healtcare [14]. Even if recent researches have proposed new mod- clinical experts. The usefulness of the ICI is demonstrated in [9] els which exhibit high performance as well as interpretability, by showing experimentally that it displayed higher sensitivity e.g., GA2M[15] and rule-based models [25], the utility of these than FI to predict one central indicator of “wellness state” of models in healthcare has not been convincingly demonstrated ageing individuals, the Quality of Life (QoL in short). yet, due to the rarity of their application [14]. The interpretation method used in [11, 12], instead, is designed to work with existing and well established (even if less interpretable) ML methods, such as gradient boosting or deep learning, by extracting explanations 1.2 Contributions through models that are applied post-hoc using Shapley Values In contrast to the dominant KD methods as described, in this [23]. This is one of the most advanced interpretation methods paper we explore a complementary data-driven approach (DD) available, allowing for both global (entire study population) and to predicting wellness states for long-term patients, using a com- local (instance level) explanations. In our study we exploit such bination of clinical, self-monitoring, and PRO (self-reporting) model together with the XGBoost [4] gradient boosting algo- longitudinal observations that refer to the five IC domains. As rithm, in order to aim to both interpretability and performance. we will see, while this approach removes the need for the clinical A similar technique has been also proposed, only at conceptual experts to directly define metrics such as the ICI, it also empow- level, in the Explainable AI framework discussed in [27], which ers them by deploying machine learning techniques that make is however based on clinical-only data analysis. the predictions easily interpretable. Specifically, we focus on three dependent variables to charac- 3 THE HIV COHORT DATASET terise the “wellness state” of ageing individuals, namely, (i) Falls, The experimental dataset used in this paper was obtained from indicating whether or not a person has experienced any falls the My Smart Age with HIV (MySAwH) [18] project. MySAwH within a given time period; (ii) SPPB, or Short Physical Perfor- is a multi-centre prospective ongoing study aiming at empow- mance Battery, measuring movement of the lower limbs, and (iii) ering OPLWH, i.e. 50+ years old, to develop healthy lifestyles. Quality of Life (QoL). Using the MySAwH dataset for training, The project involves 261 patients from three clinics: 128 from we can directly contrast the DD and KD approaches, and we Modena (Italy), 100 from Sydney (Australia) and 33 from Hong show that we can achieve better predictive power without the Kong (China). One novelty of the approach is the combination need to rely on manually formulated ICI metrics. Furthermore, of clinical patient data, acquired during periodic scheduled as- we also show that the models’ performance increases if we also sessments in the clinic, with patient-oriented longitudinal data include a single Frailty Index value for each patient, represent- about patients’ behavioural, physiological and environmental ing a “baseline” clinical assessment, in addition to the PRO and health status, which is collected at higher frequency through activity variables. smartphones and wearable devices. Finally, we combine well-established machine learning mod- Thus, the resulting dataset is highly heterogeneous as to the els with a post-hoc interpretation method and we show that data type, the geographical origin of patients and the acquisition satisfactory model performance can be achieved together with rate. It provides a comprehensive characterization of patients intelligible explanations, which provide the additional benefit of from the broad determinants of health that impact aging, com- ranking the variables with respect to a prediction. Indeed, we plemented by those more specific to OPLWH, namely: show that the relative importance of the variables differs for each patient, indicating the important role played by intelligible Activity tracking variables: Step count, sleep hours and models towards personalising healthcare. This capability also calories, which are collected daily using a commercial- provides critical feedback to the clinicians, who can use this in- grade wearable activity tracker; formation in combination with their expertise to moderate the PRO variables: 56 categorical questions exploring func- model’s predictions. tional abilities and Quality of life (QoL). These are col- lected monthly using a dedicated smartphone app; p Clinical variables: Comprehensive geriatric assessment • the feature vector x i,j contains the values of the PRO and and HIV-specific variables, which are collected by health- activity tracking variables V for patient p at month m: 56 care workers during study visits at time 0, 9 and 18 months. PRO answers provided by the patient during the corre- 37 of these variables were used to measure the Frailty In- sponding month, i.e. plus 3 aggregated values computed dex (FI) as defined in [6]: 27 from blood tests, 3 about body as the mean of the daily wearable device data (step count, composition, 7 HIV-related variables and patient-reported calories, number of sleep hours) collected during the same p outcomes. month. Given a variable Vk , x i,j [Vk ] denotes the Vk value A preliminary analysis [8] introduced a new index for express- p in x i,j ; ing IC, named ICI, and compared the performance of FI and ICI p • y j is the value of the outcome o measured during the in predicting QoL. This is a quantitative measure of health as assessed by the individual respondents that is widely used on hospital visit at the end of period j. aging population. QoL was assessed using a standardised EQ-5D- The second sample set, denoted as SampleoF I , is built by adding p 5L questionnaire, based on the EQ Visual Analogue scale (EQ to each feature vector x i,j ∈ Sampleo , the FI value computed VAS) [5]. Through the MySAwH dataset, FI and ICI were shown from the clinical variables measured during the hospital visit at to be performative tools that can be used in research and clinical the beginning of period j, i.e. at month 0 when j = 1 and 9 when setting to describe disease and health status in OPLWH. ICI score j = 2. in comparison to FI displayed higher sensitivity to predict the Quality Assurance. Lastly, we performed a quality assurance QoL and self-perceived health in OPLWH. step on the resulting sample sets. Within each time window, ob- Outcomes. In this work, we focus on the task of predicting servations of PRO variables are sometimes incomplete, resulting significant healty ageing indicators and QoL is one of such in- in sequence gaps. The size of the gaps is 5 consecutive missing dicators. In addition to QoL, two other indicators were selected: observations on average, with a max of 17, and we found 108 Falls, that is an adverse outcome included among the geriatric gaps per patient on average, with a max of 284 gaps (regardless syndroms, and the Short Physical Performance Battery (SPPB), of size). We performed imputation by interpolating missing data a group of measures about movement ability with lower limbs points in the time series, with an aim to achieve a balance be- (Guralnik et al., 2000) that can aid in the monitoring of function tween the size of the gaps and the performance of the predictive in older people. These three outcomes are chosen because they model. Clearly, interpolating very large gaps produces spurious widely cover all 5 domains of IC. In summary, these are as follows data in the training set. We experimentally determined the max (Fig. 1 shows their distributions): size of gaps that could be safely interpolated (five missing steps), Quality of Life (QoL): assessed using the EQ-5D5L stan- by assessing the predictive performance of each of the models re- dard, with values between 0 and 1; sulting from training sets obtained from more or less “aggressive” Falls: A binary outcome that evaluates to True if a patient interpolation. After adjusting for missing data, the final training has fallen at least once since the previous visit, and False set contains 2,250 data points, with an average of 8 per patient. otherwise; The construction of the dataset results in at most 16 samples per SPPB Index: a discrete index that assumes integer values patient, for a total of 4,176 records, considering each month for between 0 and 12. each patient. Observational data and feature space. The observational 4 THE LEARNING FRAMEWORK dataset used to predict these outcomes covers 18 months and is The objective of the data-driven approach is to predict each out- broken down into two time windows, reflecting the clinical as- come at the time of a visit using the samples referring to the time sessment schedule (months 9 and 18), when the selected clinical window before that visit, as shown in Fig. 2. outcomes are assessed. For each time window, we draw two sets To this end, the data-driven learning framework we built is of samples from the related observations, which we are going to depicted on the left side of Fig. 3. Each outcome o is predicted by use as ground truth to train our models. two learning models Mo and MoF I , one for each of the two sample The first consists of the patient-centric longitudinal data, in- cluding the activity tracking and the PRO variables. To this end, sets Sampleo and SampleoF I , respectively. We trained the two we further aggregated the resulting PRO time series and the activ- models separately and assessed the performance using standard ity tracking time series at regular intervals of 1 month, resulting KFold cross-validation (CV) on an 80% of the samples (χt r ain ) in two sets of data points, one for each of the two windows. The and a test phase on the remaining 20% samples (χt est ) of the second sample set augments the first with the FI values com- corresponding sample set. puted from the clinical variables measured during hospital visits The DD approach is compared to the KD approach that repre- at the beginning of each window, namely at times 0 and 9. This sents the common practice in geriatric medicine. To this end, we added value is a physician’s assessment that complements the built the KD learning framework depicted on the right hand side patient-centric data point and can be interpreted as the baseline of Fig. 3. The approach aims at computing an ICI scores for each of the time series the data point comes from. observation by manually selecting a subset V = {V1 . . . Vn } ⊆ V Formally, for each outcome o ∈ {QoL, SPPB, Falls} we write of the set of PRO and activity tracking variables V, specifying Sampleo to denote the sample set containing the monthly samples functions si (x) to map each value x for variable Vi ∈ V to a score, per patient. For each patient p, we denote a single sample in the and finally combining the individual scores si (x) into a unique set Sampleo at month m = i + (j − 1) ∗ 9, corresponding to the value. The variables are chosen to represent each of the five IC i-th observation, i ∈ [1, 8], in the j-th window, j ∈ [1, 2], by a pair domains, namely locomotion, cognition, psychological, vitality p p p and sensory capacity. For most of the variables Vi , a binary score sm = (x i,j , y j ) ∈ Sampleo where is defined, i.e., si (x) ∈ {0, 1}, based on a single threshold, for p • x i,j represents the feature vector for i-th observation; instance when Vi = stress level (from 1 to 10) the score is mapped 10000 10000 2000 1000 1000 1500 100 100 1000 10 10 500 1 1 0 0,1-0,2 0,2-0,3 0,3-0,4 0,4-0,5 0,5-0,6 0,6-0,7 0,7-0,8 0,8-0,9 0,9-1 2-3 4-5 6-7 7-8 8-9 9-10 10-11 11-12 False True (a) QoL distribution (b) SPPB distribution (c) Falls distribution Figure 1: Distribution of outcomes in the dataset Knowledge-Driven (KD) Approach … … Raw Data Subsetting Cutoffs 𝑓(𝑥) IC Index 1 2 3 4 5 6 7 8 9 … 1° time window 10 11 12 13 14 15 16 17 18 … 2° time window Data-Driven (DD) Approach months Training in CV Training in CV Performance Figure 2: Prediction at next clinical visit using only patient Comparison reported outcomes Regression Regression MQoL_DD QoL MQoL_KD Regression Regression MSPPB_DD SPPB MSPPB_KD to 1 if the value is lower than 3 and 0 otherwise. Other variables (MAE) are mapped to a score in the [0, 1] range, for instance the number Classification Classification of steps per day. MFalls_DD Falls MFalls_KD p (Acc, Prec, Rec) Then, given a feature vector x i,j for patient p, the correspond- p ing ICI value ICI (i, j, p) is computed as the sum of the si (x i,j [Vi ]) Model Interpretation scores, normalised by the number of variables: Í |V | p i:1 si (x i,j [Vi ]) Figure 3: Comparison between Data-Driven and ICI (i, j, p) = Knowledge-Driven approaches n Such an index is subject to an inevitable bias: the imposition of the physician’s interpretation on the choice of the variables performance from model interpretability would better suit our of the subset, as well as on the thresholds and the arithmetic needs. Thus, our results are based on a combination of Gradi- formula to be used. ent Boosting (the XGBoost implementation [4] for performance Also for this case, a dataset consisting exclusively of the ICIs, (Sec. 5.1), and Shapley Values [23], using the SHAP implemen- SampleoI C I , and one consisting of the ICIs and the FI at the most tation [11]) to generate reports on the relative importance of recent visit, SampleoI C I ,F I , was isolated. This parallelism with each individual feature, both across the entire population and for what was done in the data-driven approach allowed to train individual patients (Sec. 5.2). 6 learning models with the datasets just described (MoIC I and MoI C I ,F I for each of the three outcomes o) and to compare the 5.1 Predictive Performance predictive performances between the two different approaches. In our evaluation we compare our DD approach, illustrated in the left side of Fig. 3, with the KD approach based on a regression 5 EXPERIMENTAL RESULTS AND MODEL on the IC index (right hand side in the Figure). Fig. 4 shows the INTERPRETATION predictive performance of the models we tested, namely: the DD In this section we first discuss the performance of the predictive models trained with and without using FI as a feature, earlier models, and then describe in detail our approach and results referred to as MoF I and Mo , respectively; and the KD models, regarding model interpretation, which is reported as output to the where again the expert may or may not consider FI. Results are DD approach in the left side of Fig. 3. The latter is a fundamental presented using 1-MAPE (Mean Average Percentage Error) for requirement in medicine, where the ability to provide medical the numerical outcomes QoL and SPPB on the left of the figure, doctors with an easy-to-understand interpretation of the model and accuracy, precision, recall and F1 for Falls, on the right. predictions is fundamental. This not only conveys confidence in The results indicate a higher than 90% 1-MAPE for all cases in the predictions, but also helps to make them actionable, i.e., in QoL and SPPB, while classification accuracy for Falls is higher the form of recommendations to patients. We present examples than 84%. Further, the DD approach performs generally better of such interpretations, and their practical relevance, in Sec. 5.2. than KD, and both benefit from using FI, with performance reach- The Gradient Boosting algorithm [13] proved to offer better ing 94.3%, 94.9% and 95% for QoL, SPPB, and Falls, respectively. predictive performance than other popular intelligible learning To note, in one case the KD approach returns a very low frameworks such as GA2 M [15], suggesting that separating model Recall when FI is not used. This can be explained by the strong 100% 100% 80% 95% 60% 90% 40% 20% 85% 0% KD DD KD DD KD DD KD DD KD DD KD DD KD DD KD DD KD DD QoL SPPB Acc Prec - True Prec - False Rec - True Rec - False F1 - True F1 - False w/o FI 91% 92% 93% 92% w/o FI 84% 93% 22% 97% 85% 93% 2% 52% 99% 100% 4% 68% 91% 96% w/ FI 92% 94% 93% 95% w/ FI 89% 95% 72% 98% 92% 95% 54% 68% 96% 100% 62% 80% 94% 97% Figure 4: Predictive Performance. Left: 1-MAPE (Mean Average Percentage Error) for QoL and SPPB, right: classification effectiveness for Falls QoL SPPB Index explained in terms of different behaviour, represented for in- stance by different EMA features. An example appears in Fig. 6, showing two different sets of positively contributing (green) and negatively contributing (red) features for two patients with the same SPPB index (note that, for SPPB, higher is better, as this indicates the patient’s capacity of physical movement. In the case of Falls, for instance, the opposite would be true). Clearly, this added information may lead to different interventions for these two patients. At the same time, SHAP provides global explanations, which Figure 5: Regression MAE distribution per patient characterise the contribution of each feature as a function of its grouped by clinical center range of values. For instance, Fig. 7 shows how the SV, indicating the overall contribution of one of these features (a PRO ques- tion), goes from negative to positive depending on the patients’ imbalance of the majority “False” class (no Falls) relative to the responses to this question, with a definite threshold of ⩾ 3. small minority “True”. We note that this capability essentially mimics the KD ap- The training sets used to generate these models combine pa- proach in that it identifies thresholds for the variables. While tients from all three clinics. To account for possible differences these are similar to the manually selected cutoffs, in our DD in data collection protocols between the clinics, we also created approach these are automatically identified from the data, in a one separate model for each. The corresponding results are pre- principled way. In the future, this explanation capability may un- sented in Table 1 and are consistent with those presented above. derpin epidemiological studies where the precise characterisation Some anomalies appear in the Hong Kong models, and these are of a populations of individuals enables new forms of preventive probably due to the small size of the training set. medicine. Finally, Fig. 5 shows the MAE distribution grouped per clinical center for QoL and SPPB. This helps understanding the robust- 6 CONCLUSIONS AND FUTURE WORK ness of the models and to identify any non-homogeneity in the data. In particular, Hong Kong exhibits a higher number of out- In this paper we have proposed a novel, data-driven approach liers compared to Modena and Sydney, probably because of the towards the definition of Intrinsic Capacity, aimed at quantifying small number of cases (33, compared to 128 in Modena and 100 and predicting the wellness state of old people who live with HIV. in Sydney), which are also more homogeneous. These results Using a cohort from a multi-centre prospective study as training suggest that developing separate models by stratifying across set, we have shown that a machine learning model that predicts clinics and data collection centres may be beneficial for future, three specific wellness metrics (Falls, SPPB, and Quality of Life), larger scale studies. performs equally or better than a manually-defined Intrinsic Ca- pacity Index. At the same time, the model is interpretable, making 5.2 Model Interpretation it an ideal complement to expert-based assessment of wellness. SHAP [11] is a framework for interpreting predictions from ma- chine learning models. It is based on Shapley values, first intro- REFERENCES duced in 1953 in the context of cooperative game theory [23]. [1] Ilaria Bellantuono, Rafael DeCabo, Dan Ehninger, et al. 2018. Find drugs that delay many diseases of old age. Nature 554 (02 2018), 293–295. Briefly, the main goal of the framework is to rank the relative [2] Giulia Belloni and Matteo Cesari. 2019. Frailty and Intrinsic Capacity: Two influence of each feature on a predictive model, both locally, Distinct but Related Constructs. Frontiers in Medicine 6 (06 2019). [3] Thomas Brothers, Susan Kirkland, Giovanni Guaraldi, et al. 2014. Frailty in that is, for a specific instance prediction, and globally, i.e., when People Aging With Human Immunodeficiency Virus (HIV) Infection. The considering the model predictions for an entire population. Journal of infectious diseases 210 (06 2014). In our medical setting, this means that for each patient, in [4] Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proc. of ACM SIGKDD (KDD ’16). ACM, New York, NY, USA, 785– addition to the predicted outcome the clinician also receives a 794. list of features, ranked in order of their relative importance in [5] Nancy Devlin and Richard Brooks. 2017. EQ-5D and the EuroQol Group: Past, achieving the prediction. Importantly, these orders may differ Present and Future. Applied Health Economics and Health Policy 15 (02 2017). [6] Iacopo Franconi, Olga Theou, Lindsay Wallace, et al. 2018. Construct validation for any two patients. This means that, using SHAP, we enable of a Frailty Index, an HIV Index and a Protective Index from a clinical HIV forms of personalised medicine whereby similar outcomes are database. PLOS ONE 13 (10 2018), e0201394. QoL SPPB Index Falls 1 - MAPE 1 - MAPE Acc P True P False R True R False F1 True F1 False KD DD KD DD KD DD KD DD KD DD KD DD KD DD KD DD KD DD Hong Kong w/o FI 93% 93% 91% 92% 84% 96% 0% 0% 87% 100% 0% 0% 97% 97% 0% 0% 92% 98% w/ FI 94% 93% 94% 93% 94% 93% 1% 1% 94% 93% 33% 33% 100% 100% 50% 50% 97% 96% Modena w/o FI 94% 94% 94% 95% 86% 94% 0% 93% 86% 94% 0% 41% 100% 99% 0% 57% 93% 96% w/ FI 94% 94% 95% 96% 93% 95% 74% 93% 95% 96% 53% 68% 98% 99% 62% 79% 97% 98% Sydney w/o FI 88% 90% 91% 93% 81% 87% 68% 76% 84% 89% 38% 57% 95% 95% 49% 65% 89% 92% w/ FI 89% 90% 93% 94% 87% 95% 86% 93% 88% 96% 69% 68% 95% 99% 77% 79% 91% 98% Table 1: Single-clinic models performance. (left: predictive performance for QoL and SPPB, right: classification effective- ness for Falls) Figure 6: Example of a local interpretation of one patient’s SPPB prediction. The 5 most relevant Shapley Values are reported. [15] Harsha Nori, Samuel Jenkins, Paul Koch, et al. 2019. InterpretML: A Unified Framework for Machine Learning Interpretability. arXiv preprint arXiv:1909.09223 (2019). [16] World Health Organization. 2015. World report on aging and health. https: //www.who.int/ageing/events/world-report-2015-launch/en/ [17] World Health Organization. 2018. Ageing and health. https://www.who.int/ news-room/fact-sheets/detail/ageing-and-health [18] Mirko Orsini, Marco Pacchioni, Andrea Malagoli, et al. 2017. My smart age with HIV: An innovative mobile and IoMT framework for patient’s empower- ment. In 2017 IEEE 3rd International Forum on Research and Technologies for Society and Industry (RTSI). 1–6. [19] Nathan D Price, Andrew T Magis, John C Earls, et al. 2017. A wellness Figure 7: Global distribution of one of the PRO’s SVs based study of 108 individuals using personal, dense, dynamic data clouds. Nature Biotechnology 35 (jul 2017), 747. on the value of the possible answers. [20] Mashfiqui Rabbi, Shahid Ali, Tanzeem Choudhury, et al. 2011. Passive and in-situ assessment of mental and physical well-being using mobile sensors. In Proc. of UbiComp’11. 385–394. [21] Martin Ritt, Karl Gassmann, and Cornel Sieber. 2016. Significance of frailty for [7] Leilani H. Gilpin, David Bau, Ben Z. Yuan, et al. 2018. Explaining explanations: predicting adverse clinical outcomes in different patient groups with specific An overview of interpretability of machine learning. In Proc. DSAA 2018. medical conditions. Zeitschrift fur Gerontologie und Geriatrie 49 (09 2016). 80–89. [22] Samuel Searle, Arnold Mitnitski, Evelyne Gahbauer, et al. 2008. A standard [8] Giovanni Guaraldi and Jovana Milic. 2019. The Interplay Between Frailty and procedure for creating a frailty index. BMC geriatrics 8 (10 2008), 24. Intrinsic Capacity in Aging and HIV Infection. AIDS Research and Human [23] Lloyd Stowell Shapley. 1953. A Value for n-Person Games. Contributions to the Retroviruses 35 (08 2019). Theory of Games, Vol. 2. Princeton University Press, Chapter 17. [9] Giovanni Guaraldi, Mirko Orsini, Agnese Caselgrandi, et al. 2019. Fitness [24] Catherine Tong, Matthew Craner, Matthieu Vegreville, et al. 2019. Tracking tracking wearable devices and a dedicated smart phone app (MySAwH App) Fatigue and Health State in Multiple Sclerosis Patients Using Connnected to predict quality of life in PLWH: a multi-centre prospective study.. In 17th Wellness Devices. Proc. of ACM on Interactive, Mobile, Wearable and Ubiquitous European AIDS Conference (EACS) (2019-08-05). Technologies 3, 3 (sep 2019), 1–19. [10] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, et al. 2018. A survey [25] Berk Ustun and Cynthia Rudin. 2015. Supersparse Linear Integer Models for of methods for explaining black box models. Comput. Surveys (2018). Optimized Medical Scoring Systems. Machine Learning 102 (02 2015). [11] Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting [26] Petar Veličković, Laurynas Karazija, Nicholas D. Lane, et al. 2018. Cross-modal Model Predictions. In Advances in Neural Information Processing Systems 30, Recurrent Models for Weight Objective Prediction from Multimodal Time- I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, series Data. In Proc. of PervasiveHealth ’18 (PervasiveHealth ’18). 178–186. and R. Garnett (Eds.). Curran Associates, Inc., 4765–4774. [27] Danding Wang, Qian Yang, Ashraf Abdul, et al. 2019. Designing theory- [12] Scott M Lundberg, Bala Nair, Monica S Vavilala, et al. 2018. Explainable driven user-centric explainable AI. In Proc. of Conference on Human Factors in machine learning predictions to help anesthesiologists prevent hypoxemia Computing Systems. during surgery. Nature Biomedical Engineering 2, 10 (2018), 749–760. [28] Rui Wang, Emily A. Scherer, Megan Walsh, et al. 2018. Predicting Symp- [13] Llew Mason, Jonathan Baxter, Peter L Bartlett, et al. 2000. Boosting algorithms tom Trajectories of Schizophrenia Using Mobile Sensing. GetMobile: Mobile as gradient descent. In Advances in neural information processing systems. Computing and Communications (2018). 512–518. [29] Rui Wang, Weichen Wang, Alex DaSilva, et al. 2018. Tracking Depression [14] Ankur Teredesai Muhammad Aurangzeb Ahmad, Carly Eckert et al. 2018. Dynamics in College Students Using Mobile Phone and Wearable Sensing. Interpretable Machine Learning in Healthcare. IEEE Intelligent Informatics Proc. of ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 1 Bulletin 19, 1 (2018), 1–7. (2018), 1–26.