Data-driven vs knowledge-driven inference of health
              outcomes in the ageing population: a case study
                  Davide Ferrari                                       Giovanni Guaraldi                            Federica Mandreoli
       University of Modena and Reggio Emilia                   University of Modena and Reggio Emilia         University of Modena and Reggio Emilia
                     Modena, Italy                                            Modena, Italy                                  Modena, Italy
            162996@studenti.unimore.it                              giovanni.guaraldi@unimore.it                   federica.mandreoli@unimore.it

             Riccardo Martoglia                                             Jovana Milić                                Paolo Missier
      University of Modena and Reggio Emilia                    University of Modena and Reggio Emilia                 Newcastle University
                    Modena, Italy                                             Modena, Italy                          Newcastle upon Tyne, UK
          riccardo.martoglia@unimore.it                                jovana.milic@gmail.com                         paolo.missier@ncl.ac.uk

ABSTRACT                                                                               1.1    Background
Preventive, Predictive, Personalised and Participative (P4)                            We focus on two easy-to-interpret metrics that clinical re-
medicine has the potential to not only vastly improve people’s                         searchers have proposed to succinctly express the health status
quality of life, but also to significantly reduce healthcare costs                     of patients at a given point in time. The first is a measure of
and improve its efficiency. Our research focuses on age-related                        frailty, designed to quantify the reduction of homeostatic reserves
diseases and explores the opportunities offered by a data-driven                       available to an individual. High frailty is indicative of higher
approach to predict wellness states of ageing individuals, in con-                     risk of negative health outcomes, but it is also a potentially re-
trast to the commonly adopted knowledge-driven approach that                           versible condition [21] In practice, frailty is measured using a
relies on easy-to-interpret metrics manually introduced by clin-                       variety of specific Frailty Index metrics (FIs). These are calcu-
ical experts. This is done by means of machine learning mod-                           lated using at least 30 directly assessed health variables, which
els applied on the My Smart Age with HIV (MySAwH) dataset,                             include signs or symptoms, biochemical parameters, various co-
which is collected through a relatively new approach especially                        morbidities or socio-demographic data [6, 22]. The choice of the
for older HIV patient cohorts. This includes Patient Related Out-                      specific variables, as well as their sources, may vary depending
comes values from mobile smartphone apps and activity traces                           on data availability, leading to different specifications for the FI.
from commercial-grade activity loggers. Our results show better                        Given their complexity and multidimensionality, FIs reflect the
predictive performance for the data-driven approach. We also                           biological age of an individual rather than their chronological age,
show that a post hoc interpretation method applied to the predic-                      making them reliable prognostic tools that can be used in differ-
tive models can provide intelligible explanations that enable new                      ent settings for clinical decision algorithms [6]. In particular, the
forms of personalised and preventive medicine.                                         dataset used in this research concerns a cohort of HIV patients.
                                                                                       This is important, because long-lived HIV patients exhibit a form
                                                                                       of accentuated ageing [3],such that they can successfully be used
1    INTRODUCTION                                                                      to study frailty, i.e., where the duration of the condition (number
Medical practice is evolving rapidly, away from the traditional but                    of years since infection) is used as a proxy for chronological age.
inefficient detect-and-cure approach, and towards a Preventive,                            The second measure of health considered in this work directly
Predictive, Personalised and Participative (P4) vision that focuses                    reflects the more positive notion of healthy ageing [16] that is
on extending people’s wellness state, with particular focus on age-                    becoming prevalent especially in public health settings. In con-
ing individuals [19]. This vision is increasingly data-driven, and                     trast to frailty, which is designed to measure decay, the term
is underpinned by many forms of “Big Health Data” including pe-                        healthy ageing (HA) has been proposed to promote a positive
riodic clinical assessments and electronic health records, but also                    approach to ageing that relies on reserves and preserved capac-
using new forms of self-assessment, such as mobile-based ques-                         ities in an individual, rather than accumulation of deficits. In
tionnaires and personal wearable devices. With these premises,                         the World Health Organization Guidelines on Integrated Care
P4 medicine has the potential to not only vastly improve people’s                      for Older People (ICOPE), HA is based on concrete measures of
quality of life, but also to significantly reduce healthcare costs                     Intrinsic Capacity (IC). These are defined as a composite of all
and improve its efficiency.                                                            the physical and mental capacities of an individual, divided into
   Our research explores specific opportunities offered by data-                       five domains: locomotion, cognition, psychological, vitality and
driven approaches to predictive care, in contrast to traditional,                      sensory capacity [16].
knowledge-driven approaches that rely purely on clinical ex-                               As suggested by Belloni and Cesari [2], frailty and IC should
pertise. Our focus is on age-related diseases, an emerging issue                       not be considered as two opposed constructs, but rather two
for health care systems. The World Health Organisation esti-                           constructs that share a common biological background. The IC
mates that the proportion of people over 60 years of age will                          should be considered as an evolution of the frailty concept, taking
reach 2 billion by 2050 [17]. Ageing is associated with increased                      into special consideration the functional reserve expressed by
prevalence of co-morbidities that accumulate in the complex of                         the vitality domain, the need for a worldwide implementation
multi-morbidity in older people [1].                                                   of prevention, the continuum of the ageing process, and the
                                                                                       opportunities offered by novel technologies [8].
© 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed-       Key to a successful operational definition of IC is the choice
ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen,       of the variables used to characterise healthy ageing. [16] argues
Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At-
tribution 4.0 International (CC BY 4.0)                                                that integrated care is crucial to incorporate intrinsic capacity
                                                                                       assessment in new care models and encourages the adoption of
wearable devices and mobile apps for longitudinal data collection.      2   RELATED WORK
In [9], the authors proposed a self-generated health measure            The idea of exploiting machine learning toward a general “health”
called Intrinsic Capacity Index (ICI) which relies on physical          focus is certainly gaining popularity, also thanks to the possi-
function data collected through fitness tracking wearable devices       bility of continuously monitoring well-being through wearable
and a set of electronic Patient-Related Outcomes (PRO) collected        activity trackers’ data. For instance, passive sensing techniques
through a dedicated smart-phone app (MySAwH App). In the                have been recently exploited to assess both mental and physical
study, these variables have been collected longitudinally over a        health [20] or to predict weight objectives for users of smart
period of 18 months for a cohort of HIV patients as part of the My      connected devices [26]. Some researchers also successfully ex-
Smart Age with HIV (MySAwH) study. This is a prospective multi-         ploited this kind of data in more traditional disease monitoring
center international case-only study, designed to empower Older         scenarios, e.g., to track Multiple Sclerosis [24], depression [29]
People Living With HIV (OPLWH) to achieve healthy lifestyles            and schizophrenia [28] patients’ health. In most of these studies
and improvement in quality of life.                                     mobile phones are only exploited as a passive sensing device,
   Similarly to FI, ICI is computed from a subset of the available      while in our case we combine the use of passive sensing (wear-
variables, by manually selecting a cutoff point for each variable       able activity tracker) data with self-reporting EMA data collected
and simply counting, for a given patient, the variables with value      through mobile phone apps.
higher than the cutoff for that patient. We have referred to this          A critical aspect for the successful application of ML to
approach that represents the common practice to assess health           medicine is the recent increased emphasis on the need for ex-
condition in geriatric medicine as “knowledge-driven” (KD for           planations of ML systems [7, 10]. This has lead to another im-
short), as it relies on easy-to-interpret metrics where the choice      portant research trend, i.e. Interpretable Machine Learning in
of variables and of their cutoff points is defined manually by          healtcare [14]. Even if recent researches have proposed new mod-
clinical experts. The usefulness of the ICI is demonstrated in [9]      els which exhibit high performance as well as interpretability,
by showing experimentally that it displayed higher sensitivity          e.g., GA2M[15] and rule-based models [25], the utility of these
than FI to predict one central indicator of “wellness state” of         models in healthcare has not been convincingly demonstrated
ageing individuals, the Quality of Life (QoL in short).                 yet, due to the rarity of their application [14]. The interpretation
                                                                        method used in [11, 12], instead, is designed to work with existing
                                                                        and well established (even if less interpretable) ML methods, such
                                                                        as gradient boosting or deep learning, by extracting explanations
1.2    Contributions                                                    through models that are applied post-hoc using Shapley Values
In contrast to the dominant KD methods as described, in this            [23]. This is one of the most advanced interpretation methods
paper we explore a complementary data-driven approach (DD)              available, allowing for both global (entire study population) and
to predicting wellness states for long-term patients, using a com-      local (instance level) explanations. In our study we exploit such
bination of clinical, self-monitoring, and PRO (self-reporting)         model together with the XGBoost [4] gradient boosting algo-
longitudinal observations that refer to the five IC domains. As         rithm, in order to aim to both interpretability and performance.
we will see, while this approach removes the need for the clinical      A similar technique has been also proposed, only at conceptual
experts to directly define metrics such as the ICI, it also empow-      level, in the Explainable AI framework discussed in [27], which
ers them by deploying machine learning techniques that make             is however based on clinical-only data analysis.
the predictions easily interpretable.
   Specifically, we focus on three dependent variables to charac-       3   THE HIV COHORT DATASET
terise the “wellness state” of ageing individuals, namely, (i) Falls,
                                                                        The experimental dataset used in this paper was obtained from
indicating whether or not a person has experienced any falls
                                                                        the My Smart Age with HIV (MySAwH) [18] project. MySAwH
within a given time period; (ii) SPPB, or Short Physical Perfor-
                                                                        is a multi-centre prospective ongoing study aiming at empow-
mance Battery, measuring movement of the lower limbs, and (iii)
                                                                        ering OPLWH, i.e. 50+ years old, to develop healthy lifestyles.
Quality of Life (QoL). Using the MySAwH dataset for training,
                                                                        The project involves 261 patients from three clinics: 128 from
we can directly contrast the DD and KD approaches, and we
                                                                        Modena (Italy), 100 from Sydney (Australia) and 33 from Hong
show that we can achieve better predictive power without the
                                                                        Kong (China). One novelty of the approach is the combination
need to rely on manually formulated ICI metrics. Furthermore,
                                                                        of clinical patient data, acquired during periodic scheduled as-
we also show that the models’ performance increases if we also
                                                                        sessments in the clinic, with patient-oriented longitudinal data
include a single Frailty Index value for each patient, represent-
                                                                        about patients’ behavioural, physiological and environmental
ing a “baseline” clinical assessment, in addition to the PRO and
                                                                        health status, which is collected at higher frequency through
activity variables.
                                                                        smartphones and wearable devices.
   Finally, we combine well-established machine learning mod-
                                                                           Thus, the resulting dataset is highly heterogeneous as to the
els with a post-hoc interpretation method and we show that
                                                                        data type, the geographical origin of patients and the acquisition
satisfactory model performance can be achieved together with
                                                                        rate. It provides a comprehensive characterization of patients
intelligible explanations, which provide the additional benefit of
                                                                        from the broad determinants of health that impact aging, com-
ranking the variables with respect to a prediction. Indeed, we
                                                                        plemented by those more specific to OPLWH, namely:
show that the relative importance of the variables differs for
each patient, indicating the important role played by intelligible          Activity tracking variables: Step count, sleep hours and
models towards personalising healthcare. This capability also                 calories, which are collected daily using a commercial-
provides critical feedback to the clinicians, who can use this in-            grade wearable activity tracker;
formation in combination with their expertise to moderate the               PRO variables: 56 categorical questions exploring func-
model’s predictions.                                                          tional abilities and Quality of life (QoL). These are col-
                                                                              lected monthly using a dedicated smartphone app;
                                                                                                     p
    Clinical variables: Comprehensive geriatric assessment                    • the feature vector x i,j contains the values of the PRO and
       and HIV-specific variables, which are collected by health-               activity tracking variables V for patient p at month m: 56
       care workers during study visits at time 0, 9 and 18 months.             PRO answers provided by the patient during the corre-
       37 of these variables were used to measure the Frailty In-               sponding month, i.e. plus 3 aggregated values computed
       dex (FI) as defined in [6]: 27 from blood tests, 3 about body            as the mean of the daily wearable device data (step count,
       composition, 7 HIV-related variables and patient-reported                calories, number of sleep hours) collected during the same
                                                                                                                p
       outcomes.                                                                month. Given a variable Vk , x i,j [Vk ] denotes the Vk value
   A preliminary analysis [8] introduced a new index for express-                     p
                                                                                 in x i,j ;
ing IC, named ICI, and compared the performance of FI and ICI                     p
                                                                               • y j is the value of the outcome o measured during the
in predicting QoL. This is a quantitative measure of health as
assessed by the individual respondents that is widely used on                    hospital visit at the end of period j.
aging population. QoL was assessed using a standardised EQ-5D-            The second sample set, denoted as SampleoF I , is built by adding
                                                                                                     p
5L questionnaire, based on the EQ Visual Analogue scale (EQ               to each feature vector x i,j ∈ Sampleo , the FI value computed
VAS) [5]. Through the MySAwH dataset, FI and ICI were shown               from the clinical variables measured during the hospital visit at
to be performative tools that can be used in research and clinical        the beginning of period j, i.e. at month 0 when j = 1 and 9 when
setting to describe disease and health status in OPLWH. ICI score         j = 2.
in comparison to FI displayed higher sensitivity to predict the
                                                                             Quality Assurance. Lastly, we performed a quality assurance
QoL and self-perceived health in OPLWH.
                                                                          step on the resulting sample sets. Within each time window, ob-
   Outcomes. In this work, we focus on the task of predicting             servations of PRO variables are sometimes incomplete, resulting
significant healty ageing indicators and QoL is one of such in-           in sequence gaps. The size of the gaps is 5 consecutive missing
dicators. In addition to QoL, two other indicators were selected:         observations on average, with a max of 17, and we found 108
Falls, that is an adverse outcome included among the geriatric            gaps per patient on average, with a max of 284 gaps (regardless
syndroms, and the Short Physical Performance Battery (SPPB),              of size). We performed imputation by interpolating missing data
a group of measures about movement ability with lower limbs               points in the time series, with an aim to achieve a balance be-
(Guralnik et al., 2000) that can aid in the monitoring of function        tween the size of the gaps and the performance of the predictive
in older people. These three outcomes are chosen because they             model. Clearly, interpolating very large gaps produces spurious
widely cover all 5 domains of IC. In summary, these are as follows        data in the training set. We experimentally determined the max
(Fig. 1 shows their distributions):                                       size of gaps that could be safely interpolated (five missing steps),
    Quality of Life (QoL): assessed using the EQ-5D5L stan-               by assessing the predictive performance of each of the models re-
        dard, with values between 0 and 1;                                sulting from training sets obtained from more or less “aggressive”
    Falls: A binary outcome that evaluates to True if a patient           interpolation. After adjusting for missing data, the final training
        has fallen at least once since the previous visit, and False      set contains 2,250 data points, with an average of 8 per patient.
        otherwise;                                                        The construction of the dataset results in at most 16 samples per
    SPPB Index: a discrete index that assumes integer values              patient, for a total of 4,176 records, considering each month for
        between 0 and 12.                                                 each patient.

   Observational data and feature space. The observational                4   THE LEARNING FRAMEWORK
dataset used to predict these outcomes covers 18 months and is
                                                                          The objective of the data-driven approach is to predict each out-
broken down into two time windows, reflecting the clinical as-
                                                                          come at the time of a visit using the samples referring to the time
sessment schedule (months 9 and 18), when the selected clinical
                                                                          window before that visit, as shown in Fig. 2.
outcomes are assessed. For each time window, we draw two sets
                                                                             To this end, the data-driven learning framework we built is
of samples from the related observations, which we are going to
                                                                          depicted on the left side of Fig. 3. Each outcome o is predicted by
use as ground truth to train our models.
                                                                          two learning models Mo and MoF I , one for each of the two sample
   The first consists of the patient-centric longitudinal data, in-
cluding the activity tracking and the PRO variables. To this end,         sets Sampleo and SampleoF I , respectively. We trained the two
we further aggregated the resulting PRO time series and the activ-        models separately and assessed the performance using standard
ity tracking time series at regular intervals of 1 month, resulting       KFold cross-validation (CV) on an 80% of the samples (χt r ain )
in two sets of data points, one for each of the two windows. The          and a test phase on the remaining 20% samples (χt est ) of the
second sample set augments the first with the FI values com-              corresponding sample set.
puted from the clinical variables measured during hospital visits            The DD approach is compared to the KD approach that repre-
at the beginning of each window, namely at times 0 and 9. This            sents the common practice in geriatric medicine. To this end, we
added value is a physician’s assessment that complements the              built the KD learning framework depicted on the right hand side
patient-centric data point and can be interpreted as the baseline         of Fig. 3. The approach aims at computing an ICI scores for each
of the time series the data point comes from.                             observation by manually selecting a subset V = {V1 . . . Vn } ⊆ V
   Formally, for each outcome o ∈ {QoL, SPPB, Falls} we write             of the set of PRO and activity tracking variables V, specifying
Sampleo to denote the sample set containing the monthly samples           functions si (x) to map each value x for variable Vi ∈ V to a score,
per patient. For each patient p, we denote a single sample in the         and finally combining the individual scores si (x) into a unique
set Sampleo at month m = i + (j − 1) ∗ 9, corresponding to the            value. The variables are chosen to represent each of the five IC
i-th observation, i ∈ [1, 8], in the j-th window, j ∈ [1, 2], by a pair   domains, namely locomotion, cognition, psychological, vitality
 p       p      p                                                         and sensory capacity. For most of the variables Vi , a binary score
sm = (x i,j , y j ) ∈ Sampleo where
                                                                          is defined, i.e., si (x) ∈ {0, 1}, based on a single threshold, for
        p
    • x i,j represents the feature vector for i-th observation;           instance when Vi = stress level (from 1 to 10) the score is mapped
        10000                                                                                                                  10000                                                                        2000

         1000                                                                                                                   1000                                                                        1500

          100                                                                                                                    100                                                                        1000

           10                                                                                                                       10                                                                          500
              1                                                                                                                      1                                                                            0
                      0,1-0,2


                                    0,2-0,3


                                               0,3-0,4


                                                         0,4-0,5


                                                                       0,5-0,6


                                                                                     0,6-0,7


                                                                                               0,7-0,8


                                                                                                         0,8-0,9


                                                                                                                   0,9-1


                                                                                                                                         2-3


                                                                                                                                               4-5


                                                                                                                                                        6-7


                                                                                                                                                               7-8


                                                                                                                                                                       8-9


                                                                                                                                                                               9-10


                                                                                                                                                                                      10-11


                                                                                                                                                                                              11-12
                                                                                                                                                                                                                       False      True

                                              (a) QoL distribution                                                                                   (b) SPPB distribution                                        (c) Falls distribution


                                                                                                Figure 1: Distribution of outcomes in the dataset

                                                                                                                                                                                                         Knowledge-Driven (KD) Approach
                                                         …
                                                                                                                           …
                                                                                                                                                        Raw Data                 Subsetting              Cutoffs         𝑓(𝑥)          IC Index
          1       2             3      4       5         6         7             8        9     …        1° time window

          10 11 12 13 14 15 16 17 18 … 2° time window                                                                                                      Data-Driven (DD) Approach
          months

                                                                                                                                                       Training in CV                                                               Training in CV
                                                                                                                                                                                                      Performance
Figure 2: Prediction at next clinical visit using only patient                                                                                                                                        Comparison
reported outcomes                                                                                                                                     Regression                                                                           Regression
                                                                                                                                                                             MQoL_DD                    QoL                MQoL_KD
                                                                                                                                                      Regression                                                                           Regression
                                                                                                                                                                             MSPPB_DD                   SPPB               MSPPB_KD
to 1 if the value is lower than 3 and 0 otherwise. Other variables                                                                                                                                      (MAE)
are mapped to a score in the [0, 1] range, for instance the number                                                                                    Classification                                                                     Classification
of steps per day.                                                                                                                                                            MFalls_DD                  Falls              MFalls_KD
                                   p                                                                                                                                                             (Acc, Prec, Rec)
   Then, given a feature vector x i,j for patient p, the correspond-
                                                                                                                                p
ing ICI value ICI (i, j, p) is computed as the sum of the si (x i,j [Vi ])                                                                             Model Interpretation
scores, normalised by the number of variables:
                                     Í |V |     p
                                      i:1 si (x i,j [Vi ])                                                                                           Figure 3: Comparison between                                        Data-Driven             and
                     ICI (i, j, p) =                                                                                                                 Knowledge-Driven approaches
                                             n
   Such an index is subject to an inevitable bias: the imposition
of the physician’s interpretation on the choice of the variables
                                                                                                                                                     performance from model interpretability would better suit our
of the subset, as well as on the thresholds and the arithmetic
                                                                                                                                                     needs. Thus, our results are based on a combination of Gradi-
formula to be used.
                                                                                                                                                     ent Boosting (the XGBoost implementation [4] for performance
   Also for this case, a dataset consisting exclusively of the ICIs,
                                                                                                                                                     (Sec. 5.1), and Shapley Values [23], using the SHAP implemen-
SampleoI C I , and one consisting of the ICIs and the FI at the most
                                                                                                                                                     tation [11]) to generate reports on the relative importance of
recent visit, SampleoI C I ,F I , was isolated. This parallelism with                                                                                each individual feature, both across the entire population and for
what was done in the data-driven approach allowed to train                                                                                           individual patients (Sec. 5.2).
6 learning models with the datasets just described (MoIC I and
MoI C I ,F I for each of the three outcomes o) and to compare the                                                                                    5.1      Predictive Performance
predictive performances between the two different approaches.
                                                                                                                                                     In our evaluation we compare our DD approach, illustrated in
                                                                                                                                                     the left side of Fig. 3, with the KD approach based on a regression
5    EXPERIMENTAL RESULTS AND MODEL                                                                                                                  on the IC index (right hand side in the Figure). Fig. 4 shows the
     INTERPRETATION                                                                                                                                  predictive performance of the models we tested, namely: the DD
In this section we first discuss the performance of the predictive                                                                                   models trained with and without using FI as a feature, earlier
models, and then describe in detail our approach and results                                                                                         referred to as MoF I and Mo , respectively; and the KD models,
regarding model interpretation, which is reported as output to the                                                                                   where again the expert may or may not consider FI. Results are
DD approach in the left side of Fig. 3. The latter is a fundamental                                                                                  presented using 1-MAPE (Mean Average Percentage Error) for
requirement in medicine, where the ability to provide medical                                                                                        the numerical outcomes QoL and SPPB on the left of the figure,
doctors with an easy-to-understand interpretation of the model                                                                                       and accuracy, precision, recall and F1 for Falls, on the right.
predictions is fundamental. This not only conveys confidence in                                                                                         The results indicate a higher than 90% 1-MAPE for all cases in
the predictions, but also helps to make them actionable, i.e., in                                                                                    QoL and SPPB, while classification accuracy for Falls is higher
the form of recommendations to patients. We present examples                                                                                         than 84%. Further, the DD approach performs generally better
of such interpretations, and their practical relevance, in Sec. 5.2.                                                                                 than KD, and both benefit from using FI, with performance reach-
   The Gradient Boosting algorithm [13] proved to offer better                                                                                       ing 94.3%, 94.9% and 95% for QoL, SPPB, and Falls, respectively.
predictive performance than other popular intelligible learning                                                                                         To note, in one case the KD approach returns a very low
frameworks such as GA2 M [15], suggesting that separating model                                                                                      Recall when FI is not used. This can be explained by the strong
      100%                                               100%
                                                          80%
      95%                                                 60%
      90%                                                 40%
                                                          20%
      85%                                                  0%
               KD          DD      KD          DD                KD     DD    KD    DD     KD    DD     KD     DD     KD     DD     KD    DD     KD     DD
                     QoL                SPPB                        Acc  Prec - True Prec - False Rec - True Rec - False F1 - True F1 - False
      w/o FI   91%         92%    93%          92%        w/o FI 84% 93% 22% 97% 85% 93% 2% 52% 99% 100% 4% 68% 91% 96%
      w/ FI    92%         94%    93%          95%        w/ FI 89% 95% 72% 98% 92% 95% 54% 68% 96% 100% 62% 80% 94% 97%


Figure 4: Predictive Performance. Left: 1-MAPE (Mean Average Percentage Error) for QoL and SPPB, right: classification
effectiveness for Falls


               QoL                             SPPB Index                    explained in terms of different behaviour, represented for in-
                                                                             stance by different EMA features. An example appears in Fig. 6,
                                                                             showing two different sets of positively contributing (green) and
                                                                             negatively contributing (red) features for two patients with the
                                                                             same SPPB index (note that, for SPPB, higher is better, as this
                                                                             indicates the patient’s capacity of physical movement. In the case
                                                                             of Falls, for instance, the opposite would be true). Clearly, this
                                                                             added information may lead to different interventions for these
                                                                             two patients.
                                                                                At the same time, SHAP provides global explanations, which
Figure 5: Regression MAE distribution per patient                            characterise the contribution of each feature as a function of its
grouped by clinical center                                                   range of values. For instance, Fig. 7 shows how the SV, indicating
                                                                             the overall contribution of one of these features (a PRO ques-
                                                                             tion), goes from negative to positive depending on the patients’
imbalance of the majority “False” class (no Falls) relative to the           responses to this question, with a definite threshold of ⩾ 3.
small minority “True”.                                                          We note that this capability essentially mimics the KD ap-
   The training sets used to generate these models combine pa-               proach in that it identifies thresholds for the variables. While
tients from all three clinics. To account for possible differences           these are similar to the manually selected cutoffs, in our DD
in data collection protocols between the clinics, we also created            approach these are automatically identified from the data, in a
one separate model for each. The corresponding results are pre-              principled way. In the future, this explanation capability may un-
sented in Table 1 and are consistent with those presented above.             derpin epidemiological studies where the precise characterisation
Some anomalies appear in the Hong Kong models, and these are                 of a populations of individuals enables new forms of preventive
probably due to the small size of the training set.                          medicine.
   Finally, Fig. 5 shows the MAE distribution grouped per clinical
center for QoL and SPPB. This helps understanding the robust-
                                                                             6     CONCLUSIONS AND FUTURE WORK
ness of the models and to identify any non-homogeneity in the
data. In particular, Hong Kong exhibits a higher number of out-              In this paper we have proposed a novel, data-driven approach
liers compared to Modena and Sydney, probably because of the                 towards the definition of Intrinsic Capacity, aimed at quantifying
small number of cases (33, compared to 128 in Modena and 100                 and predicting the wellness state of old people who live with HIV.
in Sydney), which are also more homogeneous. These results                   Using a cohort from a multi-centre prospective study as training
suggest that developing separate models by stratifying across                set, we have shown that a machine learning model that predicts
clinics and data collection centres may be beneficial for future,            three specific wellness metrics (Falls, SPPB, and Quality of Life),
larger scale studies.                                                        performs equally or better than a manually-defined Intrinsic Ca-
                                                                             pacity Index. At the same time, the model is interpretable, making
5.2     Model Interpretation                                                 it an ideal complement to expert-based assessment of wellness.
SHAP [11] is a framework for interpreting predictions from ma-
chine learning models. It is based on Shapley values, first intro-           REFERENCES
duced in 1953 in the context of cooperative game theory [23].                [1] Ilaria Bellantuono, Rafael DeCabo, Dan Ehninger, et al. 2018. Find drugs that
                                                                                 delay many diseases of old age. Nature 554 (02 2018), 293–295.
Briefly, the main goal of the framework is to rank the relative              [2] Giulia Belloni and Matteo Cesari. 2019. Frailty and Intrinsic Capacity: Two
influence of each feature on a predictive model, both locally,                   Distinct but Related Constructs. Frontiers in Medicine 6 (06 2019).
                                                                             [3] Thomas Brothers, Susan Kirkland, Giovanni Guaraldi, et al. 2014. Frailty in
that is, for a specific instance prediction, and globally, i.e., when            People Aging With Human Immunodeficiency Virus (HIV) Infection. The
considering the model predictions for an entire population.                      Journal of infectious diseases 210 (06 2014).
    In our medical setting, this means that for each patient, in             [4] Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting
                                                                                 System. In Proc. of ACM SIGKDD (KDD ’16). ACM, New York, NY, USA, 785–
addition to the predicted outcome the clinician also receives a                  794.
list of features, ranked in order of their relative importance in            [5] Nancy Devlin and Richard Brooks. 2017. EQ-5D and the EuroQol Group: Past,
achieving the prediction. Importantly, these orders may differ                   Present and Future. Applied Health Economics and Health Policy 15 (02 2017).
                                                                             [6] Iacopo Franconi, Olga Theou, Lindsay Wallace, et al. 2018. Construct validation
for any two patients. This means that, using SHAP, we enable                     of a Frailty Index, an HIV Index and a Protective Index from a clinical HIV
forms of personalised medicine whereby similar outcomes are                      database. PLOS ONE 13 (10 2018), e0201394.
                          QoL            SPPB Index                                                        Falls
                       1 - MAPE           1 - MAPE                Acc          P True        P False       R True         R False         F1 True         F1 False
                      KD      DD         KD      DD          KD         DD    KD    DD     KD      DD     KD     DD      KD     DD       KD     DD       KD      DD
                                                                                        Hong Kong
         w/o FI      93%      93%        91%     92%        84%         96%   0%     0%   87% 100%        0%      0%    97% 97%          0%      0%     92%     98%
         w/ FI       94%      93%        94%     93%        94%         93%   1%     1%   94% 93%        33%     33%    100% 100%       50%     50%     97%     96%
                                                                                         Modena
         w/o FI      94%      94%        94%     95%        86%         94%    0%   93% 86% 94%           0%     41%    100%     99%     0%     57%     93%     96%
         w/ FI       94%      94%        95%     96%        93%         95%   74%   93% 95% 96%          53%     68%    98%      99%    62%     79%     97%     98%
                                                                                         Sydney
         w/o FI      88%      90%        91%     93%        81%         87%   68%   76% 84% 89%          38%     57%     95%     95%    49%     65%     89%     92%
         w/ FI       89%      90%        93%     94%        87%         95%   86%   93% 88% 96%          69%     68%     95%     99%    77%     79%     91%     98%

Table 1: Single-clinic models performance. (left: predictive performance for QoL and SPPB, right: classification effective-
ness for Falls)


Figure 6: Example of a local interpretation of one patient’s SPPB prediction. The 5 most relevant Shapley Values are
reported.

                                                                                           [15] Harsha Nori, Samuel Jenkins, Paul Koch, et al. 2019. InterpretML: A
                                                                                                Unified Framework for Machine Learning Interpretability. arXiv preprint
                                                                                                arXiv:1909.09223 (2019).
                                                                                           [16] World Health Organization. 2015. World report on aging and health. https:
                                                                                                //www.who.int/ageing/events/world-report-2015-launch/en/
                                                                                           [17] World Health Organization. 2018. Ageing and health. https://www.who.int/
                                                                                                news-room/fact-sheets/detail/ageing-and-health
                                                                                           [18] Mirko Orsini, Marco Pacchioni, Andrea Malagoli, et al. 2017. My smart age
                                                                                                with HIV: An innovative mobile and IoMT framework for patient’s empower-
                                                                                                ment. In 2017 IEEE 3rd International Forum on Research and Technologies for
                                                                                                Society and Industry (RTSI). 1–6.
                                                                                           [19] Nathan D Price, Andrew T Magis, John C Earls, et al. 2017. A wellness
Figure 7: Global distribution of one of the PRO’s SVs based                                     study of 108 individuals using personal, dense, dynamic data clouds. Nature
                                                                                                Biotechnology 35 (jul 2017), 747.
on the value of the possible answers.                                                      [20] Mashfiqui Rabbi, Shahid Ali, Tanzeem Choudhury, et al. 2011. Passive and
                                                                                                in-situ assessment of mental and physical well-being using mobile sensors. In
                                                                                                Proc. of UbiComp’11. 385–394.
                                                                                           [21] Martin Ritt, Karl Gassmann, and Cornel Sieber. 2016. Significance of frailty for
 [7] Leilani H. Gilpin, David Bau, Ben Z. Yuan, et al. 2018. Explaining explanations:           predicting adverse clinical outcomes in different patient groups with specific
     An overview of interpretability of machine learning. In Proc. DSAA 2018.                   medical conditions. Zeitschrift fur Gerontologie und Geriatrie 49 (09 2016).
     80–89.                                                                                [22] Samuel Searle, Arnold Mitnitski, Evelyne Gahbauer, et al. 2008. A standard
 [8] Giovanni Guaraldi and Jovana Milic. 2019. The Interplay Between Frailty and                procedure for creating a frailty index. BMC geriatrics 8 (10 2008), 24.
     Intrinsic Capacity in Aging and HIV Infection. AIDS Research and Human                [23] Lloyd Stowell Shapley. 1953. A Value for n-Person Games. Contributions to the
     Retroviruses 35 (08 2019).                                                                 Theory of Games, Vol. 2. Princeton University Press, Chapter 17.
 [9] Giovanni Guaraldi, Mirko Orsini, Agnese Caselgrandi, et al. 2019. Fitness             [24] Catherine Tong, Matthew Craner, Matthieu Vegreville, et al. 2019. Tracking
     tracking wearable devices and a dedicated smart phone app (MySAwH App)                     Fatigue and Health State in Multiple Sclerosis Patients Using Connnected
     to predict quality of life in PLWH: a multi-centre prospective study.. In 17th             Wellness Devices. Proc. of ACM on Interactive, Mobile, Wearable and Ubiquitous
     European AIDS Conference (EACS) (2019-08-05).                                              Technologies 3, 3 (sep 2019), 1–19.
[10] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, et al. 2018. A survey           [25] Berk Ustun and Cynthia Rudin. 2015. Supersparse Linear Integer Models for
     of methods for explaining black box models. Comput. Surveys (2018).                        Optimized Medical Scoring Systems. Machine Learning 102 (02 2015).
[11] Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting              [26] Petar Veličković, Laurynas Karazija, Nicholas D. Lane, et al. 2018. Cross-modal
     Model Predictions. In Advances in Neural Information Processing Systems 30,                Recurrent Models for Weight Objective Prediction from Multimodal Time-
     I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,                series Data. In Proc. of PervasiveHealth ’18 (PervasiveHealth ’18). 178–186.
     and R. Garnett (Eds.). Curran Associates, Inc., 4765–4774.                            [27] Danding Wang, Qian Yang, Ashraf Abdul, et al. 2019. Designing theory-
[12] Scott M Lundberg, Bala Nair, Monica S Vavilala, et al. 2018. Explainable                   driven user-centric explainable AI. In Proc. of Conference on Human Factors in
     machine learning predictions to help anesthesiologists prevent hypoxemia                   Computing Systems.
     during surgery. Nature Biomedical Engineering 2, 10 (2018), 749–760.                  [28] Rui Wang, Emily A. Scherer, Megan Walsh, et al. 2018. Predicting Symp-
[13] Llew Mason, Jonathan Baxter, Peter L Bartlett, et al. 2000. Boosting algorithms            tom Trajectories of Schizophrenia Using Mobile Sensing. GetMobile: Mobile
     as gradient descent. In Advances in neural information processing systems.                 Computing and Communications (2018).
     512–518.                                                                              [29] Rui Wang, Weichen Wang, Alex DaSilva, et al. 2018. Tracking Depression
[14] Ankur Teredesai Muhammad Aurangzeb Ahmad, Carly Eckert et al. 2018.                        Dynamics in College Students Using Mobile Phone and Wearable Sensing.
     Interpretable Machine Learning in Healthcare. IEEE Intelligent Informatics                 Proc. of ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 1
     Bulletin 19, 1 (2018), 1–7.                                                                (2018), 1–26.