Hierarchical Bayesian Survival Analysis and Projective Covariate
           Selection in Cardiovascular Event Risk Prediction


    Tomi Peltola                 Aki S. Havulinna               Veikko Salomaa                  Aki Vehtari
tomi.peltola@aalto.fi          aki.havulinna@thl.fi          veikko.salomaa@thl.fi          aki.vehtari@aalto.fi
   Aalto University,            National Institute for         National Institute for          Aalto University,
       Finland                   Health and Welfare,           Health and Welfare,                 Finland
                                       Finland                       Finland


                      Abstract                               statistical challenge is then to identify an informative
                                                             subset of the biomarkers and estimate its predictive
     Identifying biomarkers with predictive value            performance.
     for disease risk stratification is an important         Here, we describe an application of linear, hierarchi-
     task in epidemiology. This paper describes an           cal Bayesian survival regression to model cardiovascu-
     application of Bayesian linear survival regres-         lar event risk in diabetic individuals. The available
     sion to model cardiovascular event risk in di-          data consists of 7932 Finnish individuals in the FIN-
     abetic individuals with measurements avail-             RISK 1997 cohort [1], of whom 401 had diabetes at
     able on 55 candidate biomarkers. We extend              the beginning of the study. The covariates consist of
     the survival model to include data from a               a set of 55 candidate biomarkers measured from blood
     larger set of non-diabetic individuals in an            samples and 12 established risk factors (e.g., base-
     e↵ort to increase the predictive performance            line age, sex, body-mass index, lipoprotein cholesterol
     for the diabetic subpopulation. We com-                 measures, blood pressure and smoking). The length
     pare the Gaussian, Laplace and horseshoe                of the follow-up period was 15 years. We focus on
     shrinkage priors, and find that the last has            three key elements in the model construction: 1) using
     the best predictive performance and shrinks             shrinkage priors to model the assumption of possibly
     strong predictors less than the others. We              limited relevance of many biomarkers, 2) utilizing the
     implement the projection predictive covari-             large set of non-diabetic individuals in the modelling,
     ate selection approach of Dupuis and Robert             and 3) the selection of a subset of the biomarkers with
     (2003) to further search for small sets of pre-         predictive value. While the statistical approach is not
     dictive biomarkers that could provide cost-             limited to this particular application, we use the set-
     efficient prediction without significant loss in        ting to make the description of the methods concrete.
     performance. In passing, we present a deriva-
     tion of the projective covariate selection in           Shrinkage or sparsity-promoting priors for regression
     Bayesian decision theoretic framework.                  coefficients are used to shrink the e↵ects of (appar-
                                                             ently) irrelevant covariates to zero, while retaining the
                                                             e↵ects of relevant covariates. Their use has increased
                                                             with the availability of datasets with large numbers of
1    INTRODUCTION                                            features, for example, from high-throughput measure-
                                                             ment technologies, which often capture a snapshot of
Improving disease risk prediction is a major task in
                                                             a whole system (e.g., metabolome, genome) instead
epidemiological research. Non-communicable diseases,
                                                             of targeted features. The interest has spawned con-
many of which develop and progress slowly, are a ma-
                                                             siderable research e↵ort into such priors and multiple
jor cause of morbidity worldwide. Accurate risk pre-
                                                             alternatives have been proposed (see, e.g., refs [2–6]).
diction could be used to screen individuals for targeted
                                                             In this work, we chose to compare three priors: the
intervention. Advances in measurement technologies
                                                             Laplace [3], the horseshoe [5] and, as a baseline, a
allow researchers cost-efficient quantification of large
                                                             Gaussian prior. The Laplace prior corresponds to the
numbers of potentially relevant biomarkers, for exam-
                                                             popular lasso penalty [7] in non-Bayesian regularized
ple, in blood samples. However, often only a few of
                                                             regression. The horseshoe prior has been shown to
such candidate biomarkers could be expected to give
                                                             have desirable features in Bayesian analysis [5, 8]. We
practically relevant gain in risk stratification or could
                                                             briefly review these priors in Section 2.2.
be realistically used in routine health care setting. The

                                                        79
Of the 401 diabetic individuals in the study, 155 ex-        2     MODEL
perienced a cardiovascular event within the follow-up
period. This leaves a limited set of informative sam-        We first consider modelling the cardiovascular-event-
ples to perform the model fitting, covariate selection       free survival in the subset of diabetic individuals only.
and predictive performance evaluation with. Although         The model is then extended to include the data of
the risk of cardiovascular events is larger in diabetic      non-diabetic individuals, while allowing the covariate
individuals than the general population [9], we would        e↵ects and the baseline hazard to di↵er in these groups
expect that the risk factors are shared at least to some     and between men and women.
extent. Based on this assumption, we incorporate the
non-diabetic individuals (n = 7531, 1031 events) into
                                                             2.1     OBSERVATION MODEL
the analysis by constructing a hierarchical joint model,
where the submodels for diabetic and non-diabetic in-        Let the observation ti be the event time Ti or
dividuals can be correlated (akin to transfer or multi-      the censoring time Ci since the beginning of the
task learning [10]). The joint model does not place          study for ith individual and vi be the corresponding
hard constraints on the similarity of the submodels,         event/censoring indicator (1 for observed events, 0 for
but allows the models to di↵er between non-diabetic          censored). All censored cases are right censored (i.e.,
and diabetic individuals and also between men and            Ti > Ci where only Ci is observed; censoring occurs
women. Details are given in Section 2.3.                     in the data mostly because of event-free survival to
While lasso regression in the non-Bayesian context can       the end of the follow-up). Further, let xi be a column
perform hard covariate selection by estimating exact         vector of the observed covariate values for the ith sub-
zeroes for regression coefficients, the Bayesian shrink-     ject. We assume a parametric survival model, where
age priors do not lead to sparse posterior distributions     the observations follow the Weibull model2
as there will remain uncertainty after observing a fi-                                     v (↵ 1)
nite dataset. However, we are interested in finding         p(ti |xi , vi , , ↵) = ↵vi ti i          exp(vi    T
                                                                                                                   xi t ↵
                                                                                                                        i exp(
                                                                                                                               T
                                                                                                                                 xi ))
a minimal subset of predictively relevant biomarkers
as discussed above. To this end, we examine the use          with the shape ↵ and the scale defined through the
of projection predictive covariate selection1 , where the    linear combination T xi of the covariates [14]. The
full model, encompassing all the candidate biomarkers        Weibull model is a proportional hazard model with
and the uncertainties related to their e↵ects, is taken      the hazard function h(Ti ) = ↵Ti↵ 1 exp( T xi ).
as a yardstick for the smaller models. Specifically, the     We include a constant term 1 in the covariates xi
models with subsets of covariates are found by maxi-         and denote the corresponding regression coefficient 0 .
mizing the similarity of their predictions to this refer-    The intercept and the shape are given the di↵use pri-
ence as proposed by Dupuis and Robert [12]. Notably,         ors:
this approach does not require specifying priors for the
submodels and one can instead focus on building a                                      0      ⇠     N(0, 102 ),
good reference model. Dupuis and Robert [12] suggest
                                                                                  log ↵       ⇠     N(0, 102 ).
choosing the size of the covariate subset based on an
acceptable loss of explanatory power compared to the
reference model. We examine using cross-validation           The covariates are divided into a set of established
based estimates of predictive performance as an alter-       risk (or protective) factors and a set of new candidate
native.                                                      biomarkers, which are of more uncertain relevance.
                                                             The coefficients of the established predictors, j for
The structure of this article is as follows. In Section
                                                             j = 1, . . . , mbg , are given the prior [15]:
2, we describe the survival model, shrinkage priors,
and the hierarchical extension to include data of non-                                     2 2
                                                                        j   ⇠    N(0,      s j ),    for j = 1, . . . , mbg ,
diabetic individuals. The projection predictive covari-
                                                                        2                  2
ate selection is described in Section 3. The results                    j   ⇠    Inv–       (1), for j = 1, . . . , mbg ,
from the application of the methods for cardiovascular-                 s   ⇠    Half–N(0, 102 ).
event-free survival modelling in diabetic individuals
are presented in Section 4. Finally, Section 5 discusses
the modelling approach.                                      Priors for the coefficients of the candidate biomarkers
                                                             are considered below.
                                                                2
                                                                  The notation for probability distributions follows the
   1
     A comprehensive review of predictive Bayesian model     parametrizations given in ref. [13], except for the Weibull
selection approaches is given by Vehtari and Ojanen [11].    model, which is explicitly written out. Half -distributions
Our terminology follows theirs.                              refer to the restriction to the real positive axis.

                                                      80
2.2   PRIORS FOR BIOMARKER                                         2.3   HIERARCHICAL EXTENSION
      COEFFICIENTS
                                                                   Next, we consider extending the approach to jointly
Based on our prior assumption that only some of the                model the event-free survival of non-diabetic men
biomarkers are expected to be practically relevant for             (NM), non-diabetic women (NW), diabetic men (DM),
prediction, we consider the use of shrinkage priors for            and diabetic women (DW). Our aim is to increase the
the biomarker coefficients. As discussed in the intro-             predictive performance of the model specifically in the
duction, there has been a lot of recent research into              subset of diabetic individuals, but gain power by in-
these type of priors and there are multiple proposals.             cluding the larger set of observations for non-diabetic
We restrict our consideration to three alternatives: the           individuals in the model. To this end, we tie together
horseshoe prior [5], the Laplace prior [3], and, as a              the submodels of the four groups using the following
baseline approach, a Gaussian prior. Each of these                 assumptions:
can be expressed as normal scale mixtures
             2 2
                                                                    1. The relevance of a biomarker will be similar for
   j ⇠ N(0, ⌧s ⌧j ),   for j = mbg + 1, . . . , mbg + mbm ,            all the submodels.
where ⌧s is a global scale parameter (shared across j)              2. The e↵ect size of a biomarker (or other covari-
and ⌧j are local parameters. Ideally, the prior shrinks                ate) and its direction are similar between men and
the coefficients of irrelevant biomarkers to zero, but                 women, and between diabetic and non-diabetic in-
allows large coefficients for relevant biomarkers. In a                dividuals.
sparse situation, with many irrelevant biomarkers and
few relevant, this could be e↵ected by making ⌧s small,             3. The baseline hazard functions have similar shapes
but allowing some ⌧j to take on large values to escape                 for men and women, and diabetic and non-
the shrinkage [16].                                                    diabetic individuals.

The priors for ⌧j s, for j = mbg + 1, . . . , mbg + mbm , for      Let j = [ j,N M j,N W j,DM j,DW ]T be the coef-
the three alternatives are                                         ficients for the jth biomarker in the four submodels.
                                                                   We set
       ⌧j ⇠ Half–Cauchy(0, 1)             for horseshoe,                                         2    1
                                                                                       j ⇠ N(0, rj ⇤    ),
       ⌧j2 ⇠ Exponential(0.7)             for Laplace,
                                                                   where rj2 ⇤ 1 is the prior covariance matrix. Here,
       ⌧j = 1                             for the Gaussian.        rj = ⌧j ⌧s and follows one of the prior specifications
                                                                   given in the previous section. This encodes the first
A comparison of the Laplace and horseshoe prior is                 assumption above: a single rj parameter defines the
given in ref. [5]: it is noted that the Laplace prior              relevance of the jth biomarker in all the four submod-
may overshrink large coefficients in a sparse situation,           els.
while the horseshoe prior is more robust (see also ref.
[16]). Furthermore, van der Pas et al. [8] derive the-   To encode the second assumption, we specify the struc-
oretical results indicating that the posterior distribu- ture of the prior precision matrix as
tion under the horseshoe prior may be more informa-       2                                                   3
tive3 than under the Laplace prior in a sparse normal       1 + cN + sM        cN           sM          0
                                                          6       cN     1 + cN + sW        0          sW     7
means problem. The Gaussian prior does not try to ⇤=6                                                         7.
                                                          4       sM           0      1 + cD + sM       cD    5
separate between relevant and irrelevant covariates as
it depends only on the shared scale parameter ⌧s .                0            sW           cD    1 + cD + sW

The same prior is given for the global scale parameter             The corresponding graphical structure is illustrated in
in each case:                                                      Figure 1. As will be made more explicit below, the cN
                                                                   and cD control the similarity of the submodels of non-
                 ⌧s ⇠ Half–Cauchy(0, 1),                           diabetic men and women, and between the submodels
                                                                   of diabetic men and women, respectively. sM and sW
which has its (bounded) mode at zero, but is only                  control the similarity between the submodels of non-
weakly informative as it also places a substantial                 diabetic and diabetic men, and non-diabetic and di-
amount of prior mass far from zero (see refs [15–17] for           abetic women, respectively. We further simplify the
discussion on priors for global variance parameters).              model by taking cN = cD = c and sM = sW = s and
   3                                                               constrain c > 0 and s > 0. The precision matrix has
     That is, the posterior mean estimator attains a min-
imax risk, possibly up to a multiplicative constant, in a          similarity to the one used by Liu et al. [18] to learn de-
sparse setting and the posterior contracts at a similar rate       pendencies between covariates, but here ⇤ is restricted
(with conditions on ⌧s ).                                          to encode a specific prior structure.

                                                              81
                           cN                                        1
                                                                                    0.9                                          0.9
                 j,N M               j,N W
                                                                    0.8
                                                                                                                           0.5
                                                                    0.6             0.5
              sM                         sW


                                                               s0
                                                                    0.4
                                                                                                              0.1
                                                                    0.2
                 j,DM                j,DW                                           0.1
                            cD                                       0
                                                                          0   0.2 0.4 0.6 0.8       1 0    0.2 0.4 0.6 0.8         1
                                                                                    c0                                c0
Figure 1: Prior structure for the regression coefficients
of jth biomarker in the joint model.                           Figure 2: Contour plots of the correlation coefficient
                                                               between j,N M and j,DM (left) and j,N M and j,DW
                                                               (right) as a function of c0 and s0 .
We choose = (2c+1)(2s+1)(2c+2s+1)
                (1+2c+2s+2cs)(c+s+1) as this makes the
diagonal elements of ⇤ 1 equal to 1, that is, ⇤ 1
becomes a correlation matrix. The relevance of the jth
biomarker is then solely dependent on rj .                     minimum value of 0 when ⇢ = 1. In Figure 2, the im-
                                                               plied prior correlation coefficients of some interesting
For more insight, the prior for    j may be written out        pairs of X,j s are shown as functions of c0 and s0 : s0
as proportional to                                             controls almost linearly the correlation between j,N M
                                          !
                      1                                        and j,DM , whereas the correlation between j,N M
           exp           (S2 + cSc + sSs ) ,                   and j,DW is close to bilinear in c0 and s0 .
                    2rj2
                                                               To complete the prior specification c0 and s0 are given
             2       2              2            2
where S2 = j,N  M + j,N W +         j,DM +       j,DW , Sc =   prior distributions. We use di↵erent parameters for
( j,N M          2
          j,N W ) + ( j,DM            j,DW )
                                             2
                                                and Ss =       biomarkers (c0 and s0 ), other covariates (c0bg and s0bg )
( j,N M          2
          j,DM ) + ( j,N W
                                           2
                                     j,DW ) .    c controls    and the log-scale Weibull shape parameter log ↵ (c0↵
the penalization in the di↵erence between men and              and s0↵ ; this encodes the third assumption):
women, and s controls the penalization in the di↵er-
ence between non-diabetic and diabetic subjects. Tak-
ing negative logarithm of the prior shows that it corre-                                  c0 ⇠ Beta(ac , bc ),
sponds to a specific Bayesian version of the multi-task                                   s0 ⇠ Beta(as , bs ),
graph regularization penalty proposed by Evgeniou                                        c0bg ⇠ Beta(ac , bc ),
et al. [19] and further studied by Sheldon [20]. The
                                                                                         s0bg ⇠ Beta(as , bs ),
prior can also be represented in the sparse Bayesian
multi-task learning framework of Archambeau et al.                                        c0↵ ⇠ Beta(ac , bc ),
[21], where a zero-mean matrix-variate Gaussian den-                                     s0↵ ⇠ Beta(as , bs ).
sity is placed on B = [ 1 , . . . , m ] with row covariance
⌦ (over the m covariates) and column covariance ⌃
(over the tasks). Here, ⌦ is a diagonal matrix with            Finally, ac , bc , as and bs are given Gamma( 12 , 14 ) priors.
elements rj2 and ⌃ = ⇤ 1 .
                                                               We note that the eigendecomposition of ⇤ = V DV T
We use the following transformations of c and s: c =           is of simple form, with D being a diagonal matrix with
(1 c0 ) 1 1 and s = (1 s0 ) 1 1, where c0 2 [0, 1)             elements 1, 1 + 2c, 1 + 2s, 1 + 2c + 2s and
and s0 2 [0, 1). At c0 = 0, c = 0 and the corresponding
submodels are independent. As c0 ! 1, c ! 1 and the                                         2                         3
corresponding submodels are constrained to identical.                                  1            1     1         1
s0 behaves similarly.                                                               16 1            1     1         17
                                                                                 V = 6                                7.
                                                                                    2 41            1     1         15
We can also examine the implied prior distribution of                                  1            1     1         1
the di↵erence between two X,j coefficients as a func-
tion of c0 and s0 . First, note that the distribution of
                       2
  X,j    Y,j is N(0, 2rj (1    ⇢)), where ⇢ is the corre-      This can be useful in reparametrizing the model for
lation coefficient. Specifically, the variance of the dis-     Markov chain Monte Carlo sampling algorithms. It
tribution is linearly dependent on ⇢ and, for ⇢         0,     also shows that the precision matrix is positive defi-
has the maximum value of 2rj2 when ⇢ = 0 and the               nite.

                                                          82
3     METHODS FOR BIOMARKER                                    order,
      SELECTION AND PREDICTIVE                                                 Z Z
      PERFORMANCE EVALUATION                                       ū(M? ) =          p(T |✓, x) log p(T |✓? , x? )dT

                                                                                  ⇥ f (✓? |✓)p(✓|D)p(x)d(✓? , ✓, x).
The approaches used for biomarker selection and eval-
uation of predictive performance are described below.          Finally, to arrive at the same solution with Dupuis and
The model constructed in previous section is used as           Robert [12], f (✓? |✓) can be restricted to the Dirac
the reference model in the biomarker selection.
                                                               delta function (✓? ✓ˆ? ) with an o↵set ✓ˆ? that de-
                                                               pends on ✓. That is, the solution to the maximization
3.1     PROJECTION PREDICTIVE                                  of ū is defined pointwise for each ✓ as the correspond-
        COVARIATE SELECTION                                    ing optimal value of ✓ˆ? . The pointwise solution arises
                                                               from the dependence of f on ✓.
Assuming the availability of a reference model, which          As p(✓|D) is not available analytically and p(x) at all,
is a good representation of the predictive power of the        the former is approximated with Markov chain Monte
candidate biomarkers and the related uncertainty, we           Carlo methods and the latter by using xi samples
seek a subset of the biomarkers, which can be used             available in the data D [12]. The obtained estimate
for prediction without a large loss in performance rel-        is
ative to the reference model. Our prior assumption                               Z
of sparsity in the biomarker e↵ects implies that this                      1 X                                    (j)
                                                               ū(M? ) ⇡             p(T |✓ (j) , xi ) log p(T |✓ˆ? , xi,? )dT ,
goal could be achievable. We describe the approach in                     nJ i,j
two steps: 1) defining a submodel for making predic-
tions with a specific subset of the candidate biomark-         where the double sum runs over the n data points and
ers, and 2) finding submodels with good predictive per-        the J posterior samples. The optimization problems
formance.                                                                            (j)
                                                               to find the optimal ✓ˆ? s are independent over j. We
                                                               solve them using the Newton’s method.
3.1.1    Projective Submodels                                  We define the projection predictive distribution for the
                                                               submodel M? as
We use the projective approach of Dupuis and Robert
                                                                                   Z
[12], Goutis and Robert [22] to find the parameters of
the submodel, but present an alternative derivation in           p(T |x? , Mref ) = p(T |x? , ✓? )f (✓? |Mref )d✓? ,
the Bayesian decision theoretic framework reviewed in
ref. [11]. The projection is posed as a solution to an         where we explicitly emphasize the dependence on the
optimization problem with regard to a restriction of           reference model Mref and which is approximated using
                                                                                         (j)
the reference model. Let the covariates x be divided           the projected samples ✓ˆ? s. This kind of projected
into two parts x = [x? , x> ] and define a submodel            predictive distribution was also considered by Nott and
M? to be restricted to using the covariates in x? 4 with       Leng [23].
parameters ✓? = ( ? , ↵? ) in the Weibull model. We
                                                               Note that scaling the estimated ū as d(M? ) =
find the submodel by maximizing the Gibbs reference
                                                               ū(Mref ) ū(M? ) (and minimizing instead of maxi-
utility
                                                               mizing) does not change the optimal solution and gives
                  Z Z                                         otherwise the same formula as ū, except the term in
      ū(M? ) =          u(M? , x? , ✓, T )p(T |✓, x)dT        square brackets is replaced with the Kullback–Leibler
                                                               divergence between p(T |x, ✓) and p(T |x? , ✓? ). This
                   p(✓|D)p(x)d(✓, x)                           gives the approach further information theoretic justi-
                                                               fication and is the basis of the formulation in Dupuis
with respect to the unknown probability densi-                 and Robert [12]. They also suggest defining the rela-
ties  f (✓? |✓) appearing in the u(M? , x? , ✓, T ) =          tive explanatory power of the submodel as
R
  f (✓? |✓) log p(T |✓? , x? )d✓? . Here, p(✓|D) is the
                                                                                                                d(M? )
posterior distribution of the reference model given the           relative explanatory power(M? ) = 1                  ,
                                                                                                                d(M0 )
observed data D and p(x) is the distribution of the co-
variates. Writing out u and changing the integration           where M0 refers to the model without any of the can-
                                                               didate biomarkers and which transforms the d(M? )
   4
     We assume that the established risk factors are always    values to between 0 (for M? = M0 ) and 1 (for M? =
included in this set.                                          Mref ).

                                                          83
3.1.2     Submodel Search                                             Stratified ten-fold cross-validation [25] is used to ob-
                                                                      tain estimates of the generalization performance: The
ū (or equivalently d) is used to compare the submodels               full dataset is divided randomly into ten disjoint sub-
in the search for good subsets of biomarkers. However,                sets (folds), while balancing the sets to have approxi-
exhaustive search of the model space5 is not feasible,                mately similar age distributions and proportions of di-
unless the number of candidate biomarkers is small.                   abetic and non-diabetic individuals, men and women,
We choose to use the suboptimal forward selection                     and cases of cardiovascular events. Predictions for
strategy for its simplicity and its scalability to large              each fold are obtained using a posterior distribution
covariate sets:                                                       based on training data, where the particular fold has
                                                                      been left out. Given predictions obtained this way,
 1. Begin with the submodel M0 (no biomarkers) and                    the predictive performance is summarized by the mean
    set j to 0.                                                       LPD over the full set of n data points (MLPD).
 2. Repeat until all biomarkers have been added:                      To reduce variance and gauge uncertainty in model
                                                                      comparisons, we compute Bayesian bootstrap [26]
       (a) Find the projections for all submodels that                samples of the MLDP di↵erence ( MLPD) between
           are obtainable by adding one new biomarker                 model Ma and model Mb by
           to Mj . Select the one with largest ū and set
           it as Mj+1 . Set j to j + 1.                                                          n
                                                                                                 X      (j)
                                                                          MLPD(j) (Ma , Mb ) =         wi [LPDi (Ma ) LPDi (Mb )],
This defines a deterministic6 path of models from M0                                             i=1

to Mmbm and gives a ranking of the biomarkers ac-
                                                                                (j)
cording to their projection predictive value. Dupuis                  where wi , i = 1, . . . , n, are the bootstrap weights
                                                                       P (j)
and Robert [12] suggest finally choosing the small-                   ( i wi = 1) for the jth bootstrap sample generated
est submodel with an acceptable loss in the explana-                  using the Dirichlet distribution with parameters set to
tory power relative to the reference model (and use                   1 [11]. The comparison is summarized by the q-value7 :
a slightly more elaborate search). Alternatively, one
could monitor some other statistic (e.g., predictive per-                                    1X
                                                                                                 J
formance) along the search path to locate good sub-                            q(Ma , Mb ) =       I( MLPD(j)         0),
                                                                                             J j=1
models. Computing the full forward selection path
may not be necessary, if a suitable stopping criterion
is used in the step 2 above.                                          where I(·) = 1 if the given condition holds and 0 oth-
                                                                      erwise, and which is interpreted as the Bayesian pos-
3.2     PREDICTIVE PERFORMANCE                                        terior probability (under the Dirichlet model) of Ma
        EVALUATION                                                    performing better than Mb [11].

Given a model M with posterior predictive distribu-
tion p(T⇤ |x⇤ , D), where D is the observed data, we
                                                                      4     RESULTS
evaluate its predictive performance using the loga-
rithm of the predictive density (LPD) at an actual ob-                Missing values in the covariate data were multiply im-
servation (t⇤ , v⇤ , x⇤ ). This scoring rule is proper and            puted using chained linear regressions with in-house
measures the calibration and sharpness of the predic-                 scripts based on ref. [27]. The candidate biomarkers
tive distribution simultaneously [24]. As the predictive              were log-transformed and scaled to have zero mean and
densities are not available analytically for the models               unit variance. The No-U-Turn variant of the Hamil-
considered here, we estimate the LPD score from the                   tonian Monte Carlo algorithm [28], as implemented
Markov chain Monte Carlo samples of the posterior                     in Stan software [29], was used to sample from the
distribution:                                                         posterior distributions of the full models. The sam-
                                                                      pling was done independently for 5 imputed datasets
                            J
                         1X                    (j)
                                                                      (4 chains of 1000 samples after burn-in for each). The
       LPD⇤ (M ) ⇡ log       p(t⇤ |x⇤ , v⇤ ,         , ↵(j) ),        samples were then concatenated. The sampling pro-
                         J j
                                                                      cess was further performed independently for each of
                                                                      the 10 cross-validation training sets. All shown esti-
where ( (j) , ↵(j) ) are J posterior samples of the model
                                                                      mates of predictive performance were computed using
given the data D.
                                                                      cross-validation (Section 3.2).
   5
    The number of subsets for mbm covariates is 2mbm .
   6                                                                     7
    Given the stochastic samples from the posterior distri-                We use q instead of p to avoid confusion with the fre-
bution of the reference model.                                        quentist p-value.

                                                                 84
Table 1: Model comparisons on cross-validation predictions. MLPDs and q-values (Section 3.2) are shown for
predictions only on diabetic women, only on diabetic men or both. q-values are calculated against the joint
horseshoe model; color scale 0.0 ⌅⌅⌅⌅⌅⌅⌅⌅⌅⌅ 1.0.

                                                  women                     men         women & men
          model                              MLPD q-value            MLPD q-value      MLPD q-value
          joint horseshoe                    -0.581  NA              -0.716   NA       -0.652 NA
          joint Laplace                      -0.582  0.27 ⌅          -0.720   0.10 ⌅   -0.656 0.08 ⌅
          joint Gaussian                     -0.585  0.22 ⌅          -0.727   0.05 ⌅   -0.660 0.04 ⌅
          joint no-biomarkers                -0.594  0.18 ⌅          -0.758   0.03 ⌅   -0.681 0.01 ⌅
          diab women&men horseshoe           -0.606  0.03 ⌅          -0.719   0.44 ⌅   -0.666 0.13 ⌅
          diab women/men horseshoe           -0.610  0.03 ⌅          -0.721   0.45 ⌅   -0.669 0.15 ⌅
          diab women/men no-biomarkers       -0.613  0.05 ⌅          -0.765   0.04 ⌅   -0.694 0.01 ⌅


                                                                                                  horseshoe
         0.4
                                                                                                  Laplace
                                                                                                  Gaussian
         0.2


           0


         0.2
               1            10               20                 30               40               50          55
                                                         biomarker


Figure 3: Biomarker regression coefficients for the submodel of diabetic men in the joint models with the
horseshoe, Laplace and Gaussian priors (full dataset). Dot is the mean and vertical line shows the 95% credible
interval. Biomarkers are ordered according to the mean coefficients of the horseshoe model.


4.1   MODEL COMPARISONS                                    ing only the data of diabetic individuals and seems to
                                                           be greater in men. This indicates that the candidate
Table 1 presents results on comparing the mean log         biomarkers contain relevant information for predicting
predictive densities (MLPD) of the following com-          cardiovascular event risk.
binations of models: joint for the joint model of          Including the data of the non-diabetic individuals in
non-diabetic and diabetic individuals (Section 2.3),       the model seems to increase the predictive perfor-
diab women&men for a joint model of diabetic men           mance for the diabetic subpopulation, especially for
and women (two-group version of Section 2.3), diab         women. The covariate e↵ects in the joint models are
women/men for separate models of diabetic men and          very similar across the diabetic and non-diabetic sub-
women (without the extension of Section 2.3), and          models: posterior mean of s0 is 0.96 for the horseshoe
using the horseshoe, Laplace or Gaussian priors on         model. This implies that the risk factors behave sim-
the biomarker e↵ects, or using only the established        ilarly in both groups, but it is also possible that the
risk factors (no-biomarkers). The MLPDs and q-             dataset has limited information to distinguish between
values were computed separately for the predictions        them and that larger datasets could uncover more dif-
for women and men, and for pooled predictions, and,        ferences.
importantly, in each case only for the predictions on
the diabetic subpopulation.                                Finally, it seems that the horseshoe prior performs bet-
                                                           ter than the Laplace, and that the Gaussian is the
The results show that there is an increase in the pre-     worst of the three for this data. Figure 3 shows a
dictive performance when supplanting the established       comparison of the biomarker regression coefficients un-
risk factors with the candidate biomarkers. The in-        der these priors. The Laplace and the Gaussian priors
crease holds both when using the joint models or us-

                                                    85
shrink the largest coefficient more than the horseshoe
as would be expected in a sparse setting [5, 16]. Fur-
thermore, the horseshoe seems to shrink coefficients
near zero more strongly than the Laplace making the                                        1


                                                             relative explanatory power
credible intervals around zero narrower.

4.2   BIOMARKER SELECTION AND                                                             0.8
      SUBMODEL PREDICTIVE
      PERFORMANCE
                                                                                          0.6                              cross-validation sets
We applied the projection predictive covariate selec-                                                                      full data
tion (Section 3.1) with the joint horseshoe model as                                            1         10     20       30     40          50    55
the reference. The forward selection was run using                                                             number of biomarkers
only the part of the model concerning diabetic individ-
uals. We run the forward selection jointly for women
and men to get an overall biomarker ranking for the di-     Figure 4: Relative explanatory powers along the for-
abetic subpopulation. The forward selection was run         ward selection path.
also for each cross-validation training set separately
(using the reference model fitted on the corresponding
training data).
Figure 4 shows the relative explanatory power curves
along the forward selection path. In the full dataset,
                                                                                                ·10   2
the best candidate biomarker attains 61% explanatory
power relative to the reference model, five best reach                                     2
over 80% and ten biomarkers are needed to reach over
90%. The growth in the explanatory power slows with
                                                               MLPD


more biomarkers, indicating diminishing gains from
                                                                                           0
adding more candidate biomarkers (22 are needed to
reach 95% and the remaining 33 account for the last
5%).
                                                                                           2
However, choosing an acceptable loss in the explana-
tory power to select an appropriate minimal subset of                                           1         10     20       30     40          50    55
the biomarkers for use in prediction tasks seems diffi-                                                        number of biomarkers
                                                                                                ·10   2
cult. In Figure 5, we show MLPDs (normalized to the
reference model) obtained using the projection pre-                                        2
dictive covariate selection approach within the cross-
validation. Top panel shows the MLPD along the
                                                                MLPD


forward selection path and the bottom panel by the
                                                                                           0
obtained relative explanatory power (e.g., at 0.6, the
predictions in each cross-validation fold was made with
the smallest submodel reaching 60% power in that
fold). These show a mode at 2 biomarkers and at                                            2
around 0.65 relative explanatory power (which corre-                                        0.5           0.6        0.7      0.8      0.9         1
sponds to choosing two, three or four biomarkers de-                                                         relative explanatory power
pending on the fold). A second peak can be seen at
10 biomarkers or correspondingly at 0.91 power (10–16
biomarkers).                                                Figure 5:     MLPD values (in reference to the full
Unfortunately, the variance in the cross-validation es-     model) by number of biomarkers (top) or by explana-
timates is quite large for making a definite choice based   tory power threshold (bottom). Top: vertical lines
on them. Figure 6 shows the full set of pairwise com-       are 95% Bayesian bootstrap credible intervals for the
parisons between the submodels along the forward se-          MLPD. Bottom: dashed curves show the (pointwise)
lection path (by number of biomarkers; same as in Fig-      credible interval.
ure 5 top panel). This indicates that two biomarkers
is overall the best choice, but the di↵erence to the 10

                                                      86
                                                   1
                                                              not warrant any formal causal inferences. Moreover,
      0
                                                              the inclusion of the data of non-diabetic individuals
                                                   0.9        may bias the inferences on the diabetic subpopulation
     10                                            0.8        towards the general population, when the dataset has
                                                   0.7
                                                              limited information to distinguish them. Nevertheless,
     20                                                       the presented predictive comparisons, being indepen-
                                                   0.6        dent of the model assumptions, justify studying the
Mi


                                                   0.5        joint model.
     30
                                                   0.4        The submodels in projection predictive covariate se-
                                                   0.3        lection depend on the observed data only through the
     40
                                                              reference model. Thus, finding the submodel param-
                                                   0.2
                                                              eters and the covariate selection itself do not cause
     50                                            0.1        further fitting to the data, but rely on the information
                                                   0          provided by the reference model [11]. The projected
          0   10   20     30     40     50
                                                              submodels may also be able to retain some predictive
                        Mj
                                                              features of the reference model that would not be avail-
                                                              able, if the submodels were independently fitted to the
Figure 6: q-value matrix for Mi is better than Mj             data [11]: importantly, from Bayesian point of view,
with regard to MLPD (Section 3.2; M· s refer to the           the submodel may be able to account for uncertainty
submodels along the forward selection path).                  due to the omission of some covariates.
                                                              However, selecting a single submodel for future pre-
biomarker selection is not large (q-value = 0.52). How-       diction tasks may be difficult. We examined using the
ever, on comparing these to the full model or generally       projection approach within cross-validation to obtain
models with 11 or more biomarkers, the 10 biomarker           estimates of the submodel predictive performances. A
selection is more confidently better (q-values mostly         disadvantage of this procedure is that the performance
> 0.9) than the 2 biomarker selection (q-values mostly        estimates are for the selection process and not for some
within 0.7–0.8).                                              particular combination of selected biomarkers. Fur-
                                                              thermore, if selection is based on these estimates, the
Nevertheless, the analysis seems to support two clearly       performance estimate for the chosen submodel will not
predictively relevant biomarkers for the cardiovascular       anymore be unbiased for out-of-sample prediction un-
risk prediction, with further 8 possibly interesting can-     less nested cross-validation is used [11].
didate biomarkers, but with some uncertainty about
their relevance. Figure 3 also supports this conclusion       Acknowledgements
with two of the biomarkers having clearly non-zero ef-
fects.                                                        We acknowledge the computational resources provided
                                                              by Aalto Science-IT project.
5     DISCUSSION
                                                              References
This paper presented a Bayesian analysis of                    [1] Erkki Vartiainen, Tiina Laatikainen, Markku Pel-
cardiovascular-event-free survival in diabetic individ-            tonen, Anne Juolevi, Satu Männistö, Jouko Sund-
uals, with the aim of identifying biomarkers with pre-             vall, Pekka Jousilahti, Veikko Salomaa, Liisa Val-
dictive value. We presented a comparison of the horse-             sta, and Pekka Puska. Thirty-five-year trends
shoe, Laplace and Gaussian priors on the candidate                 in cardiovascular risk factors in Finland. Inter-
biomarker e↵ects and demonstrated empirically an ex-               national Journal of Epidemiology, 39(2):504–518,
pected [5, 16] di↵erence in their behaviour. We further            2010.
extended the model hierarchically to include data of
non-diabetic individuals and examined the use of pro-          [2] T. J. Mitchell and J. J. Beauchamp. Bayesian
jection predictive covariate selection to find biomarker           variable selection in linear regression. Journal
subsets with good predictive performance.                          of the American Statistical Association, 83(404):
                                                                   1023–1032, 1988.
We could also hope that the predictive biomarkers cap-
ture some part of the state of the underlying disease          [3] Trevor Park and George Casella. The Bayesian
process and as such could be used to speculate about               lasso. Journal of the American Statistical Associ-
causal disease pathways and to prioritize biomarkers               ation, 103(482):681–686, 2008.
for further study. However, the analysis approach does         [4] Jim E. Griffin and Philip J. Brown. Inference

                                                         87
    with normal-gamma prior distributions in regres-           Bayesian Statistics 9, pages 501–538. Oxford Uni-
    sion problems. Bayesian Analysis, 5(1):171–188,            versity Press, 2011.
    2010.                                                  [17] Nicholas G. Polson and James G. Scott. On the
 [5] Carlos M. Carvalho, Nicholas G. Polson, and                half-Cauchy prior for a global scale parameter.
     James G. Scott. The horseshoe estimator for                Bayesian Analysis, 7(4):887–902, 2012.
     sparse signals. Biometrika, 97(2):465–480, 2010.      [18] Fei Liu, Sounak Chakraborty, Fan Li, Yan Liu,
 [6] Zhihua Zhang, Shusen Wang, Dehua Liu, and                  and Aurelie C. Lozano. Bayesian regularization
     Michael I. Jordan. EP-GIG priors and applica-              via graph Laplacian. Bayesian Analysis, 9(2):
     tions in Bayesian sparse learning. The Journal             449–474, 2014.
     of Machine Learning Research, 13(1):2031–2061,        [19] Theodoros Evgeniou, Charles A. Micchelli, and
     2012.                                                      Massimiliano Pontil. Learning multiple tasks with
 [7] Robert Tibshirani. Regression shrinkage and se-            kernel methods. Journal of Machine Learning Re-
     lection via the lasso. Journal of the Royal Sta-           search, 6:615–637, 2005.
     tistical Society. Series B (Methodological), 58(1):   [20] Daniel Sheldon. Graphical multi-task learning. In
     267–288, 1996.                                             NIPS 2008 Workshop: “Structured Input – Struc-
 [8] S. L. van der Pas, B. J. K. Kleijn, and A. W.              tured Output”, 2008.
     van der Vaart. The horseshoe estimator: Pos-          [21] Cédric Archambeau, Shengbo Guo, and Onno
     terior concentration around nearly black vectors.          Zoeter. Sparse Bayesian multi-task learning. In
     arXiv preprint arXiv:1404.0202, 2014.                      Advances in Neural Information Processing Sys-
 [9] Emerging Risk Factors Collaboration. Diabetes              tems 24, pages 1755–1763, 2011.
     mellitus, fasting blood glucose concentration, and    [22] Constantinos Goutis and Christian P. Robert.
     risk of vascular disease: a collaborative meta-            Model choice in generalised linear models: A
     analysis of 102 prospective studies. The Lancet,           Bayesian approach via Kullback–Leibler projec-
     375(9733):2215–2222, 2010.                                 tions. Biometrika, 85(1):29–37, 1998.
[10] Sinno Jialin Pan and Qiang Yang. A survey on          [23] David J. Nott and Chenlei Leng. Bayesian pro-
     transfer learning. IEEE Transactions on Knowl-             jection approaches to variable selection in gener-
     edge and Data Engineering, 22(10):1345–1359,               alized linear models. Computational Statistics &
     2010.                                                      Data Analysis, 54(12):3227–3241, 2010.
[11] Aki Vehtari and Janne Ojanen. A survey of             [24] Tilmann Gneiting, Fadoua Balabdaoui, and
     Bayesian predictive methods for model assess-              Adrian E. Raftery. Probabilistic forecasts, cali-
     ment, selection and comparison. Statistics Sur-            bration and sharpness. Journal of the Royal Sta-
     veys, 6:142–228, 2012.                                     tistical Society: Series B (Statistical Methodol-
                                                                ogy), 69(2):243–268, 2007.
[12] Jérome A. Dupuis and Christian P. Robert. Vari-
     able selection in qualitative models via an en-       [25] Ron Kohavi. A study of cross-validation and
     tropic explanatory power. Journal of Statistical           bootstrap for accuracy estimation and model se-
     Planning and Inference, 111(1):77–94, 2003.                lection. In Proceedings of the 14th International
                                                                Joint Conference on Artificial Intelligence (IJ-
[13] Andrew Gelman, John B. Carlin, Hal S. Stern,               CAI), pages 1137–1145, 1995.
     David B. Dunson, Aki Vehtari, and Donald B.
     Rubin. Bayesian Data Analysis. CRC press, third       [26] Donald B. Rubin. The Bayesian bootstrap. The
     edition, 2014.                                             Annals of Statistics, 9(1):130–134, 1981.

[14] Joseph G. Ibrahim, Ming-Hui Chen, and Debajy-         [27] Stef van Buuren and Karin Groothuis-Oudshoorn.
     oti Sinha. Bayesian Survival Analysis. Springer,           MICE: Multivariate imputation by chained equa-
     2001.                                                      tions in R. Journal of Statistical Software, 45(3),
                                                                2011.
[15] Andrew Gelman. Prior distributions for vari-
                                                           [28] Matthew D. Ho↵man and Andrew Gelman. The
     ance parameters in hierarchical models. Bayesian
                                                                No-U-Turn Sampler: Adaptively setting path
     Analysis, 1(3):515–533, 2006.
                                                                lengths in Hamiltonian Monte Carlo. Journal of
[16] Nicholas G. Polson and James G. Scott. Shrink              Machine Learning Research, 15:1593–1623, 2014.
     globally, act locally: Sparse Bayesian regulariza-
                                                           [29] Stan Development Team. Stan: A C++ library
     tion and prediction. In J. M. Bernardo, M. J.
                                                                for probability and sampling, version 2.2, 2014.
     Bayarri, J. O. Berger, A. P. Dawid, D. Heck-
                                                                URL http://mc-stan.org/.
     erman, A. F. M. Smith, and M. West, editors,

                                                     88