<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hierarchical Bayesian Survival Analysis and Pro jective Covariate Selection in Cardiovascular Event Risk Prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tomi Peltola</string-name>
          <email>tomi.peltola@aalto.fi</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aki S. Havulinna</string-name>
          <email>aki.havulinna@thl.fi</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Veikko Salomaa</string-name>
          <email>veikko.salomaa@thl.fi</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aki Vehtari</string-name>
          <email>aki.vehtari@aalto.fi</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aalto University</institution>
          ,
          <country country="FI">Finland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Institute for</institution>
          ,
          <addr-line>Health and Welfare</addr-line>
          ,
          <country country="FI">Finland</country>
        </aff>
      </contrib-group>
      <fpage>79</fpage>
      <lpage>88</lpage>
      <abstract>
        <p>Identifying biomarkers with predictive value for disease risk stratification is an important task in epidemiology. This paper describes an application of Bayesian linear survival regression to model cardiovascular event risk in diabetic individuals with measurements available on 55 candidate biomarkers. We extend the survival model to include data from a larger set of non-diabetic individuals in an e↵ ort to increase the predictive performance for the diabetic subpopulation. We compare the Gaussian, Laplace and horseshoe shrinkage priors, and find that the last has the best predictive performance and shrinks strong predictors less than the others. We implement the projection predictive covariate selection approach of Dupuis and Robert (2003) to further search for small sets of predictive biomarkers that could provide coste cient prediction without significant loss in performance. In passing, we present a derivation of the projective covariate selection in Bayesian decision theoretic framework.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>INTRODUCTION
Improving disease risk prediction is a major task in
epidemiological research. Non-communicable diseases,
many of which develop and progress slowly, are a
major cause of morbidity worldwide. Accurate risk
prediction could be used to screen individuals for targeted
intervention. Advances in measurement technologies
allow researchers cost-e cient quantification of large
numbers of potentially relevant biomarkers, for
example, in blood samples. However, often only a few of
such candidate biomarkers could be expected to give
practically relevant gain in risk stratification or could
be realistically used in routine health care setting. The
statistical challenge is then to identify an informative
subset of the biomarkers and estimate its predictive
performance.</p>
      <p>
        Here, we describe an application of linear,
hierarchical Bayesian survival regression to model
cardiovascular event risk in diabetic individuals. The available
data consists of 7932 Finnish individuals in the
FINRISK 1997 cohort [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], of whom 401 had diabetes at
the beginning of the study. The covariates consist of
a set of 55 candidate biomarkers measured from blood
samples and 12 established risk factors (e.g.,
baseline age, sex, body-mass index, lipoprotein cholesterol
measures, blood pressure and smoking). The length
of the follow-up period was 15 years. We focus on
three key elements in the model construction: 1) using
shrinkage priors to model the assumption of possibly
limited relevance of many biomarkers, 2) utilizing the
large set of non-diabetic individuals in the modelling,
and 3) the selection of a subset of the biomarkers with
predictive value. While the statistical approach is not
limited to this particular application, we use the
setting to make the description of the methods concrete.
Shrinkage or sparsity-promoting priors for regression
coe cients are used to shrink the e↵ ects of
(apparently) irrelevant covariates to zero, while retaining the
e↵ ects of relevant covariates. Their use has increased
with the availability of datasets with large numbers of
features, for example, from high-throughput
measurement technologies, which often capture a snapshot of
a whole system (e.g., metabolome, genome) instead
of targeted features. The interest has spawned
considerable research e↵ ort into such priors and multiple
alternatives have been proposed (see, e.g., refs [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5 ref6">2–6</xref>
        ]).
In this work, we chose to compare three priors: the
Laplace [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the horseshoe [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and, as a baseline, a
Gaussian prior. The Laplace prior corresponds to the
popular lasso penalty [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] in non-Bayesian regularized
regression. The horseshoe prior has been shown to
have desirable features in Bayesian analysis [
        <xref ref-type="bibr" rid="ref5 ref8">5, 8</xref>
        ]. We
briefly review these priors in Section 2.2.
      </p>
      <p>
        Of the 401 diabetic individuals in the study, 155
experienced a cardiovascular event within the follow-up
period. This leaves a limited set of informative
samples to perform the model fitting, covariate selection
and predictive performance evaluation with. Although
the risk of cardiovascular events is larger in diabetic
individuals than the general population [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], we would
expect that the risk factors are shared at least to some
extent. Based on this assumption, we incorporate the
non-diabetic individuals (n = 7531, 1031 events) into
the analysis by constructing a hierarchical joint model,
where the submodels for diabetic and non-diabetic
individuals can be correlated (akin to transfer or
multitask learning [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]). The joint model does not place
hard constraints on the similarity of the submodels,
but allows the models to di↵ er between non-diabetic
and diabetic individuals and also between men and
women. Details are given in Section 2.3.
      </p>
      <p>
        While lasso regression in the non-Bayesian context can
perform hard covariate selection by estimating exact
zeroes for regression coe cients, the Bayesian
shrinkage priors do not lead to sparse posterior distributions
as there will remain uncertainty after observing a
finite dataset. However, we are interested in finding
a minimal subset of predictively relevant biomarkers
as discussed above. To this end, we examine the use
of projection predictive covariate selection1, where the
full model, encompassing all the candidate biomarkers
and the uncertainties related to their e↵ ects, is taken
as a yardstick for the smaller models. Specifically, the
models with subsets of covariates are found by
maximizing the similarity of their predictions to this
reference as proposed by Dupuis and Robert [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Notably,
this approach does not require specifying priors for the
submodels and one can instead focus on building a
good reference model. Dupuis and Robert [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] suggest
choosing the size of the covariate subset based on an
acceptable loss of explanatory power compared to the
reference model. We examine using cross-validation
based estimates of predictive performance as an
alternative.
      </p>
      <p>The structure of this article is as follows. In Section
2, we describe the survival model, shrinkage priors,
and the hierarchical extension to include data of
nondiabetic individuals. The projection predictive
covariate selection is described in Section 3. The results
from the application of the methods for
cardiovascularevent-free survival modelling in diabetic individuals
are presented in Section 4. Finally, Section 5 discusses
the modelling approach.</p>
      <p>
        1A comprehensive review of predictive Bayesian model
selection approaches is given by Vehtari and Ojanen [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
Our terminology follows theirs.
      </p>
      <p>MODEL
We first consider modelling the
cardiovascular-eventfree survival in the subset of diabetic individuals only.
The model is then extended to include the data of
non-diabetic individuals, while allowing the covariate
e↵ ects and the baseline hazard to di↵ er in these groups
and between men and women.
2.1</p>
      <p>OBSERVATION MODEL
Let the observation ti be the event time Ti or
the censoring time Ci since the beginning of the
study for ith individual and vi be the corresponding
event/censoring indicator (1 for observed events, 0 for
censored). All censored cases are right censored (i.e.,
Ti &gt; Ci where only Ci is observed; censoring occurs
in the data mostly because of event-free survival to
the end of the follow-up). Further, let xi be a column
vector of the observed covariate values for the ith
subject. We assume a parametric survival model, where
the observations follow the Weibull model2
p(ti|xi, vi, , ↵ ) = ↵ vi tvi(↵ 1) exp(vi
i</p>
      <p>
        Txi t↵i exp( Txi))
with the shape ↵ and the scale defined through the
linear combination Txi of the covariates [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The
Weibull model is a proportional hazard model with
the hazard function h(Ti) = ↵ T i↵ 1 exp( Txi).
We include a constant term 1 in the covariates xi
and denote the corresponding regression coe cient 0.
The intercept and the shape are given the di↵ use
priors:
log ↵
0 ⇠
⇠
      </p>
      <p>N(0, 102),
N(0, 102).</p>
      <p>
        The covariates are divided into a set of established
risk (or protective) factors and a set of new candidate
biomarkers, which are of more uncertain relevance.
The coe cients of the established predictors, j for
j = 1, . . . , mbg, are given the prior [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]:
j ⇠
2
j ⇠
s ⇠
      </p>
      <p>N(0, s2 j2), for j = 1, . . . , mbg,
Inv– 2(1), for j = 1, . . . , mbg,</p>
      <p>Half–N(0, 102).</p>
      <p>Priors for the coe cients of the candidate biomarkers
are considered below.</p>
      <p>
        2The notation for probability distributions follows the
parametrizations given in ref. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], except for the Weibull
model, which is explicitly written out. Half -distributions
refer to the restriction to the real positive axis.
Based on our prior assumption that only some of the
biomarkers are expected to be practically relevant for
prediction, we consider the use of shrinkage priors for
the biomarker coe cients. As discussed in the
introduction, there has been a lot of recent research into
these type of priors and there are multiple proposals.
We restrict our consideration to three alternatives: the
horseshoe prior [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the Laplace prior [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and, as a
baseline approach, a Gaussian prior. Each of these
can be expressed as normal scale mixtures
j ⇠
      </p>
      <p>
        N(0, ⌧ s2⌧ j2), for j = mbg + 1, . . . , mbg + mbm,
where ⌧ s is a global scale parameter (shared across j)
and ⌧ j are local parameters. Ideally, the prior shrinks
the coe cients of irrelevant biomarkers to zero, but
allows large coe cients for relevant biomarkers. In a
sparse situation, with many irrelevant biomarkers and
few relevant, this could be e↵ ected by making ⌧ s small,
but allowing some ⌧ j to take on large values to escape
the shrinkage [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>The priors for ⌧ j s, for j = mbg + 1, . . . , mbg + mbm, for
the three alternatives are
⌧ j ⇠
⌧ j2 ⇠
⌧ j = 1</p>
    </sec>
    <sec id="sec-2">
      <title>Half–Cauchy(0, 1)</title>
    </sec>
    <sec id="sec-3">
      <title>Exponential(0.7)</title>
      <p>for horseshoe,
for Laplace,
for the Gaussian.</p>
      <p>
        A comparison of the Laplace and horseshoe prior is
given in ref. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]: it is noted that the Laplace prior
may overshrink large coe cients in a sparse situation,
while the horseshoe prior is more robust (see also ref.
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]). Furthermore, van der Pas et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] derive
theoretical results indicating that the posterior
distribution under the horseshoe prior may be more
informative3 than under the Laplace prior in a sparse normal
means problem. The Gaussian prior does not try to ⇤ =66
separate between relevant and irrelevant covariates as 4
it depends only on the shared scale parameter ⌧ s.
The same prior is given for the global scale parameter
in each case:
⌧ s ⇠
      </p>
      <p>
        Half–Cauchy(0, 1),
which has its (bounded) mode at zero, but is only
weakly informative as it also places a substantial
amount of prior mass far from zero (see refs [
        <xref ref-type="bibr" rid="ref15 ref16 ref17">15–17</xref>
        ] for
discussion on priors for global variance parameters).
      </p>
      <p>3That is, the posterior mean estimator attains a
minimax risk, possibly up to a multiplicative constant, in a
sparse setting and the posterior contracts at a similar rate
(with conditions on ⌧ s).
2.3</p>
      <p>HIERARCHICAL EXTENSION
Next, we consider extending the approach to jointly
model the event-free survival of non-diabetic men
(NM), non-diabetic women (NW), diabetic men (DM),
and diabetic women (DW). Our aim is to increase the
predictive performance of the model specifically in the
subset of diabetic individuals, but gain power by
including the larger set of observations for non-diabetic
individuals in the model. To this end, we tie together
the submodels of the four groups using the following
assumptions:
1. The relevance of a biomarker will be similar for
all the submodels.
2. The e↵ ect size of a biomarker (or other
covariate) and its direction are similar between men and
women, and between diabetic and non-diabetic
individuals.
3. The baseline hazard functions have similar shapes
for men and women, and diabetic and
nondiabetic individuals.</p>
      <p>Let j = [ j,NM j,NW j,DM j,DW ]T be the
coefficients for the jth biomarker in the four submodels.
We set
j ⇠</p>
      <p>N(0, rj2 ⇤
1),
where rj2 ⇤ 1 is the prior covariance matrix. Here,
rj = ⌧ j ⌧ s and follows one of the prior specifications
given in the previous section. This encodes the first
assumption above: a single rj parameter defines the
relevance of the jth biomarker in all the four
submodels.</p>
      <p>
        To encode the second assumption, we specify the
structure of the prior precision matrix as
The corresponding graphical structure is illustrated in
Figure 1. As will be made more explicit below, the cN
and cD control the similarity of the submodels of
nondiabetic men and women, and between the submodels
of diabetic men and women, respectively. sM and sW
control the similarity between the submodels of
nondiabetic and diabetic men, and non-diabetic and
diabetic women, respectively. We further simplify the
model by taking cN = cD = c and sM = sW = s and
constrain c &gt; 0 and s &gt; 0. The precision matrix has
similarity to the one used by Liu et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] to learn
dependencies between covariates, but here ⇤ is restricted
to encode a specific prior structure.
sM
j,NM
j,DM
cN
cD
j,NW
      </p>
      <p>sW
j,DW</p>
      <sec id="sec-3-1">
        <title>We choose</title>
        <p>diagonal elements of ⇤ 1 equal to 1, that is, ⇤
becomes a correlation matrix. The relevance of the jth
biomarker is then solely dependent on rj .</p>
        <p>= ((21c++21c)+(22ss++12)c(s2)c(+c+2ss++11)) as this makes the
1
For more insight, the prior for j may be written out
as proportional to
exp</p>
        <p>
          1 !
2rj2 (S2 + cSc + sSs) ,
where S2 = j2,NM + j2,NW + j2,DM + j2,DW , Sc =
( j,NM j,NW )2 + ( j,DM j,DW )2 and Ss =
( j,NM j,DM )2 + ( j,NW j,DW )2. c controls
the penalization in the di↵ erence between men and
women, and s controls the penalization in the di↵
erence between non-diabetic and diabetic subjects.
Taking negative logarithm of the prior shows that it
corresponds to a specific Bayesian version of the multi-task
graph regularization penalty proposed by Evgeniou
et al. [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] and further studied by Sheldon [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. The
prior can also be represented in the sparse Bayesian
multi-task learning framework of Archambeau et al.
[
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], where a zero-mean matrix-variate Gaussian
density is placed on B = [ 1, . . . , m] with row covariance
⌦ (over the m covariates) and column covariance ⌃
(over the tasks). Here, ⌦ is a diagonal matrix with
elements rj2 and ⌃ = ⇤ 1.
        </p>
        <p>We use the following transformations of c and s: c =
(1 c0) 1 1 and s = (1 s0) 1 1, where c0 2 [0, 1)
and s0 2 [0, 1). At c0 = 0, c = 0 and the corresponding
submodels are independent. As c0 ! 1, c ! 1 and the
corresponding submodels are constrained to identical.
s0 behaves similarly.</p>
      </sec>
      <sec id="sec-3-2">
        <title>We can also examine the implied prior distribution of</title>
        <p>the di↵ erence between two X,j coe cients as a
function of c0 and s0. First, note that the distribution of</p>
        <p>X,j Y,j is N(0, 2rj2(1 ⇢ )), where ⇢ is the
correlation coe cient. Specifically, the variance of the
distribution is linearly dependent on ⇢ and, for ⇢ 0,
has the maximum value of 2rj2 when ⇢ = 0 and the
0.1
0.5
minimum value of 0 when ⇢ = 1. In Figure 2, the
implied prior correlation coe cients of some interesting
pairs of X,j s are shown as functions of c0 and s0: s0
controls almost linearly the correlation between j,NM
and j,DM , whereas the correlation between j,NM
and j,DW is close to bilinear in c0 and s0.</p>
        <p>To complete the prior specification c0 and s0 are given
prior distributions. We use di↵ erent parameters for
biomarkers (c0 and s0), other covariates (c0bg and s0bg)
and the log-scale Weibull shape parameter log ↵ (c↵0
and s↵0 ; this encodes the third assumption):
c0 ⇠
s0 ⇠
c0bg ⇠
s0bg ⇠
c↵0 ⇠
s↵0 ⇠</p>
      </sec>
      <sec id="sec-3-3">
        <title>Beta(ac, bc),</title>
      </sec>
      <sec id="sec-3-4">
        <title>Beta(as, bs),</title>
      </sec>
      <sec id="sec-3-5">
        <title>Beta(ac, bc),</title>
      </sec>
      <sec id="sec-3-6">
        <title>Beta(as, bs),</title>
      </sec>
      <sec id="sec-3-7">
        <title>Beta(ac, bc),</title>
      </sec>
      <sec id="sec-3-8">
        <title>Beta(as, bs).</title>
        <p>Finally, ac, bc, as and bs are given Gamma( 12 , 14 ) priors.
We note that the eigendecomposition of ⇤ = V DV T
is of simple form, with D being a diagonal matrix with
elements 1, 1 + 2c, 1 + 2s, 1 + 2c + 2s and</p>
        <p>2 1
V = 21 646 11
1</p>
        <p>METHODS FOR BIOMARKER
SELECTION AND PREDICTIVE
PERFORMANCE EVALUATION
The approaches used for biomarker selection and
evaluation of predictive performance are described below.
The model constructed in previous section is used as
the reference model in the biomarker selection.
3.1</p>
        <sec id="sec-3-8-1">
          <title>PROJECTION PREDICTIVE COVARIATE SELECTION</title>
          <p>Assuming the availability of a reference model, which
is a good representation of the predictive power of the
candidate biomarkers and the related uncertainty, we
seek a subset of the biomarkers, which can be used
for prediction without a large loss in performance
relative to the reference model. Our prior assumption
of sparsity in the biomarker e↵ ects implies that this
goal could be achievable. We describe the approach in
two steps: 1) defining a submodel for making
predictions with a specific subset of the candidate
biomarkers, and 2) finding submodels with good predictive
performance.
3.1.1</p>
        </sec>
        <sec id="sec-3-8-2">
          <title>Projective Submodels</title>
          <p>
            We use the projective approach of Dupuis and Robert
[
            <xref ref-type="bibr" rid="ref12">12</xref>
            ], Goutis and Robert [
            <xref ref-type="bibr" rid="ref22">22</xref>
            ] to find the parameters of
the submodel, but present an alternative derivation in
the Bayesian decision theoretic framework reviewed in
ref. [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. The projection is posed as a solution to an
optimization problem with regard to a restriction of
the reference model. Let the covariates x be divided
into two parts x = [x? , x&gt;] and define a submodel
M? to be restricted to using the covariates in x? 4 with
parameters ✓ ? = ( ? , ↵ ? ) in the Weibull model. We
find the submodel by maximizing the Gibbs reference
utility
          </p>
          <p>Z Z
u¯(M? ) =</p>
          <p>u(M? , x? , ✓ , T )p(T |✓ , x)dT
p(✓ |D)p(x)d(✓ , x)
with respect to the unknown probability
densities f (✓ ? |✓ ) appearing in the u(M? , x? , ✓ , T ) =
R f (✓ ? |✓ ) log p(T |✓ ? , x? )d✓ ? . Here, p(✓ |D) is the
posterior distribution of the reference model given the
observed data D and p(x) is the distribution of the
covariates. Writing out u and changing the integration
4We assume that the established risk factors are always
included in this set.
order,</p>
          <p>Z Z
u¯(M? ) =</p>
          <p>p(T |✓ , x) log p(T |✓ ? , x? )dT
⇥ f (✓ ? |✓ )p(✓ |D)p(x)d(✓ ? , ✓ , x).</p>
          <p>
            Finally, to arrive at the same solution with Dupuis and
Robert [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ], f (✓ ? |✓ ) can be restricted to the Dirac
delta function (✓ ? ✓ ˆ? ) with an o↵ set ✓ ˆ? that
depends on ✓ . That is, the solution to the maximization
of u¯ is defined pointwise for each ✓ as the
corresponding optimal value of ✓ ˆ? . The pointwise solution arises
from the dependence of f on ✓ .
          </p>
          <p>
            As p(✓ |D) is not available analytically and p(x) at all,
the former is approximated with Markov chain Monte
Carlo methods and the latter by using xi samples
available in the data D [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]. The obtained estimate
is
u¯(M? ) ⇡
1
nJ
          </p>
          <p>X Z
i,j</p>
          <p>p(T |✓ (j), xi) log p(T |✓ ˆ?(j), xi,? )dT ,
where the double sum runs over the n data points and
the J posterior samples. The optimization problems
to find the optimal ✓ ˆ(j)s are independent over j. We
?
solve them using the Newton’s method.</p>
          <p>We define the projection predictive distribution for the
submodel M? as
p(T |x? , Mref ) =</p>
          <p>p(T |x? , ✓ ? )f (✓ ? |Mref )d✓ ? ,</p>
          <p>
            Z
where we explicitly emphasize the dependence on the
reference model Mref and which is approximated using
the projected samples ✓ ˆ(j)s. This kind of projected
?
predictive distribution was also considered by Nott and
Leng [
            <xref ref-type="bibr" rid="ref23">23</xref>
            ].
          </p>
          <p>
            Note that scaling the estimated u¯ as d(M? ) =
u¯(Mref ) u¯(M? ) (and minimizing instead of
maximizing) does not change the optimal solution and gives
otherwise the same formula as u¯, except the term in
square brackets is replaced with the Kullback–Leibler
divergence between p(T |x, ✓ ) and p(T |x? , ✓ ? ). This
gives the approach further information theoretic
justification and is the basis of the formulation in Dupuis
and Robert [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]. They also suggest defining the
relative explanatory power of the submodel as
relative explanatory power(M? ) = 1
d(M? ) ,
d(M0)
Mref ).
where M0 refers to the model without any of the
candidate biomarkers and which transforms the d(M? )
values to between 0 (for M? = M0) and 1 (for M? =
u¯ (or equivalently d) is used to compare the submodels
in the search for good subsets of biomarkers. However,
exhaustive search of the model space5 is not feasible,
unless the number of candidate biomarkers is small.
We choose to use the suboptimal forward selection
strategy for its simplicity and its scalability to large
covariate sets:
1. Begin with the submodel M0 (no biomarkers) and
set j to 0.
2. Repeat until all biomarkers have been added:
(a) Find the projections for all submodels that
are obtainable by adding one new biomarker
to Mj . Select the one with largest u¯ and set
it as Mj+1. Set j to j + 1.
          </p>
          <p>
            This defines a deterministic6 path of models from M0
to Mmbm and gives a ranking of the biomarkers
according to their projection predictive value. Dupuis
and Robert [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] suggest finally choosing the
smallest submodel with an acceptable loss in the
explanatory power relative to the reference model (and use
a slightly more elaborate search). Alternatively, one
could monitor some other statistic (e.g., predictive
performance) along the search path to locate good
submodels. Computing the full forward selection path
may not be necessary, if a suitable stopping criterion
is used in the step 2 above.
3.2
          </p>
          <p>PREDICTIVE PERFORMANCE</p>
          <p>
            EVALUATION
Given a model M with posterior predictive
distribution p(T⇤ |x⇤ , D), where D is the observed data, we
evaluate its predictive performance using the
logarithm of the predictive density (LPD) at an actual
observation (t⇤ , v⇤ , x⇤ ). This scoring rule is proper and
measures the calibration and sharpness of the
predictive distribution simultaneously [
            <xref ref-type="bibr" rid="ref24">24</xref>
            ]. As the predictive
densities are not available analytically for the models
considered here, we estimate the LPD score from the
Markov chain Monte Carlo samples of the posterior
distribution:
          </p>
          <p>J
LPD⇤ (M ) ⇡ log 1 X p(t⇤ |x⇤ , v⇤ , (j), ↵ (j)),</p>
          <p>J j
where ( (j), ↵ (j)) are J posterior samples of the model
given the data D.</p>
          <p>5The number of subsets for mbm covariates is 2mbm .
6Given the stochastic samples from the posterior
distribution of the reference model.</p>
          <p>
            Stratified ten-fold cross-validation [
            <xref ref-type="bibr" rid="ref25">25</xref>
            ] is used to
obtain estimates of the generalization performance: The
full dataset is divided randomly into ten disjoint
subsets (folds), while balancing the sets to have
approximately similar age distributions and proportions of
diabetic and non-diabetic individuals, men and women,
and cases of cardiovascular events. Predictions for
each fold are obtained using a posterior distribution
based on training data, where the particular fold has
been left out. Given predictions obtained this way,
the predictive performance is summarized by the mean
LPD over the full set of n data points (MLPD).
To reduce variance and gauge uncertainty in model
comparisons, we compute Bayesian bootstrap [
            <xref ref-type="bibr" rid="ref26">26</xref>
            ]
samples of the MLDP di↵ erence ( MLPD) between
model Ma and model Mb by
          </p>
          <p>
            MLPD(j)(Ma, Mb) =
n
X wi(j)[LPDi(Ma) LPDi(Mb)],
i=1
where wi(j), i = 1, . . . , n, are the bootstrap weights
(Pi wi(j) = 1) for the jth bootstrap sample generated
using the Dirichlet distribution with parameters set to
1 [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. The comparison is summarized by the q-value7:
          </p>
          <p>J
q(Ma, Mb) = 1 X I( MLPD(j)</p>
          <p>
            J
j=1
0),
where I(·) = 1 if the given condition holds and 0
otherwise, and which is interpreted as the Bayesian
posterior probability (under the Dirichlet model) of Ma
performing better than Mb [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ].
4
          </p>
          <p>
            RESULTS
Missing values in the covariate data were multiply
imputed using chained linear regressions with in-house
scripts based on ref. [
            <xref ref-type="bibr" rid="ref27">27</xref>
            ]. The candidate biomarkers
were log-transformed and scaled to have zero mean and
unit variance. The No-U-Turn variant of the
Hamiltonian Monte Carlo algorithm [
            <xref ref-type="bibr" rid="ref28">28</xref>
            ], as implemented
in Stan software [
            <xref ref-type="bibr" rid="ref29">29</xref>
            ], was used to sample from the
posterior distributions of the full models. The
sampling was done independently for 5 imputed datasets
(4 chains of 1000 samples after burn-in for each). The
samples were then concatenated. The sampling
process was further performed independently for each of
the 10 cross-validation training sets. All shown
estimates of predictive performance were computed using
cross-validation (Section 3.2).
          </p>
          <p>7We use q instead of p to avoid confusion with the
frequentist p-value.
0.4
0.2</p>
          <p>0
0.2
Table 1 presents results on comparing the mean log
predictive densities (MLPD) of the following
combinations of models: joint for the joint model of
non-diabetic and diabetic individuals (Section 2.3),
diab women&amp;men for a joint model of diabetic men
and women (two-group version of Section 2.3), diab
women/men for separate models of diabetic men and
women (without the extension of Section 2.3), and
using the horseshoe, Laplace or Gaussian priors on
the biomarker e↵ ects, or using only the established
risk factors (no-biomarkers). The MLPDs and
qvalues were computed separately for the predictions
for women and men, and for pooled predictions, and,
importantly, in each case only for the predictions on
the diabetic subpopulation.</p>
          <p>The results show that there is an increase in the
predictive performance when supplanting the established
risk factors with the candidate biomarkers. The
increase holds both when using the joint models or
using only the data of diabetic individuals and seems to
be greater in men. This indicates that the candidate
biomarkers contain relevant information for predicting
cardiovascular event risk.</p>
          <p>Including the data of the non-diabetic individuals in
the model seems to increase the predictive
performance for the diabetic subpopulation, especially for
women. The covariate e↵ ects in the joint models are
very similar across the diabetic and non-diabetic
submodels: posterior mean of s0 is 0.96 for the horseshoe
model. This implies that the risk factors behave
similarly in both groups, but it is also possible that the
dataset has limited information to distinguish between
them and that larger datasets could uncover more
differences.</p>
          <p>
            Finally, it seems that the horseshoe prior performs
better than the Laplace, and that the Gaussian is the
worst of the three for this data. Figure 3 shows a
comparison of the biomarker regression coe cients
under these priors. The Laplace and the Gaussian priors
shrink the largest coe cient more than the horseshoe
as would be expected in a sparse setting [
            <xref ref-type="bibr" rid="ref16 ref5">5, 16</xref>
            ].
Furthermore, the horseshoe seems to shrink coe cients
near zero more strongly than the Laplace making the
credible intervals around zero narrower.
We applied the projection predictive covariate
selection (Section 3.1) with the joint horseshoe model as
the reference. The forward selection was run using
only the part of the model concerning diabetic
individuals. We run the forward selection jointly for women
and men to get an overall biomarker ranking for the
diabetic subpopulation. The forward selection was run
also for each cross-validation training set separately
(using the reference model fitted on the corresponding
training data).
          </p>
          <p>Figure 4 shows the relative explanatory power curves
along the forward selection path. In the full dataset,
the best candidate biomarker attains 61% explanatory
power relative to the reference model, five best reach
over 80% and ten biomarkers are needed to reach over
90%. The growth in the explanatory power slows with
more biomarkers, indicating diminishing gains from
adding more candidate biomarkers (22 are needed to
reach 95% and the remaining 33 account for the last
5%).</p>
          <p>However, choosing an acceptable loss in the
explanatory power to select an appropriate minimal subset of
the biomarkers for use in prediction tasks seems di
cult. In Figure 5, we show MLPDs (normalized to the
reference model) obtained using the projection
predictive covariate selection approach within the
crossvalidation. Top panel shows the MLPD along the
forward selection path and the bottom panel by the
obtained relative explanatory power (e.g., at 0.6, the
predictions in each cross-validation fold was made with
the smallest submodel reaching 60% power in that
fold). These show a mode at 2 biomarkers and at
around 0.65 relative explanatory power (which
corresponds to choosing two, three or four biomarkers
depending on the fold). A second peak can be seen at
10 biomarkers or correspondingly at 0.91 power (10–16
biomarkers).</p>
          <p>Unfortunately, the variance in the cross-validation
estimates is quite large for making a definite choice based
on them. Figure 6 shows the full set of pairwise
comparisons between the submodels along the forward
selection path (by number of biomarkers; same as in
Figure 5 top panel). This indicates that two biomarkers
is overall the best choice, but the di↵ erence to the 10
0.5
·10 2
1</p>
          <p>10
·10 2
cross-validation sets
full data
1
10</p>
          <p>20 30 40
number of biomarkers
0.9
0.8
biomarker selection is not large (q-value = 0.52).
However, on comparing these to the full model or generally
models with 11 or more biomarkers, the 10 biomarker
selection is more confidently better (q-values mostly
&gt; 0.9) than the 2 biomarker selection (q-values mostly
within 0.7–0.8).</p>
          <p>Nevertheless, the analysis seems to support two clearly
predictively relevant biomarkers for the cardiovascular
risk prediction, with further 8 possibly interesting
candidate biomarkers, but with some uncertainty about
their relevance. Figure 3 also supports this conclusion
with two of the biomarkers having clearly non-zero
effects.
5</p>
          <p>
            DISCUSSION
This paper presented a Bayesian analysis of
cardiovascular-event-free survival in diabetic
individuals, with the aim of identifying biomarkers with
predictive value. We presented a comparison of the
horseshoe, Laplace and Gaussian priors on the candidate
biomarker e↵ ects and demonstrated empirically an
expected [
            <xref ref-type="bibr" rid="ref16 ref5">5, 16</xref>
            ] di↵ erence in their behaviour. We further
extended the model hierarchically to include data of
non-diabetic individuals and examined the use of
projection predictive covariate selection to find biomarker
subsets with good predictive performance.
          </p>
          <p>We could also hope that the predictive biomarkers
capture some part of the state of the underlying disease
process and as such could be used to speculate about
causal disease pathways and to prioritize biomarkers
for further study. However, the analysis approach does
not warrant any formal causal inferences. Moreover,
the inclusion of the data of non-diabetic individuals
may bias the inferences on the diabetic subpopulation
towards the general population, when the dataset has
limited information to distinguish them. Nevertheless,
the presented predictive comparisons, being
independent of the model assumptions, justify studying the
joint model.</p>
          <p>
            The submodels in projection predictive covariate
selection depend on the observed data only through the
reference model. Thus, finding the submodel
parameters and the covariate selection itself do not cause
further fitting to the data, but rely on the information
provided by the reference model [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. The projected
submodels may also be able to retain some predictive
features of the reference model that would not be
available, if the submodels were independently fitted to the
data [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]: importantly, from Bayesian point of view,
the submodel may be able to account for uncertainty
due to the omission of some covariates.
          </p>
          <p>
            However, selecting a single submodel for future
prediction tasks may be di cult. We examined using the
projection approach within cross-validation to obtain
estimates of the submodel predictive performances. A
disadvantage of this procedure is that the performance
estimates are for the selection process and not for some
particular combination of selected biomarkers.
Furthermore, if selection is based on these estimates, the
performance estimate for the chosen submodel will not
anymore be unbiased for out-of-sample prediction
unless nested cross-validation is used [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ].
          </p>
          <p>Acknowledgements
We acknowledge the computational resources provided
by Aalto Science-IT project.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Erkki</given-names>
            <surname>Vartiainen</surname>
          </string-name>
          , Tiina Laatikainen, Markku Peltonen, Anne Juolevi, Satu M¨
          <article-title>annist¨o, Jouko Sundvall</article-title>
          , Pekka Jousilahti, Veikko Salomaa, Liisa Valsta, and
          <string-name>
            <given-names>Pekka</given-names>
            <surname>Puska</surname>
          </string-name>
          .
          <article-title>Thirty-five-year trends in cardiovascular risk factors in Finland</article-title>
          .
          <source>International Journal of Epidemiology</source>
          ,
          <volume>39</volume>
          (
          <issue>2</issue>
          ):
          <fpage>504</fpage>
          -
          <lpage>518</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T. J.</given-names>
            <surname>Mitchell</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Beauchamp</surname>
          </string-name>
          .
          <article-title>Bayesian variable selection in linear regression</article-title>
          .
          <source>Journal of the American Statistical Association</source>
          ,
          <volume>83</volume>
          (
          <issue>404</issue>
          ):
          <fpage>1023</fpage>
          -
          <lpage>1032</lpage>
          ,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Trevor</given-names>
            <surname>Park</surname>
          </string-name>
          and
          <string-name>
            <given-names>George</given-names>
            <surname>Casella</surname>
          </string-name>
          .
          <article-title>The Bayesian lasso</article-title>
          .
          <source>Journal of the American Statistical Association</source>
          ,
          <volume>103</volume>
          (
          <issue>482</issue>
          ):
          <fpage>681</fpage>
          -
          <lpage>686</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Jim</surname>
            <given-names>E. Gri n</given-names>
          </string-name>
          and
          <string-name>
            <surname>Philip J. Brown.</surname>
          </string-name>
          <article-title>Inference with normal-gamma prior distributions in regression problems</article-title>
          .
          <source>Bayesian Analysis</source>
          ,
          <volume>5</volume>
          (
          <issue>1</issue>
          ):
          <fpage>171</fpage>
          -
          <lpage>188</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Carlos</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Carvalho</surname>
          </string-name>
          , Nicholas G. Polson, and
          <string-name>
            <surname>James</surname>
            <given-names>G. Scott.</given-names>
          </string-name>
          <article-title>The horseshoe estimator for sparse signals</article-title>
          .
          <source>Biometrika</source>
          ,
          <volume>97</volume>
          (
          <issue>2</issue>
          ):
          <fpage>465</fpage>
          -
          <lpage>480</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Zhihua</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Shusen Wang, Dehua Liu, and
          <string-name>
            <surname>Michael I. Jordan.</surname>
          </string-name>
          <article-title>EP-GIG priors and applications in Bayesian sparse learning</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          ,
          <volume>13</volume>
          (
          <issue>1</issue>
          ):
          <fpage>2031</fpage>
          -
          <lpage>2061</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Robert</given-names>
            <surname>Tibshirani</surname>
          </string-name>
          .
          <article-title>Regression shrinkage and selection via the lasso</article-title>
          .
          <source>Journal of the Royal Statistical Society. Series B (Methodological)</source>
          ,
          <volume>58</volume>
          (
          <issue>1</issue>
          ):
          <fpage>267</fpage>
          -
          <lpage>288</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S. L. van der</given-names>
            <surname>Pas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. J. K.</given-names>
            <surname>Kleijn</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A. W. van der Vaart.</surname>
          </string-name>
          <article-title>The horseshoe estimator: Posterior concentration around nearly black vectors</article-title>
          .
          <source>arXiv preprint arXiv:1404.0202</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Emerging</given-names>
            <surname>Risk Factors Collaboration</surname>
          </string-name>
          .
          <article-title>Diabetes mellitus, fasting blood glucose concentration, and risk of vascular disease: a collaborative metaanalysis of 102 prospective studies</article-title>
          .
          <source>The Lancet</source>
          ,
          <volume>375</volume>
          (
          <issue>9733</issue>
          ):
          <fpage>2215</fpage>
          -
          <lpage>2222</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Sinno</given-names>
            <surname>Jialin</surname>
          </string-name>
          Pan and
          <string-name>
            <given-names>Qiang</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>A survey on transfer learning</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          ,
          <volume>22</volume>
          (
          <issue>10</issue>
          ):
          <fpage>1345</fpage>
          -
          <lpage>1359</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Aki</given-names>
            <surname>Vehtari</surname>
          </string-name>
          and
          <string-name>
            <given-names>Janne</given-names>
            <surname>Ojanen</surname>
          </string-name>
          .
          <article-title>A survey of Bayesian predictive methods for model assessment, selection and comparison</article-title>
          .
          <source>Statistics Surveys</source>
          ,
          <volume>6</volume>
          :
          <fpage>142</fpage>
          -
          <lpage>228</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J</given-names>
            <surname>´erome</surname>
          </string-name>
          <string-name>
            <given-names>A.</given-names>
            <surname>Dupuis</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christian P.</given-names>
            <surname>Robert</surname>
          </string-name>
          .
          <article-title>Variable selection in qualitative models via an entropic explanatory power</article-title>
          .
          <source>Journal of Statistical Planning and Inference</source>
          ,
          <volume>111</volume>
          (
          <issue>1</issue>
          ):
          <fpage>77</fpage>
          -
          <lpage>94</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Gelman</surname>
          </string-name>
          , John B. Carlin, Hal S. Stern,
          <string-name>
            <given-names>David B.</given-names>
            <surname>Dunson</surname>
          </string-name>
          , Aki Vehtari, and
          <string-name>
            <surname>Donald</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rubin</surname>
          </string-name>
          .
          <article-title>Bayesian Data Analysis</article-title>
          .
          <source>CRC press, third edition</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Joseph</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Ibrahim</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Hui Chen</surname>
            , and
            <given-names>Debajyoti</given-names>
          </string-name>
          <string-name>
            <surname>Sinha</surname>
          </string-name>
          .
          <source>Bayesian Survival Analysis</source>
          . Springer,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Gelman</surname>
          </string-name>
          .
          <article-title>Prior distributions for variance parameters in hierarchical models</article-title>
          .
          <source>Bayesian Analysis</source>
          ,
          <volume>1</volume>
          (
          <issue>3</issue>
          ):
          <fpage>515</fpage>
          -
          <lpage>533</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Nicholas</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Polson and James</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Scott</surname>
          </string-name>
          .
          <article-title>Shrink globally, act locally: Sparse Bayesian regularization and prediction</article-title>
          . In
          <string-name>
            <surname>J. M. Bernardo</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          <string-name>
            <surname>Bayarri</surname>
            ,
            <given-names>J. O.</given-names>
          </string-name>
          <string-name>
            <surname>Berger</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          <string-name>
            <surname>Dawid</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Heckerman</surname>
            ,
            <given-names>A. F. M.</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
          </string-name>
          , and M. West, editors,
          <source>Bayesian Statistics 9</source>
          , pages
          <fpage>501</fpage>
          -
          <lpage>538</lpage>
          . Oxford University Press,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Nicholas</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Polson and James</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Scott</surname>
          </string-name>
          .
          <article-title>On the half-Cauchy prior for a global scale parameter</article-title>
          .
          <source>Bayesian Analysis</source>
          ,
          <volume>7</volume>
          (
          <issue>4</issue>
          ):
          <fpage>887</fpage>
          -
          <lpage>902</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Fei</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Sounak Chakraborty,
          <string-name>
            <given-names>Fan</given-names>
            <surname>Li</surname>
          </string-name>
          , Yan Liu, and
          <string-name>
            <surname>Aurelie</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Lozano</surname>
          </string-name>
          .
          <article-title>Bayesian regularization via graph Laplacian</article-title>
          .
          <source>Bayesian Analysis</source>
          ,
          <volume>9</volume>
          (
          <issue>2</issue>
          ):
          <fpage>449</fpage>
          -
          <lpage>474</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Theodoros</surname>
            <given-names>Evgeniou</given-names>
          </string-name>
          , Charles A.
          <string-name>
            <surname>Micchelli</surname>
            , and
            <given-names>Massimiliano</given-names>
          </string-name>
          <string-name>
            <surname>Pontil</surname>
          </string-name>
          .
          <article-title>Learning multiple tasks with kernel methods</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>6</volume>
          :
          <fpage>615</fpage>
          -
          <lpage>637</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Sheldon</surname>
          </string-name>
          .
          <article-title>Graphical multi-task learning</article-title>
          .
          <source>In NIPS 2008 Workshop: “Structured Input - Structured Output”</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>C</given-names>
            <surname>´edric Archambeau</surname>
          </string-name>
          , Shengbo Guo, and
          <string-name>
            <given-names>Onno</given-names>
            <surname>Zoeter</surname>
          </string-name>
          .
          <article-title>Sparse Bayesian multi-task learning</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          <volume>24</volume>
          , pages
          <fpage>1755</fpage>
          -
          <lpage>1763</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Constantinos</given-names>
            <surname>Goutis</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christian P.</given-names>
            <surname>Robert</surname>
          </string-name>
          .
          <article-title>Model choice in generalised linear models: A Bayesian approach via Kullback-Leibler projections</article-title>
          .
          <source>Biometrika</source>
          ,
          <volume>85</volume>
          (
          <issue>1</issue>
          ):
          <fpage>29</fpage>
          -
          <lpage>37</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>David</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Nott</surname>
            and
            <given-names>Chenlei</given-names>
          </string-name>
          <string-name>
            <surname>Leng</surname>
          </string-name>
          .
          <article-title>Bayesian projection approaches to variable selection in generalized linear models</article-title>
          .
          <source>Computational Statistics &amp; Data Analysis</source>
          ,
          <volume>54</volume>
          (
          <issue>12</issue>
          ):
          <fpage>3227</fpage>
          -
          <lpage>3241</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Tilmann</surname>
            <given-names>Gneiting</given-names>
          </string-name>
          , Fadoua Balabdaoui, and
          <string-name>
            <given-names>Adrian E.</given-names>
            <surname>Raftery</surname>
          </string-name>
          .
          <article-title>Probabilistic forecasts, calibration and sharpness</article-title>
          .
          <source>Journal of the Royal Statistical Society: Series B (Statistical Methodology)</source>
          ,
          <volume>69</volume>
          (
          <issue>2</issue>
          ):
          <fpage>243</fpage>
          -
          <lpage>268</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Ron</given-names>
            <surname>Kohavi</surname>
          </string-name>
          .
          <article-title>A study of cross-validation and bootstrap for accuracy estimation and model selection</article-title>
          .
          <source>In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI)</source>
          , pages
          <fpage>1137</fpage>
          -
          <lpage>1145</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Donald</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rubin</surname>
          </string-name>
          .
          <article-title>The Bayesian bootstrap</article-title>
          .
          <source>The Annals of Statistics</source>
          ,
          <volume>9</volume>
          (
          <issue>1</issue>
          ):
          <fpage>130</fpage>
          -
          <lpage>134</lpage>
          ,
          <year>1981</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Stef</surname>
            <given-names>van Buuren</given-names>
          </string-name>
          and
          <article-title>Karin Groothuis-Oudshoorn. MICE: Multivariate imputation by chained equations in R</article-title>
          .
          <source>Journal of Statistical Software</source>
          ,
          <volume>45</volume>
          (
          <issue>3</issue>
          ),
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Matthew</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Ho</surname>
          </string-name>
          <article-title>↵ man and Andrew Gelman. The No-U-Turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>15</volume>
          :
          <fpage>1593</fpage>
          -
          <lpage>1623</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Stan</surname>
            <given-names>Development</given-names>
          </string-name>
          <string-name>
            <surname>Team. Stan: A C+</surname>
          </string-name>
          <article-title>+ library for probability and sampling</article-title>
          ,
          <source>version 2.2</source>
          ,
          <year>2014</year>
          . URL http://mc-stan.org/.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>