Hierarchical Bayesian Survival Analysis and Projective Covariate Selection in Cardiovascular Event Risk Prediction Tomi Peltola Aki S. Havulinna Veikko Salomaa Aki Vehtari tomi.peltola@aalto.fi aki.havulinna@thl.fi veikko.salomaa@thl.fi aki.vehtari@aalto.fi Aalto University, National Institute for National Institute for Aalto University, Finland Health and Welfare, Health and Welfare, Finland Finland Finland Abstract statistical challenge is then to identify an informative subset of the biomarkers and estimate its predictive Identifying biomarkers with predictive value performance. for disease risk stratification is an important Here, we describe an application of linear, hierarchi- task in epidemiology. This paper describes an cal Bayesian survival regression to model cardiovascu- application of Bayesian linear survival regres- lar event risk in diabetic individuals. The available sion to model cardiovascular event risk in di- data consists of 7932 Finnish individuals in the FIN- abetic individuals with measurements avail- RISK 1997 cohort [1], of whom 401 had diabetes at able on 55 candidate biomarkers. We extend the beginning of the study. The covariates consist of the survival model to include data from a a set of 55 candidate biomarkers measured from blood larger set of non-diabetic individuals in an samples and 12 established risk factors (e.g., base- e↵ort to increase the predictive performance line age, sex, body-mass index, lipoprotein cholesterol for the diabetic subpopulation. We com- measures, blood pressure and smoking). The length pare the Gaussian, Laplace and horseshoe of the follow-up period was 15 years. We focus on shrinkage priors, and find that the last has three key elements in the model construction: 1) using the best predictive performance and shrinks shrinkage priors to model the assumption of possibly strong predictors less than the others. We limited relevance of many biomarkers, 2) utilizing the implement the projection predictive covari- large set of non-diabetic individuals in the modelling, ate selection approach of Dupuis and Robert and 3) the selection of a subset of the biomarkers with (2003) to further search for small sets of pre- predictive value. While the statistical approach is not dictive biomarkers that could provide cost- limited to this particular application, we use the set- efficient prediction without significant loss in ting to make the description of the methods concrete. performance. In passing, we present a deriva- tion of the projective covariate selection in Shrinkage or sparsity-promoting priors for regression Bayesian decision theoretic framework. coefficients are used to shrink the e↵ects of (appar- ently) irrelevant covariates to zero, while retaining the e↵ects of relevant covariates. Their use has increased with the availability of datasets with large numbers of 1 INTRODUCTION features, for example, from high-throughput measure- ment technologies, which often capture a snapshot of Improving disease risk prediction is a major task in a whole system (e.g., metabolome, genome) instead epidemiological research. Non-communicable diseases, of targeted features. The interest has spawned con- many of which develop and progress slowly, are a ma- siderable research e↵ort into such priors and multiple jor cause of morbidity worldwide. Accurate risk pre- alternatives have been proposed (see, e.g., refs [2–6]). diction could be used to screen individuals for targeted In this work, we chose to compare three priors: the intervention. Advances in measurement technologies Laplace [3], the horseshoe [5] and, as a baseline, a allow researchers cost-efficient quantification of large Gaussian prior. The Laplace prior corresponds to the numbers of potentially relevant biomarkers, for exam- popular lasso penalty [7] in non-Bayesian regularized ple, in blood samples. However, often only a few of regression. The horseshoe prior has been shown to such candidate biomarkers could be expected to give have desirable features in Bayesian analysis [5, 8]. We practically relevant gain in risk stratification or could briefly review these priors in Section 2.2. be realistically used in routine health care setting. The 79 Of the 401 diabetic individuals in the study, 155 ex- 2 MODEL perienced a cardiovascular event within the follow-up period. This leaves a limited set of informative sam- We first consider modelling the cardiovascular-event- ples to perform the model fitting, covariate selection free survival in the subset of diabetic individuals only. and predictive performance evaluation with. Although The model is then extended to include the data of the risk of cardiovascular events is larger in diabetic non-diabetic individuals, while allowing the covariate individuals than the general population [9], we would e↵ects and the baseline hazard to di↵er in these groups expect that the risk factors are shared at least to some and between men and women. extent. Based on this assumption, we incorporate the non-diabetic individuals (n = 7531, 1031 events) into 2.1 OBSERVATION MODEL the analysis by constructing a hierarchical joint model, where the submodels for diabetic and non-diabetic in- Let the observation ti be the event time Ti or dividuals can be correlated (akin to transfer or multi- the censoring time Ci since the beginning of the task learning [10]). The joint model does not place study for ith individual and vi be the corresponding hard constraints on the similarity of the submodels, event/censoring indicator (1 for observed events, 0 for but allows the models to di↵er between non-diabetic censored). All censored cases are right censored (i.e., and diabetic individuals and also between men and Ti > Ci where only Ci is observed; censoring occurs women. Details are given in Section 2.3. in the data mostly because of event-free survival to While lasso regression in the non-Bayesian context can the end of the follow-up). Further, let xi be a column perform hard covariate selection by estimating exact vector of the observed covariate values for the ith sub- zeroes for regression coefficients, the Bayesian shrink- ject. We assume a parametric survival model, where age priors do not lead to sparse posterior distributions the observations follow the Weibull model2 as there will remain uncertainty after observing a fi- v (↵ 1) nite dataset. However, we are interested in finding p(ti |xi , vi , , ↵) = ↵vi ti i exp(vi T xi t ↵ i exp( T xi )) a minimal subset of predictively relevant biomarkers as discussed above. To this end, we examine the use with the shape ↵ and the scale defined through the of projection predictive covariate selection1 , where the linear combination T xi of the covariates [14]. The full model, encompassing all the candidate biomarkers Weibull model is a proportional hazard model with and the uncertainties related to their e↵ects, is taken the hazard function h(Ti ) = ↵Ti↵ 1 exp( T xi ). as a yardstick for the smaller models. Specifically, the We include a constant term 1 in the covariates xi models with subsets of covariates are found by maxi- and denote the corresponding regression coefficient 0 . mizing the similarity of their predictions to this refer- The intercept and the shape are given the di↵use pri- ence as proposed by Dupuis and Robert [12]. Notably, ors: this approach does not require specifying priors for the submodels and one can instead focus on building a 0 ⇠ N(0, 102 ), good reference model. Dupuis and Robert [12] suggest log ↵ ⇠ N(0, 102 ). choosing the size of the covariate subset based on an acceptable loss of explanatory power compared to the reference model. We examine using cross-validation The covariates are divided into a set of established based estimates of predictive performance as an alter- risk (or protective) factors and a set of new candidate native. biomarkers, which are of more uncertain relevance. The coefficients of the established predictors, j for The structure of this article is as follows. In Section j = 1, . . . , mbg , are given the prior [15]: 2, we describe the survival model, shrinkage priors, and the hierarchical extension to include data of non- 2 2 j ⇠ N(0, s j ), for j = 1, . . . , mbg , diabetic individuals. The projection predictive covari- 2 2 ate selection is described in Section 3. The results j ⇠ Inv– (1), for j = 1, . . . , mbg , from the application of the methods for cardiovascular- s ⇠ Half–N(0, 102 ). event-free survival modelling in diabetic individuals are presented in Section 4. Finally, Section 5 discusses the modelling approach. Priors for the coefficients of the candidate biomarkers are considered below. 2 The notation for probability distributions follows the 1 A comprehensive review of predictive Bayesian model parametrizations given in ref. [13], except for the Weibull selection approaches is given by Vehtari and Ojanen [11]. model, which is explicitly written out. Half -distributions Our terminology follows theirs. refer to the restriction to the real positive axis. 80 2.2 PRIORS FOR BIOMARKER 2.3 HIERARCHICAL EXTENSION COEFFICIENTS Next, we consider extending the approach to jointly Based on our prior assumption that only some of the model the event-free survival of non-diabetic men biomarkers are expected to be practically relevant for (NM), non-diabetic women (NW), diabetic men (DM), prediction, we consider the use of shrinkage priors for and diabetic women (DW). Our aim is to increase the the biomarker coefficients. As discussed in the intro- predictive performance of the model specifically in the duction, there has been a lot of recent research into subset of diabetic individuals, but gain power by in- these type of priors and there are multiple proposals. cluding the larger set of observations for non-diabetic We restrict our consideration to three alternatives: the individuals in the model. To this end, we tie together horseshoe prior [5], the Laplace prior [3], and, as a the submodels of the four groups using the following baseline approach, a Gaussian prior. Each of these assumptions: can be expressed as normal scale mixtures 2 2 1. The relevance of a biomarker will be similar for j ⇠ N(0, ⌧s ⌧j ), for j = mbg + 1, . . . , mbg + mbm , all the submodels. where ⌧s is a global scale parameter (shared across j) 2. The e↵ect size of a biomarker (or other covari- and ⌧j are local parameters. Ideally, the prior shrinks ate) and its direction are similar between men and the coefficients of irrelevant biomarkers to zero, but women, and between diabetic and non-diabetic in- allows large coefficients for relevant biomarkers. In a dividuals. sparse situation, with many irrelevant biomarkers and few relevant, this could be e↵ected by making ⌧s small, 3. The baseline hazard functions have similar shapes but allowing some ⌧j to take on large values to escape for men and women, and diabetic and non- the shrinkage [16]. diabetic individuals. The priors for ⌧j s, for j = mbg + 1, . . . , mbg + mbm , for Let j = [ j,N M j,N W j,DM j,DW ]T be the coef- the three alternatives are ficients for the jth biomarker in the four submodels. We set ⌧j ⇠ Half–Cauchy(0, 1) for horseshoe, 2 1 j ⇠ N(0, rj ⇤ ), ⌧j2 ⇠ Exponential(0.7) for Laplace, where rj2 ⇤ 1 is the prior covariance matrix. Here, ⌧j = 1 for the Gaussian. rj = ⌧j ⌧s and follows one of the prior specifications given in the previous section. This encodes the first A comparison of the Laplace and horseshoe prior is assumption above: a single rj parameter defines the given in ref. [5]: it is noted that the Laplace prior relevance of the jth biomarker in all the four submod- may overshrink large coefficients in a sparse situation, els. while the horseshoe prior is more robust (see also ref. [16]). Furthermore, van der Pas et al. [8] derive the- To encode the second assumption, we specify the struc- oretical results indicating that the posterior distribu- ture of the prior precision matrix as tion under the horseshoe prior may be more informa- 2 3 tive3 than under the Laplace prior in a sparse normal 1 + cN + sM cN sM 0 6 cN 1 + cN + sW 0 sW 7 means problem. The Gaussian prior does not try to ⇤=6 7. 4 sM 0 1 + cD + sM cD 5 separate between relevant and irrelevant covariates as it depends only on the shared scale parameter ⌧s . 0 sW cD 1 + cD + sW The same prior is given for the global scale parameter The corresponding graphical structure is illustrated in in each case: Figure 1. As will be made more explicit below, the cN and cD control the similarity of the submodels of non- ⌧s ⇠ Half–Cauchy(0, 1), diabetic men and women, and between the submodels of diabetic men and women, respectively. sM and sW which has its (bounded) mode at zero, but is only control the similarity between the submodels of non- weakly informative as it also places a substantial diabetic and diabetic men, and non-diabetic and di- amount of prior mass far from zero (see refs [15–17] for abetic women, respectively. We further simplify the discussion on priors for global variance parameters). model by taking cN = cD = c and sM = sW = s and 3 constrain c > 0 and s > 0. The precision matrix has That is, the posterior mean estimator attains a min- imax risk, possibly up to a multiplicative constant, in a similarity to the one used by Liu et al. [18] to learn de- sparse setting and the posterior contracts at a similar rate pendencies between covariates, but here ⇤ is restricted (with conditions on ⌧s ). to encode a specific prior structure. 81 cN 1 0.9 0.9 j,N M j,N W 0.8 0.5 0.6 0.5 sM sW s0 0.4 0.1 0.2 j,DM j,DW 0.1 cD 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 c0 c0 Figure 1: Prior structure for the regression coefficients of jth biomarker in the joint model. Figure 2: Contour plots of the correlation coefficient between j,N M and j,DM (left) and j,N M and j,DW (right) as a function of c0 and s0 . We choose = (2c+1)(2s+1)(2c+2s+1) (1+2c+2s+2cs)(c+s+1) as this makes the diagonal elements of ⇤ 1 equal to 1, that is, ⇤ 1 becomes a correlation matrix. The relevance of the jth biomarker is then solely dependent on rj . minimum value of 0 when ⇢ = 1. In Figure 2, the im- plied prior correlation coefficients of some interesting For more insight, the prior for j may be written out pairs of X,j s are shown as functions of c0 and s0 : s0 as proportional to controls almost linearly the correlation between j,N M ! 1 and j,DM , whereas the correlation between j,N M exp (S2 + cSc + sSs ) , and j,DW is close to bilinear in c0 and s0 . 2rj2 To complete the prior specification c0 and s0 are given 2 2 2 2 where S2 = j,N M + j,N W + j,DM + j,DW , Sc = prior distributions. We use di↵erent parameters for ( j,N M 2 j,N W ) + ( j,DM j,DW ) 2 and Ss = biomarkers (c0 and s0 ), other covariates (c0bg and s0bg ) ( j,N M 2 j,DM ) + ( j,N W 2 j,DW ) . c controls and the log-scale Weibull shape parameter log ↵ (c0↵ the penalization in the di↵erence between men and and s0↵ ; this encodes the third assumption): women, and s controls the penalization in the di↵er- ence between non-diabetic and diabetic subjects. Tak- ing negative logarithm of the prior shows that it corre- c0 ⇠ Beta(ac , bc ), sponds to a specific Bayesian version of the multi-task s0 ⇠ Beta(as , bs ), graph regularization penalty proposed by Evgeniou c0bg ⇠ Beta(ac , bc ), et al. [19] and further studied by Sheldon [20]. The s0bg ⇠ Beta(as , bs ), prior can also be represented in the sparse Bayesian multi-task learning framework of Archambeau et al. c0↵ ⇠ Beta(ac , bc ), [21], where a zero-mean matrix-variate Gaussian den- s0↵ ⇠ Beta(as , bs ). sity is placed on B = [ 1 , . . . , m ] with row covariance ⌦ (over the m covariates) and column covariance ⌃ (over the tasks). Here, ⌦ is a diagonal matrix with Finally, ac , bc , as and bs are given Gamma( 12 , 14 ) priors. elements rj2 and ⌃ = ⇤ 1 . We note that the eigendecomposition of ⇤ = V DV T We use the following transformations of c and s: c = is of simple form, with D being a diagonal matrix with (1 c0 ) 1 1 and s = (1 s0 ) 1 1, where c0 2 [0, 1) elements 1, 1 + 2c, 1 + 2s, 1 + 2c + 2s and and s0 2 [0, 1). At c0 = 0, c = 0 and the corresponding submodels are independent. As c0 ! 1, c ! 1 and the 2 3 corresponding submodels are constrained to identical. 1 1 1 1 s0 behaves similarly. 16 1 1 1 17 V = 6 7. 2 41 1 1 15 We can also examine the implied prior distribution of 1 1 1 1 the di↵erence between two X,j coefficients as a func- tion of c0 and s0 . First, note that the distribution of 2 X,j Y,j is N(0, 2rj (1 ⇢)), where ⇢ is the corre- This can be useful in reparametrizing the model for lation coefficient. Specifically, the variance of the dis- Markov chain Monte Carlo sampling algorithms. It tribution is linearly dependent on ⇢ and, for ⇢ 0, also shows that the precision matrix is positive defi- has the maximum value of 2rj2 when ⇢ = 0 and the nite. 82 3 METHODS FOR BIOMARKER order, SELECTION AND PREDICTIVE Z Z PERFORMANCE EVALUATION ū(M? ) = p(T |✓, x) log p(T |✓? , x? )dT ⇥ f (✓? |✓)p(✓|D)p(x)d(✓? , ✓, x). The approaches used for biomarker selection and eval- uation of predictive performance are described below. Finally, to arrive at the same solution with Dupuis and The model constructed in previous section is used as Robert [12], f (✓? |✓) can be restricted to the Dirac the reference model in the biomarker selection. delta function (✓? ✓ˆ? ) with an o↵set ✓ˆ? that de- pends on ✓. That is, the solution to the maximization 3.1 PROJECTION PREDICTIVE of ū is defined pointwise for each ✓ as the correspond- COVARIATE SELECTION ing optimal value of ✓ˆ? . The pointwise solution arises from the dependence of f on ✓. Assuming the availability of a reference model, which As p(✓|D) is not available analytically and p(x) at all, is a good representation of the predictive power of the the former is approximated with Markov chain Monte candidate biomarkers and the related uncertainty, we Carlo methods and the latter by using xi samples seek a subset of the biomarkers, which can be used available in the data D [12]. The obtained estimate for prediction without a large loss in performance rel- is ative to the reference model. Our prior assumption Z of sparsity in the biomarker e↵ects implies that this 1 X (j) ū(M? ) ⇡ p(T |✓ (j) , xi ) log p(T |✓ˆ? , xi,? )dT , goal could be achievable. We describe the approach in nJ i,j two steps: 1) defining a submodel for making predic- tions with a specific subset of the candidate biomark- where the double sum runs over the n data points and ers, and 2) finding submodels with good predictive per- the J posterior samples. The optimization problems formance. (j) to find the optimal ✓ˆ? s are independent over j. We solve them using the Newton’s method. 3.1.1 Projective Submodels We define the projection predictive distribution for the submodel M? as We use the projective approach of Dupuis and Robert Z [12], Goutis and Robert [22] to find the parameters of the submodel, but present an alternative derivation in p(T |x? , Mref ) = p(T |x? , ✓? )f (✓? |Mref )d✓? , the Bayesian decision theoretic framework reviewed in ref. [11]. The projection is posed as a solution to an where we explicitly emphasize the dependence on the optimization problem with regard to a restriction of reference model Mref and which is approximated using (j) the reference model. Let the covariates x be divided the projected samples ✓ˆ? s. This kind of projected into two parts x = [x? , x> ] and define a submodel predictive distribution was also considered by Nott and M? to be restricted to using the covariates in x? 4 with Leng [23]. parameters ✓? = ( ? , ↵? ) in the Weibull model. We Note that scaling the estimated ū as d(M? ) = find the submodel by maximizing the Gibbs reference ū(Mref ) ū(M? ) (and minimizing instead of maxi- utility mizing) does not change the optimal solution and gives Z Z otherwise the same formula as ū, except the term in ū(M? ) = u(M? , x? , ✓, T )p(T |✓, x)dT square brackets is replaced with the Kullback–Leibler divergence between p(T |x, ✓) and p(T |x? , ✓? ). This p(✓|D)p(x)d(✓, x) gives the approach further information theoretic justi- fication and is the basis of the formulation in Dupuis with respect to the unknown probability densi- and Robert [12]. They also suggest defining the rela- ties f (✓? |✓) appearing in the u(M? , x? , ✓, T ) = tive explanatory power of the submodel as R f (✓? |✓) log p(T |✓? , x? )d✓? . Here, p(✓|D) is the d(M? ) posterior distribution of the reference model given the relative explanatory power(M? ) = 1 , d(M0 ) observed data D and p(x) is the distribution of the co- variates. Writing out u and changing the integration where M0 refers to the model without any of the can- didate biomarkers and which transforms the d(M? ) 4 We assume that the established risk factors are always values to between 0 (for M? = M0 ) and 1 (for M? = included in this set. Mref ). 83 3.1.2 Submodel Search Stratified ten-fold cross-validation [25] is used to ob- tain estimates of the generalization performance: The ū (or equivalently d) is used to compare the submodels full dataset is divided randomly into ten disjoint sub- in the search for good subsets of biomarkers. However, sets (folds), while balancing the sets to have approxi- exhaustive search of the model space5 is not feasible, mately similar age distributions and proportions of di- unless the number of candidate biomarkers is small. abetic and non-diabetic individuals, men and women, We choose to use the suboptimal forward selection and cases of cardiovascular events. Predictions for strategy for its simplicity and its scalability to large each fold are obtained using a posterior distribution covariate sets: based on training data, where the particular fold has been left out. Given predictions obtained this way, 1. Begin with the submodel M0 (no biomarkers) and the predictive performance is summarized by the mean set j to 0. LPD over the full set of n data points (MLPD). 2. Repeat until all biomarkers have been added: To reduce variance and gauge uncertainty in model comparisons, we compute Bayesian bootstrap [26] (a) Find the projections for all submodels that samples of the MLDP di↵erence ( MLPD) between are obtainable by adding one new biomarker model Ma and model Mb by to Mj . Select the one with largest ū and set it as Mj+1 . Set j to j + 1. n X (j) MLPD(j) (Ma , Mb ) = wi [LPDi (Ma ) LPDi (Mb )], This defines a deterministic6 path of models from M0 i=1 to Mmbm and gives a ranking of the biomarkers ac- (j) cording to their projection predictive value. Dupuis where wi , i = 1, . . . , n, are the bootstrap weights P (j) and Robert [12] suggest finally choosing the small- ( i wi = 1) for the jth bootstrap sample generated est submodel with an acceptable loss in the explana- using the Dirichlet distribution with parameters set to tory power relative to the reference model (and use 1 [11]. The comparison is summarized by the q-value7 : a slightly more elaborate search). Alternatively, one could monitor some other statistic (e.g., predictive per- 1X J formance) along the search path to locate good sub- q(Ma , Mb ) = I( MLPD(j) 0), J j=1 models. Computing the full forward selection path may not be necessary, if a suitable stopping criterion is used in the step 2 above. where I(·) = 1 if the given condition holds and 0 oth- erwise, and which is interpreted as the Bayesian pos- 3.2 PREDICTIVE PERFORMANCE terior probability (under the Dirichlet model) of Ma EVALUATION performing better than Mb [11]. Given a model M with posterior predictive distribu- tion p(T⇤ |x⇤ , D), where D is the observed data, we 4 RESULTS evaluate its predictive performance using the loga- rithm of the predictive density (LPD) at an actual ob- Missing values in the covariate data were multiply im- servation (t⇤ , v⇤ , x⇤ ). This scoring rule is proper and puted using chained linear regressions with in-house measures the calibration and sharpness of the predic- scripts based on ref. [27]. The candidate biomarkers tive distribution simultaneously [24]. As the predictive were log-transformed and scaled to have zero mean and densities are not available analytically for the models unit variance. The No-U-Turn variant of the Hamil- considered here, we estimate the LPD score from the tonian Monte Carlo algorithm [28], as implemented Markov chain Monte Carlo samples of the posterior in Stan software [29], was used to sample from the distribution: posterior distributions of the full models. The sam- pling was done independently for 5 imputed datasets J 1X (j) (4 chains of 1000 samples after burn-in for each). The LPD⇤ (M ) ⇡ log p(t⇤ |x⇤ , v⇤ , , ↵(j) ), samples were then concatenated. The sampling pro- J j cess was further performed independently for each of the 10 cross-validation training sets. All shown esti- where ( (j) , ↵(j) ) are J posterior samples of the model mates of predictive performance were computed using given the data D. cross-validation (Section 3.2). 5 The number of subsets for mbm covariates is 2mbm . 6 7 Given the stochastic samples from the posterior distri- We use q instead of p to avoid confusion with the fre- bution of the reference model. quentist p-value. 84 Table 1: Model comparisons on cross-validation predictions. MLPDs and q-values (Section 3.2) are shown for predictions only on diabetic women, only on diabetic men or both. q-values are calculated against the joint horseshoe model; color scale 0.0 ⌅⌅⌅⌅⌅⌅⌅⌅⌅⌅ 1.0. women men women & men model MLPD q-value MLPD q-value MLPD q-value joint horseshoe -0.581 NA -0.716 NA -0.652 NA joint Laplace -0.582 0.27 ⌅ -0.720 0.10 ⌅ -0.656 0.08 ⌅ joint Gaussian -0.585 0.22 ⌅ -0.727 0.05 ⌅ -0.660 0.04 ⌅ joint no-biomarkers -0.594 0.18 ⌅ -0.758 0.03 ⌅ -0.681 0.01 ⌅ diab women&men horseshoe -0.606 0.03 ⌅ -0.719 0.44 ⌅ -0.666 0.13 ⌅ diab women/men horseshoe -0.610 0.03 ⌅ -0.721 0.45 ⌅ -0.669 0.15 ⌅ diab women/men no-biomarkers -0.613 0.05 ⌅ -0.765 0.04 ⌅ -0.694 0.01 ⌅ horseshoe 0.4 Laplace Gaussian 0.2 0 0.2 1 10 20 30 40 50 55 biomarker Figure 3: Biomarker regression coefficients for the submodel of diabetic men in the joint models with the horseshoe, Laplace and Gaussian priors (full dataset). Dot is the mean and vertical line shows the 95% credible interval. Biomarkers are ordered according to the mean coefficients of the horseshoe model. 4.1 MODEL COMPARISONS ing only the data of diabetic individuals and seems to be greater in men. This indicates that the candidate Table 1 presents results on comparing the mean log biomarkers contain relevant information for predicting predictive densities (MLPD) of the following com- cardiovascular event risk. binations of models: joint for the joint model of Including the data of the non-diabetic individuals in non-diabetic and diabetic individuals (Section 2.3), the model seems to increase the predictive perfor- diab women&men for a joint model of diabetic men mance for the diabetic subpopulation, especially for and women (two-group version of Section 2.3), diab women. The covariate e↵ects in the joint models are women/men for separate models of diabetic men and very similar across the diabetic and non-diabetic sub- women (without the extension of Section 2.3), and models: posterior mean of s0 is 0.96 for the horseshoe using the horseshoe, Laplace or Gaussian priors on model. This implies that the risk factors behave sim- the biomarker e↵ects, or using only the established ilarly in both groups, but it is also possible that the risk factors (no-biomarkers). The MLPDs and q- dataset has limited information to distinguish between values were computed separately for the predictions them and that larger datasets could uncover more dif- for women and men, and for pooled predictions, and, ferences. importantly, in each case only for the predictions on the diabetic subpopulation. Finally, it seems that the horseshoe prior performs bet- ter than the Laplace, and that the Gaussian is the The results show that there is an increase in the pre- worst of the three for this data. Figure 3 shows a dictive performance when supplanting the established comparison of the biomarker regression coefficients un- risk factors with the candidate biomarkers. The in- der these priors. The Laplace and the Gaussian priors crease holds both when using the joint models or us- 85 shrink the largest coefficient more than the horseshoe as would be expected in a sparse setting [5, 16]. Fur- thermore, the horseshoe seems to shrink coefficients near zero more strongly than the Laplace making the 1 relative explanatory power credible intervals around zero narrower. 4.2 BIOMARKER SELECTION AND 0.8 SUBMODEL PREDICTIVE PERFORMANCE 0.6 cross-validation sets We applied the projection predictive covariate selec- full data tion (Section 3.1) with the joint horseshoe model as 1 10 20 30 40 50 55 the reference. The forward selection was run using number of biomarkers only the part of the model concerning diabetic individ- uals. We run the forward selection jointly for women and men to get an overall biomarker ranking for the di- Figure 4: Relative explanatory powers along the for- abetic subpopulation. The forward selection was run ward selection path. also for each cross-validation training set separately (using the reference model fitted on the corresponding training data). Figure 4 shows the relative explanatory power curves along the forward selection path. In the full dataset, ·10 2 the best candidate biomarker attains 61% explanatory power relative to the reference model, five best reach 2 over 80% and ten biomarkers are needed to reach over 90%. The growth in the explanatory power slows with MLPD more biomarkers, indicating diminishing gains from 0 adding more candidate biomarkers (22 are needed to reach 95% and the remaining 33 account for the last 5%). 2 However, choosing an acceptable loss in the explana- tory power to select an appropriate minimal subset of 1 10 20 30 40 50 55 the biomarkers for use in prediction tasks seems diffi- number of biomarkers ·10 2 cult. In Figure 5, we show MLPDs (normalized to the reference model) obtained using the projection pre- 2 dictive covariate selection approach within the cross- validation. Top panel shows the MLPD along the MLPD forward selection path and the bottom panel by the 0 obtained relative explanatory power (e.g., at 0.6, the predictions in each cross-validation fold was made with the smallest submodel reaching 60% power in that fold). These show a mode at 2 biomarkers and at 2 around 0.65 relative explanatory power (which corre- 0.5 0.6 0.7 0.8 0.9 1 sponds to choosing two, three or four biomarkers de- relative explanatory power pending on the fold). A second peak can be seen at 10 biomarkers or correspondingly at 0.91 power (10–16 biomarkers). Figure 5: MLPD values (in reference to the full Unfortunately, the variance in the cross-validation es- model) by number of biomarkers (top) or by explana- timates is quite large for making a definite choice based tory power threshold (bottom). Top: vertical lines on them. Figure 6 shows the full set of pairwise com- are 95% Bayesian bootstrap credible intervals for the parisons between the submodels along the forward se- MLPD. Bottom: dashed curves show the (pointwise) lection path (by number of biomarkers; same as in Fig- credible interval. ure 5 top panel). This indicates that two biomarkers is overall the best choice, but the di↵erence to the 10 86 1 not warrant any formal causal inferences. Moreover, 0 the inclusion of the data of non-diabetic individuals 0.9 may bias the inferences on the diabetic subpopulation 10 0.8 towards the general population, when the dataset has 0.7 limited information to distinguish them. Nevertheless, 20 the presented predictive comparisons, being indepen- 0.6 dent of the model assumptions, justify studying the Mi 0.5 joint model. 30 0.4 The submodels in projection predictive covariate se- 0.3 lection depend on the observed data only through the 40 reference model. Thus, finding the submodel param- 0.2 eters and the covariate selection itself do not cause 50 0.1 further fitting to the data, but rely on the information 0 provided by the reference model [11]. The projected 0 10 20 30 40 50 submodels may also be able to retain some predictive Mj features of the reference model that would not be avail- able, if the submodels were independently fitted to the Figure 6: q-value matrix for Mi is better than Mj data [11]: importantly, from Bayesian point of view, with regard to MLPD (Section 3.2; M· s refer to the the submodel may be able to account for uncertainty submodels along the forward selection path). due to the omission of some covariates. However, selecting a single submodel for future pre- biomarker selection is not large (q-value = 0.52). How- diction tasks may be difficult. We examined using the ever, on comparing these to the full model or generally projection approach within cross-validation to obtain models with 11 or more biomarkers, the 10 biomarker estimates of the submodel predictive performances. A selection is more confidently better (q-values mostly disadvantage of this procedure is that the performance > 0.9) than the 2 biomarker selection (q-values mostly estimates are for the selection process and not for some within 0.7–0.8). particular combination of selected biomarkers. Fur- thermore, if selection is based on these estimates, the Nevertheless, the analysis seems to support two clearly performance estimate for the chosen submodel will not predictively relevant biomarkers for the cardiovascular anymore be unbiased for out-of-sample prediction un- risk prediction, with further 8 possibly interesting can- less nested cross-validation is used [11]. didate biomarkers, but with some uncertainty about their relevance. Figure 3 also supports this conclusion Acknowledgements with two of the biomarkers having clearly non-zero ef- fects. We acknowledge the computational resources provided by Aalto Science-IT project. 5 DISCUSSION References This paper presented a Bayesian analysis of [1] Erkki Vartiainen, Tiina Laatikainen, Markku Pel- cardiovascular-event-free survival in diabetic individ- tonen, Anne Juolevi, Satu Männistö, Jouko Sund- uals, with the aim of identifying biomarkers with pre- vall, Pekka Jousilahti, Veikko Salomaa, Liisa Val- dictive value. We presented a comparison of the horse- sta, and Pekka Puska. Thirty-five-year trends shoe, Laplace and Gaussian priors on the candidate in cardiovascular risk factors in Finland. Inter- biomarker e↵ects and demonstrated empirically an ex- national Journal of Epidemiology, 39(2):504–518, pected [5, 16] di↵erence in their behaviour. We further 2010. extended the model hierarchically to include data of non-diabetic individuals and examined the use of pro- [2] T. J. Mitchell and J. J. Beauchamp. Bayesian jection predictive covariate selection to find biomarker variable selection in linear regression. Journal subsets with good predictive performance. of the American Statistical Association, 83(404): 1023–1032, 1988. We could also hope that the predictive biomarkers cap- ture some part of the state of the underlying disease [3] Trevor Park and George Casella. The Bayesian process and as such could be used to speculate about lasso. Journal of the American Statistical Associ- causal disease pathways and to prioritize biomarkers ation, 103(482):681–686, 2008. for further study. However, the analysis approach does [4] Jim E. Griffin and Philip J. Brown. Inference 87 with normal-gamma prior distributions in regres- Bayesian Statistics 9, pages 501–538. Oxford Uni- sion problems. Bayesian Analysis, 5(1):171–188, versity Press, 2011. 2010. [17] Nicholas G. Polson and James G. Scott. On the [5] Carlos M. Carvalho, Nicholas G. Polson, and half-Cauchy prior for a global scale parameter. James G. Scott. The horseshoe estimator for Bayesian Analysis, 7(4):887–902, 2012. sparse signals. Biometrika, 97(2):465–480, 2010. [18] Fei Liu, Sounak Chakraborty, Fan Li, Yan Liu, [6] Zhihua Zhang, Shusen Wang, Dehua Liu, and and Aurelie C. Lozano. Bayesian regularization Michael I. Jordan. EP-GIG priors and applica- via graph Laplacian. Bayesian Analysis, 9(2): tions in Bayesian sparse learning. The Journal 449–474, 2014. of Machine Learning Research, 13(1):2031–2061, [19] Theodoros Evgeniou, Charles A. Micchelli, and 2012. Massimiliano Pontil. Learning multiple tasks with [7] Robert Tibshirani. Regression shrinkage and se- kernel methods. Journal of Machine Learning Re- lection via the lasso. Journal of the Royal Sta- search, 6:615–637, 2005. tistical Society. Series B (Methodological), 58(1): [20] Daniel Sheldon. Graphical multi-task learning. In 267–288, 1996. NIPS 2008 Workshop: “Structured Input – Struc- [8] S. L. van der Pas, B. J. K. Kleijn, and A. W. tured Output”, 2008. van der Vaart. The horseshoe estimator: Pos- [21] Cédric Archambeau, Shengbo Guo, and Onno terior concentration around nearly black vectors. Zoeter. Sparse Bayesian multi-task learning. In arXiv preprint arXiv:1404.0202, 2014. Advances in Neural Information Processing Sys- [9] Emerging Risk Factors Collaboration. Diabetes tems 24, pages 1755–1763, 2011. mellitus, fasting blood glucose concentration, and [22] Constantinos Goutis and Christian P. Robert. risk of vascular disease: a collaborative meta- Model choice in generalised linear models: A analysis of 102 prospective studies. The Lancet, Bayesian approach via Kullback–Leibler projec- 375(9733):2215–2222, 2010. tions. Biometrika, 85(1):29–37, 1998. [10] Sinno Jialin Pan and Qiang Yang. A survey on [23] David J. Nott and Chenlei Leng. Bayesian pro- transfer learning. IEEE Transactions on Knowl- jection approaches to variable selection in gener- edge and Data Engineering, 22(10):1345–1359, alized linear models. Computational Statistics & 2010. Data Analysis, 54(12):3227–3241, 2010. [11] Aki Vehtari and Janne Ojanen. A survey of [24] Tilmann Gneiting, Fadoua Balabdaoui, and Bayesian predictive methods for model assess- Adrian E. Raftery. Probabilistic forecasts, cali- ment, selection and comparison. Statistics Sur- bration and sharpness. Journal of the Royal Sta- veys, 6:142–228, 2012. tistical Society: Series B (Statistical Methodol- ogy), 69(2):243–268, 2007. [12] Jérome A. Dupuis and Christian P. Robert. Vari- able selection in qualitative models via an en- [25] Ron Kohavi. A study of cross-validation and tropic explanatory power. Journal of Statistical bootstrap for accuracy estimation and model se- Planning and Inference, 111(1):77–94, 2003. lection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJ- [13] Andrew Gelman, John B. Carlin, Hal S. Stern, CAI), pages 1137–1145, 1995. David B. Dunson, Aki Vehtari, and Donald B. Rubin. Bayesian Data Analysis. CRC press, third [26] Donald B. Rubin. The Bayesian bootstrap. The edition, 2014. Annals of Statistics, 9(1):130–134, 1981. [14] Joseph G. Ibrahim, Ming-Hui Chen, and Debajy- [27] Stef van Buuren and Karin Groothuis-Oudshoorn. oti Sinha. Bayesian Survival Analysis. Springer, MICE: Multivariate imputation by chained equa- 2001. tions in R. Journal of Statistical Software, 45(3), 2011. [15] Andrew Gelman. Prior distributions for vari- [28] Matthew D. Ho↵man and Andrew Gelman. The ance parameters in hierarchical models. Bayesian No-U-Turn Sampler: Adaptively setting path Analysis, 1(3):515–533, 2006. lengths in Hamiltonian Monte Carlo. Journal of [16] Nicholas G. Polson and James G. Scott. Shrink Machine Learning Research, 15:1593–1623, 2014. globally, act locally: Sparse Bayesian regulariza- [29] Stan Development Team. Stan: A C++ library tion and prediction. In J. M. Bernardo, M. J. for probability and sampling, version 2.2, 2014. Bayarri, J. O. Berger, A. P. Dawid, D. Heck- URL http://mc-stan.org/. erman, A. F. M. Smith, and M. West, editors, 88