Feature Selection for Survival Analysis in Bioinformatics Charles Englebert1,2 , Thomas Quinn3 , Isabelle Bichindaritz1 1 State University of New York, Oswego, NY, USA 2 Supelec, Metz, France 3 Deakin University, Geelong, Australia Abstract: The development of microarray technology has made it possible to assemble biomedical datasets that measure the expression profile of thousands of genes simultaneously. However, such high-dimensional datasets make computation costly and can complicate the interpretation of a predictive model. To address this, feature selection methods are used to extract biological information from a large amount of data in order to filter the expression dataset down to the smallest possible subset of accurate predictor genes. Feature selection has three main advantages: it decreases computational costs, mitigates the possibility of overfitting due to high inter- variable correlations, and allows for an easier clinical interpretation of the model. In this paper we compare three methods of feature selection : iterative Bayesian Model Averaging (BMA), Random Survival Forest (RSF) and Cox Proportional Hazard (CPH) and four methods of survival analysis: Analysis Random Survival Forest (RSF), Cox Proportional Hazard (CPH), Alan Additive Filter (AAF) and Deepsurv (neural network). Feature selection methods permit to extract the most relevant features for survival analysis from the dataset automatically and are compared with a hand selected set of features. Certain survival analysis methods also permit to perform feature selection such as RSF and CPH. All the data we used came from the Metabric breast cancer dataset. For every feature selection method, we compare varied numbers of selected features. Our results indicate that feature selection improves the performance of survival analysis methods. Overall, the best survival analysis performance was obtained by combining RSF for feature selection and Deepsurv for survival prediction. I. PRINCIPLES OF SURVIVAL ANALYSIS B. Concordance Index We cannot evaluate the predictive ability of a survival Survival analysis is about predicting the time duration model using classical loss functions such as L2 because until an event occurs [7]. In our case, we are interested we do not possess the curve which represents the prob- in using microarray data to estimate the longevity of a ability of dying for our data [5]. Instead, we only know patient diagnosed with breast cancer. However, the ap- the time at which a patient actually died. One measure proach is generalizable to any highly dimensional data used is the concordance index, or c-index. This mea- used for survival analysis tasks. sure evaluates the ordering of predicted survival times: 0.5 is the expected result from random predictions, 1.0 is perfect concordance, and 0.0 is perfect anti-concordance (i.e., one must multiply predictions by −1 to get a c- index of 1.0). Throughout this paper, we use the c-index A. Modeling Survival Analysis to evaluate the accuracy of our predictions. In survival analysis, we do not necessarily know the II. METHODS OF SURVIVAL ANALYSIS actual longevity of every individual since the experiment may have stopped within their lifetime. Those individu- A. Cox Model als who have not been subject to the death event during the study are labeled as right censored. This includes any patient who drops out of the study early. As such, Cox hazard regression is a standard method for sur- for each individual, we consider either the survival time vival analysis. It consists in modeling the hazard func- (if we have the death date) or a censored time (if we tion defined by: do not have the date of the death but instead the date P r(t <= O <= t + ∆t|O >= t|x) of the last visit to the doctor). An instance in the sur- λ(t|x) = lim (1) ∆t→0 ∆t vival data is usually represented as (xi , ti , δi ) where xi is the feature vector, ti is the observed time, and δi is which represents how the risk of an event per unit time the indicator. An indicator of 1 is used to indicate an changes over time, as: uncensored instance (i.e., the death of a patient) while λ(t|x) = λ0 (t)exp(β T x) (2) 0 is used to indicate a censored instance. The survival function S(t|x) = P (O > t|x) represents the probability where β = (β1 , .., β2 )T is a vector of parameters, λ0 (t) is of being alive after time t, where O represent the total a baseline hazard function, O represent the total survival survival time and x the data from which is inferred the time, and x the data from which is inferred the predic- prediction. tion. We also define h(x) = β T x as the risk function: it 2 represents how the risk of an event per time unit changes data pose a great challenge for computational techniques. over time. We aim to extract biological information from this large amount of data to filter the expression dataset down to the smallest possible subset of accurate predictor genes. B. Alan Additive Filter Reducing the number of predictor genes has three main advantages: it decreases computational costs, mitigates Alan Additive filter works the same way as Cox Pro- the possibility of overfitting due to high inter-variable portional Hazard but instead it uses : correlations, and allows for an easier clinical interpreta- tion of the model. T X λ(t) = β0 (t) + βi (t).xi (3) i=1 A. Cox Proportional Hazard for feature selection as a regression function. COX Proportional Hazard permits to perform feature selection. The method consists in ranking the features in C. Deepsurv descending order of their log likelihood. Deepsurv is a deep learning method for survival analy- sis based on the Faraggi and Simon network [2]. The im- B. Bayesian Model Averaging plementation of the model is based on Theano [4], which defines the likelihood of the model as: For high-dimensional data, many features can repre- K sent the data equally well. Here, we use the term model Y exp(hβ (xk )) L(θ) = PM (4) to refer to any such subset of selected features. We focus k=1 m=1 exp(hβ (xm ) our attention on one feature selection method in partic- ular, Bayesian Model Averaging (BMA), used here to se- where θ represents the parameterized weights of the net- lect subsets of genes from microarray data [1]. Instead of work on which the learning is made, hβ is the risk func- choosing a single model and proceeding as if the data was tion of the Cox model, m is the number of patients actually generated from it, BMA combines the effective- still at risk at time ti , and K is the total number of ness of multiple models by taking the weighted average patients in the dataset. The hazard function is then of their posterior distributions [1]. The core idea behind λ(t|x) = λ0 (t)exp(h(x)). Bayesian Model Averaging is expressed in the following As many numbers included in ]0, 1[ are multiplied it is equation: relevant to sum the logarithm of those numbers instead. This is what is done in Deepsurv. The loss function (l) K that was chosen for this model is then defined as the neg- X P (∆|Mk , D) = P (∆|Mk , D)P (Mk |D) (5) ative log of this likelihood. This method yields equation k=1 7. With this network and 16 hand-selected features, the where ∆ is the quantity of interest we want to compute, team that developed Deepsurv achieved a concordance it can be be a blood pressure or a temperature, D is the index of 0.69 [2]. data, and Mk is the k th model. The probability of the quantities to take a certain value is assumed to be the mean given by all models weighted by their own proba- D. Random Survival Forest (RSF) bility. There are three apparent issues at this point: ob- taining the subsets of relevant models {Mk }, determining Random Forest is a method that operates by construct- P (∆|Mk , D), and determining P (Mk |D). The BMA al- ing a multitude of decision trees at training time and gorithm addresses these: once we have P (Mk |D), we can outputting the class (in case of a classification analysis) deduce the predictive utility of gene xi by: or the mean regression (in case of a regression analysis). This popular machine learning method was adapted to X survival analysis. We used the R package randomF orest. P (xi |D) = P (Mk |D) (6) Random forest also permits to rank features and so it Mk /xi ∈Mk provides another features selection method. Note that the number of models that exist for microar- ray data is usually very large. Given G candidate ex- III. FEATURE SELECTION planatory genes in an expression set, then there are 2G possible models to consider. Meanwhile, the number of The aim of this paper is to prove the usefulness of fea- genes in microarray datasets typically varies from 102 s to ture selection before survival analysis. High-dimensional 104 s. 3 C. Bayesian Model Averaging for 4. Apply BMA to X = (x1 , x2 , ..., xW ) High-Dimensional Data 5. toBeP rocessed ← (xW +1 , ..., xp ) The BMA algorithm described above can only deal with data that have at most 30 features. However, the 6. Repeat until all p genes are usual practice of employing stepwise backward elimina- processed: tion to reduce the number of genes down to 30 is not applicable in a situation where the number of predictive variables is greater than the number of samples. To ad- (a) Remove from X any gene i with dress this issue, Yeung et al. developed an iterative BMA P r(xi |D) < 1% algorithm that takes a rank-ordered list of genes and suc- (b) If all genes have P r(xi |D) < 1%, cessively applies the traditional BMA algorithm until all then determine the minimum genes have been processed [3]. We apply iterative BMA P r(xi |D), minP robne0 , in the to our data, using a Cox proportional hazards regression current BMA window. Next, to create the rank-ordered list of genes. Genes are ranked remove from X any gene i with in descending order of their log likelihood (l) based on 2, P r(xi |D) < (minP robne0 + 1%) defined as: K M (c) Replace genes removed from X X X with top-ranked genes from l(θ) = hθ (xk ) − log exp(hθ (xm )) (7) toBeP rocessed, removing them from k=1 m=1 the toBeP rocessed set where θ represents the parameterized weights of the net- (d) Apply BMA to X work on which the learning is made, hθ is the risk func- tion of the Cox model, M is the number of patients still at risk at time ti , and K is the total number of patients 7. Output selected genes in X and their in the dataset. The baseline hazard function, λ0 (t), does corresponding posterior probability. not appear in the equation as it is assumed to be the same for all patients. IV. RESULTS D. Summary of BMA Algorithm A. Dataset The iterative BMA algorithm iterates through the user-specified list of p top-ranked genes, applying the We applied our methods to the Molecular Taxonomy of traditional BMA algorithm for survival analysis to each Breast Cancer International Consortium (METABRIC) group of variables in a given BMA window of size W = dataset which originally used expression profiles to iden- 25. Then, genes with high posterior probabilities are re- tify new breast cancer subgroups in an effort to help tained while genes with low posterior probabilities are physicians provide better treatment recommendations. eliminated. This entails: This dataset consists of gene expression data and clini- cal features for 1,981 patients, 43.85% of which have had 1. Define parameters: an observed death due to breast cancer (with a median survival time of 1,907 days). (a) D training set We applied our three algorithms of feature selection (b) G total number of genes in the on this dataset. We also reproduced the feature set dataset of size 14 presented by the Deepsurv team which con- (c) n number of samples in the tains : four prognostic meta-genes (CIN, MES, LYM, training set and FGD3-SUSD3), the age at diagnosis, the number (d) p number of top-ranked genes to of positive lymph nodes, the tumor size, the ER status, process the HER2 status, four indicators known to be predic- tive of breast cancer: ERBB2, MKI67, PGR and ESR1, (e) W the size of the BMA window and the prescribed treatment (i.e., chemotherapy, radio- (i.e., the number of genes therapy, or hormonotherapy) [2]. Metagenes are sets of processed using traditional BMA) genes coexpressed in multiple cancer types. Those fea- 2. Import training set D tures are calculated from genes expression. The four prognostic meta-genes were previously found to predict 3. Rank-order all G genes by applying accurately the survival time of patients by the winners of Cox proportional hazard regression the Sage Bionetworks-DREAM Breast Cancer Prognosis to each individual gene Challenge [6]. 4 B. Selected Features We were interested in knowing which selected features are in common with the hand selected features. Here is a table which summarizes the common features: BMA RSF COX CHEMOTHERAPY ER STATUS 3 CIN RADIO THERAPY HER2 STATUS 1 FGD3-SUSD3 HORM THERAPY It is interesting to note that each method seems spe- cialized in a particular type of feature. BMA has selected only clinical features in common with the hand selected features. More exactly it has selected only clinical fea- tures related to treatment. RSF includes the two selected hormone-related features ER (estrogen) and HER2 (hu- man epidermal growth factor receptor), which are known FIG. 2. Alan additive filter, Random survival Forest and Cox to be related to breast cancer. COX selected four genes Proportional Hazard on sub-sets of features of various sizes related to the metagenes CIN and FGD3-SUSD3 [6]. generated by Cox Proportional Hazard C. Influence of the Number of Features We aim to identify the influence of the number of se- lected features. To do so we selected variable numbers of features for each feature section method. The sizes used are : 12, 14, 16, 18, 20, 22, 32, 64 and 128. We then tested each survival method on every subset generated. The Deepsurv method needs a lot of fine tuning and is very long to train. For this reason, we don’t have as many results with Deepsurv as with the three other methods. Also, Deepsurv is extremely sensitive to the dataset used. It needs to be fine tuned differently when we change fea- ture selection method. This makes the method not ap- propriate for comparing feature selection methods. We will present the results we obtained with RSF, COX and AAF here and we will discuss Deepsurv in a later section. FIG. 3. Alan additive filter, Random survival Forest and Cox Proportional Hazard on sub-sets of features of various sizes generated by Random Survival Forest It is interesting to notice in Fig. 1, Fig. 2, and Fig. 3 that the number of features seems to have a negative impact on the predictions of BMA and AAF. This is counter-intuitive as a dataset with many more features carries more information and then is expected to provide more accurate prediction. We also note that RSF clearly outperforms the two other methods presented here on ev- ery dataset and gets more accurate when the number of features increases. D. Influence of Prevalent Features FIG. 1. Alan additive filter, Random survival Forest and Cox The hand selected feature set composed of 16 features Proportional Hazard on sub-sets of features of various sizes is used by the Deepsurv team to reach its best concor- generated by Bayesian Model Averaging dance index of 0.695. This feature set provides a concor- 5 dance index of 0.7024 with RSF, 0.6635 with CPH and best predictions. If the size of the input changes, the 0.6630 with AAF. None of our automatic feature selec- architecture of the network changes, and the fine-tuning tion methods could provide results as high as these on must also change. The proposed analysis focused here this same number of features. We have tried to evaluate on four subsets of the Metabric dataset: one with the the influence of particular features present in the hand selected set. To do this, we elaborated two methods. First we replaced subsets of features in the hand selected set of features by features selected by automatic feature selection methods. For this first method we expect to see a negative effect on the concordance index compared to the results given by the hand selected features. We re- produced the hand selected feature set without the indi- cators ERBB2, MKI67, PGR and ESR1, one without the treatments, hormonal information, chemotherapy, radio- therapy, hormonotherapy, ER, and HER2 and one with- out the metagenes, CIN, MES, LYM, and FGD3-SUSD3. FIG. 5. Alan additive filter, Random survival Forest and Cox Proportional Hazard on subsets of features generated from RSF feature selection set from which was removed certain subsets of features and replaced by others taken from hand selected features 12 most likely features, one with the 14 most likely features, one with the 16 most likely features and one with the features selected by hand by the DeepSurv team. 12 BMA 14 hand-selected 14 RSF 64 RSF 0.5388 0.6676 0.6720 0.7173 FIG. 4. Alan additive filter, Random survival Forest and Cox Proportional Hazard on subsets of features generated from These results show that RSF automatically selected the hand selected feature set from which was removed cer- 14 features perform at least as well as the hand-selected tain subsets of features and replaced by others obtained by features with Deepsurv, which is an important result. automatic feature selection methods Another important result is that a moderate additional number of features improves Deepsurv’s performance One can see in Fig. 4 that the effect of removing even more, however only with the RSF feature selection features is very slight but we can still notice a negative method. effect. Those experiments do not permit to understand the difference of accuracy with automatic feature selec- tion methods. V. CONCLUSION The second method consisted in adding those same features to automatically selected feature sets. We chose Although more work is needed to reach a definitive to use the RSF method as it provided the best results. conclusion, our results indicate that feature selection can In this case we expect to see a positive effect on the play a helpful role when performing survival analysis on concordance index as can be verified on Fig. 5, although high-dimensional data. In addition to the uncontested the impact is slight. role in decreasing computational costs, the survival pre- diction shows improved concordance index. RSF though takes advantage of the information contained in a mod- E. Prediction with Deepsurv erate additional number of features. Overall, the best survival analysis performance was obtained by combin- A deep neural network needs fine-tuning in order ing RSF for feature selection and Deepsurv for survival to find the appropriate parameters which provide the prediction. [1] Amalia Annest. Iterative Bayesian Model Averaging: A Dimensional Microarray Data. University of Washington, Method for the Application of Survival Analysis to High- 6 2008. arXiv e-prints, abs/1605.02688, May 2016. [2] Jonathan Bates Alexander Cloninger Tingting Jiang Jared [5] Balaji Krishnapuram Vikas C. Raykar, Harald Steck. ”on L. Katzman, Uri Shaham and Yuval Kluger. Deep Sur- ranking in survival analysis: Bounds on the concordance vival: A Deep Cox Proportional Hazards Network. BioMed index”. Central, 2016. [6] Tai-Hsien Ou Yang Wei-Yi Cheng and Dimitris Anastas- [3] Adrian E. Raftery Ka Yee Yeung, Roger E. Bumgarner. siou. Biomolecular events in cancer revealed by attractor Bayesian model averaging: development of an improved metagenes. PLoS Comput Biol, 2013. multi-class, gene selection and classification tool for mi- [7] Jiawen Yao Xinliang Zhu and Junzhou Huang. Deep croarray data. Oxford Academic, 2005. Multi-Instance Learning for Survival Prediction from [4] Theano Development Team. Theano: A Python frame- Whole Slide Pathological Images. BioMed Central, 2016. work for fast computation of mathematical expressions.