-

Charles Englebert

1 2

Thomas Quinn

Isabelle Bichindaritz

1 0 Deakin University , Geelong , Australia 1 State University of New York , Oswego, NY , USA 2 Supelec , Metz , France

The development of microarray technology has made it possible to assemble biomedical datasets that measure the expression pro le of thousands of genes simultaneously. However, such high-dimensional datasets make computation costly and can complicate the interpretation of a predictive model. To address this, feature selection methods are used to extract biological information from a large amount of data in order to lter the expression dataset down to the smallest possible subset of accurate predictor genes. Feature selection has three main advantages: it decreases computational costs, mitigates the possibility of over tting due to high intervariable correlations, and allows for an easier clinical interpretation of the model. In this paper we compare three methods of feature selection : iterative Bayesian Model Averaging (BMA), Random Survival Forest (RSF) and Cox Proportional Hazard (CPH) and four methods of survival analysis: Analysis Random Survival Forest (RSF), Cox Proportional Hazard (CPH), Alan Additive Filter (AAF) and Deepsurv (neural network). Feature selection methods permit to extract the most relevant features for survival analysis from the dataset automatically and are compared with a hand selected set of features. Certain survival analysis methods also permit to perform feature selection such as RSF and CPH. All the data we used came from the Metabric breast cancer dataset. For every feature selection method, we compare varied numbers of selected features. Our results indicate that feature selection improves the performance of survival analysis methods. Overall, the best survival analysis performance was obtained by combining RSF for feature selection and Deepsurv for survival prediction.

PRINCIPLES OF SURVIVAL ANALYSIS

In survival analysis, we do not necessarily know the actual longevity of every individual since the experiment may have stopped within their lifetime. Those individuals who have not been subject to the death event during the study are labeled as right censored. This includes any patient who drops out of the study early. As such, for each individual, we consider either the survival time (if we have the death date) or a censored time (if we do not have the date of the death but instead the date of the last visit to the doctor). An instance in the survival data is usually represented as (xi; ti; i) where xi is the feature vector, ti is the observed time, and i is the indicator. An indicator of 1 is used to indicate an uncensored instance (i.e., the death of a patient) while 0 is used to indicate a censored instance. The survival function S(tjx) = P (O > tjx) represents the probability of being alive after time t, where O represent the total survival time and x the data from which is inferred the prediction.

We cannot evaluate the predictive ability of a survival model using classical loss functions such as L2 because we do not possess the curve which represents the probability of dying for our data [5]. Instead, we only know the time at which a patient actually died. One measure used is the concordance index, or c-index. This measure evaluates the ordering of predicted survival times: 0:5 is the expected result from random predictions, 1:0 is perfect concordance, and 0:0 is perfect anti-concordance (i.e., one must multiply predictions by 1 to get a cindex of 1:0). Throughout this paper, we use the c-index to evaluate the accuracy of our predictions.

II.

METHODS OF SURVIVAL ANALYSIS

Cox hazard regression is a standard method for survival analysis. It consists in modeling the hazard function de ned by: (tjx) = lim t!0

P r(t <= O <= t + t tjO >= tjx) which represents how the risk of an event per unit time changes over time, as: (tjx) =

0(t)exp( T x) where = ( 1; ::; 2)T is a vector of parameters, 0(t) is a baseline hazard function, O represent the total survival time, and x the data from which is inferred the prediction. We also de ne h(x) = T x as the risk function: it (1) (2) represents how the risk of an event per time unit changes over time.

Alan Additive Filter

Alan Additive lter works the same way as Cox Proportional Hazard but instead it uses : as a regression function.

(t) =

T 0(t) + X i(t):xi

i=1

Deepsurv is a deep learning method for survival analysis based on the Faraggi and Simon network [2]. The implementation of the model is based on Theano [4], which de nes the likelihood of the model as:

K L( ) = Y k=1

exp(h (xk)) PM

m=1 exp(h (xm) where represents the parameterized weights of the network on which the learning is made, h is the risk function of the Cox model, m is the number of patients still at risk at time ti, and K is the total number of patients in the dataset. The hazard function is then (tjx) = 0(t)exp(h(x)).

As many numbers included in ]0; 1[ are multiplied it is relevant to sum the logarithm of those numbers instead. This is what is done in Deepsurv. The loss function (l) that was chosen for this model is then de ned as the negative log of this likelihood. This method yields equation 7.

With this network and 16 hand-selected features, the team that developed Deepsurv achieved a concordance index of 0.69 [2].

Random Forest is a method that operates by constructing a multitude of decision trees at training time and outputting the class (in case of a classi cation analysis) or the mean regression (in case of a regression analysis). This popular machine learning method was adapted to survival analysis. We used the R package randomF orest. Random forest also permits to rank features and so it provides another features selection method.

The aim of this paper is to prove the usefulness of feature selection before survival analysis. High-dimensional (3) (4) data pose a great challenge for computational techniques. We aim to extract biological information from this large amount of data to lter the expression dataset down to the smallest possible subset of accurate predictor genes. Reducing the number of predictor genes has three main advantages: it decreases computational costs, mitigates the possibility of over tting due to high inter-variable correlations, and allows for an easier clinical interpretation of the model.

Cox Proportional Hazard for feature selection COX Proportional Hazard permits to perform feature selection. The method consists in ranking the features in descending order of their log likelihood.

Bayesian Model Averaging

For high-dimensional data, many features can represent the data equally well. Here, we use the term model to refer to any such subset of selected features. We focus our attention on one feature selection method in particular, Bayesian Model Averaging (BMA), used here to select subsets of genes from microarray data [1]. Instead of choosing a single model and proceeding as if the data was actually generated from it, BMA combines the e ectiveness of multiple models by taking the weighted average of their posterior distributions [1]. The core idea behind Bayesian Model Averaging is expressed in the following equation:

P ( jMk; D) =

K X P ( jMk; D)P (MkjD) k=1 (5) where is the quantity of interest we want to compute, it can be be a blood pressure or a temperature, D is the data, and Mk is the kth model. The probability of the quantities to take a certain value is assumed to be the mean given by all models weighted by their own probability. There are three apparent issues at this point: obtaining the subsets of relevant models fMkg, determining P ( jMk; D), and determining P (MkjD). The BMA algorithm addresses these: once we have P (MkjD), we can deduce the predictive utility of gene xi by:

Mk=xi2Mk P (xijD) =

P (MkjD) (6)

Note that the number of models that exist for microarray data is usually very large. Given G candidate explanatory genes in an expression set, then there are 2G possible models to consider. Meanwhile, the number of genes in microarray datasets typically varies from 102s to 104s.

C. Bayesian Model Averaging for

High-Dimensional Data

The BMA algorithm described above can only deal with data that have at most 30 features. However, the usual practice of employing stepwise backward elimination to reduce the number of genes down to 30 is not applicable in a situation where the number of predictive variables is greater than the number of samples. To address this issue, Yeung et al. developed an iterative BMA algorithm that takes a rank-ordered list of genes and successively applies the traditional BMA algorithm until all genes have been processed [3]. We apply iterative BMA to our data, using a Cox proportional hazards regression to create the rank-ordered list of genes. Genes are ranked in descending order of their log likelihood (l) based on 2, de ned as:

K M l( ) = X h (xk) log X exp(h (xm)) k=1 m=1 (7) where represents the parameterized weights of the network on which the learning is made, h is the risk function of the Cox model, M is the number of patients still at risk at time ti, and K is the total number of patients in the dataset. The baseline hazard function, 0(t), does not appear in the equation as it is assumed to be the same for all patients.

D. Summary of BMA Algorithm

The iterative BMA algorithm iterates through the user-speci ed list of p top-ranked genes, applying the traditional BMA algorithm for survival analysis to each group of variables in a given BMA window of size W = 25. Then, genes with high posterior probabilities are retained while genes with low posterior probabilities are eliminated. This entails: 1. Define parameters: (a) D training set (b) G total number of genes in the dataset (c) n number of samples in the

training set (d) p number of top-ranked genes to process (e) W the size of the BMA window (i.e., the number of genes processed using traditional BMA)

2. Import training set D

3. Rank-order all G genes by applying Cox proportional hazard regression to each individual gene 4. Apply BMA to X = (x1; x2; :::; xW ) 5. toBeP rocessed

(xW +1; :::; xp)

6. Repeat until all p genes are

processed: (a) Remove from X any gene i with

P r(xijD) < 1% (b) If all genes have P r(xijD) < then determine the minimum P r(xijD), minP robne0, in the current BMA window. Next, remove from X any gene i with

P r(xijD) < (minP robne0 + 1%) (c) Replace genes removed from

X with top-ranked genes from toBeP rocessed, removing them from the toBeP rocessed set (d) Apply BMA to X 7. Output selected genes in X and their corresponding posterior probability. 1%, IV. RESULTS

A. Dataset

We applied our methods to the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) dataset which originally used expression pro les to identify new breast cancer subgroups in an e ort to help physicians provide better treatment recommendations. This dataset consists of gene expression data and clinical features for 1,981 patients, 43.85% of which have had an observed death due to breast cancer (with a median survival time of 1,907 days).

We applied our three algorithms of feature selection on this dataset. We also reproduced the feature set of size 14 presented by the Deepsurv team which contains : four prognostic meta-genes (CIN, MES, LYM, and FGD3-SUSD3), the age at diagnosis, the number of positive lymph nodes, the tumor size, the ER status, the HER2 status, four indicators known to be predictive of breast cancer: ERBB2, MKI67, PGR and ESR1, and the prescribed treatment (i.e., chemotherapy, radiotherapy, or hormonotherapy) [2]. Metagenes are sets of genes coexpressed in multiple cancer types. Those features are calculated from genes expression. The four prognostic meta-genes were previously found to predict accurately the survival time of patients by the winners of the Sage Bionetworks-DREAM Breast Cancer Prognosis Challenge [ 6 ].

We were interested in knowing which selected features are in common with the hand selected features. Here is a table which summarizes the common features:

BMA RSF COX CHEMOTHERAPY ER STATUS 3 CIN RADIO THERAPY HER2 STATUS 1 FGD3-SUSD3 HORM THERAPY

It is interesting to note that each method seems specialized in a particular type of feature. BMA has selected only clinical features in common with the hand selected features. More exactly it has selected only clinical features related to treatment. RSF includes the two selected hormone-related features ER (estrogen) and HER2 (human epidermal growth factor receptor), which are known to be related to breast cancer. COX selected four genes related to the metagenes CIN and FGD3-SUSD3 [ 6 ].

C. In uence of the Number of Features

We aim to identify the in uence of the number of selected features. To do so we selected variable numbers of features for each feature section method. The sizes used are : 12, 14, 16, 18, 20, 22, 32, 64 and 128. We then tested each survival method on every subset generated. The Deepsurv method needs a lot of ne tuning and is very long to train. For this reason, we don't have as many results with Deepsurv as with the three other methods. Also, Deepsurv is extremely sensitive to the dataset used. It needs to be ne tuned di erently when we change feature selection method. This makes the method not appropriate for comparing feature selection methods. We will present the results we obtained with RSF, COX and AAF here and we will discuss Deepsurv in a later section.

It is interesting to notice in Fig. 1, Fig. 2, and Fig. 3 that the number of features seems to have a negative impact on the predictions of BMA and AAF. This is counter-intuitive as a dataset with many more features carries more information and then is expected to provide more accurate prediction. We also note that RSF clearly outperforms the two other methods presented here on every dataset and gets more accurate when the number of features increases.

D. In uence of Prevalent Features

The hand selected feature set composed of 16 features is used by the Deepsurv team to reach its best concordance index of 0.695. This feature set provides a concordance index of 0.7024 with RSF, 0.6635 with CPH and 0.6630 with AAF. None of our automatic feature selection methods could provide results as high as these on this same number of features. We have tried to evaluate the in uence of particular features present in the hand selected set. To do this, we elaborated two methods. First we replaced subsets of features in the hand selected set of features by features selected by automatic feature selection methods. For this rst method we expect to see a negative e ect on the concordance index compared to the results given by the hand selected features. We reproduced the hand selected feature set without the indicators ERBB2, MKI67, PGR and ESR1, one without the treatments, hormonal information, chemotherapy, radiotherapy, hormonotherapy, ER, and HER2 and one without the metagenes, CIN, MES, LYM, and FGD3-SUSD3.

One can see in Fig. 4 that the e ect of removing features is very slight but we can still notice a negative e ect. Those experiments do not permit to understand the di erence of accuracy with automatic feature selection methods.

The second method consisted in adding those same features to automatically selected feature sets. We chose to use the RSF method as it provided the best results. In this case we expect to see a positive e ect on the concordance index as can be veri ed on Fig. 5, although the impact is slight.

Prediction with Deepsurv to

A deep neural network needs ne-tuning in order nd the appropriate parameters which provide the best predictions. If the size of the input changes, the architecture of the network changes, and the ne-tuning must also change. The proposed analysis focused here on four subsets of the Metabric dataset: one with the 12 most likely features, one with the 14 most likely features, one with the 16 most likely features and one with the features selected by hand by the DeepSurv team. 12 BMA 14 hand-selected 14 RSF 64 RSF 0.5388 0.6676 0.6720 0.7173

These results show that RSF automatically selected 14 features perform at least as well as the hand-selected features with Deepsurv, which is an important result. Another important result is that a moderate additional number of features improves Deepsurv's performance even more, however only with the RSF feature selection method.

CONCLUSION

Although more work is needed to reach a de nitive conclusion, our results indicate that feature selection can play a helpful role when performing survival analysis on high-dimensional data. In addition to the uncontested role in decreasing computational costs, the survival prediction shows improved concordance index. RSF though takes advantage of the information contained in a moderate additional number of features. Overall, the best survival analysis performance was obtained by combining RSF for feature selection and Deepsurv for survival prediction. [1] Amalia Annest. Iterative Bayesian Model Averaging: A Method for the Application of Survival Analysis to HighDimensional Microarray Data. University of Washington,

2008 . [2]

Jonathan

Bates Alexander Cloninger Tingting Jiang Jared

Central , 2016 . [3] Adrian

Raftery Ka Yee Yeung , Roger E. Bumgarner.

croarray data . Oxford Academic, 2005 . [4]

Theano

Development Team. Theano : A Python frame-

arXiv e-prints, abs/1605 .02688, May 2016 . [5]

Balaji

Krishnapuram Vikas C. Raykar ,

Harald

Steck . "on

index" . [6] Tai-Hsien Ou Yang Wei-Yi Cheng and Dimitris Anastas-

metagenes. PLoS Comput Biol , 2013 . [7]

Jiawen

Yao Xinliang Zhu and

Junzhou

Huang . Deep

Whole

Slide Pathological Images. BioMed Central , 2016 .