Feature Selection for Survival Analysis in Bioinformatics

                             Charles Englebert1,2 , Thomas Quinn3 , Isabelle Bichindaritz1
                                    1
                                        State University of New York, Oswego, NY, USA
                                                   2
                                                      Supelec, Metz, France
                                            3
                                              Deakin University, Geelong, Australia


             Abstract: The development of microarray technology has made it possible to assemble biomedical datasets that
             measure the expression profile of thousands of genes simultaneously. However, such high-dimensional datasets
             make computation costly and can complicate the interpretation of a predictive model. To address this, feature
             selection methods are used to extract biological information from a large amount of data in order to filter the
             expression dataset down to the smallest possible subset of accurate predictor genes. Feature selection has three
             main advantages: it decreases computational costs, mitigates the possibility of overfitting due to high inter-
             variable correlations, and allows for an easier clinical interpretation of the model. In this paper we compare three
             methods of feature selection : iterative Bayesian Model Averaging (BMA), Random Survival Forest (RSF) and
             Cox Proportional Hazard (CPH) and four methods of survival analysis: Analysis Random Survival Forest (RSF),
             Cox Proportional Hazard (CPH), Alan Additive Filter (AAF) and Deepsurv (neural network). Feature selection
             methods permit to extract the most relevant features for survival analysis from the dataset automatically and
             are compared with a hand selected set of features. Certain survival analysis methods also permit to perform
             feature selection such as RSF and CPH. All the data we used came from the Metabric breast cancer dataset. For
             every feature selection method, we compare varied numbers of selected features. Our results indicate that feature
             selection improves the performance of survival analysis methods. Overall, the best survival analysis performance
             was obtained by combining RSF for feature selection and Deepsurv for survival prediction.


   I.   PRINCIPLES OF SURVIVAL ANALYSIS                                            B.   Concordance Index

                                                                     We cannot evaluate the predictive ability of a survival
  Survival analysis is about predicting the time duration
                                                                  model using classical loss functions such as L2 because
until an event occurs [7]. In our case, we are interested
                                                                  we do not possess the curve which represents the prob-
in using microarray data to estimate the longevity of a
                                                                  ability of dying for our data [5]. Instead, we only know
patient diagnosed with breast cancer. However, the ap-
                                                                  the time at which a patient actually died. One measure
proach is generalizable to any highly dimensional data
                                                                  used is the concordance index, or c-index. This mea-
used for survival analysis tasks.
                                                                  sure evaluates the ordering of predicted survival times:
                                                                  0.5 is the expected result from random predictions, 1.0 is
                                                                  perfect concordance, and 0.0 is perfect anti-concordance
                                                                  (i.e., one must multiply predictions by −1 to get a c-
                                                                  index of 1.0). Throughout this paper, we use the c-index
           A.    Modeling Survival Analysis                       to evaluate the accuracy of our predictions.

   In survival analysis, we do not necessarily know the               II.   METHODS OF SURVIVAL ANALYSIS
actual longevity of every individual since the experiment
may have stopped within their lifetime. Those individu-                                 A.   Cox Model
als who have not been subject to the death event during
the study are labeled as right censored. This includes
any patient who drops out of the study early. As such,               Cox hazard regression is a standard method for sur-
for each individual, we consider either the survival time         vival analysis. It consists in modeling the hazard func-
(if we have the death date) or a censored time (if we             tion defined by:
do not have the date of the death but instead the date                                P r(t <= O <= t + ∆t|O >= t|x)
of the last visit to the doctor). An instance in the sur-            λ(t|x) = lim                                            (1)
                                                                               ∆t→0                ∆t
vival data is usually represented as (xi , ti , δi ) where xi
is the feature vector, ti is the observed time, and δi is         which represents how the risk of an event per unit time
the indicator. An indicator of 1 is used to indicate an           changes over time, as:
uncensored instance (i.e., the death of a patient) while                            λ(t|x) = λ0 (t)exp(β T x)                (2)
0 is used to indicate a censored instance. The survival
function S(t|x) = P (O > t|x) represents the probability          where β = (β1 , .., β2 )T is a vector of parameters, λ0 (t) is
of being alive after time t, where O represent the total          a baseline hazard function, O represent the total survival
survival time and x the data from which is inferred the           time, and x the data from which is inferred the predic-
prediction.                                                       tion. We also define h(x) = β T x as the risk function: it
                                                                                                                       2

represents how the risk of an event per time unit changes     data pose a great challenge for computational techniques.
over time.                                                    We aim to extract biological information from this large
                                                              amount of data to filter the expression dataset down to
                                                              the smallest possible subset of accurate predictor genes.
                 B.    Alan Additive Filter                   Reducing the number of predictor genes has three main
                                                              advantages: it decreases computational costs, mitigates
  Alan Additive filter works the same way as Cox Pro-         the possibility of overfitting due to high inter-variable
portional Hazard but instead it uses :                        correlations, and allows for an easier clinical interpreta-
                                                              tion of the model.
                                   T
                                   X
                 λ(t) = β0 (t) +         βi (t).xi     (3)
                                   i=1
                                                               A.   Cox Proportional Hazard for feature selection
as a regression function.
                                                                 COX Proportional Hazard permits to perform feature
                                                              selection. The method consists in ranking the features in
                        C.   Deepsurv                         descending order of their log likelihood.

   Deepsurv is a deep learning method for survival analy-
sis based on the Faraggi and Simon network [2]. The im-                  B.    Bayesian Model Averaging
plementation of the model is based on Theano [4], which
defines the likelihood of the model as:
                                                                 For high-dimensional data, many features can repre-
                        K                                     sent the data equally well. Here, we use the term model
                        Y     exp(hβ (xk ))
               L(θ) =       PM                         (4)    to refer to any such subset of selected features. We focus
                        k=1  m=1 exp(hβ (xm )                 our attention on one feature selection method in partic-
                                                              ular, Bayesian Model Averaging (BMA), used here to se-
where θ represents the parameterized weights of the net-
                                                              lect subsets of genes from microarray data [1]. Instead of
work on which the learning is made, hβ is the risk func-
                                                              choosing a single model and proceeding as if the data was
tion of the Cox model, m is the number of patients
                                                              actually generated from it, BMA combines the effective-
still at risk at time ti , and K is the total number of
                                                              ness of multiple models by taking the weighted average
patients in the dataset. The hazard function is then
                                                              of their posterior distributions [1]. The core idea behind
λ(t|x) = λ0 (t)exp(h(x)).
                                                              Bayesian Model Averaging is expressed in the following
   As many numbers included in ]0, 1[ are multiplied it is
                                                              equation:
relevant to sum the logarithm of those numbers instead.
This is what is done in Deepsurv. The loss function (l)                               K
that was chosen for this model is then defined as the neg-
                                                                                      X
                                                                     P (∆|Mk , D) =         P (∆|Mk , D)P (Mk |D)    (5)
ative log of this likelihood. This method yields equation                             k=1
7.
   With this network and 16 hand-selected features, the       where ∆ is the quantity of interest we want to compute,
team that developed Deepsurv achieved a concordance           it can be be a blood pressure or a temperature, D is the
index of 0.69 [2].                                            data, and Mk is the k th model. The probability of the
                                                              quantities to take a certain value is assumed to be the
                                                              mean given by all models weighted by their own proba-
         D.     Random Survival Forest (RSF)                  bility. There are three apparent issues at this point: ob-
                                                              taining the subsets of relevant models {Mk }, determining
  Random Forest is a method that operates by construct-       P (∆|Mk , D), and determining P (Mk |D). The BMA al-
ing a multitude of decision trees at training time and        gorithm addresses these: once we have P (Mk |D), we can
outputting the class (in case of a classification analysis)   deduce the predictive utility of gene xi by:
or the mean regression (in case of a regression analysis).
This popular machine learning method was adapted to                                         X
survival analysis. We used the R package randomF orest.                    P (xi |D) =                P (Mk |D)      (6)
Random forest also permits to rank features and so it                                    Mk /xi ∈Mk
provides another features selection method.
                                                                Note that the number of models that exist for microar-
                                                              ray data is usually very large. Given G candidate ex-
              III.    FEATURE SELECTION                       planatory genes in an expression set, then there are 2G
                                                              possible models to consider. Meanwhile, the number of
  The aim of this paper is to prove the usefulness of fea-    genes in microarray datasets typically varies from 102 s to
ture selection before survival analysis. High-dimensional     104 s.
                                                                                                                            3

         C.      Bayesian Model Averaging for                         4. Apply BMA to X = (x1 , x2 , ..., xW )
                  High-Dimensional Data
                                                                      5. toBeP rocessed ← (xW +1 , ..., xp )
   The BMA algorithm described above can only deal
with data that have at most 30 features. However, the                 6. Repeat until all p genes are
usual practice of employing stepwise backward elimina-                   processed:
tion to reduce the number of genes down to 30 is not
applicable in a situation where the number of predictive
variables is greater than the number of samples. To ad-                  (a) Remove from X any gene i with
dress this issue, Yeung et al. developed an iterative BMA                    P r(xi |D) < 1%
algorithm that takes a rank-ordered list of genes and suc-               (b) If all genes have P r(xi |D) < 1%,
cessively applies the traditional BMA algorithm until all                    then determine the minimum
genes have been processed [3]. We apply iterative BMA                        P r(xi |D), minP robne0 , in the
to our data, using a Cox proportional hazards regression                     current BMA window. Next,
to create the rank-ordered list of genes. Genes are ranked                   remove from X any gene i with
in descending order of their log likelihood (l) based on 2,                  P r(xi |D) < (minP robne0 + 1%)
defined as:
                 K                      M
                                                                         (c) Replace genes removed from
                 X                      X                                    X with top-ranked genes from
        l(θ) =         hθ (xk ) − log         exp(hθ (xm ))   (7)
                                                                             toBeP rocessed, removing them from
                 k=1                    m=1
                                                                             the toBeP rocessed set
where θ represents the parameterized weights of the net-
                                                                         (d) Apply BMA to X
work on which the learning is made, hθ is the risk func-
tion of the Cox model, M is the number of patients still
at risk at time ti , and K is the total number of patients            7. Output selected genes in X and their
in the dataset. The baseline hazard function, λ0 (t), does               corresponding posterior probability.
not appear in the equation as it is assumed to be the
same for all patients.
                                                                                        IV.    RESULTS
          D.     Summary of BMA Algorithm
                                                                                          A.   Dataset
   The iterative BMA algorithm iterates through the
user-specified list of p top-ranked genes, applying the                We applied our methods to the Molecular Taxonomy of
traditional BMA algorithm for survival analysis to each             Breast Cancer International Consortium (METABRIC)
group of variables in a given BMA window of size W =                dataset which originally used expression profiles to iden-
25. Then, genes with high posterior probabilities are re-           tify new breast cancer subgroups in an effort to help
tained while genes with low posterior probabilities are             physicians provide better treatment recommendations.
eliminated. This entails:                                           This dataset consists of gene expression data and clini-
                                                                    cal features for 1,981 patients, 43.85% of which have had
  1. Define parameters:                                             an observed death due to breast cancer (with a median
                                                                    survival time of 1,907 days).
     (a) D training set
                                                                       We applied our three algorithms of feature selection
     (b) G total number of genes in the                             on this dataset. We also reproduced the feature set
         dataset                                                    of size 14 presented by the Deepsurv team which con-
     (c) n number of samples in the                                 tains : four prognostic meta-genes (CIN, MES, LYM,
         training set                                               and FGD3-SUSD3), the age at diagnosis, the number
     (d) p number of top-ranked genes to                            of positive lymph nodes, the tumor size, the ER status,
         process                                                    the HER2 status, four indicators known to be predic-
                                                                    tive of breast cancer: ERBB2, MKI67, PGR and ESR1,
     (e) W the size of the BMA window                               and the prescribed treatment (i.e., chemotherapy, radio-
         (i.e., the number of genes                                 therapy, or hormonotherapy) [2]. Metagenes are sets of
         processed using traditional BMA)                           genes coexpressed in multiple cancer types. Those fea-
  2. Import training set D                                          tures are calculated from genes expression. The four
                                                                    prognostic meta-genes were previously found to predict
  3. Rank-order all G genes by applying                             accurately the survival time of patients by the winners of
     Cox proportional hazard regression                             the Sage Bionetworks-DREAM Breast Cancer Prognosis
     to each individual gene                                        Challenge [6].
                                                                                                                         4

                 B.   Selected Features

  We were interested in knowing which selected features
are in common with the hand selected features. Here is
a table which summarizes the common features:

        BMA          RSF         COX
   CHEMOTHERAPY ER STATUS        3 CIN
   RADIO THERAPY HER2 STATUS 1 FGD3-SUSD3
   HORM THERAPY
   It is interesting to note that each method seems spe-
cialized in a particular type of feature. BMA has selected
only clinical features in common with the hand selected
features. More exactly it has selected only clinical fea-
tures related to treatment. RSF includes the two selected
hormone-related features ER (estrogen) and HER2 (hu-
man epidermal growth factor receptor), which are known         FIG. 2. Alan additive filter, Random survival Forest and Cox
to be related to breast cancer. COX selected four genes        Proportional Hazard on sub-sets of features of various sizes
related to the metagenes CIN and FGD3-SUSD3 [6].               generated by Cox Proportional Hazard


      C.   Influence of the Number of Features

   We aim to identify the influence of the number of se-
lected features. To do so we selected variable numbers of
features for each feature section method. The sizes used
are : 12, 14, 16, 18, 20, 22, 32, 64 and 128. We then
tested each survival method on every subset generated.
The Deepsurv method needs a lot of fine tuning and is
very long to train. For this reason, we don’t have as many
results with Deepsurv as with the three other methods.
Also, Deepsurv is extremely sensitive to the dataset used.
It needs to be fine tuned differently when we change fea-
ture selection method. This makes the method not ap-
propriate for comparing feature selection methods. We
will present the results we obtained with RSF, COX and
AAF here and we will discuss Deepsurv in a later section.

                                                               FIG. 3. Alan additive filter, Random survival Forest and Cox
                                                               Proportional Hazard on sub-sets of features of various sizes
                                                               generated by Random Survival Forest

                                                                 It is interesting to notice in Fig. 1, Fig. 2, and Fig. 3
                                                               that the number of features seems to have a negative
                                                               impact on the predictions of BMA and AAF. This is
                                                               counter-intuitive as a dataset with many more features
                                                               carries more information and then is expected to provide
                                                               more accurate prediction. We also note that RSF clearly
                                                               outperforms the two other methods presented here on ev-
                                                               ery dataset and gets more accurate when the number of
                                                               features increases.


                                                                        D.   Influence of Prevalent Features

FIG. 1. Alan additive filter, Random survival Forest and Cox      The hand selected feature set composed of 16 features
Proportional Hazard on sub-sets of features of various sizes   is used by the Deepsurv team to reach its best concor-
generated by Bayesian Model Averaging                          dance index of 0.695. This feature set provides a concor-
                                                                                                                         5

dance index of 0.7024 with RSF, 0.6635 with CPH and            best predictions. If the size of the input changes, the
0.6630 with AAF. None of our automatic feature selec-          architecture of the network changes, and the fine-tuning
tion methods could provide results as high as these on         must also change. The proposed analysis focused here
this same number of features. We have tried to evaluate        on four subsets of the Metabric dataset: one with the
the influence of particular features present in the hand
selected set. To do this, we elaborated two methods.
First we replaced subsets of features in the hand selected
set of features by features selected by automatic feature
selection methods. For this first method we expect to see
a negative effect on the concordance index compared to
the results given by the hand selected features. We re-
produced the hand selected feature set without the indi-
cators ERBB2, MKI67, PGR and ESR1, one without the
treatments, hormonal information, chemotherapy, radio-
therapy, hormonotherapy, ER, and HER2 and one with-
out the metagenes, CIN, MES, LYM, and FGD3-SUSD3.

                                                               FIG. 5. Alan additive filter, Random survival Forest and Cox
                                                               Proportional Hazard on subsets of features generated from
                                                               RSF feature selection set from which was removed certain
                                                               subsets of features and replaced by others taken from hand
                                                               selected features


                                                               12 most likely features, one with the 14 most likely
                                                               features, one with the 16 most likely features and one
                                                               with the features selected by hand by the DeepSurv team.

                                                                  12 BMA 14 hand-selected 14 RSF 64 RSF
                                                                   0.5388     0.6676       0.6720 0.7173
FIG. 4. Alan additive filter, Random survival Forest and Cox
Proportional Hazard on subsets of features generated from        These results show that RSF automatically selected
the hand selected feature set from which was removed cer-      14 features perform at least as well as the hand-selected
tain subsets of features and replaced by others obtained by    features with Deepsurv, which is an important result.
automatic feature selection methods                            Another important result is that a moderate additional
                                                               number of features improves Deepsurv’s performance
   One can see in Fig. 4 that the effect of removing           even more, however only with the RSF feature selection
features is very slight but we can still notice a negative     method.
effect. Those experiments do not permit to understand
the difference of accuracy with automatic feature selec-
tion methods.
                                                                                 V.   CONCLUSION
The second method consisted in adding those same
features to automatically selected feature sets. We chose        Although more work is needed to reach a definitive
to use the RSF method as it provided the best results.         conclusion, our results indicate that feature selection can
In this case we expect to see a positive effect on the         play a helpful role when performing survival analysis on
concordance index as can be verified on Fig. 5, although       high-dimensional data. In addition to the uncontested
the impact is slight.                                          role in decreasing computational costs, the survival pre-
                                                               diction shows improved concordance index. RSF though
                                                               takes advantage of the information contained in a mod-
            E.   Prediction with Deepsurv                      erate additional number of features. Overall, the best
                                                               survival analysis performance was obtained by combin-
  A deep neural network needs fine-tuning in order             ing RSF for feature selection and Deepsurv for survival
to find the appropriate parameters which provide the           prediction.


[1] Amalia Annest. Iterative Bayesian Model Averaging: A          Dimensional Microarray Data. University of Washington,
    Method for the Application of Survival Analysis to High-
                                                                                                                             6

    2008.                                                             arXiv e-prints, abs/1605.02688, May 2016.
[2] Jonathan Bates Alexander Cloninger Tingting Jiang Jared       [5] Balaji Krishnapuram Vikas C. Raykar, Harald Steck. ”on
    L. Katzman, Uri Shaham and Yuval Kluger. Deep Sur-                ranking in survival analysis: Bounds on the concordance
    vival: A Deep Cox Proportional Hazards Network. BioMed            index”.
    Central, 2016.                                                [6] Tai-Hsien Ou Yang Wei-Yi Cheng and Dimitris Anastas-
[3] Adrian E. Raftery Ka Yee Yeung, Roger E. Bumgarner.               siou. Biomolecular events in cancer revealed by attractor
    Bayesian model averaging: development of an improved              metagenes. PLoS Comput Biol, 2013.
    multi-class, gene selection and classification tool for mi-   [7] Jiawen Yao Xinliang Zhu and Junzhou Huang. Deep
    croarray data. Oxford Academic, 2005.                             Multi-Instance Learning for Survival Prediction from
[4] Theano Development Team. Theano: A Python frame-                  Whole Slide Pathological Images. BioMed Central, 2016.
    work for fast computation of mathematical expressions.