<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Charles Englebert</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Quinn</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Isabelle Bichindaritz</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Deakin University</institution>
          ,
          <addr-line>Geelong</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>State University of New York</institution>
          ,
          <addr-line>Oswego, NY</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Supelec</institution>
          ,
          <addr-line>Metz</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The development of microarray technology has made it possible to assemble biomedical datasets that measure the expression pro le of thousands of genes simultaneously. However, such high-dimensional datasets make computation costly and can complicate the interpretation of a predictive model. To address this, feature selection methods are used to extract biological information from a large amount of data in order to lter the expression dataset down to the smallest possible subset of accurate predictor genes. Feature selection has three main advantages: it decreases computational costs, mitigates the possibility of over tting due to high intervariable correlations, and allows for an easier clinical interpretation of the model. In this paper we compare three methods of feature selection : iterative Bayesian Model Averaging (BMA), Random Survival Forest (RSF) and Cox Proportional Hazard (CPH) and four methods of survival analysis: Analysis Random Survival Forest (RSF), Cox Proportional Hazard (CPH), Alan Additive Filter (AAF) and Deepsurv (neural network). Feature selection methods permit to extract the most relevant features for survival analysis from the dataset automatically and are compared with a hand selected set of features. Certain survival analysis methods also permit to perform feature selection such as RSF and CPH. All the data we used came from the Metabric breast cancer dataset. For every feature selection method, we compare varied numbers of selected features. Our results indicate that feature selection improves the performance of survival analysis methods. Overall, the best survival analysis performance was obtained by combining RSF for feature selection and Deepsurv for survival prediction.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>PRINCIPLES OF SURVIVAL ANALYSIS</p>
      <p>In survival analysis, we do not necessarily know the
actual longevity of every individual since the experiment
may have stopped within their lifetime. Those
individuals who have not been subject to the death event during
the study are labeled as right censored. This includes
any patient who drops out of the study early. As such,
for each individual, we consider either the survival time
(if we have the death date) or a censored time (if we
do not have the date of the death but instead the date
of the last visit to the doctor). An instance in the
survival data is usually represented as (xi; ti; i) where xi
is the feature vector, ti is the observed time, and i is
the indicator. An indicator of 1 is used to indicate an
uncensored instance (i.e., the death of a patient) while
0 is used to indicate a censored instance. The survival
function S(tjx) = P (O &gt; tjx) represents the probability
of being alive after time t, where O represent the total
survival time and x the data from which is inferred the
prediction.</p>
      <p>We cannot evaluate the predictive ability of a survival
model using classical loss functions such as L2 because
we do not possess the curve which represents the
probability of dying for our data [5]. Instead, we only know
the time at which a patient actually died. One measure
used is the concordance index, or c-index. This
measure evaluates the ordering of predicted survival times:
0:5 is the expected result from random predictions, 1:0 is
perfect concordance, and 0:0 is perfect anti-concordance
(i.e., one must multiply predictions by 1 to get a
cindex of 1:0). Throughout this paper, we use the c-index
to evaluate the accuracy of our predictions.</p>
      <p>II.</p>
      <p>METHODS OF SURVIVAL ANALYSIS</p>
      <p>Cox hazard regression is a standard method for
survival analysis. It consists in modeling the hazard
function de ned by:
(tjx) = lim
t!0</p>
      <p>P r(t &lt;= O &lt;= t +
t
tjO &gt;= tjx)
which represents how the risk of an event per unit time
changes over time, as:
(tjx) =</p>
      <p>0(t)exp( T x)
where = ( 1; ::; 2)T is a vector of parameters, 0(t) is
a baseline hazard function, O represent the total survival
time, and x the data from which is inferred the
prediction. We also de ne h(x) = T x as the risk function: it
(1)
(2)
represents how the risk of an event per time unit changes
over time.</p>
      <p>Alan Additive Filter</p>
      <p>Alan Additive lter works the same way as Cox
Proportional Hazard but instead it uses :
as a regression function.</p>
      <p>(t) =</p>
      <p>T
0(t) + X i(t):xi</p>
      <p>i=1</p>
      <p>Deepsurv is a deep learning method for survival
analysis based on the Faraggi and Simon network [2]. The
implementation of the model is based on Theano [4], which
de nes the likelihood of the model as:</p>
      <p>K
L( ) = Y
k=1</p>
      <p>exp(h (xk))
PM</p>
      <p>m=1 exp(h (xm)
where represents the parameterized weights of the
network on which the learning is made, h is the risk
function of the Cox model, m is the number of patients
still at risk at time ti, and K is the total number of
patients in the dataset. The hazard function is then
(tjx) = 0(t)exp(h(x)).</p>
      <p>As many numbers included in ]0; 1[ are multiplied it is
relevant to sum the logarithm of those numbers instead.
This is what is done in Deepsurv. The loss function (l)
that was chosen for this model is then de ned as the
negative log of this likelihood. This method yields equation
7.</p>
      <p>With this network and 16 hand-selected features, the
team that developed Deepsurv achieved a concordance
index of 0.69 [2].</p>
      <p>Random Forest is a method that operates by
constructing a multitude of decision trees at training time and
outputting the class (in case of a classi cation analysis)
or the mean regression (in case of a regression analysis).
This popular machine learning method was adapted to
survival analysis. We used the R package randomF orest.
Random forest also permits to rank features and so it
provides another features selection method.</p>
      <p>The aim of this paper is to prove the usefulness of
feature selection before survival analysis. High-dimensional
(3)
(4)
data pose a great challenge for computational techniques.
We aim to extract biological information from this large
amount of data to lter the expression dataset down to
the smallest possible subset of accurate predictor genes.
Reducing the number of predictor genes has three main
advantages: it decreases computational costs, mitigates
the possibility of over tting due to high inter-variable
correlations, and allows for an easier clinical
interpretation of the model.</p>
      <p>Cox Proportional Hazard for feature selection
COX Proportional Hazard permits to perform feature
selection. The method consists in ranking the features in
descending order of their log likelihood.</p>
      <p>Bayesian Model Averaging</p>
      <p>For high-dimensional data, many features can
represent the data equally well. Here, we use the term model
to refer to any such subset of selected features. We focus
our attention on one feature selection method in
particular, Bayesian Model Averaging (BMA), used here to
select subsets of genes from microarray data [1]. Instead of
choosing a single model and proceeding as if the data was
actually generated from it, BMA combines the e
ectiveness of multiple models by taking the weighted average
of their posterior distributions [1]. The core idea behind
Bayesian Model Averaging is expressed in the following
equation:</p>
      <p>P ( jMk; D) =</p>
      <p>K
X P ( jMk; D)P (MkjD)
k=1
(5)
where is the quantity of interest we want to compute,
it can be be a blood pressure or a temperature, D is the
data, and Mk is the kth model. The probability of the
quantities to take a certain value is assumed to be the
mean given by all models weighted by their own
probability. There are three apparent issues at this point:
obtaining the subsets of relevant models fMkg, determining
P ( jMk; D), and determining P (MkjD). The BMA
algorithm addresses these: once we have P (MkjD), we can
deduce the predictive utility of gene xi by:</p>
      <p>X</p>
      <p>Mk=xi2Mk
P (xijD) =</p>
      <p>P (MkjD)
(6)</p>
      <p>Note that the number of models that exist for
microarray data is usually very large. Given G candidate
explanatory genes in an expression set, then there are 2G
possible models to consider. Meanwhile, the number of
genes in microarray datasets typically varies from 102s to
104s.</p>
      <p>C. Bayesian Model Averaging for</p>
      <p>High-Dimensional Data</p>
      <p>The BMA algorithm described above can only deal
with data that have at most 30 features. However, the
usual practice of employing stepwise backward
elimination to reduce the number of genes down to 30 is not
applicable in a situation where the number of predictive
variables is greater than the number of samples. To
address this issue, Yeung et al. developed an iterative BMA
algorithm that takes a rank-ordered list of genes and
successively applies the traditional BMA algorithm until all
genes have been processed [3]. We apply iterative BMA
to our data, using a Cox proportional hazards regression
to create the rank-ordered list of genes. Genes are ranked
in descending order of their log likelihood (l) based on 2,
de ned as:</p>
      <p>K M
l( ) = X h (xk) log X exp(h (xm))
k=1 m=1
(7)
where represents the parameterized weights of the
network on which the learning is made, h is the risk
function of the Cox model, M is the number of patients still
at risk at time ti, and K is the total number of patients
in the dataset. The baseline hazard function, 0(t), does
not appear in the equation as it is assumed to be the
same for all patients.</p>
      <p>D. Summary of BMA Algorithm</p>
      <p>The iterative BMA algorithm iterates through the
user-speci ed list of p top-ranked genes, applying the
traditional BMA algorithm for survival analysis to each
group of variables in a given BMA window of size W =
25. Then, genes with high posterior probabilities are
retained while genes with low posterior probabilities are
eliminated. This entails:
1. Define parameters:
(a) D training set
(b) G total number of genes in the
dataset
(c) n number of samples in the</p>
      <p>training set
(d) p number of top-ranked genes to
process
(e) W the size of the BMA window
(i.e., the number of genes
processed using traditional BMA)</p>
    </sec>
    <sec id="sec-2">
      <title>2. Import training set D</title>
      <p>3. Rank-order all G genes by applying
Cox proportional hazard regression
to each individual gene
4. Apply BMA to X = (x1; x2; :::; xW )
5. toBeP rocessed</p>
      <p>(xW +1; :::; xp)</p>
    </sec>
    <sec id="sec-3">
      <title>6. Repeat until all p genes are</title>
      <p>processed:
(a) Remove from X any gene i with</p>
      <p>P r(xijD) &lt; 1%
(b) If all genes have P r(xijD) &lt;
then determine the minimum
P r(xijD), minP robne0, in the
current BMA window. Next,
remove from X any gene i with</p>
      <p>P r(xijD) &lt; (minP robne0 + 1%)
(c) Replace genes removed from</p>
      <p>X with top-ranked genes from
toBeP rocessed, removing them from
the toBeP rocessed set
(d) Apply BMA to X
7. Output selected genes in X and their
corresponding posterior probability.
1%,
IV. RESULTS</p>
      <p>A. Dataset</p>
      <p>We applied our methods to the Molecular Taxonomy of
Breast Cancer International Consortium (METABRIC)
dataset which originally used expression pro les to
identify new breast cancer subgroups in an e ort to help
physicians provide better treatment recommendations.
This dataset consists of gene expression data and
clinical features for 1,981 patients, 43.85% of which have had
an observed death due to breast cancer (with a median
survival time of 1,907 days).</p>
      <p>
        We applied our three algorithms of feature selection
on this dataset. We also reproduced the feature set
of size 14 presented by the Deepsurv team which
contains : four prognostic meta-genes (CIN, MES, LYM,
and FGD3-SUSD3), the age at diagnosis, the number
of positive lymph nodes, the tumor size, the ER status,
the HER2 status, four indicators known to be
predictive of breast cancer: ERBB2, MKI67, PGR and ESR1,
and the prescribed treatment (i.e., chemotherapy,
radiotherapy, or hormonotherapy) [2]. Metagenes are sets of
genes coexpressed in multiple cancer types. Those
features are calculated from genes expression. The four
prognostic meta-genes were previously found to predict
accurately the survival time of patients by the winners of
the Sage Bionetworks-DREAM Breast Cancer Prognosis
Challenge [
        <xref ref-type="bibr" rid="ref5">6</xref>
        ].
      </p>
      <p>We were interested in knowing which selected features
are in common with the hand selected features. Here is
a table which summarizes the common features:</p>
      <p>BMA RSF COX
CHEMOTHERAPY ER STATUS 3 CIN
RADIO THERAPY HER2 STATUS 1 FGD3-SUSD3
HORM THERAPY</p>
      <p>
        It is interesting to note that each method seems
specialized in a particular type of feature. BMA has selected
only clinical features in common with the hand selected
features. More exactly it has selected only clinical
features related to treatment. RSF includes the two selected
hormone-related features ER (estrogen) and HER2
(human epidermal growth factor receptor), which are known
to be related to breast cancer. COX selected four genes
related to the metagenes CIN and FGD3-SUSD3 [
        <xref ref-type="bibr" rid="ref5">6</xref>
        ].
      </p>
      <p>C. In uence of the Number of Features</p>
      <p>We aim to identify the in uence of the number of
selected features. To do so we selected variable numbers of
features for each feature section method. The sizes used
are : 12, 14, 16, 18, 20, 22, 32, 64 and 128. We then
tested each survival method on every subset generated.
The Deepsurv method needs a lot of ne tuning and is
very long to train. For this reason, we don't have as many
results with Deepsurv as with the three other methods.
Also, Deepsurv is extremely sensitive to the dataset used.
It needs to be ne tuned di erently when we change
feature selection method. This makes the method not
appropriate for comparing feature selection methods. We
will present the results we obtained with RSF, COX and
AAF here and we will discuss Deepsurv in a later section.</p>
      <p>It is interesting to notice in Fig. 1, Fig. 2, and Fig. 3
that the number of features seems to have a negative
impact on the predictions of BMA and AAF. This is
counter-intuitive as a dataset with many more features
carries more information and then is expected to provide
more accurate prediction. We also note that RSF clearly
outperforms the two other methods presented here on
every dataset and gets more accurate when the number of
features increases.</p>
      <p>D. In uence of Prevalent Features</p>
      <p>The hand selected feature set composed of 16 features
is used by the Deepsurv team to reach its best
concordance index of 0.695. This feature set provides a
concordance index of 0.7024 with RSF, 0.6635 with CPH and
0.6630 with AAF. None of our automatic feature
selection methods could provide results as high as these on
this same number of features. We have tried to evaluate
the in uence of particular features present in the hand
selected set. To do this, we elaborated two methods.
First we replaced subsets of features in the hand selected
set of features by features selected by automatic feature
selection methods. For this rst method we expect to see
a negative e ect on the concordance index compared to
the results given by the hand selected features. We
reproduced the hand selected feature set without the
indicators ERBB2, MKI67, PGR and ESR1, one without the
treatments, hormonal information, chemotherapy,
radiotherapy, hormonotherapy, ER, and HER2 and one
without the metagenes, CIN, MES, LYM, and FGD3-SUSD3.</p>
      <p>One can see in Fig. 4 that the e ect of removing
features is very slight but we can still notice a negative
e ect. Those experiments do not permit to understand
the di erence of accuracy with automatic feature
selection methods.</p>
      <p>The second method consisted in adding those same
features to automatically selected feature sets. We chose
to use the RSF method as it provided the best results.
In this case we expect to see a positive e ect on the
concordance index as can be veri ed on Fig. 5, although
the impact is slight.</p>
      <p>E.</p>
      <p>Prediction with Deepsurv
to</p>
      <p>A deep neural network needs ne-tuning in order
nd the appropriate parameters which provide the
best predictions. If the size of the input changes, the
architecture of the network changes, and the ne-tuning
must also change. The proposed analysis focused here
on four subsets of the Metabric dataset: one with the
12 most likely features, one with the 14 most likely
features, one with the 16 most likely features and one
with the features selected by hand by the DeepSurv team.
12 BMA 14 hand-selected 14 RSF 64 RSF
0.5388 0.6676 0.6720 0.7173</p>
      <p>These results show that RSF automatically selected
14 features perform at least as well as the hand-selected
features with Deepsurv, which is an important result.
Another important result is that a moderate additional
number of features improves Deepsurv's performance
even more, however only with the RSF feature selection
method.</p>
      <p>CONCLUSION</p>
      <p>Although more work is needed to reach a de nitive
conclusion, our results indicate that feature selection can
play a helpful role when performing survival analysis on
high-dimensional data. In addition to the uncontested
role in decreasing computational costs, the survival
prediction shows improved concordance index. RSF though
takes advantage of the information contained in a
moderate additional number of features. Overall, the best
survival analysis performance was obtained by
combining RSF for feature selection and Deepsurv for survival
prediction.
[1] Amalia Annest. Iterative Bayesian Model Averaging: A
Method for the Application of Survival Analysis to
HighDimensional Microarray Data. University of Washington,</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <year>2008</year>
          . [2]
          <string-name>
            <given-names>Jonathan</given-names>
            <surname>Bates Alexander Cloninger Tingting</surname>
          </string-name>
          Jiang Jared
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Central</surname>
          </string-name>
          ,
          <year>2016</year>
          . [3]
          <string-name>
            <surname>Adrian</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Raftery Ka Yee Yeung</surname>
          </string-name>
          , Roger E. Bumgarner.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>croarray data</article-title>
          . Oxford Academic,
          <year>2005</year>
          . [4]
          <string-name>
            <given-names>Theano</given-names>
            <surname>Development Team. Theano</surname>
          </string-name>
          : A Python frame-
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>arXiv</surname>
          </string-name>
          e-prints,
          <source>abs/1605</source>
          .02688, May
          <year>2016</year>
          . [5]
          <string-name>
            <given-names>Balaji</given-names>
            <surname>Krishnapuram Vikas C. Raykar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Harald</given-names>
            <surname>Steck</surname>
          </string-name>
          .
          <article-title>"on</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>index"</article-title>
          . [6]
          <string-name>
            <surname>Tai-Hsien Ou Yang</surname>
          </string-name>
          Wei-Yi Cheng and Dimitris Anastas-
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          metagenes.
          <source>PLoS Comput Biol</source>
          ,
          <year>2013</year>
          . [7]
          <string-name>
            <given-names>Jiawen</given-names>
            <surname>Yao Xinliang Zhu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Junzhou</given-names>
            <surname>Huang</surname>
          </string-name>
          . Deep
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Whole</given-names>
            <surname>Slide Pathological Images. BioMed Central</surname>
          </string-name>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>