<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The EFS-Server: A Web-Application for Feature Selection in Binary Classi cation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ursula Neumann</string-name>
          <email>u.neumann@wz-straubing.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dominik Heider</string-name>
          <email>d.heider@wz-straubing.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Straubing Center of Science</institution>
          ,
          <addr-line>Petersgasse 18, 94315 Straubing</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Feature selection methods are essential to identify a subset of features that improve the prediction performance of subsequent classi cation models and thereby also simplify their interpretability. Preceding studies showed the defectiveness in terms of speci c biases of single feature selection methods, whereas an ensemble of feature selection techniques has the advantage to alleviate and compensate for such biases. With the development of the ensemble feature selection (EFS) method we take advantage of the bene ts of multiple feature selection methods and combine their normalized outputs to a quantitative ensemble importance. Eight di erent feature selection methods have been used for the EFS approach. We evaluated the EFS method on a testset and it turned out that the subset of features retrieved by the EFS method showed a signi cantly improved performance in a subsequent logistic regression (LR) model compared to a model using all available features. EFS can be downloaded as an R-package or used in a websever at http://EFS.heiderlab.de.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Machine learning models have been widely used for classi cation of biomedical
problems, e.g., in drug resistance [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or prediction of the severity of diseases [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
However, in these areas one is frequently faced with high-dimensional data and
small-n-large-p problems, thus the need for simpli cation of datasets with many
parameters frequently emerges.
      </p>
      <p>
        Therefore, a great variety of feature selection (FS) techniques already exists.
However, di erent feature selection methods provide di erent subsets of features.
There are several factors that can cause instability and unreliability of the feature
selection, e.g., the complexity of multiple relevant features, a
small-n-large-pproblem, or when the algorithm simply ignores stability [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. To counteract
instability and therewith unreliability of feature selection methods, we developed
an ensemble feature selection (EFS) method, which compensates biases of single
FS. The idea of ensemble methods is already widely used in learning algorithms
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. By using an ensemble of feature selection methods, a quanti cation of the
importance of features can be obtained and the method-speci c biases can be
compensated.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>
        The EFS method provides eight di erent techniques for feature selection in
binary classi cation: Since random forests [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] have been shown to give highly
accurate predictions on biological [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ] and biomedical data [
        <xref ref-type="bibr" rid="ref1 ref9">1, 9</xref>
        ], four of the chosen
feature selection methods are embedded in a random forest algorithm. Further,
we considered the outcome of an LR (i.e., the coe cients) as another embedded
method as well as the lter methods median, Pearson-, and Spearman-correlation
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The key features of our EFS method are:
1. The combination of widely known and extensively tested feature selection
methods.
2. The balance of biases by using an ensemble.
3. The evaluation of EFS via LR.
      </p>
      <p>We normalized all individual outputs to a common scale, an interval from 0 to 1.
Thereby we ensure the comparability between di erent FS methods and conserve
the distances of importance between one feature to another. This normalization is
achieved in two di erent ways: For all feature selections, except for the median,
the absolute value of the FS method output is a value which illustrates the
increase of importance. By dividing through the maximum value we get values
between 0 and 1:
impXi =</p>
      <p>i
max( m)m2M
:
In the case of the median FS we receive a p value for each feature Xi, which is
normalized as follows:
impXi = 1
pi + min(pi):
By dividing the calculated importances through the number of selected methods
(1 to 8) and summing up all individual importances, we get an EFS importance
between 0 and 1. The EFS system selects those parameter that have a higher
importance than the mean importance:</p>
      <p>impXi &gt; impXM ;
where impXM symbolizes the mean of all variable importances.</p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>
        In order to evaluate our EFS method, we used an LR model with
leave-oneout cross-validation (LOOCV). For comparison purposes, we also trained an LR
model without feature selection and examined both AUC-values of the ROC
curves with ROCR [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The dataset SPECTF has been obtained from the UCI
Machine Learning Repository [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. It describes diagnosing of cardiac Single
Proton Emission Computed Tomography (SPECT) images. The class-variable is
distinguishing between normal (= 0) and abnormal (= 1). In panel A) of
Figure 1 the resulting ROC curves are shown. Additionally, the p-value (p &lt; 0:001)
Lecture Notes in Computer Science: Authors' Instructions
is located in the bottom right corner of the diagram. The p-value clearly shows
that there is a signi cant improvement in terms of AUC of the LR with features
selected by the EFS method compared the LR model without feature selection.
The calculation of the p-value is based on the method of DeLong et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>Besides the R-package EFS, a web application is provided for researchers which
are not familiar with the use of R at http://EFS.heiderlab.de. The
EFSserver provides a feature ranking by summing up the normalized importances of
all feature selection methods. Additionally, the EFS-server produces a barplot
of the importances, if the number of features does not exceed 25. If a barplot for
more than 25 parameters is required, the barplot fs function of R-package EFS
can be used (cf. panel B in Figure 1). Moreover, the user can download all results
from the feature selection methods and the EFS-method as a csv- le for further
analyses. Based on the results of our EFS method, a signi cant improvement
in prediction performance compared to all features in an LR model could be
demonstrated. The EFS-server is a nice and handy tool for unexperienced users
that provides a feature selection method in a simple and guided procedure. For
the experienced user, the corresponding R-package can be used, which provides
also deeper insights into the selection and evaluation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Riemenschneider</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Senge</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          , Hullermeier, E.,
          <string-name>
            <surname>Heider</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Exploiting HIV-1 protease and reverse transcriptase cross-resistance information for improved drug resistance prediction by means of multi-label classi cation</article-title>
          .
          <source>BioData Mining</source>
          ,
          <volume>9</volume>
          ,
          <issue>10</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Baars</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jinawy</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hendricks</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sowa</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klsch</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riemenschneider</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gerken</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erbel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heider</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Canbay</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>In Acute Myocardial Infarction Liver Parameters Are Associated With Stenosis Diameter</article-title>
          .
          <source>Medicine</source>
          <year>2016</year>
          ,
          <volume>95</volume>
          (
          <issue>6</issue>
          ):
          <fpage>e2807</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zongker</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Feature selection: evaluation, application, and small sample performance</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>19</volume>
          (
          <issue>2</issue>
          ),
          <volume>153</volume>
          {
          <fpage>158</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Stable feature selection for biomarker discovery</article-title>
          .
          <source>Computational Biology and Chemistry</source>
          <volume>34</volume>
          ,
          <issue>215</issue>
          {
          <fpage>225</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Kuncheva</surname>
            ,
            <given-names>L.I.</given-names>
          </string-name>
          :
          <article-title>Combining Pattern Classi ers: Methods and Algorithms</article-title>
          . John Wiley &amp; Sons (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Breiman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Random forests</article-title>
          .
          <source>Machine Learning 45(1)</source>
          ,
          <volume>5</volume>
          {
          <fpage>32</fpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. van den Boom, J.,
          <string-name>
            <surname>Heider</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>S.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pastore</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mueller</surname>
          </string-name>
          , J.W.:
          <article-title>3- phosphoadenosine 5-phosphosulfate (paps) synthases, naturally fragile enzymes speci cally stabilized by nucleotide binding</article-title>
          .
          <source>J Biol Chem</source>
          .
          <volume>287</volume>
          (
          <issue>21</issue>
          ),
          <volume>1764555</volume>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Touw</surname>
            ,
            <given-names>W.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bayjanov</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Overmars</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Backus</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boekhorst</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wels</surname>
          </string-name>
          , M.,
          <string-name>
            <surname>van Hijum</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          :
          <article-title>Data mining in the life sciences with random forest: a walk in the park or lost in the jungle? Brief Bioinform</article-title>
          . .
          <volume>14</volume>
          (
          <issue>3</issue>
          ),
          <volume>31526</volume>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Dybowski</surname>
            ,
            <given-names>J.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riemenschneider</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hauke</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pyka</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verheyen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Ho mann, D.,
          <string-name>
            <surname>Heider</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Improved Bevirimat resistance prediction by combination of structural and sequence-based classi ers</article-title>
          .
          <source>BioData Mining</source>
          ,
          <volume>4</volume>
          ,
          <issue>26</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
          </string-name>
          , H.:
          <article-title>E cient feature selection via analysis of relevance and redundancy</article-title>
          .
          <source>J. Mach. Learn. Res.</source>
          ,
          <volume>5</volume>
          , 1205{
          <fpage>1224</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Sing</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sander</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beerenwinkel</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lengauer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Rocr: visualizing classi er performance in r</article-title>
          .
          <source>Bioinformatics</source>
          <volume>21</volume>
          (
          <issue>20</issue>
          ),
          <volume>3940</volume>
          {
          <fpage>3941</fpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Lichman</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>UCI Machine Learning Repository (</article-title>
          <year>2013</year>
          ), http://archive.ics.uci.edu/ml
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>DeLong</surname>
            ,
            <given-names>E.R.</given-names>
          </string-name>
          , DeLong,
          <string-name>
            <given-names>D.M.</given-names>
            ,
            <surname>Clarke-Pearson</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.L.</surname>
          </string-name>
          :
          <article-title>Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach</article-title>
          . Biometrics,
          <volume>44</volume>
          , 837{
          <fpage>845</fpage>
          (
          <year>1988</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>