<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <fpage>64</fpage>
      <lpage>79</lpage>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>E2xplicit VC.oHntarmolero,fPF.eDautpuornetRelevance and Stability</title>
      <p>models, which are easier to interpret. Such models can then be analyzed by
domain experts and are easier to validate. Getting more interpretable models is
also a key concern nowadays and even considered by many as a requirement
when deployed in the medical domain.</p>
      <p>Feature selection has been already largely studied. Yet, current methods are
still widely unsatisfactory mainly because of the typical instability they exhibit.
Instability here refers to the fact that the selected features may be drastically
different for similar data, even though the true underlying processes (explaining
the target variable) are essentially constant. Such instability is a key issue as
it reduces the interpretability of the predictive models as well as the trust of
domain experts towards the selected feature subsets. We address this problem
here by designing methods balancing between the classification performance and
the selection stability of the well-known Recursive Feature Elimination (RFE)
algorithm. Our approach allows domain experts to explicitly control the trade-off
and to select Pareto-optimal compromises based on their personal preferences.</p>
      <p>In the rest of this section, two distinct stability problems that are tackled in
this paper are introduced.
1.1</p>
      <p>The Stability Problems
Single Task Stability (1) Feature selection methods are often inherently
unstable, i.e. they return highly different feature sets when the training data is
slightly modified. Figure 1a illustrates such an instability. The initial dataset
is perturbed1 to form different datasets. Instability arises when little overlap of
the selected features occurs. This prevents a correct and sound interpretation
of the selected features and strongly impacts their further validation by domain
experts. Unlike optimizing the accuracy of predictive models, optimizing
selection stability may look trivial since an algorithm always returning an arbitrary
but fixed set of features would be stable by design. Yet, such an algorithm is not
expected to select informative and predictive features. This illustrates that
optimizing stability is only well-posed jointly with predictive accuracy, and possibly
additional criteria such as minimal model size or sparsity.</p>
      <p>Transfer Learning Selection Stability (2) Multi-task feature selection aims at
discovering variables that are relevant for several similar, yet distinct,
classification tasks. Different feature subsets can be returned for each task. In this
paper, we focus on the case where all learning tasks are not directly available.
Information from the tasks arising first can be propagated to subsequent tasks,
via transfer learning. Stability has to be encouraged from the domain expert
point of view as features that are relevant for different data sources are likely to
be particularly interesting to study. The accuracy-stability trade-off on such a
learning problem (represented in Figure 1b) can take two extreme values. With
complete disregard to stability, each feature set could be selected on a given task
1 Here by bootstrapping which is often used to measure such instability, but it could
be any small perturbation.
independently of the others, with no control on the across task stability. On the
contrary, maximum stability can trivially be reached by returning the feature set
computed for the first task, for all subsequent tasks. However, this is expected to
reduce the accuracy of the models built on these subsequent tasks as previously
learned features might turn out to be less informative for them. This would be
the case if the different tasks are obtained by gradually enriching or correcting
the data as features learned on the error-corrected data are expected to be more
relevant.</p>
      <p>(a) Single-task
(b) Transfer learning</p>
      <p>In section 2, feature selection methods and propositions to increase stability
are reviewed. Section 3 introduces a metric to assess the performance of methods
compromising between feature selection stability and classification performance.
Then a biased variant of the RFE algorithm is proposed in section 4. Section 5
demonstrates the ability of this biased RFE to tackle the previously mentioned
stability problems.
2</p>
      <sec id="sec-1-1">
        <title>Related Work</title>
        <p>
          Feature selection techniques are generally split into three categories: filters,
wrappers and embedded methods. Filters evaluate the relevance of features
independently of the final model, most commonly a classifier, and remove low ranked
features. Simple filters ( e.g. t-test or ANOVA) are univariate, which is
computationally efficient and tends to produce a relatively stable selection but they
E4xplicit VC.oHntarmolero,fPF.eDautpuornetRelevance and Stability
plainly ignore the possible dependencies between various features. Information
theoretic methods, such as MRMR [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and many others, are based on mutual
information between features or with the response, but a robust estimation of
these quantities in high dimensional spaces remains difficult. Wrappers look for
the feature subset that will yield the best predictive performance on a validation
set. They are classifier dependent and very often multivariate. However, they
can be very computationally intensive and an optimal feature subset can rarely
be found. Embedded methods select features by determining which features are
more important in the decisions of a predictive model. Prominent examples
include SVM-RFE [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and logistic regression with a LASSO [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] or Elastic Net
penalty [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]. These methods tend to be more computationally demanding than
filters but they integrate into a single procedure the feature selection and the
estimation of a predictive model. Yet, they also tend to produce much less stable
models.
        </p>
        <p>
          Some works specifically study the causes of selection instability. Results show
that it is mostly caused by the small sample/feature ratio [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], noise in the data
or imbalanced target variable [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and feature redundancy [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. While all of these
reasons clearly play a role, the rfist one is likely the most important one in a
biomedical domain with typically several thousands, if not millions, of features
for only a few dozens or hundreds of samples. This is likely why stable feature
selection is intrinsically hard in this domain and why existing techniques are still
largely unsatisfactory.
        </p>
        <p>
          Looking for a stable feature selection also requires a proper way to quantify
stability itself and lots of measures have already been proposed: the Kuncheva
index [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], the Jaccard index [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], the POG [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] and nPOG [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] indices among
others. Under such a profusion of different measures, it becomes difficult to
justify the choice of a particular index and even more to compare results of works
based on different metrics. Furthermore, the large number of available measures
can lead to publication bias (researchers may select the index that makes their
algorithm look the most stable) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In the hope of fixing this issue, a recent
work [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] lists and analyzes 15 different stability measures. They are compared
based on the satisfaction of 5 different properties that a stability measure should
comply. A novel and unifying index has been proposed in this regard. This index,
used throughout this paper, measures the stability across M selected subsets of
features. It can be computed according to equation (1).
        </p>
        <p>
          1 Pd 2
d f=1 sf
φ = 1 − k k )
d ∗ (1 − d
(1)
with sf2 = MM− 1 pˆf (1 − pˆf ) the estimator of the variance of the selection of the
fth feature over the M selected subsets and k the mean number of features
selected from the original d features.2 This measure is the only existing measure
satisfying the 5 (good) properties described in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], namely fully defined , strict
monotonicity, bounds, maximum stability ⇔ deterministic selection and
correc2 pˆf is the fraction of times feature f has been selected among the M subsets.
tion for chance. It is formally bounded by − 1 and 1 but is asymptotically lower
bounded by 0 as M → ∞ . It is also equivalent to the Kuncheva Index (KI)[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]
when the number of selected features k is constant across the M selected subsets
but can be computed in O(M ∗ d) time, whereas KI can only be computed in
O(M 2 ∗ d).
        </p>
        <p>
          Several authors already proposed different methods to increase stability. For
instance, instance-weighting for variance reduction [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] which tends to increase
feature stability while keeping a comparable predictive performance. Ensemble
methods for feature selection have also been proposed [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and generally increase
feature stability. Nonetheless, the gain in stability offered by existing methods is
still limited and, maybe more importantly, the stability of the selection cannot
be controlled explicitly, which is the main goal of this paper.
        </p>
        <p>
          Multi-task feature selection has already been largely studied [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ].
Encouraging the selection of common predictors across tasks can be done by using the
`1/`p regularization scheme. The cost of selecting different predictors for
different tasks can be controlled by using different norms `1/`p, as p → ∞ favors
the selection of common features. As with the differential shrinkage approach
proposed here, penalties caused by selecting several times the same feature are
reduced. Notably, the `1/`∞ [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and `1/`2 [
          <xref ref-type="bibr" rid="ref18 ref19">18,19</xref>
          ] penalties have been studied
in details. Efficient projected gradient algorithms, for general p, are proposed
and the effect of p on the shared sparsity pattern and the classification
performance is analyzed [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. The main goal of [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] is to find adequate feature-sharing
degrees such as to maximize the prediction performance of the models, which is
different from the objective of explicit control of the accuracy-stability trade-off
that is pursued in the present paper. Although this approach has been originally
introduced for standard multi-task feature selection, it can trivially be adapted
to the transfer learning setting [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. Other similar approaches have also been
proposed [
          <xref ref-type="bibr" rid="ref3 ref4 ref8">3,4,8</xref>
          ] (see [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] for a complete survey).
3
        </p>
        <p>A Multi-Objective Evaluation Framework Through Pareto
Optimality
In this section, we propose to use a classical evaluation framework in
multiobjective optimization to assess the efficiency of methods balancing between
classification performance and selection stability. An (accuracy,stability) pair 3
(a1, φ 1) dominates another pair (a2, φ 2) iff a1 ≥ a2 ∧ φ 1 ≥ φ 2 and at least one of
the inequalities is strict (&gt;). A given method m is able to generate some pairs
Pm in the space of all possible pairs4 P = { (a, φ ) : 0 ≤ a, φ ≤ 1} . From the set of
generated pairs Pm, the set of pairs that are not dominated by any other pair,
3 Common alternatives to the classification accuracy, such as specificity/sensitivity or</p>
        <p>
          AU C, can also be used.
4 The careful reader may remember that the stability measure φ formally lies in the
[
          <xref ref-type="bibr" rid="ref1">− 1, 1</xref>
          ] interval. However, as φ = 0 corresponds to the stability of a uniformly random
selection, we argue that the only interesting part of the stability spectrum is in fact
[
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>E6xplicit VC.oHntarmolero,fPF.eDautpuornetRelevance and Stability</title>
      <p>P am, can be found. This set, called the Pareto set, defines a subspace where
no point dominates any other point. A domain expert would then choose his
favorite pair based on his personal preference towards classification performance
and feature selection stability.</p>
      <p>
        As performance metric, we propose the widely used hypervolume measure
[
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], also known as S-metric. This volume represents the space containing the
sets of accuracy-stability pairs that are dominated by at least one point of the
Pareto set P am. The hypervolume measure has the convenient property that
whenever a Pareto set dominates another, the hypervolume of the former is
greater. As our objective space is bidimensional, the hypervolume measure is
referred to as the Dominance Area (DA) in the rest of this work.
      </p>
      <p>
        An example of the DA metric can be seen in Figure 2. The blue method
starts from the left with a higher accuracy. It thus gains some area over the
red method. Nonetheless, the red method can reach higher stabilities without
dropping the accuracy as much as the blue one. Overall, the red method has a
larger DA. Note that this DA is also equal to the fraction of the total area that
is dominated by the method, or 1 minus the fraction of area that dominates the
method. Its value thus lies in the [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] interval.
      </p>
      <p>
        As noticed by [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], this DA measure is biased towards convex, inner parts
of the objective space. [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] tackles this problem by giving different weights to
different portions of the objective space. This weighted DA can be computed via
the weighted integral
      </p>
      <p>DAP =</p>
      <p>
        Z 1 Z 1
0
0
w(a, φ )fP (a, φ )dadφ
(2)
with w the weighting function and fP the attainment function which is equal
to 1 if (a, φ ) is dominated and 0 otherwise. To preserve the [
        <xref ref-type="bibr" rid="ref1">0 − 1</xref>
        ] bounds, w
has to be normalized such that its integral over the objective space is 1. For
eA∗ a
example, the normalized weighting function wa(a, φ ) , eA− 1 gives a higher
weight to the portions of the space where the accuracy is high. In the example
of Figure 2, the blue method actually outperforms the red one for A &gt; 2.5.
For the sake of generality, our methods are evaluated with w(a, φ ) = 1 but the
proposed evaluation framework allows for more, for instance, if domain experts
are particularly interested in some parts of the objective space.
      </p>
      <p>In order to evaluate the pair (a, φ ) ∈ P corresponding, for instance, to a set of
meta-parameters, some data has to be used to learn the features and some data
to evaluate them on independent examples. This can be done via standard
crossvalidation or by bootstrapping. Each set of meta-parameters produces different
pairs in P and their average value is reported. Concretely, each point in Figure
2 comes with an uncertainty linked to the sampling of the data. In the following,
we define a confidence interval on the true value of DA based on the derivation
of confidence regions for each Pareto-optimal pair.</p>
      <p>Let A be the random variable representing the accuracy value measured on
a given subsampling of the data and Φ be the corresponding stability value.
Let P = (A, Φ ) be the multivariate random variable with the accuracy and
stability as dimensions. Let us assume that the evaluation protocol produces B
measurements of P for each Pareto-optimal point5, represented by the vector p.
The Hotelling distribution T 2 is the multivariate counterpart of the Student’s t
distribution, with which we can define confidence (here 2-dimensional) regions.</p>
      <p>T 2 = B(p¯ − μ (p))0C− 1(p¯ − μ (p)) ∼
2(B − 1)</p>
      <p>B − 2 ∗ F 2,B− 2
with C the sample covariance matrix. It can be shown that T 2 is distributed like
a Fisher distribution F2,B− 2. Thus,</p>
      <p>P (p¯ − μ (p))0C− 1(p¯ − μ (p) ≤
2 ∗ (B − 1)</p>
      <p>B − 2
∗ F 2,B− 2(α ) = 1 − α
The inequality defines an ellipsoidal region, that is likely to cover μ (p). The
center of the ellipsoid is p¯. The length of the axis and their angle can be found by
computing the eigenvalues and eigenvectors of the sample covariance matrix C.
To compute a confidence interval on the DA, the most dominant and dominated
point of each ellipse are found and used to compute the upper and lower bound
of the confidence interval (see Figure 3c and 3d for concrete examples in our
experiments).
(3)
(4)
4</p>
      <sec id="sec-2-1">
        <title>A Biased Variant of the RFE Algorithm</title>
        <p>In this section, we propose a simple method to balance between the
classification performance and selection stability of a logistic RFE algorithm. The RFE
5 B could be e.g. the number of bootstrap samples.
E8xplicit VC.oHntarmolero,fPF.eDautpuornetRelevance and Stability
algorithm was originally introduced with a hinge loss. We prefer here the
logistic variant for an expected smoother control of the trade-off under study. RFE
iteratively drops the least relevant features until the desired number of features
k is reached. We opt here to drop a fixed fraction (20%) of the features at each
iteration. The loss function that a logistic RFE minimizes for a binary
classification task is the following, with n the number of samples, xi sample number i
made of d features as dimensions, and yi its label.</p>
        <p>n
L = X log(1 + e− yi∗ (w∗ xi)) + λ | w| 2
i=1
(5)
The weight vector w contains a weight assigned to each feature. The features are
then ranked based on the absolute value of their weight, which represents the
importance of the feature in the final prediction. The term λ | w| 2 of the loss
function is a regularization term, preventing coefficients of the model to take
too high values, which would most likely result in overfitting. In the classical
approach, every feature is regularized by the same amount λ .</p>
        <p>
          We propose to extend equation (5), such that the regularization term
becomes λ β | w| 2. The function of the vector β is to bias the selection towards
certain features via differential shrinkage. A feature f with a small β f is less
regularized and vice-versa. Its selection in the model is less penalized than a
feature with a higher β f . The search is thus biased towards features with small β f .
A similar differential shrinkage has already been applied to the `1-AROM and
`2-AROM methods [
          <xref ref-type="bibr" rid="ref12 ref13">12,13</xref>
          ] with the objective of biasing the selection towards a
priori relevant features or in a transfer learning context. In the remaining part of
this section, three possible schemes to set the β vector, according to the setting
of interest, are discussed.
        </p>
        <p>Biased RFE for Single Task Feature Selection By varying the distribution of β ,
the accuracy-stability trade-off of the biased RFE can be controlled. The biased
RFE is equivalent to a standard RFE when β = 1. Otherwise, the selection is
biased towards features with a small β f . This is expected to increase stability at
the possible cost of some classification performance, as uninformative features
could be prioritized. In this initial approach, we decide to favor some features
non-uniformly at random, following a gamma distribution.</p>
        <p>β f ∼ Γ (α, 1)
with α the shape of the gamma distribution, which controls the trade-off. All
β f are post centered such that μ (β ) = 1. As α → ∞ , the gamma distribution
tends to a Dirac delta, δ (α ). All features have then the same weight (equal to
μ (β ) = 1) and no bias is put in the selection. As α → 0, the distribution of
β departs from δ (α ) which increases the bias. Domains experts can thus play
with the α values and therefore explicitly tune the trade-off between selection
stability and prediction accuracy.
Using Prior Knowledge The biased RFE can take advantage of available prior
knowledge. If a ranking of the features is known, then the β f can be assigned
such that this ranking is respected. If the prior knowledge is meaningfull, the
selection is no longer biased towards arbitrary features, but towards features
that are high in the ranking, and thus potentially informative. Another type
of prior knowledge could be an unordered set of features that are suspected to
take part in the process of interest. The lowest β f could then systematically be
assigned to those features.</p>
        <p>Biased RFE for Transfer Learning We are now interested in the across task
stability that can be obtained via transfer learning. Tasks are thus ranked such
that information from previous tasks can be used in the selection of features for
subsequent tasks.6 In task i, features that have been returned for tasks 0..i − 1
should be prioritized over the rest, such that the feature stability is increased.
Given the definition of stability used here (equation (1)), it is actually possible to
compute the drop/gain in stability that the selection of a feature would cause.
Intuitively, we propose to bias the selection, through a specific choice of the
vector β , towards features that would cause the highest gain/lowest drop in
stability if they were to be selected. Constant terms put aside7, each feature
influences (negatively) the total stability by its variance in the selection sf2 ∝
pf (1 − pf ). Feature f is given an attractiveness score scf , expressed in equation
(6).</p>
        <p>scf =
2
with sf,no the selection variance of feature f assuming f is not selected in the
2
current task and sf,yes its selection variance if it were to be selected. N is the
number of tasks for which feature sets have already been selected. This score
is thus proportional to the difference of stability between the cases where the
feature is selected for a given task and not. This is illustrated in table 1a where
the current task is T 4. For instance, the selection of the feature F 2 in task T 4
would make its mean selection, pf , equal to 0.75. If it were not to be selected, pf
would be equal to 0.5. The attractiveness score of F 2, scF 2 is actually positive,
meaning that the selection of F 2 in T 4 would increase the measured stability.</p>
        <p>The (N+1)2 factor of equation (6) is there to correct a downards tendency of</p>
        <p>N
scf when the index of the considered task increases. This is illustrated on table
1b. If feature f is selected in each task, scf would actually decrease which would
decrease the bias. It can be shown that including the correction term leads to
scf = 2 ∗ pf − 1 with pf the proportion of the selections of feature f in the
past N tasks. Let S be the sum of the such selections. By definition, pf = NS ,
6 Tasks can be ranked naturally from their chronological order or by the domain
expert.
7 We purposely drop the M/(M − 1) term, for convenience. Also, the denominator
kd (1 − kd ) is constant if the number k of selected features is fixed.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>E1x0plicit VC.oHntarmolero,fPF.eDautpuornetRelevance and Stability</title>
      <p>sf,no = N+1 ∗ (1 − NS+1 ) and sf2,yes = NS++11 ∗ (1 − NS++11 ). Thus,
2 S
scf =
to bias the selection towards previously selected features.8 With α t = 0, features
are learned independently on each task. On the contrary, an increasing α t raises
the bias towards features that were already selected in past tasks. Domain
experts can thus tune the α t values to control the accuracy-stability trade-off in
such a transfer learning setting.
5</p>
      <sec id="sec-3-1">
        <title>Experiments</title>
        <p>
          In this section, we evaluate to what extent an actual compromise between
prediction accuracy and selection stability can be made with the proposed approaches.
Experiments are performed on two distinct tasks, prostate cancer diagnosis from
microarray data and handwritten digit recognition. The Prostate dataset
contains 12600-dimensional (microarray) gene expression data from 52 patients with
prostate tumors and 50 healthy patients [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. The Gisette dataset contains
5000dimensional integer data, with features aimed at discerning pictures of the
number 4 from the number 9. Gisette was originally constructed from the MNIST
8 Again, β is post-centered such that μ (β ) = 1 at each iteration of the RFE algorithm.
data but was extended with 2500 noisy features [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. It consists of 6000 examples,
but, in order to better illustrate the trade-off, only 100 examples are used here.
5.1
        </p>
        <p>Evaluation Methodology
To obtain the results presented in the next sections, the following methodology
has been used. Each (a, φ ) pair is obtained with a different α (problem 1) or
α t (problem 2). For the single task stability problem, the β f are first sampled
from the gamma distribution. Then, M bootstrap samples are built. k features
are then selected using the proposed biased RFE on each bootstrap sample. For
the transfer learning stability problem, a single bootstrap sample for each task
is created. Features are selected from it, then β for the next task is computed
according to equation (7). The final prediction model is learned by minimizing
the classical, unbiased, logistic loss with a L2 regularization (see equation (5))
with a non-strongly fitted 9 regularization parameter λ . Every model is evaluated
on its out-of-bag examples. The mean accuracy as well as the stability of the
selected features are computed. As these values are obviously dependent on the
sampling of β (problem 1) or the features learned on the first few tasks (problem
2), this procedure is repeated B times and the mean values are reported. The
95% confidence regions of the expected value of the accuracy-stability trade-off
are computed as well as the confidence interval on DA described in section 3.
Stability of the feature selection (x-axis on Figures 2, 3 and 4) has not to be
confused with its corresponding uncertainty which is the width of the confidence
regions along the x-axis.
5.2 Single Task Selection Stability
The λ meta-parameter of the RFE formulation (equation (5)) has not been
strongly optimized. A value of 0.1 which provides a good accuracy has been
used for both tasks. To obtain the below graphs, the methodology detailed in
section 5.1 has been used with M = 30, k = 20 and B = 100.</p>
        <p>Results on both data sets can be seen in Figure 3. The blue curves are
obtained without any prior knowledge. The top-left point of each subgraph
corresponds to the (accuracy,stability) trade-off obtained with the classical logistic
RFE method. Following Pareto lines from left to right, the shape α of the gamma
distribution decreases. This makes the biased logistic RFE departs from its
unbiased version which raises stability but reduces classification performance. As the
method fails to reach maximum stability, it was extended with the trivial point
(arand, 1), obtained by always returning the same arbitrary feature subset.10 It
9 Values used for λ are 0.1 for problem 1 and 1 for problem 2. The final classification
algorithm does not influence the selection stability. It can thus be optimized to
maximize the predictive accuracy only.
10 It is actually impossible to reach a maximum stability of 1 for a finite regularization
parameter λ . In such a case, even with no regularization, a feature is not guaranteed
to be always selected.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>E1x2plicit VC.oHntarmolero,fPF.eDautpuornetRelevance and Stability</title>
      <p>seems that, while it is possible to increase signicfiantly the stability without
degrading too much the accuracy on the Gisette dataset (Figure 3b), it is not the
case for the Prostate dataset (Figure 3a) where the accuracy drops directly.</p>
      <p>To measure the effect of prior knowledge, N = 10 examples are sampled
randomly. The 100 features with the highest variance are selected as part of the
prior knowledge, here representing a set of potentially relevant features. As can
be seen on Figure 3a and 3b, even such a small prior knowledge improves the
Dominance Area considerably.</p>
      <p>Figures 3c and 3d have been obtained with a small subset of the Pareto points.
The ellipses are the 95% confidence regions of the expected value (on the β
sampling) of the accuracy-stability trade-off. For large α values, the importance
of β is reduced, and thus the uncertainty limited. As α decreases, this confidence
region grows. The ellipses are also all inclined towards the right. This represents
the covariance between the accuracy and stability for a single β sampling. If the
sampling appears to be bad, .i.e. poor features are prioritized, poor accuracy
and poor stability are obtained. The opposite is true for a good sampling. By
using the top-right and bottom-left point of each ellipse, it is possible to derive
a confidence interval on the true DA of the method on these datasets.
5.3</p>
      <p>Multi-task Selection Stability via Transfer Learning</p>
      <p>To generate different, yet similar, classification tasks, normally distributed
noise has been added to the Prostate dataset. This noise is centered on 0 and
has a specific standard deviation for every couple of feature and task, such
that features relevant in some task, could be irrelevant in others. Yet, tasks are
expected to share common informative features. 8 tasks are considered here, with
an arbitrary order between them. Results with k = 10, B = 500, λ = 1 are shown
on Figure 4a. The blue area is obtained by combining two trivial options. First,
the features learned on the first task can be selected for all subsequent tasks,
achieving a stability of 1. Or features can be learned independently from each
other (equivalent to α t = 0). This strategy offers poor selection stability, but also
a sub-optimal classification performance. Knowledge from previous tasks can be
used to guide the search towards potentially good features for subsequent tasks.
This increases both the accuracy and stability at first. Then, the accuracy starts
to decrease, as the selection of features is forced too much. This tendency is</p>
    </sec>
    <sec id="sec-5">
      <title>E1x4plicit VC.oHntarmolero,fPF.eDautpuornetRelevance and Stability</title>
      <p>
        better illustrated in Figure 4b, which contains some non-Pareto optimal points.
This result is consistent with the conclusion drawn by the analysis of the
GroupLasso with `1/`p regularization [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], i.e. that weak coupling norms (1.5 ≤ p ≤
2) outperforms no and strong coupling norms. Unlike for single task feature
selection, the confidence regions are similar for all compromises, meaning that
differential shrinkage does not increase the uncertainty of the obtained
accuracystability pair. Furthermore, as the ellipses are straight, the measured accuracy
and stability are uncorrelated.
6
      </p>
      <sec id="sec-5-1">
        <title>Conclusion and Perspectives</title>
        <p>The typical instability of standard feature selection methods is a key concern
nowadays as it reduces the interpretability of the predictive models as well as
the trust of domain experts towards the selected feature subsets. Such experts
would often prefer a more stable feature selection algorithm over an unstable
and slightly more accurate one. In this paper, the compromise between feature
relevance and selection stability is made explicit by biasing the selection towards
some features through differential shrinkage of the Recursive Feature
Elimination algorithm. Domain experts are given the opportunity to select any
Paretooptimal trade-off of accuracy and selection stability based on their preferences.
We propose the use of the hypervolume metric to assess the performance of
methods realizing such a compromise. An associated confidence interval, based on the
derivation of confidence regions of the accuracy-stability trade-off, is derived.</p>
        <p>Results on prostate cancer diagnosis and handwritten recognition tasks show
that the selection stability can be increased at will, often with a cost of
classification performance. When some prior knowledge is available, far better
compromises can be made. The design and evaluation of hybrid methods, learning
the prior knowledge from the data, and using it to stabilize the selection is part
of our future work.</p>
        <p>Motivated by the needs of domain experts, across tasks feature stability is
also studied in a transfer learning setting (i.e. when tasks are ordered). A biasing
scheme that takes the stability measure explicitly into account is proposed. For
similar, yet different, tasks, we show on microarray data that some bias is at first
beneficial for both the accuracy and the stability. A too strong bias continues to
increase the selection stability but at the cost of some classification performance,
as the most relevant features vary across tasks. Our approach is evaluated here
in a simulated transfer learning setting and further experimental validations will
be conducted.</p>
        <p>
          Different multi-task feature selection methods have been proposed in the
literature (e.g. Group-Lasso with `1/`p regularization [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ], additive linear
models [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], . . . ). Such methods were introduced with the primary objective of building
accurate predictive models across several (this time unordered) tasks. We will
study to which extent they could also be used to allow the tuning of the across
task selection stability and classicfiation performance trade-off. The biased RFE
proposed here can be extended to tackle classical multi-task feature selection, for
example by prioritizing the most relevant features when all tasks are considered
together. Our future work includes the evaluation of all these approaches in the
proposed assessment framework.
        </p>
        <p>The present paper answers the growing necessity of considering the selection
stability not only as a side-effect of learning accurate predictive models but as an
actual goal in a bi-objective framework. It proposes initial approaches to learn
Pareto-optimal compromises in such a framework and, hopefully, opens the way
to new works and improvements in this area.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abeel</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Helleputte</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , Van de Peer,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Dupont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Saeys</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          :
          <article-title>Robust biomarker identification for cancer diagnosis with ensemble feature selection methods</article-title>
          .
          <source>Bioinformatics</source>
          <volume>26</volume>
          (
          <issue>3</issue>
          ),
          <fpage>392</fpage>
          -
          <lpage>398</lpage>
          (
          <year>2010</year>
          ). https://doi.org/10.1093/bioinformatics/btp630
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Alelyani</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>On feature selection stability: A data perspective</article-title>
          .
          <source>Ph.D. thesis</source>
          , Arizona State University (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Argyriou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Evgeniou</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pontil</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Multi-task feature learning</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Argyriou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Evgeniou</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pontil</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Convex multi-task feature learning</article-title>
          .
          <source>Machine Learning</source>
          <volume>73</volume>
          (
          <issue>3</issue>
          ),
          <fpage>243</fpage>
          -
          <lpage>272</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Awada</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khoshgoftaar</surname>
            ,
            <given-names>T.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dittman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wald</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napolitano</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A review of the stability of feature selection techniques for bioinformatics data</article-title>
          .
          <source>In: Information Reuse and Integration (IRI)</source>
          ,
          <year>2012</year>
          IEEE 13th International Conference on. pp.
          <fpage>356</fpage>
          -
          <lpage>363</lpage>
          . IEEE (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Boulesteix</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Slawski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Stability and aggregation of ranked gene lists</article-title>
          .
          <source>Briefings in bioinformatics 10(5)</source>
          ,
          <fpage>556</fpage>
          -
          <lpage>568</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peng</surname>
          </string-name>
          , H.:
          <article-title>Minimum redundancy feature selection from microarray gene expression data</article-title>
          .
          <source>Journal of bioinformatics and computational biology</source>
          <volume>3</volume>
          (
          <issue>02</issue>
          ),
          <fpage>185</fpage>
          -
          <lpage>205</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Evgeniou</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pontil</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Regularized multi-task learning</article-title>
          .
          <source>In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          . pp.
          <fpage>109</fpage>
          -
          <lpage>117</lpage>
          . ACM (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Guyon</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gunn</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikravesh</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zadeh</surname>
            ,
            <given-names>L.A.</given-names>
          </string-name>
          :
          <article-title>Feature extraction: foundations and applications</article-title>
          , vol.
          <volume>207</volume>
          . Springer (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Guyon</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weston</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barnhill</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Gene selection for cancer classification using support vector machines</article-title>
          .
          <source>Machine learning 46(1-3)</source>
          ,
          <fpage>3894</fpage>
          -
          <lpage>22</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. Han,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          :
          <article-title>A variance reduction framework for stable feature selection</article-title>
          .
          <source>Statistical Analysis and Data Mining: The ASA Data Science Journal</source>
          <volume>5</volume>
          (
          <issue>5</issue>
          ),
          <fpage>428</fpage>
          -
          <lpage>445</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Helleputte</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dupont</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Feature selection by transfer learning with linear regularized models</article-title>
          .
          <source>In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases</source>
          . pp.
          <fpage>533</fpage>
          -
          <lpage>547</lpage>
          . Springer (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Helleputte</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dupont</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Partially supervised feature selection with regularized linear models</article-title>
          .
          <source>In: Proceedings of the 26th Annual International Conference on Machine Learning</source>
          . pp.
          <fpage>409</fpage>
          -
          <lpage>416</lpage>
          . ACM (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Kalousis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prados</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hilario</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Stability of feature selection algorithms</article-title>
          .
          <source>In: Data Mining</source>
          , Fifth IEEE International Conference on. pp.
          <fpage>8</fpage>
          -pp.
          <source>IEEE</source>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Kuncheva</surname>
            ,
            <given-names>L.I.:</given-names>
          </string-name>
          <article-title>A stability index for feature selection</article-title>
          .
          <source>In: Artificial intelligence and applications</source>
          . pp.
          <fpage>421</fpage>
          -
          <lpage>427</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palatucci</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , J.:
          <article-title>Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery</article-title>
          .
          <source>In: Proceedings of the 26th Annual International Conference on Machine Learning</source>
          . pp.
          <fpage>649</fpage>
          -
          <lpage>656</lpage>
          . ACM (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Nogueira</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sechidis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brown</surname>
          </string-name>
          , G.:
          <article-title>On the stability of feature selection algorithms</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          <volume>18</volume>
          (
          <issue>1</issue>
          ),
          <fpage>63456</fpage>
          -
          <lpage>398</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Obozinski</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taskar</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Multi-task feature selection</article-title>
          . Statistics Department, UC Berkeley,
          <source>Tech. Rep</source>
          <volume>2</volume>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Obozinski</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taskar</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          :
          <article-title>Joint covariate selection and joint subspace selection for multiple classification problems</article-title>
          .
          <source>Statistics and Computing</source>
          <volume>20</volume>
          (
          <issue>2</issue>
          ),
          <fpage>231</fpage>
          -
          <lpage>252</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Saeys</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Inza</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          , Larran˜aga, P.:
          <article-title>A review of feature selection techniques in bioinformatics</article-title>
          .
          <source>Bioinformatics</source>
          <volume>23</volume>
          (
          <issue>19</issue>
          ),
          <fpage>2507</fpage>
          -
          <lpage>2517</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reid</surname>
            ,
            <given-names>L.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>W.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shippy</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Warrington</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>S.C.</given-names>
          </string-name>
          , Collins,
          <string-name>
            <given-names>P.J.</given-names>
            ,
            <surname>De Longueville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Kawasaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.S.</given-names>
            ,
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.Y.</surname>
          </string-name>
          , et al.:
          <article-title>The microarray quality control (maqc) project shows inter-and intraplatform reproducibility of gene expression measurements</article-title>
          .
          <source>Nature biotechnology 24(9)</source>
          ,
          <volume>1151</volume>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Febbo</surname>
            ,
            <given-names>P.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ross</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jackson</surname>
            ,
            <given-names>D.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manola</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ladd</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tamayo</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Renshaw</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>D'Amico</surname>
            ,
            <given-names>A.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Richie</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          , et al.:
          <article-title>Gene expression correlates of clinical prostate cancer behavior</article-title>
          .
          <source>Cancer cell 1</source>
          (
          <issue>2</issue>
          ),
          <fpage>203</fpage>
          -
          <lpage>209</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Somol</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Novovicova</surname>
          </string-name>
          , J.:
          <article-title>Evaluating stability and comparing output of feature selectors that optimize feature subset cardinality</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>32</volume>
          (
          <issue>11</issue>
          ),
          <fpage>1921</fpage>
          -
          <lpage>1939</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Tibshirani</surname>
          </string-name>
          , R.:
          <article-title>Regression shrinkage and selection via the lasso</article-title>
          .
          <source>Journal of the Royal Statistical Society</source>
          . Series B (Methodological) pp.
          <fpage>267</fpage>
          -
          <lpage>288</lpage>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Vogt</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
          </string-name>
          , V.:
          <article-title>A complete analysis of the l 1, p group-lasso</article-title>
          .
          <source>arXiv preprint arXiv:1206.4632</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zou</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Evaluating reproducibility of differential expression discoveries in microarray studies by considering correlated molecular changes</article-title>
          .
          <source>Bioinformatics</source>
          <volume>25</volume>
          (
          <issue>13</issue>
          ),
          <fpage>1662</fpage>
          -
          <lpage>1668</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          :
          <article-title>A survey on multi-task learning</article-title>
          .
          <source>arXiv preprint arXiv:1707.08114</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Zitzler</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brockhoff</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thiele</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>The hypervolume indicator revisited: On the design of pareto-compliant indicators via weighted integration</article-title>
          .
          <source>In: International Conference on Evolutionary Multi-Criterion Optimization</source>
          . pp.
          <fpage>862</fpage>
          -
          <lpage>876</lpage>
          . Springer (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Zitzler</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thiele</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Multiobjective evolutionary algorithms: a comparative case study and the strength pareto approach</article-title>
          .
          <source>IEEE transactions on Evolutionary Computation</source>
          <volume>3</volume>
          (
          <issue>4</issue>
          ),
          <fpage>257</fpage>
          -
          <lpage>271</lpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Zou</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hastie</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Regularization and variable selection via the elastic net</article-title>
          .
          <source>Journal of the Royal Statistical Society: Series B (Statistical Methodology)</source>
          <volume>67</volume>
          (
          <issue>2</issue>
          ),
          <fpage>301</fpage>
          -
          <lpage>320</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>