<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Map-Elites Algorithm for Features Selection Problem</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Brenda Quin~onez</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diego P. Pinto-Roa </string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miguel Garc a-Torres</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mar a E. Garc a-Diaz </string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlos Nun~ez-Castillo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Federico Divina</string-name>
          <email>fdivinag@upo.es</email>
        </contrib>
      </contrib-group>
      <abstract>
        <p>In the High-dimensional data analysis there are several challenges in the elds of machine learning and data mining. Typically, feature selection is considered as a combinatorial optimization problem which seeks to remove irrelevant and redundant data by reducing computation time and improve learning measures. Given the complexity of this problem, we propose a novel Map-Elites based Algorithm that determines the minimum set of features maximizing learning accuracy simultaneously. Experimental results, on several data based from real scenarios, show the e ectiveness of the proposed algorithm.</p>
      </abstract>
      <kwd-group>
        <kwd>Feature Selection</kwd>
        <kwd>Map-Elites</kwd>
        <kwd>Combinatorial Optimization</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>Data Mining</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Recently, the available data has increased explosively in both the number of
samples and the dimensionality in di erent machine learning applications, such
as text mining, arti cial vision and bio-medical. Our interest is mainly focused
on the high dimensionality of the data. The large amount of high-dimensional
data has imposed a great challenge on existing machine learning methods. The
presence of noisy, redundant and irrelevant dimensions can make the learning
algorithms very slow and can also generate di culties when interpreting the
resulting models [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In machine learning and statistics, the feature selection is
the process of selecting a subset of relevant characteristics to use in building the
model. Attribute selection methods greatly in uence the success of data mining
processes by reducing computational time and improving learning metrics, for
this reason we propose a new attribute technique selection based on Illumination
Algorithm [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>This paper is organized as follows. Section 2 introduces to the features
selection problem. Then, we describe the Illumination search algorithms in
Section 3, speci cally the Map-Elites algorithm. The section 4 contains the
proposal of this paper and, nally, in the last section there is a brief discussion
of the results obtained so far.</p>
    </sec>
    <sec id="sec-2">
      <title>The Feature Selection Problem</title>
      <p>
        A feature selection algorithm basically is the combination of a search technique
to propose new subsets of features, with an evaluation measure that quali es the
di erent subsets. The simplest algorithm is to test every possible subset of
features to nd the one that minimizes the error rate. This is an exhaustive search
of space, and it is computationally intractable except for the smallest feature
sets; i.e. for n attributes, there are 2n solutions. The choice of the evaluation
metric has a great in uence on the algorithm, and they can distinguish among
three main categories: wrap methods, lter methods and embedded methods
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Wrappers use a search algorithm to search through the space of possible
features and evaluate each subset by running a model on the subset. Wrappers
can be computationally expensive and have a risk of over tting to the model.
Filters are similar to Wrappers in the search approach, but instead of
evaluating against a model, a simpler lter is evaluated. Embedded techniques are
embedded in and speci c to a model [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Many popular search approaches use greedy hill climbing, which iteratively
evaluates a candidate subset of features, then modi es the subset and evaluates
if the new subset is an improvement over the old. Evaluation of the subsets
requires a scoring metric that grades a subset of features.</p>
      <p>
        Search approaches applied to the feature selection include: exhaustive, best
rst, simulated annealing, genetic algorithm, greedy forward selection, greedy
backward elimination, particle swarm optimization, targeted projection pursuit,
Scatter Search, Variable Neighborhood Search [
        <xref ref-type="bibr" rid="ref2 ref4">2, 4</xref>
        ].
      </p>
      <p>
        Genetic algorithm (GA) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] method due to the capability to evolve new
features of the selected features and a vast exploration of the search space for
new tter solutions. GA includes a subset of the growth-based optimization
methods aiming at the use of the GA operators such as selection, mutation
and recombination to a population of challenging problem solutions. GA has
been e ectively applied to several optimization problems such as classi cation
tasks and pattern recognition. The GA's stochastic component does not rule
out excitedly dissimilar solutions, which may give the better result. This has
the advantage that, given su cient time and a well bOlli1ded problem, the
algorithm can discover a global optimum. It is well suited to feature selection
problems because of the above reason. In the next section it will describe the
MAP Elites algorithm based on genetic algorithms.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>MAP Elites</title>
      <p>
        The Multi-dimensional Archive of Phenotypic Elites (MAP-Elites) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] algorithm
illuminates search spaces, which produces a large diversity of high-performing,
yet qualitatively di erent solutions, which can be more helpful than a single,
high-performing solution. Interestingly, because MAP-Elites explores more of
the search space, it also tends to nd a better overall solution than
state-of-theart search algorithms. This is because MAP-Elites illuminates the relationship
between performance and dimensions of interest in solutions. MATP-Elites
returns a set of high-performing and improves the state-of-the-art for nding
a single, best solution, it will catalyze advances throughout all science and
engineering elds.
      </p>
      <p>
        MAP-Elites is quite simple. First, the user chooses a performance measure
f(x) that evaluates a solution x. Second, the user chooses N dimensions of
variation of interest that de ne a feature space of interest to the user. Each
dimension of variation is discretized based on user preference or available
computational resources. Given a particular discretization, MAP-Elites will search
for the highest performing solution for each cell in the N-dimensional feature
space. The search is conducted in the search space, which is the space of all
possible values of x, where x is a description of a candidate solution [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>MAP-Elites for Feature Selection</title>
      <p>In this paper we propose to use the Map-Elites algorithm as an innovative
technique for features selection in the automatic learning process. The challenge
of this problem is that the inputs variable are binary one, whereas the basic
Map-Elites was designed for real numbers. Therefore, the main objective of
this proposal will be to create a search space for a MAP-Elites for the binary
variables given the feature selection is a combinatorial problem.</p>
      <p>To face this challenge, we represent a set of solutions as a vector with the
indexes of the selected features. Algorithm 1 shows the pseudo-code of the
Combinatorial MAP-Elites proposed. To create the map that allows us to distribute
the solutions in the search space, we de ne the number of cells as the algorithm
input parameter NC. Subsequently, this parameter is used to calculate a
number of xed features per cell NFF, which are used as cell identi ers and help
determine which cell of the map each solution will be associated with (1). We
also use two more input parameters for the algorithm that are typical of genetic
algorithms such as the number of iterations I and the number of initial genomes
G.</p>
      <p>MAP-Elites starts by randomly generating G genomes (solutions coded) and
determining the performance and features of each (2). In a random order, those
genomes are placed into the cells to which they belong in the feature space (if
multiple genomes map to the same cell, the highest-performing one per cell is
retained). At that point the algorithm is initialized, and the following steps are
repeated until a termination criterion is reached I. A cell in the map is randomly
chosen and the genome in that cell produces an o spring via mutation and/or
crossover (3). The features and performance of that o spring are determined,
and the o spring is placed in the cell if the cell is empty or if the o spring
is higher-performing than the current occupant of the cell, in which case that
occupant is replaced by the new solution. The algorithm returns a map with
the best solution found for each cell along with the corresponding tness.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>
        The proposed Map-Elites tries to determine the minimum set of features that
maximizes learning accuracy. This preliminary experiment was conducted using
di erent data sets from real scenarios obtained from [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Table 1 presents
performance of the algorithm where solutions were evaluated using Bayes Classi er
and 2-fold cross validation [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In this experiment the input parameters for the
algorithm were 5,000 iterations (I ), a map of 8 cells (NC ) and the number of
initial genomes (G ) equal to 500. In addition, we able to see the accuracy
obtained by the proposed algorithm is promissory and at the same time it reduces
the number of features.
      </p>
      <p>Currently, the performance of Map-Elites is being tested and compared with
the competitive algorithms of the-state-of-the-art.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Jean-Baptiste Mouret</surname>
            and
            <given-names>Je</given-names>
          </string-name>
          <string-name>
            <surname>Clune</surname>
          </string-name>
          . \
          <article-title>Illuminating search spaces by mapping elites"</article-title>
          .
          <source>CoRR</source>
          .
          <year>2015</year>
          .vol.
          <source>abs/1504.04909</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Felix</given-names>
            <surname>Garc</surname>
          </string-name>
          <article-title>a Lopez, Miguel Garc a-Torres, Belen Melian Batista, Jose A. Moreno Perez, and</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Marcos</surname>
          </string-name>
          Moreno-Vegatitle.
          <article-title>\Solving feature subset selection problem by a Parallel Scatter Search"</article-title>
          .
          <source>European Journal of Operational Research</source>
          .
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Miao</surname>
            ,
            <given-names>Jianyu</given-names>
          </string-name>
          <string-name>
            <surname>Niu</surname>
          </string-name>
          , Lingfeng. (
          <year>2016</year>
          ).
          <article-title>A Survey on Feature Selection</article-title>
          .
          <source>Procedia Computer Science</source>
          .
          <volume>91</volume>
          .
          <fpage>919</fpage>
          -
          <lpage>926</lpage>
          .
          <fpage>10</fpage>
          .1016/j.procs.
          <year>2016</year>
          .
          <volume>07</volume>
          .111.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Garcia-Torres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gomez-Vela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Melian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.M.</given-names>
            <surname>Moreno-Vega</surname>
          </string-name>
          .
          <article-title>Highdimensional feature selection via feature grouping: A Variable Neighborhood Search approach</article-title>
          , Information Sciences, vol.
          <volume>326</volume>
          , pp.
          <fpage>102</fpage>
          -
          <lpage>118</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Sindhiya</surname>
            ,
            <given-names>S Selvaraj</given-names>
          </string-name>
          , Gunasundari. (
          <year>2015</year>
          ).
          <article-title>A survey on genetic algorithm based feature selection for disease diagnosis system</article-title>
          .
          <source>Proceedings of ICCCS 2014 - IEEE International Conference on Computer Communication and Systems</source>
          .
          <volume>164</volume>
          -
          <fpage>169</fpage>
          .
          <fpage>10</fpage>
          .1109/ICCCS.
          <year>2014</year>
          .
          <volume>7068187</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>