<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>NDSE: Instance Generation for Classi cation by Given Meta-Feature Description</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>ITMO University, Computer Technology Lab</institution>
          ,
          <addr-line>Kronverksky pr. 49, 197101 St.Petersburg</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Introduction. Meta-features [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] are variables re ecting various properties
of datasets. In machine learning, they are typically used for two purposes. The
rst one is predicting algorithm performance on new datasets never seen before,
that is meta-learning [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The second one is analyzing algorithms and nding
their weaknesses [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Thus, meta-features allow obtaining vector representations
of datasets and considering them as points in the corresponding space. The key
problem is that such a space will be extremely sparse if we use only real-world
datasets. This problem was recognized by researchers and several methods exists
for generating new datasets, e.g. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, most of them generate datasets
in an uncontrolled manner, without any guesses on how biased the resulting set
of datasets may be in terms of meta-feature space, which does not allow using
them for algorithm analysis. This paper is devoted to solve the dataset generation
problem in a guided way. The method we propose generates a dataset such that
its meta-feature vector is as close as possible to a target meta-feature vector.
      </p>
      <p>
        Related works. We found only two papers devoted to the problem we solve.
In both of them, the authors solve this problem by reducing it to a minimization
problem, in which the target function is the distance between target and
resulting datasets in a meta-feature space. To solve this problem, they use di erent
evolutionary techniques. In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the authors explicitly represent datasets for the
genetic algorithm as vectors in #attributes #objects dimensional space. We
will refer to the search in this space as Vect. In [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the authors use the CMA-ES
algorithm to optimize parameters of the Gaussian mixture distribution, which
they used for sampling objects the of resulting dataset. We will refer to it as
GMM.
      </p>
      <p>
        NDSE Method. The NDSE method (Natural DataSets Evolution), that
we present earlier [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], shares the advantages of both related approaches. This
method also uses evolutionary algorithms and its operators have direct access to
datasets, not their intermediate representations.
      </p>
      <p>NDSE is based on using operators of adding and removing attributes or
objects from a dataset. These operators are equivalent to feature extraction,
feature selection, oversampling and undersampling, respectively. We use the
following implementation of these operators: random removing of attributes and
objects from a dataset, copying of random subsets of objects from the same class
for adding new object to dataset, applying random functions, which depends on
attributes, class label and random noise, for adding a new attribute to a dataset.</p>
      <p>In order to apply evolutionary algorithms, mutation and crossover operators
are required. Mutation operator uses the operators of adding and removing to
change the number of attributes or objects in a dataset. Crossover operator
merges two datasets into a single big dataset and then splits it by attributes
into two resulting datasets.</p>
      <p>NDSE works as follows. It receives a list of meta-features (functions of dataset)
and a target meta-feature vector as input. Vanilla version of this method (which
we will refer to as NDSE) starts with an empty dataset, while an improved
version (which we will refer to as NDSE') uses real-world datasets as an initial
population. During its runtime, the method follows a speci c evolutionary
algorithm, which is the hyperparameter of this method, and which exploits suggested
mutation and crossover operators. Fitness function is the distance between the
target meta-feature vector and meta-feature vector of a generated dataset.</p>
      <p>Experiments. We conduct two experiments. In both of them, the target
vector is meta-features of a real dataset for binary classi cation from OpenML1.
We excluded a target dataset from the initial population for the methods that
use them in that way. We use di erent evolutionary strategies implemented in
jMetal2 library. We compare NDSE and NDSE' with Vect, Vect' (uses
realworld datasets as an initial population) and GMM. The error is the resulting
tness-function value, that is the normalized Euclidean distance between the
target and resulting datasets: E( ) = Pi((fi i)= i)2, where i and fi are i-th
meta-feature values of the target and resulting datasets correspondingly, and i
is its standard deviation for real-world datasets. We use 29 meta-features, such
as general information, informational and statistical metrics, and characteristic
of decision tree. Each run in the Short-Run experiment is limited by the 2000
generated datasets, while this limitation is 105 for the Long-Run experiment.
The results of the experiments are shown in Table 1.
As it can be seen, the proposed methods are superior to the baselines. Usage
of real-world datasets as an initial population is always better than usage of
anything else. The best result is shown by NDSE' with SPEA2.
1 OpenML.org
2 jmetal.github.io/jMetal</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Filchenkov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pendryak</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Datasets meta-feature description for recommending feature selection algorithm</article-title>
          .
          <source>In: AINL-ISMW FRUCT</source>
          . pp.
          <volume>11</volume>
          {
          <issue>18</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Brazdil</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carrier</surname>
            ,
            <given-names>C.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soares</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vilalta</surname>
          </string-name>
          , R.: Metalearning:
          <article-title>Applications to data mining</article-title>
          . Springer Science &amp; Business
          <string-name>
            <surname>Media</surname>
          </string-name>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Smith-Miles</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baatar</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wreford</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Towards objective measures of algorithm performance across instance space</article-title>
          .
          <source>Computers &amp; Operations Research</source>
          <volume>45</volume>
          ,
          <volume>12</volume>
          {
          <fpage>24</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Soares</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Uci++:
          <article-title>Improved support for algorithm selection using datasetoids</article-title>
          .
          <source>In: Paci c-Asia Conference on Knowledge Discovery and Data Mining</source>
          . pp.
          <volume>499</volume>
          {
          <fpage>506</fpage>
          . Springer (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Reif</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shafait</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dengel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Dataset generation for meta-learning</article-title>
          .
          <source>Poster and Demo Track of the 35th German Conference on Arti cial Intelligence (KI-2012</source>
          ) pp.
          <volume>69</volume>
          {
          <issue>73</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. Mun~oz,
          <string-name>
            <given-names>M.A.</given-names>
            ,
            <surname>Smith-Miles</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          :
          <article-title>Generating custom classi cation datasets by targeting the instance space</article-title>
          .
          <source>In: Proceedings of the Genetic and Evolutionary Computation Conference Companion</source>
          . pp.
          <volume>1582</volume>
          {
          <fpage>1588</fpage>
          . GECCO '17,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Zabashta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Filchenkov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>NDSE: Method for classi cation instance generation given meta-feature description</article-title>
          . In: AutoML Workshop, International Conference on Machine Learning (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>