-

NDSE: Instance Generation for Classi cation by Given Meta-Feature Description

0 ITMO University, Computer Technology Lab , Kronverksky pr. 49, 197101 St.Petersburg , Russia

Introduction. Meta-features [ 1 ] are variables re ecting various properties of datasets. In machine learning, they are typically used for two purposes. The rst one is predicting algorithm performance on new datasets never seen before, that is meta-learning [ 2 ]. The second one is analyzing algorithms and nding their weaknesses [ 3 ]. Thus, meta-features allow obtaining vector representations of datasets and considering them as points in the corresponding space. The key problem is that such a space will be extremely sparse if we use only real-world datasets. This problem was recognized by researchers and several methods exists for generating new datasets, e.g. [ 4 ]. However, most of them generate datasets in an uncontrolled manner, without any guesses on how biased the resulting set of datasets may be in terms of meta-feature space, which does not allow using them for algorithm analysis. This paper is devoted to solve the dataset generation problem in a guided way. The method we propose generates a dataset such that its meta-feature vector is as close as possible to a target meta-feature vector.

Related works. We found only two papers devoted to the problem we solve. In both of them, the authors solve this problem by reducing it to a minimization problem, in which the target function is the distance between target and resulting datasets in a meta-feature space. To solve this problem, they use di erent evolutionary techniques. In [ 5 ], the authors explicitly represent datasets for the genetic algorithm as vectors in #attributes #objects dimensional space. We will refer to the search in this space as Vect. In [ 6 ], the authors use the CMA-ES algorithm to optimize parameters of the Gaussian mixture distribution, which they used for sampling objects the of resulting dataset. We will refer to it as GMM.

NDSE Method. The NDSE method (Natural DataSets Evolution), that we present earlier [ 7 ], shares the advantages of both related approaches. This method also uses evolutionary algorithms and its operators have direct access to datasets, not their intermediate representations.

NDSE is based on using operators of adding and removing attributes or objects from a dataset. These operators are equivalent to feature extraction, feature selection, oversampling and undersampling, respectively. We use the following implementation of these operators: random removing of attributes and objects from a dataset, copying of random subsets of objects from the same class for adding new object to dataset, applying random functions, which depends on attributes, class label and random noise, for adding a new attribute to a dataset.

In order to apply evolutionary algorithms, mutation and crossover operators are required. Mutation operator uses the operators of adding and removing to change the number of attributes or objects in a dataset. Crossover operator merges two datasets into a single big dataset and then splits it by attributes into two resulting datasets.

NDSE works as follows. It receives a list of meta-features (functions of dataset) and a target meta-feature vector as input. Vanilla version of this method (which we will refer to as NDSE) starts with an empty dataset, while an improved version (which we will refer to as NDSE') uses real-world datasets as an initial population. During its runtime, the method follows a speci c evolutionary algorithm, which is the hyperparameter of this method, and which exploits suggested mutation and crossover operators. Fitness function is the distance between the target meta-feature vector and meta-feature vector of a generated dataset.

Experiments. We conduct two experiments. In both of them, the target vector is meta-features of a real dataset for binary classi cation from OpenML1. We excluded a target dataset from the initial population for the methods that use them in that way. We use di erent evolutionary strategies implemented in jMetal2 library. We compare NDSE and NDSE' with Vect, Vect' (uses realworld datasets as an initial population) and GMM. The error is the resulting tness-function value, that is the normalized Euclidean distance between the target and resulting datasets: E( ) = Pi((fi i)= i)2, where i and fi are i-th meta-feature values of the target and resulting datasets correspondingly, and i is its standard deviation for real-world datasets. We use 29 meta-features, such as general information, informational and statistical metrics, and characteristic of decision tree. Each run in the Short-Run experiment is limited by the 2000 generated datasets, while this limitation is 105 for the Long-Run experiment. The results of the experiments are shown in Table 1. As it can be seen, the proposed methods are superior to the baselines. Usage of real-world datasets as an initial population is always better than usage of anything else. The best result is shown by NDSE' with SPEA2. 1 OpenML.org 2 jmetal.github.io/jMetal

1. Filchenkov , A. , Pendryak , A. : Datasets meta-feature description for recommending feature selection algorithm . In: AINL-ISMW FRUCT . pp. 11 { 18 ( 2015 )

2. Brazdil , P. , Carrier , C.G. , Soares , C. , Vilalta , R.: Metalearning: Applications to data mining . Springer Science & Business Media ( 2008 )

3. Smith-Miles , K. , Baatar , D. , Wreford , B. , Lewis , R. : Towards objective measures of algorithm performance across instance space . Computers & Operations Research 45 , 12 { 24 ( 2014 )

4. Soares , C. : Uci++: Improved support for algorithm selection using datasetoids . In: Paci c-Asia Conference on Knowledge Discovery and Data Mining . pp. 499 { 506 . Springer ( 2009 )

5. Reif , M. , Shafait , F. , Dengel , A. : Dataset generation for meta-learning . Poster and Demo Track of the 35th German Conference on Arti cial Intelligence (KI-2012 ) pp. 69 { 73 ( 2012 )

6. Mun~oz, M.A. , Smith-Miles , K. : Generating custom classi cation datasets by targeting the instance space . In: Proceedings of the Genetic and Evolutionary Computation Conference Companion . pp. 1582 { 1588 . GECCO '17, ACM , New York ( 2017 )

7. Zabashta , A. , Filchenkov , A. : NDSE: Method for classi cation instance generation given meta-feature description . In: AutoML Workshop, International Conference on Machine Learning ( 2017 )