NDSE: Instance Generation for Classification by Given Meta-Feature Description Alexey Zabashta and Andrey Filchenkov ITMO University, Computer Technology Lab, Kronverksky pr. 49, 197101 St.Petersburg, Russia {azabashta,afilchenkov}@corp.ifmo.ru Introduction. Meta-features [1] are variables reflecting various properties of datasets. In machine learning, they are typically used for two purposes. The first one is predicting algorithm performance on new datasets never seen before, that is meta-learning [2]. The second one is analyzing algorithms and finding their weaknesses [3]. Thus, meta-features allow obtaining vector representations of datasets and considering them as points in the corresponding space. The key problem is that such a space will be extremely sparse if we use only real-world datasets. This problem was recognized by researchers and several methods exists for generating new datasets, e.g. [4]. However, most of them generate datasets in an uncontrolled manner, without any guesses on how biased the resulting set of datasets may be in terms of meta-feature space, which does not allow using them for algorithm analysis. This paper is devoted to solve the dataset generation problem in a guided way. The method we propose generates a dataset such that its meta-feature vector is as close as possible to a target meta-feature vector. Related works. We found only two papers devoted to the problem we solve. In both of them, the authors solve this problem by reducing it to a minimization problem, in which the target function is the distance between target and result- ing datasets in a meta-feature space. To solve this problem, they use different evolutionary techniques. In [5], the authors explicitly represent datasets for the genetic algorithm as vectors in #attributes × #objects dimensional space. We will refer to the search in this space as Vect. In [6], the authors use the CMA-ES algorithm to optimize parameters of the Gaussian mixture distribution, which they used for sampling objects the of resulting dataset. We will refer to it as GMM. NDSE Method. The NDSE method (Natural DataSets Evolution), that we present earlier [7], shares the advantages of both related approaches. This method also uses evolutionary algorithms and its operators have direct access to datasets, not their intermediate representations. NDSE is based on using operators of adding and removing attributes or objects from a dataset. These operators are equivalent to feature extraction, feature selection, oversampling and undersampling, respectively. We use the fol- lowing implementation of these operators: random removing of attributes and objects from a dataset, copying of random subsets of objects from the same class for adding new object to dataset, applying random functions, which depends on attributes, class label and random noise, for adding a new attribute to a dataset. 2 Alexey Zabashta and Andrey Filchenkov In order to apply evolutionary algorithms, mutation and crossover operators are required. Mutation operator uses the operators of adding and removing to change the number of attributes or objects in a dataset. Crossover operator merges two datasets into a single big dataset and then splits it by attributes into two resulting datasets. NDSE works as follows. It receives a list of meta-features (functions of dataset) and a target meta-feature vector as input. Vanilla version of this method (which we will refer to as NDSE) starts with an empty dataset, while an improved version (which we will refer to as NDSE’) uses real-world datasets as an initial population. During its runtime, the method follows a specific evolutionary algo- rithm, which is the hyperparameter of this method, and which exploits suggested mutation and crossover operators. Fitness function is the distance between the target meta-feature vector and meta-feature vector of a generated dataset. Experiments. We conduct two experiments. In both of them, the target vector is meta-features of a real dataset for binary classification from OpenML1 . We excluded a target dataset from the initial population for the methods that use them in that way. We use different evolutionary strategies implemented in jMetal2 library. We compare NDSE and NDSE’ with Vect, Vect’ (uses real- world datasets as an initial population) and GMM. The error is the resulting fitness-function value, that is the normalized Euclidean distance between the target and resulting datasets: E(θ) = i ((fi − θi )/σi )2 , where θi and fi are i-th P meta-feature values of the target and resulting datasets correspondingly, and σi is its standard deviation for real-world datasets. We use 29 meta-features, such as general information, informational and statistical metrics, and characteristic of decision tree. Each run in the Short-Run experiment is limited by the 2000 generated datasets, while this limitation is 105 for the Long-Run experiment. The results of the experiments are shown in Table 1. Table 1. Experimental results. The average error over the 298 target datasets by different evolutionary strategies for the Short-Run experiment and minimal average error at the three target datasets for the Long-Run experiment. Short-Run (evolutionary algorithms) Long-Run (datasets) CMAES MOCell NSGAII RAND SMSOA SPEA2 fertility vote Australian GMM 4.51 6.48 6.08 6.76 6.02 6.08 2.39·10−5 0.91 6.54 Vect — 13.29 13.66 21.57 12.55 13.06 1.67·10−9 5.50 12.29 NDSE — 2.61 2.83 3.89 2.90 2.80 3.13·10−5 0.07 0.51 Vect’ — 2.04 1.64 2.63 1.88 1.94 2.38·10−9 0.69 2.21 NDSE’ — 1.84 1.75 6.95 1.74 1.60 1.18·10−4 0.11 0.54 As it can be seen, the proposed methods are superior to the baselines. Usage of real-world datasets as an initial population is always better than usage of anything else. The best result is shown by NDSE’ with SPEA2. 1 OpenML.org 2 jmetal.github.io/jMetal NDSE: Instance Generation for Classification 3 Conclusion. In this paper, we suggested a method for generating datasets given their meta-feature description. The method outperformed the baselines. Acknowledgments. This work was financially supported by The Russian Science Foundation, Agreement 17-71-30029 and the Russian Foundation for Basic Research, Grant 16-37-60115 mol a dk. References 1. Filchenkov, A., Pendryak, A.: Datasets meta-feature description for recommending feature selection algorithm. In: AINL-ISMW FRUCT. pp. 11–18 (2015) 2. Brazdil, P., Carrier, C.G., Soares, C., Vilalta, R.: Metalearning: Applications to data mining. Springer Science & Business Media (2008) 3. Smith-Miles, K., Baatar, D., Wreford, B., Lewis, R.: Towards objective measures of algorithm performance across instance space. Computers & Operations Research 45, 12–24 (2014) 4. Soares, C.: Uci++: Improved support for algorithm selection using datasetoids. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. pp. 499–506. Springer (2009) 5. Reif, M., Shafait, F., Dengel, A.: Dataset generation for meta-learning. Poster and Demo Track of the 35th German Conference on Artificial Intelligence (KI-2012) pp. 69–73 (2012) 6. Muñoz, M.A., Smith-Miles, K.: Generating custom classification datasets by target- ing the instance space. In: Proceedings of the Genetic and Evolutionary Computa- tion Conference Companion. pp. 1582–1588. GECCO ’17, ACM, New York (2017) 7. Zabashta, A., Filchenkov, A.: NDSE: Method for classification instance generation given meta-feature description. In: AutoML Workshop, International Conference on Machine Learning (2017)