Layered TPOT Speeding up Tree-based Pipeline Optimization Pieter Gijsbers1 , Joaquin Vanschoren1 , and Randal S. Olson2 1 Technische Universiteit Eindhoven 2 University of Pennsylvania Abstract. With the demand for machine learning increasing, so does the demand for tools which make it easier to use. Automated machine learning (AutoML) tools have been developed to address this need, such as the Tree-Based Pipeline Optimization Tool (TPOT) which uses ge- netic programming to build optimal pipelines. We introduce Layered TPOT, a modification to TPOT which aims to create pipelines equally good as the original, but in significantly less time. This approach evalu- ates candidate pipelines on increasingly large subsets of the data accord- ing to their fitness, using a modified evolutionary algorithm to allow for separate competition between pipelines trained on different sample sizes. Empirical evaluation shows that, on sufficiently large datasets, Layered TPOT indeed finds better models faster. 1 Introduction The field of Automated Machine Learning (AutoML) aims to automate many of the tasks required to construct machine learning models, hence lowering the barrier to entry and yielding better models, faster. AutoML methods typically automate one or more steps in the creation of useful machine learning pipelines, such as the selection of preprocessing or learning algorithms, hyperparameter optimization, or a combination of them. A few methods even construct and optimize entire pipelines, such as the Tree-based Pipeline Optimization Tool (TPOT)[14]. TPOT uses genetic programming to evolve optimal pipelines, aim- ing to find machine learning pipelines yielding accurate predictive models while trying to keep the pipeline as simple as possible. In this paper, we introduce a novel improvement of TPOT, aimed at reduc- ing the time needed to evaluate pipelines, without reducing the quality of the final pipeline. Indeed, the most time-consuming part in the optimization process is evaluating the performance of candidate machine learning pipelines. In our modification, this time is reduced by initially evaluating the pipelines on a small subset of the data, and only allowing promising pipelines to be evaluated on the full dataset. In order to do this in a fair manner, modifications to the evolution- ary algorithm are implemented to prevent direct comparison between pipelines which are evaluated on different subsets of the data. As such, we aim to find pipelines of similar quality in much less time, making the tool more accessible and practical by requiring less computational time. We call this improvement Layered TPOT (LTPOT). This paper is organized as follows. First, we review related work in Sect. 2. Then, in Sect. 3, we discuss the proposed modification to TPOT in detail. Next, in Sect. 4, we lay out how LTPOT will be evaluated, discuss the results of these evaluations, and propose aspects of LTPOT which can be researched in future work. Finally, we conclude the study in Sect. 5. 2 Related Work The field of AutoML is a culmination of work in the fields of algorithm selection, hyperparameter optimization and machine learning. Several AutoML systems support at least some form of automatic pipeline construction. Auto-WEKA [18] uses Sequential Model-based Algorithm Configuration (SMAC) to do combined algorithm selection and hyperparameter optimiza- tion [10], which is an adaptation of Sequential Model-Based Optimization from statistical literature. Auto-WEKA uses algorithms from the WEKA library [8], to build pipelines with one learner (possibly an ensemble) and optionally a single feature selection technique. Auto-sklearn [6], built on the Python library scikit-learn [15], is a re- implementation of Auto-WEKA with two extensions. The first is the use of meta-learning to warm-start the Bayesian optimization procedure, a technique which has earlier been proven useful in [7]. The second addition is the automatic construction of ensembles from models evaluated during the optimization process. Furthermore, auto-sklearn allows for more preprocessing steps to be included in the pipeline: it allows for one feature preprocessor, which includes feature extraction techniques as well as feature selection techniques, in addition to up to three data preprocessor methods, such as missing value imputation and rescaling. TPOT differs from Auto-sklearn and Auto-WEKA by using an evolutionary algorithm instead of SMAC. Additionally, TPOT uses a tree-based structure to represent pipelines, and considers pipelines with any number of preprocessing steps. Hence, TPOT is not constrained in the number nor the order of prepro- cessing steps. The main idea of LTPOT is to first evaluate pipelines on a subset of the data, to get an indication of whether or not the pipeline is promising relative to other pipelines. This idea has been explored before, for example in Sample- based Active Testing [1], or algorithm selection using learning curves [16], where promising algorithm configurations are first evaluated on a smaller data sample to create a proxy for their performance on the full dataset. The use of subsets to evaluate machine learning configurations is also used in Hyperband [11]. Hyperband dynamically allocates resources to more promis- ing algorithm configurations, based on experiments executed on gradually more resources. One application that is explored is using data samples as a resource, and using increasingly bigger subsets of the data to evaluate the algorithm con- figurations with. In a specific selection of datasets they showed that using this technique improved over Random Search, SMAC and Tree-structured Parzen Estimators, as introduced in [3]. To the best of our knowledge, however, it was never before used in combina- tion with evolutionary algorithms for pipeline construction. 3 Methods In this section, we describe the modifications made to TPOT. First, we give a brief description of how TPOT constructs and optimizes machine learning pipelines. Then, we describe the Layered TPOT (LTPOT) structure and moti- vate its design. 3.1 Structure of TPOT To construct and optimize machine learning pipelines, TPOT uses tree-based ge- netic programming [2]. Pipelines are represented by genetic programming trees. An example of a tree representation of a machine learning pipeline is shown in Fig. 1. The tree consists of several nodes, which can either be Primitives or Terminals, in Fig. 1 depicted as squares and ellipses, respectively. Primitives are operators which require input, such as an algorithm requiring data and hyper- parameter values. Terminals are constants which can be passed to primitives. Finally, a primitive can also be used as input to a primitive, as can be seen in Fig.1, where the StandardScaler primitive provides the scaled dataset to the Bernoulli Naive Bayes algorithm. Information of required input and output types is used to ensure that only valid pipelines are constructed. Bernoulli Naive Bayes Standard Alpha = 1.0 Scaler Dataset Fig. 1: A visual representation of a tree-based machine learning pipeline that applies standard scaling to the data before learning a Bernoulli Naive Bayes model. The evolutionary algorithm then works by using these machine learning pipelines as their individuals. It will perform mutation, for example changing a hyperparameter or adding a preprocessing step, as well as crossover, by se- lecting two pipelines which share a primitive, which allows them to exchange subtrees or leaves. Finally, pipelines are evaluated and assigned a fitness score, so that a selection procedure can determine which individuals should be in the next generation. As mentioned earlier, in theory these pipeline trees could be arbitrarily large. However, very extensive machine learning pipelines are often undesirable. With more hyperparameters, long pipelines can be harder to tune, be more prone to overfitting, hinder understanding of the final model, and require more time to be evaluated, slowing down the optimization process. Because of these reasons, we use a multiobjective optimization technique, NSGA-II[4], to select individuals based on the Pareto front of the trade-off be- tween the pipeline length and its performance. 3.2 Layered TPOT Concept During the TPOT optimization process, most time is spent on evaluating ma- chine learning pipelines. Every machine learning pipeline is evaluated on the full dataset, which can take a lot of time. Layered TPOT (LTPOT) aims to reduce the amount of pipelines evaluated on the entire dataset by doing a selection process, so that only the most promising pipelines need to be evaluated on the full dataset. It achieves this by evaluating pipelines on a small subset of the data, and only if a pipeline exhibits good performance on that subset it will be evaluated on more data. Age-Layered Population To incorporate a fair evaluation of pipelines on subsets gradually increasing in size, we designed a layered structure to separate competition among individuals. In this, we were inspired by the Age-Layered Population Structure (ALPS) [9], where individuals are segregated into layers to reduce the problem of premature convergence. The problem of premature con- vergence occurs when the individuals in the population converge to a good local optimum, which means that any new individuals are unlikely to be competitive with this local optimum, and through selection are filtered out of the population before they themselves can converge to a local optimum. In ALPS, all individuals were given an age which would increase as an individual or its offspring would remain in the population, and perform breeding and selection only in separate age groups called layers. This segregation is important, because otherwise the old, locally well optimized, individuals would often prevent young individuals from being able to survive the multiple generations they needed to get closer to their local optimum. Layers in LTPOT In LTPOT, we wish to evaluate pipelines on subsets of the data. However, the performance of a machine learning pipeline is influenced by the amount of training data it receives. This means that when evaluating individuals on different subsets of the data, their performance cannot directly be compared to one another. Therefore, the individuals are segregated into layers. At each layer, individuals will be evaluated on a subset of different size. The layers are ordered, such that in the first layer the subset used to evaluate the individuals is the smallest, and every layer above that will increase the subset size. When an individual performs well in one layer, it will eventually be trans- ferred to the next. This way, only the best pipelines will be evaluated on the entire dataset in the last layer. A visual overview of the structure and flow of LTPOT is given in Fig.2. Layer 1 Layer 2 Layer 3 Layer 4 N/8 samples N/4 samples N/2 samples N samples new population top k new top k new rest top k new rest rest Fig. 2: A visual overview of the Layered TPOT structure. After a certain number of generations, each layer passes their best k individuals on to the next layer, while the first layer will be provided with a newly generated set of individuals. Correlation of performance between layers In the extreme, the selec- tion procedure implicitly assumes that the relative performance of two pipelines is the same when evaluated on a subset of the dataset as it is on the entire dataset. However, this assumption does not always hold. The learning curves for two pipelines may cross, meaning that one pipeline performs better after being trained on a small subset of the data, while the other performs better when trained on the full dataset. Generally speaking, as the pipelines are evaluated on more data, the relative performance correlates more strongly with the relative performance obtained when they are trained on the entire dataset. This is why our design will evaluate the pipelines on gradually larger subsets of data, so that when a pipeline performs worse than expected as the dataset increases, it need not be evaluated on bigger datasets. Unfortunately, in the case where a pipeline has poor performance on a small subset, but good performance on the entire dataset, LTPOT will not pick up this pipeline. We set up an experiment to verify whether or not there is indeed a correlation between the performance of a pipeline on a sample of the dataset and its per- formance on the entire dataset. In this experiment, we evaluated 50 pipelines on 12 datasets with ten times 10-fold cross-validation, with various samples sizes of the dataset as well as the entire dataset. All datasets were part of the Penn Machine Learning Benchmark (PMLB) [13]. The average AUROC of each pipeline for each sample size was determined for each dataset. Sample sizes were {N/21 , . . . , N/25 }, where N is the number of instances in the dataset. Because the evolutionary algorithm performs pipeline selection based on the ranking of the algorithms, rather than the value of the score metric, we ranked the averaged scores and computed the correlation of the rankings. dataset instances features ρN ρN ρN ρN ρN 2 4 8 16 32 satimage 6435 36 0.984 0.903 0.833 0.759 0.641 clean2 6598 168 0.972 0.946 0.890 0.825 0.765 ann-thyroid 7200 21 0.982 0.967 0.946 0.924 0.828 twonorm 7400 20 0.978 0.929 0.898 0.759 0.829 mushroom 8124 22 0.973 0.962 0.935 0.867 0.781 agaricus-lepiota 8145 22 0.989 0.980 0.920 0.877 0.805 coil2000 9822 85 0.939 0.878 0.815 0.568 0.529 pendigits 10992 16 0.984 0.983 0.945 0.844 0.771 nursery 12958 8 0.995 0.995 0.857 0.917 0.917 magic 19020 10 0.988 0.974 0.961 0.947 0.934 letter 20000 16 0.990 0.959 0.918 0.899 0.801 krkopt 28056 6 0.984 0.942 0.917 0.913 0.750 Table 1: An overview of the spearman ρ calculated by comparing the ranking of pipelines trained on subsets of the data compared to the entire dataset, p < 0.0001 in all cases. The subscript in the column denote the size of the subset. In Table 1 the Spearman ρ-values are displayed, that signify the correlation between the ranking of pipelines trained on a sample of the dataset, and the ranking of pipelines when trained on the entire dataset. The p-values are omitted because in all cases they are smaller than 0.0001 and all correlations are thus significant. From Table 1 we see that that there is a positive correlation between the rankings for all sample sizes and datasets, and the correlation gets stronger as the sample size gets closer to the full dataset size. We experimented with various curve fitting methods to extrapolate the learn- ing curves of pipelines so that crossing learning curves might be predicted earlier, but they did not improve the results. In future work the use of meta-learning for learning curve extrapolation, such as in [16], will be tried. 3.3 Layered TPOT Algorithm We will now give a more in-depth break down of the algorithm used in LT- POT. Algorithm 1 shows the core of the layered algorithm LayeredEA, and Algorithm 2 gives descriptions of the subroutines called from LayeredEA. Algorithm 1 Layered Evolutionary Algorithm 1: function LayeredEA(population, S, g, G, D) population: a set of pipelines that will be the first generation S: set containing the sample size for each layer g: interval in generations for when a transfer should occur G: total number of generations D: dataset to construct a pipeline for 2: M ← kSk . Denote the number of layers. 3: P ← kpopulationk . Denote population size. 4: L1 ← population 5: L2 , . . . , LM ← ∅ 6: for i in 1..G do 7: for l in 1..M do 8: if Ll 6= ∅ and i mod g < 2(M −l+1) then 9: offspring ← VarOr(Ll , P ) 10: Evaluate(offspring, Sl , D) 11: Ll ← Selection(Ll ∪ offspring, P ) 12: end if 13: end for 14: if i mod g = 0 then 15: for l in (M − 1)..1 do 16: Ll+1 ← Ll+1 ∪ Top(Ll , P/2) 17: end for 18: L1 ← NewPopulation(P ) 19: end if 20: end for 21: return Top(LM , 1) 22: end function Selecting parameter values Before calling LayeredEA, the number of layers as well as their sample size is defined. The sample size of a layer dictates how many instances are sampled from the dataset, to create the subset that the pipelines in that layer will be evaluated on. The subset is created by stratified Algorithm 2 Functions called in Layered EA 1: function VarOr(P opulation, N ) Performs mutation and crossover on the individuals in P opulation, creating N new individuals. end function 2: function Evaluate(P opulation, s, D) Evaluates each individual on a subset of dataset D, created by taking s instances by stratified sampling. Individuals are evaluated based on 3-fold CV accuracy and number of components in the pipeline. Results are saved as attributes of individu- als.3 end function 3: function Selection(P opulation, p) Creates pareto-fronts based on the accuracy-pipeline complexity trade-off. Then takes the first p individuals after ordering the population by which pareto-front they are in (individuals in the first front come first). end function 4: function Top(P opulation, k) Returns the k best individuals of the population, by constructing a pareto-front based on the accuracy-pipeline complexity trade-off. end function 5: function NewPopulation(P ) Creates a new population of P individuals. end function uniform random sampling without replacement, and pipelines will be evaluated on this subset with 5-fold cross-validation. In this study, the sample sizes used in each layer are dictated by the size of the dataset. Let the dataset contain N instances, then the final layer will always train on the entire dataset, and each subsequent layer will use half of the data the layer above did. In this study, the number of layers used is 4, for every dataset. The respective sample sizes used at each layer are thus N8 , N4 , N2 and N . In this paper we use the term higher layer loosely to denote layers which sample more of the entire dataset (i.e. the layer with sample size N2 is higher than the layer with sample size N4 ). There are two ways to specify for how long the main loop of lines 7 through 20 should run: a set amount of generations G, or an amount of time. We chose not to work with a specific number of generations G, but instead let the main loop in lines 7 through 20 run for eight hours. For parameter g, the amount of generations between transfer, we experimented with values 2 and 16. With a g value of 2, LTPOT acts almost as a filter, allowing only very little optimization in early layers. A value of 16, however, allows many generations of early optimization, before passing individuals through to the next layer. For the final part of the initialization, a population of size P is generated randomly. LayeredEA When calling LayeredEA, as shown in Algorithm 1, the first step is to denote constants based on the input, specifically the number of layers M (line 1) and the size of the population P (line 2), as well assigning the initial population to the first layer (line 4) and marking all other layers as empty (line 5). Then, the evolutionary algorithm will start iterating through the generations (line 6-20), at each generation again progressing the evolutionary algorithm in each layer (line 7-13) and then if required transferring individuals from one layer to the next (line 14-19). Progressing the evolutionary algorithm in each layer (line 7-13), only happens for layers which are active. This means that there must be a population in the layer (first clause on line 8), which it may not yet have if not enough transfers have taken place yet. Additionally, layers which evaluate on more data are not active every generation (second clause on line 8), this is motivated below. When progressing the evolutionary algorithm, it executes the same steps as TPOT would. First, a new population is created from the individuals evaluated during the last generation in the same layer, by performing mutation and cross- over (line 9). However, every time a layer is passed new individuals from a previous layer, as well as in the very first generation, the provided population is taken as is without creating offspring.4 The new individuals are then evaluated based on the sample of the data as defined by their layer (line 10). Finally, based on the Pareto-front of the trade-off between the performance score of the pipeline as well as the pipeline length, the best individuals are picked among the new 4 This is not incorporated in the pseudo-code of algorithm 1, to keep the general structure clear. individuals as well as last generation’s (line 11). Then, every g generations, the best individuals from each layer get passed to the next one. In our configuration we chose to transfer half of the layer’s population. The final pipeline chosen by LTPOT is the pipeline which has the best score in the highest layer (line 21). Optimizations There are a few scenario’s that either require some additional clarification, or deviate from the above algorithm: The first time a layer receives individuals from the layer before it, the selec- tion procedure will oversample from this population so that the population in the layer will also grow to size P (line 11). This is done so that in subsequent generations, more variations of the original pipelines will exist, allowing for a better search for optimal pipelines. Secondly, if LTPOT runs for a specified amount of generations, layers will be turned off whenever their population can no longer reach the highest layer. For example, let LTPOT be configured with 4 layers. When LTPOT is less than 3 ∗ g generations away from completion, any individual in the lowest layer will never reach the highest layer, thus rendering any results obtained in this layer useless. Whenever this happens, the respective layer will no longer have their individuals evaluated or transferred to a next layer. Next, there is the earlier mentioned restriction on activating layers as shown in the second clause on line 8. LTPOT has a selection process in place for which pipelines will be evaluated on the entire dataset. This means that at higher lay- ers, the pipelines in the population are likely already quite good. Because of these two factors, we want to limit the exploration in higher layers. To do this, instead of running the evolutionary algorithm in each layer every generation, higher layers can be turned off for some generations. In this study, for a layered structure with M layers, layer l is progressed for min(2(M −l+1) , g) generations every g generations, with l ∈ {1, · · · , M }. This is demonstrated with g = 12 and M = 4 in Fig.3, and checked in the second clause on line 8. We have not yet evaluated if this leads to significant improvements. Finally, to prevent any one pipeline from halting the algorithm, a pipeline’s evaluation is automatically stopped if it exceeds a given time limit. If the eval- uation is stopped this way, the pipeline is marked as failure and will not be considered as parent for the next generation. This behavior is present in the original TPOT, and adopted to LTPOT by further decreasing the time limit by layer. In the top layer, each individual is allowed the same evaluation time as it would in TPOT. However, for lower layers, the time allowed is decreased quadratically proportional to the sample size (a layer with half the data gets a fourth of the time per individual). Layer 1 Layer 2 Layer 3 Layer 4 On Off Fig. 3: Not every layer should run experiments every generation. This figure illustrates that the higher layers will be ’turned off’ in higher layers, meaning that no iterations of the evolutionary algorithm are executed. 4 Empirical evaluation 4.1 Experimental Questions The goal of LTPOT is to find pipelines at least as good as TPOT’s, but in less time. This also means that, given the same amount of time, LTPOT could very well find better pipelines. To assess whether or not this is achieved, we will evaluate LTPOT in three ways. First, we want to evaluate if, given the same amount of time, LTPOT will outperform TPOT when their best found pipelines are compared. To do this, a ranking is constructed between TPOT and LTPOT for each dataset over time, by ranking the performance of the best pipeline found so far at regular time in- tervals. We omit a comparison to Random Search, as TPOT compared favorably to Random Search in earlier work [12]. Secondly, to quantify how much faster LTPOT is, we compare the time needed for LTPOT to find a pipeline at least as good as the best pipeline found by TPOT. We then compare it to the time TPOT needs to find this pipeline. Finally, we will also compare the Area Under the Receiver Operating Char- acteristic curve (AUROC) of the final pipelines found by each method, so we can quantify the difference in model quality between the methods. 4.2 Experimental setup We compare LTPOT to the original TPOT by evaluating both on a selection of 18 large datasets. We specifically chose larger datasets so that there will be a distinct difference in time needed to evaluate individuals on the entire dataset versus just a subset. All datasets in the selection contain at least one hundred thousand instances, though most contain exactly one million. The selection in- cludes pseudo-artificial datasets described in [17]. The datasets are available for download and inspection on OpenML5 , an open database for machine learning 5 https://www.openml.org/s/69 experiments [19]. We previously evaluated LTPOT on a selection of datasets from the Penn Machine Learning Benchmark [13], which TPOT was initially evaluated on. Because those datasets were relatively small, pipeline evaluations were quick even on the full dataset, so there was no significant benefit of using LTPOT. On each dataset, each method is evaluated nine times, starting with a different initial generation each time. As described earlier, there are many hyperparameters with which to tune LTPOT. In this study, we only experiment with g, the amount of generations between transfer. The choices for g will be 2 and 16, and these configurations will be referred to as LTPOT-2 and LTPOT-16, respectively. This is meant to give insight in the effectiveness of two functions of LTPOT: filtering and early optimization. For LTPOT-2, layers act almost solely as a filter, by passing the best individuals to the next layer every other iteration, not much optimization takes place in lower layers, but it does allow for the early discarding of pipelines which seem to perform poorly. With LTPOT-16, we instead see that a lot of optimization can take place based on results found in lower layers, as relatively more time is spent evaluating and improving individuals in lower layers com- pared to LTPOT-2. In other words, LTPOT-2’s first layer allows for more early exploration, while LTPOT-16’s first layer is more focused on exploitation. The amount of individuals transferred, k, will be set to 15, which is half of the total population size P = 30. Each LTPOT configuration, as well as TPOT, is run nine times per dataset, each time with a different random seed, guaranteeing a different initial population and subsequent choices for crossover and mutation. Each run set to last 8 hours, but each individual pipeline may only be evaluated for at most 10 minutes. We explored different values for P and different amounts of evaluation time per individual, while keeping the total run time constant at 8 hours, and found that for the chosen datasets these values strike a balance between having a diverse enough population and being able to evaluate enough generations. 4.3 Results First, we compare the various configurations by their average rank over time, which can be seen in Fig. 4. In this figure, for each configuration, for every dataset and seed, the best found pipelines so far are ranked against each other every minute. For each method, the average Friedman rank [5] across all these datasets and seeds is calculated based on the highest AUROC score achieved by the best pipeline so far, using the fixed hyperparameter values stated above. To calculate the rank, we consider the result a tie if the difference in AUROC values is smaller than 0.1. A lower rank is better. In Fig. 4, we see that LTPOT on average achieves the best scores throughout the entire 8 hour period. LTPOT-16 starts by outperforming LTPOT-2 slightly, but as time goes on its relative performance drops, even being surpassed by TPOT. From this, it seems that while LTPOT works well as a filter, optimization 2.8 Average rank over time Vanilla TPOT 2.6 Layered TPOT­2 Layered TPOT­16 2.4 Average rank 2.2 2.0 1.8 1.6 1.4 0 100 200 300 400 500 Time (minutes) Fig. 4: Ranking of each method averaged over all datasets based on internal AUC scores of the best individual found so far. in early layers does not pay off beyond the early stages. Thus, a lower value for g is better based on these results. Furthermore, we see that towards the end TPOT is decisively improving over LTPOT-16, but only very slowly over LTPOT-2, as the average rank of LTPOT-2 increases only very slowly. A different rank does not necessarily mean the found model performance differs a lot. To clarify this, we look at the difference in AUROC by dataset per configuration, as seen for 9 of 18 datasets in Fig. 5.6 In Fig. 5 we show a boxplot that describes the distribution of AUROC scores of the final pipelines by each method. No single method is dominant over the others, and differences in AUROC scores are small for almost all datasets. Using a student t-test, we determined that there is no statistically significant difference (p < 0.05) between the final pipelines (after the full 8 hour time budget). However, looking at the average ranking by configuration in Fig. 4, we see that under smaller time budgets LTPOT-2 often finds pipelines which are better than TPOT. In this scenario, it is interesting to see how much time LTPOT needs to find a pipeline at least as good as the best pipeline TPOT found. We compared LTPOT-2 to TPOT for each dataset and seed, and looked at how long it took for the method which found the best pipeline to find a pipeline at least as good as the best found pipeline by the other method. Figure 6 shows the time difference (in minutes) between finding these equally good pipelines. Positive values indicate that LTPOT is faster. Yellow distributions correspond to seeds where LTPOT eventually found the best pipeline, and show how much sooner LTPOT found a pipeline at least as good as the best pipeline of TPOT. 6 A similar figure for the other 9 datasets can be found in Appendix A. AUC­scores­of­best­pipeline­by­method­by­dataset 1.000 BNG­trains 0.99 BNG­labor 0.826 BNG­breast­cancer 0.973 0.9937 0.813 0.947 0.9918 0.801 0.920 0.9900 0.788 TPOT L­2 L­16 TPOT L­2 L­16 TPOT L­2 L­16 1.000 BNG­ionosphere 0.89 0 BNG­spect_test 0.727 airlines 0.988 0.8933 0.724 0.977 0.8917 0.721 0.96 0.8900 0.718 TPOT L­2 L­16 TPOT L­2 L­16 TPOT L­2 L­16 0.72 Click_prediction_small 1.000000 skin­segmentation 0.99980 BNG­mushroom 0.71 0.999982 0.99947 0.70 0.999963 0.99913 0.69 0.99994 0.99880 TPOT L­2 L­16 TPOT L­2 L­16 TPOT L­2 L­16 Fig. 5: An overview of achieved AUROC score for each run of each method by dataset. Blue distributions correspond to cases where TPOT eventually found the best pipeline, and show how much later TPOT found a pipeline at least as good as LTPOT’s best. In general, when LTPOT finds the best pipeline, it finds a pipeline at least as good as TPOT’s best pipeline much sooner. In particular for LTPOT-16 we see that it often is at least 200 minutes faster. Even in the cases where TPOT eventually finds the best pipelines, we see that it often finds a pipeline at least as good as LTPOT’s best only after LTPOT already found it. This again is es- pecially true for LTPOT-16, where almost all yellow distributions are entirely positive. However, even when one method finds a pipeline at least as good as the other method’s eventual best pipeline, it can still be the case that the eventual worst method at that same time already has quite a good pipeline. Therefore, we compare the performance of the best pipeline of each method found at time t, where time t is the time where the best method finds a pipeline at least as good as the eventual best pipeline of the worst method. Between 18 datasets and 9 seeds, there are 162 comparisons between TPOT and LTPOT-2 or LTPOT-16. The comparison in AUROC at time t is shown in Figure 7. We see that when TPOT finds a pipeline at least as good as LTPOT’s best, in most cases it does so when LTPOT already has found a pipeline at most Distribution of time difference between finding equal performance pipelines 500 LTPOT­2 LTPOT­2 400 minutes between equal pipelines TPOT 300 200 100 0 −100 −200 −300 69 80 82 0 17 4 6 1 2 5 9 0 3 6 02 9 73 77 12 12 12 13 13 13 13 14 14 14 26 11 11 11 12 15 500 LTPOT­16 minutes between equal pipelines 400 300 200 100 0 −100 LTPOT­16 TPOT −200 lin 9 air 116 es t ult m G 124 l ­a ­g k r ins og te ere ion s or r G 82 G 1 l es se G 3 na ce ma sic titi G 39 G 32 G 3 BN 13 roo vo BN 14 dit dit lab G 126 ad atl G 269 BN 11 tra BN 7 t_t ba tat sp 80 an so ph pa BN 1 mu 0 BN 1 sp 35 ion 46 _s gm 2 cre cre he 40 ­st G 12 sh ec am en G 11 se 50 t­c os he bre 77 ict 7 ion G 1 G 1 G 1 art red 21 in­ 1 as BN BN _p 1 BN BN BN BN BN G BN sk BN ck Cli Fig. 6: Violin plots of the time difference between finding two equally good pipelines. Datasets are shown together with their OpenML ID number. 60 Difference in AUROC at time t if ___ finds best 80 Difference in AUROC at time t if ___ finds best LTPOT­2 LTPOT­16 TPOT 70 TPOT 50 60 40 50 count count 30 40 30 20 20 10 10 0 0 0.0 ­ 0.01 0.01 ­ 0.05 0.05 ­ 0.1 0.1 ­ 0.2 0.2 ­ 0.5 0.0 ­ 0.01 0.01 ­ 0.05 0.05 ­ 0.1 0.1 ­ 0.2 0.2 ­ 0.5 Difference in AUROC Difference in AUROC Fig. 7: Shows the AUROC difference at time t, which is the time the best method (color coded) finds a pipeline at least as good as the other method will find. 0.01 AUROC worse. However, when LTPOT finds a pipeline at least as good as TPOT’s best, TPOT has relatively worse pipelines more often, and in some cases as much as over 0.2 AUROC worse. From this we conclude that in many cases LTPOT finds good pipelines faster. While LTPOT-2 does not always find the best pipeline, when it doesn’t, it finds comparable pipelines at least as quickly as TPOT. LTPOT-16 finds comparable pipelines even quicker, although it becomes less competitive under larger time budgets. 4.4 Future work The structure of Layered TPOT allows for more hyperparameters to be tuned, and their effect remains unexplored at this point. The number of layers as well as the effect of their granularity could possibly be tuned to the dataset. Whether or not to turn off higher layers, and with which frequency, should be explored as well. The amount of individuals to transfer between layers may also change how quickly higher layers can optimize pipelines. The choice of the amount of generations before transfer, g, will influence how much optimization will be done in the lower layers. Perhaps these parameters should change over time in a single run. All of these choices come in addition to the hyperparameters already available for TPOT, such as the population size, mutation rate and crossover rate. Finally, there are possible changes not yet captured in hyperparameters. For instance, it might be better to favor crossover over mutation in higher layers, so that the focus of exploration shifts to combining promising pipelines in higher layers. It could be that having a big population in the higher layers is unneces- sary, and shrinking population size there might yield similar results in a faster time frame. Currently, the way to evaluate individuals quickly is to sample a number of instances of the data. Instead, one could create a subset by selecting features, or apply compression techniques to represent the data. We will explore these aspects in subsequent work. Results of this study will be made available on OpenML7 , and the code for Layered TPOT can be found on Github8 . 5 Conclusion In this paper we presented an extension to TPOT called Layered TPOT. In the extension, instead of evaluating each pipeline on all data, only pipelines which have shown good results on subsets of the data are evaluated on all data. It does this by introducing layers. In each layer a distinct group of individuals is subject to the evolutionary algorithm, but individuals are only evaluated on a subset of the data as defined by the layer. Each subsequent layer will evaluate 7 www.openml.org 8 https://github.com/PG-TUe/tpot/tree/layered the individuals on more data, and if an individual performs well in one layer, it will be transferred to the next, to be trained on more data and compete with other promising pipelines trained on the same subset size. To determine the usefulness of LTPOT, two configurations have been com- pared to TPOT, on a selection of 18 large datasets. The results showed that LTPOT is not strictly better than TPOT, but it often finds good pipelines much faster, and under smaller time budgets, it outperforms TPOT on average. More- over, LTPOT allows for a lot of flexibility in the configuration of its structure, and the effects of changes to these configurations remain to be explored. References 1. Abdulrahman, S.M., Brazdil, P., Van Rijn, J.N., Vanschoren, J.: Algorithm se- lection via meta-learning and sample-based active testing. In: Proceedings of the 2015 International Conference on Meta-Learning and Algorithm Selection - Volume 1455. pp. 55–66. MetaSel’15, CEUR-WS.org, Aachen, Germany, Germany (2015), http://dl.acm.org/citation.cfm?id=3053836.3053845 2. Banzhaf, W., Francone, F.D., Keller, R.E., Nordin, P.: Genetic Programming: An Introduction: on the Automatic Evolution of Computer Programs and Its Appli- cations. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1998) 3. Bergstra, J.S., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Wein- berger, K.Q. (eds.) Advances in Neural Information Processing Systems 24, pp. 2546–2554. Curran Associates, Inc. (2011), http://papers.nips.cc/paper/4443- algorithms-for-hyper-parameter-optimization.pdf 4. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Transactions on Evolutionary Computation 6(2), 182–197 (Apr 2002) 5. Demsar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, 1–30 (2006) 6. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hut- ter, F.: Efficient and robust automated machine learning. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Ad- vances in Neural Information Processing Systems 28, pp. 2962–2970. Cur- ran Associates, Inc. (2015), http://papers.nips.cc/paper/5872-efficient-and-robust- automated-machine-learning.pdf 7. Feurer, M., Springenberg, J.T., Hutter, F.: Initializing bayesian hyperparameter optimization via meta-learning. In: Proceedings of the Twenty-Ninth AAAI Con- ference on Artificial Intelligence. pp. 1128–1135. AAAI’15, AAAI Press (2015), http://dl.acm.org/citation.cfm?id=2887007.2887164 8. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: An update. SIGKDD Explor. Newsl. 11(1), 10–18 (Nov 2009), http://doi.acm.org/10.1145/1656274.1656278 9. Hornby, G.S.: Alps: The age-layered population structure for reducing the problem of premature convergence. In: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation. pp. 815–822. GECCO ’06, ACM, New York, NY, USA (2006), http://doi.acm.org/10.1145/1143997.1144142 10. Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration (extended version). Tech. Rep. TR-2010-10, University of British Columbia, Department of Computer Science (2010), available online: http://www.cs.ubc.ca/˜hutter/papers/10-TR-SMAC.pdf 11. Li, L., Jamieson, K.G., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Effi- cient hyperparameter optimization and infinitely many armed bandits. CoRR abs/1603.06560 (2016), http://arxiv.org/abs/1603.06560 12. Olson, R.S., Bartley, N., Urbanowicz, R.J., Moore, J.H.: Evaluation of a tree-based pipeline optimization tool for automating data science. CoRR abs/1603.06212 (2016), http://arxiv.org/abs/1603.06212 13. Olson, R.S., Cava, W.L., Orzechowski, P., Urbanowicz, R.J., Moore, J.H.: PMLB: A large benchmark suite for machine learning evaluation and comparison. CoRR abs/1703.00512 (2017), http://arxiv.org/abs/1703.00512 14. Olson, R.S., Urbanowicz, R.J., Andrews, P.C., Lavender, N.A., Kidd, L.C., Moore, J.H.: Applications of Evolutionary Computation: 19th European Con- ference, EvoApplications 2016, Porto, Portugal, March 30 – April 1, 2016, Pro- ceedings, Part I, chap. Automating Biomedical Data Science Through Tree-Based Pipeline Optimization, pp. 123–137. Springer International Publishing (2016), http://dx.doi.org/10.1007/978-3-319-31204-0 9 15. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Pas- sos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (Nov 2011), http://dl.acm.org/citation.cfm?id=1953048.2078195 16. van Rijn, J., Abdulrahman, S., Brazdil, P., Vanschoren, J.: Fast algorithm selection using learning curves. In: Lecture Notes in Computer Science (IDA 2015). vol. 9385, pp. 298–309 (2015) 17. van Rijn, J., Holmes, G., Pfahringer, B., Vanschoren, J.: Algorithm selection on data streams. Lecture Notes in Computer Science (Proceedings of Discovery Sci- ence 2014) 8777, 325–336 18. Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-weka: Combined se- lection and hyperparameter optimization of classification algorithms. In: Proceed- ings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 847–855. KDD ’13, ACM, New York, NY, USA (2013), http://doi.acm.org/10.1145/2487575.2487629 19. Vanschoren, J., van Rijn, J.N., Bischl, B., Torgo, L.: Openml: Networked science in machine learning. SIGKDD Explorations 15(2), 49–60 (2013), http://doi.acm.org/10.1145/2641190.2641198 A Additional Experiment Results Below is a figure similar to Figure5 for the other 9 datasets in the benchmark. AUC­scores­of­best­pipeline­by­method­by­dataset 0.902 BNG­sick 0.98 BNG­sonar 0.682 BNG­spambase­ 0.889 0.9 0.671 0.877 0.89 0.661 0.864 0.84 0.650 TPOT L­2 L­16 TPOT L­2 L­16 TPOT L­2 L­16 0.980 BNG­heart­statlog 0.94 BNG­adult 0.99650 BNG­vote 0.94 0.89 0.9962 0.907 0.8 0.99597 0.870 0.78 0.99570 TPOT L­2 L­16 TPOT L­2 L­16 TPOT L­2 L­16 0.98 BNG­hepatitis 0.9540 BNG­credit­a 0.88 BNG­credit­g 0.94 0.951 0.82 0.90 0.9487 0.76 0.86 0.9460 0.70 TPOT L­2 L­16 TPOT L­2 L­16 TPOT L­2 L­16 Fig. 8: An overview of achieved AUROC score for each run of each method by dataset. Figures 9 and 10 show the score of the best pipeline and what time it was found per dataset, method and seed. 0.286 1169 0.1100 1180 1182 0.095 0.285 0.1095 0.284 0.1090 1 ­ AUROC 1 ­ AUROC 1 ­ AUROC 0.090 0.283 0.1085 0.282 0.1080 0.085 0.281 0.1075 0.280 0.1070 0.080 0.279 0.1065 0 50 100 150 200 250 300 350 400 450 0 100 200 300 400 500 0 100 200 300 400 500 time (minutes) time (minutes) time (minutes) 0.0011 120 0.296 1217 0.054 124 0.0010 0.294 0.053 0.0009 0.292 0.052 1 ­ AUROC 1 ­ AUROC 1 ­ AUROC 0.0008 0.290 0.051 0.0007 0.0006 0.288 0.050 0.0005 0.286 0.049 0.0004 0.284 0.048 0.0003 0.282 0.047 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 time (minutes) time (minutes) time (minutes) 0.175 126 0.130 131 0.08 132 0.170 0.07 0.165 0.125 1 ­ AUROC 1 ­ AUROC 1 ­ AUROC 0.160 0.06 0.120 0.155 0.05 0.150 0.115 0.04 0.145 0.110 0.140 0.03 0.135 0.105 0.02 0 100 200 300 400 500 0 100 200 300 400 500 0 50 100 150 200 250 300 350 400 450 time (minutes) time (minutes) time (minutes) Fig. 9: Shows per dataset per seed per method when the best pipeline was found, and its ’1-AUROC’ score. 77 73 0.050 269 0.200 0.0085 0.045 0.0080 1 ­ AUROC 1 ­ AUROC 1 ­ AUROC 0.195 0.040 0.0075 0.035 0.0070 0.190 0.0065 0.030 0.0060 0.025 0.185 0.0055 0.020 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 time (minutes) time (minutes) time (minutes) 0.000014 1502 0.025 146 0.0043 143 0.000013 0.020 0.0042 0.000012 1 ­ AUROC 1 ­ AUROC 1 ­ AUROC 0.000011 0.015 0.0041 0.000010 0.010 0.0040 0.000009 0.000008 0.005 0.0039 0.000007 0.000 0.0038 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 time (minutes) time (minutes) time (minutes) 0.045 140 0.022 139 0.340 135 0.044 0.020 0.338 0.018 1 ­ AUROC 1 ­ AUROC 1 ­ AUROC 0.043 0.016 0.336 0.042 0.014 0.334 0.012 0.041 0.010 0.332 0.040 0.008 0.330 0.006 0.039 0.004 0.328 0 100 200 300 400 500 0 100 200 300 400 500 0 50 100 150 200 250 300 350 400 450 time (minutes) time (minutes) time (minutes) Fig. 10: Shows per dataset per seed per method when the best pipeline was found, and its ’1-AUROC’ score.