=Paper=
{{Paper
|id=Vol-2841/DARLI-AP_11
|storemode=property
|title=The impact of Auto-Sklearn’s Learning Settings: Meta-learning, Ensembling, Time Budget, and Search Space Size
|pdfUrl=https://ceur-ws.org/Vol-2841/DARLI-AP_11.pdf
|volume=Vol-2841
|authors=Hassan Eldeeb,Oleh Matsuk,Mohamed Maher,Abdelrhman Eldallal,Sherif Sakr
|dblpUrl=https://dblp.org/rec/conf/edbt/EldeebMMES21
}}
==The impact of Auto-Sklearn’s Learning Settings: Meta-learning, Ensembling, Time Budget, and Search Space Size==
The impact of Auto-Sklearn’s Learning Settings Meta-learning, Ensembling, Time Budget, and Search Space Size Hassan Eldeeb* Oleh Matsuk Mohamed Maher hassan.eldeeb@ut.ee Data Systems Group Data Systems Group Data Systems Group University of Tartu, Estonia University of Tartu, Estonia University of Tartu, Estonia Abdelrhman Eldallal Sherif Sakr Data Systems Group Data Systems Group University of Tartu, Estonia University of Tartu, Estonia ABSTRACT These AutoML frameworks follow different learning settings, With the booming demand for machine learning (ML) applica- i.e., parameters or options that AutoML users have to preset while tions, it is recognized that the number of knowledgeable data submitting the input dataset. For example, AutoSklearn [7] scientists cannot scale with the growing data volumes and appli- and SmartML [17] adopt a meta-learning-based mechanism to cation needs in our digital world. Therefore, several automated improve the automated search process’s performance. That is, they machine learning (AutoML) frameworks have been developed to start with the most promising models that have performed well fill the gap of human expertise by automating most of the process with similar datasets. ATM [21] limits the default search space into of building a ML pipeline. In this paper, we present a micro-level only three classifiers, namely, Decision Tree, K-Nearest Neigh- analysis of the AutoML process by empirically evaluating and bours, and Logistic Regression. AutoSklearn offers an ensem- analyzing the impact of several learning settings and parameters, bling mechanism as a post-hoc optimization instead of report- i.e., meta-learning, ensembling, time budget and size of search ing only the best-performing model. Additionally, most AutoML space on the performance. Particularly, we focus on AutoSklearn, frameworks run within a user-determined time budget. Although the state-of-the-art AutoML framework. Our study reveals that no the user has the option to use different flavors of AutoML tools single configuration of these design decisions achieves the best by manipulating these learning settings, it is hard to decide which performance across all conditions and datasets. However, some of them should be used for the input dataset. So, these learning design parameters have a statistically consistent improvement over settings are just hyper-parameters for AutoML tools. the performance, such as using ensemble models. Some others Understanding the impact of these learning settings on real- are conditionally effective, e.g., meta-learning adds a statistically world datasets is vital, especially when the authors of the AutoML significant improvement, only with a small time budget. frameworks evaluate their contributions using relatively small datasets [22]. Besides, the authors may knowingly or unknowingly select datasets on which their frameworks perform well. 1 INTRODUCTION In this study, we present a thorough analysis of the significance Due to the increasing success of machine learning techniques in of various hyper-parameters (learning settings) of the AutoML pro- several application domains, they attract lots of attention from the cess, including meta-learning, ensembling, length of time budget research and business communities. Hence, a wide range of fields and size of search space. Since AutoSklearn supports different is witnessing many breakthroughs achieved by machine and deep settings for all of these hyper-parameters, we nominated it to be learning techniques [18, 29]. Furthermore, machine learning has the backbone of this study. significant achievements compared to human-level performance. So, the contribution of this paper can be summarized as follows: For example, AlphaGO [20] defeated the GO game’s champion, • We benchmark 100 datasets on different learning settings and deep learning models excelled in image recognition and sur- (hyper-parameters) of AutoSklearn. passed human performance years ago [23]. • The impact of each hyper-parameter of AutoSklearn Nevertheless, the machine learning modeling process is a highly has been examined using different configurations. iterative, exploratory, and time-consuming process. Therefore, re- • For each positive/negative impact of hyper-parameter con- cently, several frameworks (e.g., AutoWeka [22], AutoSklearn [7], figuration, we validate its consistency using Wilcoxon SmartML [17]) are proposed to support automating the Com- statistical test [27]. bined Algorithm Selection and Hyper-parameter tuning (CASH) • Eventually, we provide a simple guideline for which of problem [13, 19, 28]. The performance of the automatically gener- these hyper-parameters is expected to improve the perfor- ated pipelines, by these AutoML frameworks, is perfect for some mance score based on the input datasets’ characteristics. tasks such that data scientists cannot develop pipelines to beat it; This analysis is ongoing and extendable. So, we will update not even AutoML designers, as seen in the ChaLearn AutoML it with new datasets and additional frameworks. For ensuring Challenge 2015/2016 [12]. reproducibility, we have released all artifacts (e.g., datasets, source code, log results).1 * Corresponding Author. The remainder of this paper is organized as follows. We discuss the related work in Section 2. Section 3 describes our experi- © 2021 Copyright for this paper by its author(s). Published in the Workshop Pro- ment design and defines the target learning settings. Section 4 ceedings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus) on CEUR-WS.org. Use permitted under Creative Commons License Attri- presents the impact of meta-learning (Section 4.1), ensembling bution 4.0 International (CC BY 4.0) 1 https://datasystemsgrouput.github.io/AutoMLDesignDecisions/ (Section 4.2), length of the time budget (Section 4.3), and size of 3 EXPERIMENT DESIGN the search space (Section 4.4). The datasets that show significant In this paper, we followed the best practices on how to construct performance differences towards/against any of these design pa- and run good machine learning benchmarks and general-purpose rameters are further analyzed in Section 4.5. Finally, we conclude algorithm configuration libraries [2] the paper in Section 5. 3.1 AutoML framework: AutoSKLearn 2 RELATED WORK AutoSklearn [7], the winner of two ChaLearn AutoML chal- Recently, several studies have surveyed and compared the per- lenges, is implemented on top of Scikit-Learn, a popular formance of various AutoML frameworks [11, 13, 19, 24, 28]. Python machine learning package [9]. It uses Sequential Model- In general, these studies show no clear winner as there are al- based Algorithm Configuration (SMAC) as a Bayesian optimiza- ways some trade-offs that need to be considered and optimized tion technique [14]. Beside adopting meta-learning and ensem- according to the context of the problems and the user’s goals. bling design decisions, time budget and search space are also con- For example, Gijsbers et al. [11] have conducted an experimental figurable in AutoSklearn. The framework uses meta-learning study to compare the performance of 4 AutoML systems, namely, for initializing the search process as a warm start. It also utilizes AutoWeka, AutoSklearn, TPOT and H2O using 39 datasets the ensembling learning setting to improve the performance of and time budgets of 1 and 4 hours. The study results observed output models. Moreover, one of the main advantages of Auto- that some AutoML tools perform significantly better or worse on Sklearn is that it comes with different execution options. In several datasets than others. The authors could not draw clear con- particular, its basic vanilla version (AutoSklearn-v) applies clusions about which data properties could explain this behavior. only the SMAC optimization techniques for the AutoML optimiza- Truong et al. [24] have conducted a study using 300 datasets tion process. However, AutoSklearn also allows the end-users to compare the performance of 7 AutoML frameworks, namely, to enable/disable the different optimization options including the H2O, [15], AutoSklearn, Ludwig2 , Darwin3 , TPOT and usage of meta-learning (AutoSklearn-m), ensembling (Auto- Auto-ml using dynamic time budgets. The results of this study Sklearn-e) in addition to the full version (AutoSklearn) showed that no framework managed to outperform all others on a where all options are enabled. plurality of tasks. Across the various evaluations and benchmarks, H2O, Auto-keras and AutoSklearn performed better than 3.2 Datasets the rest of the tools. We used 100 datasets that are collected from OpenML reposi- Zöller and Huber [28] have performed a comparison for 8 tory [26]. OpenML datasets are already preprocessed into numer- CASH optimization algorithms, namely, Grid Search, Random ical features. Therefore, they match the input criteria of Auto- Search, ROBO (RObust Bayesian Optimization) [16], BTB (Bayesian Sklearn. The datasets include binary (50%) and multi-class Tuning and Bandits)4 , hyperopt [1], SMAC, BOHB [6] and Optu- (50%) classification tasks. The sizes of these datasets varies be- nity5 . The comparison results showed that all CASH algorithms’ tween 5KB and 643MB. The datasets used in this study cover a performance, except the grid search, perform similarly on av- wide spectrum of various meta-features (e.g., r_numerical_features, erage. The authors also noted that a simple search algorithm class_entropy, max_prob, mean_prob, std_dev, dataset_ratio, sym- such as random search did not perform worse than the other bols_sum, symbols_std_dev, skew_std_dev, kurtosis_min, kurto- algorithms. Besides, the authors compared the performance of sis_max, kurtosis_mean and kurtosis_std_dev) [5]. Each dataset 6 AutoML frameworks, namely, TPOT, hpsklearn6 , Auto- is partitioned into training and validation split (80%) and a test Sklearn, ATM, H2O in addition to Random Search. The re- split (20%) which is used to evaluate the output pipeline. sults of the frameworks’ comparison showed that on average, all AutoML frameworks have similar performance. However, for a single dataset, the performance differs on average by 6% accuracy. 3.3 Hardware Choice and Resources To the best of our knowledge, this study is the first that focuses The experiments are conducted on Google Cloud machines. Each on analyzing the learning settings and parameters of the AutoML machine is configured with 2 vCPUs, 7.5 GB RAM, and ubuntu- process. The previous studies mainly focus on comparing whole minimal-1804-bionic. Each experiment is run four times with frameworks’ performance or comparing different optimization different time budgets: 10, 30, 60, and 240 minutes. techniques’ performance. Thus, a single configuration of the learn- ing settings is used with all the tested datasets. Mostly, it is the 3.4 Learning Settings Definitions default settings to have a fair comparison among the benchmarked Learning settings are the parameters or options that AutoML frameworks [11, 13, 19, 24, 28]. In contrast, this study focuses users have to preset while submitting the input dataset. This paper on the impact of the learning settings and different configura- focuses on four of them: meta-learning, ensembling, time budget, tion of the AutoML framework over the accuracy performance. and search space size. In the following, we define each of them Understanding this relationship can help the domain expert use in the context of AutoML and briefly describe their mechanism AutoML frameworks and select the learning setting configuration used in AutoSklearn. that works well with his dataset. Meta-learning [25] is the process of learning from previous experience gained while applying various learning algorithms on different types of data. In the context of AutoML, the main advan- tage of meta-learning techniques is that it allows hand-engineered 2 https://github.com/uber/ludwig 3 https://www.sparkcognition.com/product/darwin/ algorithms to be replaced with automated methods designed in a 4 https://github.com/HDI-Project/BTB data-driven way. Thus, it is used partially to simulate the machine 5 https://github.com/claesenm/optunity learning expert’s role for non-technical users and domain experts. 6 https://github.com/hyperopt/hyperopt-sklearn/tree/master/hpsklearn AutoSklearn applies a meta-learning mechanism based on a knowledge base storing the meta-features of datasets and the best- mechanism does not always lead to better performance. On aver- performing pipelines on these datasets. Thirty-eight statistical and age, AutoSklearn-v and AutoSklearn-m provide a com- information-theoretic meta-features are used. In the offline phase, parable performance, for the four time budgets, as shown in Ta- the meta-features and the empirically best-performing pipelines ble 1. In particular, both versions have similar performance in are stored for each dataset in their repository (140 datasets from 64, 55, 65, and 69 datasets for the 10, 30, 60, and 240 minutes, OpenML repository) [7]. For any new dataset in the online phase, respectively. Table 1 summarizes the results for each time budget. the framework extracts its meta-features and searches for the most We used Wilcoxon statistical test [10] to assess the signifi- similar datasets to return the top 𝑘 best-performing pipelines on cance of the performance difference between AutoSklearn-v these similar datasets. These 𝑘 pipelines are used as a warm start and AutoSklearn-m. The results of Table 2 show that the for the Bayesian optimization algorithm used in the framework. meta-learning mechanism makes a statistically significant gain In principle, the main goal of any meta-learning mechanism is with 95% confidence (𝑝 value < 0.05) only with the 10 minutes to improve the search process by enabling the optimization tech- time budget. For the 30-minute time budget, the level of confi- nique to start from the most promising pipelines instead of starting dence decreases to 93.5%. In contrast, for the time budgets of 60 from random pipelines. If the suggested pipelines’ performance is and 240 minutes, there is no statistically significant difference in bad, the Bayesian optimization can recover from these pipelines using the meta-learning mechanism to initialize the search process. in the next iterations. Hence, this implies that the longer the time budget, the lower the Ensembling is the process of combining multiple ML base mod- impact of the meta-learning mechanism. In general, the longer els trained on the same task to produce a better predictive model. the time budget, the more time available for the AutoML search These base models can be combined using several techniques, process to explore more configurations in the search space, and including simple/weighted voting (averaging), bagging, boosting, the higher probability of getting a better result. Hence, the impact and other techniques [4]. In principle, the main advantage of us- of the initial configurations is also lower. ing ensembling techniques is that it allows the base models to We speculated whether the improvement/deterioration effect collaborate in generating more generalized predictions than using of meta-learning is constant for the same datasets across the four predictions from an individual base model. time budgets. As shown in Figure 2(a), we found that 25 datasets AutoSklearn stores the generated models instead of just have better accuracy when using AutoSklearn-m than Auto- keeping the best-performing one. These models are used in a post- Sklearn-v in the 10-minute time budget. Out of those 25, the processing phase to construct an ensemble. AutoSklearn uses improvement holds in only 13 datasets in the 30-minute time the ensemble selection methodology introduced by Caruana et budget. Similarly, among the 13 datasets, only 5 continued to al. [3]. It is a greedy technique that starts with an empty ensemble retain this behavior during the 60-minute. Finally, only one dataset and attractively adds base models to the ensemble to maximize is common among the four time budgets. the validation performance. From Figure 2, the datasets with improved performance scores The time budget represents the available time to examine the are not the same in every time budget. By analyzing the meta- search space for identifying a pipeline that maximizes the perfor- features of these datasets, we could not link them over different mance metric. Generally, it is hypothesized that the more time time budgets. This observation is also confirmed in Figure 2(b), allocated to the search process, the higher the achievable perfor- where we could not draw a clear pattern out of these datasets. mance [7]. On the other hand, there is another trade-off that should be considered. The longer the budget we allocate, the more com- 4.2 The Impact of Ensembling puting resources we will consume for the search process and the higher potential that the model overfits the validation set. To assess the impact of the ensembling, we have experimented The search space of any AutoML process is significantly huge with comparing the performance of AutoSklearn-v and Auto- [7]. For example, AutoSklearn [7] is designed with over 15 Sklearn-e. Figure 3 shows the performance differences be- classifiers from the scikit-learn library. Assuming that each tween the two versions over the 100 datasets. On average, Auto- classifier has only 2 hyper-parameters and each of them has 100 Sklearn-e increased the accuracy performance by 0.5%, 0.7%, discrete values, the search space contains 15∗1002 different config- 1%, and 1.4% over the four time budgets, successively. In partic- urations. However, in practice, the real numbers are much bigger. ular, the two modes have similar performance in 65, 62, 66, and 63 datasets for 10, 30, 60, and 240 minute time budgets. Table 3 summarizes the results for each time budget. Table 4 shows that the outcomes of Wilcoxon test, which is 4 RESULTS AND DISCUSSION conducted to assess the statistical significance of the accuracy per- The set of questions aimed at assessing the impact of each learning formance between AutoSklearn-v and AutoSklearn-e. settings are as follows. Does the parameter improve/decline the The table shows that the ensembling techniques enhance the per- performance accuracy? Is the difference statistically significant? formance with a statistically significant gain with more than 95% And finally, when is it recommended to enable each parameter? confidence (𝑝 value < 0.05) on the four time budgets. The level We answer these questions for each learning setting. of confidence is almost 99% over all the time budgets combined. Generally, the ensemble model extremely boosts accuracy com- pared to the individual base models as long as these base models’ 4.1 The Impact of Meta-Learning errors are independent of each other [4]. Although the base classi- fiers’ errors are not completely independent, the ensemble model To assess the impact of the meta-learning mechanism, we experi- still enhances the accuracy in a statistically significant manner. mented with comparing the performance of AutoSklearn-m It means that the accuracy improvement by the ensemble model, and AutoSklearn-v. Figure 1 shows the impact of meta-learning generated by AutoSklearn-e, is not a random effect and is over different time budgets. From this figure, the meta-learning expected to be repeated on the new datasets. Effect of Meta-Learning (10 Min) Effect of Meta-Learning (30 Min) Negative Same Positive Failed Negative Same Positive Failed Performance Difference (M - V) Performance Difference (M - V) 0.20 0.05 0.15 0.00 0.10 0.05 0.05 0.00 0.10 0.05 0.10 0.15 0.15 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Data Sets Data Sets (a) 10 Min. (b) 30 Min. Effect of Meta-Learning (60 Min) Effect of Meta-Learning (4 Hours) Negative Same Positive Failed Negative Same Positive Failed Performance Difference (M - V) Performance Difference (M - V) 0.20 0.20 0.15 0.15 0.10 0.10 0.05 0.00 0.05 0.05 0.00 0.10 0.05 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Data Sets Data Sets (c) 60 Min. (d) 240 Min. Figure 1: The impact of meta-learning over all time budgets. Upward triangles represent better performance with Auto- Sklearn-m, downward triangles represent better performance using AutoSklearn-v, and circles mean that the absolute difference is < 1%. Table 1: Comparison between the performance of AutoSklearn-v and AutoSklearn-m in terms of accuracy over different time budgets. Accuracy Gain (Accuracy) #datasets with Time Budget Framework Mean SD Min Mean Max gain > 1% AutoSklearn-m 0.870 0.144 1.1% 2.9% 6.7% 25 10 AutoSklearn-v 0.868 0.145 1.1% 5.6% 15.6% 10 AutoSklearn-m 0.873 0.143 1.1% 2.8% 18.8% 27 30 AutoSklearn-v 0.873 0.142 1.1% 4.5% 16.7% 17 AutoSklearn-m 0.873 0.141 1.1% 3.4% 18.8% 17 60 AutoSklearn-v 0.874 0.137 1.1% 4.4% 13.3% 17 AutoSklearn-m 0.877 0.136 1.1% 5.5.% 18.8% 13 240 AutoSklearn-v 0.872 0.149 1.1% 2.7% 8.3% 17 Table 2: The results of Wilcoxon test for assessing the statis- having a better/lower performance impact using the ensembling tical significance of the performance difference using Auto- mechanism. Sklearn-m over AutoSklearn-v 4.3 The Impact of Time Budget Mode 1 Mode 2 Time Budget 𝑃 value 10 0.004 This experiment compares the accuracy gain of all combinations of 30 0.065 time budget increases (10/30 Min, 10/60 Min, 10/240 Min, 30/60 AutoSklearn-m AutoSklearn-v Min, 30/240 Min, 60/240 Min). For space limitations, we could 60 0.434 240 0.305 not include all comparison figures. However, they are available in the project repository. On average, the accuracy values on each of the four time bud- gets are comparable. For example, the base/increased time budgets, The datasets with an enhanced/declined performance using i.e., 10/30 Min (Figure 5(a)), 30/60 Min (Figure 5(b)), 60/240 Min AutoSklearn-e within the four time budgets are studied. Fig- (Figure 5(c)) and 10/240 Min (Figure 5(d)), have similar perfor- ure 4(a) shows the overlap among the datasets with better per- mance in 65, 57, 66, and 60 datasets, respectively. The accuracy formance using the ensembling mechanism. Figure 4(b) shows of the base time budget was lower than the performance of the the overlap among the datasets with better performance using increased time budget for 9, 9, 13, and 11 datasets on the four AutoSklearn-v. Our analysis shows no strong correlation be- figures, respectively. On the other hand, they have better perfor- tween the meta-features of the datasets and the probability of mance for 17, 21, 22, and 26 datasets. The majority of the datasets Table 3: Comparison between the performance of AutoSklearn-v and AutoSklearn-e in terms of accuracy over different time budgets. Accuracy Gain (Accuracy) #datasets with Time Budget Framework Mean SD Min Mean Max gain > 1% AutoSklearn-e 0.873 0.139 1.1% 4.1% 16.4% 22 10 AutoSklearn-v 0.868 0.145 1.1% 3.4% 8.3% 12 AutoSklearn-e 0.880 0.136 1.1% 3.4 13.5% 27 30 AutoSklearn-v 0.873 0.142 1.1% 3.1% 11.1% 10 AutoSklearn-e 0.884 0.132 1.1% 4.9% 12.5% 24 60 AutoSklearn-v 0.874 0.137 1.1% 3.1% 6.4% 9 AutoSklearn-e 0.886 0.130 1.1% 7.1% 52.7% 14 240 AutoSklearn-v 0.872 0.149 1.1% 2.9% 8.3% 11 Table 5: The results of Wilcoxon test for assessing the statis- AutoSKLearn-m AutoSKLearn-v_m tical significance of the performance gain for increasing the time budget 25 16 14 Framework TB 1 TB 2 Avg. Acc. Diff 𝑃 value 20 12 30 10 0.005 0.226 0.004 No. of datasets 60 10 0.007 No. of dataset 10 15 60 30 0.002 0.141 AutoSklearn-v 8 240 10 0.007 0.000 10 240 30 0.002 0.027 6 240 60 0.000 0.110 4 5 30 10 0.004 0.211 2 60 10 0.004 0.198 0 0 60 30 0.00 0.956 10 30 60 240 10 30 60 240 AutoSklearn-m Time Budgets (minutes) Time Budgets (minutes) 240 10 0.008 0.099 240 30 0.004 0.614 240 60 0.004 0.398 (a) AutoSklearn-m (b) AutoSklearn-v 30 10 0.007 0.000 60 10 0.011 0.000 Figure 2: Overlap among datasets having better perfor- 60 30 0.004 0.675 mance using AutoSklearn-m(a), and AutoSklearn-v(b) AutoSklearn-e 240 10 0.013 0.000 through each time budget. The new color in each bar repre- 240 30 0.006 0.038 sents the number of new datasets with higher performance 240 60 0.002 0.265 at this time budget, while the same color represents the same 30 10 0.003 0.362 datasets from previous time budgets. 60 10 0.009 0.000 60 30 0.005 0.019 AutoSklearn Table 4: The results of Wilcoxon test for assessing the statis- 240 10 0.014 0.001 tical significance of the performance difference using Auto- 240 30 0.011 0.002 Sklearn-e over AutoSklearn-v 240 60 0.005 0.117 Framework 1 Framework 2 TB 𝑃 value Table 6: Search space effect: result summary 10 0.011 30 0.000 Search Space Mean SD AutoSklearn-e AutoSklearn-v 60 0.000 3𝐶 0.867 0.139 240 0.008 𝐹𝐶 0.863 0.153 achieve a performance improvement when increasing the time to 60, 10 to 240, and 30 to 240 provide a statistically significant budget, while few datasets witness a performance decline. Thus, accuracy gain. offering more time for AutoSklearn to search for a better so- lution generally lead to accuracy performance gains as previously 4.4 The Impact of The Size of The Search Space established in [7]. Table 5 shows the statistical significance of In this experiment, we compare the accuracy using the full search increasing the time budget [27]. In particular, increasing the time space, with all available classifiers (𝐹𝐶), to a subset of search space budget from 10 minutes to 30 minutes and from 60 minutes to 240 containing the best-performing classifiers (3𝐶). In practice, we minutes do not provide a statistically significant performance gain. selected the top 3 classifiers, i.e., support vector machine, random On the other hand, increasing the time budget from 10 to 60, 30 forest, and decision trees, based on the results of the 𝐹𝐶. Table 6 Effect of Ensembling (10 Min) Effect of Ensembling (30 Min) Negative Same Positive Failed Negative Same Positive Failed Performance Difference (E - V) Performance Difference (E - V) 0.15 0.10 0.10 0.05 0.05 0.00 0.00 0.05 0.05 0.10 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Data Sets Data Sets (a) 10 Min. (b) 30 Min. Effect of Ensembling (60 Min) Effect of Ensembling (4 Hours) Negative Same Positive Failed Negative Same Positive Failed Performance Difference (E - V) Performance Difference (E - V) 0.10 0.5 0.4 0.05 0.3 0.2 0.00 0.1 0.05 0.0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Data Sets Data Sets (c) 60 Min. (d) 240 Min. Figure 3: The impact of ensembling over all time budgets. Upward triangles represent better performance with Auto- Sklearn-e, downward triangles represent better performance using AutoSklearn-v, and circles means that the absolute difference is < 1% Table 7: The results of Wilcoxon test for assessing the statisti- AutoSKLearn-e AutoSKLearn-v_e cal significance of the performance difference for increasing 14 the search space 25 12 Mode 1 Mode 2 Avg. Acc. Diff 𝑃 value 20 10 𝐹𝐶 3𝐶 -0.003 0.618 No. of datasets No. of datasets 15 8 6 10 (less than 1%). The Wilcoxon test (Table 7) shows no statistically 4 significant difference between the two search spaces. Although the 5 𝐹𝐶 is much larger, it does not reduce accuracy than the exploited 2 search space (3𝐶). Therefore, it is better to keep all classifiers in 0 0 the target search space. 10 30 60 240 10 30 60 240 Time Budgets (minutes) Time Budgets (minutes) 4.5 Special Runs and Discussion (a) AutoSklearn-e (b) AutoSklearn-v The datasets with substantial performance differences towards/against any of the discussed learning settings are further investigated and Figure 4: Overlap among datasets having better perfor- reran 3 times per configuration. We noticed that most of these mance using AutoSklearn-e(a) and AutoSklearn-v(b) datasets have an order of magnitude fewer instances than features through each time budget. The new color in each bar repre- or have significantly few instances (mostly with datasets from med- sents the number of new datasets with higher performance ical domains); see Table 8. Generally, the generated pipelines and at this time budget, while the same color represents the same their accuracy for these kinds of datasets are completely different datasets from previous time budgets. in each iteration. For instance, 5 different classifiers are selected in 6 unique pipelines for the dataset_40_sonar (sonar) dataset. shows that both search spaces have a comparable performance. In principle, the importance of the learners’ hyper-parameters Figure 6 shows the effect of using the 𝐹𝐶 against using only 3𝐶 on varies based on their effect on the accuracy [8]. Moreover, the AutoSklearn. The results show that there is no clear winner. In importance of the hyper-parameters depends on the dataset charac- particular, the 𝐹𝐶 exceeds the accuracy of 3𝐶 in 28 datasets with teristics. For example, the regularization parameter is critical for an average accuracy gain of 3.3%, while using 3𝐶 achieves better datasets with fewer instances than features to avoid over-fitting. performance on 21 datasets with an average accuracy difference Hence, the AutoML tool should pay more attention to it for better of 5.9%. Besides, 50 datasets have negligible accuracy differences generalization with the current few instances. Effect of increasing time budget form 10 Min to 30 Min in AutoSKLearn Effect of increasing time budget form 30 Min to 60 Min in AutoSKLearn Negative Same Positive Failed Negative Same Positive Failed 0.15 Performance Difference Performance Difference 0.15 0.10 0.10 0.05 0.05 0.00 0.00 0.05 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Data Sets Data Sets (a) 10-30 Min. (b) 30-60 Min. Effect of increasing time budget form 60 Min to 4 Hours in AutoSKLearn Effect of increasing time budget form 10 Min to 4 Hours in AutoSKLearn Negative Same Positive Failed Negative Same Positive Failed Performance Difference Performance Difference 0.3 0.5 0.4 0.2 0.3 0.1 0.2 0.0 0.1 0.0 0.1 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Data Sets Data Sets (c) 60-240 Min. (d) 10-240 Min. Figure 5: The impact of increasing the time budget on AutoSklearn performance from 𝑥 to 𝑦 minutes (x-y). Upward triangles represent better performance with 𝑦 time budget. Downward triangles represent better performance on 𝑥 time budget. Circles mean that the difference between 𝑥 and 𝑦 is < 1%. Accuracy Difference between fc and 3c for 30 minutes in AutoSKLearn however, it is not handled very well by any AutoML tool, in- Negative Same Positive Failed Performance Difference cluding AutoSklearn [24]. Therefore, there is a huge room of 0.1 improvement in the automated feature engineering phase. 0.0 These results reflect the great importance of the feature engi- 0.1 neering phase as a crucial step in classical machine learning. The 0.2 right feature engineering phase could turn the feature space into a 0.3 linearly separable space, so even naive classifiers could achieve 0 10 20 30 40 50 60 70 80 90 100 relatively high accuracy. On the other hand, skipping this phase Data set or using the wrong feature engineering preprocessors makes it harder to achieve relatively high accuracy, even for the most effi- Figure 6: The impact of reducing the search space size on cient classifiers. Therefore, the image datasets that use many raw each AutoML framework. Upward triangles represent bet- pixels as features usually have an oscillated performance based ter performance with 𝐹𝐶 search space. Downward triangles on the preprocessors selected in the feature engineering phase. represent better performance with 3𝐶 search space. circles Consequently, the pipeline that uses more suitable preprocessors means that the difference between 𝐹𝐶 and 3𝐶 is < 1%. to the target datasets relatively achieves better accuracy. Using AutoSklearn-e’s greedy implementation of ensem- bling with datasets having significantly few instances declines the performance since the validation set is expected to contain In AutoSklearn, the ML pipeline structure consists of three significantly few instances too. The fitted model is vulnerable to fixed components, i.e., data preprocessor, feature preprocessor, over-fitting on such a small validation set, e.g., GCM in Table 8. and classifier. AutoSklearn tries several options for each stage We believe that when dealing with really big datasets7 , the and selects the one that maximizes the validation accuracy. Since optimization process would not have the luxury to attempt the the feature engineering phase is significant, the output pipelines same large number of configurations. The reason behind this is the have high-performance differences when they have different fea- significant costs (e.g., time and computing resources) associated ture preprocessors, even if the same classifier is selected for all with each configuration attempt. Thus, to tackle the challenge of of them. For example, although lda is selected as a classifier in dealing with big datasets, there is a crucial need for a distributed two pipelines for dbworld-bodies (bodies), their accura- AutoML search process. For such big datasets, the meta-learning cies are very different; i.e., 93.8% for AutoSklearn-v since it mechanism can have a better significant impact on reducing the used nystroem_sampler preprocessor compared to 75% for search space and optimizing the search process with possibly a AutoSklearn-m without any preprocessors. Additionally, over AutoSklearn-v, two pipelines with the same Gaussian naive Bayes gaussian_nb classifiers generate 100% and 83.3% ac- 7 The average size of the 100 datasets of our experiments is 21.2MB (relatively curacy values with different preprocessors. In practice, the feature small). Relatively big datasets such as Cifar-10 (643MB) have failed with Auto- engineering phase consumes most of the data scientists’ time; Sklearn. Table 8: A sample of the datasets’ characteristics and results REFERENCES from the repeated (special) runs. ’m’, ’e’, ’v’ stands for the [1] James Bergstra, Dan Yamins, and David D Cox. 2013. Hyperopt: A python version of AutoSklearn (A). library for optimizing the hyperparameters of machine learning algorithms. In Proceedings of the 12th Python in science conference. Citeseer, 13–20. . [2] Bernd Bischl et al. 2017. OpenML benchmarking suites and the OpenML100. arXiv preprint arXiv:1708.03731 (2017). Dataset #Feat. #Inst. A Accuracy [3] Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. 2004. 10 minutes Ensemble selection from libraries of models. In ICML. m 0.885 0.827 0.788 [4] Thomas G Dietterich. 2000. Ensemble methods in machine learning. In Inter- sonar 61 208 national workshop on multiple classifier systems. v 0.846 0.827 0.712 [5] Salijona Dyrmishi, Radwa Elshawi, and Sherif Sakr. 2019. A Decision Support m 0.875 0.875 0.75 Framework for AutoML Systems: A Meta-Learning Approach. In Proceedings bodies 4703 64 v 0.938 0.938 0.875 of The 1st IEEE ICDM Workshop on Autonomous Machine Learning (AML). m 0.666 0.666 0.60 [6] Stefan Falkner, Aaron Klein, and Frank Hutter. 2018. Bohb: Robust and effi- tumors_C 7130 60 cient hyperparameter optimization at scale. arXiv preprint arXiv:1807.01774 v 0.666 0.466 0.466 (2018). m 0.881 0.839 0.811 [7] Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, micro-mass 1301 517 v 0.947 0.867 0.832 and Frank Hutter. 2020. Auto-sklearn 2.0: The next generation. arXiv preprint e 0.604 0.604 0.583 arXiv:2007.04074 (2020). GCM 160064 190 [8] Matthias Feurer and Frank Hutter. 2019. Hyperparameter optimization. In v 0.792 0.708 0.646 Automated Machine Learning. Springer, Cham, 3–33. 30 minutes [9] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springen- m 0.812 0.812 0.75 berg, Manuel Blum, and Frank Hutter. 2015. Efficient and Robust Automated stemmed 3722 64 Machine Learning. In NIPS. v 0.875 0.875 0.812 [10] Edmund A Gehan. 1965. A generalized Wilcoxon test for comparing arbitrarily m 0.812 0.812 0.75 singly-censored samples. Biometrika 52, 1-2 (1965), 203–224. lymphoma 4027 45 v 0.875 0.875 0.812 [11] Pieter Gijsbers, Erin LeDell, Janek Thomas, Sébastien Poirier, Bernd Bischl, m 0.958 0.875 0.875 and Joaquin Vanschoren. 2019. An open source AutoML benchmark. arXiv rsctc2010_3 22278 95 preprint arXiv:1907.00909 (2019). v 1.0 0.833 0.75 [12] Isabelle Guyon, Lisheng Sun-Hosoya, Marc Boullé, Hugo Jair Escalante, Sergio 240 minutes Escalera, Zhengying Liu, Damir Jajetic, Bisakha Ray, Mehreen Saeed, Michèle m 0.954 0.888 0.62 Sebag, et al. 2019. Analysis of the AutoML Challenge Series. Automated CovPokElec 65 1.4 M Machine Learning (2019), 177. v 0.80 0.572 0.504 [13] Xin He, Kaiyong Zhao, and Xiaowen Chu. 2019. AutoML: A Survey of the State-of-the-Art. arXiv preprint arXiv:1908.00709 (2019). [14] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2011. Sequential lower number of attempts on the defined time budgets (See Table 8 model-based optimization for general algorithm configuration. In International conference on learning and intelligent optimization. for CovPokElec dataset). [15] Haifeng Jin, Qingquan Song, and Xia Hu. 2019. Auto-keras: An efficient neural architecture search system. In ACM KDD. 5 CONCLUSION [16] Aaron Klein, Stefan Falkner, Numair Mansur, and Frank Hutter. 2017. Robo: A flexible and robust bayesian optimization framework in python. In NIPS 2017 This paper analyzed and presented various learning settings em- Bayesian Optimization Workshop. ployed and considered by AutoSklearn and AutoML frame- [17] Mohamed Maher and Sherif Sakr. 2019. SmartML: A Meta Learning-Based Framework for Automated Selection and Hyperparameter Tuning for Machine works in general. The analysis revealed several insights that can Learning Algorithms. In EDBT. help guiding and improving the design process of future AutoML [18] Sherif Sakr and Albert Y. Zomaya (Eds.). 2019. Encyclopedia of Big Data Technologies. Springer. https://doi.org/10.1007/978-3-319-63962-8 techniques. For example, no single configuration of the learning [19] Radwa El Shawi, Mohamed Maher, and Sherif Sakr. 2019. Automated Ma- settings can always guarantee an improved performance for all chine Learning: State-of-The-Art and Open Challenges. CoRR abs/1906.02287 datasets. Each configuration usually leads to a better performance (2019). arXiv:1906.02287 http://arxiv.org/abs/1906.02287 [20] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George on some datasets. The meta-learning mechanism pioneered by Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneer- AutoSklearn achieves a statistically significant performance shelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural improvement with short time budgets only, and it significantly networks and tree search. nature 529, 7587 (2016), 484–489. [21] Thomas Swearingen, Will Drevo, Bennett Cyphers, Alfredo Cuesta-Infante, loses its impact with longer time budgets. Hence, we only rec- Arun Ross, and Kalyan Veeramachaneni. 2017. ATM: A distributed, col- ommend using meta-learning with limited time budgets or huge laborative, scalable system for automated machine learning. In 2017 IEEE International Conference on Big Data (Big Data). datasets that take a long time to train a single model. Using ensem- [22] Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2013. ble models, the results are consistently improved for all time bud- Auto-WEKA: Combined selection and hyperparameter optimization of classifi- gets. Thus, ensembling is recommended, especially with datasets cation algorithms. In ACM KDD. [23] Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Hervé Jégou. 2020. with many features and few instances, since it reduces the chances Fixing the train-test resolution discrepancy: FixEfficientNet. arXiv preprint of overfitting the validation split. Increasing the time budget needs arXiv:2003.08237 (2020). to be considered carefully as it does not always lead to a significant [24] Anh Truong, Austin Walters, Jeremy Goodsitt, Keegan Hines, Bayan Bruss, and Reza Farivar. 2019. Towards automated machine learning: Evaluation and improvement of the accuracy. This decision can vary from one comparison of automl approaches and tools. arXiv preprint arXiv:1908.05557 scenario/application to another according to the resource-accuracy (2019). [25] Joaquin Vanschoren. 2018. Meta-learning: A survey. arXiv preprint tradeoff. Deliberately selecting a small search space with a few arXiv:1810.03548 (2018). top-performing classifiers can lead to a very comparable perfor- [26] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. mance with a search space that includes many classifiers. This OpenML: Networked Science in Machine Learning. SIGKDD Explorations 15, 2 (2013), 49–60. https://doi.org/10.1145/2641190.2641198 insight is essential, especially for large datasets that cannot be [27] Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In evaluated using many classifiers. Finally, this study opens the door Breakthroughs in statistics. Springer, 196–202. to adaptively configure the default learning settings for each input [28] Marc-André Zöller and Marco F Huber. 2019. Benchmark and Survey of Automated Machine Learning Frameworks. (2019). dataset based on its characteristics. [29] Albert Y Zomaya and Sherif Sakr. 2017. Handbook of big data technologies. Springer. ACKNOWLEDGEMENT This work is funded by the European Regional Development Funds via the Mobilitas Plus programme (grant MOBTT75).