=Paper= {{Paper |id=Vol-2841/DARLI-AP_11 |storemode=property |title=The impact of Auto-Sklearn’s Learning Settings: Meta-learning, Ensembling, Time Budget, and Search Space Size |pdfUrl=https://ceur-ws.org/Vol-2841/DARLI-AP_11.pdf |volume=Vol-2841 |authors=Hassan Eldeeb,Oleh Matsuk,Mohamed Maher,Abdelrhman Eldallal,Sherif Sakr |dblpUrl=https://dblp.org/rec/conf/edbt/EldeebMMES21 }} ==The impact of Auto-Sklearn’s Learning Settings: Meta-learning, Ensembling, Time Budget, and Search Space Size== https://ceur-ws.org/Vol-2841/DARLI-AP_11.pdf
                 The impact of Auto-Sklearn’s Learning Settings
                        Meta-learning, Ensembling, Time Budget, and Search Space Size

                Hassan Eldeeb*                                    Oleh Matsuk                                      Mohamed Maher
             hassan.eldeeb@ut.ee                              Data Systems Group                                 Data Systems Group
             Data Systems Group                            University of Tartu, Estonia                       University of Tartu, Estonia
          University of Tartu, Estonia

                                        Abdelrhman Eldallal                                 Sherif Sakr
                                          Data Systems Group                          Data Systems Group
                                       University of Tartu, Estonia                University of Tartu, Estonia
ABSTRACT                                                                           These AutoML frameworks follow different learning settings,
With the booming demand for machine learning (ML) applica-                     i.e., parameters or options that AutoML users have to preset while
tions, it is recognized that the number of knowledgeable data                  submitting the input dataset. For example, AutoSklearn [7]
scientists cannot scale with the growing data volumes and appli-               and SmartML [17] adopt a meta-learning-based mechanism to
cation needs in our digital world. Therefore, several automated                improve the automated search process’s performance. That is, they
machine learning (AutoML) frameworks have been developed to                    start with the most promising models that have performed well
fill the gap of human expertise by automating most of the process              with similar datasets. ATM [21] limits the default search space into
of building a ML pipeline. In this paper, we present a micro-level             only three classifiers, namely, Decision Tree, K-Nearest Neigh-
analysis of the AutoML process by empirically evaluating and                   bours, and Logistic Regression. AutoSklearn offers an ensem-
analyzing the impact of several learning settings and parameters,              bling mechanism as a post-hoc optimization instead of report-
i.e., meta-learning, ensembling, time budget and size of search                ing only the best-performing model. Additionally, most AutoML
space on the performance. Particularly, we focus on AutoSklearn,               frameworks run within a user-determined time budget. Although
the state-of-the-art AutoML framework. Our study reveals that no               the user has the option to use different flavors of AutoML tools
single configuration of these design decisions achieves the best               by manipulating these learning settings, it is hard to decide which
performance across all conditions and datasets. However, some                  of them should be used for the input dataset. So, these learning
design parameters have a statistically consistent improvement over             settings are just hyper-parameters for AutoML tools.
the performance, such as using ensemble models. Some others                        Understanding the impact of these learning settings on real-
are conditionally effective, e.g., meta-learning adds a statistically          world datasets is vital, especially when the authors of the AutoML
significant improvement, only with a small time budget.                        frameworks evaluate their contributions using relatively small
                                                                               datasets [22]. Besides, the authors may knowingly or unknowingly
                                                                               select datasets on which their frameworks perform well.
1    INTRODUCTION                                                                  In this study, we present a thorough analysis of the significance
Due to the increasing success of machine learning techniques in                of various hyper-parameters (learning settings) of the AutoML pro-
several application domains, they attract lots of attention from the           cess, including meta-learning, ensembling, length of time budget
research and business communities. Hence, a wide range of fields               and size of search space. Since AutoSklearn supports different
is witnessing many breakthroughs achieved by machine and deep                  settings for all of these hyper-parameters, we nominated it to be
learning techniques [18, 29]. Furthermore, machine learning has                the backbone of this study.
significant achievements compared to human-level performance.                      So, the contribution of this paper can be summarized as follows:
For example, AlphaGO [20] defeated the GO game’s champion,                           • We benchmark 100 datasets on different learning settings
and deep learning models excelled in image recognition and sur-                         (hyper-parameters) of AutoSklearn.
passed human performance years ago [23].                                             •  The impact of each hyper-parameter of AutoSklearn
    Nevertheless, the machine learning modeling process is a highly                     has been examined using different configurations.
iterative, exploratory, and time-consuming process. Therefore, re-                   • For each positive/negative impact of hyper-parameter con-
cently, several frameworks (e.g., AutoWeka [22], AutoSklearn [7],                       figuration, we validate its consistency using Wilcoxon
SmartML [17]) are proposed to support automating the Com-                               statistical test [27].
bined Algorithm Selection and Hyper-parameter tuning (CASH)                          • Eventually, we provide a simple guideline for which of
problem [13, 19, 28]. The performance of the automatically gener-                       these hyper-parameters is expected to improve the perfor-
ated pipelines, by these AutoML frameworks, is perfect for some                         mance score based on the input datasets’ characteristics.
tasks such that data scientists cannot develop pipelines to beat it;                This analysis is ongoing and extendable. So, we will update
not even AutoML designers, as seen in the ChaLearn AutoML                        it with new datasets and additional frameworks. For ensuring
Challenge 2015/2016 [12].                                                        reproducibility, we have released all artifacts (e.g., datasets, source
                                                                                 code, log results).1
* Corresponding Author.
                                                                                    The remainder of this paper is organized as follows. We discuss
                                                                                 the related work in Section 2. Section 3 describes our experi-
© 2021 Copyright for this paper by its author(s). Published in the Workshop Pro- ment design and defines the target learning settings. Section 4
ceedings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia,
Cyprus) on CEUR-WS.org. Use permitted under Creative Commons License Attri-
                                                                                 presents the impact of meta-learning (Section 4.1), ensembling
bution 4.0 International (CC BY 4.0)
                                                                               1 https://datasystemsgrouput.github.io/AutoMLDesignDecisions/
(Section 4.2), length of the time budget (Section 4.3), and size of     3     EXPERIMENT DESIGN
the search space (Section 4.4). The datasets that show significant      In this paper, we followed the best practices on how to construct
performance differences towards/against any of these design pa-         and run good machine learning benchmarks and general-purpose
rameters are further analyzed in Section 4.5. Finally, we conclude      algorithm configuration libraries [2]
the paper in Section 5.
                                                                        3.1    AutoML framework: AutoSKLearn
2    RELATED WORK                                                       AutoSklearn [7], the winner of two ChaLearn AutoML chal-
Recently, several studies have surveyed and compared the per-           lenges, is implemented on top of Scikit-Learn, a popular
formance of various AutoML frameworks [11, 13, 19, 24, 28].             Python machine learning package [9]. It uses Sequential Model-
In general, these studies show no clear winner as there are al-         based Algorithm Configuration (SMAC) as a Bayesian optimiza-
ways some trade-offs that need to be considered and optimized           tion technique [14]. Beside adopting meta-learning and ensem-
according to the context of the problems and the user’s goals.          bling design decisions, time budget and search space are also con-
For example, Gijsbers et al. [11] have conducted an experimental        figurable in AutoSklearn. The framework uses meta-learning
study to compare the performance of 4 AutoML systems, namely,           for initializing the search process as a warm start. It also utilizes
AutoWeka, AutoSklearn, TPOT and H2O using 39 datasets                   the ensembling learning setting to improve the performance of
and time budgets of 1 and 4 hours. The study results observed           output models. Moreover, one of the main advantages of Auto-
that some AutoML tools perform significantly better or worse on         Sklearn is that it comes with different execution options. In
several datasets than others. The authors could not draw clear con-     particular, its basic vanilla version (AutoSklearn-v) applies
clusions about which data properties could explain this behavior.       only the SMAC optimization techniques for the AutoML optimiza-
   Truong et al. [24] have conducted a study using 300 datasets         tion process. However, AutoSklearn also allows the end-users
to compare the performance of 7 AutoML frameworks, namely,              to enable/disable the different optimization options including the
H2O, [15], AutoSklearn, Ludwig2 , Darwin3 , TPOT and                    usage of meta-learning (AutoSklearn-m), ensembling (Auto-
Auto-ml using dynamic time budgets. The results of this study           Sklearn-e) in addition to the full version (AutoSklearn)
showed that no framework managed to outperform all others on a          where all options are enabled.
plurality of tasks. Across the various evaluations and benchmarks,
H2O, Auto-keras and AutoSklearn performed better than                   3.2    Datasets
the rest of the tools.                                                  We used 100 datasets that are collected from OpenML reposi-
   Zöller and Huber [28] have performed a comparison for 8              tory [26]. OpenML datasets are already preprocessed into numer-
CASH optimization algorithms, namely, Grid Search, Random               ical features. Therefore, they match the input criteria of Auto-
Search, ROBO (RObust Bayesian Optimization) [16], BTB (Bayesian         Sklearn. The datasets include binary (50%) and multi-class
Tuning and Bandits)4 , hyperopt [1], SMAC, BOHB [6] and Optu-           (50%) classification tasks. The sizes of these datasets varies be-
nity5 . The comparison results showed that all CASH algorithms’         tween 5KB and 643MB. The datasets used in this study cover a
performance, except the grid search, perform similarly on av-           wide spectrum of various meta-features (e.g., r_numerical_features,
erage. The authors also noted that a simple search algorithm            class_entropy, max_prob, mean_prob, std_dev, dataset_ratio, sym-
such as random search did not perform worse than the other              bols_sum, symbols_std_dev, skew_std_dev, kurtosis_min, kurto-
algorithms. Besides, the authors compared the performance of            sis_max, kurtosis_mean and kurtosis_std_dev) [5]. Each dataset
6 AutoML frameworks, namely, TPOT, hpsklearn6 , Auto-                   is partitioned into training and validation split (80%) and a test
Sklearn, ATM, H2O in addition to Random Search. The re-                 split (20%) which is used to evaluate the output pipeline.
sults of the frameworks’ comparison showed that on average, all
AutoML frameworks have similar performance. However, for a
single dataset, the performance differs on average by 6% accuracy.
                                                                        3.3    Hardware Choice and Resources
   To the best of our knowledge, this study is the first that focuses   The experiments are conducted on Google Cloud machines. Each
on analyzing the learning settings and parameters of the AutoML         machine is configured with 2 vCPUs, 7.5 GB RAM, and ubuntu-
process. The previous studies mainly focus on comparing whole           minimal-1804-bionic. Each experiment is run four times with
frameworks’ performance or comparing different optimization             different time budgets: 10, 30, 60, and 240 minutes.
techniques’ performance. Thus, a single configuration of the learn-
ing settings is used with all the tested datasets. Mostly, it is the    3.4    Learning Settings Definitions
default settings to have a fair comparison among the benchmarked        Learning settings are the parameters or options that AutoML
frameworks [11, 13, 19, 24, 28]. In contrast, this study focuses        users have to preset while submitting the input dataset. This paper
on the impact of the learning settings and different configura-         focuses on four of them: meta-learning, ensembling, time budget,
tion of the AutoML framework over the accuracy performance.             and search space size. In the following, we define each of them
Understanding this relationship can help the domain expert use          in the context of AutoML and briefly describe their mechanism
AutoML frameworks and select the learning setting configuration         used in AutoSklearn.
that works well with his dataset.                                          Meta-learning [25] is the process of learning from previous
                                                                        experience gained while applying various learning algorithms on
                                                                        different types of data. In the context of AutoML, the main advan-
                                                                        tage of meta-learning techniques is that it allows hand-engineered
2 https://github.com/uber/ludwig
3 https://www.sparkcognition.com/product/darwin/                        algorithms to be replaced with automated methods designed in a
4 https://github.com/HDI-Project/BTB                                    data-driven way. Thus, it is used partially to simulate the machine
5 https://github.com/claesenm/optunity                                  learning expert’s role for non-technical users and domain experts.
6 https://github.com/hyperopt/hyperopt-sklearn/tree/master/hpsklearn
                                                                        AutoSklearn applies a meta-learning mechanism based on a
knowledge base storing the meta-features of datasets and the best-     mechanism does not always lead to better performance. On aver-
performing pipelines on these datasets. Thirty-eight statistical and   age, AutoSklearn-v and AutoSklearn-m provide a com-
information-theoretic meta-features are used. In the offline phase,    parable performance, for the four time budgets, as shown in Ta-
the meta-features and the empirically best-performing pipelines        ble 1. In particular, both versions have similar performance in
are stored for each dataset in their repository (140 datasets from     64, 55, 65, and 69 datasets for the 10, 30, 60, and 240 minutes,
OpenML repository) [7]. For any new dataset in the online phase,       respectively. Table 1 summarizes the results for each time budget.
the framework extracts its meta-features and searches for the most        We used Wilcoxon statistical test [10] to assess the signifi-
similar datasets to return the top 𝑘 best-performing pipelines on      cance of the performance difference between AutoSklearn-v
these similar datasets. These 𝑘 pipelines are used as a warm start     and AutoSklearn-m. The results of Table 2 show that the
for the Bayesian optimization algorithm used in the framework.         meta-learning mechanism makes a statistically significant gain
    In principle, the main goal of any meta-learning mechanism is      with 95% confidence (𝑝 value < 0.05) only with the 10 minutes
to improve the search process by enabling the optimization tech-       time budget. For the 30-minute time budget, the level of confi-
nique to start from the most promising pipelines instead of starting   dence decreases to 93.5%. In contrast, for the time budgets of 60
from random pipelines. If the suggested pipelines’ performance is      and 240 minutes, there is no statistically significant difference in
bad, the Bayesian optimization can recover from these pipelines        using the meta-learning mechanism to initialize the search process.
in the next iterations.                                                Hence, this implies that the longer the time budget, the lower the
    Ensembling is the process of combining multiple ML base mod-       impact of the meta-learning mechanism. In general, the longer
els trained on the same task to produce a better predictive model.     the time budget, the more time available for the AutoML search
These base models can be combined using several techniques,            process to explore more configurations in the search space, and
including simple/weighted voting (averaging), bagging, boosting,       the higher probability of getting a better result. Hence, the impact
and other techniques [4]. In principle, the main advantage of us-      of the initial configurations is also lower.
ing ensembling techniques is that it allows the base models to            We speculated whether the improvement/deterioration effect
collaborate in generating more generalized predictions than using      of meta-learning is constant for the same datasets across the four
predictions from an individual base model.                             time budgets. As shown in Figure 2(a), we found that 25 datasets
    AutoSklearn stores the generated models instead of just            have better accuracy when using AutoSklearn-m than Auto-
keeping the best-performing one. These models are used in a post-      Sklearn-v in the 10-minute time budget. Out of those 25, the
processing phase to construct an ensemble. AutoSklearn uses            improvement holds in only 13 datasets in the 30-minute time
the ensemble selection methodology introduced by Caruana et            budget. Similarly, among the 13 datasets, only 5 continued to
al. [3]. It is a greedy technique that starts with an empty ensemble   retain this behavior during the 60-minute. Finally, only one dataset
and attractively adds base models to the ensemble to maximize          is common among the four time budgets.
the validation performance.                                               From Figure 2, the datasets with improved performance scores
    The time budget represents the available time to examine the       are not the same in every time budget. By analyzing the meta-
search space for identifying a pipeline that maximizes the perfor-     features of these datasets, we could not link them over different
mance metric. Generally, it is hypothesized that the more time         time budgets. This observation is also confirmed in Figure 2(b),
allocated to the search process, the higher the achievable perfor-     where we could not draw a clear pattern out of these datasets.
mance [7]. On the other hand, there is another trade-off that should
be considered. The longer the budget we allocate, the more com-
                                                                       4.2    The Impact of Ensembling
puting resources we will consume for the search process and the
higher potential that the model overfits the validation set.           To assess the impact of the ensembling, we have experimented
    The search space of any AutoML process is significantly huge       with comparing the performance of AutoSklearn-v and Auto-
[7]. For example, AutoSklearn [7] is designed with over 15             Sklearn-e. Figure 3 shows the performance differences be-
classifiers from the scikit-learn library. Assuming that each          tween the two versions over the 100 datasets. On average, Auto-
classifier has only 2 hyper-parameters and each of them has 100        Sklearn-e increased the accuracy performance by 0.5%, 0.7%,
discrete values, the search space contains 15∗1002 different config-   1%, and 1.4% over the four time budgets, successively. In partic-
urations. However, in practice, the real numbers are much bigger.      ular, the two modes have similar performance in 65, 62, 66, and
                                                                       63 datasets for 10, 30, 60, and 240 minute time budgets. Table 3
                                                                       summarizes the results for each time budget.
                                                                           Table 4 shows that the outcomes of Wilcoxon test, which is
4     RESULTS AND DISCUSSION                                           conducted to assess the statistical significance of the accuracy per-
The set of questions aimed at assessing the impact of each learning    formance between AutoSklearn-v and AutoSklearn-e.
settings are as follows. Does the parameter improve/decline the        The table shows that the ensembling techniques enhance the per-
performance accuracy? Is the difference statistically significant?     formance with a statistically significant gain with more than 95%
And finally, when is it recommended to enable each parameter?          confidence (𝑝 value < 0.05) on the four time budgets. The level
We answer these questions for each learning setting.                   of confidence is almost 99% over all the time budgets combined.
                                                                       Generally, the ensemble model extremely boosts accuracy com-
                                                                       pared to the individual base models as long as these base models’
4.1    The Impact of Meta-Learning                                     errors are independent of each other [4]. Although the base classi-
                                                                       fiers’ errors are not completely independent, the ensemble model
To assess the impact of the meta-learning mechanism, we experi-
                                                                       still enhances the accuracy in a statistically significant manner.
mented with comparing the performance of AutoSklearn-m
                                                                       It means that the accuracy improvement by the ensemble model,
and AutoSklearn-v. Figure 1 shows the impact of meta-learning
                                                                       generated by AutoSklearn-e, is not a random effect and is
over different time budgets. From this figure, the meta-learning
                                                                       expected to be repeated on the new datasets.
                                                         Effect of Meta-Learning (10 Min)                                                                              Effect of Meta-Learning (30 Min)
                                              Negative           Same           Positive         Failed                                                     Negative           Same           Positive          Failed
  Performance Difference (M - V)




                                                                                                                Performance Difference (M - V)
                                                                                                                                                 0.20
                                   0.05                                                                                                          0.15
                                   0.00                                                                                                          0.10
                                                                                                                                                 0.05
                                   0.05                                                                                                          0.00
                                   0.10                                                                                                          0.05
                                                                                                                                                 0.10
                                   0.15                                                                                                          0.15
                                          0   10   20     30     40      50 60      70      80    90      100                                           0   10   20      30     40      50 60      70      80   90       100
                                                                      Data Sets                                                                                                      Data Sets
                                                                (a) 10 Min.                                                                                                   (b) 30 Min.

                                                         Effect of Meta-Learning (60 Min)                                                                              Effect of Meta-Learning (4 Hours)
                                              Negative           Same           Positive         Failed                                                     Negative            Same          Positive          Failed
  Performance Difference (M - V)




                                                                                                                Performance Difference (M - V)
                                   0.20                                                                                                          0.20
                                   0.15                                                                                                          0.15
                                   0.10                                                                                                          0.10
                                   0.05
                                   0.00                                                                                                          0.05
                                   0.05                                                                                                          0.00
                                   0.10                                                                                                          0.05
                                          0   10   20     30     40      50 60      70      80    90      100                                           0   10   20      30     40      50 60      70      80   90       100
                                                                      Data Sets                                                                                                      Data Sets
                                                                (c) 60 Min.                                                                                                   (d) 240 Min.


Figure 1: The impact of meta-learning over all time budgets. Upward triangles represent better performance with Auto-
Sklearn-m, downward triangles represent better performance using AutoSklearn-v, and circles mean that the absolute
difference is < 1%.

Table 1: Comparison between the performance of AutoSklearn-v and AutoSklearn-m in terms of accuracy over different
time budgets.

                                                                                                   Accuracy                                         Gain (Accuracy)              #datasets with
                                                   Time Budget                Framework
                                                                                                 Mean    SD                                      Min Mean        Max               gain > 1%
                                                                         AutoSklearn-m           0.870 0.144                                     1.1% 2.9%       6.7%                  25
                                                           10
                                                                         AutoSklearn-v           0.868 0.145                                     1.1% 5.6% 15.6%                       10
                                                                         AutoSklearn-m           0.873 0.143                                     1.1% 2.8% 18.8%                       27
                                                           30
                                                                         AutoSklearn-v           0.873 0.142                                     1.1% 4.5% 16.7%                       17
                                                                         AutoSklearn-m           0.873 0.141                                     1.1% 3.4% 18.8%                       17
                                                           60
                                                                         AutoSklearn-v           0.874 0.137                                     1.1% 4.4% 13.3%                       17
                                                                         AutoSklearn-m           0.877 0.136                                     1.1% 5.5.% 18.8%                      13
                                                          240
                                                                         AutoSklearn-v           0.872 0.149                                     1.1% 2.7%       8.3%                  17


Table 2: The results of Wilcoxon test for assessing the statis-                                                                    having a better/lower performance impact using the ensembling
tical significance of the performance difference using Auto-                                                                       mechanism.
Sklearn-m over AutoSklearn-v

                                                                                                                                   4.3                  The Impact of Time Budget
                                   Mode 1                  Mode 2              Time Budget       𝑃 value
                                                                                   10             0.004                           This experiment compares the accuracy gain of all combinations of
                                                                                   30             0.065                           time budget increases (10/30 Min, 10/60 Min, 10/240 Min, 30/60
 AutoSklearn-m                                     AutoSklearn-v                                                                  Min, 30/240 Min, 60/240 Min). For space limitations, we could
                                                                                   60             0.434
                                                                                   240            0.305                           not include all comparison figures. However, they are available in
                                                                                                                                  the project repository.
                                                                                                                                      On average, the accuracy values on each of the four time bud-
                                                                                                                                  gets are comparable. For example, the base/increased time budgets,
   The datasets with an enhanced/declined performance using                                                                       i.e., 10/30 Min (Figure 5(a)), 30/60 Min (Figure 5(b)), 60/240 Min
AutoSklearn-e within the four time budgets are studied. Fig-                                                                      (Figure 5(c)) and 10/240 Min (Figure 5(d)), have similar perfor-
ure 4(a) shows the overlap among the datasets with better per-                                                                    mance in 65, 57, 66, and 60 datasets, respectively. The accuracy
formance using the ensembling mechanism. Figure 4(b) shows                                                                        of the base time budget was lower than the performance of the
the overlap among the datasets with better performance using                                                                      increased time budget for 9, 9, 13, and 11 datasets on the four
AutoSklearn-v. Our analysis shows no strong correlation be-                                                                       figures, respectively. On the other hand, they have better perfor-
tween the meta-features of the datasets and the probability of                                                                    mance for 17, 21, 22, and 26 datasets. The majority of the datasets
Table 3: Comparison between the performance of AutoSklearn-v and AutoSklearn-e in terms of accuracy over different
time budgets.

                                                                                                           Accuracy        Gain (Accuracy)            #datasets with
                                             Time Budget                          Framework
                                                                                                         Mean    SD     Min Mean        Max             gain > 1%
                                                                                 AutoSklearn-e           0.873 0.139    1.1% 4.1% 16.4%                     22
                                                     10
                                                                                 AutoSklearn-v           0.868 0.145    1.1% 3.4%       8.3%                12
                                                                                 AutoSklearn-e           0.880 0.136    1.1%     3.4   13.5%                27
                                                     30
                                                                                 AutoSklearn-v           0.873 0.142    1.1% 3.1% 11.1%                     10
                                                                                 AutoSklearn-e           0.884 0.132    1.1% 4.9% 12.5%                     24
                                                     60
                                                                                 AutoSklearn-v           0.874 0.137    1.1% 3.1%       6.4%                 9
                                                                                 AutoSklearn-e           0.886 0.130    1.1% 7.1% 52.7%                     14
                                                     240
                                                                                 AutoSklearn-v           0.872 0.149    1.1% 2.9%       8.3%                11


                                                                                                                       Table 5: The results of Wilcoxon test for assessing the statis-
                             AutoSKLearn-m                                             AutoSKLearn-v_m                 tical significance of the performance gain for increasing the
                                                                                                                       time budget
                  25                                                        16

                                                                            14
                                                                                                                              Framework        TB 1    TB 2        Avg. Acc. Diff   𝑃 value
                  20
                                                                            12                                                                  30      10             0.005         0.226
                                                                                                                                                                                     0.004
No. of datasets




                                                                                                                                                60      10             0.007
                                                           No. of dataset




                                                                            10
                  15                                                                                                                            60      30             0.002         0.141
                                                                                                                             AutoSklearn-v
                                                                            8                                                                  240      10             0.007         0.000
                  10                                                                                                                           240      30             0.002         0.027
                                                                            6
                                                                                                                                               240      60             0.000         0.110
                                                                            4
                  5                                                                                                                             30         10           0.004       0.211
                                                                            2                                                                   60         10           0.004       0.198
                  0                                                         0                                                                   60         30            0.00       0.956
                        10      30      60     240                                 10      30      60     240                AutoSklearn-m
                          Time Budgets (minutes)                                     Time Budgets (minutes)                                     240        10           0.008       0.099
                                                                                                                                                240        30           0.004       0.614
                                                                                                                                                240        60           0.004       0.398
                   (a) AutoSklearn-m                                         (b) AutoSklearn-v
                                                                                                                                                30         10           0.007       0.000
                                                                                                                                                60         10           0.011       0.000
Figure 2: Overlap among datasets having better perfor-                                                                                          60         30           0.004       0.675
mance using AutoSklearn-m(a), and AutoSklearn-v(b)                                                                           AutoSklearn-e
                                                                                                                                                240        10           0.013       0.000
through each time budget. The new color in each bar repre-                                                                                      240        30           0.006       0.038
sents the number of new datasets with higher performance                                                                                        240        60           0.002       0.265
at this time budget, while the same color represents the same
                                                                                                                                                30         10           0.003       0.362
datasets from previous time budgets.
                                                                                                                                                60         10           0.009       0.000
                                                                                                                                                60         30           0.005       0.019
                                                                                                                              AutoSklearn
Table 4: The results of Wilcoxon test for assessing the statis-                                                                                 240        10           0.014       0.001
tical significance of the performance difference using Auto-                                                                                    240        30           0.011       0.002
Sklearn-e over AutoSklearn-v                                                                                                                    240        60           0.005       0.117


                       Framework 1                   Framework 2                            TB       𝑃 value                    Table 6: Search space effect: result summary
                                                                                             10       0.011
                                                                                             30       0.000                                 Search Space        Mean     SD
                  AutoSklearn-e                  AutoSklearn-v
                                                                                             60       0.000                                      3𝐶             0.867   0.139
                                                                                            240       0.008                                      𝐹𝐶             0.863   0.153



achieve a performance improvement when increasing the time                                                             to 60, 10 to 240, and 30 to 240 provide a statistically significant
budget, while few datasets witness a performance decline. Thus,                                                        accuracy gain.
offering more time for AutoSklearn to search for a better so-
lution generally lead to accuracy performance gains as previously                                                      4.4    The Impact of The Size of The Search Space
established in [7]. Table 5 shows the statistical significance of                                                      In this experiment, we compare the accuracy using the full search
increasing the time budget [27]. In particular, increasing the time                                                    space, with all available classifiers (𝐹𝐶), to a subset of search space
budget from 10 minutes to 30 minutes and from 60 minutes to 240                                                        containing the best-performing classifiers (3𝐶). In practice, we
minutes do not provide a statistically significant performance gain.                                                   selected the top 3 classifiers, i.e., support vector machine, random
On the other hand, increasing the time budget from 10 to 60, 30                                                        forest, and decision trees, based on the results of the 𝐹𝐶. Table 6
                                                                                          Effect of Ensembling (10 Min)                                                                                                          Effect of Ensembling (30 Min)
                                                                           Negative              Same           Positive                          Failed                                                             Negative           Same           Positive           Failed
                  Performance Difference (E - V)




                                                                                                                                                                 Performance Difference (E - V)
                                                             0.15                                                                                                                                  0.10
                                                             0.10                                                                                                                                  0.05
                                                             0.05                                                                                                                                  0.00
                                                             0.00                                                                                                                                  0.05
                                                             0.05                                                                                                                                  0.10
                                                                    0      10       20    30     40       50 60                     70      80     90      100                                              0        10    20     30     40      50 60      70     80      90      100
                                                                                                       Data Sets                                                                                                                              Data Sets
                                                                                                (a) 10 Min.                                                                                                                            (b) 30 Min.

                                                                                          Effect of Ensembling (60 Min)                                                                                                         Effect of Ensembling (4 Hours)
                                                                             Negative             Same          Positive                            Failed                                                      Negative              Same          Positive            Failed
                            Performance Difference (E - V)




                                                                                                                                                                 Performance Difference (E - V)
                                                              0.10                                                                                                                                0.5
                                                                                                                                                                                                  0.4
                                                              0.05                                                                                                                                0.3
                                                                                                                                                                                                  0.2
                                                              0.00                                                                                                                                0.1
                                                              0.05                                                                                                                                0.0
                                                                     0      10      20     30     40         50 60                   70     80     90      100                                          0       10        20    30     40      50 60       70     80       90      100
                                                                                                          Data Sets                                                                                                                         Data Sets
                                                                                                (c) 60 Min.                                                                                                                            (d) 240 Min.


Figure 3: The impact of ensembling over all time budgets. Upward triangles represent better performance with Auto-
Sklearn-e, downward triangles represent better performance using AutoSklearn-v, and circles means that the absolute
difference is < 1%

                                                                                                                                                                                   Table 7: The results of Wilcoxon test for assessing the statisti-
                                                                    AutoSKLearn-e                                               AutoSKLearn-v_e                                    cal significance of the performance difference for increasing
                                                                                                                      14                                                           the search space
                  25
                                                                                                                      12
                                                                                                                                                                                                                 Mode 1          Mode 2        Avg. Acc. Diff     𝑃 value
                  20                                                                                                  10                                                                                          𝐹𝐶              3𝐶              -0.003           0.618
No. of datasets




                                                                                                    No. of datasets




                  15                                                                                                  8

                                                                                                                      6
                  10                                                                                                                                                               (less than 1%). The Wilcoxon test (Table 7) shows no statistically
                                                                                                                      4                                                            significant difference between the two search spaces. Although the
                     5                                                                                                                                                             𝐹𝐶 is much larger, it does not reduce accuracy than the exploited
                                                                                                                      2
                                                                                                                                                                                   search space (3𝐶). Therefore, it is better to keep all classifiers in
                     0                                                                                                0                                                            the target search space.
                                                             10      30      60     240                                    10      30      60     240
                                                               Time Budgets (minutes)                                        Time Budgets (minutes)

                                                                                                                                                                                    4.5                     Special Runs and Discussion
                            (a) AutoSklearn-e                                                                          (b) AutoSklearn-v                                           The datasets with substantial performance differences towards/against
                                                                                                                                                                                   any of the discussed learning settings are further investigated and
Figure 4: Overlap among datasets having better perfor-                                                                                                                             reran 3 times per configuration. We noticed that most of these
mance using AutoSklearn-e(a) and AutoSklearn-v(b)                                                                                                                                  datasets have an order of magnitude fewer instances than features
through each time budget. The new color in each bar repre-                                                                                                                         or have significantly few instances (mostly with datasets from med-
sents the number of new datasets with higher performance                                                                                                                           ical domains); see Table 8. Generally, the generated pipelines and
at this time budget, while the same color represents the same                                                                                                                      their accuracy for these kinds of datasets are completely different
datasets from previous time budgets.                                                                                                                                               in each iteration. For instance, 5 different classifiers are selected
                                                                                                                                                                                   in 6 unique pipelines for the dataset_40_sonar (sonar)
                                                                                                                                                                                   dataset.
shows that both search spaces have a comparable performance.                                                                                                                          In principle, the importance of the learners’ hyper-parameters
Figure 6 shows the effect of using the 𝐹𝐶 against using only 3𝐶 on                                                                                                                 varies based on their effect on the accuracy [8]. Moreover, the
AutoSklearn. The results show that there is no clear winner. In                                                                                                                    importance of the hyper-parameters depends on the dataset charac-
particular, the 𝐹𝐶 exceeds the accuracy of 3𝐶 in 28 datasets with                                                                                                                  teristics. For example, the regularization parameter is critical for
an average accuracy gain of 3.3%, while using 3𝐶 achieves better                                                                                                                   datasets with fewer instances than features to avoid over-fitting.
performance on 21 datasets with an average accuracy difference                                                                                                                     Hence, the AutoML tool should pay more attention to it for better
of 5.9%. Besides, 50 datasets have negligible accuracy differences                                                                                                                 generalization with the current few instances.
                                                              Effect of increasing time budget form 10 Min to 30 Min in AutoSKLearn                                                     Effect of increasing time budget form 30 Min to 60 Min in AutoSKLearn
                                                                          Negative         Same          Positive          Failed                                                                  Negative          Same          Positive         Failed
                                                            0.15
                                   Performance Difference




                                                                                                                                                         Performance Difference
                                                                                                                                                                                   0.15
                                                            0.10
                                                                                                                                                                                   0.10
                                                            0.05                                                                                                                   0.05
                                                            0.00                                                                                                                   0.00
                                                                                                                                                                                   0.05
                                                                   0     10        20        30        40      50 60        70        80        90 100                                      0        10    20    30      40      50 60    70   80   90 100
                                                                                                            Data Sets                                                                                                         Data Sets
                                                                                                  (a) 10-30 Min.                                                                                                      (b) 30-60 Min.

                                                       Effect of increasing time budget form 60 Min to 4 Hours in AutoSKLearn                                                      Effect of increasing time budget form 10 Min to 4 Hours in AutoSKLearn
                                                                  Negative          Same          Positive          Failed                                                                   Negative          Same          Positive          Failed
                   Performance Difference




                                                                                                                                                         Performance Difference
                                                  0.3                                                                                                                             0.5
                                                                                                                                                                                  0.4
                                                  0.2                                                                                                                             0.3
                                                  0.1                                                                                                                             0.2
                                                  0.0                                                                                                                             0.1
                                                                                                                                                                                  0.0
                                                  0.1
                                                              0         10        20        30        40      50 60         70        80    90 100                                      0       10        20    30     40      50 60      70   80   90   100
                                                                                                           Data Sets                                                                                                        Data Sets
                                                                                                 (c) 60-240 Min.                                                                                                      (d) 10-240 Min.


 Figure 5: The impact of increasing the time budget on AutoSklearn performance from 𝑥 to 𝑦 minutes (x-y). Upward triangles
 represent better performance with 𝑦 time budget. Downward triangles represent better performance on 𝑥 time budget. Circles
 mean that the difference between 𝑥 and 𝑦 is < 1%.

                                        Accuracy Difference between fc and 3c for 30 minutes in AutoSKLearn                                                           however, it is not handled very well by any AutoML tool, in-
                                                 Negative         Same            Positive         Failed
Performance Difference




                                                                                                                                                                      cluding AutoSklearn [24]. Therefore, there is a huge room of
                               0.1                                                                                                                                    improvement in the automated feature engineering phase.
                               0.0                                                                                                                                       These results reflect the great importance of the feature engi-
                               0.1                                                                                                                                    neering phase as a crucial step in classical machine learning. The
                               0.2                                                                                                                                    right feature engineering phase could turn the feature space into a
                               0.3                                                                                                                                    linearly separable space, so even naive classifiers could achieve
                                                      0            10        20        30        40        50    60    70        80        90    100                  relatively high accuracy. On the other hand, skipping this phase
                                                                                                      Data set                                                        or using the wrong feature engineering preprocessors makes it
                                                                                                                                                                      harder to achieve relatively high accuracy, even for the most effi-
 Figure 6: The impact of reducing the search space size on                                                                                                            cient classifiers. Therefore, the image datasets that use many raw
 each AutoML framework. Upward triangles represent bet-                                                                                                               pixels as features usually have an oscillated performance based
 ter performance with 𝐹𝐶 search space. Downward triangles                                                                                                             on the preprocessors selected in the feature engineering phase.
 represent better performance with 3𝐶 search space. circles                                                                                                           Consequently, the pipeline that uses more suitable preprocessors
 means that the difference between 𝐹𝐶 and 3𝐶 is < 1%.                                                                                                                 to the target datasets relatively achieves better accuracy.
                                                                                                                                                                         Using AutoSklearn-e’s greedy implementation of ensem-
                                                                                                                                                                      bling with datasets having significantly few instances declines
                                                                                                                                                                      the performance since the validation set is expected to contain
    In AutoSklearn, the ML pipeline structure consists of three                                                                                                       significantly few instances too. The fitted model is vulnerable to
 fixed components, i.e., data preprocessor, feature preprocessor,                                                                                                     over-fitting on such a small validation set, e.g., GCM in Table 8.
 and classifier. AutoSklearn tries several options for each stage                                                                                                        We believe that when dealing with really big datasets7 , the
 and selects the one that maximizes the validation accuracy. Since                                                                                                    optimization process would not have the luxury to attempt the
 the feature engineering phase is significant, the output pipelines                                                                                                   same large number of configurations. The reason behind this is the
 have high-performance differences when they have different fea-                                                                                                      significant costs (e.g., time and computing resources) associated
 ture preprocessors, even if the same classifier is selected for all                                                                                                  with each configuration attempt. Thus, to tackle the challenge of
 of them. For example, although lda is selected as a classifier in                                                                                                    dealing with big datasets, there is a crucial need for a distributed
 two pipelines for dbworld-bodies (bodies), their accura-                                                                                                             AutoML search process. For such big datasets, the meta-learning
 cies are very different; i.e., 93.8% for AutoSklearn-v since it                                                                                                      mechanism can have a better significant impact on reducing the
 used nystroem_sampler preprocessor compared to 75% for                                                                                                               search space and optimizing the search process with possibly a
 AutoSklearn-m without any preprocessors. Additionally, over
 AutoSklearn-v, two pipelines with the same Gaussian naive
 Bayes gaussian_nb classifiers generate 100% and 83.3% ac-                                                                                                             7 The average size of the 100 datasets of our experiments is 21.2MB (relatively
 curacy values with different preprocessors. In practice, the feature                                                                                                  small). Relatively big datasets such as Cifar-10 (643MB) have failed with Auto-
 engineering phase consumes most of the data scientists’ time;                                                                                                         Sklearn.
Table 8: A sample of the datasets’ characteristics and results           REFERENCES
from the repeated (special) runs. ’m’, ’e’, ’v’ stands for the            [1] James Bergstra, Dan Yamins, and David D Cox. 2013. Hyperopt: A python
version of AutoSklearn (A).                                                   library for optimizing the hyperparameters of machine learning algorithms. In
                                                                              Proceedings of the 12th Python in science conference. Citeseer, 13–20.
                                 .                                        [2] Bernd Bischl et al. 2017. OpenML benchmarking suites and the OpenML100.
                                                                              arXiv preprint arXiv:1708.03731 (2017).
       Dataset       #Feat.  #Inst. A               Accuracy              [3] Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. 2004.
                           10 minutes                                         Ensemble selection from libraries of models. In ICML.
                                      m     0.885    0.827     0.788      [4] Thomas G Dietterich. 2000. Ensemble methods in machine learning. In Inter-
       sonar           61     208                                             national workshop on multiple classifier systems.
                                      v     0.846    0.827     0.712
                                                                          [5] Salijona Dyrmishi, Radwa Elshawi, and Sherif Sakr. 2019. A Decision Support
                                      m     0.875    0.875      0.75          Framework for AutoML Systems: A Meta-Learning Approach. In Proceedings
      bodies         4703      64
                                      v     0.938    0.938     0.875          of The 1st IEEE ICDM Workshop on Autonomous Machine Learning (AML).
                                      m     0.666    0.666      0.60      [6] Stefan Falkner, Aaron Klein, and Frank Hutter. 2018. Bohb: Robust and effi-
     tumors_C        7130      60                                             cient hyperparameter optimization at scale. arXiv preprint arXiv:1807.01774
                                      v     0.666    0.466     0.466
                                                                              (2018).
                                      m     0.881    0.839     0.811      [7] Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer,
    micro-mass       1301     517
                                      v     0.947    0.867     0.832          and Frank Hutter. 2020. Auto-sklearn 2.0: The next generation. arXiv preprint
                                      e     0.604    0.604     0.583          arXiv:2007.04074 (2020).
        GCM         160064    190                                         [8] Matthias Feurer and Frank Hutter. 2019. Hyperparameter optimization. In
                                      v     0.792    0.708     0.646          Automated Machine Learning. Springer, Cham, 3–33.
                           30 minutes                                     [9] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springen-
                                      m     0.812    0.812      0.75          berg, Manuel Blum, and Frank Hutter. 2015. Efficient and Robust Automated
      stemmed        3722      64                                             Machine Learning. In NIPS.
                                      v     0.875    0.875     0.812
                                                                         [10] Edmund A Gehan. 1965. A generalized Wilcoxon test for comparing arbitrarily
                                      m     0.812    0.812      0.75          singly-censored samples. Biometrika 52, 1-2 (1965), 203–224.
     lymphoma        4027      45
                                      v     0.875    0.875     0.812     [11] Pieter Gijsbers, Erin LeDell, Janek Thomas, Sébastien Poirier, Bernd Bischl,
                                      m     0.958    0.875     0.875          and Joaquin Vanschoren. 2019. An open source AutoML benchmark. arXiv
    rsctc2010_3      22278     95                                             preprint arXiv:1907.00909 (2019).
                                      v      1.0     0.833      0.75     [12] Isabelle Guyon, Lisheng Sun-Hosoya, Marc Boullé, Hugo Jair Escalante, Sergio
                           240 minutes                                        Escalera, Zhengying Liu, Damir Jajetic, Bisakha Ray, Mehreen Saeed, Michèle
                                      m     0.954    0.888      0.62          Sebag, et al. 2019. Analysis of the AutoML Challenge Series. Automated
    CovPokElec         65    1.4 M                                            Machine Learning (2019), 177.
                                      v      0.80    0.572     0.504
                                                                         [13] Xin He, Kaiyong Zhao, and Xiaowen Chu. 2019. AutoML: A Survey of the
                                                                              State-of-the-Art. arXiv preprint arXiv:1908.00709 (2019).
                                                                         [14] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2011. Sequential
lower number of attempts on the defined time budgets (See Table 8             model-based optimization for general algorithm configuration. In International
                                                                              conference on learning and intelligent optimization.
for CovPokElec dataset).                                                 [15] Haifeng Jin, Qingquan Song, and Xia Hu. 2019. Auto-keras: An efficient neural
                                                                              architecture search system. In ACM KDD.
5     CONCLUSION                                                         [16] Aaron Klein, Stefan Falkner, Numair Mansur, and Frank Hutter. 2017. Robo: A
                                                                              flexible and robust bayesian optimization framework in python. In NIPS 2017
This paper analyzed and presented various learning settings em-               Bayesian Optimization Workshop.
ployed and considered by AutoSklearn and AutoML frame-                   [17] Mohamed Maher and Sherif Sakr. 2019. SmartML: A Meta Learning-Based
                                                                              Framework for Automated Selection and Hyperparameter Tuning for Machine
works in general. The analysis revealed several insights that can             Learning Algorithms. In EDBT.
help guiding and improving the design process of future AutoML           [18] Sherif Sakr and Albert Y. Zomaya (Eds.). 2019. Encyclopedia of Big Data
                                                                              Technologies. Springer. https://doi.org/10.1007/978-3-319-63962-8
techniques. For example, no single configuration of the learning         [19] Radwa El Shawi, Mohamed Maher, and Sherif Sakr. 2019. Automated Ma-
settings can always guarantee an improved performance for all                 chine Learning: State-of-The-Art and Open Challenges. CoRR abs/1906.02287
datasets. Each configuration usually leads to a better performance            (2019). arXiv:1906.02287 http://arxiv.org/abs/1906.02287
                                                                         [20] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George
on some datasets. The meta-learning mechanism pioneered by                    Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneer-
AutoSklearn achieves a statistically significant performance                  shelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural
improvement with short time budgets only, and it significantly                networks and tree search. nature 529, 7587 (2016), 484–489.
                                                                         [21] Thomas Swearingen, Will Drevo, Bennett Cyphers, Alfredo Cuesta-Infante,
loses its impact with longer time budgets. Hence, we only rec-                Arun Ross, and Kalyan Veeramachaneni. 2017. ATM: A distributed, col-
ommend using meta-learning with limited time budgets or huge                  laborative, scalable system for automated machine learning. In 2017 IEEE
                                                                              International Conference on Big Data (Big Data).
datasets that take a long time to train a single model. Using ensem-     [22] Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2013.
ble models, the results are consistently improved for all time bud-           Auto-WEKA: Combined selection and hyperparameter optimization of classifi-
gets. Thus, ensembling is recommended, especially with datasets               cation algorithms. In ACM KDD.
                                                                         [23] Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Hervé Jégou. 2020.
with many features and few instances, since it reduces the chances            Fixing the train-test resolution discrepancy: FixEfficientNet. arXiv preprint
of overfitting the validation split. Increasing the time budget needs         arXiv:2003.08237 (2020).
to be considered carefully as it does not always lead to a significant   [24] Anh Truong, Austin Walters, Jeremy Goodsitt, Keegan Hines, Bayan Bruss,
                                                                              and Reza Farivar. 2019. Towards automated machine learning: Evaluation and
improvement of the accuracy. This decision can vary from one                  comparison of automl approaches and tools. arXiv preprint arXiv:1908.05557
scenario/application to another according to the resource-accuracy            (2019).
                                                                         [25] Joaquin Vanschoren. 2018. Meta-learning: A survey. arXiv preprint
tradeoff. Deliberately selecting a small search space with a few              arXiv:1810.03548 (2018).
top-performing classifiers can lead to a very comparable perfor-         [26] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013.
mance with a search space that includes many classifiers. This                OpenML: Networked Science in Machine Learning. SIGKDD Explorations 15,
                                                                              2 (2013), 49–60. https://doi.org/10.1145/2641190.2641198
insight is essential, especially for large datasets that cannot be       [27] Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In
evaluated using many classifiers. Finally, this study opens the door          Breakthroughs in statistics. Springer, 196–202.
to adaptively configure the default learning settings for each input     [28] Marc-André Zöller and Marco F Huber. 2019. Benchmark and Survey of
                                                                              Automated Machine Learning Frameworks. (2019).
dataset based on its characteristics.                                    [29] Albert Y Zomaya and Sherif Sakr. 2017. Handbook of big data technologies.
                                                                              Springer.
ACKNOWLEDGEMENT
This work is funded by the European Regional Development
Funds via the Mobilitas Plus programme (grant MOBTT75).