<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Effective data pre-processing for AutoML</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Joseph Giovanelli</string-name>
          <email>j.giovanelli@unibo.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Besim Bilalli</string-name>
          <email>bbilalli@essi.upc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Abelló</string-name>
          <email>aabello@essi.upc.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universitat Politècnica de</institution>
          ,
          <addr-line>Catalunya</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universitat Politècnica de</institution>
          ,
          <addr-line>Catalunya</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Bologna</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Data pre-processing plays a key role in a data analytics process (e.g., supervised learning). It encompasses a broad range of activities that span from correcting errors to selecting the most relevant features for the analysis phase. There is no clear evidence, or rules defined, on how pre-processing transformations (e,g., normalization, discretization, etc.) impact the ifnal results of the analysis. The problem is exacerbated when transformations are combined into pre-processing pipeline prototypes. Data scientists cannot easily foresee the impact of pipeline prototypes and hence require a method to discriminate between them and find the most relevant ones (e.g., with highest positive impact) for their study at hand. Once found, these pipelines can be optimized using AutoML in order to generate executable pipelines (i.e., with parametrized operators for each transformation). In this work, we study the impact of transformations in general, and the impact of transformations when combined together into pipelines. We develop a generic method that allows to find effective pipeline prototypes. Evaluated using Scikit-learn, our effective pipeline prototypes, when optimized, provide results that get 90% of the optimal predictive accuracy in the median, but with a cost that is 24 times smaller.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>The decision making process has historically been key for
the success of any organization or business activity. Lately,
with the abundant presence of data, this process has become
data-driven, where data are continuously analyzed to be
transformed into knowledge. Along the way however, data
undergo several (sometimes necessary) processing steps, shown
in Figure 1. Firstly, data arriving in a raw format from
different sources are sifted out such that only a relevant subset
is selected. Next, this subset is pre-processed and is fed to
a machine learning (ML) algorithm for it to be analyzed.
The output of the analysis is then interpreted and the whole
process iterates until the results obtained are satisfactory and
significant for the decisions to be made.</p>
      <p>
        Unfortunately, this well known process does not have
universal well-defined practices for the different steps, which
translates to the data scientist manually configuring and
parameterising the operators for each step until an optimal
solution is found — an optimal data analytics pipeline. To
this end, most of the time is spent on the heavily
laborious work of pre-processing (i.e., 50-80% of the time [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]),
where the generated output is a pre-processing pipeline. Next,
once the data is transformed into the proper form, different
ML algorithms with different hyperparameters are evaluated
over the dataset until an acceptable result is obtained — ML
pipeline. This whole process requires expertise and is
particularly challenging for novice, inexperienced data scientists. To
help with the challenge of finding optimal analytics pipelines
Copyright © 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
several techniques have been proposed, among them AutoML
being the most prominent.
      </p>
      <p>
        AutoML is a technique originally proposed for
automatically selecting and parameterising the learning algorithm,
which has been known as the Combined Algorithm Selection
and Hyper-parameter Optimization (CASH) problem [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
Given a budget in terms of time or number of iterations, an
optimization algorithm iterates by visiting a potentially better
configuration each time, until a near optimal solution is
obtained. AutoML has become the de-facto standard procedure
for CASH, demonstrating very good results [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] even in
Kaggle competitions. However, traditionally its drawback has
been that it focuses mainly on optimizing the
hyperparameters of the learning algorithms, often overlooking the impact
of data pre-processing. The process gets stuck on a near
optimal hyper-parameter configuration, and no matter the
number of iterations, the results do not improve. This hinders
the real power of AutoML because pre-processing operators
do not get thoroughly considered. Since CASH is one
aspect of automatically finding the best data analytics pipeline,
AutoML has been extended to also specifically cover the
pre-processing part. The latter has been coined as the Data
Pipeline Selection and Optimization (DPSO) problem [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ],
where a pipeline prototype (sequence of transformations,
e.g., missing value imputation followed by normalization) is
fed to AutoML and an optimal instance of the prototype, in
the form of a pipeline (sequence of operators, e.g.,
imputation by mean followed by min-max normalization) is found.
By considering pre-processing as an integral component of
data analytics, and carefully configuring the pre-processing
pipelines, it is easy to obtain results that go beyond the ones
obtained by only optimizing the learning algorithm.
      </p>
      <p>
        To briefly illustrate this, we perform an experiment on
the well known bank-marketing1 dataset, using
HyperOpt [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] as an AutoML approach to optimize the
parameters of three different ML algorithms, namely Naive Bayes
(NB), K-Nearest Neighbor (KNN), and Random Forest (RF).
We provide an initial budget of 50 iterations for
optimizing the hyper-parameters of the algorithms, and after the
50th iteration, we fix the algorithm configuration to the best
one achieved so far and start optimizing the pre-processing
pipeline2. In Figure 2, the improvement ratio of predictive
accuracy (i.e., ratio of the accuracy obtained after the i-th
iteration to the baseline/default accuracy) is plotted against
the number of different configurations visited by HyperOpt
(i.e., iterations). Observe that after the 11th iteration for NB
and KNN, and after the 26th iteration for RF, the lines remain
lfat. That is, from there on, no improvement is achieved by
optimizing the hyper-parameters of the algorithms until the
50th iteration is reached. At this point, a sudden jump is
observed and the results start to improve again, going clearly
1https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
2This order is used only for the sake of illustration.
Data source
      </p>
      <p>Data
selection
Data pipeline
prototype
selection</p>
      <p>Data pipeline
prototype
optimization
Data pre-processing</p>
      <p>Modeling
Algorithm
selection</p>
      <p>Hyperparameter
optimization</p>
      <p>Interpretation/
Evaluation</p>
      <p>Knowledge
beyond the ones obtained before, thanks to the
optimizations performed now over the pre-processing pipeline. Yet,
including the pre-processing in a free form in the
optimization, heavily increases the search space, making the problem
much harder. This is mitigated by creating a pre-processing
pipeline prototype that fixes the order of transformations,
leaving the freedom to only instantiate and parametrise them.
Therefore, the challenge for data scientists is to find the
right pre-processing pipeline prototype to optimize, that is,
(i) which pre-processing transformations to consider in the
prototype, and (ii) how to order them such that when
optimizing the parameters of their operators, better results are
obtained. The aim of this work is thus to study these two
questions in order to propose a method for generating
effective pre-processing pipeline prototypes, that once instantiated
through AutoML improve the final result of the analysis. To
keep discussions and experiments simpler, we stick to
supervised learning tasks, which encompass algorithms generating
a map function based on pairs of input-output exemplars.
In particular, this work focuses on classification problems,
where the output to be predicted is of categorical type.
Contributions. The main contributions of this paper can be
summarized as follows:
• We empirically evaluate the impact of optimizing the
exhaustive set of potential pipeline prototypes and
ifnd out that there is no single universal pipeline that
works best for every dataset and algorithm considered.
• We define a method that given a learning algorithm
and a set of pre-processing operators, is capable of
generating the right order between operators,
obtaining the most effective pre-processing pipelines.
• We perform a comprehensive evaluation by
comparing the performance of optimizing the pipelines
generated following our method, and find out that:
– with 24 times less time budget, our proposed
pipelines obtain results whose median is above 90%
of the optimal ones.
– on average, in 73% of the cases, splitting evenly
the time budget between pre-processing and
hyperparametrisation outperforms the results of only
optimizing the hyper-parameters of the ML algorithm.</p>
      <p>The remainder of this paper is organized as follows. In
Section 2, we discuss the related work. In Section 3, we
present our method for constructing effective pipeline
prototypes. In Section 4, we provide an extensive evaluation of
the pipelines created using that method. Finally, in Section 5,
we provide the conclusions and future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>A lot of ongoing research aims at addressing the problem
of providing user assistance for the different steps of the
data analytics process. In general, there is a trend to develop
(semi) automatic systems that assist the user in one or many
steps altogether. At the beginning, the focus was to provide
support exclusively for the learning step (i.e., the CASH
problem). Recently however, the direction has shifted towards
designing systems that additionally or specifically provide
user assistance in the data pre-processing step (i.e., the DPSO
problem).</p>
      <p>
        When it comes to data pre-processing, different works
have tackled this problem from different perspectives. For
instance, there are works that aim to apply pre-processing
for the sake of guaranteeing data quality, or enabling data
exchange, or even data integration. That is, they consider data
pre-processing in isolation or apart from data analysis [
        <xref ref-type="bibr" rid="ref11 ref13 ref14 ref7">7, 11,
13, 14</xref>
        ]. In this and our related work however, we consider
only the works that see pre-processing as an integral part of
data analytics and hence apply it for the sake of improving
the final result of the analysis.
      </p>
      <p>Finally, there are works that aim at fully automating the
data analytics process (i.e., automatically generate data
analytics flows), which roughly translates to combining DPSO
with CASH, where the border line between the latter two
becomes blurry. Nevertheless, we tentatively group the works
based on the type of the problem they aim to solve.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>DPSO</title>
      <p>
        In DPD [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], the DPSO problem, as we use it in this work,
is formally defined. Authors demonstrate the impact of
optimizing the pre-processing pipeline, but considering only
a single fixed pipeline prototype. However, as we will see
later (Section 4.1), a single fixed prototype cannot perform
best for every dataset. Therefore, we build on top of [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], yet
instead of relying on a fixed prototype, we define a method
to generate the right pipeline prototypes to be optimized.
      </p>
      <p>
        In PRESISTANT [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4–6</xref>
        ], we tackled the problem of
recommending pre-processing operators to the non-expert data
analyst. The goal, and at the same time the challenge was
to identify the pre-processing operators, and rank them in
advance, based on their potential impact to the final
analysis. However, we did not consider pre-processing pipelines,
but only single transformations, expecting that the analyst
applies the process iteratively. In this work, we consider sets
of transformations and thus study the impact of combining
transformations into a pipeline.
      </p>
      <p>
        In ActiveClean [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], authors define an algorithm that aims
at prioritizing the cleaning of records that are more likely
to affect the results of the statistical modeling problems,
assuming that the latter belong to the class of convex loss
models (i.e., linear regression and SVMs). Hence, instead of
recommending the transformations to be applied, the system
recommends the subset of data which needs to be cleaned at
a given point. The type of pre-processing to be applied is left
to the user, assuming that the user is an expert.
      </p>
      <p>
        In Learn2Clean [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], based on a reinforcement learning
technique, for a given dataset, and an ML model, an optimal
sequence of operators for pre-processing the data is
generated, such that the quality of the ML model is maximized.
Here, similarly to [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], the pipeline prototype is fixed in
advance. Our work is a step further in that we help to choose
the right pipeline prototype, instead of fixing it in advance.
      </p>
      <p>
        In Alpine Meadow [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], authors follow a similar approach
to ours in that they define two steps for the pre-processing
phase. One, the so called logical pipeline plan, which is
roughly equivalent to the pipeline prototypes defined in this
work, and the second the physical pipeline plan which
translates to pipelines used in this work. The physical plan is
generated through a combination of Bayesian optimization,
meta-learning, and multi-armed bandits. For the logical plans,
they rely on rules but without clear evidence on how they are
generated. Moreover, it is not clear whether the logical plan
is fixed as in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and if some further adjustment from the
user is required.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>CASH</title>
      <p>
        The task in solving the CASH problem is to automatically
ifnd an optimized instantiation for the hyper-parameters of
the ML algorithm. Most of the works use Bayesian
optimization methods to tune and optimize them [
        <xref ref-type="bibr" rid="ref18 ref21 ref8">8, 18, 21</xref>
        ].
Since Bayesian optimization is randomized, meta-learning
has been used to find a good seed for the search [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Most
of these works however, only minimally consider the data
pre-processing step.
      </p>
      <p>
        Auto-WEKA [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and its counterpart package for Python,
Auto-sklearn [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], solve the problem of learning algorithm
selection and their associated hyper-parameter optimization in
a combined search space. They use Sequential Model-based
Algorithm Configuration (SMAC) to explore the large search
space. These systems also consider pre-processing
transformations to generate end-to-end analytic pipelines. Yet, they
consider a small set of transformations and also consider a
single fixed pipeline prototype. Our work is complementary
to Auto-WEKA and Auto-sklearn.
      </p>
      <p>
        TPOT [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] is a tree-based pipeline optimization tool using
genetic programming while requiring little to no expertise
from the user. In TPOT however, they only consider one
transformation inside the optimization process (i.e., Feature
Engineering).
      </p>
      <p>
        Based on the work of [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], Microsoft developed an
AutoML tool via Azure. They build predictive ML pipelines
combining collaborative filtering and Bayesian Optimization.
In particular they model the search space as probability
distribution defined by a Probabilistic Matrix Factorization and
then use expected improvement as acquisition function to
choose the most promising pipeline.
      </p>
      <p>To summarize, full automation of data analytics has been
the ultimate goal of many research works. Yet, such an
automation has shown to be computationally expensive, mainly
due to the search space involved (i.e., pre-processing and
mining operators). Therefore, the usability of these approaches
in realistic scenarios is sometimes limited. Regardless of
the latter, our approach of finding a set of effective pipeline
prototypes can be seen as complementary to these solutions,
since it helps in pruning the large search space.
3</p>
    </sec>
    <sec id="sec-5">
      <title>DATA PRE-PROCESSING PIPELINE</title>
    </sec>
    <sec id="sec-6">
      <title>CONSTRUCTION</title>
      <p>
        Following the notation from [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], we also distinguish
between a pipeline prototype and a pipeline. The former is
deifned as a fixed, ordered sequence of kinds of pre-processing
transformations, where each kind of transformation can be
instantiated by a specific set of operators, into an actual
executable pipeline. Typically, pipeline prototype construction
is a manual and tedious task, where a data scientist
exhaustively iterates over a staggeringly large number of possible
pipeline orderings, until he/she finds one that works best for
the problem at hand. This is a challenging task due to the
fact that there are no clear rules and guidelines in terms of
which permutation of kinds of transformations would work
best (i.e., the final impact of a pipeline is difficult to foresee).
To facilitate it, we propose a method, sketched in Figure 3,
that in short breaks the combinatorial problem of finding the
best pipeline into studying kinds of transformations in pairs,
ultimately, generating effective pipeline prototypes. Some of
the steps of the method are generic and thus can be applied
regardless of the context, and yet others are specific, and
depend on the context (i.e., AutoML framework used or dataset
characteristics).
      </p>
      <p>The process starts by selecting the ML library and
AutoML framework to be used (e.g., Scikit-learn, AutoWeka),
which on the one hand determine the potential kinds of
transformations and their available instantiations, and on the other
hand allow to generate framework-related rules reflecting
the limitations in the concrete implementation of operators.
These rules enable the generation of precedence relationships
between the kinds of transformations for which they apply.
Next, the process follows with a study over all the possible
pairs of kinds of transformations, aiming to find the
correct/meaningful order between them using generic knowledge
about their behaviour. As a result, a set of heuristic rules that
determines precedences between transformations is
generated. Furthermore, for the pairs for which an order cannot be
clearly devised, an additional empirical study is performed.
This study relies on a testbed of dataset representatives, and
thus it implicitly corresponds to domain knowledge. The
output of this step is a set of learnt rules that determines
promising precedences of transformations (i.e. an order that
would potentially positively impact the final result of the
Machine
learning
framework
Select the list of
pre-processing
transformations</p>
      <p>of
interest</p>
      <p>Framework
related rules
analysis). However, even after this phase, for some pairs of
transformations an order may not be determined. These are
pairs where the order is relevant but cannot be decided in
advance, thus all their permutations need to be present. Finally,
a step of composition follows, where given the overall set of
devised rules (i.e., framework-related, heuristic and learnt),
transformations are composed resulting in a set of valid and
potentially effective pipeline prototypes.
literature) over the behaviour of transformations
(meaningful), or not (meaningless) (see Section 3.3).
(3) Promising/Unpromising pairs. Depending on whether
the precedence between them is expected to provide
positive impact on the final result of the analysis
(promising), or not (unpromising) (see Section 3.4).</p>
      <p>Thus, attending to the relationships between its
transformations, a prototype can be described as either compatible,
well-formed, or effective. A prototype is defined to be
compatible, if all its precedence relationships are compatible. It
is defined as well-formed, if all its precedence relationships
are both compatible and meaningful. Finally, it is defined
as effective, if all its precedence relationships are
compatible, meaningful, and promising at the same time. In fact, the
ultimate goal of our method is to find effective pipelines.
3.2</p>
    </sec>
    <sec id="sec-7">
      <title>Framework-related rules</title>
      <p>The compatibility of transformations is dependent of the
selected ML framework. We studied the transformations
implemented in Scikit-learn and detected a set of implicit rules
that are shown through an adjacency matrix, corresponding
to a precedence graph, in Table 1a. Each cell ai j denotes
a precedence relationship between the row i and column j.
Hence, 1 means that an edge exists between the
transformation in the row and the transformation in the column, while 0
means that such an edge does not exist, hence a precedence
order is not established for that pair.</p>
      <p>For example, most Scikit-learn transformations cannot be
applied in the presence of missing values. This is why in
every pair of transformations where Imputation is involved,
except the one with Normalization5, Imputation goes first.
Furthermore, Scikit-learn transformations are applied only
to all compatible attributes of a given dataset. Generally,
categorical attributes are physically represented as strings and
continuous attributes as numbers. However, a
transformation that is meant to be applied, say to continuous attributes,
cannot be applied over a dataset that contains both
continuous and categorical attributes (i.e., a dataset containing both
numbers and strings); Scikit-learn cannot deal with arrays
of mixed types. In that case, all the categorical attributes
need to be encoded into numeric representations, even if
they represent a categorical value. That is, a value can be a
number but represent a category. This is what happens when
Normalization and Discretization are meant to be applied to
a dataset containing mixed types of attributes. In order for
them to be applied to datasets of mixed types, an Encoding
transformation needs to be applied first. A similar constraint
5Normalization transformations are the only ones that Scikit-learn can apply
on datasets with missing values.
is imposed when considering Rebalancing and Feature
Engineering, since these transformations do not accept inputs
containing strings (i.e., representing a categorical type). For
the rest of the pairs of transformations there are no constraints
imposed by the framework, thus any order of such
transformations is permitted, reflected by a 0 in Table 1a. The graph
obtained in this case exclusively corresponds to the
limitations of Scikit-learn (as a matter of fact, if another framework
were to be chosen, it may have looked differently).
3.3</p>
    </sec>
    <sec id="sec-8">
      <title>Heuristic rules</title>
      <p>In the previous section, we derived a precedence based on
the constraints of the framework. Now, we want to study the
precedence independently of the framework, and find
meaningful pairs. That is, for every given pair, we want to find
the relative order, based on generic (i.e., based on the
literature), domain-independent knowledge about transformations
and their applicability. To this end, note that some of the
constraints imposed by the framework may be contradicted
here, but this will be considered later when taking the union
of both of the graphs (see Section 3.5). Table 1b shows the
results obtained.</p>
      <p>Observe that the constraints on the Imputation
transformation still hold, that is, it is correct to apply Imputation first
when combining it with another transformation, including
when combining it with Normalization. However, the
precedences of Encoding are not present, hence not considering
the framework, an Encoding transformation makes sense to
be combined in any order with the rest of transformations,
except Imputation. For the sake of an example, Discretization
combined with Encoding is a meaningful combination (when
a mixed type dataset is considered), but incompatible from
the point of view of Scikit-learn. Furthermore, notice that
combining Discretization with Normalization does not make
sense, due to the fact that after the Discretization step,
continuous attributes are transformed into Categorical ones, and
hence Normalization cannot be applied. Similarly, applying
Normalization first, changes the scale of the values and hence
impacts the Discretization step. Finally, another meaningful
precedence can be derived when combining Normalization
with Rebalancing. In this case, Normalization should be
applied first, since otherwise Rebalancing would impact the
scale of the values to be normalized.
3.4</p>
    </sec>
    <sec id="sec-9">
      <title>Learnt rules</title>
      <p>
        A third viewpoint to consider is that of learning a promising
order by empirically studying the impact of the combinations
on the final result of the analysis, using different
classification problems in the training. Specifically, we considered
three classification algorithms (i.e., NB, RF, KNN) on 60 out
of 72 classification problems from the OpenML-CC18
benchmark [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. We omitted 12 datasets since, due to their size,
the exhaustive search (see Section 4.2) was taking unfeasible
amount of computing time.
      </p>
      <p>We could try to learn the precedence of every pair of
transformations, but would just be a waste of resources, because
we can see in Table 1a and 1b, that some precedences are
already decided for one reason or another. Hence, only pairs
of transformations with a 0 for both directions (in both
Table 1a and 1b) need to be studied further. That is, they make
sense to be combined together, but a precedence order could
not be determined through framework-related or heuristic
rules. Thus, for instance pairs involving Encoding are not
considered in this phase, since for them an order is already
imposed by the framework (see Table 1a). To this end, the
pairs of transformations we consider for the third precedence
graph include only {F , N }, {F , D}, {F , R}, and {R, D}.</p>
      <p>3.4.1 Learning method. For every selected pair of
transformations, for a given classification algorithm, we
check which order of the pair improves most the
performance (e.g., predictive accuracy) of the algorithm over all
the datasets considered. To this end, for each dataset we get
a precedence order that gives better results (i.e., promising
precedence) in terms of predictive accuracy (other metrics
could have been used as well).</p>
      <p>
        To find a promising precedence order between a given
pair of transformations, we follow the steps shown in
Algorithm 1. To be able to compute the impact of transformations,
we first take the accuracy of the classification algorithm over
the original dataset, and use it as baseline for comparison
(see line 1). Next for each precedence order (lines 2-3), we
ifnd both, the pipelines with the best parametrisations using
Sequential Model-Based Optimization (SMBO) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and the
accuracy of the ML algorithm (with default parametrisation)
on the transformed datasets. Finally, the winner pipeline is
selected (line 5). Notice that some previous validity check
(line 4) is necessary because of the way SMBO works. When
optimizing a pre-processing pipeline (in this case consisting
of only two transformations), given a budget in terms of time
or number of iterations, SMBO tries many different
configurations of parameters for each of the operators implementing
the transformations, yet, one of the possible configurations
is also not instantiating at all a transformation (or both) in
the pipeline (i.e., represented with a  symbol). Hence, out
Nr. Pipeline 1
1.
2.
3.
4.
5.
6.
7.
8.
      </p>
      <p> → 
 → 
 → 
 → 
 → confT2
 → confT2
 → confT2
 → confT2
9. confT1 → 
10. confT1 → 
11. confT1 → 
12. confT1 → 
13. confT1 → confT2
14. confT1 → confT2
15. confT1 → confT2
confT2 → confT1
Pipeline 2
 → 
confT2 → 
 → confT1
 → 
confT2 → 
 → confT1
 → 
confT2 → 
 → confT1
confT2 → confT1
confT2 → confT1
 → 
confT2 → 
 → confT1</p>
      <p>Valid result
Draw
Draw
Draw
Draw
confT2 → confT1
Draw
Draw
 → confT2
confT2 → 
Draw
Draw
confT2 → confT1
Draw
Draw
Draw
confT1 → 
 → confT1
Draw
confT2 → confT1
Draw
confT1 → confT2
Draw
confT1 → confT2
Draw
confT1 → confT1
Draw
confT1 → confT2
confT2 → confT1</p>
      <p>Valid score
accbaseline
accbaseline
accbaseline
accbaseline
accT2→T1
accbaseline
accT2
accT2 or accT1
accT2
accT2→T1
accbaseline
accT1 or accT2
accT1
accT1
accT2→T1
accbaseline
accT1→T2
accT2
accT1→T2
accT1
accT1→T2
accT1 or accT2
accT1→T2
accT2→T1</p>
      <p>Winner prototype</p>
      <p>Baseline
Baseline
Baseline
Baseline
T2 → T1
Baseline</p>
      <p>T2
T2</p>
      <p>T2
T1 or T2</p>
      <p>T2
T2 → T1
Baseline
T1 or T2</p>
      <p>T1
T1
T1</p>
      <p>T1
T2 → T1
Baseline
T1 → T2</p>
      <p>T2
T1 → T2</p>
      <p>T1
T1 → T2
T1 or T2
T1 → T2
T2 → T1</p>
      <sec id="sec-9-1">
        <title>6: else</title>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>7: return </title>
      <p>8: end if
Algorithm 1 Find a promising pipeline prototype for transformations T1 and T2
# dataset, classification algorithm</p>
      <sec id="sec-10-1">
        <title>Require: d, a</title>
        <p># precedence orders of a pair of transformations
Require: T1 → T2, T2 → T1
# get baseline performance of algorithm on d
1: accbaseline = Acc(d, a);
# get pipeline and accuracy for T1 → T2
2: [pipelineT1→T2 , accT1→T2 ] = SMBO(T1 → T2, d, a)
# get pipeline and accuracy for T2 → T1
3: [pipelineT2→T1 , accT2→T1 ] = SMBO(T2 → T1, d, a)
# see Table 2 for the rules applied
4: if IsValid(accT1→T2 , accT2→T1 , accbaseline ) then
5: return Winner([pipelineT1→T2 , accT1→T2 ], [pipelineT2→T1 , accT2→T1 ]) # see column Winner prototype in Table 2</p>
        <p>KNN
T1 = Feat. Eng., T2 = Rebalance</p>
        <p>KNN</p>
        <p>T1 = Discretize, T2 = Rebalance
KNN</p>
        <p>KNN</p>
        <p>3.4.2 Results. Applying Algorithm 1, we obtain a
promising order for each pair of transformations considered (i.e.,
{F , N }, {F , D}, {F , R}, {R, D}). Since SMBO is a
randomized algorithm we experimented with (a) running it several
times splitting the budget, and (b) running it only once with
a the total budget. For the experiments considered, no
significant differences where observed, therefore we opted for
running it once with all the budget (i.e., 200 seconds per run),
which allows for more configurations to be visited in a single
run. Aggregating all the results, Figure 4 shows the number
of datasets, for which a given combination (see Table 2,
column Winner prototype for the list of labels) is selected as
the winner. For instance, for the pair {F , N } (i.e., Feature
Engineering, Normalization), the most winning prototype for
KNN and NB is N → F , which means that for most datasets,
better results are obtained if Normalization is applied before
Feature Engineering. Next, the second position for KNN and
NB, and the first for RF is only N , which means that for
many datasets, in different algorithms, it is better to apply
only Normalization without combining it with Feature
Engineering. The third position is for  → , which means
that better results are obtained if no transformation is
applied. The remaining prototypes winning in some datasets
are F (only Feature Engineering), and F → N (Feature
Engineering preceding Normalization). Finally, for three datasets,
which are omitted for simplicity, there were no winning
pipelines (i.e., pipelines resulted in a draw). Since we want
to find the best order for a given pair of transformations, we
specifically focus on the performances of the corresponding
precedence orders (i.e., T1 → T2 versus T2 → T1) of a pair of
transformations, and disregard the comparison of the other
possibilities (i.e., T1, T2, or Baseline) since they are not
relevant for declaring a winner. Hence, we check whether the
difference between these pipelines are statistically significant
by running a binomial test assuming a theoretical
probability of 0.5. The results are shown in Table 3. In summary,
the results from Table 3 indicate that, with 95% confidence
we can assume that for the pair {F , N }, N → F performs
better than F → N , hence Normalization should precede
Feature Engineering. Similarly, for {D, F }, D → F performs
better than F → D, hence Discretization should precede
Feature Engineering. Finally, for the remaining pairs, {F , R}
and {R, D}, a precedence order can not be assumed since the
results obtained are not significant. Using these results, we
create the promising precedence adjacency matrix shown in
Table 1c, where as one can observe, precedence edges are
introduced for {N , F } and {D, F }, but no edges exist neither
for {F , R}, nor for {R, D}.
3.5</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>Pipeline prototype composition</title>
      <p>To generate the final pipeline prototypes, in this step we
combine all the matrices generated by the previous steps.
That is, we take the union of the edges (represented by 1’s)
from the matrices in Table 1 (a,b,c), and create a new final
adjacency matrix, shown in Table 4. This is the matrix that
will allow us to generate the final effective pipelines.</p>
      <p>Observing the table, one can realize that for pairs {F , R}
and {R, D}, no precedence edges exist. This means that these
pairs are somewhat equally relevant from either direction
(any order), and thus when generating the final prototypes,
both options should appear.</p>
      <p>For a better reading, in Figure 5, we visualize Table 4 in
form of a graph, where nodes represent the kinds of
transformations and the directed edges represent a precedence
order between them. Out of the graph, we generate the final
pipeline prototypes by taking all the maximum length
variations (ordered arrangements without repetition) of the nodes,
respecting the precedence rules (i.e., not contradicting the
direction of existing edges). The result is the set of vfie pipeline
prototypes shown in Table 5. This set consisting of
compatible, meaningful and promising pairs of transformations is the
set of recommended effective pipeline prototypes.
4</p>
    </sec>
    <sec id="sec-12">
      <title>EVALUATION</title>
      <p>The aim of our experimental study is three-fold:
(1) Check whether there exists a universal pipeline
prototype that works best for every classification problem
considered (i.e., dataset and ML alg.) (Section 4.1).
(2) Assess and compare the performance of the effective
pipelines constructed using our method against the set
of exhaustively generated pipeline prototypes
(Section 4.2).
(3) Assess and compare the impact of dedicating a portion
of the optimization time to the effective pipelines
constructed using our method, with the impact of using
the whole optimization time for the hyper-parameters
of the ML algorithm (Section 4.3).</p>
      <p>The experiments were performed on an Intel Core i7
machine with 12 cores, running at 3.20 GHz with 64 GB of main
memory. As a platform for running the SMBO optimization
algorithm we use HyperOpt. Furthermore, the datasets used
in the experiments are the ones from the OpenML-CC18
suite (see Section 3.4). Finally, the classification algorithms
considered are NB, KNN, and RF. All the experiments for a
single algorithm, on average took approximately two weeks6.
4.1</p>
    </sec>
    <sec id="sec-13">
      <title>Universal pipeline prototype</title>
      <p>The goal of this experiment is to demonstrate the difficulty
of blindly finding the right pipeline prototype (i.e., without
considering any meaningful or promising precedence). In
Table 6, we list the exhaustive set of pipeline prototypes
generated considering the compatible precedence graph in
Table 1a (i.e., 24 compatible permutations). In a real scenario,
this number is too high for splitting the time budget in order
to optimize them. Yet, for the sake of this experiment, we
exhaustively optimize all the prototypes, for each dataset. Thus,
6The source code and the datasets for
reproducing the experiments can be found in
https://github.com/josephgiovanelli/effective_preprocessing_pipeline_evaluation
for each pipeline prototype and for each dataset, the SMBO
algorithm is configured to assign a 200 seconds time budget
to the phase of instantiating and optimizing the pipeline
prototype, and another 200 seconds to the phase of optimizing
the hyper-parameters of the ML algorithm.</p>
      <p>The results obtained are shown in Figure 6. The
enumerated prototypes are listed in the ordinate axis and each
stacked bar represents the percentage of cases for which that
prototype achieved the best performance across different ML
algorithms (the contribution of each algorithm is represented
with a different color). In an ideal scenario, for a pipeline
to be considered universal, it should perform best in all or
at least most of the cases, which is clearly not happening.
Observe that, even the best performing pipeline is only the
best in 11% of the cases, which is obviously far from being
universal. Hence all (or at least several) pipelines need to be
evaluated together, in order to obtain optimal results.
4.2</p>
    </sec>
    <sec id="sec-14">
      <title>Exhaustive versus effective prototypes</title>
      <p>Given that there is no single universal pipeline, one can opt
for feeding all the possible prototypes (see Table 6) to the
optimization algorithm in order to get the optimal results. As
before, we assign a budget of 200 seconds for the
optimization of each prototype, hence 80 minutes in total for all the
set of 24 exhaustive prototypes in order to find the optimal
pipeline for every dataset. On the other hand, we take only
the vfie effective prototypes resulting from the application
of our method and assign just 40 seconds time budget for
the optimization of each one of them, hence 200 seconds in
total. With the aim of comparing the two, and thus roughly
understanding how close we are to the optimal case, in both
cases, we dedicated the same time budget (i.e., 200 seconds)
for the phase of optimizing the hyper-parameters of the ML
algorithm. In order to evaluate how close the effective
prototypes are to the exhaustive ones, we calculate the normalized
distance from the result to the optimum:</p>
      <sec id="sec-14-1">
        <title>Acc(deffect ive , a∗) − Acc(d, a)</title>
        <p>Acc(dexhaust ive , a∗) − Acc(d, a)
normalized distance =</p>
        <p>where, Acc(d, a) is the baseline performance (i.e.,
predictive accuracy of the algorithm a with default
hyperparameters over the original dataset d). Acc(deffect ive , a∗) is
the accuracy of the optimized algorithm a∗ over the dataset
deffect ive transformed using the optimized instantiation of
the effective set of prototypes (i.e., our approach). Finally,
Acc(dexhaust ive , a∗) is the accuracy of the optimized
algorithm a∗ over the dataset dexhaust ive transformed using the
optimized pipeline instantiation of the exhaustive set of
prototypes. The subtraction by Acc(d, a) is done with the aim
of weighting the difcfiulty of a dataset, hence allowing for
comparisons in terms of the gain in accuracy. To this end,
the bigger the potential gain (denominator) is, the bigger
the obtained gain (numerator) must be, for the latter to be
relevant.</p>
        <p>The results obtained for every dataset and algorithm are
shown as boxplots in Figure 7. Observe that, most of the cases
are very close to the results obtained using the exhaustive set,
the median distances being 91.51%, 93.13%, 88.97%, for NB,
KNN, and RF, respectively. In general, in 75% of the cases
the chosen pipelines are above 80%, and only few outliers
are below 60%. Curiously, in some cases, we outperform
the results over the exhaustive set of pipelines, but this is
due to the randomness of the optimization algorithm, which
unless it is given an unrealistically high budget of time, is not
capable of finding the true optimal solution. We discarded
the option of assigning a larger budget since this was not
practical considering the huge search space and the lack of
any guarantee of improvement.</p>
        <p>To summarize, the experiment shows that with roughly 24
times less time budget, we can obtain results that are as good
as 90% in the median compared to the optimal ones. The raw
results (i.e., without the normalized distances) can be found
on the aforementioned github page.
4.3</p>
      </sec>
    </sec>
    <sec id="sec-15">
      <title>Complementing hyper-parameter optimization with pre-processing</title>
      <p>We have just shown that our effective pipeline prototypes
have similar impact as the exhaustive prototypes. Now we
want to compare the impact of effective prototypes against
optimizing only the hyper-parameters of the ML algorithm.
That is, we want to examine whether dedicating a part of the
optimization budget to the pre-processing pipeline impacts
more (positively) the results of the analysis, than using the
whole budget for the hyper-parameter optimization7.</p>
      <p>
        To this end, for the latter we now dedicate the total
optimization budget (i.e., 400 seconds), and for the former,
inspired by [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], we split the budget 50-50 between the
preprocessing pipeline optimization and the hyper-parameter
optimization (i.e., 200 seconds for the pre-processing, and
200 seconds for the hyper-parameter optimization). The time
for the pre-processing is further split among the vfie different
pipeline prototypes (i.e., 40 seconds each).
      </p>
      <p>To compare the results, we calculate the impact using the
formulas below, that correspond to the normalized distance
from either pre-processing or hyper-parameter optimization
to the maximum improvement that can be achieved,
regardless of whether pre-processing is applied or not.
pp impact =
hp impact =</p>
      <sec id="sec-15-1">
        <title>Acc(deffect ive , a∗) − Acc(d, a)</title>
        <p>max (Acc(deffect ive , a∗), Acc(d, a∗)) − Acc(d, a)</p>
        <p>Acc(d, a∗) − Acc(d, a)
max (Acc(deffect ive , a∗), Acc(d, a∗)) − Acc(d, a)
where, Acc(d, a) is the baseline accuracy (i.e., predictive
accuracy of the algorithm a with default hyper-parameters
over the original dataset d). Acc(deffect ive , a∗) is the
accuracy of the optimized algorithm a∗ over the dataset deffect ive
transformed using the optimized instantiation of the effective
set of prototypes obtained with the method above. Finally,
Acc(d, a∗) is the accuracy of the optimized algorithm a∗ (i.e,
using the entire budget) over the original dataset d.</p>
        <p>To obtain relative values that sum to 1, we normalize the
impacts dividing them by their sum. For instance, for the
pre-processing score we calculate the following:
pp impact
normalized pp impact =</p>
        <p>pp impact + hp impact</p>
        <p>We perform the same for the hyper-parameter impact and
plot the results obtained for all the algorithms and datasets
in Figure 8, where each bar represents the results obtained
for a single dataset. The different colors represent the impact
values of pre-processing and hyper-parameter optimization.</p>
        <p>Observing the bar-charts one can see that (i) dedicating a
portion of the budget to pre-processing, brings benefit to the
analysis in most of the cases (i.e., 73% of the cases), and (ii)
the impact of hyper-parameter optimization, increases with
the increase of the number of hyper-parameters of the ML
algorithm (e.g., hyper-parameter optimization impacts more
RF than NB). Overall, we can conclude that pre-processing
is a critical step that once effectively applied may have a high
positive impact on the final result of the analysis.
5</p>
      </sec>
    </sec>
    <sec id="sec-16">
      <title>CONCLUSIONS AND FUTURE WORK</title>
      <p>In this work, we first studied the overall impact of
transformations when combined into pre-processing pipelines
and then delved into examining the impact of the
precedence order between pairs of various transformations. As a
result, we defined a method that allows to generate effective
pre-processing pipelines. That is, pipelines that consist of,
7To enable the application of the ML algorithms on all the datasets, whenever
required, we apply the necessary transformation (e.g, imputation or encoding).
1</p>
      <p>KNN
Datasets</p>
      <p>RF
optimization budget for the hyper-parameter optimization.
(i) compatible pairs of transformations with respect to the
framework used, (ii) meaningful pairs of transformations in
terms of general knowledge (best practices), and (iii)
promising pairs of transformations that once applied are expected
to provide higher overall impact (domain knowledge). An
extensive evaluation on 60 datasets with heterogeneous
characteristics, from sample size to feature types, and a set of
classification algorithms (i.e.,</p>
      <p>NB, KNN, RF), showed that
our devised pipeline prototypes give promising results. More
specifically, we were able to observe that:
– The overall impact of optimizing pre-processing is
not negligible and it may boost the performance of
the overall analytics (e.g., predictive accuracy).
– There is no universal pre-processing pipeline
prototype that works best for every dataset and algorithm.
– With 24 times less time budget, our proposed pipeline
prototypes were able to obtain results that were as
good as 90% in the median of the optimal ones found
through an exhaustive search.
– Dedicating a portion of the time to the pre-processing
optimization, instead of dedicating it entirely to
hyperparameter optimization may boost the final result of
the analysis. On average, in 73% of the cases including
pre-processing in the optimization, outperformed the
results of only optimizing hyper-parameters.</p>
      <p>
        The results indicate that pre-processing can boost the
performance of the ML algorithm. Hence, it must be considered
as an integral part of the data analytics optimization process.
As immediate future work, we intend to: (i) study the datasets
that do not react well, or at all, to data pre-processing (see
(ii) increase the number of datasets and algorithms used to
further validate our approach, and (iii) extend our approach
with a meta-learning module, that given some dataset
characteristics [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is capable of recommending the most useful
prototype among the vfie effective ones.
      </p>
    </sec>
    <sec id="sec-17">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was supported by the GENESIS project, funded
by the Spanish Ministerio de Ciencia e Innovación under
project TIN2016-79269-R. We thank University of Bologna
for issuing a grant for author’s research stay at Universitat
Politècnica de Catalunya. Finally, we thank Matteo Golfarelli
for his comments and feedback on this work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bergstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yamins</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Cox</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures (ICML'13)</article-title>
          .
          <fpage>115</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Laure</given-names>
            <surname>Berti-Équille</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation (WWW '19)</article-title>
          .
          <fpage>2580</fpage>
          -
          <lpage>2586</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Besim</given-names>
            <surname>Bilalli</surname>
          </string-name>
          , Alberto Abelló, and
          <string-name>
            <surname>Tomàs</surname>
          </string-name>
          Aluja-Banet.
          <year>2017</year>
          .
          <article-title>On the predictive power of meta-features in OpenML</article-title>
          .
          <source>Int. J. Appl. Math. Comput. Sci. 27</source>
          ,
          <issue>4</issue>
          (
          <year>2017</year>
          ),
          <fpage>697</fpage>
          -
          <lpage>712</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Besim</given-names>
            <surname>Bilalli</surname>
          </string-name>
          , Alberto Abelló, Tomàs Aluja-Banet, Rana Faisal Munir, and
          <string-name>
            <given-names>Robert</given-names>
            <surname>Wrembel</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>PRESISTANT: Data Pre-processing Assistant (CAiSE Forum '</article-title>
          18).
          <fpage>57</fpage>
          -
          <lpage>65</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Besim</given-names>
            <surname>Bilalli</surname>
          </string-name>
          , Alberto Abelló, Tomàs Aluja-Banet, and
          <string-name>
            <given-names>Robert</given-names>
            <surname>Wrembel</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Intelligent assistance for data pre-processing</article-title>
          .
          <source>Comput. Stand. Interfaces</source>
          <volume>57</volume>
          (
          <year>2018</year>
          ),
          <fpage>101</fpage>
          -
          <lpage>109</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Besim</given-names>
            <surname>Bilalli</surname>
          </string-name>
          , Alberto Abelló, Tomàs Aluja-Banet, and
          <string-name>
            <given-names>Robert</given-names>
            <surname>Wrembel</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>PRESISTANT: Learning based assistant for data pre-processing</article-title>
          .
          <source>Data Knowl. Eng</source>
          .
          <volume>123</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Xu</given-names>
            <surname>Chu</surname>
          </string-name>
          , John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti,
          <string-name>
            <given-names>Nan</given-names>
            <surname>Tang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Yin</given-names>
            <surname>Ye</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing (</article-title>
          <source>SIGMOD '15)</source>
          .
          <fpage>1247</fpage>
          -
          <lpage>1261</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Matthias</given-names>
            <surname>Feurer</surname>
          </string-name>
          , Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, and
          <string-name>
            <given-names>Frank</given-names>
            <surname>Hutter</surname>
          </string-name>
          .
          <source>2015. Efcfiient and Robust Automated Machine Learning (NeurIPS '15)</source>
          .
          <fpage>2962</fpage>
          -
          <lpage>2970</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Matthias</given-names>
            <surname>Feurer</surname>
          </string-name>
          , Jost Tobias Springenberg, and
          <string-name>
            <given-names>Frank</given-names>
            <surname>Hutter</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Initializing Bayesian Hyperparameter Optimization via Meta-Learning (</article-title>
          <source>AAAI '15)</source>
          .
          <fpage>1128</fpage>
          -
          <lpage>1135</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Nicolo</surname>
            <given-names>Fusi</given-names>
          </string-name>
          , Rishit Sheth, and
          <string-name>
            <given-names>Melih</given-names>
            <surname>Elibol</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Probabilistic Matrix Factorization for Automated Machine Learning (</article-title>
          <source>NeurIPS'18)</source>
          .
          <fpage>3352</fpage>
          -
          <lpage>3361</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Floris</surname>
            <given-names>Geerts</given-names>
          </string-name>
          , Giansalvatore Mecca, Paolo Papotti, and
          <string-name>
            <given-names>Donatello</given-names>
            <surname>Santoro</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>The LLUNATIC Data-cleaning Framework</article-title>
          .
          <source>PVLDB End</source>
          .
          <volume>6</volume>
          ,
          <issue>9</issue>
          (
          <year>July 2013</year>
          ),
          <fpage>625</fpage>
          -
          <lpage>636</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Frank</surname>
            <given-names>Hutter</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holger H Hoos</surname>
          </string-name>
          , and
          <string-name>
            <surname>Kevin</surname>
          </string-name>
          Leyton-Brown.
          <year>2011</year>
          .
          <article-title>Sequential model-based optimization for general algorithm configuration</article-title>
          .
          <source>In International conference on learning and intelligent optimization</source>
          . Springer,
          <fpage>507</fpage>
          -
          <lpage>523</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Zhongjun</surname>
            <given-names>Jin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michael R. Anderson</surname>
            ,
            <given-names>Michael</given-names>
          </string-name>
          <string-name>
            <surname>Cafarella</surname>
            , and
            <given-names>H. V.</given-names>
          </string-name>
          <string-name>
            <surname>Jagadish</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Foofah: A Programming-By-Example System for Synthesizing Data Transformation Programs</article-title>
          (
          <source>SIGMOD '17)</source>
          .
          <fpage>1607</fpage>
          -
          <lpage>1610</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Zuhair</surname>
            <given-names>Khayyat</given-names>
          </string-name>
          , Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti,
          <string-name>
            <surname>Jorge-Arnulfo</surname>
            Quiané-Ruiz,
            <given-names>Nan</given-names>
          </string-name>
          <string-name>
            <surname>Tang</surname>
            , and
            <given-names>Si</given-names>
          </string-name>
          <string-name>
            <surname>Yin</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>BigDansing: A System for Big Data Cleansing (SIGMOD '15)</article-title>
          .
          <fpage>1215</fpage>
          -
          <lpage>1230</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>L.</given-names>
            <surname>Kotthoff</surname>
          </string-name>
          , Colleen Thornton,
          <string-name>
            <given-names>Holger</given-names>
            <surname>Hoos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          , and Kevin LeytonBrown.
          <year>2017</year>
          .
          <article-title>Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>18</volume>
          (
          <year>03 2017</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Sanjay</surname>
            <given-names>Krishnan</given-names>
          </string-name>
          , Jiannan Wang,
          <string-name>
            <surname>Eugene Wu</surname>
            ,
            <given-names>Michael J.</given-names>
          </string-name>
          <string-name>
            <surname>Franklin</surname>
            , and
            <given-names>Ken</given-names>
          </string-name>
          <string-name>
            <surname>Goldberg</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>ActiveClean: Interactive Data Cleaning For Statistical Modeling</article-title>
          .
          <source>PVLDB 9</source>
          ,
          <issue>12</issue>
          (
          <year>2016</year>
          ),
          <fpage>948</fpage>
          -
          <lpage>959</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M. Arthur</given-names>
            <surname>Munson</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>A Study on the Importance of and Time Spent on Different Modeling Steps</article-title>
          .
          <source>SIGKDD Explor. Newsl</source>
          .
          <volume>13</volume>
          ,
          <issue>2</issue>
          (
          <year>2012</year>
          ),
          <fpage>65</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Randal</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Olson</surname>
            , Nathan Bartley,
            <given-names>Ryan J.</given-names>
          </string-name>
          <string-name>
            <surname>Urbanowicz</surname>
          </string-name>
          , and
          <string-name>
            <surname>Jason</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Moore</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science (</article-title>
          <source>GECCO '16)</source>
          .
          <fpage>485</fpage>
          -
          <lpage>492</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Alexandre</given-names>
            <surname>Quemy</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Two-stage optimization for machine learning workflow</article-title>
          .
          <source>Information Systems</source>
          <volume>92</volume>
          (
          <year>2020</year>
          ),
          <fpage>101483</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Zeyuan</surname>
            <given-names>Shang</given-names>
          </string-name>
          , Emanuel Zgraggen, Benedetto Buratti, Ferdinand Kossmann, Philipp Eichmann, Yeounoh Chung, Carsten Binnig, Eli Upfal, and
          <string-name>
            <given-names>Tim</given-names>
            <surname>Kraska</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Democratizing Data Science through Interactive Curation of ML Pipelines (SIGMOD '19)</article-title>
          .
          <fpage>1171</fpage>
          -
          <lpage>1188</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Chris</surname>
            <given-names>Thornton</given-names>
          </string-name>
          , Frank Hutter,
          <string-name>
            <surname>Holger H. Hoos</surname>
          </string-name>
          , et al.
          <year>2013</year>
          .
          <article-title>Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms</article-title>
          . In KDD.
          <volume>847</volume>
          -
          <fpage>855</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Joaquin</surname>
            <given-names>Vanschoren</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Jan N. van Rijn</given-names>
            ,
            <surname>Bernd Bischl</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Luis</given-names>
            <surname>Torgo</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>OpenML: Networked Science in Machine Learning</article-title>
          .
          <source>SIGKDD Explorations 15</source>
          ,
          <issue>2</issue>
          (
          <year>2013</year>
          ),
          <fpage>49</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>