Data Pipeline Selection and Optimization
                                                                  Alexandre Quemy
                                                       IBM, Cracow, Poland
                               Faculty of Computing, Poznan University of Technology, Poznan, Poland
                                                       aquemy@pl.ibm.com

ABSTRACT                                                                        and error using the experience of data scientists and the expert
Data pipelines are known to influence machine learning perfor-                  knowledge about the data.
mances. In this paper, we formulate the data pipeline hyperpa-                     In this paper, we propose to apply state-of-the-art hyperopti-
rameter optimization problem as a standard optimization prob-                   mization techniques to select and configure data pipelines. The
lem that can be solved by (meta)optimizer. We apply Sequential                  main contributions can be summarized as follows:
Model-Based Optimization techniques to demonstrate how it                            • Showing the impact of data pipeline configuration on the
can automatically select and tune preprocessing operators to                            classification accuracy1 .
improve baseline score with a restricted budget. For NLP prepro-                     • Defining the Data Pipeline Selection and Optimization
cessing operators, we found that some optimal configurations                           (DPSO) problem.
are optimal for several different algorithms. It suggests that there                 • Showing that addressing the DPSO using Sequential Model-
exist algorithm-independent optimal parameter configurations                            Based Optimization (SMBO) leads to a significant increase
for some datasets.                                                                      of classification performances, even with a restricted CPU
                                                                                        or time budget.
                                                                                     • Defining a measure to quantify how an optimal configu-
1    INTRODUCTION                                                                       ration is specific or independent from the algorithm and
It is now well accepted that in machine learning, data are as                           showing it returns expected results.
important as algorithms. Algorithms received a lot of interest                     In Section 2, we present the related work on data pipeline
in hyperparameter tuning methods, that is to say, the art of ad-                optimization and hyperparameter tuning. After introducing the
justing parameters that are not dependent on the instance data.                 problem in Section 3, we perform two set of experiments: Sec-
Contrarily, dataset generation and preprocessing received little if             tion 4 demonstrates SMBO capacity to solve the problem while
any interest in hyperparameter tuning. For instance, [6] notices                Section 5 focuses on the link between optimal configurations and
that algorithm hyperparameter tuning is performed in 16 out of                  algorithms. We conclude in Section 6 by discussing the limitations
19 selected publications while only 2 publications study the im-                of this preliminary work and outlining future work.
pact of data preprocessing. This can probably be explained by the
fact that the research community mainly uses ready-to-consume                   2 RELATED WORK
datasets, hence occulting de facto this problematic. However, in                2.1 Data processing impact
practice, raw data are rarely ready to be consumed and must be                  The data preprocessing impact has been evaluated for multiple
transformed by a succession of operations usually referred as                   algorithms and operators. In [6], the authors showed that the
data pipeline.                                                                  accuracy obtained by Neural Network, SVM and Decision Trees
    There are plenty of reasons for which a data source cannot                  are significantly impacted by data scaling, sampling and contin-
be used directly. For instance, if there are too many descriptive               uous and categorical coding. A correlation link between under
variables, some feature selection or dimensionality reduction                   and oversampling is also demonstrated.
algorithms must be applied. If data are too large, subsample                       In [13], three specific data processing operators has been tested
techniques can be used. For imbalanced datasets, oversampling                   for neural networks. Despite the authors do not provide the re-
or undersampling may help. One of the most common reasons                       sults without any data processing, the results show an important
to modify raw data is missing or incorrect values. The most                     accuracy variability between the alternatives, thus implying a
common approaches to cope with this problem is discarding                       data processing impact.
rows with missing or incorrect data, or imputation, i.e. replacing                 For a more comprehensive view on data processing impact,
missing values with estimated values based on the available data.               we refer the reader to [7].
Curating datasets from outliers using statistical techniques s.a.
winsorization is very common. Finally, it is worth mentioning                   2.2     Optimizing data pipeline
that some learning models have intrinsic domain restrictions (e.g.
                                                                                AmazonML uses a sort of collaborative filtering to recommend a
Random Forest cannot directly work on categorical variables).
                                                                                data pipeline based on data (meta)attributes and a meta-database
This is handled by encoding variables into suitable variables (e.g.
                                                                                about efficient pipelines. eIDA [11] solves a planning problem
numerical variables for Random Forest).
                                                                                on top of an exhaustive grid which is unsuitable for practicable
    All those operations introduce bias and their presence or not
                                                                                problems with a large configuration domain.
in a data pipeline may be subject to discussion. The data pipeline
                                                                                   In [14], guidelines are used to verify the quality of prepro-
depends both on the data source and the algorithm such that
                                                                                cessed data in continuous machine learning, i.e. machine learning
there is no universal pipeline that can work for every data source
                                                                                models in production and receiving continuously new training
and every algorithm. The data pipeline is usually defined by trial
                                                                                data. The control is usually semi-automatic and proposed by tools
                                                                                s.a. SeeDB [16] to automatically generate useful visualization of
© 2019 Copyright held by the author(s). Published in the Workshop Proceedings
of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on    1 The approach remains valid for any problem as long as it consists in maximizing a
CEUR-WS.org.                                                                    score. In fact, it is enough to have a quality measure on the processed data.
data relations, or QUDE [18] to control false discoveries. The               In this paper, we propose to deal specifically with selecting and
drawback of those methods is the lack of automation.                      optimizing the data pipeline to demonstrate the influence of data
   Recently, a method using meta-features to estimate the impact          pipeline on the final results, without configuring the algorithm.
of preprocessing operators on model accuracy has been proposed            We hope this to open the road to more efficient techniques to
[3]. Meta-features can be general (e.g. number of classes or at-          solve CASH, notably by allowing transfer learning at the pipeline
tributes) or statistical (e.g. entropy, noise to signal ratio). This      configuration step, in addition to meta-learning across datasets.
approach constructs a latent space in which any dataset can
be represented. A meta-learner is trained over several different          3     DPSO PROBLEM
datasets obtained from different raw data and data pipeline. The          We formulate the Data Pipeline Selection and Optimization (DPSO)
meta-model is thus able to predict the influence of data pipeline         problem. Let D be a dataset split into D train and D test . A data
operators on new datasets without training the model and evalu-           pipeline is a sequence of operators with their own configuration,
ating it using e.g. cross-validation.                                     transforming a data source into consumable data for a given
   In [12], the authors use a genetic algorithm to select a repre-        algorithm A. Let assume a data pipeline configuration space P.
sentative sample from the data. The objective is to find represen-        Denote by L(P, A, D test ) the loss algorithm A achieved by a cross-
tative elements to decrease the learning time and increase the            validation on D test transformed by P. The DPSO can be formally
model accuracy. The fitness function to evaluate a sample is the          written:
model accuracy and thus, this approach is iterative. This work
can be seen as a special case of what is being done in this paper:            Definition 3.1 (Data Pipeline Selection and Optimization (DPSO)).
the sample selector operator being one particular operator to be
optimized in the data pipeline.                                                                   P ∗ ∈ argmin L(P, A, D test )                    (DPSO)
                                                                                                           P ∈P

2.3     Hyperparameter tuning and AutoML                                     In practice, the training set D train is used to find P, and the test
The most basic technique for hyperparameter tuning is a grid              set D test to evaluate the overall performances. DPSO can be seen
search or factorial design which consists in exhaustively testing         as a subpart of CASH. CASH agglomerates the pipeline and its
parameter configuration on a grid. Randomized search might help           configuration into the algorithm selection and hyperparameter
in increasing the probability of finding a good configuration but         optimization. To obtain a solution to CASH, a second optimiza-
in most cases the grid approach is computationally intractable.           tion step can be performed to find the best hyperparameters to
   Modern parameter tuning techniques are divided into two                the algorithm A.
categories. The first one is model-free techniques such as rac-
ing algorithm s.a. F-RACE [5] or iterated local search algorithm          4     EXPERIMENTS WITH SMBO
s.a. ParamILS [10]. The second one can be grouped under a                 In this section, we apply SMBO to solve the Data Pipeline Selec-
general framework called Sequential Model-Based Optimization              tion and Optimization problem.
(SMBO) that iterates over fitting models to determine promising
but unseen regions of the configuration space [2, 9]. Given a             4.1     Protocol
new configuration pn+1 , the model aims at predicting the per-            We created a pipeline prototype made of 3 steps: “rebalance” (han-
formances of the target algorithm on+1 knowing the history                dling imbalanced dataset), “normalizer” (scaling features), “fea-
{(p1, o 1 ), ..., (pn , on )}. Among this group of techniques, bayesian   tures” (feature selection or dimension reduction). For each step,
techniques s.a. gaussian process models and estimates P(o|p). An-         we selected few possible concrete operators with a specific config-
other popular approach is the Tree-structured Parzen Estimator            uration space. For instance, for “features”, there is the choice be-
(TPE) that models not only P(o|p) but also P(p) to provide better         tween a PCA with keeping 1 to 4 axes, selecting the k ∈ {1, ..., 4}
recommendations.                                                          best features according to an ANOVA, or a combination of both.
   AutoML aims at automating the whole design of machine                  Rebalance step consists in downsampling with Near Miss or Con-
learning experiment. Current AutoML approaches focus on solv-             densed Nearest Neighbour method or oversampling with SMOTE.
ing the combined algorithm selection and hyperparameter opti-             The normalization gives the choice between a standard scaler,
mization (CASH) problem introduced by Auto-WEKA [15]. This                a scaler excluding some points based on a quantile interval, a
problem is rather high-level as it considers the data pipeline            min-max scaler and a power transformation. Each step can also
selection and its configuration as part of the algorithm selec-           be skipped, and we call baseline pipeline, the pipeline skipping
tion phase and the general hyperparameter configuration. For              all operations. There is a total of 4750 possible pipeline configu-
instance, Auto-Sklearn defines pipelines as one feature prepro-           rations. For an exhaustive description of the configuration space,
cessing operator and up to three data preprocessing methods               we refer the reader to the Supplementary Material3 .
[8].                                                                          We performed the experiment on 3 datasets: Wine, Iris and
   The most popular AutoML frameworks such as Auto-WEKA                   Breast4 . We used 4 classification algorithms: SVM, Random Forest,
[15], Auto-sklearn [8] or H2O 2 uses Bayesian optimization to             Neural Network and Decision Tree. A 10-fold cross-validation is
solve CASH. They usually add additional components that we                used to assess the pipeline performances.
do not consider in this study. For instance, Auto-Sklearn reuses              We want to quantify the achievable improvement compared
the predictions made at every generation in an ensemble way to            to the baseline, measure how likely it is to improve the baseline
improve results and prevent overfitting. It also uses meta-learning       w.r.t. the configuration space, determine if SMBO is capable to
[1, 17] to solve coldstart problem: a model has been pre-trained
                                                                          3 https://aquemy.github.io/DOLAP_2019_supplementary_material/
offline over 140 datasets to be able to recommend good initial            4 The choice of small datasets is justified by the need to know the optimal score
solutions to CASH on new datasets.                                        in the search space to effectively evaluate SMBO results. Those results justify the
                                                                          SMBO approach as in practice only a fraction of the search space needs to be
2 https://www.h2o.ai/
                                                                          explored to drastically improve the score.
improve the baseline score, measure how much and fast SMBO                           Table 1: Pipeline optimization results.
is likely to improve the baseline score with a restricted budget.
    We performed an exhaustive search and a search using SMBO                             Baseline   Exhaustive        SMBO           SMBO (norm.)   Imp. Inter.
with a budget of 100 configurations to explore (about 2% of the                                                 Iris
configuration space).                                                     SVM              0.9667      0.9889          0.9778            0.9831       [11, 11]
                                                                          Random Forest    0.9222      0.9778          0.9667            0.9828        [8, 27]
4.2    Results                                                            Neural Net       0.9667      0.9889          0.9778            0.9831       [17, 17]
                                                                          Decision Tree    0.9222      0.9889          0.9889            1.0000        [1, 83]
Figure 1 provides the result obtained with Random Forest on
                                                                                                             Breast
Breast. A summary of the results is provided by Table 1. All re-
sults being qualitatively similar, the plots are provided in the          SVM              0.9501      0.9765          0.9765            1.0000       [12, 20]
                                                                          Random Forest    0.9384      0.9619          0.9560            0.9780        [4, 19]
Supplementary Material. Figure 1, top part, shows that the base-          Neural Net       0.9326      0.9765          0.9707            0.9903         [1, 7]
                                                                          Decision Tree    0.9296      0.9619          0.9589            0.9900        [0, 67]
                                                                                                              Wine
                                                                          SVM              0.9151      1.0000          0.9906            0.9811        [3, 13]
                                                                          Random Forest    0.9623      0.9906          0.9811            0.9818        [5, 20]
                                                                          Neural Net       0.9057      0.9906          0.9906            1.0000        [1, 25]
                                                                          Decision Tree    0.9057      0.9811          0.9811            1.0000        [5, 35]
                                                                         The column SMBO (norm.) is the SMBO score normalized within the
                                                                         search space. The last column is the interval where the left bound is the
                                                                         number of configurations required for SMBO to improve the baseline
                                                                         score, and the right, the number of configurations before reaching the
                                                                         best score.


                                                                         uses the RobustScaler but with an incorrect interval and without
                                                                         centering the data. It is hard to tell which configuration is the
                                                                         closest to the optimal one because there is no obvious metric
                                                                         on the configuration space. However, qualitatively, it seems that
                                                                         the best configurations are relatively similar to the optimal one.
                                                                         As similar results are observed for all methods and datasets, we
                                                                         provided them in the Supplementary Material.

Figure 1: Density of configurations. The vertical line is the                          SMOTE, k=5
                                                                                                            RobustScaler
                                                                                                                                          None
                                                                                                        [5, 95], centering, scaling
baseline score (top). Accuracy with SMBO for 100 configu-
rations explored (bottom).
                                                                                                            RobustScaler
                                                                                       SMOTE, k=7                                         None
                                                                                                            [10, 90], scaling

line score is 0.9384 and the best score 0.9619 i.e. an error reduction
                                                                                                           StandardScaler
of 38% is achievable in the search space. Most configurations de-                      SMOTE, k=7
                                                                                                            centering, scaling
                                                                                                                                          None
teriorate the baseline score. However, SMBO is skewed towards
better configuration compared to the exhaustive search. It indi-                       SMOTE, k=7          StandardScaler                 None
cates SMBO has a better probability to find a good configuration
than random search. The bottom part shows that SMBO starts to                          SMOTE, k=7                None                     None
improve the baseline score after only 4 iterations and reached its
best configuration after 19 iterations. There is only one optimal        Figure 2: Optimal pipeline (top) and the best pipelines
configuration in the search space which is not found. If we nor-         found by SBMO, using Random Forest on Breast.
malize the accuracy using the min. and max. on the configuration
space, SMBO found a configuration that represents a score of
97.80% with exploring only 0.4% of the configuration space.
   Table 1 shows that similar results are obtained for all methods       5    ALGORITHM-SPECIFIC CONFIGURATION
on all datasets. SMBO always found a better configuration than           We would like to quantify how much an optimal configuration is
the baseline, in at most 17 iterations. In average, the best score is    specific to an algorithm or is universal, i.e. works well regardless
achieved around 20 iterations (excluding Decision Tree on Iris and       of the algorithm. For this, the optimization process might be
Breast). Decision Tree was able to reach the optimal configuration       performed on a collection of methods A = {Ai }i=1   N . The result is

on Iris (resp. Wine) after 1 (resp. 5) iterations. In general, the       a sample of optimal configurations p = {pi }i=1 where M ≥ N
                                                                                                                  ∗      ∗ M
score in the normalized score space belongs to [0.9780, 1.000]. To       since an algorithm might have several distinct optimal configu-
summarize, in average, with 20 iterations (0.42% of the search           rations. After normalizing the configuration space to bring each
space) SMBO is able to decrease the error by 58.16% compared to          axis to [0, 1], the link between the processed data and the methods
the baseline score and found configurations that score 98.92% in         can be studied through the Normalized Mean Absolute Deviation
the normalized score space.                                              (NMAD). The idea behind this metric is to measure how much
   Figure 2 shows the optimal pipeline in the search space and the       the optimal points are distant from a reference optimal point.
four pipelines giving the best score for SMBO. All four pipelines        If the optimal configuration does not depend on the algorithm,
have the correct operator for rebalance and features steps. One          the expected distance between the optimal configurations is 0.
Conversely, if a point is specific to an algorithm, the other points
will be in average far from it.
   Working in the normalized configuration space has two ad-
vantages. First, it forces all parameters to have the same impact.
Secondly, it allows the comparison from one dataset to another
since the NMAD belongs to [0, 1] for any number of algorithms
or dimensions of the configuration space.
   The Normalized Mean Absolute Deviation is the norm 1 of the
Mean Absolute Deviation5 , divided by the number of dimensions
K of the configuration space.
   Definition 5.1 (Normalized Mean Absolute Deviation (NMAD)).
                                           N
                                     1 1 Õ ∗
                NMAD(p∗, r ) =
                                                     
                                        ||   |p − r | ||1
                                     K N i=1 i

   To measure how much each optimal point pi∗ is specific to
an algorithm A j , we use it as a reference point and calculate
the NMAD using a sample composed of all the optimal points.
However, an algorithm might have several optimal points and to
be fair, we use as a representant of each algorithm, the closest
point to the reference point.
                                                                                  Figure 3: Heatmap depicting the accuracy depending on
5.1     Protocol                                                                  the pipeline parameter configuration. Top for ECHR, bot-
As the configuration space described in Section 4.1 is not a metric               tom for Newsgroup.
space, we cannot directly use the NMAD. To avoid introducing
bias with an ad-hoc distance, we perform another experiment                         Table 2: Best configurations depending on the method
with a configuration space that is embedded in N.
   We collected 1000 judgements documents provided by the                                       Method                                (n, k)   accuracy
European Court of Human Rights (ECHR) about the Article 6.                                                         ECHR
The database HUDOC6 provides the ground truth corresponding                               Decision Tree                           (5, 50000)   0.900
to a violation or no violation. The cases have been collected such                      Neural Network                            (5, 50000)   0.960
that the dataset is balanced. The conclusion part is removed. To                         Random Forest    (3, 10000), (4, 10000), (5, 50000)   0.910
confirm the results, we used a second dataset composed of 855                              Linear SVM     (3, 50000), (4, 50000), (5, 50000)   0.921
documents from the categories atheism and religion of 20news-                                                   Newsgroup
groups.                                                                                   Decision Tree              (4, 5000), (4, 100000)    0.889
   Each document is preprocessed using a data pipeline consist-                         Neural Network                           (5, 50000)    0.953
ing in tokenization, stopwords removal, followed by a n-gram                             Random Forest                           (3, 10000)    0.931
                                                                                           Linear SVM                           (2, 100000)    0.946
generation. The processed documents are combined and the k
top tokens across the corpus are kept, forming the dictionary.
Each case is turned into a Bag-of-Words using the dictionary.
                                                                                  parameters values are better because they imply a lower prepro-
   There are two hyperparameters in the preprocessing phase:
                                                                                  cessing and training time. It is interesting to notice that (5, 50000)
n the size of the n-grams, and k the number of tokens in the
                                                                                  returns the best accuracy for every model, as this point would be
dictionary. We defined the parameter configuration domain as
                                                                                  a sort of universal configuration for the dataset, taking the best
follow:
                                                                                  out of the data source, rather than being well suited for a specific
     • n ∈ {1, 2, 3, 4, 5},                                                       algorithm. On the contrary, on Newsgroup, all optimal points are
     • k ∈ {10, 100, 1000, 5000, 10000, 50000, 100000}.                           different. Our hypothesis is that the more structured a corpus
We used the same four algorithms as in Section 4. As we are inter-                is, the less algorithm-specific are the optimal configurations, be-
ested in the optimal configurations, we performed an exhaustive                   cause the preprocessing steps become more important to extract
search.                                                                           markers used by the algorithms to reach good performances. As
                                                                                  ECHR dataset describes standardized justice documents, it is far
5.2     Results                                                                   more structured than Newsgroup. This would also explain why
For both datasets, Figure 3 shows that the classifier returns poor                generating n-grams for n = 5 still improves the results on ECHR
results for a configuration with a dictionary of only 10 or 100                   while degrading them on Newsgroup.
tokens. Both parameters influence the results, and too high values                    This hypothesis is partially confirmed by Table 3, where it
deteriorate the results.                                                          is clear that the n-gram operator has a strong impact on the
   Table 2 summarizes the best configurations per method. For                     accuracy variation on ECHR dataset (up to 9.8% accuracy im-
the first dataset, there are 3 points that gives the optimal value                provement) while almost none on Newsgroup dataset (at the
for Random Forest and Linear SVM, however, in practice lowest                     exception of Random Forest).
5 As we work on a discrete space, we used the norm 1, but the euclidean norm is
                                                                                  Table 4 contains the NMAD value for each distinct optimal con-
probably a better choice in continuous space.                                     figuration reported in Table 2. The Supplementary Material pro-
6 https://hudoc.echr.coe.int/                                                     vides the calculation step by step. As it can be expected, the
Table 3: Impact of parameter n on the accuracy, measured                 training the model for each selected configuration. To mitigate
as the relative difference between the best results obtained             this problem, we see few possibilities to explore:
only using (1, k) and the best results obtained for any con-                  • decreasing the amount of data to preprocess using a sam-
figuration (n, k).                                                               ple technique as described in [12],
                                                                              • using in priority data pipelines suggested by a meta-learning
                     Method   p = (1, k)     p = (n, k)   ∆ acc                  algorithm s.a. the one described in [3, 4],
                                ECHR                                          • caching the intermediate results of the data pipeline to
              Decision Tree     0.850          0.900      5.9%                   reuse, when possible, the outcome of some transforma-
            Neural Network      0.874          0.960      9.8%                   tions (e.g. there is no need to regenerate the 2-grams for
             Random Forest      0.863          0.910      5.4%                   n ≥ 3 if a previous configuration with n = 2 has been
               Linear SVM       0.892          0.921      6.6%
                                                                                 explored.).
                              Newsgroup
                                                                         Another aspect to be addressed is the compromise between time
              Decision Tree     0.885          0.889      0.5%           and performances. Indeed, some parameters increases the pro-
            Neural Network      0.949          0.953      0.4%
                                                                         cessing time but not the model training (e.g. n-grams computa-
             Random Forest      0.883          0.931      5.4%
               Linear SVM       0.945          0.946      0.1%           tion) while others may not affect the processing time but signifi-
                                                                         cantly increase the model training (e.g. number of tokens k). A
                                                                         fine grain time analysis would be required, and an intelligent
Table 4: Normalized Mean Average Deviation for each op-                  pruning system could be a solution to avoid costly iterations.
timal configuration found.                                                   Future work should focus on an online version s.t. the pipeline
                                                                         is tuned in a streaming way. Also, the NMAP indicator works only
                ECHR                            Newsgroup                in euclidian spaces which is not the case for the first experiment.
                                                                         Therefore, further work should focus on extending the NMAP to
        Point         NMAD                 Point           NMAD          non-vector space.
        (5, 50000)          0              (4, 5000)         0.306
        (3, 10000)      0.275              (4, 100000)       0.300       REFERENCES
        (4, 10000)      0.213              (5, 50000)        0.356        [1] Rémi Bardenet, Mátyás Brendel, Balázs Kégl, and Michele Sebag. 2013. Col-
                                                                              laborative hyperparameter tuning. In Int. Conf. Mach. Learn. 199–207.
        (3, 50000)      0.175              (3, 10000)        0.294        [2] J Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. 2011. Algorithms for Hyper-
        (4, 50000)      0.094              (2, 100000)       0.362            parameter Optimization. In Proc. Int. Conf. Neural Inf. Process. Syst. 2546–2554.
                                                                          [3] B. Bilalli, A. Abelló, and T. Aluja-Banet. 2017. On the Predictive Power of
                                                                              Meta-features in OpenML. Int. J. Appl. Math. Comput. Sci. 27, 4 (2017), 697–712.
                                                                          [4] B. Bilalli, A. Abelló, T. Aluja-Banet, and R. Wrembel. 2018. Intelligent assistance
point (5, 50000) has a NMAD of 0 since the point is present for               for data pre-processing. Computer Standards & Interfaces 57 (2018), 101 – 109.
                                                                          [5] M. Birattari, Z. Yuan, P. Balaprakash, and T. Stützle. 2010. F-Race and Iterated
every algorithm: (5, 50000) is a universal pipeline configuration             F-Race: An Overview. Springer Berlin Heidelberg, Berlin, Heidelberg, 311–336.
for this data pipeline and dataset. The point (4, 50000) appears          [6] S. F. Crone, S. Lessmann, and R. Stahlbock. 2006. The impact of preprocessing
                                                                              on data mining: An evaluation of classifier sensitivity in direct marketing. Eur.
only once but it is really close to (5, 50000) (itself in the 3 other         J. Oper. Res. 173, 3 (2006), 781 – 800.
algorithms results) s.t. its NMAD is low. It can be interpreted as        [7] T. Dasu and T. Johnson. 2003. Exploratory data mining and data cleaning.
belonging to the same area of optimal values. On the opposite,                Vol. 479. John Wiley & Sons.
                                                                          [8] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg,
(3, 10000) and (4, 10000) have high NMAD w.r.t. the other points,             Manuel Blum, and Frank Hutter. 2015. Efficient and Robust Automated Ma-
indicating they are isolated points and may be algorithm specific.            chine Learning. In Adv. Neural Inf. Process. Syst., C. Cortes, N. D. Lawrence,
Their NMAD values are rather low because despite the points                   D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). 2962–2970.
                                                                          [9] F. Hutter, H. H. Hoos, and K. Leyton-Brown. 2011. Sequential Model-based
are isolated, they differ significantly from the others points only           Optimization for General Algorithm Configuration. In Proc. Int. Conf. Learn.
on the second component. In comparison, if (1, 10) would be an                Intel. Optim. Springer-Verlag, Berlin, Heidelberg, 507–523.
                                                                         [10] T. Hutter, F. and Hoos, H. H. and Leyton-Brown, K. and Stützle. 2009. ParamILS:
optimal point for Random Forest, its NMAD would be 0.5. On                    An Automatic Algorithm Configuration Framework. J. Artif. Intel. Res. 36
the contrary, for Newsgroup, the NMAD value is rather high and                (2009), 267–306.
similar for all points, indicating that they are at a similar distance   [11] J. Kietz, F. Serban, S. Fischer, and A. Bernstein. 2014. “Semantics Inside!”
                                                                              But Let’s Not Tell the Data Miners: Intelligent Support for Data Mining. In
from each other and really algorithm specific.                                The Semantic Web: Trends and Challenges. Springer International Publishing,
    To summarize, the NMAD metric is coherent with the conclu-                706–720.
sion drawn from the heatmaps and Table 2, and suggests that              [12] J. Nalepa, M. Myller, S. Piechaczek, K. Hrynczenko, and M. Kawulok. 2018.
                                                                              Genetic Selection of Training Sets for (Not Only) Artificial Neural Networks.
there exist two types of optimal configurations: universal pipeline           In Proc. Int. Conf. Beyond Databases, Architectures Struct. 194–206.
configurations that work well on a large range of algorithms for         [13] N. M. Nawi, W. H. Atomi, and M. Z. Rehman. 2013. The Effect of Data Pre-
                                                                              processing on Optimized Training of Artificial Neural Networks. Procedia
a given dataset, and algorithm-specific configurations. Thus, we              Technology 11 (2013), 32 – 39. Int. Conf. Elect. Eng. Info.
are confident the NMAD can be used in larger configuration               [14] N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich. 2017. Data Management
spaces where heatmaps and exhaustive results are not available                Challenges in Production Machine Learning. In Proc. ACM Int. Conf. Manage.
                                                                              Data. ACM, 1723–1726.
for graphical interpretation, and help to reuse configurations.          [15] Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown.
                                                                              2013. Auto-WEKA: Combined selection and hyperparameter optimization of
                                                                              classification algorithms. In Int. Conf. Knowl. Disc. Data Min. ACM, 847–855.
6    CONCLUSION                                                          [16] M. Vartak, S. Rahman, S. Madden, A. Parameswaran, and N. Polyzotis. 2015. S
In this paper, we successfully applied Sequential Model-Based                 ee DB: efficient data-driven visualization recommendations to support visual
                                                                              analytics. Proc. VLDB Endowment 8, 13 (2015), 2182–2193.
Optimization techniques to data pipeline selection and configu-          [17] Dani Yogatama and Gideon Mann. 2014. Efficient transfer learning method for
ration. In addition, we provided a metric to study if an optimal              automatic hyperparameter tuning. In Int. Conf. Artif. Intel. Stat. 1077–1085.
configuration is algorithm specific or rather universal.                 [18] Z. Zhao, L. De Stefani, E. Zgraggen, C. Binnig, E. Upfal, and T. Kraska. 2017.
                                                                              Controlling False Discoveries During Interactive Data Exploration. In Proc.
   The main practical drawback of the iterative approach pre-                 ACM Int. Conf. Manag. Data. ACM, 527–540.
sented in this paper is the cost involved in processing the data and