Data Pipeline Selection and Optimization Alexandre Quemy IBM, Cracow, Poland Faculty of Computing, Poznan University of Technology, Poznan, Poland aquemy@pl.ibm.com ABSTRACT and error using the experience of data scientists and the expert Data pipelines are known to influence machine learning perfor- knowledge about the data. mances. In this paper, we formulate the data pipeline hyperpa- In this paper, we propose to apply state-of-the-art hyperopti- rameter optimization problem as a standard optimization prob- mization techniques to select and configure data pipelines. The lem that can be solved by (meta)optimizer. We apply Sequential main contributions can be summarized as follows: Model-Based Optimization techniques to demonstrate how it • Showing the impact of data pipeline configuration on the can automatically select and tune preprocessing operators to classification accuracy1 . improve baseline score with a restricted budget. For NLP prepro- • Defining the Data Pipeline Selection and Optimization cessing operators, we found that some optimal configurations (DPSO) problem. are optimal for several different algorithms. It suggests that there • Showing that addressing the DPSO using Sequential Model- exist algorithm-independent optimal parameter configurations Based Optimization (SMBO) leads to a significant increase for some datasets. of classification performances, even with a restricted CPU or time budget. • Defining a measure to quantify how an optimal configu- 1 INTRODUCTION ration is specific or independent from the algorithm and It is now well accepted that in machine learning, data are as showing it returns expected results. important as algorithms. Algorithms received a lot of interest In Section 2, we present the related work on data pipeline in hyperparameter tuning methods, that is to say, the art of ad- optimization and hyperparameter tuning. After introducing the justing parameters that are not dependent on the instance data. problem in Section 3, we perform two set of experiments: Sec- Contrarily, dataset generation and preprocessing received little if tion 4 demonstrates SMBO capacity to solve the problem while any interest in hyperparameter tuning. For instance, [6] notices Section 5 focuses on the link between optimal configurations and that algorithm hyperparameter tuning is performed in 16 out of algorithms. We conclude in Section 6 by discussing the limitations 19 selected publications while only 2 publications study the im- of this preliminary work and outlining future work. pact of data preprocessing. This can probably be explained by the fact that the research community mainly uses ready-to-consume 2 RELATED WORK datasets, hence occulting de facto this problematic. However, in 2.1 Data processing impact practice, raw data are rarely ready to be consumed and must be The data preprocessing impact has been evaluated for multiple transformed by a succession of operations usually referred as algorithms and operators. In [6], the authors showed that the data pipeline. accuracy obtained by Neural Network, SVM and Decision Trees There are plenty of reasons for which a data source cannot are significantly impacted by data scaling, sampling and contin- be used directly. For instance, if there are too many descriptive uous and categorical coding. A correlation link between under variables, some feature selection or dimensionality reduction and oversampling is also demonstrated. algorithms must be applied. If data are too large, subsample In [13], three specific data processing operators has been tested techniques can be used. For imbalanced datasets, oversampling for neural networks. Despite the authors do not provide the re- or undersampling may help. One of the most common reasons sults without any data processing, the results show an important to modify raw data is missing or incorrect values. The most accuracy variability between the alternatives, thus implying a common approaches to cope with this problem is discarding data processing impact. rows with missing or incorrect data, or imputation, i.e. replacing For a more comprehensive view on data processing impact, missing values with estimated values based on the available data. we refer the reader to [7]. Curating datasets from outliers using statistical techniques s.a. winsorization is very common. Finally, it is worth mentioning 2.2 Optimizing data pipeline that some learning models have intrinsic domain restrictions (e.g. AmazonML uses a sort of collaborative filtering to recommend a Random Forest cannot directly work on categorical variables). data pipeline based on data (meta)attributes and a meta-database This is handled by encoding variables into suitable variables (e.g. about efficient pipelines. eIDA [11] solves a planning problem numerical variables for Random Forest). on top of an exhaustive grid which is unsuitable for practicable All those operations introduce bias and their presence or not problems with a large configuration domain. in a data pipeline may be subject to discussion. The data pipeline In [14], guidelines are used to verify the quality of prepro- depends both on the data source and the algorithm such that cessed data in continuous machine learning, i.e. machine learning there is no universal pipeline that can work for every data source models in production and receiving continuously new training and every algorithm. The data pipeline is usually defined by trial data. The control is usually semi-automatic and proposed by tools s.a. SeeDB [16] to automatically generate useful visualization of © 2019 Copyright held by the author(s). Published in the Workshop Proceedings of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on 1 The approach remains valid for any problem as long as it consists in maximizing a CEUR-WS.org. score. In fact, it is enough to have a quality measure on the processed data. data relations, or QUDE [18] to control false discoveries. The In this paper, we propose to deal specifically with selecting and drawback of those methods is the lack of automation. optimizing the data pipeline to demonstrate the influence of data Recently, a method using meta-features to estimate the impact pipeline on the final results, without configuring the algorithm. of preprocessing operators on model accuracy has been proposed We hope this to open the road to more efficient techniques to [3]. Meta-features can be general (e.g. number of classes or at- solve CASH, notably by allowing transfer learning at the pipeline tributes) or statistical (e.g. entropy, noise to signal ratio). This configuration step, in addition to meta-learning across datasets. approach constructs a latent space in which any dataset can be represented. A meta-learner is trained over several different 3 DPSO PROBLEM datasets obtained from different raw data and data pipeline. The We formulate the Data Pipeline Selection and Optimization (DPSO) meta-model is thus able to predict the influence of data pipeline problem. Let D be a dataset split into D train and D test . A data operators on new datasets without training the model and evalu- pipeline is a sequence of operators with their own configuration, ating it using e.g. cross-validation. transforming a data source into consumable data for a given In [12], the authors use a genetic algorithm to select a repre- algorithm A. Let assume a data pipeline configuration space P. sentative sample from the data. The objective is to find represen- Denote by L(P, A, D test ) the loss algorithm A achieved by a cross- tative elements to decrease the learning time and increase the validation on D test transformed by P. The DPSO can be formally model accuracy. The fitness function to evaluate a sample is the written: model accuracy and thus, this approach is iterative. This work can be seen as a special case of what is being done in this paper: Definition 3.1 (Data Pipeline Selection and Optimization (DPSO)). the sample selector operator being one particular operator to be optimized in the data pipeline. P ∗ ∈ argmin L(P, A, D test ) (DPSO) P ∈P 2.3 Hyperparameter tuning and AutoML In practice, the training set D train is used to find P, and the test The most basic technique for hyperparameter tuning is a grid set D test to evaluate the overall performances. DPSO can be seen search or factorial design which consists in exhaustively testing as a subpart of CASH. CASH agglomerates the pipeline and its parameter configuration on a grid. Randomized search might help configuration into the algorithm selection and hyperparameter in increasing the probability of finding a good configuration but optimization. To obtain a solution to CASH, a second optimiza- in most cases the grid approach is computationally intractable. tion step can be performed to find the best hyperparameters to Modern parameter tuning techniques are divided into two the algorithm A. categories. The first one is model-free techniques such as rac- ing algorithm s.a. F-RACE [5] or iterated local search algorithm 4 EXPERIMENTS WITH SMBO s.a. ParamILS [10]. The second one can be grouped under a In this section, we apply SMBO to solve the Data Pipeline Selec- general framework called Sequential Model-Based Optimization tion and Optimization problem. (SMBO) that iterates over fitting models to determine promising but unseen regions of the configuration space [2, 9]. Given a 4.1 Protocol new configuration pn+1 , the model aims at predicting the per- We created a pipeline prototype made of 3 steps: “rebalance” (han- formances of the target algorithm on+1 knowing the history dling imbalanced dataset), “normalizer” (scaling features), “fea- {(p1, o 1 ), ..., (pn , on )}. Among this group of techniques, bayesian tures” (feature selection or dimension reduction). For each step, techniques s.a. gaussian process models and estimates P(o|p). An- we selected few possible concrete operators with a specific config- other popular approach is the Tree-structured Parzen Estimator uration space. For instance, for “features”, there is the choice be- (TPE) that models not only P(o|p) but also P(p) to provide better tween a PCA with keeping 1 to 4 axes, selecting the k ∈ {1, ..., 4} recommendations. best features according to an ANOVA, or a combination of both. AutoML aims at automating the whole design of machine Rebalance step consists in downsampling with Near Miss or Con- learning experiment. Current AutoML approaches focus on solv- densed Nearest Neighbour method or oversampling with SMOTE. ing the combined algorithm selection and hyperparameter opti- The normalization gives the choice between a standard scaler, mization (CASH) problem introduced by Auto-WEKA [15]. This a scaler excluding some points based on a quantile interval, a problem is rather high-level as it considers the data pipeline min-max scaler and a power transformation. Each step can also selection and its configuration as part of the algorithm selec- be skipped, and we call baseline pipeline, the pipeline skipping tion phase and the general hyperparameter configuration. For all operations. There is a total of 4750 possible pipeline configu- instance, Auto-Sklearn defines pipelines as one feature prepro- rations. For an exhaustive description of the configuration space, cessing operator and up to three data preprocessing methods we refer the reader to the Supplementary Material3 . [8]. We performed the experiment on 3 datasets: Wine, Iris and The most popular AutoML frameworks such as Auto-WEKA Breast4 . We used 4 classification algorithms: SVM, Random Forest, [15], Auto-sklearn [8] or H2O 2 uses Bayesian optimization to Neural Network and Decision Tree. A 10-fold cross-validation is solve CASH. They usually add additional components that we used to assess the pipeline performances. do not consider in this study. For instance, Auto-Sklearn reuses We want to quantify the achievable improvement compared the predictions made at every generation in an ensemble way to to the baseline, measure how likely it is to improve the baseline improve results and prevent overfitting. It also uses meta-learning w.r.t. the configuration space, determine if SMBO is capable to [1, 17] to solve coldstart problem: a model has been pre-trained 3 https://aquemy.github.io/DOLAP_2019_supplementary_material/ offline over 140 datasets to be able to recommend good initial 4 The choice of small datasets is justified by the need to know the optimal score solutions to CASH on new datasets. in the search space to effectively evaluate SMBO results. Those results justify the SMBO approach as in practice only a fraction of the search space needs to be 2 https://www.h2o.ai/ explored to drastically improve the score. improve the baseline score, measure how much and fast SMBO Table 1: Pipeline optimization results. is likely to improve the baseline score with a restricted budget. We performed an exhaustive search and a search using SMBO Baseline Exhaustive SMBO SMBO (norm.) Imp. Inter. with a budget of 100 configurations to explore (about 2% of the Iris configuration space). SVM 0.9667 0.9889 0.9778 0.9831 [11, 11] Random Forest 0.9222 0.9778 0.9667 0.9828 [8, 27] 4.2 Results Neural Net 0.9667 0.9889 0.9778 0.9831 [17, 17] Decision Tree 0.9222 0.9889 0.9889 1.0000 [1, 83] Figure 1 provides the result obtained with Random Forest on Breast Breast. A summary of the results is provided by Table 1. All re- sults being qualitatively similar, the plots are provided in the SVM 0.9501 0.9765 0.9765 1.0000 [12, 20] Random Forest 0.9384 0.9619 0.9560 0.9780 [4, 19] Supplementary Material. Figure 1, top part, shows that the base- Neural Net 0.9326 0.9765 0.9707 0.9903 [1, 7] Decision Tree 0.9296 0.9619 0.9589 0.9900 [0, 67] Wine SVM 0.9151 1.0000 0.9906 0.9811 [3, 13] Random Forest 0.9623 0.9906 0.9811 0.9818 [5, 20] Neural Net 0.9057 0.9906 0.9906 1.0000 [1, 25] Decision Tree 0.9057 0.9811 0.9811 1.0000 [5, 35] The column SMBO (norm.) is the SMBO score normalized within the search space. The last column is the interval where the left bound is the number of configurations required for SMBO to improve the baseline score, and the right, the number of configurations before reaching the best score. uses the RobustScaler but with an incorrect interval and without centering the data. It is hard to tell which configuration is the closest to the optimal one because there is no obvious metric on the configuration space. However, qualitatively, it seems that the best configurations are relatively similar to the optimal one. As similar results are observed for all methods and datasets, we provided them in the Supplementary Material. Figure 1: Density of configurations. The vertical line is the SMOTE, k=5 RobustScaler None [5, 95], centering, scaling baseline score (top). Accuracy with SMBO for 100 configu- rations explored (bottom). RobustScaler SMOTE, k=7 None [10, 90], scaling line score is 0.9384 and the best score 0.9619 i.e. an error reduction StandardScaler of 38% is achievable in the search space. Most configurations de- SMOTE, k=7 centering, scaling None teriorate the baseline score. However, SMBO is skewed towards better configuration compared to the exhaustive search. It indi- SMOTE, k=7 StandardScaler None cates SMBO has a better probability to find a good configuration than random search. The bottom part shows that SMBO starts to SMOTE, k=7 None None improve the baseline score after only 4 iterations and reached its best configuration after 19 iterations. There is only one optimal Figure 2: Optimal pipeline (top) and the best pipelines configuration in the search space which is not found. If we nor- found by SBMO, using Random Forest on Breast. malize the accuracy using the min. and max. on the configuration space, SMBO found a configuration that represents a score of 97.80% with exploring only 0.4% of the configuration space. Table 1 shows that similar results are obtained for all methods 5 ALGORITHM-SPECIFIC CONFIGURATION on all datasets. SMBO always found a better configuration than We would like to quantify how much an optimal configuration is the baseline, in at most 17 iterations. In average, the best score is specific to an algorithm or is universal, i.e. works well regardless achieved around 20 iterations (excluding Decision Tree on Iris and of the algorithm. For this, the optimization process might be Breast). Decision Tree was able to reach the optimal configuration performed on a collection of methods A = {Ai }i=1 N . The result is on Iris (resp. Wine) after 1 (resp. 5) iterations. In general, the a sample of optimal configurations p = {pi }i=1 where M ≥ N ∗ ∗ M score in the normalized score space belongs to [0.9780, 1.000]. To since an algorithm might have several distinct optimal configu- summarize, in average, with 20 iterations (0.42% of the search rations. After normalizing the configuration space to bring each space) SMBO is able to decrease the error by 58.16% compared to axis to [0, 1], the link between the processed data and the methods the baseline score and found configurations that score 98.92% in can be studied through the Normalized Mean Absolute Deviation the normalized score space. (NMAD). The idea behind this metric is to measure how much Figure 2 shows the optimal pipeline in the search space and the the optimal points are distant from a reference optimal point. four pipelines giving the best score for SMBO. All four pipelines If the optimal configuration does not depend on the algorithm, have the correct operator for rebalance and features steps. One the expected distance between the optimal configurations is 0. Conversely, if a point is specific to an algorithm, the other points will be in average far from it. Working in the normalized configuration space has two ad- vantages. First, it forces all parameters to have the same impact. Secondly, it allows the comparison from one dataset to another since the NMAD belongs to [0, 1] for any number of algorithms or dimensions of the configuration space. The Normalized Mean Absolute Deviation is the norm 1 of the Mean Absolute Deviation5 , divided by the number of dimensions K of the configuration space. Definition 5.1 (Normalized Mean Absolute Deviation (NMAD)). N 1 1 Õ ∗ NMAD(p∗, r ) =  || |p − r | ||1 K N i=1 i To measure how much each optimal point pi∗ is specific to an algorithm A j , we use it as a reference point and calculate the NMAD using a sample composed of all the optimal points. However, an algorithm might have several optimal points and to be fair, we use as a representant of each algorithm, the closest point to the reference point. Figure 3: Heatmap depicting the accuracy depending on 5.1 Protocol the pipeline parameter configuration. Top for ECHR, bot- As the configuration space described in Section 4.1 is not a metric tom for Newsgroup. space, we cannot directly use the NMAD. To avoid introducing bias with an ad-hoc distance, we perform another experiment Table 2: Best configurations depending on the method with a configuration space that is embedded in N. We collected 1000 judgements documents provided by the Method (n, k) accuracy European Court of Human Rights (ECHR) about the Article 6. ECHR The database HUDOC6 provides the ground truth corresponding Decision Tree (5, 50000) 0.900 to a violation or no violation. The cases have been collected such Neural Network (5, 50000) 0.960 that the dataset is balanced. The conclusion part is removed. To Random Forest (3, 10000), (4, 10000), (5, 50000) 0.910 confirm the results, we used a second dataset composed of 855 Linear SVM (3, 50000), (4, 50000), (5, 50000) 0.921 documents from the categories atheism and religion of 20news- Newsgroup groups. Decision Tree (4, 5000), (4, 100000) 0.889 Each document is preprocessed using a data pipeline consist- Neural Network (5, 50000) 0.953 ing in tokenization, stopwords removal, followed by a n-gram Random Forest (3, 10000) 0.931 Linear SVM (2, 100000) 0.946 generation. The processed documents are combined and the k top tokens across the corpus are kept, forming the dictionary. Each case is turned into a Bag-of-Words using the dictionary. parameters values are better because they imply a lower prepro- There are two hyperparameters in the preprocessing phase: cessing and training time. It is interesting to notice that (5, 50000) n the size of the n-grams, and k the number of tokens in the returns the best accuracy for every model, as this point would be dictionary. We defined the parameter configuration domain as a sort of universal configuration for the dataset, taking the best follow: out of the data source, rather than being well suited for a specific • n ∈ {1, 2, 3, 4, 5}, algorithm. On the contrary, on Newsgroup, all optimal points are • k ∈ {10, 100, 1000, 5000, 10000, 50000, 100000}. different. Our hypothesis is that the more structured a corpus We used the same four algorithms as in Section 4. As we are inter- is, the less algorithm-specific are the optimal configurations, be- ested in the optimal configurations, we performed an exhaustive cause the preprocessing steps become more important to extract search. markers used by the algorithms to reach good performances. As ECHR dataset describes standardized justice documents, it is far 5.2 Results more structured than Newsgroup. This would also explain why For both datasets, Figure 3 shows that the classifier returns poor generating n-grams for n = 5 still improves the results on ECHR results for a configuration with a dictionary of only 10 or 100 while degrading them on Newsgroup. tokens. Both parameters influence the results, and too high values This hypothesis is partially confirmed by Table 3, where it deteriorate the results. is clear that the n-gram operator has a strong impact on the Table 2 summarizes the best configurations per method. For accuracy variation on ECHR dataset (up to 9.8% accuracy im- the first dataset, there are 3 points that gives the optimal value provement) while almost none on Newsgroup dataset (at the for Random Forest and Linear SVM, however, in practice lowest exception of Random Forest). 5 As we work on a discrete space, we used the norm 1, but the euclidean norm is Table 4 contains the NMAD value for each distinct optimal con- probably a better choice in continuous space. figuration reported in Table 2. The Supplementary Material pro- 6 https://hudoc.echr.coe.int/ vides the calculation step by step. As it can be expected, the Table 3: Impact of parameter n on the accuracy, measured training the model for each selected configuration. To mitigate as the relative difference between the best results obtained this problem, we see few possibilities to explore: only using (1, k) and the best results obtained for any con- • decreasing the amount of data to preprocess using a sam- figuration (n, k). ple technique as described in [12], • using in priority data pipelines suggested by a meta-learning Method p = (1, k) p = (n, k) ∆ acc algorithm s.a. the one described in [3, 4], ECHR • caching the intermediate results of the data pipeline to Decision Tree 0.850 0.900 5.9% reuse, when possible, the outcome of some transforma- Neural Network 0.874 0.960 9.8% tions (e.g. there is no need to regenerate the 2-grams for Random Forest 0.863 0.910 5.4% n ≥ 3 if a previous configuration with n = 2 has been Linear SVM 0.892 0.921 6.6% explored.). Newsgroup Another aspect to be addressed is the compromise between time Decision Tree 0.885 0.889 0.5% and performances. Indeed, some parameters increases the pro- Neural Network 0.949 0.953 0.4% cessing time but not the model training (e.g. n-grams computa- Random Forest 0.883 0.931 5.4% Linear SVM 0.945 0.946 0.1% tion) while others may not affect the processing time but signifi- cantly increase the model training (e.g. number of tokens k). A fine grain time analysis would be required, and an intelligent Table 4: Normalized Mean Average Deviation for each op- pruning system could be a solution to avoid costly iterations. timal configuration found. Future work should focus on an online version s.t. the pipeline is tuned in a streaming way. Also, the NMAP indicator works only ECHR Newsgroup in euclidian spaces which is not the case for the first experiment. Therefore, further work should focus on extending the NMAP to Point NMAD Point NMAD non-vector space. (5, 50000) 0 (4, 5000) 0.306 (3, 10000) 0.275 (4, 100000) 0.300 REFERENCES (4, 10000) 0.213 (5, 50000) 0.356 [1] Rémi Bardenet, Mátyás Brendel, Balázs Kégl, and Michele Sebag. 2013. Col- laborative hyperparameter tuning. In Int. Conf. Mach. Learn. 199–207. (3, 50000) 0.175 (3, 10000) 0.294 [2] J Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. 2011. Algorithms for Hyper- (4, 50000) 0.094 (2, 100000) 0.362 parameter Optimization. In Proc. Int. Conf. Neural Inf. Process. Syst. 2546–2554. [3] B. Bilalli, A. Abelló, and T. Aluja-Banet. 2017. On the Predictive Power of Meta-features in OpenML. Int. J. Appl. Math. Comput. Sci. 27, 4 (2017), 697–712. [4] B. Bilalli, A. Abelló, T. Aluja-Banet, and R. Wrembel. 2018. Intelligent assistance point (5, 50000) has a NMAD of 0 since the point is present for for data pre-processing. Computer Standards & Interfaces 57 (2018), 101 – 109. [5] M. Birattari, Z. Yuan, P. Balaprakash, and T. Stützle. 2010. F-Race and Iterated every algorithm: (5, 50000) is a universal pipeline configuration F-Race: An Overview. Springer Berlin Heidelberg, Berlin, Heidelberg, 311–336. for this data pipeline and dataset. The point (4, 50000) appears [6] S. F. Crone, S. Lessmann, and R. Stahlbock. 2006. The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing. Eur. only once but it is really close to (5, 50000) (itself in the 3 other J. Oper. Res. 173, 3 (2006), 781 – 800. algorithms results) s.t. its NMAD is low. It can be interpreted as [7] T. Dasu and T. Johnson. 2003. Exploratory data mining and data cleaning. belonging to the same area of optimal values. On the opposite, Vol. 479. John Wiley & Sons. [8] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, (3, 10000) and (4, 10000) have high NMAD w.r.t. the other points, Manuel Blum, and Frank Hutter. 2015. Efficient and Robust Automated Ma- indicating they are isolated points and may be algorithm specific. chine Learning. In Adv. Neural Inf. Process. Syst., C. Cortes, N. D. Lawrence, Their NMAD values are rather low because despite the points D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). 2962–2970. [9] F. Hutter, H. H. Hoos, and K. Leyton-Brown. 2011. Sequential Model-based are isolated, they differ significantly from the others points only Optimization for General Algorithm Configuration. In Proc. Int. Conf. Learn. on the second component. In comparison, if (1, 10) would be an Intel. Optim. Springer-Verlag, Berlin, Heidelberg, 507–523. [10] T. Hutter, F. and Hoos, H. H. and Leyton-Brown, K. and Stützle. 2009. ParamILS: optimal point for Random Forest, its NMAD would be 0.5. On An Automatic Algorithm Configuration Framework. J. Artif. Intel. Res. 36 the contrary, for Newsgroup, the NMAD value is rather high and (2009), 267–306. similar for all points, indicating that they are at a similar distance [11] J. Kietz, F. Serban, S. Fischer, and A. Bernstein. 2014. “Semantics Inside!” But Let’s Not Tell the Data Miners: Intelligent Support for Data Mining. In from each other and really algorithm specific. The Semantic Web: Trends and Challenges. Springer International Publishing, To summarize, the NMAD metric is coherent with the conclu- 706–720. sion drawn from the heatmaps and Table 2, and suggests that [12] J. Nalepa, M. Myller, S. Piechaczek, K. Hrynczenko, and M. Kawulok. 2018. Genetic Selection of Training Sets for (Not Only) Artificial Neural Networks. there exist two types of optimal configurations: universal pipeline In Proc. Int. Conf. Beyond Databases, Architectures Struct. 194–206. configurations that work well on a large range of algorithms for [13] N. M. Nawi, W. H. Atomi, and M. Z. Rehman. 2013. The Effect of Data Pre- processing on Optimized Training of Artificial Neural Networks. Procedia a given dataset, and algorithm-specific configurations. Thus, we Technology 11 (2013), 32 – 39. Int. Conf. Elect. Eng. Info. are confident the NMAD can be used in larger configuration [14] N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich. 2017. Data Management spaces where heatmaps and exhaustive results are not available Challenges in Production Machine Learning. In Proc. ACM Int. Conf. Manage. Data. ACM, 1723–1726. for graphical interpretation, and help to reuse configurations. [15] Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2013. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In Int. Conf. Knowl. Disc. Data Min. ACM, 847–855. 6 CONCLUSION [16] M. Vartak, S. Rahman, S. Madden, A. Parameswaran, and N. Polyzotis. 2015. S In this paper, we successfully applied Sequential Model-Based ee DB: efficient data-driven visualization recommendations to support visual analytics. Proc. VLDB Endowment 8, 13 (2015), 2182–2193. Optimization techniques to data pipeline selection and configu- [17] Dani Yogatama and Gideon Mann. 2014. Efficient transfer learning method for ration. In addition, we provided a metric to study if an optimal automatic hyperparameter tuning. In Int. Conf. Artif. Intel. Stat. 1077–1085. configuration is algorithm specific or rather universal. [18] Z. Zhao, L. De Stefani, E. Zgraggen, C. Binnig, E. Upfal, and T. Kraska. 2017. Controlling False Discoveries During Interactive Data Exploration. In Proc. The main practical drawback of the iterative approach pre- ACM Int. Conf. Manag. Data. ACM, 527–540. sented in this paper is the cost involved in processing the data and