Query-Centric Regression for In-DBMS Analytics Qingzhi Ma Peter Triantafillou University of Warwick University of Warwick Q.Ma.2@warwick.ac.uk P.Triantafillou@warwick.ac.uk ABSTRACT support for regression, such as XLeratorDB [6] for Microsoft Research in enriching DBs with Machine Learning (ML) models SQL Server, Oracle UTL_NLA [1, 51], IBM Intelligent Miner [47], is receiving increasingly greater attention. This paper experi- which provide SQL interfaces for analysts to specify regression mentally analyzes the problem of empowering data systems with tasks. Academic efforts include MADLib (over PostgreSQL) [24], (and its users with access to) regression models (RMs). The paper MAD [15], and MauveDB [18], which integrates regression mod- offers a data system’s perspective, which unveils an interesting els into a RDBMS. [39, 40] uses User-Defined Functions (UDFs) ‘impedance mismatch ′ problem: ML models aim to offer a high to compute statistical machine learning models and data summa- expected overall prediction accuracy, which essentially assumes rization. DBEst [33], an approximate query processing engine, that queries will target data using the same distributions of the uses ML models (regressions and density estimators) to provide data on which the models are trained. However, in data man- answers for popular queries. Similarly, FunctionDB [46] builds re- agement it is widely recognized that query distributions do not gression models so that future queries can be evaluated using the necessarily follow data distributions. Queries using selection op- models. Furthermore, [43] integrates and supports least squares erators target specific data subspaces on which, even an overall regression models over training data sets defined by join queries highly-accurate model, may be weak. If such queried subspaces on database tables. Thus, in general, RMs are useful for query are popular, large numbers of queries will suffer. The paper will processing and data analytics tasks. In addition, RMs are help- reveal, shed light, and quantify this ‘impedance mismatch ′ phe- ful for many other key tasks: imputing missing values, testing nomenon. It will study in detail 8 real-life data sets and data from hypotheses, data generation, fast visualization, etc. TPC-DS and experiment with various dimensionalities therein. It will employ new appropriate metrics, substantiating the problem Motivations across a wide variety of popular RMs, ranging from simple linear Given the above increasing interest in bridging ML and RMs with models to advanced, state-of-the-art, ensembles (which enjoy ex- DBs, we focus on how seamless this process can be. ML models cellent generalization performance). It will put forth and study a (and RMs in particular) are trained to optimize a loss function new, query-centric, model that addresses this problem, improving (invariably concerning overall expected error). We refer to this per-query accuracy, while also offering excellent overall accuracy. as a workload-centric view, as the aim is to minimize expected Finally, it will study the effects of scale on the problem and its error among all queries in an expected workload. In essence, this solutions. assumes query distributions are (expected workload is) similar to that of the training data. In contrast, data systems research has 1 INTRODUCTION long recognized that query workloads follow patterns generally different to data distributions. Hence, queries on data subspaces A new dominating trend has emerged for the next-generation (e.g., using range or equality selection predicates), where an data management and analytics systems based on integrating ML model is weak, will suffer from high errors. And, if such ML models and data management platforms [10, 15, 24, 25, 35, queries are popular, many queries will suffer. This gives rise to 36, 42, 45]. Additional efforts pertain to connectors to back-end the need for a “query-centric perspective ”: We define query- databases, which allow for statistical analyses and related queries centric regression as a model which strives to ensure both high on DB data, like MonetDB.R [37], SciDB-Py [23], and Psycopg average accuracy (across all queries in a workload) as well as [48]. Another class of efforts concerns learning from past an- high per-query accuracy, whereby each query is ensured to enjoy swers to predict the answers to future analytical queries, e.g. accuracy close to that of the best possible model. for approximate query processing engines, which provide ap- The ML community’s general answer to such problems is proximate answers to aggregate queries, using ML techniques to turn to ensemble methods, in order to lower variance and [3–5, 41], or for tuning database systems [2], and for forecasting generalize better (e.g., to different distributions). We wish to shed workloads [32]. Yet another class of efforts concerns model and light into this possible impedance mismatch problem and see if it query-prediction serving, like the Velox/Clipper systems [16, 17] holds for simpler and even for state-of-the-art ensemble RMs. We managing ML models for predictive analytics. Finally, vision further wish to (i) quantify the phenomenon: We shall use several papers suggest the move towards model selection management real data sets (from the UCI ML repository and TPC-DS data) systems [29], where a primary task is model selection whereby and a wide variety of popular RMs and new metrics that reveal the system is able to select the best model to use for the task at workload-centric and query-centric performance differences, and hand. (ii) see if the problem can be addressed by adopting a query- In this realm, regression models, being a principal means for centric perspective, (using a new ensemble method) whereby the predictive analytics, are of particular interest to both analysts error observed by each query is as low as possible (which will and data analytics platforms. RMs are playing an increasingly also indirectly ensure high workload-centric performance). important role within data systems. Examples of its extended The above bear strong practical consequences. Consider a use and significance include many modern DBs which provide data analyst using python or R, linked with an ML library (like © Copyright 2020 for this paper held by its author(s). Published in the proceedings of Apache Spark MLLib, Scikit-Learn, etc.), or using a DB connector DOLAP 2020 (March 30, 2020, Copenhagen, Denmark, co-located with EDBT/ICDT 2020) on CEUR-WS.org. Use permitted under Creative Commons License Attribution like MonetDB.R or SciDB-Py, or a prediction serving system like 4.0 International (CC BY 4.0). Clipper, etc– and the following use cases. Scenario 1: The analyst uses a predicate to define a data sub- And the predictions are updated in the direction of gradient space of interest and calls a specific RM method: It would be great descent, which is if she knew which RMs to use for which data subspaces. ∂L(yi , f (x i )r ) f (x i )r +1 = f (x i )r + α ∗ (2) Scenario 2: Alternatively, the system could select the best RM ∂ f (x i )r automatically for the analyst’s query at hand. where r is the iteration number. Gradient boosting (GBoost) usu- With this paper we wish to inform the community of this DB- ally uses only the first-order information; Chen et al. incorporate RM impedance mismatch problem, study and quantify it, and lay the second-order information in gradient boosting for conditional the foundation for seamless use of RMs for in-DBMS analytics, random fields, and improve its accuracy [12]. However, the base offering this query-centric perspective. models are usually limited to a set of classification and regression trees (CART). Other regression models are not supported. 2 BACKGROUND XGBoost [11] is a state-of-art boosting method, and is widely Our study employs a set of representative and popular RMs, used for competitions due to its fast training time and high accu- grouped into two categories: Simple and ensemble RMs. racy. The objective of XGBoost is obj(Θ) = L(Θ) + Ω(Θ) (3) 2.1 Simple Regression Models Simple RMs include linear regression (LR), polynomial regres- where L(Θ) is the loss function, and controls how close predic- sion (PR), decision tree regression (DTR), SVM Regression (SVR), tions are to the targets. Ω(Θ) is the regularization term, which Nearest Neighbours Regression (NNR), etc. An introduction to controls the complexity of the model. Over-fitting is avoided if these simple regression models is omitted for space reasons. the proper Ω(Θ) is selected. The base models (booster) can be Table 1 summarizes the known asymptotic time complexity gbtree, gblinear or dart [49]. Gbtree and dart are tree models for training for key regression models. And more detailed com- while gblinear is linear. parisons are made and discussed in §3.5. 3 EXPERIMENTAL SETUP All experiments run on an Ubuntu system, with Intel Core i5-7500 Table 1: Complexity of typical regression models CPU @ 3.40GHz × 4 processors and 32GB memory. LR PR DTR 3.1 Hypotheses O(d 2n) O(d 4n) [O(dnloд(n)), O(n2d)] The study rests on testing, validating, and quantifying two key NNR SVR Gaussian Process hypotheses: O(n(d + k)) or O(ndk) O(vn2 ) O(n 3 ) Note: d is the dimension of the data points, n is the number of points Hypothesis 1. Different RMs, exhibit higher accuracy for dif- in the training data, k is the number of neighbors for KNN regression, ferent regions of the queried data spaces. Likewise for different data v is the number of support vectors for SVM regression. sets. Such differences can be large and may occur even though said RMs may enjoy similar overall accuracy. The corollary of Hypothesis 1 is that, even if our analysts in 2.2 Ensemble Methods Scenario 1 knew of highly-accurate RMs, many of their analytical queries would be susceptible to large errors. Hypothesis 1 aims Ensemble methods are powerful methods that combine the pre- to test whether the loss function used by even top-performing dictions and advantages from base models. It is often observed RMs, minimizing overall expected accuracy errors, ’hides’ this that prediction accuracy is improved by combining the prediction issue. En route, the analysis will quantify this problem across results in some way (e.g., using weighted averaging of predictions many different RMs, data sets, and dimensionalities. from various base models) [8]. Ensemble learning is also useful for scaling-up data mining and model prediction [44]. There have Hypothesis 2. Given Hypothesis 1, a model equipped with been many well-developed ensemble methods, including aver- knowledge of the accuracy distribution of RMs in the query space, aging based ensemble methods, bootstrap aggregating (bagging) can near-optimally serve each query. [9], boosting [20, 21], stacking [50], mixture of experts [26], etc. Such a model, coined QReg, which is a classifier-based ensem- Averaging-based ensemble methods calculate the weighted ble method that bears a query-centric perspective, will be studied average of predictions from all models. This incurs higher com- here to validate Hypothesis 2. Hypothesis 2 aims to show that putational costs and higher response time. integrating an RM model within a DB can be done in a query- Boosting refers to a family of algorithms that could potentially centric manner, avoiding the aforementioned problems. Thus, convert “weak models”to “strong models”. AdaBoost [20], short offering a solution for Scenario 2. for “adaptive boosting”, is a popular boosting algorithm. Unlike bootstrap aggregating whose models are trained in parallel, the 3.2 Data Sets and Dimensionality prediction models in AdaBoost are trained in sequence. AdaBoost was firstly proposed to solve classification problems, and was To test the hypotheses, eight real-world data sets with different applied to solve regression problems later on. Randomization characteristics from the UCI machine learning repository [31] may be incorporated into boosting, so that its response time is are used, varying the dimensionality from 2 to 5, as well as a reduced [22]. large fact table from the TPC-DS benchmark [38]. The objective of gradient boosting is to minimize the loss Data set 1 is a collection of YouTube videos showing input function: and output video characteristics along with the transcoding time Õ and memory requirements. Data set 2 contains Physicochemical L(yi , f (x i )) = MSE = (yi − f (x i ))2 (1) properties of Protein Tertiary Structure. Tasks include predicting the Size of the residue (RMSD) based on nine properties. There are Intuitively, our aim (with testing Hypothesis 1) is to show that, 45730 decoys and size varies from 0 to 21 armstrong. Data set 3 is despite which single model is used, some queries will always be an hourly data set containing the PM2.5 gas concentration data in processed by sub-optimal models. So we wish to quantify this op- Beijing. The task is to predict PM2.5 concentration (uд/m 3 ), and portunity loss. Furthermore, our aim (with testing Hypothesis 2) the independent variables include pressure (PRES), Cumulated is to show that a new ensemble model can help significantly allevi- wind speed (Iws), etc. Data set 4 is an online news popularity ate this problem. The ROLi,j metric will help quantify how much data set and tasks include predicting the number of shares in a model (QReд) improves on this opportunity loss for queries. social networks (popularity). There are totally 39797 records in this data set. Data set 5 contains 9568 data points collected from 3.4 Architecture a Combined Cycle Power Plant over 6 years (2006-2011), and the Assume that the data system maintains m regression models. task is to predict the net hourly electrical energy output (EP) of When a query arrives, the system needs to identify the model the plant. Data set 6 is the YearPredictionMSD data set used to with the least prediction error for this query. We treat this model predict the release year of a song from audio features. Most of selection problem as a classification problem. Fig. 1 shows the the songs are commercial tracks from 1922 to 2011. Data set 7 architecture of this classified regression QReд 1 . contains the recordings of 16 chemical sensors exposed to two dynamic gas mixtures at varying concentrations. The goal with this data set is to predict the recording of one specific chemical sensor based on other sensors and the gas mixtures. This is a time-series data set containing more than 4 million records in total. Data set 8 records the individual household electric power consumption in one household for more than four years, and there are two million records. We further employ table store_ sales from the popular TPC- DS benchmark [38]. Typical columns used for the experiments in- clude ss_wholesale_cost, ss_list_price, ss_ sales_price, ss_ext_sales_price and ss_ext_wholesale_cost. Figure 1: Architecture of the classified prediction method. 3.3 Evaluation Metrics Accuracy is measured using the Normalized Root Mean Square There are two layers in the system. (i) The model mainte- Error (NRMSE) metric, defined as: nance layer, deploys and maintains the base regression models. q Ín (ii) The query classification layer implements the core of QReg. t =1 (ŷ t −y t ) 2 n A query is first passed to the pre-trained classifier. Because the N RMSE = (4) ymax − ymin classifier “knows”the prediction ability of each model in the vari- ous queried data spaces, the query will be assigned to the model NRMSE shows overall deviations between predicted and mea- that performs best locally for this query’s data space. Unlike sured values; it is built upon root mean square error (RMSE), typical ensemble methods, only one model is invoked for each and is scaled to the range of the measured values. It provides query (hence QReg is less computationally intensive). a universal measure of prediction accuracy between different Two configurations are studied for QReд: Simple QReg uses regression models. LR, PR, and DTR. Advanced QReg uses GBoost and XGBoost as The NRMSE ratio r , compares the prediction accuracy of its base models. one RMi against that of any other RM j , and is defined as: r = N RM S E i N RM S E j . If N RMSE j ≤ N RMSEi , this ratio shows how worse 3.5 Candidate Base Models RMi is compared to RM j . When deciding which models to include the following key criteria The above are standard metrics used for comparing accuracy. are considered. However, our study calls for additional metrics. Inherent in our (a.) Model training time. This should be as low as possible study is the need to reflect the differences in accuracy observed and should exhibit good scalability as the number of the training by a query as they depend on the model used. For this we define data points increases. the concept of Opportunity Loss as a natural way to reflect how Given the asymptotic complexity, as summarised in §2.1, a much the query loses in accuracy by using a sub-optimal model. number of experiments were conducted to quantify the training Assuming RMopt is the RM with the lowest NRMSE, we define times for various regression models. Fig. 2 is a representative Opportunity Loss OLi as result for data set 4 using 4 dimensions. It shows how model train- N RMSEi ing time (for six regression models) is impacted as the number OLi = − 1, (5) N RMSEopt of training instances increases. Model training time is shown to behave acceptably with re- which quantifies as a % the error (the opportunity loss) due to spect to the number of instances in the training data set for LR, not using the best model RMopt and using RMi instead. PR, DTR, and kNN regression. The training time of Support Vec- Furthermore, we define tor Regression – Radial Basis Function tends to increase much OLi more aggressively as the number of training points increases. The ROLi,j = , (6) OLk experiment was repeated for all data sets. The above conclusion as Relative Opportunity Loss, which quantifies how much better holds across all experiments and are omitted for space reasons. RMk does vs RMi in improving on the opportunity loss. 1 The source code for Q Reд is available at https://github.com/qingzma/qreg QReд, and a well-designed classifier could potentially grasp the prediction ability of each model in the query space correctly. Thus, the prediction accuracy can be significantly improved compared to individual prediction models. Note that the original data set is partitioned into 3 subsets instead of 2. This is done in order to ensure that different train- ing data sets are used to train the base models and the classifier, respectively, which avoids potential over-fitting problems. In addition, models are fine-tuned via cross-validation using Grid- SearchCV() in the scikit-learn package. 4 EVALUATING HYPOTHESIS I Figure 2: Training time of typical regression models. Consider data set 3, the Beijing PM2.5 data set [30], using Cu- mulated Wind Speed (IWS) and Pressure (PRES) as the features, yielding a 3-dimensional regression problem. (b.) Query response time: In the classifier training process, Fig. 4.(a) shows the distribution of the model with the least predictions are made from each base prediction model. To reduce error for all data points. LR, PR, and DTR are used as the base RMs the overall training time as well as the query response time, the (Simple QReд). Fig. 4.(b) shows the distribution of best models models should have as low response time as possible. when ensemble models GBoost and XGBoost are used (Advanced (c.) Prediction accuracy: An interesting issue arises from QReд). using different regression models together, as in QReд. If the base regression models have large differences in accuracy levels, then this may result in QReд having poor accuracy. This is a direct Cumulated wind speed (m/s) Cumulated wind speed (m/s) 500 500 LR GBoost result of errors introduced during the separate classification pro- 400 PR 400 XGboost DTR cess. Therefore, care must be taken so to ensure that base models 300 300 enjoy similar and good accuracy levels. 200 200 100 100 3.6 Model Training Strategy 0 0 990 1000 1010 1020 1030 1040 990 1000 1010 1020 1030 1040 Fig. 3 shows the model training strategy of QReд. The data set is Pressure (hPa) Pressure (hPa) partitioned into three subsets: (i) Training_Data set_1 is used (a) Simple models (b) Ensemble models to train the base models; (ii) Training_Dataset _2 is used to train the classifier in QReд; (iii) Testing_Dataset is used to Figure 4: Distribution of best models for Beijing PM2.5. evaluate the accuracy of QReд. Take QReд using simple models as an example. LR dominates Main in the upper-central region. PR dominates at the lower central Dataset regions. DTR performs best in the rest of the space. the NRMSEs for LR, PR and DTR are 8.48%, 8.84%, and 8.32%, respectively. However, if for each point the best model can be selected to make Training Dataset 1 Training Dataset 2 Testing Dataset the prediction (as shown in Fig. 4), the corresponding ("optimal") NRMSE drops to 7.19%. This is a large improvement in accuracy. Figures like Fig. 4 can help analysts decide on which models Model 1 Predictions 1 to use when querying this space.Similar figures exist for all data sets studied in this work and are omitted due to space reasons. Model 2 Predictions 2 New dataset ... ... to build classifier Table 2: Win-counts of simple RMs. Model n Predictions n Data set Count of Count of Count of Figure 3: Model training of QReд. ID LR PR DTR 1 6468 6919 5366 2 3606 4133 6217 After partitioning the data set, m base models are trained upon 3 1077 1580 1472 Training_Dataset_1. The selection of base models is in princi- 4 3618 4826 4155 ple open and depends on the users’ choice (taking into account 5 980 885 1307 the above issues). Each base model fi (x) makes predictions ŷi of each data point x in Training_Dataset_2. A comparison is 6 62749 51953 57007 made between the predicted ŷi and the real label y to find the 7 9646 4029 4953 best prediction model for each query x. 8 35702 28800 40311 Having the individual predictions and associated errors, a new data set is generated by combining the data point x and the index Table 2 adopts a different perspective. It shows the number of i of the best model for this query, depicted [x, i]. This data set is most accurate predictions ("wins") made by each simple RM, per then used to build the classifier reflecting the prediction ability data set. First, note that all models have a good number of wins. of base models in the query space. The classifier is the core of Second, no model has the highest number of wins across all data sets. So, there is no single winner. DTR enjoys the most wins for enjoy an NRMSE that is about half of the NRMSE of the others. data sets 2, 5, and 8; PR makes the most accurate predictions for Interestingly, the same holds for PR and DTR! Similar conclusions data sets 1, 3, and 4; LR wins for data sets 6 and 7. hold for the other data sets. Table 3 zooms in, augmenting Table 2 by showing the NRMSEs when different simple RMs win. For example, for data set 1, we Table 4: NRMSEs for the top 20% queries per simple RM. know from Table 2 that LR wins 6468 times. For these, the LR’s NRMSE was 11.08%, as indicated by Table 3, whereas for PR was NRMSE NRMSE NRMSE enormous and for DTR was 18.28% – see the 3 numbers in cell Data Set where where where Overall ID LR wins PR wins DTR wins NRMSE [1,2] in Table 3. Similarly, for the 6919 queries where PR won, 3.61% 5.36% 6.67% 5.36% LR’s NRMSE was 11.82%, as indicated by Table 3 whereas for PR 1 7.06% 2.92% 6.33% 5.72% was 9.35% and for DTR was 11.68% – see the 3 numbers in cell 7.96% 5.78% 3.42% 6.01% [1,3]. 18.35% 16.14% 16.10% 16.89% Table 3: NRMSE values when different RMs win. 2 21.26% 12.39% 12.88% 16.04% 24.14% 17.09% 7.97% 17.69% NRMSE NRMSE 3.03% 3.64% 4.83% 3.902% NRMSE Data set where where where Overall 3 4.89% 1.44% 4.07% 3.76% ID LR wins PR wins DTR wins NRMSE 5.35% 2.78% 2.12% 3.69% 11.08% 11.82% 14.23% 12.32% 0.96% 2.33% 2.74% 2.15% 1 ≫1000% 9.35% ≫1000% ≫1000% 4 2.33% 0.63% 1.56% 1.66% 18.28% 11.68% 10.60% 14.06% 2.54% 1.40% 0.78% 1.74% 27.68% 23.78% 28.61% 27.02% 7.56% 9.77% 9.44% 8.98% 2 30.96% 20.20% 27.85% 26.72% 5 10.00% 7.75% 8.11% 8.68% 34.24% 26.18% 21.81% 26.79% 12.11% 10.55% 4.85% 9.69% 8.47% 8.00% 8.99% 8.48% 4.11% 5.26% 5.500% 5.00% 3 9.91% 6.60% 10.04% 8.84% 6 5.22% 4.12% 4.82% 4.74% 10.54% 7.99% 6.65% 8.32% 5.42% 4.76% 4.13% 4.80% 2.46% 4.96% 5.55% 4.62% 42.14% 109.35% 95.42% 87.25% 4 3.82% 2.62% 4.49% 3.67% 7 106.72% 80.88% 82.48% 90.80% 4.07% 3.54% 2.51% 3.41% 79.49% 106.48% 65.26% 85.47% 16.69% 15.97% 19.91% 17.90% 3.07% 5.57% 3.65% 4.23% 5 18.75% 13.94% 18.79% 17.56% 8 4.37% 3.77% 3.67% 3.95% 21.07% 16.99% 15.30% 17.72% 5.17% 5.92% 1.60% 4.63% 11.06% 12.43% 13.39% 12.29% 6 12.15% 11.47% 12.77% 12.15% The corresponding data for the advanced ensemble RMs is 12.30% 12.04% 12.15% 12.17% highly similar and we omit it for space reasons. 80.48% 113.75% 106.84% 95.85% 7 210.45% 83.39% 104.22% 165.31% Table 5: Win counts of ensemble RMs. 118.60% 111.17% 76.58% 107.31% 8.81% 10.02% 10.49% 9.82% Count of Data set Count of 8 10.36% 8.05% 10.04% 9.66% ID tree boosting XGBoost 11.13% 9.99% 8.29% 9.80% 1 10334 8419 Each [i,j] cell shows 3 values for data set i, for the cases where model 2 7037 6919 j wins. j = 2 (3, or 4) represents LR, (PR, or DTR) respectively. The top 3 2653 1476 number in each cell shows the NRMSE for LR, the middle shows the 4 2142 10457 NRMSE of PR, and the bottom the NRMSE for DTR. 5 1596 1576 6 92916 78793 Consider data set 4. When LR wins, its error is markedly lower 7 6991 11637 (almost half) that of PR and DTR–unlike their overall NRMSEs 8 52949 51864 which show LR to be the worst model. To further facilitate a query-centric perspective, we delve into the performance of the queries for which each RM reached a top-20% performance. For data set 1, for example, this includes 5 EVALUATING HYPOTHESIS II the best 20% of the 6468 queries for which LR wins, the best 20% This hypothesis aims to substantiate whether it is possible to of the 6919 queries for PR wins, and the best 20% of the 5366 develop a method that can learn from the key findings of the queries for DTR wins. Hence, the NRMSE of interest does not previous section and leverage them in order to address Scenario come from all the queries, but from the top 20% queries for which 2, automating the decision as to which RM to use, relieving the the least error was achieved by a simple RM. Table 4 shows these DB user/analyst of the conundrum, towards a query-centric RM. results, along with the overall NRMSE of each simple RM for the Specifically, we study if and at what costs a method can: (i) near- whole set of queries. Again, note that the overall NRMSEs are optimally select the best regression model for any query at hand quite close. However, individual differences are very large. For and (ii) achieve better overall accuracy than any single (simple data set 1, for instance, the top 20% of queries when LR wins or ensemble) method. It is natural to treat this problem as a model selection problem, using a classifier for the method selection. We show a new ensem- 1.5 ble method, QReд, which materializes a query-centric perspective 1.4 achieving the above two aims.2 We have considered various clas- 1.3 NRMSE Ratio sifiers for QReд, including SVM-linear classifiers, SVM classifiers LR vs QReg using the RBF kernel,and the XGBoost classifier, etc. A compre- PR vs QReg 1.2 DTR vs QReg hensive comparison between various classifiers is made. Unless explicitly stated otherwise, results for the XGBoost classifier 1.1 are shown, due to its overall prediction accuracy and scalability performance. 1.0 1 2 3 4 5 6 7 8 Dataset ID Cumulated wind speed (m/s) Cumulated wind speed (m/s) 500 LR 500 GBoost Figure 6: Accuracy of Simple QReд vs LR, PR, DTR. PR XGboost 400 400 DTR 300 300 200 200 100 100 0 0 AdaBoost vs QReg AdaBoost vs QReg 990 1000 1010 1020 1030 1040 Pressure (hPa) 990 1000 1010 1020 1030 1040 Pressure (hPa) 1.4 GBoost vs QReg XGBoost vs QReg 1.4 GBoost vs QReg XGBoost vs QReg NRMSE Ratio NRMSE Ratio (a) Simple Q Reд (b) Advanced Q Reд 1.2 1.2 Figure 5: QReд distribution of base models. 1.0 1.0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Dataset ID Dataset ID Fig. 5 shows the RMs chosen by Simple and Advanced QReд (a) 2d (b) 3d for each corresponding query, giving a feeling of the overall RM distribution suggested by QReд. The model distribution shown resembles the distribution of the truly optimal models across the AdaBoost vs QReg AdaBoost vs QReg 1.4 GBoost vs QReg 1.4 GBoost vs QReg queried data space, as shown in Fig. 4. Thus, QReд does a good XGBoost vs QReg XGBoost vs QReg NRMSE Ratio NRMSE Ratio job in selecting (near-) optimal RMs per query. As per Scenario 1.2 1.2 1, presenting such visualisations can be of real value to analysts. 1.0 1.0 Workload-centric Perspective: Simple QReg 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Dataset ID Dataset ID A workload-centric perspective assumes that the query distribu- (c) 4d (d) 5d tion is identical to the data distribution, as described in §1. Simple QReд uses simple regression models, including LR, PR, and DTR. Figure 7: Accuracy of QReд vs ensemble RMs Fig. 6 shows the NRMSE ratio r as defined in Equation (4) for all data sets in 3-d space. An NRMSE ratio r larger than 1 means QReд has less prediction error than the other base model. QReд is shown to outperform or be as good as any of its base models. of points in the data space, specifically, for the sub-collection of Specifically, for data sets 2, 3, 5, 6, and 8, QReд performs slightly points where XGBoost regression (or GBoost regression) has the better than other regression models, whereas for data sets 1, 4, best prediction accuracy. and 7, we can see QReд being significantly superior versus LR, Fig. 8 shows bands of 2 bars each. Each band of bars shows the or PR, or DTR. NRMSE ratio between other RM and the best RM for the collection Fig. 7 compares the prediction error between Simple QReд of points. Take the 4-d data set 1 as an example, for the collection against the more sophisticated ensemble methods, including of points where XGBoost has the best prediction accuracy. The AdaBoost, GBoost, and XGBoost. Fig. 7 shows that the even NRMSE ratio between GBoost (second best RM) and XGBoost Simple QReд often achieves better prediction accuracy than any (collection-best RM) is 1.3288 (orange bar in the figure), while of the sophisticated ensemble methods. For example, up to 25% the corresponding NRMSE ratio between QReд and XGBoost is reduction in NRMSE is achieved by Simple QReд for data set 1 1.0780 (green bar in the figure). This shows that for this collection in 3-d space. of points where XGBoost has the best prediction accuracy, GBoost suffers from a 32.88% error relative to the optimal, while using Workload-centric Perspective: Advanced QReg QReд reduces this to 7.80%. For the majority of the cases, Simple QReд is shown to outper- As another example, consider the collection of points where form simpler RMs and occasionally more complex ensemble mod- XGBoost has the best prediction accuracy in the 5-d data set 8. els. For the remainder we concentrate on Advanced QReд con- The NRMSE ratio between GBoost (second best RM) and XGBoost structed using GBoost and XGBoost. is 1.1458, while the NRMSE ratio between QReд and XGBoost Delving deeper, we now show the NRMSE ratios between is 1.0316. Thus, the relative opportunity loss is 0.1458/0.0316 = GBoost (or XGBoost) and QReд, broken down to sub-collections 4.61, which means the error caused by using GBoost (relative to the best model) is 4.61 times the error caused by QReд for the 2 NB: the aim here is not to find the best method to achieve this, but to show that this is achievable and that significant gains can be achieved using easy to deploy collection of points where XGBoost has the best accuracy. The methods. relative opportunity loss is much larger for data sets 4 and 5. 1.8 XGboost 1.8 GBoost 1.6 QReg 1.6 QReg Query-centric Perspective: Advanced QReg NRMSE Ratio NRMSE Ratio As discussed before, we calculate the NRMSE error for the full 1.4 1.4 collection of points where a single ensemble model wins. To 1.2 1.2 zoom into the context of query-centric prediction serving, we 1.0 1 2 3 4 5 6 7 8 1.0 1 2 3 4 5 6 7 8 now focus only in the top 20% of queries with the least error, as Dataset ID Dataset ID done previously. To summarize relative performance, the relative (a) 4-d GBoost win collection (b) 4-d XGBoost win collection opportunity loss between RMs and QReд is shown in Table 6. 1.8 1.8 Table 6: ROL w.r.t. QReд when different ensemble RMs XGboost GBoost QReg QReg win for their top 20% queries. 1.6 1.6 NRMSE Ratio NRMSE Ratio 1.4 1.4 Data set where where 1.2 1.2 ID GBoost wins XGBoost wins 1 3.00 2.39 1.0 1 2 3 4 5 6 7 8 1.0 1 2 3 4 5 6 7 8 2 2.77 2.68 Dataset ID Dataset ID 3 75.50 2.00 (c) 5-d GBoost win collection (d) 5-d XGBoost win collection 4 0.99 >1000 5 2.77 1.87 Figure 8: Workload-centric collection-level NRMSE ratio 6 5.67 1.77 7 7.13 3.44 Fig. 9 shows the ratio r of NRMSE between the base (ensemble) 8 2.64 5.29 methods and Advanced QReд for the whole data sets. Improve- ment in 2-d space is typically small, but from d=4, we start seeing The values shown exactly are the ROL of using as a second- larger improvements brought about by QReд. For 2-d case, the best RM GBoost (XGBoost) vs QReд when XGBoost (GBoost) accuracy of GBoost and XGBoost regressor are high and almost wins. For example, cell [1,2]=3.00, says that if XGBoost was used equal, which explains why QReд cannot improve things further. (instead of the optimal in this case GBoost) it would result in For the 3-, 4-, and 5-d cases, almost all ratios are above the hori- an error that is 3 times higher than if QReд was used. In other words, previous results have shown that, regardless of which RM is chosen, this RM will be suboptimal for certain queries. So these 1.100 ROL values show how QReд can minimize this cost when being 1.004 GBoost vs QReg XGBoost vs QReg GBoost vs QReg XGBoost vs QReg suboptimal. 1.075 1.002 NRMSE Ratio NRMSE Ratio 1.050 2.0 XGboost 2.0 GBoost 1.000 1.8 QReg 1.8 QReg 1.025 NRMSE Ratio NRMSE Ratio 0.998 1.000 1.6 1.6 0.996 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1.4 1.4 Dataset ID Dataset ID 1.2 1.2 (a) 2-d (b) 3-d 1.0 1 2 3 4 5 6 7 8 1.0 1 2 3 4 5 6 7 8 Dataset ID Dataset ID 1.15 GBoost vs QReg GBoost vs QReg (a) 4-d GBoost win collection (b) 4-d XGboost win collection XGBoost vs QReg 1.2 XGBoost vs QReg 1.10 NRMSE Ratio NRMSE Ratio 3.0 XGboost 3.0 GBoost 1.05 1.1 QReg QReg 2.5 2.5 NRMSE Ratio NRMSE Ratio 1.00 1.0 2.0 2.0 0.95 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Dataset ID Dataset ID 1.5 1.5 (c) 4-d (d) 5-d 1.0 1 2 3 4 5 6 7 8 1.0 1 2 3 4 5 6 7 8 Dataset ID Dataset ID Figure 9: r between QReg and base ensemble models (c) 5-d GBoost win collection (d) 5-d XGboost win collection zontal line r = 1. It shows that models do very well for most data Figure 10: Query-centric collection-level NRMSE ratio sets. However, there are some cases where QReд is significantly better than other ensemble methods, for example against GBoost Similar to Fig. 8, Fig. 10 shows the NRMSE ratio for the 4- for the 5-d data set 4 and against XGBoost for the 4-d data set 5 dimensional space, but in the query-centric ( the top 20% of 8. We see that QReд can improve accuracy across data sets and queries) perspective. Focus on the collection of points where dimensionalities. XGBoost has the best prediction accuracy in the 5-d data set 8 Comparing Fig. 9 with Fig. 8, we see that even though the over- as an example. The NRMSE ratio between GBoost (second best all NRMSE of various RMs is similar, different RMs give different RM) and XGBoost is 1.3842, while the NRMSE ratio between accuracy in different subspaces of the data. Interestingly, this QReд and XGBoost is 1.0872. Thus, the relative opportunity loss figure also shows that for different data sets different ensemble is 0.3842/0.0872 = 4.4, which means the error caused by using methods win (as we have seen previously), showcasing the need GBoost is 4.4 times as the error caused by QReд for the collection for a method like QReд. of points where XGBoost has the best prediction accuracy. It is noticeable that the NRMSE ratio from QReд is always less interval with 1% precision, the corresponding sample size n 0 = than that from the second best model, and is very close to 1. Thus, 9604. For datasets with a finite size, the sample size is slightly for the most vulnerable queried spaces (where a single ensemble smaller than the value obtained in eq. (7). model wins by far), QReg can near-optimally achieve the same For regression-specific tasks, sample size planning techniques accuracy, reconciling the otherwise unavoidable loss. include power analysis (PA) [14], accuracy in parameter esti- mation (AIPE) [28], etc. The sample sizes obtained from both QReg Training Time methods are different, and the magnitude is usually hundreds Fig. 11 shows the results of our study focusing on the scalability or thousands. [27] proposes a method to combine these meth- of QReд, seeing the performance overheads that need be paid for ods with a specified probability, while [34] recommends that the QReд’s accuracy improvement. There exists an approximately largest sample size should be used. For classification-specific tasks, [19] finds that many predic- tion problems do not require a large training set for classifier training. [7] uses learning curves to analyze the classification performance as a function of the training sample size, and con- cludes that 5-25 independent samples per class are enough to train classification models with acceptable performance. Also, 75-100 samples will be adequate for testing purposes. In this study, the sample size varies from 10k, 100k to 1m, which are conservative compared to the values obtained by the PA and AIPE methods for regression tasks, or the size for classi- fication tasks. 6.2 Workload-centric Perspective We show results for data sets 6, 7, 8 and Table store_sales from Figure 11: Comparison of model training time the TPC-DS data set. Data sets 6, 7, 8 contain 2-4 million records, and Table store_sales is scaled-up to 2.6 billion records. We linear relationship between the model training time and number use reservoir sampling to generate uniform random samples for of training points. Even for a relatively high number of train- these data sets. Experiments are done using Advanced QReд. ing points, (e.g., hundreds of thousands), the training time for QReд is shown to be a few dozen of seconds. Although this is Table 7: Win counts of ensemble RMs. an order of magnitude worse than XGBoost in absolute value it is acceptable for medium-sized data sets. Also, about 90% of the Count of Count of training time is spent for getting predictions from the individual Data set ID GBoost XGBoost base models. In the current version of the code, predictions are 6 16209 17124 received sequentially from base models; doing this in parallel, 7 15854 17479 would reduce the total training time. 8 13757 19576 store_sales 13415 17003 6 QREG SCALABILITY As discussed in §5, the total training time of QReд increases Table 7 shows the occurrences of best predictions (wins) made approximately linearly as the data size increases. This limits its by each model, for the samples of size 100k. Similarly to Table 2 application to very large data sets. An approach for addressing in §4, each base model is shown to win for a substantial per- this issue is to build samples from the data and train QReд on the centage of queries (or, equivalently for a considerable part of the samples. We study the implications of this approach on QReд’s data set). This supports Hypothesis I that there is not a single performance and observe also whether our Hypotheses hold for regression model capable of dealing with various data sets, and this case as well. each regression model is only good at sub-spaces of the data sets. Similar to §5, this section focuses on the workload-centric 6.1 Sample Size Planning evaluation but for sample-based QReд. We show the NRMSE One major question is how big the sample size should be? A ratio r between XGBoost (or GBoost) and QReд, broken down smaller sample requires less training time, but might lead to poor to subcollections of points in the data space, specifically, for the accuracy. According to the tasks, various strategies could be used subcollection of points where GBoost (or XGBoost regression) to determine the sample size. For general purposes, Cochran’s has the best accuracy. formula [13] is usually used to determine the sample size for a Consider the collection of points where XGBoost has the best population. prediction accuracy in the 5-d data set 7. The NRMSE ratio r z 2p(1 − p) between GBoost regression (second best RM) and XGBoost re- n0 = (7) e2 gression (best RM) is 1.2107, while the NRMSE ratio between where n 0 is the sample size, z is the selected critical value of QReд and XGBoost regression is 1.0499. Thus, the corresponding desired confidence level, p is the degree of variability and e is the ROL between GBoost and QReд is 0.2107 /0.0499 = 4.22, which desired level of precision. For instance, we need to determine the means for this collection of points, GBoost induces 4.22 times sample size of a large population whose degree of variability is higher error than QReд. unknown. p = 0.5 indicates maximum variability, and produces The same conclusion holds for the query-centric perspective, a conservative sample size. Assume we need 95% confidence and is omitted for space reasons. 1.20 XGboost 1.20 GBoost QReg QReg 1.15 1.15 NRMSE Ratio NRMSE Ratio DBEst XGBoost 1.10 1.10 8.0% DBEst GBoost DBEst Advance QReg 7.0% 1.05 1.05 6.0% 1.00 6 7 8 store_sales 1.00 6 7 8 store_sales Relative Error (%) Dataset ID Dataset ID 5.0% (a) 4-d GBoost win collection (b) 4-d XGBoost win collection 4.0% 3.0% 1.4 XGboost 1.5 GBoost 2.0% 1.3 QReg 1.4 QReg 1.0% NRMSE Ratio NRMSE Ratio 1.3 1.2 0.0% 1.2 SUM AVG OVERALL 1.1 1.1 1.0 6 7 8 store_sales 1.0 6 7 8 store_sales Figure 14: Application of QReg to DBEst Dataset ID Dataset ID (c) 5-d GBoost win collection (d) 5-d XGBoost win collection used, the relative error drops to 7.77%. Although Advanced QReд is build upon XGBoost and GBoost, the relative error of DBEst Figure 12: Workload-centric collection-level NRMSE ratio using Advanced QReд is better than DBEst using XGBoost or GBoost only. For further comparison, if the linear regression is used in DBEst, the relative error becomes 21.20%, which is much 6.3 Model Training Time higher than DBEst using Advanced QReд. Thus, a query-centric The training time of sample-based QReд consists of two parts: regressor, like QReд, improves the prediction accuracy and is (a) Sampling time to generate samples from the base tables; (b) very important in-DBMS analytics. Training time to train QReд over the samples. Fig. 13 shows 7 MAJOR LESSONS LEARNED The key lessons learned by this study are: • Different RMs are better-performing for different data sets and, more interestingly, for different data subspaces within them. This holds for simpler models and, perhaps surpris- ingly, for advanced ensemble RMs, which are designed to generalize better. • Each examined RM is best-performing (a winner) for a significant percentage of all queries. Necessarily, this im- plies that, for a significant percentage of queries, regard- less of which (simple or ensemble) RM is chosen by a DB user/analyst, a suboptimal RM will be used. Figure 13: Sample Size vs Training Time for store_sales • When said suboptimal RMs are used, significant additional errors emerge for a large percentage of queries. the training time of QReд for the 100m store_sales table, while • Best practice, which suggests using a top-performing en- sample sizes vary {10k, 100k, 1m}. It takes ca. 68-72s to generate semble, is misleading and leads to significant errors for the samples. For 10k (100k, 1m) samples, it takes less than 3s (22s, large numbers of queries. In several cases, despite the 150s) to train QReд. With 100k samples, QReд performs excel- fact that different RMs had a very similar overall error lently. So, in conclusion, sample-based QReд is scalable and both (NRMSE), a significant fraction of queries face very large hypotheses hold even when models are trained from samples. differences in error when using seemingly-similarly-performing RMs. Thus, sophisticated and simpler RMs cannot cope 6.4 Application to AQP engines. well, in order to appease query-sensitive scenarios, where Previous experiments demonstrate the strength of QReд. In this query distributions may target specific data subspaces. section, QReд is applied to DBEst, a newly model-based approxi- • A query-centric perspective, as manifested with QReд, can mate query processing (AQP) engine [33]. DBEst adopts classical offer higher accuracy across data sets and dimensionalities. machine learning models (regressors and density estimators) to This applies to overall NRMSEs. More importantly, it ap- provide approximate answers to SQL queries. We replace the plies to query-centric evaluations. The study revealed that default regression model in DBEst (XGBoost) with Advanced when QReд is used, there are significant accuracy gains, QReд, and compare the accuracy with DBEst using other ensem- compared to using any other non-optimal RM (which as ble methods, including XGBoost and GBoost. The well-known mentioned is unavoidable). TPC-DS dataset is scaled up with scaling factor of 1000, which • Accuracy improvements are achieved with small over- contains ∼ 2.6 billion tuples (1TB). 96 synthetic SQL queries cov- heads, even with very large data sizes, using sampling. ering 13 column pairs are randomly generated for SUM and AVG aggregate functions. DBEst sample size is set to 100k. 8 CONCLUSIONS Fig. 14 shows the relative error achieved by DBEst using vari- The paper studied issues pertaining to the seamless integration ous regression models. For SUM, the relative errors using XGboost of DBMSs and regression models. The analysis revealed the com- or GBoost are 8.35% and 8.10%. However, if Advanced QReд is plexity of the problem of choosing an appropriate regression model: Different models, despite having overall very similar ac- [22] Jerome H Friedman. 2002. Stochastic gradient boosting. Computational curacy, are shown to offer largely-varying accuracy for different Statistics & Data Analysis 38, 4 (2002), 367–378. [23] L Gerhardt, CH Faham, and Y Yao. 2018. SciDB-Py. http://scidb-py.readthedocs. data sets and for different subsets of the same data set. Given this, io/en/stable/. the analysis sheds light on solutions to the problem. It showed [24] Joseph M Hellerstein, Christoper Ré, Florian Schoppmann, Daisy Zhe Wang, and studied in detail the performance of QReд, which can achieve Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, et al. 2012. The MADlib analytics library: or MAD skills, the both high accuracy over the whole data set and near-optimal SQL. Proceedings of the VLDB Endowment 5, 12 (2012), 1700–1711. accuracy, per query targeting specific data subsets. The analy- [25] Botong Huang, Matthias Boehm, Yuanyuan Tian, Berthold Reinwald, Shirish sis also showed the impact of key decisions en route to QReд, Tatikonda, and Frederick R Reiss. 2015. Resource elasticity for large-scale machine learning. In Proceedings of the 2015 ACM SIGMOD International Con- such as selecting different constituent base regression models. ference on Management of Data. ACM, 137–152. In addition, it studied issues pertaining to scalability, showing [26] Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts. Neural computation 3, 1 (1991), that even with large data sets, the same issues hold and the same 79–87. model solution can be used to achieve per-query and overall [27] Michael R Jiroutek, Keith E Muller, Lawrence L Kupper, and Paul W Stewart. high accuracy. In general, the proposed QReд offers a promising 2003. A new method for choosing sample size for confidence interval–based inferences. Biometrics 59, 3 (2003), 580–590. approach for taming the generalization-overfit dilemma when [28] Ken Kelley and Scott E Maxwell. 2003. Sample size for multiple regression: employing ML models within DBMSs. obtaining regression coefficients that are accurate, not simply significant. Psychological methods 8, 3 (2003), 305. ACKNOWLEDGMENTS [29] Arun Kumar, Robert McCann, Jeffrey Naughton, and Jignesh M. Patel. [n. d.]. Model Selection Management Systems: The Next Frontier of Advanced This work was supported in part by the ‘Tools, Practices and Analytics. SIGMOD Rec. 44 ([n. d.]). Systems’theme of the UKRI Strategic Priorities Fund (EPSRC [30] Xuan Liang, Tao Zou, Bin Guo, Shuo Li, Haozhe Zhang, Shuyi Zhang, Hui Huang, and Song Xi Chen. 2015. Assessing Beijing’s PM2. 5 pollution: severity, Grant EP/T001569/1) & The Alan Turing Institute (EPSRC grant weather impact, APEC and winter heating. In Proc. R. Soc. A, Vol. 471. The EP/N510129/1). Royal Society, 20150257. [31] M. Lichman. 2013. UCI Machine Learning Repository. http://archive.ics.uci. REFERENCES edu/ml [1] 2005. Database PL/SQL Packages and Types Reference. https://docs.oracle. [32] Lin Ma, Dana Van Aken, Ahmed Hefny, Gustavo Mezerhane, Andrew Pavlo, com/cd/B28359_01/appdev.111/b28419/u_nla.htm#CIABEFIJ and Geoffrey J Gordon. 2018. Query-based workload forecasting for self- [2] Dana Van Aken, Andrew Pavlo, Geoffrey J. Gordon, and Bohan Zhang. 2017. driving database management systems. In Proceedings of the 2018 International Automatic Database Management System Tuning Through Large-scale Ma- Conference on Management of Data. ACM, 631–645. chine Learning. In Proceeding of ACM SIGMOD. [33] Qingzhi Ma and Peter Triantafillou. 2019. Dbest: Revisiting approximate query [3] C. Anagnostopoulos and P. Triantafillou. 2015. Learning Set Cardinality in processing engines with machine learning models. In Proceedings of the 2019 Distance Nearest Neighbours. In Proceeding of IEEE International Conference International Conference on Management of Data. ACM, 1553–1570. on Data Mining, (ICDM15). [34] Scott E Maxwell, Ken Kelley, and Joseph R Rausch. 2008. Sample size planning [4] C. Anagnostopoulos and P. Triantafillou. 2015. Learning to Accurately COUNT for statistical power and accuracy in parameter estimation. Annu. Rev. Psychol. with Query-Driven Predictive Analytics. In Proceeding of IEEE International 59 (2008), 537–563. Conference on Big Data. [35] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram [5] C. Anagnostopoulos and P. Triantafillou. 2017. Query-Driven Learning for Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Predictive Analytics of Data Subspace Cardinality. ACM Trans. on Knowledge Owen, et al. 2016. Mllib: Machine learning in apache spark. The Journal of Discovery from Data, (ACM TKDD) (2017). Machine Learning Research 17, 1 (2016), 1235–1241. [6] Anon. 2018. XLeratorDB. http://westclintech.com/ [36] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram [7] Claudia Beleites, Ute Neugebauer, Thomas Bocklitz, Christoph Krafft, and Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Jürgen Popp. 2013. Sample size planning for classification models. Analytica Owen, et al. 2016. Mllib: Machine learning in apache spark. Journal of Machine chimica acta 760 (2013), 25–33. Learning Research 17, 34 (2016), 1–7. [8] Christopher M Bishop. 2006. Pattern recognition and machine learning. [37] Hannes Muehleisen, Anthony Damico, and Thomas Lumley. 2018. MonetDB.R. springer. http://monetr.r-forge.r-project.org/. [9] Leo Breiman. 1996. Bagging predictors. Machine learning 24, 2 (1996), 123–140. [38] Raghunath Othayoth Nambiar and Meikel Poess. 2006. The making of TPC-DS. [10] Zhuhua Cai, Zekai J Gao, Shangyu Luo, Luis L Perez, Zografoula Vagena, and In Proceedings of the 32nd international conference on Very large data bases. Christopher Jermaine. 2014. A comparison of platforms for implementing VLDB Endowment, 1049–1058. and running very large scale machine learning algorithms. In Proceedings of [39] Carlos Ordonez. 2010. Statistical model computation with UDFs. IEEE Trans- the 2014 ACM SIGMOD international conference on Management of data. ACM, actions on Knowledge and Data Engineering 22, 12 (2010), 1752–1765. 1371–1382. [40] Carlos Ordonez, Carlos Garcia-Alvarado, and Veerabhadaran Baladandayutha- [11] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting pani. 2014. Bayesian variable selection in linear regression in one pass for system. In Proceedings of the 22nd acm sigkdd international conference on large datasets. ACM Transactions on Knowledge Discovery from Data (TKDD) knowledge discovery and data mining. ACM, 785–794. 9, 1 (2014), 3. [12] Tianqi Chen, Sameer Singh, Ben Taskar, and Carlos Guestrin. 2015. Efficient [41] Y. Park, A. S. Tajik, M. Cafarella, and B. Mozafari. 2017. Database Learning: second-order gradient boosting for conditional random fields. In Artificial Toward a Database that Becomes Smarter Every Time. In Proceeding of ACM Intelligence and Statistics. 147–155. SIGMOD. [13] William G Cochran. 2007. Sampling techniques. John Wiley & Sons. [42] Christopher Ré, Divy Agrawal, Magdalena Balazinska, Michael Cafarella, [14] Jacob Cohen. 2013. Statistical power analysis for the behavioral sciences. Rout- Michael Jordan, Tim Kraska, and Raghu Ramakrishnan. 2015. Machine learn- ledge. ing and databases: The sound of things to come or a cacophony of hype?. In [15] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. 2009. MAD Proceedings of the 2015 ACM SIGMOD International Conference on Management skills: new analysis practices for big data. Proc. VLDB Endow. (2009). of Data. ACM, 283–284. [16] Dan Crankshaw, Peter Bailis, Joseph Gonzalez, Haoyuan Li, Zhao Zhang, [43] M. Schleich, D. Olteanu, and R. Ciucanu. 2016. Learning Linear Regression Michael Franklin, Ali Ghodsi, and Michael Jordan. 2015. The missing piece Models over Factorized Joins. In ACM SIGMOD. in complex analytics: Low latency, scalable model management and serving [44] Riyaz Sikora et al. 2015. A modified stacking ensemble machine learning with Velox. In Conference on Innovative Data Systems Research (CIDR). algorithm using genetic algorithms. In Handbook of Research on Organizational [17] Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael J Franklin, Joseph E Transformations through Big Data Analytics. IGI Global, 43–53. Gonzalez, and Ion Stoica. 2016. Clipper: A Low-Latency Online Prediction [45] N. Polyzotis T. Condie, P. Mineiro and M. Weimer. 2013. Machine learning for Serving System. arXiv preprint arXiv:1612.03079 (2016). big data. In ACM SIGMOD. 939–942. [18] A. Deshpande and S. Madden. 2006. MauveDB: supporting model-based user [46] A. Thiagarajan and S. Madden. 2008. Querying continuous functions in a views in database systems. In ACM SIGMOD. database system. In ACM SIGMOD. [19] Kevin K Dobbin and Richard M Simon. 2006. Sample size planning for devel- [47] D. S. TKach. 1998. Information Mining with the IBM Intelligent Miner. IBM oping classifiers using high-dimensional DNA microarray data. Biostatistics 8, White Paper. 1 (2006), 101–117. [48] Daniele Varrazzo. 2014. Psycopg. http://initd.org/psycopg/. [20] Yoav Freund and Robert E Schapire. 1995. A desicion-theoretic generalization [49] Rashmi Korlakai Vinayak and Ran Gilad-Bachrach. 2015. DART: Dropouts of on-line learning and an application to boosting. In European conference on meet multiple additive regression trees. In Artificial Intelligence and Statistics. computational learning theory. Springer, 23–37. 489–497. [21] Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting [50] David H Wolpert. 1992. Stacked generalization. Neural networks 5, 2 (1992), machine. Annals of statistics (2001), 1189–1232. 241–259. [51] D. Wong. 2013. Oracle Data Miner. Oracle White Paper.