Query-Centric Regression for In-DBMS Analytics
                                   Qingzhi Ma                                                              Peter Triantafillou
                            University of Warwick                                                         University of Warwick
                            Q.Ma.2@warwick.ac.uk                                                      P.Triantafillou@warwick.ac.uk
ABSTRACT                                                                                 support for regression, such as XLeratorDB [6] for Microsoft
Research in enriching DBs with Machine Learning (ML) models                              SQL Server, Oracle UTL_NLA [1, 51], IBM Intelligent Miner [47],
is receiving increasingly greater attention. This paper experi-                          which provide SQL interfaces for analysts to specify regression
mentally analyzes the problem of empowering data systems with                            tasks. Academic efforts include MADLib (over PostgreSQL) [24],
(and its users with access to) regression models (RMs). The paper                        MAD [15], and MauveDB [18], which integrates regression mod-
offers a data system’s perspective, which unveils an interesting                         els into a RDBMS. [39, 40] uses User-Defined Functions (UDFs)
‘impedance mismatch ′ problem: ML models aim to offer a high                             to compute statistical machine learning models and data summa-
expected overall prediction accuracy, which essentially assumes                          rization. DBEst [33], an approximate query processing engine,
that queries will target data using the same distributions of the                        uses ML models (regressions and density estimators) to provide
data on which the models are trained. However, in data man-                              answers for popular queries. Similarly, FunctionDB [46] builds re-
agement it is widely recognized that query distributions do not                          gression models so that future queries can be evaluated using the
necessarily follow data distributions. Queries using selection op-                       models. Furthermore, [43] integrates and supports least squares
erators target specific data subspaces on which, even an overall                         regression models over training data sets defined by join queries
highly-accurate model, may be weak. If such queried subspaces                            on database tables. Thus, in general, RMs are useful for query
are popular, large numbers of queries will suffer. The paper will                        processing and data analytics tasks. In addition, RMs are help-
reveal, shed light, and quantify this ‘impedance mismatch ′ phe-                         ful for many other key tasks: imputing missing values, testing
nomenon. It will study in detail 8 real-life data sets and data from                     hypotheses, data generation, fast visualization, etc.
TPC-DS and experiment with various dimensionalities therein. It
will employ new appropriate metrics, substantiating the problem                          Motivations
across a wide variety of popular RMs, ranging from simple linear                         Given the above increasing interest in bridging ML and RMs with
models to advanced, state-of-the-art, ensembles (which enjoy ex-                         DBs, we focus on how seamless this process can be. ML models
cellent generalization performance). It will put forth and study a                       (and RMs in particular) are trained to optimize a loss function
new, query-centric, model that addresses this problem, improving                         (invariably concerning overall expected error). We refer to this
per-query accuracy, while also offering excellent overall accuracy.                      as a workload-centric view, as the aim is to minimize expected
Finally, it will study the effects of scale on the problem and its                       error among all queries in an expected workload. In essence, this
solutions.                                                                               assumes query distributions are (expected workload is) similar to
                                                                                         that of the training data. In contrast, data systems research has
1     INTRODUCTION                                                                       long recognized that query workloads follow patterns generally
                                                                                         different to data distributions. Hence, queries on data subspaces
A new dominating trend has emerged for the next-generation
                                                                                         (e.g., using range or equality selection predicates), where an
data management and analytics systems based on integrating
                                                                                         ML model is weak, will suffer from high errors. And, if such
ML models and data management platforms [10, 15, 24, 25, 35,
                                                                                         queries are popular, many queries will suffer. This gives rise to
36, 42, 45]. Additional efforts pertain to connectors to back-end
                                                                                         the need for a “query-centric perspective ”: We define query-
databases, which allow for statistical analyses and related queries
                                                                                         centric regression as a model which strives to ensure both high
on DB data, like MonetDB.R [37], SciDB-Py [23], and Psycopg
                                                                                         average accuracy (across all queries in a workload) as well as
[48]. Another class of efforts concerns learning from past an-
                                                                                         high per-query accuracy, whereby each query is ensured to enjoy
swers to predict the answers to future analytical queries, e.g.
                                                                                         accuracy close to that of the best possible model.
for approximate query processing engines, which provide ap-
                                                                                             The ML community’s general answer to such problems is
proximate answers to aggregate queries, using ML techniques
                                                                                         to turn to ensemble methods, in order to lower variance and
[3–5, 41], or for tuning database systems [2], and for forecasting
                                                                                         generalize better (e.g., to different distributions). We wish to shed
workloads [32]. Yet another class of efforts concerns model and
                                                                                         light into this possible impedance mismatch problem and see if it
query-prediction serving, like the Velox/Clipper systems [16, 17]
                                                                                         holds for simpler and even for state-of-the-art ensemble RMs. We
managing ML models for predictive analytics. Finally, vision
                                                                                         further wish to (i) quantify the phenomenon: We shall use several
papers suggest the move towards model selection management
                                                                                         real data sets (from the UCI ML repository and TPC-DS data)
systems [29], where a primary task is model selection whereby
                                                                                         and a wide variety of popular RMs and new metrics that reveal
the system is able to select the best model to use for the task at
                                                                                         workload-centric and query-centric performance differences, and
hand.
                                                                                         (ii) see if the problem can be addressed by adopting a query-
   In this realm, regression models, being a principal means for
                                                                                         centric perspective, (using a new ensemble method) whereby the
predictive analytics, are of particular interest to both analysts
                                                                                         error observed by each query is as low as possible (which will
and data analytics platforms. RMs are playing an increasingly
                                                                                         also indirectly ensure high workload-centric performance).
important role within data systems. Examples of its extended
                                                                                             The above bear strong practical consequences. Consider a
use and significance include many modern DBs which provide
                                                                                         data analyst using python or R, linked with an ML library (like
© Copyright 2020 for this paper held by its author(s). Published in the proceedings of   Apache Spark MLLib, Scikit-Learn, etc.), or using a DB connector
DOLAP 2020 (March 30, 2020, Copenhagen, Denmark, co-located with EDBT/ICDT
2020) on CEUR-WS.org. Use permitted under Creative Commons License Attribution
                                                                                         like MonetDB.R or SciDB-Py, or a prediction serving system like
4.0 International (CC BY 4.0).                                                           Clipper, etc– and the following use cases.
    Scenario 1: The analyst uses a predicate to define a data sub-            And the predictions are updated in the direction of gradient
space of interest and calls a specific RM method: It would be great           descent, which is
if she knew which RMs to use for which data subspaces.                                                                   ∂L(yi , f (x i )r )
                                                                                          f (x i )r +1 = f (x i )r + α ∗                     (2)
    Scenario 2: Alternatively, the system could select the best RM                                                          ∂ f (x i )r
automatically for the analyst’s query at hand.
                                                                              where r is the iteration number. Gradient boosting (GBoost) usu-
    With this paper we wish to inform the community of this DB-
                                                                              ally uses only the first-order information; Chen et al. incorporate
RM impedance mismatch problem, study and quantify it, and lay
                                                                              the second-order information in gradient boosting for conditional
the foundation for seamless use of RMs for in-DBMS analytics,
                                                                              random fields, and improve its accuracy [12]. However, the base
offering this query-centric perspective.
                                                                              models are usually limited to a set of classification and regression
                                                                              trees (CART). Other regression models are not supported.
2      BACKGROUND                                                                XGBoost [11] is a state-of-art boosting method, and is widely
Our study employs a set of representative and popular RMs,                    used for competitions due to its fast training time and high accu-
grouped into two categories: Simple and ensemble RMs.                         racy. The objective of XGBoost is
                                                                                                    obj(Θ) = L(Θ) + Ω(Θ)                        (3)
2.1       Simple Regression Models
Simple RMs include linear regression (LR), polynomial regres-                 where L(Θ) is the loss function, and controls how close predic-
sion (PR), decision tree regression (DTR), SVM Regression (SVR),              tions are to the targets. Ω(Θ) is the regularization term, which
Nearest Neighbours Regression (NNR), etc. An introduction to                  controls the complexity of the model. Over-fitting is avoided if
these simple regression models is omitted for space reasons.                  the proper Ω(Θ) is selected. The base models (booster) can be
   Table 1 summarizes the known asymptotic time complexity                    gbtree, gblinear or dart [49]. Gbtree and dart are tree models
for training for key regression models. And more detailed com-                while gblinear is linear.
parisons are made and discussed in §3.5.
                                                                              3     EXPERIMENTAL SETUP
                                                                              All experiments run on an Ubuntu system, with Intel Core i5-7500
       Table 1: Complexity of typical regression models
                                                                              CPU @ 3.40GHz × 4 processors and 32GB memory.

              LR                     PR                DTR                    3.1    Hypotheses
            O(d 2n)                O(d 4n)     [O(dnloд(n)), O(n2d)]
                                                                              The study rests on testing, validating, and quantifying two key
             NNR                    SVR          Gaussian Process
                                                                              hypotheses:
     O(n(d + k)) or O(ndk)         O(vn2 )            O(n 3 )
     Note: d is the dimension of the data points, n is the number of points      Hypothesis 1. Different RMs, exhibit higher accuracy for dif-
    in the training data, k is the number of neighbors for KNN regression,    ferent regions of the queried data spaces. Likewise for different data
    v is the number of support vectors for SVM regression.                    sets. Such differences can be large and may occur even though said
                                                                              RMs may enjoy similar overall accuracy.
                                                                                 The corollary of Hypothesis 1 is that, even if our analysts in
2.2       Ensemble Methods                                                    Scenario 1 knew of highly-accurate RMs, many of their analytical
                                                                              queries would be susceptible to large errors. Hypothesis 1 aims
Ensemble methods are powerful methods that combine the pre-
                                                                              to test whether the loss function used by even top-performing
dictions and advantages from base models. It is often observed
                                                                              RMs, minimizing overall expected accuracy errors, ’hides’ this
that prediction accuracy is improved by combining the prediction
                                                                              issue. En route, the analysis will quantify this problem across
results in some way (e.g., using weighted averaging of predictions
                                                                              many different RMs, data sets, and dimensionalities.
from various base models) [8]. Ensemble learning is also useful
for scaling-up data mining and model prediction [44]. There have                Hypothesis 2. Given Hypothesis 1, a model equipped with
been many well-developed ensemble methods, including aver-                    knowledge of the accuracy distribution of RMs in the query space,
aging based ensemble methods, bootstrap aggregating (bagging)                 can near-optimally serve each query.
[9], boosting [20, 21], stacking [50], mixture of experts [26], etc.
                                                                                 Such a model, coined QReg, which is a classifier-based ensem-
   Averaging-based ensemble methods calculate the weighted
                                                                              ble method that bears a query-centric perspective, will be studied
average of predictions from all models. This incurs higher com-
                                                                              here to validate Hypothesis 2. Hypothesis 2 aims to show that
putational costs and higher response time.
                                                                              integrating an RM model within a DB can be done in a query-
   Boosting refers to a family of algorithms that could potentially
                                                                              centric manner, avoiding the aforementioned problems. Thus,
convert “weak models”to “strong models”. AdaBoost [20], short
                                                                              offering a solution for Scenario 2.
for “adaptive boosting”, is a popular boosting algorithm. Unlike
bootstrap aggregating whose models are trained in parallel, the
                                                                              3.2    Data Sets and Dimensionality
prediction models in AdaBoost are trained in sequence. AdaBoost
was firstly proposed to solve classification problems, and was                To test the hypotheses, eight real-world data sets with different
applied to solve regression problems later on. Randomization                  characteristics from the UCI machine learning repository [31]
may be incorporated into boosting, so that its response time is               are used, varying the dimensionality from 2 to 5, as well as a
reduced [22].                                                                 large fact table from the TPC-DS benchmark [38].
   The objective of gradient boosting is to minimize the loss                    Data set 1 is a collection of YouTube videos showing input
function:                                                                     and output video characteristics along with the transcoding time
                                       Õ                                      and memory requirements. Data set 2 contains Physicochemical
              L(yi , f (x i )) = MSE =   (yi − f (x i ))2       (1)           properties of Protein Tertiary Structure. Tasks include predicting
the Size of the residue (RMSD) based on nine properties. There are       Intuitively, our aim (with testing Hypothesis 1) is to show that,
45730 decoys and size varies from 0 to 21 armstrong. Data set 3 is   despite which single model is used, some queries will always be
an hourly data set containing the PM2.5 gas concentration data in    processed by sub-optimal models. So we wish to quantify this op-
Beijing. The task is to predict PM2.5 concentration (uд/m 3 ), and   portunity loss. Furthermore, our aim (with testing Hypothesis 2)
the independent variables include pressure (PRES), Cumulated         is to show that a new ensemble model can help significantly allevi-
wind speed (Iws), etc. Data set 4 is an online news popularity       ate this problem. The ROLi,j metric will help quantify how much
data set and tasks include predicting the number of shares in        a model (QReд) improves on this opportunity loss for queries.
social networks (popularity). There are totally 39797 records in
this data set. Data set 5 contains 9568 data points collected from   3.4     Architecture
a Combined Cycle Power Plant over 6 years (2006-2011), and the       Assume that the data system maintains m regression models.
task is to predict the net hourly electrical energy output (EP) of   When a query arrives, the system needs to identify the model
the plant. Data set 6 is the YearPredictionMSD data set used to      with the least prediction error for this query. We treat this model
predict the release year of a song from audio features. Most of      selection problem as a classification problem. Fig. 1 shows the
the songs are commercial tracks from 1922 to 2011. Data set 7        architecture of this classified regression QReд 1 .
contains the recordings of 16 chemical sensors exposed to two
dynamic gas mixtures at varying concentrations. The goal with
this data set is to predict the recording of one specific chemical
sensor based on other sensors and the gas mixtures. This is a
time-series data set containing more than 4 million records in
total. Data set 8 records the individual household electric power
consumption in one household for more than four years, and
there are two million records.
   We further employ table store_ sales from the popular TPC-
DS benchmark [38]. Typical columns used for the experiments in-
clude ss_wholesale_cost, ss_list_price, ss_ sales_price,
ss_ext_sales_price and ss_ext_wholesale_cost.
                                                                     Figure 1: Architecture of the classified prediction method.
3.3    Evaluation Metrics
Accuracy is measured using the Normalized Root Mean Square               There are two layers in the system. (i) The model mainte-
Error (NRMSE) metric, defined as:                                    nance layer, deploys and maintains the base regression models.
                            q Ín                                     (ii) The query classification layer implements the core of QReg.
                                   t =1 (ŷ t −y t )
                                                       2
                                         n                           A query is first passed to the pre-trained classifier. Because the
                   N RMSE =                                   (4)
                               ymax − ymin                           classifier “knows”the prediction ability of each model in the vari-
                                                                     ous queried data spaces, the query will be assigned to the model
NRMSE shows overall deviations between predicted and mea-
                                                                     that performs best locally for this query’s data space. Unlike
sured values; it is built upon root mean square error (RMSE),
                                                                     typical ensemble methods, only one model is invoked for each
and is scaled to the range of the measured values. It provides
                                                                     query (hence QReg is less computationally intensive).
a universal measure of prediction accuracy between different
                                                                         Two configurations are studied for QReд: Simple QReg uses
regression models.
                                                                     LR, PR, and DTR. Advanced QReg uses GBoost and XGBoost as
   The NRMSE ratio r , compares the prediction accuracy of
                                                                     its base models.
one RMi against that of any other RM j , and is defined as: r =
 N RM S E i
 N RM S E j . If N RMSE j ≤ N RMSEi , this ratio shows how worse     3.5     Candidate Base Models
RMi is compared to RM j .                                            When deciding which models to include the following key criteria
   The above are standard metrics used for comparing accuracy.       are considered.
However, our study calls for additional metrics. Inherent in our        (a.) Model training time. This should be as low as possible
study is the need to reflect the differences in accuracy observed    and should exhibit good scalability as the number of the training
by a query as they depend on the model used. For this we define      data points increases.
the concept of Opportunity Loss as a natural way to reflect how         Given the asymptotic complexity, as summarised in §2.1, a
much the query loses in accuracy by using a sub-optimal model.       number of experiments were conducted to quantify the training
   Assuming RMopt is the RM with the lowest NRMSE, we define         times for various regression models. Fig. 2 is a representative
Opportunity Loss OLi as                                              result for data set 4 using 4 dimensions. It shows how model train-
                             N RMSEi                                 ing time (for six regression models) is impacted as the number
                    OLi =             − 1,                    (5)
                            N RMSEopt                                of training instances increases.
                                                                        Model training time is shown to behave acceptably with re-
which quantifies as a % the error (the opportunity loss) due to      spect to the number of instances in the training data set for LR,
not using the best model RMopt and using RMi instead.                PR, DTR, and kNN regression. The training time of Support Vec-
  Furthermore, we define                                             tor Regression – Radial Basis Function tends to increase much
                                   OLi                               more aggressively as the number of training points increases. The
                        ROLi,j =       ,                      (6)
                                   OLk                               experiment was repeated for all data sets. The above conclusion
as Relative Opportunity Loss, which quantifies how much better       holds across all experiments and are omitted for space reasons.
RMk does vs RMi in improving on the opportunity loss.                1 The source code for Q Reд is available at https://github.com/qingzma/qreg
                                                                        QReд, and a well-designed classifier could potentially grasp the
                                                                        prediction ability of each model in the query space correctly. Thus,
                                                                        the prediction accuracy can be significantly improved compared
                                                                        to individual prediction models.
                                                                           Note that the original data set is partitioned into 3 subsets
                                                                        instead of 2. This is done in order to ensure that different train-
                                                                        ing data sets are used to train the base models and the classifier,
                                                                        respectively, which avoids potential over-fitting problems. In
                                                                        addition, models are fine-tuned via cross-validation using Grid-
                                                                        SearchCV() in the scikit-learn package.

                                                                        4                                EVALUATING HYPOTHESIS I
  Figure 2: Training time of typical regression models.                 Consider data set 3, the Beijing PM2.5 data set [30], using Cu-
                                                                        mulated Wind Speed (IWS) and Pressure (PRES) as the features,
                                                                        yielding a 3-dimensional regression problem.
   (b.) Query response time: In the classifier training process,           Fig. 4.(a) shows the distribution of the model with the least
predictions are made from each base prediction model. To reduce         error for all data points. LR, PR, and DTR are used as the base RMs
the overall training time as well as the query response time, the       (Simple QReд). Fig. 4.(b) shows the distribution of best models
models should have as low response time as possible.                    when ensemble models GBoost and XGBoost are used (Advanced
   (c.) Prediction accuracy: An interesting issue arises from           QReд).
using different regression models together, as in QReд. If the base
regression models have large differences in accuracy levels, then
this may result in QReд having poor accuracy. This is a direct
                                                                            Cumulated wind speed (m/s)


                                                                                                                                             Cumulated wind speed (m/s)
                                                                                                         500                                                              500
                                                                                                                 LR                                                                GBoost
result of errors introduced during the separate classification pro-                                      400     PR
                                                                                                                                                                          400      XGboost
                                                                                                                 DTR
cess. Therefore, care must be taken so to ensure that base models                                        300                                                              300
enjoy similar and good accuracy levels.                                                                  200                                                              200

                                                                                                         100                                                              100
3.6    Model Training Strategy                                                                             0                                                                0
                                                                                                            990 1000 1010 1020 1030 1040                                     990 1000 1010 1020 1030 1040
Fig. 3 shows the model training strategy of QReд. The data set is                                                  Pressure (hPa)                                                   Pressure (hPa)

partitioned into three subsets: (i) Training_Data set_1 is used                                                (a) Simple models                                                (b) Ensemble models
to train the base models; (ii) Training_Dataset _2 is used to
train the classifier in QReд; (iii) Testing_Dataset is used to           Figure 4: Distribution of best models for Beijing PM2.5.
evaluate the accuracy of QReд.
                                                                           Take QReд using simple models as an example. LR dominates
                                Main
                                                                        in the upper-central region. PR dominates at the lower central
                               Dataset
                                                                        regions. DTR performs best in the rest of the space. the NRMSEs
                                                                        for LR, PR and DTR are 8.48%, 8.84%, and 8.32%, respectively.
                                                                        However, if for each point the best model can be selected to make
            Training
            Dataset 1
                               Training
                               Dataset 2
                                                 Testing
                                                 Dataset
                                                                        the prediction (as shown in Fig. 4), the corresponding ("optimal")
                                                                        NRMSE drops to 7.19%. This is a large improvement in accuracy.
                                                                           Figures like Fig. 4 can help analysts decide on which models
             Model 1          Predictions 1
                                                                        to use when querying this space.Similar figures exist for all data
                                                                        sets studied in this work and are omitted due to space reasons.
             Model 2          Predictions 2
                                                    New
                                                  dataset
              ...


                                ...


                                                  to build
                                                 classiﬁer
                                                                                                                  Table 2: Win-counts of simple RMs.
             Model n          Predictions n


                                                                             Data set                                  Count of            Count of                                          Count of
              Figure 3: Model training of QReд.                                ID                                         LR                  PR                                               DTR
                                                                             1                                         6468                6919                                              5366
                                                                             2                                         3606                4133                                              6217
   After partitioning the data set, m base models are trained upon
                                                                             3                                         1077                1580                                              1472
Training_Dataset_1. The selection of base models is in princi-
                                                                             4                                         3618                4826                                              4155
ple open and depends on the users’ choice (taking into account
                                                                             5                                         980                 885                                               1307
the above issues). Each base model fi (x) makes predictions ŷi
of each data point x in Training_Dataset_2. A comparison is                  6                                         62749               51953                                             57007
made between the predicted ŷi and the real label y to find the              7                                         9646                4029                                              4953
best prediction model for each query x.                                      8                                         35702               28800                                             40311
   Having the individual predictions and associated errors, a new
data set is generated by combining the data point x and the index         Table 2 adopts a different perspective. It shows the number of
i of the best model for this query, depicted [x, i]. This data set is   most accurate predictions ("wins") made by each simple RM, per
then used to build the classifier reflecting the prediction ability     data set. First, note that all models have a good number of wins.
of base models in the query space. The classifier is the core of        Second, no model has the highest number of wins across all data
sets. So, there is no single winner. DTR enjoys the most wins for            enjoy an NRMSE that is about half of the NRMSE of the others.
data sets 2, 5, and 8; PR makes the most accurate predictions for            Interestingly, the same holds for PR and DTR! Similar conclusions
data sets 1, 3, and 4; LR wins for data sets 6 and 7.                        hold for the other data sets.
   Table 3 zooms in, augmenting Table 2 by showing the NRMSEs
when different simple RMs win. For example, for data set 1, we                Table 4: NRMSEs for the top 20% queries per simple RM.
know from Table 2 that LR wins 6468 times. For these, the LR’s
NRMSE was 11.08%, as indicated by Table 3, whereas for PR was                               NRMSE        NRMSE        NRMSE
enormous and for DTR was 18.28% – see the 3 numbers in cell                      Data Set    where        where       where       Overall
                                                                                   ID       LR wins      PR wins      DTR wins    NRMSE
[1,2] in Table 3. Similarly, for the 6919 queries where PR won,
                                                                                            3.61%        5.36%        6.67%       5.36%
LR’s NRMSE was 11.82%, as indicated by Table 3 whereas for PR
                                                                                 1          7.06%        2.92%        6.33%       5.72%
was 9.35% and for DTR was 11.68% – see the 3 numbers in cell
                                                                                            7.96%        5.78%        3.42%       6.01%
[1,3].
                                                                                            18.35%       16.14%       16.10%      16.89%
      Table 3: NRMSE values when different RMs win.                              2          21.26%       12.39%       12.88%      16.04%
                                                                                            24.14%       17.09%       7.97%       17.69%
                NRMSE          NRMSE                                                        3.03%        3.64%        4.83%       3.902%
                                              NRMSE
  Data set       where          where         where         Overall              3          4.89%        1.44%        4.07%       3.76%
    ID          LR wins        PR wins        DTR wins      NRMSE                           5.35%        2.78%        2.12%       3.69%
                11.08%         11.82%         14.23%        12.32%                          0.96%        2.33%        2.74%       2.15%
  1             ≫1000%         9.35%          ≫1000%        ≫1000%               4          2.33%        0.63%        1.56%       1.66%
                18.28%         11.68%         10.60%        14.06%                          2.54%        1.40%        0.78%       1.74%
                27.68%         23.78%         28.61%        27.02%                          7.56%        9.77%        9.44%       8.98%
  2             30.96%         20.20%         27.85%        26.72%               5          10.00%       7.75%        8.11%       8.68%
                34.24%         26.18%         21.81%        26.79%                          12.11%       10.55%       4.85%       9.69%
                8.47%          8.00%          8.99%         8.48%                           4.11%        5.26%        5.500%      5.00%
  3             9.91%          6.60%          10.04%        8.84%                6          5.22%        4.12%        4.82%       4.74%
                10.54%         7.99%          6.65%         8.32%                           5.42%        4.76%        4.13%       4.80%
                2.46%          4.96%          5.55%         4.62%                           42.14%       109.35%      95.42%      87.25%
  4             3.82%          2.62%          4.49%         3.67%                7          106.72%      80.88%       82.48%      90.80%
                4.07%          3.54%          2.51%         3.41%                           79.49%       106.48%      65.26%      85.47%
                16.69%         15.97%         19.91%        17.90%                          3.07%        5.57%        3.65%       4.23%
  5             18.75%         13.94%         18.79%        17.56%               8          4.37%        3.77%        3.67%       3.95%
                21.07%         16.99%         15.30%        17.72%                          5.17%        5.92%        1.60%       4.63%
                11.06%         12.43%         13.39%        12.29%
  6             12.15%         11.47%         12.77%        12.15%
                                                                                The corresponding data for the advanced ensemble RMs is
                12.30%         12.04%         12.15%        12.17%
                                                                             highly similar and we omit it for space reasons.
                80.48%         113.75%        106.84%       95.85%
  7             210.45%        83.39%         104.22%       165.31%                     Table 5: Win counts of ensemble RMs.
                118.60%        111.17%        76.58%        107.31%
                8.81%          10.02%         10.49%        9.82%                             Count of
                                                                                 Data set                            Count of
  8             10.36%         8.05%          10.04%        9.66%                  ID       tree boosting            XGBoost
                11.13%         9.99%          8.29%         9.80%                1          10334                    8419
  Each [i,j] cell shows 3 values for data set i, for the cases where model       2          7037                     6919
 j wins. j = 2 (3, or 4) represents LR, (PR, or DTR) respectively. The top       3          2653                     1476
 number in each cell shows the NRMSE for LR, the middle shows the                4          2142                     10457
 NRMSE of PR, and the bottom the NRMSE for DTR.
                                                                                 5          1596                     1576
                                                                                 6          92916                    78793
   Consider data set 4. When LR wins, its error is markedly lower
                                                                                 7          6991                     11637
(almost half) that of PR and DTR–unlike their overall NRMSEs
                                                                                 8          52949                    51864
which show LR to be the worst model.
   To further facilitate a query-centric perspective, we delve into
the performance of the queries for which each RM reached a
top-20% performance. For data set 1, for example, this includes              5       EVALUATING HYPOTHESIS II
the best 20% of the 6468 queries for which LR wins, the best 20%             This hypothesis aims to substantiate whether it is possible to
of the 6919 queries for PR wins, and the best 20% of the 5366                develop a method that can learn from the key findings of the
queries for DTR wins. Hence, the NRMSE of interest does not                  previous section and leverage them in order to address Scenario
come from all the queries, but from the top 20% queries for which            2, automating the decision as to which RM to use, relieving the
the least error was achieved by a simple RM. Table 4 shows these             DB user/analyst of the conundrum, towards a query-centric RM.
results, along with the overall NRMSE of each simple RM for the              Specifically, we study if and at what costs a method can: (i) near-
whole set of queries. Again, note that the overall NRMSEs are                optimally select the best regression model for any query at hand
quite close. However, individual differences are very large. For             and (ii) achieve better overall accuracy than any single (simple
data set 1, for instance, the top 20% of queries when LR wins                or ensemble) method.
    It is natural to treat this problem as a model selection problem,
using a classifier for the method selection. We show a new ensem-                                                                                                     1.5
ble method, QReд, which materializes a query-centric perspective                                                                                                      1.4
achieving the above two aims.2 We have considered various clas-
                                                                                                                                                                      1.3


                                                                                                                                                        NRMSE Ratio
sifiers for QReд, including SVM-linear classifiers, SVM classifiers                                                                                                                           LR vs QReg
using the RBF kernel,and the XGBoost classifier, etc. A compre-                                                                                                                               PR vs QReg
                                                                                                                                                                      1.2                     DTR vs QReg
hensive comparison between various classifiers is made. Unless
explicitly stated otherwise, results for the XGBoost classifier                                                                                                       1.1
are shown, due to its overall prediction accuracy and scalability
performance.                                                                                                                                                          1.0
                                                                                                                                                                                 1   2   3     4 5 6                      7       8
                                                                                                                                                                                             Dataset ID
   Cumulated wind speed (m/s)


                                                                  Cumulated wind speed (m/s)


                                500
                                        LR
                                                                                               500
                                                                                                        GBoost                                   Figure 6: Accuracy of Simple QReд vs LR, PR, DTR.
                                        PR                                                              XGboost
                                400                                                            400
                                        DTR

                                300                                                            300

                                200                                                            200

                                100                                                            100

                                  0                                                              0                                                        AdaBoost vs QReg                                            AdaBoost vs QReg
                                   990 1000 1010 1020 1030 1040
                                          Pressure (hPa)
                                                                                                  990 1000 1010 1020 1030 1040
                                                                                                         Pressure (hPa)
                                                                                                                                                1.4       GBoost vs QReg
                                                                                                                                                          XGBoost vs QReg
                                                                                                                                                                                                               1.4    GBoost vs QReg
                                                                                                                                                                                                                      XGBoost vs QReg


                                                                                                                                  NRMSE Ratio


                                                                                                                                                                                                 NRMSE Ratio
                                       (a) Simple Q Reд                                              (b) Advanced Q Reд
                                                                                                                                                1.2                                                            1.2

                                      Figure 5: QReд distribution of base models.                                                               1.0                                                            1.0
                                                                                                                                                        1 2 3 4 5 6 7 8                                              1 2 3 4 5 6 7 8
                                                                                                                                                            Dataset ID                                                   Dataset ID
   Fig. 5 shows the RMs chosen by Simple and Advanced QReд                                                                                                              (a) 2d                                                (b) 3d
for each corresponding query, giving a feeling of the overall RM
distribution suggested by QReд. The model distribution shown
resembles the distribution of the truly optimal models across the                                                                                          AdaBoost vs QReg                                           AdaBoost vs QReg
                                                                                                                                                 1.4       GBoost vs QReg                                     1.4     GBoost vs QReg
queried data space, as shown in Fig. 4. Thus, QReд does a good                                                                                             XGBoost vs QReg                                            XGBoost vs QReg
                                                                                                                                   NRMSE Ratio


                                                                                                                                                                                                NRMSE Ratio
job in selecting (near-) optimal RMs per query. As per Scenario                                                                                  1.2                                                          1.2
1, presenting such visualisations can be of real value to analysts.
                                                                                                                                                 1.0                                                          1.0
Workload-centric Perspective: Simple QReg                                                                                                               1 2 3 4 5 6 7 8                                              1 2 3 4 5 6 7 8
                                                                                                                                                            Dataset ID                                                   Dataset ID
A workload-centric perspective assumes that the query distribu-                                                                                                         (c) 4d                                                (d) 5d
tion is identical to the data distribution, as described in §1. Simple
QReд uses simple regression models, including LR, PR, and DTR.                                                                                         Figure 7: Accuracy of QReд vs ensemble RMs
Fig. 6 shows the NRMSE ratio r as defined in Equation (4) for all
data sets in 3-d space. An NRMSE ratio r larger than 1 means
QReд has less prediction error than the other base model. QReд
is shown to outperform or be as good as any of its base models.                                                                  of points in the data space, specifically, for the sub-collection of
Specifically, for data sets 2, 3, 5, 6, and 8, QReд performs slightly                                                            points where XGBoost regression (or GBoost regression) has the
better than other regression models, whereas for data sets 1, 4,                                                                 best prediction accuracy.
and 7, we can see QReд being significantly superior versus LR,                                                                      Fig. 8 shows bands of 2 bars each. Each band of bars shows the
or PR, or DTR.                                                                                                                   NRMSE ratio between other RM and the best RM for the collection
    Fig. 7 compares the prediction error between Simple QReд                                                                     of points. Take the 4-d data set 1 as an example, for the collection
against the more sophisticated ensemble methods, including                                                                       of points where XGBoost has the best prediction accuracy. The
AdaBoost, GBoost, and XGBoost. Fig. 7 shows that the even                                                                        NRMSE ratio between GBoost (second best RM) and XGBoost
Simple QReд often achieves better prediction accuracy than any                                                                   (collection-best RM) is 1.3288 (orange bar in the figure), while
of the sophisticated ensemble methods. For example, up to 25%                                                                    the corresponding NRMSE ratio between QReд and XGBoost is
reduction in NRMSE is achieved by Simple QReд for data set 1                                                                     1.0780 (green bar in the figure). This shows that for this collection
in 3-d space.                                                                                                                    of points where XGBoost has the best prediction accuracy, GBoost
                                                                                                                                 suffers from a 32.88% error relative to the optimal, while using
Workload-centric Perspective: Advanced QReg                                                                                      QReд reduces this to 7.80%.
For the majority of the cases, Simple QReд is shown to outper-                                                                      As another example, consider the collection of points where
form simpler RMs and occasionally more complex ensemble mod-                                                                     XGBoost has the best prediction accuracy in the 5-d data set 8.
els. For the remainder we concentrate on Advanced QReд con-                                                                      The NRMSE ratio between GBoost (second best RM) and XGBoost
structed using GBoost and XGBoost.                                                                                               is 1.1458, while the NRMSE ratio between QReд and XGBoost
   Delving deeper, we now show the NRMSE ratios between                                                                          is 1.0316. Thus, the relative opportunity loss is 0.1458/0.0316 =
GBoost (or XGBoost) and QReд, broken down to sub-collections                                                                     4.61, which means the error caused by using GBoost (relative to
                                                                                                                                 the best model) is 4.61 times the error caused by QReд for the
2 NB: the aim here is not to find the best method to achieve this, but to show that
this is achievable and that significant gains can be achieved using easy to deploy
                                                                                                                                 collection of points where XGBoost has the best accuracy. The
methods.                                                                                                                         relative opportunity loss is much larger for data sets 4 and 5.
               1.8                           XGboost
                                                                        1.8                              GBoost
               1.6
                                             QReg
                                                                        1.6
                                                                                                         QReg     Query-centric Perspective: Advanced QReg
 NRMSE Ratio


                                                         NRMSE Ratio
                                                                                                                  As discussed before, we calculate the NRMSE error for the full
               1.4                                                      1.4
                                                                                                                  collection of points where a single ensemble model wins. To
               1.2                                                      1.2                                       zoom into the context of query-centric prediction serving, we
               1.0 1 2 3 4 5 6 7 8                                      1.0 1 2 3 4 5 6 7 8                       now focus only in the top 20% of queries with the least error, as
                        Dataset ID                                               Dataset ID                       done previously. To summarize relative performance, the relative
           (a) 4-d GBoost win collection                       (b) 4-d XGBoost win collection                     opportunity loss between RMs and QReд is shown in Table 6.

                1.8                                                   1.8                                          Table 6: ROL w.r.t. QReд when different ensemble RMs
                                              XGboost                                                    GBoost
                                              QReg                                                       QReg      win for their top 20% queries.
                1.6                                                   1.6
  NRMSE Ratio


                                                        NRMSE Ratio


                1.4                                                   1.4                                             Data set                         where                            where
                1.2                                                   1.2                                               ID                         GBoost wins                     XGBoost wins
                                                                                                                      1                            3.00                            2.39
                1.0 1 2 3 4 5 6 7 8                                   1.0 1 2 3 4 5 6 7 8                             2                            2.77                            2.68
                         Dataset ID                                            Dataset ID
                                                                                                                      3                            75.50                           2.00
           (c) 5-d GBoost win collection                     (d) 5-d XGBoost win collection
                                                                                                                      4                            0.99                            >1000
                                                                                                                      5                            2.77                            1.87
Figure 8: Workload-centric collection-level NRMSE ratio
                                                                                                                      6                            5.67                            1.77
                                                                                                                      7                            7.13                            3.44
   Fig. 9 shows the ratio r of NRMSE between the base (ensemble)                                                      8                            2.64                            5.29
methods and Advanced QReд for the whole data sets. Improve-
ment in 2-d space is typically small, but from d=4, we start seeing                                                   The values shown exactly are the ROL of using as a second-
larger improvements brought about by QReд. For 2-d case, the                                                      best RM GBoost (XGBoost) vs QReд when XGBoost (GBoost)
accuracy of GBoost and XGBoost regressor are high and almost                                                      wins. For example, cell [1,2]=3.00, says that if XGBoost was used
equal, which explains why QReд cannot improve things further.                                                     (instead of the optimal in this case GBoost) it would result in
For the 3-, 4-, and 5-d cases, almost all ratios are above the hori-                                              an error that is 3 times higher than if QReд was used. In other
                                                                                                                  words, previous results have shown that, regardless of which RM
                                                                                                                  is chosen, this RM will be suboptimal for certain queries. So these
                                                                          1.100                                   ROL values show how QReд can minimize this cost when being
                  1.004            GBoost vs QReg
                                   XGBoost vs QReg
                                                                                              GBoost vs QReg
                                                                                              XGBoost vs QReg     suboptimal.
                                                                          1.075
                  1.002
    NRMSE Ratio


                                                            NRMSE Ratio


                                                                          1.050
                                                                                                                       2.0                   XGboost
                                                                                                                                                        2.0                   GBoost
                  1.000                                                                                                1.8                   QReg       1.8                   QReg
                                                                          1.025
                                                                                                                   NRMSE Ratio


                                                                                                                                                                NRMSE Ratio

                  0.998
                                                                          1.000                                        1.6                              1.6
                  0.996
                          1 2 3 4 5 6 7 8                                         1 2 3 4 5 6 7 8                      1.4                              1.4
                              Dataset ID                                              Dataset ID                       1.2                              1.2
                             (a) 2-d                                                (b) 3-d                            1.0 1 2 3 4 5 6 7 8              1.0 1 2 3 4 5 6 7 8
                                                                                                                                  Dataset ID                       Dataset ID
               1.15                GBoost vs QReg                                         GBoost vs QReg
                                                                                                                            (a) 4-d GBoost win collection             (b) 4-d XGboost win collection
                                   XGBoost vs QReg                      1.2               XGBoost vs QReg
               1.10
 NRMSE Ratio


                                                          NRMSE Ratio


                                                                                                                                  3.0                XGboost
                                                                                                                                                                             3.0              GBoost
               1.05                                                     1.1                                                                          QReg                                     QReg
                                                                                                                                  2.5                                        2.5
                                                                                                                    NRMSE Ratio


                                                                                                                                                               NRMSE Ratio


               1.00
                                                                        1.0
                                                                                                                                  2.0                                        2.0
               0.95 1 2 3 4 5 6 7 8                                           1 2 3 4 5 6 7 8
                         Dataset ID                                                Dataset ID                                     1.5                                        1.5
                             (c) 4-d                                                (d) 5-d
                                                                                                                                  1.0 1 2 3 4 5 6 7 8                        1.0 1 2 3 4 5 6 7 8
                                                                                                                                           Dataset ID                                 Dataset ID
       Figure 9: r between QReg and base ensemble models                                                                     (c) 5-d GBoost win collection           (d) 5-d XGboost win collection


zontal line r = 1. It shows that models do very well for most data                                                  Figure 10: Query-centric collection-level NRMSE ratio
sets. However, there are some cases where QReд is significantly
better than other ensemble methods, for example against GBoost                                                       Similar to Fig. 8, Fig. 10 shows the NRMSE ratio for the 4-
for the 5-d data set 4 and against XGBoost for the 4-d data set                                                   5 dimensional space, but in the query-centric ( the top 20% of
8. We see that QReд can improve accuracy across data sets and                                                     queries) perspective. Focus on the collection of points where
dimensionalities.                                                                                                 XGBoost has the best prediction accuracy in the 5-d data set 8
    Comparing Fig. 9 with Fig. 8, we see that even though the over-                                               as an example. The NRMSE ratio between GBoost (second best
all NRMSE of various RMs is similar, different RMs give different                                                 RM) and XGBoost is 1.3842, while the NRMSE ratio between
accuracy in different subspaces of the data. Interestingly, this                                                  QReд and XGBoost is 1.0872. Thus, the relative opportunity loss
figure also shows that for different data sets different ensemble                                                 is 0.3842/0.0872 = 4.4, which means the error caused by using
methods win (as we have seen previously), showcasing the need                                                     GBoost is 4.4 times as the error caused by QReд for the collection
for a method like QReд.                                                                                           of points where XGBoost has the best prediction accuracy.
   It is noticeable that the NRMSE ratio from QReд is always less        interval with 1% precision, the corresponding sample size n 0 =
than that from the second best model, and is very close to 1. Thus,      9604. For datasets with a finite size, the sample size is slightly
for the most vulnerable queried spaces (where a single ensemble          smaller than the value obtained in eq. (7).
model wins by far), QReg can near-optimally achieve the same                For regression-specific tasks, sample size planning techniques
accuracy, reconciling the otherwise unavoidable loss.                    include power analysis (PA) [14], accuracy in parameter esti-
                                                                         mation (AIPE) [28], etc. The sample sizes obtained from both
QReg Training Time                                                       methods are different, and the magnitude is usually hundreds
Fig. 11 shows the results of our study focusing on the scalability       or thousands. [27] proposes a method to combine these meth-
of QReд, seeing the performance overheads that need be paid for          ods with a specified probability, while [34] recommends that the
QReд’s accuracy improvement. There exists an approximately               largest sample size should be used.
                                                                            For classification-specific tasks, [19] finds that many predic-
                                                                         tion problems do not require a large training set for classifier
                                                                         training. [7] uses learning curves to analyze the classification
                                                                         performance as a function of the training sample size, and con-
                                                                         cludes that 5-25 independent samples per class are enough to
                                                                         train classification models with acceptable performance. Also,
                                                                         75-100 samples will be adequate for testing purposes.
                                                                            In this study, the sample size varies from 10k, 100k to 1m,
                                                                         which are conservative compared to the values obtained by the
                                                                         PA and AIPE methods for regression tasks, or the size for classi-
                                                                         fication tasks.

                                                                         6.2    Workload-centric Perspective
                                                                         We show results for data sets 6, 7, 8 and Table store_sales from
      Figure 11: Comparison of model training time                       the TPC-DS data set. Data sets 6, 7, 8 contain 2-4 million records,
                                                                         and Table store_sales is scaled-up to 2.6 billion records. We
linear relationship between the model training time and number           use reservoir sampling to generate uniform random samples for
of training points. Even for a relatively high number of train-          these data sets. Experiments are done using Advanced QReд.
ing points, (e.g., hundreds of thousands), the training time for
QReд is shown to be a few dozen of seconds. Although this is                        Table 7: Win counts of ensemble RMs.
an order of magnitude worse than XGBoost in absolute value it
is acceptable for medium-sized data sets. Also, about 90% of the                                                 Count of      Count of
training time is spent for getting predictions from the individual         Data set ID                            GBoost       XGBoost
base models. In the current version of the code, predictions are           6                                       16209        17124
received sequentially from base models; doing this in parallel,            7                                       15854        17479
would reduce the total training time.                                      8                                       13757        19576
                                                                           store_sales                             13415        17003
6     QREG SCALABILITY
As discussed in §5, the total training time of QReд increases               Table 7 shows the occurrences of best predictions (wins) made
approximately linearly as the data size increases. This limits its       by each model, for the samples of size 100k. Similarly to Table 2
application to very large data sets. An approach for addressing          in §4, each base model is shown to win for a substantial per-
this issue is to build samples from the data and train QReд on the       centage of queries (or, equivalently for a considerable part of the
samples. We study the implications of this approach on QReд’s            data set). This supports Hypothesis I that there is not a single
performance and observe also whether our Hypotheses hold for             regression model capable of dealing with various data sets, and
this case as well.                                                       each regression model is only good at sub-spaces of the data sets.
                                                                            Similar to §5, this section focuses on the workload-centric
6.1    Sample Size Planning                                              evaluation but for sample-based QReд. We show the NRMSE
One major question is how big the sample size should be? A               ratio r between XGBoost (or GBoost) and QReд, broken down
smaller sample requires less training time, but might lead to poor       to subcollections of points in the data space, specifically, for the
accuracy. According to the tasks, various strategies could be used       subcollection of points where GBoost (or XGBoost regression)
to determine the sample size. For general purposes, Cochran’s            has the best accuracy.
formula [13] is usually used to determine the sample size for a             Consider the collection of points where XGBoost has the best
population.                                                              prediction accuracy in the 5-d data set 7. The NRMSE ratio r
                                z 2p(1 − p)                              between GBoost regression (second best RM) and XGBoost re-
                          n0 =                                     (7)
                                     e2                                  gression (best RM) is 1.2107, while the NRMSE ratio between
   where n 0 is the sample size, z is the selected critical value of     QReд and XGBoost regression is 1.0499. Thus, the corresponding
desired confidence level, p is the degree of variability and e is the    ROL between GBoost and QReд is 0.2107 /0.0499 = 4.22, which
desired level of precision. For instance, we need to determine the       means for this collection of points, GBoost induces 4.22 times
sample size of a large population whose degree of variability is         higher error than QReд.
unknown. p = 0.5 indicates maximum variability, and produces                The same conclusion holds for the query-centric perspective,
a conservative sample size. Assume we need 95% confidence                and is omitted for space reasons.
               1.20                            XGboost
                                                                         1.20                              GBoost
                                               QReg                                                        QReg
               1.15                                                      1.15
 NRMSE Ratio


                                                           NRMSE Ratio
                                                                                                                                                                           DBEst XGBoost
               1.10                                                      1.10                                                                     8.0%
                                                                                                                                                                           DBEst GBoost
                                                                                                                                                                           DBEst Advance QReg
                                                                                                                                                  7.0%
               1.05                                                      1.05
                                                                                                                                                  6.0%
               1.00       6         7      8 store_sales                 1.00    6             7      8 store_sales


                                                                                                                             Relative Error (%)
                                   Dataset ID                                                 Dataset ID                                          5.0%

               (a) 4-d GBoost win collection                      (b) 4-d XGBoost win collection                                                  4.0%

                                                                                                                                                  3.0%

                    1.4       XGboost
                                                                           1.5       GBoost
                                                                                                                                                  2.0%

                    1.3
                              QReg                                         1.4       QReg
                                                                                                                                                  1.0%
      NRMSE Ratio


                                                             NRMSE Ratio


                                                                           1.3
                    1.2                                                                                                                           0.0%
                                                                           1.2                                                                           SUM        AVG          OVERALL

                    1.1                                                    1.1
                    1.0   6         7      8 store_sales                   1.0   6         7      8 store_sales                                   Figure 14: Application of QReg to DBEst
                                   Dataset ID                                             Dataset ID
               (c) 5-d GBoost win collection                     (d) 5-d XGBoost win collection                       used, the relative error drops to 7.77%. Although Advanced QReд
                                                                                                                      is build upon XGBoost and GBoost, the relative error of DBEst
Figure 12: Workload-centric collection-level NRMSE ratio                                                              using Advanced QReд is better than DBEst using XGBoost or
                                                                                                                      GBoost only. For further comparison, if the linear regression is
                                                                                                                      used in DBEst, the relative error becomes 21.20%, which is much
6.3                   Model Training Time                                                                             higher than DBEst using Advanced QReд. Thus, a query-centric
The training time of sample-based QReд consists of two parts:                                                         regressor, like QReд, improves the prediction accuracy and is
(a) Sampling time to generate samples from the base tables; (b)                                                       very important in-DBMS analytics.
Training time to train QReд over the samples. Fig. 13 shows
                                                                                                                      7   MAJOR LESSONS LEARNED
                                                                                                                      The key lessons learned by this study are:
                                                                                                                          • Different RMs are better-performing for different data sets
                                                                                                                            and, more interestingly, for different data subspaces within
                                                                                                                            them. This holds for simpler models and, perhaps surpris-
                                                                                                                            ingly, for advanced ensemble RMs, which are designed to
                                                                                                                            generalize better.
                                                                                                                          • Each examined RM is best-performing (a winner) for a
                                                                                                                            significant percentage of all queries. Necessarily, this im-
                                                                                                                            plies that, for a significant percentage of queries, regard-
                                                                                                                            less of which (simple or ensemble) RM is chosen by a DB
                                                                                                                            user/analyst, a suboptimal RM will be used.
 Figure 13: Sample Size vs Training Time for store_sales                                                                  • When said suboptimal RMs are used, significant additional
                                                                                                                            errors emerge for a large percentage of queries.
the training time of QReд for the 100m store_sales table, while                                                           • Best practice, which suggests using a top-performing en-
sample sizes vary {10k, 100k, 1m}. It takes ca. 68-72s to generate                                                          semble, is misleading and leads to significant errors for
the samples. For 10k (100k, 1m) samples, it takes less than 3s (22s,                                                        large numbers of queries. In several cases, despite the
150s) to train QReд. With 100k samples, QReд performs excel-                                                                fact that different RMs had a very similar overall error
lently. So, in conclusion, sample-based QReд is scalable and both                                                           (NRMSE), a significant fraction of queries face very large
hypotheses hold even when models are trained from samples.                                                                  differences in error when using seemingly-similarly-performing
                                                                                                                            RMs. Thus, sophisticated and simpler RMs cannot cope
6.4                   Application to AQP engines.                                                                           well, in order to appease query-sensitive scenarios, where
Previous experiments demonstrate the strength of QReд. In this                                                              query distributions may target specific data subspaces.
section, QReд is applied to DBEst, a newly model-based approxi-                                                           • A query-centric perspective, as manifested with QReд, can
mate query processing (AQP) engine [33]. DBEst adopts classical                                                             offer higher accuracy across data sets and dimensionalities.
machine learning models (regressors and density estimators) to                                                              This applies to overall NRMSEs. More importantly, it ap-
provide approximate answers to SQL queries. We replace the                                                                  plies to query-centric evaluations. The study revealed that
default regression model in DBEst (XGBoost) with Advanced                                                                   when QReд is used, there are significant accuracy gains,
QReд, and compare the accuracy with DBEst using other ensem-                                                                compared to using any other non-optimal RM (which as
ble methods, including XGBoost and GBoost. The well-known                                                                   mentioned is unavoidable).
TPC-DS dataset is scaled up with scaling factor of 1000, which                                                            • Accuracy improvements are achieved with small over-
contains ∼ 2.6 billion tuples (1TB). 96 synthetic SQL queries cov-                                                          heads, even with very large data sizes, using sampling.
ering 13 column pairs are randomly generated for SUM and AVG
aggregate functions. DBEst sample size is set to 100k.                                                                8   CONCLUSIONS
   Fig. 14 shows the relative error achieved by DBEst using vari-                                                     The paper studied issues pertaining to the seamless integration
ous regression models. For SUM, the relative errors using XGboost                                                     of DBMSs and regression models. The analysis revealed the com-
or GBoost are 8.35% and 8.10%. However, if Advanced QReд is                                                           plexity of the problem of choosing an appropriate regression
model: Different models, despite having overall very similar ac-                        [22] Jerome H Friedman. 2002. Stochastic gradient boosting. Computational
curacy, are shown to offer largely-varying accuracy for different                            Statistics & Data Analysis 38, 4 (2002), 367–378.
                                                                                        [23] L Gerhardt, CH Faham, and Y Yao. 2018. SciDB-Py. http://scidb-py.readthedocs.
data sets and for different subsets of the same data set. Given this,                        io/en/stable/.
the analysis sheds light on solutions to the problem. It showed                         [24] Joseph M Hellerstein, Christoper Ré, Florian Schoppmann, Daisy Zhe Wang,
and studied in detail the performance of QReд, which can achieve                             Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan
                                                                                             Feng, Kun Li, et al. 2012. The MADlib analytics library: or MAD skills, the
both high accuracy over the whole data set and near-optimal                                  SQL. Proceedings of the VLDB Endowment 5, 12 (2012), 1700–1711.
accuracy, per query targeting specific data subsets. The analy-                         [25] Botong Huang, Matthias Boehm, Yuanyuan Tian, Berthold Reinwald, Shirish
sis also showed the impact of key decisions en route to QReд,
                                                                                             Tatikonda, and Frederick R Reiss. 2015. Resource elasticity for large-scale
                                                                                             machine learning. In Proceedings of the 2015 ACM SIGMOD International Con-
such as selecting different constituent base regression models.                              ference on Management of Data. ACM, 137–152.
In addition, it studied issues pertaining to scalability, showing                       [26] Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton.
                                                                                             1991. Adaptive mixtures of local experts. Neural computation 3, 1 (1991),
that even with large data sets, the same issues hold and the same                            79–87.
model solution can be used to achieve per-query and overall                             [27] Michael R Jiroutek, Keith E Muller, Lawrence L Kupper, and Paul W Stewart.
high accuracy. In general, the proposed QReд offers a promising                              2003. A new method for choosing sample size for confidence interval–based
                                                                                             inferences. Biometrics 59, 3 (2003), 580–590.
approach for taming the generalization-overfit dilemma when                             [28] Ken Kelley and Scott E Maxwell. 2003. Sample size for multiple regression:
employing ML models within DBMSs.                                                            obtaining regression coefficients that are accurate, not simply significant.
                                                                                             Psychological methods 8, 3 (2003), 305.
ACKNOWLEDGMENTS                                                                         [29] Arun Kumar, Robert McCann, Jeffrey Naughton, and Jignesh M. Patel. [n.
                                                                                             d.]. Model Selection Management Systems: The Next Frontier of Advanced
This work was supported in part by the ‘Tools, Practices and                                 Analytics. SIGMOD Rec. 44 ([n. d.]).
Systems’theme of the UKRI Strategic Priorities Fund (EPSRC                              [30] Xuan Liang, Tao Zou, Bin Guo, Shuo Li, Haozhe Zhang, Shuyi Zhang, Hui
                                                                                             Huang, and Song Xi Chen. 2015. Assessing Beijing’s PM2. 5 pollution: severity,
Grant EP/T001569/1) & The Alan Turing Institute (EPSRC grant                                 weather impact, APEC and winter heating. In Proc. R. Soc. A, Vol. 471. The
EP/N510129/1).                                                                               Royal Society, 20150257.
                                                                                        [31] M. Lichman. 2013. UCI Machine Learning Repository. http://archive.ics.uci.
REFERENCES                                                                                   edu/ml
 [1] 2005. Database PL/SQL Packages and Types Reference. https://docs.oracle.           [32] Lin Ma, Dana Van Aken, Ahmed Hefny, Gustavo Mezerhane, Andrew Pavlo,
     com/cd/B28359_01/appdev.111/b28419/u_nla.htm#CIABEFIJ                                   and Geoffrey J Gordon. 2018. Query-based workload forecasting for self-
 [2] Dana Van Aken, Andrew Pavlo, Geoffrey J. Gordon, and Bohan Zhang. 2017.                 driving database management systems. In Proceedings of the 2018 International
     Automatic Database Management System Tuning Through Large-scale Ma-                     Conference on Management of Data. ACM, 631–645.
     chine Learning. In Proceeding of ACM SIGMOD.                                       [33] Qingzhi Ma and Peter Triantafillou. 2019. Dbest: Revisiting approximate query
 [3] C. Anagnostopoulos and P. Triantafillou. 2015. Learning Set Cardinality in              processing engines with machine learning models. In Proceedings of the 2019
     Distance Nearest Neighbours. In Proceeding of IEEE International Conference             International Conference on Management of Data. ACM, 1553–1570.
     on Data Mining, (ICDM15).                                                          [34] Scott E Maxwell, Ken Kelley, and Joseph R Rausch. 2008. Sample size planning
 [4] C. Anagnostopoulos and P. Triantafillou. 2015. Learning to Accurately COUNT             for statistical power and accuracy in parameter estimation. Annu. Rev. Psychol.
     with Query-Driven Predictive Analytics. In Proceeding of IEEE International             59 (2008), 537–563.
     Conference on Big Data.                                                            [35] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram
 [5] C. Anagnostopoulos and P. Triantafillou. 2017. Query-Driven Learning for                Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean
     Predictive Analytics of Data Subspace Cardinality. ACM Trans. on Knowledge              Owen, et al. 2016. Mllib: Machine learning in apache spark. The Journal of
     Discovery from Data, (ACM TKDD) (2017).                                                 Machine Learning Research 17, 1 (2016), 1235–1241.
 [6] Anon. 2018. XLeratorDB. http://westclintech.com/                                   [36] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram
 [7] Claudia Beleites, Ute Neugebauer, Thomas Bocklitz, Christoph Krafft, and                Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean
     Jürgen Popp. 2013. Sample size planning for classification models. Analytica            Owen, et al. 2016. Mllib: Machine learning in apache spark. Journal of Machine
     chimica acta 760 (2013), 25–33.                                                         Learning Research 17, 34 (2016), 1–7.
 [8] Christopher M Bishop. 2006. Pattern recognition and machine learning.              [37] Hannes Muehleisen, Anthony Damico, and Thomas Lumley. 2018. MonetDB.R.
     springer.                                                                               http://monetr.r-forge.r-project.org/.
 [9] Leo Breiman. 1996. Bagging predictors. Machine learning 24, 2 (1996), 123–140.     [38] Raghunath Othayoth Nambiar and Meikel Poess. 2006. The making of TPC-DS.
[10] Zhuhua Cai, Zekai J Gao, Shangyu Luo, Luis L Perez, Zografoula Vagena, and              In Proceedings of the 32nd international conference on Very large data bases.
     Christopher Jermaine. 2014. A comparison of platforms for implementing                  VLDB Endowment, 1049–1058.
     and running very large scale machine learning algorithms. In Proceedings of        [39] Carlos Ordonez. 2010. Statistical model computation with UDFs. IEEE Trans-
     the 2014 ACM SIGMOD international conference on Management of data. ACM,                actions on Knowledge and Data Engineering 22, 12 (2010), 1752–1765.
     1371–1382.                                                                         [40] Carlos Ordonez, Carlos Garcia-Alvarado, and Veerabhadaran Baladandayutha-
[11] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting                pani. 2014. Bayesian variable selection in linear regression in one pass for
     system. In Proceedings of the 22nd acm sigkdd international conference on               large datasets. ACM Transactions on Knowledge Discovery from Data (TKDD)
     knowledge discovery and data mining. ACM, 785–794.                                      9, 1 (2014), 3.
[12] Tianqi Chen, Sameer Singh, Ben Taskar, and Carlos Guestrin. 2015. Efficient        [41] Y. Park, A. S. Tajik, M. Cafarella, and B. Mozafari. 2017. Database Learning:
     second-order gradient boosting for conditional random fields. In Artificial             Toward a Database that Becomes Smarter Every Time. In Proceeding of ACM
     Intelligence and Statistics. 147–155.                                                   SIGMOD.
[13] William G Cochran. 2007. Sampling techniques. John Wiley & Sons.                   [42] Christopher Ré, Divy Agrawal, Magdalena Balazinska, Michael Cafarella,
[14] Jacob Cohen. 2013. Statistical power analysis for the behavioral sciences. Rout-        Michael Jordan, Tim Kraska, and Raghu Ramakrishnan. 2015. Machine learn-
     ledge.                                                                                  ing and databases: The sound of things to come or a cacophony of hype?. In
[15] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. 2009. MAD              Proceedings of the 2015 ACM SIGMOD International Conference on Management
     skills: new analysis practices for big data. Proc. VLDB Endow. (2009).                  of Data. ACM, 283–284.
[16] Dan Crankshaw, Peter Bailis, Joseph Gonzalez, Haoyuan Li, Zhao Zhang,              [43] M. Schleich, D. Olteanu, and R. Ciucanu. 2016. Learning Linear Regression
     Michael Franklin, Ali Ghodsi, and Michael Jordan. 2015. The missing piece               Models over Factorized Joins. In ACM SIGMOD.
     in complex analytics: Low latency, scalable model management and serving           [44] Riyaz Sikora et al. 2015. A modified stacking ensemble machine learning
     with Velox. In Conference on Innovative Data Systems Research (CIDR).                   algorithm using genetic algorithms. In Handbook of Research on Organizational
[17] Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael J Franklin, Joseph E                   Transformations through Big Data Analytics. IGI Global, 43–53.
     Gonzalez, and Ion Stoica. 2016. Clipper: A Low-Latency Online Prediction           [45] N. Polyzotis T. Condie, P. Mineiro and M. Weimer. 2013. Machine learning for
     Serving System. arXiv preprint arXiv:1612.03079 (2016).                                 big data. In ACM SIGMOD. 939–942.
[18] A. Deshpande and S. Madden. 2006. MauveDB: supporting model-based user             [46] A. Thiagarajan and S. Madden. 2008. Querying continuous functions in a
     views in database systems. In ACM SIGMOD.                                               database system. In ACM SIGMOD.
[19] Kevin K Dobbin and Richard M Simon. 2006. Sample size planning for devel-          [47] D. S. TKach. 1998. Information Mining with the IBM Intelligent Miner. IBM
     oping classifiers using high-dimensional DNA microarray data. Biostatistics 8,          White Paper.
     1 (2006), 101–117.                                                                 [48] Daniele Varrazzo. 2014. Psycopg. http://initd.org/psycopg/.
[20] Yoav Freund and Robert E Schapire. 1995. A desicion-theoretic generalization       [49] Rashmi Korlakai Vinayak and Ran Gilad-Bachrach. 2015. DART: Dropouts
     of on-line learning and an application to boosting. In European conference on           meet multiple additive regression trees. In Artificial Intelligence and Statistics.
     computational learning theory. Springer, 23–37.                                         489–497.
[21] Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting        [50] David H Wolpert. 1992. Stacked generalization. Neural networks 5, 2 (1992),
     machine. Annals of statistics (2001), 1189–1232.                                        241–259.
                                                                                        [51] D. Wong. 2013. Oracle Data Miner. Oracle White Paper.