<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>uQery-Centric Regression for In-DBMS Analytics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Qingzhi Ma</string-name>
          <email>Q.Ma.2@warwick.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Triantafillou</string-name>
          <email>P.Triantafillou@warwick.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Warwick</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Research in enriching DBs with Machine Learning (ML) models is receiving increasingly greater attention. This paper experimentally analyzes the problem of empowering data systems with (and its users with access to) regression models (RMs). The paper ofers a data system's perspective, which unveils an interesting 'impedance mismatch′ problem: ML models aim to ofer a high expected overall prediction accuracy, which essentially assumes that queries will target data using the same distributions of the data on which the models are trained. However, in data management it is widely recognized that query distributions do not necessarily follow data distributions. Queries using selection operators target specific data subspaces on which, even an overall highly-accurate model, may be weak. If such queried subspaces are popular, large numbers of queries will sufer. The paper will reveal, shed light, and quantify this 'impedance mismatch′ phenomenon. It will study in detail 8 real-life data sets and data from TPC-DS and experiment with various dimensionalities therein. It will employ new appropriate metrics, substantiating the problem across a wide variety of popular RMs, ranging from simple linear models to advanced, state-of-the-art, ensembles (which enjoy excellent generalization performance). It will put forth and study a new, query-centric, model that addresses this problem, improving per-query accuracy, while also ofering excellent overall accuracy. Finally, it will study the efects of scale on the problem and its solutions.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        A new dominating trend has emerged for the next-generation
data management and analytics systems based on integrating
ML models and data management platforms [
        <xref ref-type="bibr" rid="ref10 ref15 ref24 ref25 ref35 ref36 ref42 ref45">10, 15, 24, 25, 35,
36, 42, 45</xref>
        ]. Additional eforts pertain to connectors to back-end
databases, which allow for statistical analyses and related queries
on DB data, like MonetDB.R [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ], SciDB-Py [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], and Psycopg
[
        <xref ref-type="bibr" rid="ref48">48</xref>
        ]. Another class of eforts concerns learning from past
answers to predict the answers to future analytical queries, e.g.
for approximate query processing engines, which provide
approximate answers to aggregate queries, using ML techniques
[
        <xref ref-type="bibr" rid="ref3 ref4 ref41 ref5">3–5, 41</xref>
        ], or for tuning database systems [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and for forecasting
workloads [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ]. Yet another class of eforts concerns model and
query-prediction serving, like the Velox/Clipper systems [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ]
managing ML models for predictive analytics. Finally, vision
papers suggest the move towards model selection management
systems [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], where a primary task is model selection whereby
the system is able to select the best model to use for the task at
hand.
      </p>
      <p>In this realm, regression models, being a principal means for
predictive analytics, are of particular interest to both analysts
and data analytics platforms. RMs are playing an increasingly
important role within data systems. Examples of its extended
use and significance include many modern DBs which provide</p>
    </sec>
    <sec id="sec-2">
      <title>Motivations</title>
      <p>Given the above increasing interest in bridging ML and RMs with
DBs, we focus on how seamless this process can be. ML models
(and RMs in particular) are trained to optimize a loss function
(invariably concerning overall expected error). We refer to this
as a workload-centric view, as the aim is to minimize expected
error among all queries in an expected workload. In essence, this
assumes query distributions are (expected workload is) similar to
that of the training data. In contrast, data systems research has
long recognized that query workloads follow patterns generally
diferent to data distributions. Hence, queries on data subspaces
(e.g., using range or equality selection predicates), where an
ML model is weak, will sufer from high errors. And, if such
queries are popular, many queries will sufer. This gives rise to
the need for a “query-centric perspective ”: We define
querycentric regression as a model which strives to ensure both high
average accuracy (across all queries in a workload) as well as
high per-query accuracy, whereby each query is ensured to enjoy
accuracy close to that of the best possible model.</p>
      <p>The ML community’s general answer to such problems is
to turn to ensemble methods, in order to lower variance and
generalize better (e.g., to diferent distributions). We wish to shed
light into this possible impedance mismatch problem and see if it
holds for simpler and even for state-of-the-art ensemble RMs. We
further wish to (i) quantify the phenomenon: We shall use several
real data sets (from the UCI ML repository and TPC-DS data)
and a wide variety of popular RMs and new metrics that reveal
workload-centric and query-centric performance diferences, and
(ii) see if the problem can be addressed by adopting a
querycentric perspective, (using a new ensemble method) whereby the
error observed by each query is as low as possible (which will
also indirectly ensure high workload-centric performance).</p>
      <p>The above bear strong practical consequences. Consider a
data analyst using python or R, linked with an ML library (like
Apache Spark MLLib, Scikit-Learn, etc.), or using a DB connector
like MonetDB.R or SciDB-Py, or a prediction serving system like
Clipper, etc– and the following use cases.</p>
      <p>Scenario 1: The analyst uses a predicate to define a data
subspace of interest and calls a specific RM method: It would be great
if she knew which RMs to use for which data subspaces.</p>
      <p>Scenario 2: Alternatively, the system could select the best RM
automatically for the analyst’s query at hand.</p>
      <p>With this paper we wish to inform the community of this
DBRM impedance mismatch problem, study and quantify it, and lay
the foundation for seamless use of RMs for in-DBMS analytics,
ofering this query-centric perspective.
2</p>
    </sec>
    <sec id="sec-3">
      <title>BACKGROUND</title>
      <p>Our study employs a set of representative and popular RMs,
grouped into two categories: Simple and ensemble RMs.
2.1</p>
    </sec>
    <sec id="sec-4">
      <title>Simple Regression Models</title>
      <p>Simple RMs include linear regression (LR), polynomial
regression (PR), decision tree regression (DTR), SVM Regression (SVR),
Nearest Neighbours Regression (NNR), etc. An introduction to
these simple regression models is omitted for space reasons.</p>
      <p>Table 1 summarizes the known asymptotic time complexity
for training for key regression models. And more detailed
comparisons are made and discussed in §3.5.</p>
    </sec>
    <sec id="sec-5">
      <title>Ensemble Methods</title>
      <p>
        Ensemble methods are powerful methods that combine the
predictions and advantages from base models. It is often observed
that prediction accuracy is improved by combining the prediction
results in some way (e.g., using weighted averaging of predictions
from various base models) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Ensemble learning is also useful
for scaling-up data mining and model prediction [
        <xref ref-type="bibr" rid="ref44">44</xref>
        ]. There have
been many well-developed ensemble methods, including
averaging based ensemble methods, bootstrap aggregating (bagging)
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], boosting [
        <xref ref-type="bibr" rid="ref20 ref21">20, 21</xref>
        ], stacking [
        <xref ref-type="bibr" rid="ref50">50</xref>
        ], mixture of experts [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], etc.
      </p>
      <p>Averaging-based ensemble methods calculate the weighted
average of predictions from all models. This incurs higher
computational costs and higher response time.</p>
      <p>
        Boosting refers to a family of algorithms that could potentially
convert “weak models”to “strong models”. AdaBoost [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], short
for “adaptive boosting”, is a popular boosting algorithm. Unlike
bootstrap aggregating whose models are trained in parallel, the
prediction models in AdaBoost are trained in sequence. AdaBoost
was firstly proposed to solve classification problems, and was
applied to solve regression problems later on. Randomization
may be incorporated into boosting, so that its response time is
reduced [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
      </p>
      <p>The objective of gradient boosting is to minimize the loss
function:</p>
      <p>
        L(yi , f (xi )) = MSE =
Õ
(yi − f (xi ))
f (xi )r +1 = f (xi )r + α ∗ ∂L(yi , f (xi )r ) (2)
∂ f (xi )r
where r is the iteration number. Gradient boosting (GBoost)
usually uses only the first-order information; Chen et al. incorporate
the second-order information in gradient boosting for conditional
random fields, and improve its accuracy [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. However, the base
models are usually limited to a set of classification and regression
trees (CART). Other regression models are not supported.
      </p>
      <p>
        XGBoost [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is a state-of-art boosting method, and is widely
used for competitions due to its fast training time and high
accuracy. The objective of XGBoost is
obj(Θ) = L(Θ) + Ω (Θ)
(3)
where L(Θ) is the loss function, and controls how close
predictions are to the targets. Ω (Θ) is the regularization term, which
controls the complexity of the model. Over-fitting is avoided if
the proper Ω (Θ) is selected. The base models (booster) can be
gbtree, gblinear or dart [
        <xref ref-type="bibr" rid="ref49">49</xref>
        ]. Gbtree and dart are tree models
while gblinear is linear.
3
      </p>
    </sec>
    <sec id="sec-6">
      <title>EXPERIMENTAL SETUP</title>
      <p>All experiments run on an Ubuntu system, with Intel Core i5-7500
CPU @ 3.40GHz × 4 processors and 32GB memory.
3.1</p>
    </sec>
    <sec id="sec-7">
      <title>Hypotheses</title>
      <p>The study rests on testing, validating, and quantifying two key
hypotheses:</p>
      <p>Hypothesis 1. Diferent RMs, exhibit higher accuracy for
different regions of the queried data spaces. Likewise for diferent data
sets. Such diferences can be large and may occur even though said
RMs may enjoy similar overall accuracy.</p>
      <p>The corollary of Hypothesis 1 is that, even if our analysts in
Scenario 1 knew of highly-accurate RMs, many of their analytical
queries would be susceptible to large errors. Hypothesis 1 aims
to test whether the loss function used by even top-performing
RMs, minimizing overall expected accuracy errors, ’hides’ this
issue. En route, the analysis will quantify this problem across
many diferent RMs, data sets, and dimensionalities.</p>
      <p>Hypothesis 2. Given Hypothesis 1, a model equipped with
knowledge of the accuracy distribution of RMs in the query space,
can near-optimally serve each query.</p>
      <p>Such a model, coined QReg, which is a classifier-based
ensemble method that bears a query-centric perspective, will be studied
here to validate Hypothesis 2. Hypothesis 2 aims to show that
integrating an RM model within a DB can be done in a
querycentric manner, avoiding the aforementioned problems. Thus,
ofering a solution for Scenario 2.
3.2</p>
    </sec>
    <sec id="sec-8">
      <title>Data Sets and Dimensionality</title>
      <p>
        To test the hypotheses, eight real-world data sets with diferent
characteristics from the UCI machine learning repository [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]
are used, varying the dimensionality from 2 to 5, as well as a
large fact table from the TPC-DS benchmark [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ].
      </p>
      <p>Data set 1 is a collection of YouTube videos showing input
and output video characteristics along with the transcoding time
and memory requirements. Data set 2 contains Physicochemical
properties of Protein Tertiary Structure. Tasks include predicting
the Size of the residue (RMSD) based on nine properties. There are
45730 decoys and size varies from 0 to 21 armstrong. Data set 3 is
an hourly data set containing the PM2.5 gas concentration data in
Beijing. The task is to predict PM2.5 concentration (uд/m3), and
the independent variables include pressure (PRES), Cumulated
wind speed (Iws), etc. Data set 4 is an online news popularity
data set and tasks include predicting the number of shares in
social networks (popularity). There are totally 39797 records in
this data set. Data set 5 contains 9568 data points collected from
a Combined Cycle Power Plant over 6 years (2006-2011), and the
task is to predict the net hourly electrical energy output (EP) of
the plant. Data set 6 is the YearPredictionMSD data set used to
predict the release year of a song from audio features. Most of
the songs are commercial tracks from 1922 to 2011. Data set 7
contains the recordings of 16 chemical sensors exposed to two
dynamic gas mixtures at varying concentrations. The goal with
this data set is to predict the recording of one specific chemical
sensor based on other sensors and the gas mixtures. This is a
time-series data set containing more than 4 million records in
total. Data set 8 records the individual household electric power
consumption in one household for more than four years, and
there are two million records.</p>
      <p>
        We further employ table store_ sales from the popular
TPCDS benchmark [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ]. Typical columns used for the experiments
include ss_wholesale_cost, ss_list_price, ss_ sales_price,
ss_ext_sales_price and ss_ext_wholesale_cost.
3.3
      </p>
    </sec>
    <sec id="sec-9">
      <title>Evaluation Metrics</title>
      <p>Accuracy is measured using the Normalized Root Mean Square
Error (NRMSE) metric, defined as:</p>
      <p>N RMSE =
q Ínt=1(yˆt −yt )2</p>
      <p>n
ymax − ymin
NRMSE shows overall deviations between predicted and
measured values; it is built upon root mean square error (RMSE),
and is scaled to the range of the measured values. It provides
a universal measure of prediction accuracy between diferent
regression models.</p>
      <p>The NRMSE ratio r , compares the prediction accuracy of
one RMi against that of any other RMj , and is defined as: r =
N RMS Ei . If N RMSEj ≤ N RMSEi , this ratio shows how worse
N RMS Ej
RMi is compared to RMj .</p>
      <p>The above are standard metrics used for comparing accuracy.
However, our study calls for additional metrics. Inherent in our
study is the need to reflect the diferences in accuracy observed
by a query as they depend on the model used. For this we define
the concept of Opportunity Loss as a natural way to reflect how
much the query loses in accuracy by using a sub-optimal model.</p>
      <p>Assuming RMopt is the RM with the lowest NRMSE, we define
Opportunity Loss OLi as</p>
      <p>OLi =</p>
      <sec id="sec-9-1">
        <title>N RMSEi</title>
      </sec>
      <sec id="sec-9-2">
        <title>N RMSEopt</title>
        <p>− 1,
ROLi,j =</p>
        <p>OLi
OLk
,
which quantifies as a % the error (the opportunity loss) due to
not using the best model RMopt and using RMi instead.</p>
        <p>Furthermore, we define
as Relative Opportunity Loss, which quantifies how much better
RMk does vs RMi in improving on the opportunity loss.
(4)
(5)
(6)</p>
        <p>Intuitively, our aim (with testing Hypothesis 1) is to show that,
despite which single model is used, some queries will always be
processed by sub-optimal models. So we wish to quantify this
opportunity loss. Furthermore, our aim (with testing Hypothesis 2)
is to show that a new ensemble model can help significantly
alleviate this problem. The ROLi,j metric will help quantify how much
a model (QReд) improves on this opportunity loss for queries.
3.4</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Architecture</title>
      <p>Assume that the data system maintains m regression models.
When a query arrives, the system needs to identify the model
with the least prediction error for this query. We treat this model
selection problem as a classification problem. Fig. 1 shows the
architecture of this classified regression QReд 1.</p>
      <p>There are two layers in the system. (i) The model
maintenance layer, deploys and maintains the base regression models.
(ii) The query classification layer implements the core of QReg.
A query is first passed to the pre-trained classifier. Because the
classifier “knows”the prediction ability of each model in the
various queried data spaces, the query will be assigned to the model
that performs best locally for this query’s data space. Unlike
typical ensemble methods, only one model is invoked for each
query (hence QReg is less computationally intensive).</p>
      <p>Two configurations are studied for QReд: Simple QReg uses
LR, PR, and DTR. Advanced QReg uses GBoost and XGBoost as
its base models.
When deciding which models to include the following key criteria
are considered.</p>
      <p>(a.) Model training time. This should be as low as possible
and should exhibit good scalability as the number of the training
data points increases.</p>
      <p>Given the asymptotic complexity, as summarised in §2.1, a
number of experiments were conducted to quantify the training
times for various regression models. Fig. 2 is a representative
result for data set 4 using 4 dimensions. It shows how model
training time (for six regression models) is impacted as the number
of training instances increases.</p>
      <p>Model training time is shown to behave acceptably with
respect to the number of instances in the training data set for LR,
PR, DTR, and kNN regression. The training time of Support
Vector Regression – Radial Basis Function tends to increase much
more aggressively as the number of training points increases. The
experiment was repeated for all data sets. The above conclusion
holds across all experiments and are omitted for space reasons.
1The source code for Q Reд is available at https://github.com/qingzma/qreg
(b.) Query response time: In the classifier training process,
predictions are made from each base prediction model. To reduce
the overall training time as well as the query response time, the
models should have as low response time as possible.</p>
      <p>(c.) Prediction accuracy: An interesting issue arises from
using diferent regression models together, as in QReд. If the base
regression models have large diferences in accuracy levels, then
this may result in QReд having poor accuracy. This is a direct
result of errors introduced during the separate classification
process. Therefore, care must be taken so to ensure that base models
enjoy similar and good accuracy levels.
3.6</p>
    </sec>
    <sec id="sec-11">
      <title>Model Training Strategy</title>
      <p>After partitioning the data set, m base models are trained upon
Training_Dataset_1. The selection of base models is in
principle open and depends on the users’ choice (taking into account
the above issues). Each base model fi (x) makes predictions yˆi
of each data point x in Training_Dataset_2. A comparison is
made between the predicted yˆi and the real label y to find the
best prediction model for each query x.</p>
      <p>Having the individual predictions and associated errors, a new
data set is generated by combining the data point x and the index
i of the best model for this query, depicted [x, i]. This data set is
then used to build the classifier reflecting the prediction ability
of base models in the query space. The classifier is the core of</p>
      <sec id="sec-11-1">
        <title>DMataaisne t</title>
        <p>Predictions 1
Predictions 2
.
.
.</p>
        <p>Predictions n
New
dataset
to build
classifier
Training
Dataset 1
Training
Dataset 2</p>
      </sec>
      <sec id="sec-11-2">
        <title>DTeastatisnegt</title>
        <p>Model 1
Model 2
.
.</p>
        <p>.</p>
        <p>Model n
QReд, and a well-designed classifier could potentially grasp the
prediction ability of each model in the query space correctly. Thus,
the prediction accuracy can be significantly improved compared
to individual prediction models.</p>
        <p>Note that the original data set is partitioned into 3 subsets
instead of 2. This is done in order to ensure that diferent
training data sets are used to train the base models and the classifier,
respectively, which avoids potential over-fitting problems. In
addition, models are fine-tuned via cross-validation using
GridSearchCV() in the scikit-learn package.
4</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>EVALUATING HYPOTHESIS I</title>
      <p>
        Consider data set 3, the Beijing PM2.5 data set [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ], using
Cumulated Wind Speed (IWS) and Pressure (PRES) as the features,
yielding a 3-dimensional regression problem.
      </p>
      <p>Fig. 4.(a) shows the distribution of the model with the least
error for all data points. LR, PR, and DTR are used as the base RMs
(Simple QReд). Fig. 4.(b) shows the distribution of best models
when ensemble models GBoost and XGBoost are used (Advanced
QReд).</p>
      <p>s)
m/(500 LR
eed400 PDRTR
sdp300
inw200
ted100
a
l
uuCm 0990 1000 1010 1020 1030 1040</p>
      <p>Pressure (hPa)
(a) Simple models
s)
m/(500 GBoost
eed400 XGboost
sdp300
inw200
ted100
a
l
uuCm 0990 1000 1010 1020 1030 1040</p>
      <p>Pressure (hPa)
(b) Ensemble models</p>
      <p>Take QReд using simple models as an example. LR dominates
in the upper-central region. PR dominates at the lower central
regions. DTR performs best in the rest of the space. the NRMSEs
for LR, PR and DTR are 8.48%, 8.84%, and 8.32%, respectively.
However, if for each point the best model can be selected to make
the prediction (as shown in Fig. 4), the corresponding ("optimal")
NRMSE drops to 7.19%. This is a large improvement in accuracy.</p>
      <p>Figures like Fig. 4 can help analysts decide on which models
to use when querying this space.Similar figures exist for all data
sets studied in this work and are omitted due to space reasons.
sets. So, there is no single winner. DTR enjoys the most wins for
data sets 2, 5, and 8; PR makes the most accurate predictions for
data sets 1, 3, and 4; LR wins for data sets 6 and 7.</p>
      <p>
        Table 3 zooms in, augmenting Table 2 by showing the NRMSEs
when diferent simple RMs win. For example, for data set 1, we
know from Table 2 that LR wins 6468 times. For these, the LR’s
NRMSE was 11.08%, as indicated by Table 3, whereas for PR was
enormous and for DTR was 18.28% – see the 3 numbers in cell
[
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ] in Table 3. Similarly, for the 6919 queries where PR won,
LR’s NRMSE was 11.82%, as indicated by Table 3 whereas for PR
was 9.35% and for DTR was 11.68% – see the 3 numbers in cell
[
        <xref ref-type="bibr" rid="ref1 ref3">1,3</xref>
        ].
      </p>
      <p>Consider data set 4. When LR wins, its error is markedly lower
(almost half) that of PR and DTR–unlike their overall NRMSEs
which show LR to be the worst model.</p>
      <p>To further facilitate a query-centric perspective, we delve into
the performance of the queries for which each RM reached a
top-20% performance. For data set 1, for example, this includes
the best 20% of the 6468 queries for which LR wins, the best 20%
of the 6919 queries for PR wins, and the best 20% of the 5366
queries for DTR wins. Hence, the NRMSE of interest does not
come from all the queries, but from the top 20% queries for which
the least error was achieved by a simple RM. Table 4 shows these
results, along with the overall NRMSE of each simple RM for the
whole set of queries. Again, note that the overall NRMSEs are
quite close. However, individual diferences are very large. For
data set 1, for instance, the top 20% of queries when LR wins
1
2
3
4
5
6
7
8</p>
      <p>The corresponding data for the advanced ensemble RMs is
highly similar and we omit it for space reasons.
This hypothesis aims to substantiate whether it is possible to
develop a method that can learn from the key findings of the
previous section and leverage them in order to address Scenario
2, automating the decision as to which RM to use, relieving the
DB user/analyst of the conundrum, towards a query-centric RM.
Specifically, we study if and at what costs a method can: (i)
nearoptimally select the best regression model for any query at hand
and (ii) achieve better overall accuracy than any single (simple
or ensemble) method.
enjoy an NRMSE that is about half of the NRMSE of the others.
Interestingly, the same holds for PR and DTR! Similar conclusions
hold for the other data sets.</p>
      <p>It is natural to treat this problem as a model selection problem,
using a classifier for the method selection. We show a new
ensemble method, QReд, which materializes a query-centric perspective
achieving the above two aims.2 We have considered various
classifiers for QReд, including SVM-linear classifiers, SVM classifiers
using the RBF kernel,and the XGBoost classifier, etc. A
comprehensive comparison between various classifiers is made. Unless
explicitly stated otherwise, results for the XGBoost classifier
are shown, due to its overall prediction accuracy and scalability
performance.</p>
    </sec>
    <sec id="sec-13">
      <title>Workload-centric Perspective: Simple QReg</title>
      <p>A workload-centric perspective assumes that the query
distribution is identical to the data distribution, as described in §1. Simple
QReд uses simple regression models, including LR, PR, and DTR.
Fig. 6 shows the NRMSE ratio r as defined in Equation (4) for all
data sets in 3-d space. An NRMSE ratio r larger than 1 means
QReд has less prediction error than the other base model. QReд
is shown to outperform or be as good as any of its base models.
Specifically, for data sets 2, 3, 5, 6, and 8, QReд performs slightly
better than other regression models, whereas for data sets 1, 4,
and 7, we can see QReд being significantly superior versus LR,
or PR, or DTR.</p>
      <p>Fig. 7 compares the prediction error between Simple QReд
against the more sophisticated ensemble methods, including
AdaBoost, GBoost, and XGBoost. Fig. 7 shows that the even
Simple QReд often achieves better prediction accuracy than any
of the sophisticated ensemble methods. For example, up to 25%
reduction in NRMSE is achieved by Simple QReд for data set 1
in 3-d space.</p>
    </sec>
    <sec id="sec-14">
      <title>Workload-centric Perspective: Advanced QReg</title>
      <p>For the majority of the cases, Simple QReд is shown to
outperform simpler RMs and occasionally more complex ensemble
models. For the remainder we concentrate on Advanced QReд
constructed using GBoost and XGBoost.</p>
      <p>Delving deeper, we now show the NRMSE ratios between
GBoost (or XGBoost) and QReд, broken down to sub-collections
2NB: the aim here is not to find the best method to achieve this, but to show that
this is achievable and that significant gains can be achieved using easy to deploy
methods.
of points in the data space, specifically, for the sub-collection of
points where XGBoost regression (or GBoost regression) has the
best prediction accuracy.</p>
      <p>Fig. 8 shows bands of 2 bars each. Each band of bars shows the
NRMSE ratio between other RM and the best RM for the collection
of points. Take the 4-d data set 1 as an example, for the collection
of points where XGBoost has the best prediction accuracy. The
NRMSE ratio between GBoost (second best RM) and XGBoost
(collection-best RM) is 1.3288 (orange bar in the figure), while
the corresponding NRMSE ratio between QReд and XGBoost is
1.0780 (green bar in the figure). This shows that for this collection
of points where XGBoost has the best prediction accuracy, GBoost
sufers from a 32.88% error relative to the optimal, while using
QReд reduces this to 7.80%.</p>
      <p>As another example, consider the collection of points where
XGBoost has the best prediction accuracy in the 5-d data set 8.
The NRMSE ratio between GBoost (second best RM) and XGBoost
is 1.1458, while the NRMSE ratio between QReд and XGBoost
is 1.0316. Thus, the relative opportunity loss is 0.1458/0.0316 =
4.61, which means the error caused by using GBoost (relative to
the best model) is 4.61 times the error caused by QReд for the
collection of points where XGBoost has the best accuracy. The
relative opportunity loss is much larger for data sets 4 and 5.
ito1.6
a
R
E1.4
S
M
RN1.2
1.8
ito1.6
a
R
E1.4
S
M
RN1.2
1.100
ito1.075
aR1.050
E
SM1.025
R
N1.000
1.2
o
it
a
R
E1.1
S
M
R</p>
      <p>N1.0
1.8
ito1.6
a
R
E1.4
S
M
RN1.2
1.8
ito1.6
a
R
E1.4
S
M
RN1.2
1.004
ito1.002
a
R
E1.000
S
RM0.998
N
0.996
1.15
io1.10
t
a
ER1.05
S
M
RN1.00
Data set</p>
      <p>ID
GBoost
QReg</p>
    </sec>
    <sec id="sec-15">
      <title>Query-centric Perspective: Advanced QReg</title>
      <p>As discussed before, we calculate the NRMSE error for the full
collection of points where a single ensemble model wins. To
zoom into the context of query-centric prediction serving, we
now focus only in the top 20% of queries with the least error, as
done previously. To summarize relative performance, the relative
opportunity loss between RMs and QReд is shown in Table 6.</p>
      <p>
        The values shown exactly are the ROL of using as a
secondbest RM GBoost (XGBoost) vs QReд when XGBoost (GBoost)
wins. For example, cell [
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ]=3.00, says that if XGBoost was used
(instead of the optimal in this case GBoost) it would result in
an error that is 3 times higher than if QReд was used. In other
words, previous results have shown that, regardless of which RM
is chosen, this RM will be suboptimal for certain queries. So these
ROL values show how QReд can minimize this cost when being
suboptimal.
zontal line r = 1. It shows that models do very well for most data
sets. However, there are some cases where QReд is significantly
better than other ensemble methods, for example against GBoost
for the 5-d data set 4 and against XGBoost for the 4-d data set
8. We see that QReд can improve accuracy across data sets and
dimensionalities.
      </p>
      <p>Comparing Fig. 9 with Fig. 8, we see that even though the
overall NRMSE of various RMs is similar, diferent RMs give diferent
accuracy in diferent subspaces of the data. Interestingly, this
ifgure also shows that for diferent data sets diferent ensemble
methods win (as we have seen previously), showcasing the need
for a method like QReд.
Similar to Fig. 8, Fig. 10 shows the NRMSE ratio for the
45 dimensional space, but in the query-centric ( the top 20% of
queries) perspective. Focus on the collection of points where
XGBoost has the best prediction accuracy in the 5-d data set 8
as an example. The NRMSE ratio between GBoost (second best
RM) and XGBoost is 1.3842, while the NRMSE ratio between
QReд and XGBoost is 1.0872. Thus, the relative opportunity loss
is 0.3842/0.0872 = 4.4, which means the error caused by using
GBoost is 4.4 times as the error caused by QReд for the collection
of points where XGBoost has the best prediction accuracy.</p>
      <p>It is noticeable that the NRMSE ratio from QReд is always less
than that from the second best model, and is very close to 1. Thus,
for the most vulnerable queried spaces (where a single ensemble
model wins by far), QReg can near-optimally achieve the same
accuracy, reconciling the otherwise unavoidable loss.</p>
    </sec>
    <sec id="sec-16">
      <title>QReg Training Time</title>
      <p>Fig. 11 shows the results of our study focusing on the scalability
of QReд, seeing the performance overheads that need be paid for
QReд’s accuracy improvement. There exists an approximately
linear relationship between the model training time and number
of training points. Even for a relatively high number of
training points, (e.g., hundreds of thousands), the training time for
QReд is shown to be a few dozen of seconds. Although this is
an order of magnitude worse than XGBoost in absolute value it
is acceptable for medium-sized data sets. Also, about 90% of the
training time is spent for getting predictions from the individual
base models. In the current version of the code, predictions are
received sequentially from base models; doing this in parallel,
would reduce the total training time.
6</p>
    </sec>
    <sec id="sec-17">
      <title>QREG SCALABILITY</title>
      <p>As discussed in §5, the total training time of QReд increases
approximately linearly as the data size increases. This limits its
application to very large data sets. An approach for addressing
this issue is to build samples from the data and train QReд on the
samples. We study the implications of this approach on QReд’s
performance and observe also whether our Hypotheses hold for
this case as well.
6.1</p>
    </sec>
    <sec id="sec-18">
      <title>Sample Size Planning</title>
      <p>
        One major question is how big the sample size should be? A
smaller sample requires less training time, but might lead to poor
accuracy. According to the tasks, various strategies could be used
to determine the sample size. For general purposes, Cochran’s
formula [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] is usually used to determine the sample size for a
population.
      </p>
      <p>n0 = z2p(e12− p) (7)
where n0 is the sample size, z is the selected critical value of
desired confidence level, p is the degree of variability and e is the
desired level of precision. For instance, we need to determine the
sample size of a large population whose degree of variability is
unknown. p = 0.5 indicates maximum variability, and produces
a conservative sample size. Assume we need 95% confidence
interval with 1% precision, the corresponding sample size n0 =
9604. For datasets with a finite size, the sample size is slightly
smaller than the value obtained in eq. (7).</p>
      <p>
        For regression-specific tasks, sample size planning techniques
include power analysis (PA) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], accuracy in parameter
estimation (AIPE) [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], etc. The sample sizes obtained from both
methods are diferent, and the magnitude is usually hundreds
or thousands. [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] proposes a method to combine these
methods with a specified probability, while [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ] recommends that the
largest sample size should be used.
      </p>
      <p>
        For classification-specific tasks, [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] finds that many
prediction problems do not require a large training set for classifier
training. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] uses learning curves to analyze the classification
performance as a function of the training sample size, and
concludes that 5-25 independent samples per class are enough to
train classification models with acceptable performance. Also,
75-100 samples will be adequate for testing purposes.
      </p>
      <p>In this study, the sample size varies from 10k, 100k to 1m,
which are conservative compared to the values obtained by the
PA and AIPE methods for regression tasks, or the size for
classiifcation tasks.
6.2</p>
    </sec>
    <sec id="sec-19">
      <title>Workload-centric Perspective</title>
      <p>We show results for data sets 6, 7, 8 and Table store_sales from
the TPC-DS data set. Data sets 6, 7, 8 contain 2-4 million records,
and Table store_sales is scaled-up to 2.6 billion records. We
use reservoir sampling to generate uniform random samples for
these data sets. Experiments are done using Advanced QReд.</p>
      <p>Table 7 shows the occurrences of best predictions (wins) made
by each model, for the samples of size 100k. Similarly to Table 2
in §4, each base model is shown to win for a substantial
percentage of queries (or, equivalently for a considerable part of the
data set). This supports Hypothesis I that there is not a single
regression model capable of dealing with various data sets, and
each regression model is only good at sub-spaces of the data sets.</p>
      <p>Similar to §5, this section focuses on the workload-centric
evaluation but for sample-based QReд. We show the NRMSE
ratio r between XGBoost (or GBoost) and QReд, broken down
to subcollections of points in the data space, specifically, for the
subcollection of points where GBoost (or XGBoost regression)
has the best accuracy.</p>
      <p>Consider the collection of points where XGBoost has the best
prediction accuracy in the 5-d data set 7. The NRMSE ratio r
between GBoost regression (second best RM) and XGBoost
regression (best RM) is 1.2107, while the NRMSE ratio between
QReд and XGBoost regression is 1.0499. Thus, the corresponding
ROL between GBoost and QReд is 0.2107 /0.0499 = 4.22, which
means for this collection of points, GBoost induces 4.22 times
higher error than QReд.</p>
      <p>The same conclusion holds for the query-centric perspective,
and is omitted for space reasons.
io1.15
t
a
ER1.10
S
M
RN1.05
1.00</p>
    </sec>
    <sec id="sec-20">
      <title>Model Training Time</title>
      <p>The training time of sample-based QReд consists of two parts:
(a) Sampling time to generate samples from the base tables; (b)
Training time to train QReд over the samples. Fig. 13 shows
the training time of QReд for the 100m store_sales table, while
sample sizes vary {10k, 100k, 1m}. It takes ca. 68-72s to generate
the samples. For 10k (100k, 1m) samples, it takes less than 3s (22s,
150s) to train QReд. With 100k samples, QReд performs
excellently. So, in conclusion, sample-based QReд is scalable and both
hypotheses hold even when models are trained from samples.
6.4</p>
    </sec>
    <sec id="sec-21">
      <title>Application to AQP engines.</title>
      <p>
        Previous experiments demonstrate the strength of QReд. In this
section, QReд is applied to DBEst, a newly model-based
approximate query processing (AQP) engine [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]. DBEst adopts classical
machine learning models (regressors and density estimators) to
provide approximate answers to SQL queries. We replace the
default regression model in DBEst (XGBoost) with Advanced
QReд, and compare the accuracy with DBEst using other
ensemble methods, including XGBoost and GBoost. The well-known
TPC-DS dataset is scaled up with scaling factor of 1000, which
contains ∼ 2.6 billion tuples (1TB). 96 synthetic SQL queries
covering 13 column pairs are randomly generated for SUM and AVG
aggregate functions. DBEst sample size is set to 100k.
      </p>
      <p>Fig. 14 shows the relative error achieved by DBEst using
various regression models. For SUM, the relative errors using XGboost
or GBoost are 8.35% and 8.10%. However, if Advanced QReд is
used, the relative error drops to 7.77%. Although Advanced QReд
is build upon XGBoost and GBoost, the relative error of DBEst
using Advanced QReд is better than DBEst using XGBoost or
GBoost only. For further comparison, if the linear regression is
used in DBEst, the relative error becomes 21.20%, which is much
higher than DBEst using Advanced QReд. Thus, a query-centric
regressor, like QReд, improves the prediction accuracy and is
very important in-DBMS analytics.
7</p>
    </sec>
    <sec id="sec-22">
      <title>MAJOR LESSONS LEARNED</title>
      <p>The key lessons learned by this study are:
• Diferent RMs are better-performing for diferent data sets
and, more interestingly, for diferent data subspaces within
them. This holds for simpler models and, perhaps
surprisingly, for advanced ensemble RMs, which are designed to
generalize better.
• Each examined RM is best-performing (a winner) for a
significant percentage of all queries. Necessarily, this
implies that, for a significant percentage of queries,
regardless of which (simple or ensemble) RM is chosen by a DB
user/analyst, a suboptimal RM will be used.
• When said suboptimal RMs are used, significant additional
errors emerge for a large percentage of queries.
• Best practice, which suggests using a top-performing
ensemble, is misleading and leads to significant errors for
large numbers of queries. In several cases, despite the
fact that diferent RMs had a very similar overall error
(NRMSE), a significant fraction of queries face very large
diferences in error when using seemingly-similarly-performing
RMs. Thus, sophisticated and simpler RMs cannot cope
well, in order to appease query-sensitive scenarios, where
query distributions may target specific data subspaces.
• A query-centric perspective, as manifested with QReд, can
ofer higher accuracy across data sets and dimensionalities.
This applies to overall NRMSEs. More importantly, it
applies to query-centric evaluations. The study revealed that
when QReд is used, there are significant accuracy gains,
compared to using any other non-optimal RM (which as
mentioned is unavoidable).
• Accuracy improvements are achieved with small
overheads, even with very large data sizes, using sampling.
8</p>
    </sec>
    <sec id="sec-23">
      <title>CONCLUSIONS</title>
      <p>The paper studied issues pertaining to the seamless integration
of DBMSs and regression models. The analysis revealed the
complexity of the problem of choosing an appropriate regression
model: Diferent models, despite having overall very similar
accuracy, are shown to ofer largely-varying accuracy for diferent
data sets and for diferent subsets of the same data set. Given this,
the analysis sheds light on solutions to the problem. It showed
and studied in detail the performance of QReд, which can achieve
both high accuracy over the whole data set and near-optimal
accuracy, per query targeting specific data subsets. The
analysis also showed the impact of key decisions en route to QReд,
such as selecting diferent constituent base regression models.
In addition, it studied issues pertaining to scalability, showing
that even with large data sets, the same issues hold and the same
model solution can be used to achieve per-query and overall
high accuracy. In general, the proposed QReд ofers a promising
approach for taming the generalization-overfit dilemma when
employing ML models within DBMSs.</p>
    </sec>
    <sec id="sec-24">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was supported in part by the ‘Tools, Practices and
Systems’theme of the UKRI Strategic Priorities Fund (EPSRC
Grant EP/T001569/1) &amp; The Alan Turing Institute (EPSRC grant
EP/N510129/1).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <fpage>2005</fpage>
          .
          <article-title>Database PL/SQL Packages and Types Reference</article-title>
          . https://docs.oracle. com/cd/B28359_01/appdev.111/b28419/u_nla.htm#CIABEFIJ
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Dana</given-names>
            <surname>Van Aken</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Pavlo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Geofrey J.</given-names>
            <surname>Gordon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Bohan</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Automatic Database Management System Tuning Through Large-scale Machine Learning</article-title>
          .
          <source>In Proceeding of ACM SIGMOD.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Anagnostopoulos</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Triantafillou</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Learning Set Cardinality in Distance Nearest Neighbours</article-title>
          .
          <source>In Proceeding of IEEE International Conference on Data Mining</source>
          , (
          <year>ICDM15</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Anagnostopoulos</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Triantafillou</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Learning to Accurately COUNT with Query-Driven Predictive Analytics</article-title>
          .
          <source>In Proceeding of IEEE International Conference on Big Data.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Anagnostopoulos</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Triantafillou</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Query-Driven Learning for Predictive Analytics of Data Subspace Cardinality</article-title>
          .
          <source>ACM Trans. on Knowledge Discovery from Data</source>
          ,
          <source>(ACM TKDD)</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Anon</surname>
          </string-name>
          .
          <year>2018</year>
          . XLeratorDB. http://westclintech.com/
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Claudia</given-names>
            <surname>Beleites</surname>
          </string-name>
          , Ute Neugebauer, Thomas Bocklitz, Christoph Kraft, and
          <string-name>
            <given-names>Jürgen</given-names>
            <surname>Popp</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Sample size planning for classification models</article-title>
          .
          <source>Analytica chimica acta 760</source>
          (
          <year>2013</year>
          ),
          <fpage>25</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Christopher</surname>
            <given-names>M</given-names>
          </string-name>
          <string-name>
            <surname>Bishop</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Pattern recognition and machine learning</article-title>
          . springer.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Leo</given-names>
            <surname>Breiman</surname>
          </string-name>
          .
          <year>1996</year>
          .
          <article-title>Bagging predictors</article-title>
          .
          <source>Machine learning 24, 2</source>
          (
          <year>1996</year>
          ),
          <fpage>123</fpage>
          -
          <lpage>140</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Zhuhua</surname>
            <given-names>Cai</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zekai J Gao</surname>
          </string-name>
          , Shangyu Luo, Luis L Perez,
          <string-name>
            <surname>Zografoula Vagena</surname>
            , and
            <given-names>Christopher</given-names>
          </string-name>
          <string-name>
            <surname>Jermaine</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>A comparison of platforms for implementing and running very large scale machine learning algorithms</article-title>
          .
          <source>In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM</source>
          ,
          <volume>1371</volume>
          -
          <fpage>1382</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Tianqi</given-names>
            <surname>Chen</surname>
          </string-name>
          and
          <string-name>
            <given-names>Carlos</given-names>
            <surname>Guestrin</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Xgboost: A scalable tree boosting system</article-title>
          .
          <source>In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM</source>
          ,
          <volume>785</volume>
          -
          <fpage>794</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Tianqi</surname>
            <given-names>Chen</given-names>
          </string-name>
          , Sameer Singh,
          <string-name>
            <given-names>Ben</given-names>
            <surname>Taskar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Carlos</given-names>
            <surname>Guestrin</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Eficient second-order gradient boosting for conditional random fields</article-title>
          .
          <source>In Artificial Intelligence and Statistics</source>
          .
          <volume>147</volume>
          -
          <fpage>155</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>William</surname>
            <given-names>G</given-names>
          </string-name>
          <string-name>
            <surname>Cochran</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Sampling techniques</article-title>
          . John Wiley &amp; Sons.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Cohen</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Statistical power analysis for the behavioral sciences</article-title>
          .
          <source>Routledge.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dolan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dunlap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Hellerstein</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Welton</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>MAD skills: new analysis practices for big data</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <article-title>(</article-title>
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Dan</surname>
            <given-names>Crankshaw</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Bailis</surname>
          </string-name>
          , Joseph Gonzalez,
          <string-name>
            <given-names>Haoyuan Li</given-names>
            ,
            <surname>Zhao</surname>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , Michael Franklin, Ali Ghodsi, and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Jordan</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>The missing piece in complex analytics: Low latency, scalable model management and serving with Velox</article-title>
          .
          <source>In Conference on Innovative Data Systems Research (CIDR).</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Crankshaw</surname>
          </string-name>
          , Xin Wang,
          <string-name>
            <surname>Giulio Zhou</surname>
            , Michael J Franklin, Joseph E Gonzalez, and
            <given-names>Ion</given-names>
          </string-name>
          <string-name>
            <surname>Stoica</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Clipper: A Low-Latency Online Prediction Serving System</article-title>
          .
          <source>arXiv preprint arXiv:1612.03079</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Deshpande</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>MauveDB: supporting model-based user views in database systems</article-title>
          .
          <source>In ACM SIGMOD.</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Kevin</surname>
            <given-names>K</given-names>
          </string-name>
          <string-name>
            <surname>Dobbin and Richard M Simon</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Sample size planning for developing classifiers using high-dimensional DNA microarray data</article-title>
          .
          <source>Biostatistics 8</source>
          ,
          <issue>1</issue>
          (
          <year>2006</year>
          ),
          <fpage>101</fpage>
          -
          <lpage>117</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Yoav</given-names>
            <surname>Freund</surname>
          </string-name>
          and
          <string-name>
            <given-names>Robert E</given-names>
            <surname>Schapire</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>A desicion-theoretic generalization of on-line learning and an application to boosting</article-title>
          .
          <source>In European conference on computational learning theory. Springer</source>
          ,
          <fpage>23</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Jerome</surname>
            <given-names>H</given-names>
          </string-name>
          <string-name>
            <surname>Friedman</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Greedy function approximation: a gradient boosting machine</article-title>
          .
          <source>Annals of statistics</source>
          (
          <year>2001</year>
          ),
          <fpage>1189</fpage>
          -
          <lpage>1232</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Jerome</surname>
            <given-names>H</given-names>
          </string-name>
          <string-name>
            <surname>Friedman</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Stochastic gradient boosting</article-title>
          .
          <source>Computational Statistics &amp; Data Analysis</source>
          <volume>38</volume>
          ,
          <issue>4</issue>
          (
          <year>2002</year>
          ),
          <fpage>367</fpage>
          -
          <lpage>378</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>L</given-names>
            <surname>Gerhardt</surname>
          </string-name>
          , CH Faham, and
          <string-name>
            <given-names>Y</given-names>
            <surname>Yao</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>SciDB-Py</article-title>
          . http://scidb-py.readthedocs. io/en/stable/.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Joseph</surname>
            <given-names>M Hellerstein</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Christoper</given-names>
            <surname>Ré</surname>
          </string-name>
          , Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng,
          <string-name>
            <given-names>Kun</given-names>
            <surname>Li</surname>
          </string-name>
          , et al.
          <year>2012</year>
          .
          <article-title>The MADlib analytics library: or MAD skills, the SQL</article-title>
          .
          <source>Proceedings of the VLDB Endowment 5</source>
          ,
          <issue>12</issue>
          (
          <year>2012</year>
          ),
          <fpage>1700</fpage>
          -
          <lpage>1711</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Botong</surname>
            <given-names>Huang</given-names>
          </string-name>
          , Matthias Boehm, Yuanyuan Tian, Berthold Reinwald, Shirish Tatikonda, and
          <string-name>
            <surname>Frederick R Reiss</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Resource elasticity for large-scale machine learning</article-title>
          .
          <source>In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM</source>
          ,
          <volume>137</volume>
          -
          <fpage>152</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Robert</surname>
            <given-names>A Jacobs</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michael I Jordan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Steven J Nowlan,</surname>
          </string-name>
          and
          <string-name>
            <given-names>Geofrey E</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>1991</year>
          .
          <article-title>Adaptive mixtures of local experts</article-title>
          .
          <source>Neural computation 3</source>
          ,
          <issue>1</issue>
          (
          <year>1991</year>
          ),
          <fpage>79</fpage>
          -
          <lpage>87</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Michael</surname>
            <given-names>R Jiroutek</given-names>
          </string-name>
          , Keith E Muller, Lawrence L Kupper, and Paul W Stewart.
          <year>2003</year>
          .
          <article-title>A new method for choosing sample size for confidence interval-based inferences</article-title>
          .
          <source>Biometrics</source>
          <volume>59</volume>
          ,
          <issue>3</issue>
          (
          <year>2003</year>
          ),
          <fpage>580</fpage>
          -
          <lpage>590</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Ken</given-names>
            <surname>Kelley</surname>
          </string-name>
          and
          <string-name>
            <given-names>Scott E</given-names>
            <surname>Maxwell</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Sample size for multiple regression: obtaining regression coeficients that are accurate, not simply significant</article-title>
          .
          <source>Psychological methods 8</source>
          ,
          <issue>3</issue>
          (
          <year>2003</year>
          ),
          <fpage>305</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Arun</surname>
            <given-names>Kumar</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robert</surname>
            <given-names>McCann</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Jefrey</given-names>
            <surname>Naughton</surname>
          </string-name>
          , and
          <string-name>
            <surname>Jignesh</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Patel</surname>
          </string-name>
          . [n. d.].
          <source>Model Selection Management Systems: The Next Frontier of Advanced Analytics. SIGMOD Rec</source>
          .
          <volume>44</volume>
          ([n. d.]).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Xuan</surname>
            <given-names>Liang</given-names>
          </string-name>
          , Tao Zou, Bin Guo,
          <string-name>
            <given-names>Shuo</given-names>
            <surname>Li</surname>
          </string-name>
          , Haozhe Zhang, Shuyi Zhang, Hui Huang, and Song Xi Chen.
          <year>2015</year>
          .
          <article-title>Assessing Beijing's PM2. 5 pollution: severity, weather impact, APEC and winter heating</article-title>
          .
          <source>In Proc. R. Soc. A</source>
          , Vol.
          <volume>471</volume>
          . The Royal Society,
          <volume>20150257</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lichman</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>UCI Machine Learning Repository</article-title>
          . http://archive.ics.uci. edu/ml
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Lin</surname>
            <given-names>Ma</given-names>
          </string-name>
          , Dana Van Aken,
          <string-name>
            <surname>Ahmed Hefny</surname>
            , Gustavo Mezerhane,
            <given-names>Andrew</given-names>
          </string-name>
          <string-name>
            <surname>Pavlo</surname>
          </string-name>
          , and
          <string-name>
            <surname>Geofrey</surname>
          </string-name>
          J Gordon.
          <year>2018</year>
          .
          <article-title>Query-based workload forecasting for selfdriving database management systems</article-title>
          .
          <source>In Proceedings of the 2018 International Conference on Management of Data. ACM</source>
          ,
          <volume>631</volume>
          -
          <fpage>645</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>Qingzhi</given-names>
            <surname>Ma</surname>
          </string-name>
          and
          <string-name>
            <given-names>Peter</given-names>
            <surname>Triantafillou</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Dbest: Revisiting approximate query processing engines with machine learning models</article-title>
          .
          <source>In Proceedings of the 2019 International Conference on Management of Data. ACM</source>
          ,
          <volume>1553</volume>
          -
          <fpage>1570</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>Scott E</given-names>
            <surname>Maxwell</surname>
          </string-name>
          , Ken Kelley, and
          <string-name>
            <surname>Joseph R Rausch</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Sample size planning for statistical power and accuracy in parameter estimation</article-title>
          .
          <source>Annu. Rev. Psychol</source>
          .
          <volume>59</volume>
          (
          <year>2008</year>
          ),
          <fpage>537</fpage>
          -
          <lpage>563</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <surname>Xiangrui</surname>
            <given-names>Meng</given-names>
          </string-name>
          , Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde,
          <string-name>
            <given-names>Sean</given-names>
            <surname>Owen</surname>
          </string-name>
          , et al.
          <year>2016</year>
          .
          <article-title>Mllib: Machine learning in apache spark</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          <volume>17</volume>
          ,
          <issue>1</issue>
          (
          <year>2016</year>
          ),
          <fpage>1235</fpage>
          -
          <lpage>1241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <surname>Xiangrui</surname>
            <given-names>Meng</given-names>
          </string-name>
          , Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde,
          <string-name>
            <given-names>Sean</given-names>
            <surname>Owen</surname>
          </string-name>
          , et al.
          <year>2016</year>
          .
          <article-title>Mllib: Machine learning in apache spark</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>17</volume>
          ,
          <issue>34</issue>
          (
          <year>2016</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <surname>Hannes</surname>
            <given-names>Muehleisen</given-names>
          </string-name>
          , Anthony Damico, and
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Lumley</surname>
          </string-name>
          .
          <year>2018</year>
          . MonetDB.R. http://monetr.r-forge.r-project.
          <source>org/.</source>
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>Raghunath</given-names>
            <surname>Othayoth</surname>
          </string-name>
          Nambiar and
          <string-name>
            <given-names>Meikel</given-names>
            <surname>Poess</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>The making of TPC-DS</article-title>
          .
          <source>In Proceedings of the 32nd international conference on Very large data bases. VLDB Endowment</source>
          ,
          <volume>1049</volume>
          -
          <fpage>1058</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>Carlos</given-names>
            <surname>Ordonez</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Statistical model computation with UDFs</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>22</volume>
          ,
          <issue>12</issue>
          (
          <year>2010</year>
          ),
          <fpage>1752</fpage>
          -
          <lpage>1765</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <surname>Carlos</surname>
            <given-names>Ordonez</given-names>
          </string-name>
          , Carlos Garcia-Alvarado, and
          <string-name>
            <given-names>Veerabhadaran</given-names>
            <surname>Baladandayuthapani</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Bayesian variable selection in linear regression in one pass for large datasets</article-title>
          .
          <source>ACM Transactions on Knowledge Discovery from Data (TKDD) 9</source>
          ,
          <issue>1</issue>
          (
          <year>2014</year>
          ),
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Tajik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cafarella</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Mozafari</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Database Learning: Toward a Database that Becomes Smarter Every Time</article-title>
          .
          <source>In Proceeding of ACM SIGMOD.</source>
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <surname>Christopher</surname>
            <given-names>Ré</given-names>
          </string-name>
          , Divy Agrawal, Magdalena Balazinska,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Cafarella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Tim</given-names>
            <surname>Kraska</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Raghu</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Machine learning and databases: The sound of things to come or a cacophony of hype?</article-title>
          .
          <source>In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM</source>
          ,
          <volume>283</volume>
          -
          <fpage>284</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>M.</given-names>
            <surname>Schleich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Olteanu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Ciucanu</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Learning Linear Regression Models over Factorized Joins</article-title>
          .
          <source>In ACM SIGMOD.</source>
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>Riyaz</given-names>
            <surname>Sikora</surname>
          </string-name>
          et al.
          <year>2015</year>
          .
          <article-title>A modified stacking ensemble machine learning algorithm using genetic algorithms</article-title>
          .
          <source>In Handbook of Research on Organizational Transformations through Big Data Analytics. IGI Global</source>
          ,
          <volume>43</volume>
          -
          <fpage>53</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>N.</given-names>
            <surname>Polyzotis</surname>
          </string-name>
          <string-name>
            <given-names>T.</given-names>
            <surname>Condie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mineiro</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Weimer</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Machine learning for big data</article-title>
          .
          <source>In ACM SIGMOD</source>
          .
          <volume>939</volume>
          -
          <fpage>942</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>A.</given-names>
            <surname>Thiagarajan</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Querying continuous functions in a database system</article-title>
          .
          <source>In ACM SIGMOD.</source>
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>TKach</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>Information Mining with the IBM Intelligent Miner</article-title>
          .
          <source>IBM White Paper.</source>
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>Daniele</given-names>
            <surname>Varrazzo</surname>
          </string-name>
          .
          <year>2014</year>
          . Psycopg. http://initd.org/psycopg/.
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>Rashmi</given-names>
            <surname>Korlakai</surname>
          </string-name>
          Vinayak and
          <string-name>
            <given-names>Ran</given-names>
            <surname>Gilad-Bachrach</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>DART: Dropouts meet multiple additive regression trees</article-title>
          .
          <source>In Artificial Intelligence and Statistics</source>
          .
          <volume>489</volume>
          -
          <fpage>497</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          [50]
          <string-name>
            <surname>David H Wolpert</surname>
          </string-name>
          .
          <year>1992</year>
          .
          <article-title>Stacked generalization</article-title>
          .
          <source>Neural networks 5</source>
          ,
          <issue>2</issue>
          (
          <year>1992</year>
          ),
          <fpage>241</fpage>
          -
          <lpage>259</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>D.</given-names>
            <surname>Wong</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Oracle Data Miner</article-title>
          . Oracle White Paper.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>