=Paper=
{{Paper
|id=Vol-2324/Paper26-CWang
|storemode=property
|title=A Vision of a Decisional Model for Re-optimizing Query Execution Plans Based on Machine Learning Techniques
|pdfUrl=https://ceur-ws.org/Vol-2324/Paper26-CWang.pdf
|volume=Vol-2324
|authors=Chenxiao Wang,Zachary Arani,Le Gruenwald,Laurent d'Orazio
|dblpUrl=https://dblp.org/rec/conf/dolap/WangAGd19
}}
==A Vision of a Decisional Model for Re-optimizing Query Execution Plans Based on Machine Learning Techniques==
A Vision of a Decisional Model for Re-optimizing Query
Execution Plans Based on Machine Learning Techniques
Chenxiao Wang Zachary Arani
University of Oklahoma University of Oklahoma
Norman, Oklahoma, United States of America Norman, Oklahoma, United States of America
chenxiao@ou.edu myrrhman@ou.edu
Le Gruenwald Laurent d’Orazio
University of Oklahoma Rennes 1 University
Norman, Oklahoma, United States of America Lannion, France
ggruenwald@ou.edu laurent.dorazio@univ-rennes1.edu
ABSTRACT executing a QEP which brings different response time and mon-
Many of the existing cloud database query optimization algo- etary cost. However, re-optimizing a QEP costs time overhead
rithms target reducing the monetary cost paid to cloud service which in turn produces extra monetary cost as well. To avoid
providers in addition to query response time. These query opti- unnecessary re-optimization and decide whether or not a QEP
mization algorithms rely on an accurate cost estimation so that should be re-optimized is not an easy task. In the work [6], the
the optimal query execution plan (QEP) is selected. The cloud authors manually set check points between different operators of
environment is dynamic, meaning the hardware configuration, a QEP. Re-optimizations at these check points are necessary but
data usage, and workload allocations are continuously changing. still not accurate. The work [3] proposed a query optimization
These dynamic changes make an accurate query cost estimation method where the query is re-optimized multiple times during
difficult to obtain. Concurrently, the query execution plan must its execution based on stages. Stages are formed automatically
be adjusted automatically to address these changes. In order to by the query optimizer and operators that do not rely on the
optimize the QEP with a more accurate cost estimation, the query completion of others are grouped together. Every time one stage
needs to be optimized multiple times during execution. On top is finished, the query is re-optimized by force. We implemented
of this, the most updated estimation should be used for each this algorithm in our previous work [11]. However, our experi-
optimization. However, issues arise when deciding to pause the mental studies show that after applying many re-optimizations,
execution for minimum overhead. In this paper, we present our the QEP remains the same compared to the original one. This is
vision of a method that uses machine learning techniques to because the stages are not aligned with the best timing to apply
predict the best timings for optimization during execution. the re-optimization. This wastes time as unnecessary optimiza-
tion increases overhead. In this paper, we provide our vision of a
KEYWORDS method using machine learning techniques to predict whether a
QEP should be optimized or not during the query execution, so
Query Optimization, Cloud Computing, Cloud databases
that the overall overhead of re-optimization is further reduced
as unnecessary re-optimization is avoided more accurately.
The rest of the paper is organized as follows. Section 2 dis-
1 INTRODUCTION
cusses the related works. Section 3 discusses the effects of query
Many of the existing cloud database query optimization algo- re-optimization. Section 4 presents how to predict query re-
rithms target reducing the monetary cost paid to cloud service optimization by machine learning. Section 5 discusses the feature
providers in addition to query response time. The time and mon- selection and machine learning model selection issues according
etary costs needed to execute a query are estimated based on the to the status of our current work. Section 6 provides conclusion
data statistics that the query optimizer has available when the and future work.
query optimization is performed. These statistics are often not
accurate, which may result in inaccurate estimates for the time
and monetary costs needed to execute the query [3, 12]. Thus, 2 RELATED WORK
the QEP generated before the query is executed may not be the There are several works [4–9] that study when a query should be
best one. One approach can be applied to solve this issue. Adap- re-optimized. Some of them are interactive, which means human
tively optimizing the QEP during the query execution to employ input is required in the re-optimization process. [6] presents a
more accurate statistics will yield better QEP selection, and thus mid-query re-optimization algorithm. In this work, a few point-
will improve query performance [3, 6] . QEP is not executed as ers are put over different locations of the QEP and whenever
whole at one time but is divided and executed part by part. After the query operators before the pointers finish, the query will
completion of one part of the QEP, data statistics are updated so be re-optimized. These locations are chosen based on a set of
that the rest part of the QEP is re-optimized adopting the new rules built by the authors. The locations that satisfy these rules
data statistics. The new QEP is expected to be changed as it either indicate that re-optimization is worthwhile. For example, the
contains different operators or is scheduled to be executed on re-optimization will take place before a Join operator. [7] intro-
different machines. Such changes result in a different way of duces an algorithm for re-optimizing the schedule to run different
queries. In this work, the algorithm compares the distance be-
© 2019 Copyright held by the author(s). Published in the Workshop Proceedings tween an initial schedule and the ideal schedule which is defined
of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on
CEUR-WS.org. by multiple human defined rules. Whenever the distance exceeds
some threshold, the schedule is re-optimized. [4] also presents
User
another mid-query re-optimization algorithm where a statistic-
collect operator is introduced to be placed at key points and used
to collect the updated data statistics during the query execution. Query and User Constraints
These updated statistics are used to re-optimize the QEP itself
and the memory allocation to execute the query. The estimated
execution time after re-optimization is compared to the estimated Optimizer
execution time before the same QEP is re-optimized. If the dif-
ference in execution time exceeds a manually set threshold, the Staged Optimized Operator Tree Estimation
re-optimization is conducted. However, as presented in our pre-
vious work [11], taking human input into the decision-making
process greatly increases time overhead and introduces a source scheduler
Query Result
of unreliable accuracy. In [8], instead of using human input, rein-
forced learning is used for the optimizer to decide which physical
One Stage of Operators
operators the optimizer should select (the optimizer’s actions)
based on the current data statistics (the optimizer’s states). But
Updated Statistics
still, inaccurate data statistics will result in sub-optimal selection and Constraints Execution
of physical operators. [9] presents a query re-optimization algo-
rithm to estimate the cost of the current query by looking at the Updated Op-
similar queries answered in the past. This method uses previous erator Tree Runtime Information
known column statistics to predict unknown column statistics
by using the joint probability density function. However, in this
work, a matrix inversion is required to calculate the cost of the All stages
Yes
unknown column statistics. Applying such an operation online finished?
would cost a lot of time overhead. [5] presents an algorithm to
adaptively optimize the QEP on cloud databases. This work as-
sumes that users are willing to accept incomplete query results No
if the cost is under users’ budget. However, the optimization in
this work considers either time cost or monetary cost, not both.
All these existing works, while using query re-optimization to Statistic
improve query response time, are not concerned about monetary Updater
costs at the same time or are not designed for cloud databases.
To fill the gaps in the existing works, the approach we envision
Figure 1: The query re-optimization procedure.
in this paper emphasizes on addressing a number of important
issues. First, the approach utilizes query re-optimization not only
for query response time but also monetary costs at the same
time for cloud databases. Second, to achieve a greater accuracy in when the estimation is evaluated. Thus, the QEP generated may
terms of the timing for re-optimization to occur and to do so au- not be the most efficient. Re-optimizing the QEP during query
tonomously, the approach uses a machine learning model trained execution to employ more accurate statistics will yield better
by historical queries to predict when to do re-optimization in- QEP selection, and thus will improve query performance. In
stead of manually deciding. Third, to reduce the time overhead, our previous work [10], we discover that query re-optimization
the approach consists of the offline and online processes. The will enable the optimizer to select better physical operators to
offline process takes the majority of the time to build the ma- execute the QEP and select better hardware configurations to
chine learning model, while the online process is only applying execute the QEP (such as the number of containers and the type
the model to do the prediction, which limits the time overhead. of containers). These optimizations are beneficial for improving
Fourth, the approach always uses the actual data statistics so that either the overall query execution time or the monetary cost or
the optimizer is always able to select the correct QEP. both. Figure 2 shows the result of executing the query from our
previous work [10]. In the experiment query, there is a Join of
two subqueries. The data size of each subquery is unknown. We
3 QUERY RE-OPTIMIZATIONS want to see how the physical operator of this Join will change
In a traditional DBMS, queries are first converted to multiple depending on the data size of the subquery. So, we purposely
QEPs. Following this, all the QEPs are then evaluated by the query made the data size of the right side of the Join operator small
optimizer to obtain the time costs. In some systems, additional enough to fit in the cache. As a consequence, the Shuffle Join
costs like monetary cost, network bandwidth, and hardware uti- operator will be changed to the Broadcast Join operator only
lization are also calculated. Finally, the optimizer chooses the after re-optimization. Broadcast Join is executed around 40%
best QEP and sends it to the execution engine. However, unlike faster than Shuffle Join in our environment. The results show
traditional DBMS, we apply mid-query re-optimization (Figure the overall time cost using re-optimization has approximately
1). This means a query plan can be optimized for multiple times 20% improvement on average over no re-optimization, while the
during execution. In a traditional DBMS, the QEPs are optimized monetary costs of the two approaches were close, with only a 4%
by the cost estimation(s) which are evaluated based on data sta- difference. This increase of monetary cost is due to the fact that
tistics such as the cardinality of a column, number of rows in a the more powerful containers being selected are the containers
table, etc. However, such data statistics are often not accurate which charge more hourly.
50
44.39
40
#participants
35
30
23.825
20
Response Time (sec) Monetary Cost (x$0.001)
Shuffle Join BroadCast Join with Re-Optimization
Figure 2: Impact of physical operators on time and mone- Figure 4: A QEP is divided into different stages after being
tary cost for the execution of Query 2. compiled from the query.
running only random but most costly/representative queries. Af-
ter a query is submitted, we record the current data statistics in
the system. These current statistics are called Statcurr. Then, the
query is sent to the optimizer and an optimal QEP is generated
by using the optimizer in a traditional DBMS. This QEP includes
the stage information and on which nodes these stages will be
executed. Figure 4 shows an example of a QEP generated by the
query optimizer for the following query:
SELECT Department , c o u n t ( Name )
FROM STUDENT
GROUP BY Department
WHERE Grade <= 'C ' ;
In Figure 4, TS, SOR, FIL and AGG stand for TableScan, Sort, Filter
Figure 3: The procedure of collecting training data.
and Aggregate operators, respectively. The subscripts distinguish
the same operators that are executed in parallel on different
data. After that, Stage 1 is sent to the query execution engine.
4 MACHINE LEARNING BASED QUERY During the execution, we update the data statistics using the
RE-OPTIMIZATION DECISION PROCESS method mentioned in the work [3] and we call these updated
4.1 Overview statistics Statupdate. Since these statistics are collected from the
actual running query, Statupdate is more accurate than Statcurr
Though we find that re-optimization improves the query re-
which is obtained from the estimation. The difference between
sponse time, the re-optimization itself increases the overhead.
the Statupdate and Statcurr, called Statdiff, is then used in the
Besides that, not all the re-optimizations are effective for the QEP,
machine learning model to predict whether the re-optimization
that is, some re-optimizations may yield no changes. These re-
is effective. For example, the current selectivity and the updated
optimizations are unnecessary and increase overhead. In order to
selectivity of column A are 0.5 and 0.1, respectively, then the
avoid this issue, we envision a method using a machine learning
difference 0.4 is added as one feature of the training dataset. This
model built on past data (training data) to predict whether or
process is applied to all the features. The selected features are
not a future re-optimization would be useful. If the prediction
shown in Table 1.
indicates that the re-optimization will have no effect on the QEP,
If the re-optimization is predicted to be effective, the QEP is
then the query will not be re-optimized.
then re-optimized using the updated data statistics. Following
this, the next stage (Stage 2) is executed based on the new QEP.
4.2 Architecture The process is then repeated for the rest of the stages. In this
Figure 3 shows the architecture of how the training data is col- example, Stage 2 is possibly changed. At this point, Stage 2 after
lected. First, we collect the training data by running random the re-optimization is compared to the Stage 2 before the re-
queries on our current system and observing the data statistics. optimization to observe any potential changes. We define that a
This way the prediction model can be applied to all queries. If re- QEP is changed if one of the following three aspects occurs: 1)
optimization is only for most costly/most representative queries, changes in the physical operator types; 2) changes in the number
then in this first step, the training data should be collected from of containers; or 3) changes in the types of containers.
Table 1: List of Features and their types
DIFF_SELECTIVITY(C1)
DIFF_SELECTIVITY(C2)
DIFF_SELECTIVITY(Cn)
DIFF_NDV(C1)
DIFF_NDV(C2)
DIFF_NDV(Cn)
DIFF_HISTOGRAM(C1)
DIFF_HISTOGRAM(C2)
DIFF_HISTOGRAM(Cn)
Figure 5: The procedure of applying the prediction model a prediction model was trained using these data and applied
to decide re-optimization. to a new query to predict whether or not the re-optimization
should be conducted after each stage of this query is executed.
Figure 5 illustrates how this model is applied during the query
A change in the physical operator types means that if there execution. The query is first converted to a QEP and Stage 1 is
exists any physical operator in the current QEP that is different submitted for execution. The prediction model is used to check
from the physical operators in the previous QEP, then the QEP whether or not the re-optimization should be conducted. Only a
has changed. For example, in our previous experiment, the change ’YES’ prediction will trigger re-optimization. By doing this, the
in the physical operator from Shuffle Join to Broadcast Join is unnecessary re-optimization discussed previously is eliminated.
defined as a change in the physical operator types. This change
highly influences query execution time. Thus, by detecting such 5 DISCUSSION
changes in QEP after a re-optimization, this re-optimization is
probably effective, and the re-optimization will be applied if the 5.1 Additional Feature Selection
similar situation is encountered. Some additional features can also be added to the training model,
A change in the number or types of containers means the total although the utility of new features may not be immediately
number of containers used to execute the current QEP is differ- clear. For instance, we considered adding the total available CPU
ent from that of the previous QEP. Such changes are also called usage to the feature list. However, the metric of the total available
changes in the degree of parallelism. For example, the TableScan CPU appears to not be relevant in predicting re-optimization.
containers are distributed into four containers before the re- For instance, take the statement "If the total available CPU is
optimization and use three containers after the re-optimization. low, then the re-optimization will not be conducted." The answer
This change highly influences the monetary cost of query execu- to this statement is false as when we collect the training data,
tion. Thus, such re-optimization becomes useful if such changes re-optimizations are conducted anyway no matter how low the
are detected. Similarly, a change in the types of containers means available CPU usage is. Re-optimization is only influenced by the
the QEP after the re-optimization is executed on different types of estimation of data size. The CPU usage only determines which
containers, either more powerful ones or weaker ones. Detecting containers should be assigned. Thus, this attribute is not relevant.
such changes may influence the monetary cost as well.
5.2 Machine Learning Model Selection
4.3 Feature Selection We postulate that several machine learning techniques can be
The three changes discussed in Section 4.2 usually occur when- applied to predict whether or not re-optimization should be con-
ever the estimated data size has also changed. This is because the ducted. Pre-processing data to reduce the number of features
optimizer uses these estimations to decide how to execute the before processing with a machine learning model is necessary.
query and how many containers should be used. Thus, in order to As in our model, the number of features is linearly related to the
tell whether the re-optimization is effective, we use data features accumulative number of columns in all tables. If there is a high
that are highly relevant to the changes in data size estimation. number of columns, there will be too many predictors. Principle
Assume in the current DBMS, there exist the C={C 1 ,C 2 ,...,Cn } Component Analysis (PCA) can be a useful option to find out
columns in all the tables. Selectivity differences (DIFF_SELECTIVITY ), what features are important.
distinct values (DIFF_NVD), and histograms (DIFF_HISTOGRAM) Unsupervised learning such as clustering has been used in
of each column are used as the data features in the training data query optimization [14]. In our case, clustering can be used to
used for prediction as shown in Table 1, and the binary value determine if re-optimization is useful by identifying a specific
YES/NO is used as the predicted class, where YES means the re- pattern of data statistics for the DBMS. There exist serval popular
optimization is predicted to be useful and NO otherwise. Many clustering algorithms such as K-means and DBSCAN [1], which
works show that the selectivity, number of distinct values and the are used on a situational basis. In our study, we think K-means is
histogram influence the data size estimation [6, 12, 13]. Thus, the suitable as the value of K is fixed. In order to predict whether or
differences in these three features result in changes to data size not re-optimization should be conducted, we can set the K value
estimation of intermediate results. Hence, they become relevant to two. Two clusters are formed, one is for the "YES" cases and
in deciding the effectiveness of re-optimization another one is for the "NO" cases. Similar data statistics collected
In our preliminary work to test our vision, we have collected after re-optimizations are measured by the normalized Euclidean
the training data from running 5000 random queries on a DBMS distance function and are grouped together to form these two
for 24 hours and collected the data for the above features. Then clusters. When new data statistics arrive, they are measured to
re-optimization should be done, and to predict how many times
such query re-optimizations should occur.
ACKNOWLEDGMENTS
This work is partially supported by the National Science Founda-
tion Award No. 1349285.
REFERENCES
[1] G. Ahalya and H. M. Pandey. 2015. Data clustering approaches survey and
analysis. In 2015 International Conference on Futuristic Trends on Computational
Analysis and Knowledge Management (ABLAZE). 532–537. https://doi.org/10.
1109/ABLAZE.2015.7154919
[2] A. A. Bankole and S. A. Ajila. 2013. Predicting cloud resource provisioning
using machine learning techniques. In 2013 26th IEEE Canadian Conference on
Electrical and Computer Engineering (CCECE). 1–4. https://doi.org/10.1109/
CCECE.2013.6567848
Figure 6: An Example of a decision tree showing that re- [3] Nicolas Bruno, Sapna Jain, and Jingren Zhou. 2013. Continuous Cloud-scale
optimization should take place Query Optimization and Processing. Proc. VLDB Endow. 6, 11 (Aug. 2013),
961–972. https://doi.org/10.14778/2536222.2536223
[4] Navin Kabra and David J. DeWitt. 1998. Efficient Mid-query Re-optimization of
Sub-optimal Query Execution Plans. SIGMOD Rec. 27, 2 (June 1998), 106–117.
determine to which cluster they belong. If it belongs to the "YES" https://doi.org/10.1145/276305.276315
[5] Willis Lang, Rimma V. Nehme, and Ian Rae. 2015. Database Optimization in
cluster, then a re-optimization is necessary. Besides unsupervised the Cloud: Where Costs, Partial Results, and Consumer Choice Meet. In CIDR.
learning such as the K-means clustering, supervised learning can [6] Volker Markl, Vijayshankar Raman, David Simmen, Guy Lohman, Hamid
also be used to predict whether or not re-optimization would be Pirahesh, and Miso Cilimdzic. 2004. Robust Query Processing Through Pro-
gressive Optimization. In Proceedings of the 2004 ACM SIGMOD International
necessary. Supervised learning is not without issue, however. For Conference on Management of Data (SIGMOD ’04). ACM, New York, NY, USA,
instance, it is usually hard to get labeled data for training a model. 659–670. https://doi.org/10.1145/1007568.1007642
In our case, labels can be easily obtained by observing the effect [7] David Meignan. 2014. A Heuristic Approach to Schedule Reoptimization in
the Context of Interactive Optimization. In Proceedings of the 2014 Annual
of re-optimization on past QEPs. As there are a fair amount of Conference on Genetic and Evolutionary Computation (GECCO ’14). ACM, New
supervised learning algorithms, several possible options exist for York, NY, USA, 461–468. https://doi.org/10.1145/2576768.2598213
[8] Jennifer Ortiz, Magdalena Balazinska, Johannes Gehrke, and S. Sathiya Keerthi.
prediction. For instance, a binary classifier decision tree can be 2018. Learning State Representations for Query Optimization with Deep
examined to classify whether or not re-optimization should be Reinforcement Learning. In Proceedings of the Second Workshop on Data Man-
conducted. Each feature will be represented by a node split into agement for End-To-End Machine Learning (DEEM’18). ACM, New York, NY,
USA, Article 4, 4 pages. https://doi.org/10.1145/3209889.3209890
either "YES" or "NO". Figure 6 shows an example of this partial [9] Yongjoo Park, Ahmad Shahab Tajik, Michael Cafarella, and Barzan Mozafari.
decision tree. When the final classification is "YES", then a re- 2017. Database Learning: Toward a Database That Becomes Smarter Every
optimization is necessary. Also, the performance of several other Time. In Proceedings of the 2017 ACM International Conference on Management
of Data (SIGMOD ’17). ACM, New York, NY, USA, 587–602. https://doi.org/10.
supervised learning algorithms like Neural Networks, Support 1145/3035918.3064013
Vector Machines, and Linear Regression were studied in [2] for [10] Chenxiao Wang, Zach Arani, Le Gruenwald, and Laurent d’Orazio. 2018.
Adaptive Time, Monetary Cost Aware Query Optimization on Cloud Database
cloud provisioning, but not for query re-optimization. We are Systems. 3374–3382. https://doi.org/10.1109/BigData.2018.8622401
currently investigating which machine learning models are most [11] Chenxiao Wang, Jason Arenson, Florian Helff, Le Gruenwald, and Laurent
suitable for our study. An appropriate machine learning model d’Orazio. 2017. Improving user interaction in mobile-cloud database query
processing. In Workshop on Scalable Cloud Data Management. Boston, United
should be highly accurate in prediction (i.e. having low error rates States. https://hal.archives-ouvertes.fr/hal-01640072
when applied to the training data and test data). The training data [12] Florian Wolf, Norman May, Paul R. Willems, and Kai-Uwe Sattler. 2018. On
must be selected carefully to avoid the cases of model overfitting, the Calculation of Optimality Ranges for Relational Query Execution Plans.
In Proceedings of the 2018 International Conference on Management of Data
i.e., the cases where the model provides a low training error (SIGMOD ’18). ACM, New York, NY, USA, 663–675. https://doi.org/10.1145/
rate, but a high testing error rate. In addition, an appropriate 3183713.3183742
[13] Wentao Wu, Jeffrey F. Naughton, and Harneet Singh. 2016. Sampling-Based
model should incur a low overhead while being applied online Query Re-Optimization. In Proceedings of the 2016 International Conference
for re-optimization prediction and should perform effectively in on Management of Data (SIGMOD ’16). ACM, New York, NY, USA, 1721–1736.
reducing the overall query response time and monetary cost. https://doi.org/10.1145/2882903.2882914
[14] J. Zahir, A. El Qadi, and S. Mouline. 2014. Access plan recommendation:
A clustering based approach using queries similarity. In 2014 Second World
6 CONCLUSION AND FUTURE WORK Conference on Complex Systems (WCCS). 55–60. https://doi.org/10.1109/ICoCS.
2014.7060936
In this paper, we provide our vision of a model that utilizes ma-
chine learning techniques to study previously observed statistics
data from a running system in order to build a prediction model.
This model is used to predict whether or not a query should be
re-optimized in order to avoid unnecessary re-optimizations. By
doing this, a query’s execution time and/or monetary cost can be
further reduced. In future work, we plan to fully implement the
approach we envisioned and use it to predict additional behav-
iors of a DBMS. For example, we would like to study methods
for increasing or decreasing the number of executing contain-
ers based on current data statistics. We believe that predicting
additional useful behaviors will make the query re-optimization
process more efficient. In addition, we will also extend our ap-
proach to predict, independently of query stages, when a query