=Paper= {{Paper |id=Vol-2324/Paper26-CWang |storemode=property |title=A Vision of a Decisional Model for Re-optimizing Query Execution Plans Based on Machine Learning Techniques |pdfUrl=https://ceur-ws.org/Vol-2324/Paper26-CWang.pdf |volume=Vol-2324 |authors=Chenxiao Wang,Zachary Arani,Le Gruenwald,Laurent d'Orazio |dblpUrl=https://dblp.org/rec/conf/dolap/WangAGd19 }} ==A Vision of a Decisional Model for Re-optimizing Query Execution Plans Based on Machine Learning Techniques== https://ceur-ws.org/Vol-2324/Paper26-CWang.pdf
       A Vision of a Decisional Model for Re-optimizing Query
       Execution Plans Based on Machine Learning Techniques
                            Chenxiao Wang                                                           Zachary Arani
                   University of Oklahoma                                                    University of Oklahoma
          Norman, Oklahoma, United States of America                                Norman, Oklahoma, United States of America
                      chenxiao@ou.edu                                                          myrrhman@ou.edu

                              Le Gruenwald                                                        Laurent d’Orazio
                   University of Oklahoma                                                        Rennes 1 University
          Norman, Oklahoma, United States of America                                               Lannion, France
                    ggruenwald@ou.edu                                                     laurent.dorazio@univ-rennes1.edu

ABSTRACT                                                                        executing a QEP which brings different response time and mon-
Many of the existing cloud database query optimization algo-                    etary cost. However, re-optimizing a QEP costs time overhead
rithms target reducing the monetary cost paid to cloud service                  which in turn produces extra monetary cost as well. To avoid
providers in addition to query response time. These query opti-                 unnecessary re-optimization and decide whether or not a QEP
mization algorithms rely on an accurate cost estimation so that                 should be re-optimized is not an easy task. In the work [6], the
the optimal query execution plan (QEP) is selected. The cloud                   authors manually set check points between different operators of
environment is dynamic, meaning the hardware configuration,                     a QEP. Re-optimizations at these check points are necessary but
data usage, and workload allocations are continuously changing.                 still not accurate. The work [3] proposed a query optimization
These dynamic changes make an accurate query cost estimation                    method where the query is re-optimized multiple times during
difficult to obtain. Concurrently, the query execution plan must                its execution based on stages. Stages are formed automatically
be adjusted automatically to address these changes. In order to                 by the query optimizer and operators that do not rely on the
optimize the QEP with a more accurate cost estimation, the query                completion of others are grouped together. Every time one stage
needs to be optimized multiple times during execution. On top                   is finished, the query is re-optimized by force. We implemented
of this, the most updated estimation should be used for each                    this algorithm in our previous work [11]. However, our experi-
optimization. However, issues arise when deciding to pause the                  mental studies show that after applying many re-optimizations,
execution for minimum overhead. In this paper, we present our                   the QEP remains the same compared to the original one. This is
vision of a method that uses machine learning techniques to                     because the stages are not aligned with the best timing to apply
predict the best timings for optimization during execution.                     the re-optimization. This wastes time as unnecessary optimiza-
                                                                                tion increases overhead. In this paper, we provide our vision of a
KEYWORDS                                                                        method using machine learning techniques to predict whether a
                                                                                QEP should be optimized or not during the query execution, so
Query Optimization, Cloud Computing, Cloud databases
                                                                                that the overall overhead of re-optimization is further reduced
                                                                                as unnecessary re-optimization is avoided more accurately.
                                                                                    The rest of the paper is organized as follows. Section 2 dis-
1    INTRODUCTION
                                                                                cusses the related works. Section 3 discusses the effects of query
Many of the existing cloud database query optimization algo-                    re-optimization. Section 4 presents how to predict query re-
rithms target reducing the monetary cost paid to cloud service                  optimization by machine learning. Section 5 discusses the feature
providers in addition to query response time. The time and mon-                 selection and machine learning model selection issues according
etary costs needed to execute a query are estimated based on the                to the status of our current work. Section 6 provides conclusion
data statistics that the query optimizer has available when the                 and future work.
query optimization is performed. These statistics are often not
accurate, which may result in inaccurate estimates for the time
and monetary costs needed to execute the query [3, 12]. Thus,                   2   RELATED WORK
the QEP generated before the query is executed may not be the                   There are several works [4–9] that study when a query should be
best one. One approach can be applied to solve this issue. Adap-                re-optimized. Some of them are interactive, which means human
tively optimizing the QEP during the query execution to employ                  input is required in the re-optimization process. [6] presents a
more accurate statistics will yield better QEP selection, and thus              mid-query re-optimization algorithm. In this work, a few point-
will improve query performance [3, 6] . QEP is not executed as                  ers are put over different locations of the QEP and whenever
whole at one time but is divided and executed part by part. After               the query operators before the pointers finish, the query will
completion of one part of the QEP, data statistics are updated so               be re-optimized. These locations are chosen based on a set of
that the rest part of the QEP is re-optimized adopting the new                  rules built by the authors. The locations that satisfy these rules
data statistics. The new QEP is expected to be changed as it either             indicate that re-optimization is worthwhile. For example, the
contains different operators or is scheduled to be executed on                  re-optimization will take place before a Join operator. [7] intro-
different machines. Such changes result in a different way of                   duces an algorithm for re-optimizing the schedule to run different
                                                                                queries. In this work, the algorithm compares the distance be-
© 2019 Copyright held by the author(s). Published in the Workshop Proceedings   tween an initial schedule and the ideal schedule which is defined
of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on
CEUR-WS.org.                                                                    by multiple human defined rules. Whenever the distance exceeds
some threshold, the schedule is re-optimized. [4] also presents
                                                                                                          User
another mid-query re-optimization algorithm where a statistic-
collect operator is introduced to be placed at key points and used
to collect the updated data statistics during the query execution.                              Query and User Constraints
These updated statistics are used to re-optimize the QEP itself
and the memory allocation to execute the query. The estimated
execution time after re-optimization is compared to the estimated                                       Optimizer
execution time before the same QEP is re-optimized. If the dif-
ference in execution time exceeds a manually set threshold, the                               Staged Optimized Operator Tree         Estimation
re-optimization is conducted. However, as presented in our pre-
vious work [11], taking human input into the decision-making
process greatly increases time overhead and introduces a source                                         scheduler
                                                                                                                                                  Query Result
of unreliable accuracy. In [8], instead of using human input, rein-
forced learning is used for the optimizer to decide which physical
                                                                                                  One Stage of Operators
operators the optimizer should select (the optimizer’s actions)
based on the current data statistics (the optimizer’s states). But
                                                                         Updated Statistics
still, inaccurate data statistics will result in sub-optimal selection   and Constraints                Execution
of physical operators. [9] presents a query re-optimization algo-
rithm to estimate the cost of the current query by looking at the           Updated Op-
similar queries answered in the past. This method uses previous              erator Tree           Runtime Information
known column statistics to predict unknown column statistics
by using the joint probability density function. However, in this
work, a matrix inversion is required to calculate the cost of the                                       All stages
                                                                                                                               Yes
unknown column statistics. Applying such an operation online                                            finished?
would cost a lot of time overhead. [5] presents an algorithm to
adaptively optimize the QEP on cloud databases. This work as-
sumes that users are willing to accept incomplete query results                                            No
if the cost is under users’ budget. However, the optimization in
this work considers either time cost or monetary cost, not both.
All these existing works, while using query re-optimization to                                           Statistic
improve query response time, are not concerned about monetary                                            Updater
costs at the same time or are not designed for cloud databases.
    To fill the gaps in the existing works, the approach we envision
                                                                              Figure 1: The query re-optimization procedure.
in this paper emphasizes on addressing a number of important
issues. First, the approach utilizes query re-optimization not only
for query response time but also monetary costs at the same
time for cloud databases. Second, to achieve a greater accuracy in       when the estimation is evaluated. Thus, the QEP generated may
terms of the timing for re-optimization to occur and to do so au-        not be the most efficient. Re-optimizing the QEP during query
tonomously, the approach uses a machine learning model trained           execution to employ more accurate statistics will yield better
by historical queries to predict when to do re-optimization in-          QEP selection, and thus will improve query performance. In
stead of manually deciding. Third, to reduce the time overhead,          our previous work [10], we discover that query re-optimization
the approach consists of the offline and online processes. The           will enable the optimizer to select better physical operators to
offline process takes the majority of the time to build the ma-          execute the QEP and select better hardware configurations to
chine learning model, while the online process is only applying          execute the QEP (such as the number of containers and the type
the model to do the prediction, which limits the time overhead.          of containers). These optimizations are beneficial for improving
Fourth, the approach always uses the actual data statistics so that      either the overall query execution time or the monetary cost or
the optimizer is always able to select the correct QEP.                  both. Figure 2 shows the result of executing the query from our
                                                                         previous work [10]. In the experiment query, there is a Join of
                                                                         two subqueries. The data size of each subquery is unknown. We
3    QUERY RE-OPTIMIZATIONS                                              want to see how the physical operator of this Join will change
In a traditional DBMS, queries are first converted to multiple           depending on the data size of the subquery. So, we purposely
QEPs. Following this, all the QEPs are then evaluated by the query       made the data size of the right side of the Join operator small
optimizer to obtain the time costs. In some systems, additional          enough to fit in the cache. As a consequence, the Shuffle Join
costs like monetary cost, network bandwidth, and hardware uti-           operator will be changed to the Broadcast Join operator only
lization are also calculated. Finally, the optimizer chooses the         after re-optimization. Broadcast Join is executed around 40%
best QEP and sends it to the execution engine. However, unlike           faster than Shuffle Join in our environment. The results show
traditional DBMS, we apply mid-query re-optimization (Figure             the overall time cost using re-optimization has approximately
1). This means a query plan can be optimized for multiple times          20% improvement on average over no re-optimization, while the
during execution. In a traditional DBMS, the QEPs are optimized          monetary costs of the two approaches were close, with only a 4%
by the cost estimation(s) which are evaluated based on data sta-         difference. This increase of monetary cost is due to the fact that
tistics such as the cardinality of a column, number of rows in a         the more powerful containers being selected are the containers
table, etc. However, such data statistics are often not accurate         which charge more hourly.
                    50
                                  44.39

                    40
#participants




                                        35

                    30
                                                           23.825

                    20



                           Response Time (sec) Monetary Cost (x$0.001)

                         Shuffle Join     BroadCast Join with Re-Optimization

Figure 2: Impact of physical operators on time and mone-                        Figure 4: A QEP is divided into different stages after being
tary cost for the execution of Query 2.                                         compiled from the query.


                                                                                running only random but most costly/representative queries. Af-
                                                                                ter a query is submitted, we record the current data statistics in
                                                                                the system. These current statistics are called Statcurr. Then, the
                                                                                query is sent to the optimizer and an optimal QEP is generated
                                                                                by using the optimizer in a traditional DBMS. This QEP includes
                                                                                the stage information and on which nodes these stages will be
                                                                                executed. Figure 4 shows an example of a QEP generated by the
                                                                                query optimizer for the following query:
                                                                                SELECT Department , c o u n t ( Name )
                                                                                FROM STUDENT
                                                                                GROUP BY Department
                                                                                WHERE Grade <= 'C ' ;
                                                                                In Figure 4, TS, SOR, FIL and AGG stand for TableScan, Sort, Filter
                Figure 3: The procedure of collecting training data.
                                                                                and Aggregate operators, respectively. The subscripts distinguish
                                                                                the same operators that are executed in parallel on different
                                                                                data. After that, Stage 1 is sent to the query execution engine.
4  MACHINE LEARNING BASED QUERY                                                 During the execution, we update the data statistics using the
   RE-OPTIMIZATION DECISION PROCESS                                             method mentioned in the work [3] and we call these updated
4.1 Overview                                                                    statistics Statupdate. Since these statistics are collected from the
                                                                                actual running query, Statupdate is more accurate than Statcurr
Though we find that re-optimization improves the query re-
                                                                                which is obtained from the estimation. The difference between
sponse time, the re-optimization itself increases the overhead.
                                                                                the Statupdate and Statcurr, called Statdiff, is then used in the
Besides that, not all the re-optimizations are effective for the QEP,
                                                                                machine learning model to predict whether the re-optimization
that is, some re-optimizations may yield no changes. These re-
                                                                                is effective. For example, the current selectivity and the updated
optimizations are unnecessary and increase overhead. In order to
                                                                                selectivity of column A are 0.5 and 0.1, respectively, then the
avoid this issue, we envision a method using a machine learning
                                                                                difference 0.4 is added as one feature of the training dataset. This
model built on past data (training data) to predict whether or
                                                                                process is applied to all the features. The selected features are
not a future re-optimization would be useful. If the prediction
                                                                                shown in Table 1.
indicates that the re-optimization will have no effect on the QEP,
                                                                                   If the re-optimization is predicted to be effective, the QEP is
then the query will not be re-optimized.
                                                                                then re-optimized using the updated data statistics. Following
                                                                                this, the next stage (Stage 2) is executed based on the new QEP.
4.2                Architecture                                                 The process is then repeated for the rest of the stages. In this
Figure 3 shows the architecture of how the training data is col-                example, Stage 2 is possibly changed. At this point, Stage 2 after
lected. First, we collect the training data by running random                   the re-optimization is compared to the Stage 2 before the re-
queries on our current system and observing the data statistics.                optimization to observe any potential changes. We define that a
This way the prediction model can be applied to all queries. If re-             QEP is changed if one of the following three aspects occurs: 1)
optimization is only for most costly/most representative queries,               changes in the physical operator types; 2) changes in the number
then in this first step, the training data should be collected from             of containers; or 3) changes in the types of containers.
                                                                                    Table 1: List of Features and their types

                                                                                              DIFF_SELECTIVITY(C1)
                                                                                              DIFF_SELECTIVITY(C2)
                                                                                              DIFF_SELECTIVITY(Cn)
                                                                                              DIFF_NDV(C1)
                                                                                              DIFF_NDV(C2)
                                                                                              DIFF_NDV(Cn)
                                                                                              DIFF_HISTOGRAM(C1)
                                                                                              DIFF_HISTOGRAM(C2)
                                                                                              DIFF_HISTOGRAM(Cn)


Figure 5: The procedure of applying the prediction model                  a prediction model was trained using these data and applied
to decide re-optimization.                                                to a new query to predict whether or not the re-optimization
                                                                          should be conducted after each stage of this query is executed.
                                                                          Figure 5 illustrates how this model is applied during the query
   A change in the physical operator types means that if there            execution. The query is first converted to a QEP and Stage 1 is
exists any physical operator in the current QEP that is different         submitted for execution. The prediction model is used to check
from the physical operators in the previous QEP, then the QEP             whether or not the re-optimization should be conducted. Only a
has changed. For example, in our previous experiment, the change          ’YES’ prediction will trigger re-optimization. By doing this, the
in the physical operator from Shuffle Join to Broadcast Join is           unnecessary re-optimization discussed previously is eliminated.
defined as a change in the physical operator types. This change
highly influences query execution time. Thus, by detecting such           5 DISCUSSION
changes in QEP after a re-optimization, this re-optimization is
probably effective, and the re-optimization will be applied if the        5.1 Additional Feature Selection
similar situation is encountered.                                         Some additional features can also be added to the training model,
   A change in the number or types of containers means the total          although the utility of new features may not be immediately
number of containers used to execute the current QEP is differ-           clear. For instance, we considered adding the total available CPU
ent from that of the previous QEP. Such changes are also called           usage to the feature list. However, the metric of the total available
changes in the degree of parallelism. For example, the TableScan          CPU appears to not be relevant in predicting re-optimization.
containers are distributed into four containers before the re-            For instance, take the statement "If the total available CPU is
optimization and use three containers after the re-optimization.          low, then the re-optimization will not be conducted." The answer
This change highly influences the monetary cost of query execu-           to this statement is false as when we collect the training data,
tion. Thus, such re-optimization becomes useful if such changes           re-optimizations are conducted anyway no matter how low the
are detected. Similarly, a change in the types of containers means        available CPU usage is. Re-optimization is only influenced by the
the QEP after the re-optimization is executed on different types of       estimation of data size. The CPU usage only determines which
containers, either more powerful ones or weaker ones. Detecting           containers should be assigned. Thus, this attribute is not relevant.
such changes may influence the monetary cost as well.
                                                                          5.2    Machine Learning Model Selection
4.3    Feature Selection                                                  We postulate that several machine learning techniques can be
The three changes discussed in Section 4.2 usually occur when-            applied to predict whether or not re-optimization should be con-
ever the estimated data size has also changed. This is because the        ducted. Pre-processing data to reduce the number of features
optimizer uses these estimations to decide how to execute the             before processing with a machine learning model is necessary.
query and how many containers should be used. Thus, in order to           As in our model, the number of features is linearly related to the
tell whether the re-optimization is effective, we use data features       accumulative number of columns in all tables. If there is a high
that are highly relevant to the changes in data size estimation.          number of columns, there will be too many predictors. Principle
   Assume in the current DBMS, there exist the C={C 1 ,C 2 ,...,Cn }      Component Analysis (PCA) can be a useful option to find out
columns in all the tables. Selectivity differences (DIFF_SELECTIVITY ),   what features are important.
distinct values (DIFF_NVD), and histograms (DIFF_HISTOGRAM)                  Unsupervised learning such as clustering has been used in
of each column are used as the data features in the training data         query optimization [14]. In our case, clustering can be used to
used for prediction as shown in Table 1, and the binary value             determine if re-optimization is useful by identifying a specific
YES/NO is used as the predicted class, where YES means the re-            pattern of data statistics for the DBMS. There exist serval popular
optimization is predicted to be useful and NO otherwise. Many             clustering algorithms such as K-means and DBSCAN [1], which
works show that the selectivity, number of distinct values and the        are used on a situational basis. In our study, we think K-means is
histogram influence the data size estimation [6, 12, 13]. Thus, the       suitable as the value of K is fixed. In order to predict whether or
differences in these three features result in changes to data size        not re-optimization should be conducted, we can set the K value
estimation of intermediate results. Hence, they become relevant           to two. Two clusters are formed, one is for the "YES" cases and
in deciding the effectiveness of re-optimization                          another one is for the "NO" cases. Similar data statistics collected
   In our preliminary work to test our vision, we have collected          after re-optimizations are measured by the normalized Euclidean
the training data from running 5000 random queries on a DBMS              distance function and are grouped together to form these two
for 24 hours and collected the data for the above features. Then          clusters. When new data statistics arrive, they are measured to
                                                                         re-optimization should be done, and to predict how many times
                                                                         such query re-optimizations should occur.

                                                                         ACKNOWLEDGMENTS
                                                                         This work is partially supported by the National Science Founda-
                                                                         tion Award No. 1349285.

                                                                         REFERENCES
                                                                          [1] G. Ahalya and H. M. Pandey. 2015. Data clustering approaches survey and
                                                                              analysis. In 2015 International Conference on Futuristic Trends on Computational
                                                                              Analysis and Knowledge Management (ABLAZE). 532–537. https://doi.org/10.
                                                                              1109/ABLAZE.2015.7154919
                                                                          [2] A. A. Bankole and S. A. Ajila. 2013. Predicting cloud resource provisioning
                                                                              using machine learning techniques. In 2013 26th IEEE Canadian Conference on
                                                                              Electrical and Computer Engineering (CCECE). 1–4. https://doi.org/10.1109/
                                                                              CCECE.2013.6567848
Figure 6: An Example of a decision tree showing that re-                  [3] Nicolas Bruno, Sapna Jain, and Jingren Zhou. 2013. Continuous Cloud-scale
optimization should take place                                                Query Optimization and Processing. Proc. VLDB Endow. 6, 11 (Aug. 2013),
                                                                              961–972. https://doi.org/10.14778/2536222.2536223
                                                                          [4] Navin Kabra and David J. DeWitt. 1998. Efficient Mid-query Re-optimization of
                                                                              Sub-optimal Query Execution Plans. SIGMOD Rec. 27, 2 (June 1998), 106–117.
determine to which cluster they belong. If it belongs to the "YES"            https://doi.org/10.1145/276305.276315
                                                                          [5] Willis Lang, Rimma V. Nehme, and Ian Rae. 2015. Database Optimization in
cluster, then a re-optimization is necessary. Besides unsupervised            the Cloud: Where Costs, Partial Results, and Consumer Choice Meet. In CIDR.
learning such as the K-means clustering, supervised learning can          [6] Volker Markl, Vijayshankar Raman, David Simmen, Guy Lohman, Hamid
also be used to predict whether or not re-optimization would be               Pirahesh, and Miso Cilimdzic. 2004. Robust Query Processing Through Pro-
                                                                              gressive Optimization. In Proceedings of the 2004 ACM SIGMOD International
necessary. Supervised learning is not without issue, however. For             Conference on Management of Data (SIGMOD ’04). ACM, New York, NY, USA,
instance, it is usually hard to get labeled data for training a model.        659–670. https://doi.org/10.1145/1007568.1007642
In our case, labels can be easily obtained by observing the effect        [7] David Meignan. 2014. A Heuristic Approach to Schedule Reoptimization in
                                                                              the Context of Interactive Optimization. In Proceedings of the 2014 Annual
of re-optimization on past QEPs. As there are a fair amount of                Conference on Genetic and Evolutionary Computation (GECCO ’14). ACM, New
supervised learning algorithms, several possible options exist for            York, NY, USA, 461–468. https://doi.org/10.1145/2576768.2598213
                                                                          [8] Jennifer Ortiz, Magdalena Balazinska, Johannes Gehrke, and S. Sathiya Keerthi.
prediction. For instance, a binary classifier decision tree can be            2018. Learning State Representations for Query Optimization with Deep
examined to classify whether or not re-optimization should be                 Reinforcement Learning. In Proceedings of the Second Workshop on Data Man-
conducted. Each feature will be represented by a node split into              agement for End-To-End Machine Learning (DEEM’18). ACM, New York, NY,
                                                                              USA, Article 4, 4 pages. https://doi.org/10.1145/3209889.3209890
either "YES" or "NO". Figure 6 shows an example of this partial           [9] Yongjoo Park, Ahmad Shahab Tajik, Michael Cafarella, and Barzan Mozafari.
decision tree. When the final classification is "YES", then a re-             2017. Database Learning: Toward a Database That Becomes Smarter Every
optimization is necessary. Also, the performance of several other             Time. In Proceedings of the 2017 ACM International Conference on Management
                                                                              of Data (SIGMOD ’17). ACM, New York, NY, USA, 587–602. https://doi.org/10.
supervised learning algorithms like Neural Networks, Support                  1145/3035918.3064013
Vector Machines, and Linear Regression were studied in [2] for           [10] Chenxiao Wang, Zach Arani, Le Gruenwald, and Laurent d’Orazio. 2018.
                                                                              Adaptive Time, Monetary Cost Aware Query Optimization on Cloud Database
cloud provisioning, but not for query re-optimization. We are                 Systems. 3374–3382. https://doi.org/10.1109/BigData.2018.8622401
currently investigating which machine learning models are most           [11] Chenxiao Wang, Jason Arenson, Florian Helff, Le Gruenwald, and Laurent
suitable for our study. An appropriate machine learning model                 d’Orazio. 2017. Improving user interaction in mobile-cloud database query
                                                                              processing. In Workshop on Scalable Cloud Data Management. Boston, United
should be highly accurate in prediction (i.e. having low error rates          States. https://hal.archives-ouvertes.fr/hal-01640072
when applied to the training data and test data). The training data      [12] Florian Wolf, Norman May, Paul R. Willems, and Kai-Uwe Sattler. 2018. On
must be selected carefully to avoid the cases of model overfitting,           the Calculation of Optimality Ranges for Relational Query Execution Plans.
                                                                              In Proceedings of the 2018 International Conference on Management of Data
i.e., the cases where the model provides a low training error                 (SIGMOD ’18). ACM, New York, NY, USA, 663–675. https://doi.org/10.1145/
rate, but a high testing error rate. In addition, an appropriate              3183713.3183742
                                                                         [13] Wentao Wu, Jeffrey F. Naughton, and Harneet Singh. 2016. Sampling-Based
model should incur a low overhead while being applied online                  Query Re-Optimization. In Proceedings of the 2016 International Conference
for re-optimization prediction and should perform effectively in              on Management of Data (SIGMOD ’16). ACM, New York, NY, USA, 1721–1736.
reducing the overall query response time and monetary cost.                   https://doi.org/10.1145/2882903.2882914
                                                                         [14] J. Zahir, A. El Qadi, and S. Mouline. 2014. Access plan recommendation:
                                                                              A clustering based approach using queries similarity. In 2014 Second World
6    CONCLUSION AND FUTURE WORK                                               Conference on Complex Systems (WCCS). 55–60. https://doi.org/10.1109/ICoCS.
                                                                              2014.7060936
In this paper, we provide our vision of a model that utilizes ma-
chine learning techniques to study previously observed statistics
data from a running system in order to build a prediction model.
This model is used to predict whether or not a query should be
re-optimized in order to avoid unnecessary re-optimizations. By
doing this, a query’s execution time and/or monetary cost can be
further reduced. In future work, we plan to fully implement the
approach we envisioned and use it to predict additional behav-
iors of a DBMS. For example, we would like to study methods
for increasing or decreasing the number of executing contain-
ers based on current data statistics. We believe that predicting
additional useful behaviors will make the query re-optimization
process more efficient. In addition, we will also extend our ap-
proach to predict, independently of query stages, when a query