<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Toulouse: Learning Join Order Optimization Policies for Rule-based Data Engines</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Antonios Karvelas</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yannis Foufoulas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alkis Simitsis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yannis Ioannidis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Athena Research Center</institution>
          ,
          <addr-line>Athens</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Athens</institution>
          ,
          <addr-line>Athens</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In recent times, several research works have explored the idea of leveraging machine learning techniques to improve or even replace core components of traditional database architectures, such as the query optimizer and selectivity and cardinality cost estimators. These eforts often rely on existing, cost-based optimizers and cost models to avoid a cold-start, and build on top of the optimizer's decisions. In this paper, we investigate whether learning could also be beneficial in rule-based optimizers for known and unknown workloads alike. As a proof of concept, we use MonetDB, an open-source, column-store analytics data engine, and explore whether a learning model based on Graph Neural Networks that is trained on a cost-based engine, such as PostgreSQL, could improve MonetDB optimizer's decisions. Our experimental results reveal deficiencies in MonetDB's query execution plans, especially for queries with long chains of join operators, and potential opportunities in exploiting learning techniques.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Query processing</kwd>
        <kwd>Query optimization</kwd>
        <kwd>Join ordering</kwd>
        <kwd>Machine learning</kwd>
        <kwd>Graph neural networks</kwd>
        <kwd>Reinforcement learning</kwd>
        <kwd>Proximal policy optimization</kwd>
        <kwd>Cost-based optimization</kwd>
        <kwd>Rule-based optimization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction
usually with a single pass of the SQL statement.</p>
      <p>The rules are well-crafted based on experience,
theQuery optimization has been a long-standing chal- ory, and common practice, but they often miss
lenge from the very early days of data manage- optimization opportunities as they tend to overlook
ment systems. Traditional optimization techniques data properties.
to search the space of alternative query execution A recent line of research exploits the advances
plans and find the most eficient one include vari- made recently in machine learning (ML) technology
ous heuristic and cost-based approaches, rule-based and explores the potential of the so-called, learning
strategies, randomized algorithms, and so on [1, 2]. query optimizers, which aim at learning the behavior</p>
      <p>Cost-based optimizers depend on accurate cardi- of query operators and query patterns over time and
nality and selectivity estimates, and especially for tend to learn also from their previous decisions (i.e.,
intermediate result, to produce reasonable plans. In execution plans). In the context of learning query
general, these techniques are successful given spe- optimization, earlier research eforts focus mainly
cific assumptions such as attribute value indepen- on the problems of join ordering and end-to-end
dence, uniformity, data independence, etc. When optimization. However, the vast majority of past
these assumptions cannot be met the optimizers work focuses on cost-based optimizers, relying on
typically fall back to an educated guess. In practice, their choices and respective cost models to avoid a
getting such accurate estimates is a significant chal- cold-start and also train the proposed models.
lenge. And the challenge is more evident in queries In this work, we present Toulouse, our
optimizahaving a long chain of join operators, where one tion approach to rule-based data engines, and
inshould decide in which order the joins should be vestigate the feasibility and potential benefit of
apcomputed, forming the so-called join-ordering prob- plying learning techniques to a rule-based query
lem. On the other hand, rule-based techniques rely optimizer. Our initial focus is on the join ordering
on a set of rules to produce query execution plans, optimization problem. Generally speaking, without
DataPlat’23: 2nd International Workshop on Data Plat- having a cost model and cost estimates assigned
form Design, Management, and Optimization, March 28, to query plans and query operators, the problem
2023, Ioannina, Greece formulation should rely on optimization patterns
$ sdi1600060@di.uoa.gr (A. Karvelas); johnfouf@di.uoa.gr for the query workloads at hand.
(Y. Foufoulas); alkis@athenarc.gr (A. Simitsis); As a proof of concept, we used MonetDB [3] in
yannis@di.uoa.gr (Y. Ioannidis)
© 2023 Copyright for this paper by its authors. Use permitted under our investigation. A query in MonetDB undergoes
Creative Commons License Attribution 4.0 International (CC BY an optimizer pipeline, which examines and applies a
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g 4C.0E).UR Workshop Proceedings (CEUR-WS.org)
series of optimizer steps [4]. Our preliminary inves- components, such as data structures, indices, and
tigation showed that a featurization scheme based query execution with learned components.
solely on the optimizer steps is not suficient to Other approaches focus on the join ordering
probproduce useful learning patterns. Subsequently, we lem. DQ [7] and ReJOIN [8] combine reinforcement
tried to leverage a model trained on a cost-based op- learning (RL) with a human-engineered cost model
timizer and obtained three interesting results. First, to automatically learn search strategies to
naviMonetDB’s optimizer presents significant limita- gate the space of possible join orderings. RTOS [9]
tions when it comes to queries having long chains of extends these by employing a Tree-LSTM to
comjoin operators and thus, it seems that does not han- pute query cost. SkinnerDB [10] uses RL to learn
dle very efectively the join ordering problem. Sec- optimal join orders during query execution and
enond, that a model trained on a cost-based optimizer ables fast join order switching. AlphaJoin [11] uses
following a supervised approach behaves reason- Monte Carlo Tree Search (MCTS) for join ordering
ably well when used in MonetDB, without requiring and Adaptive Decision Network (ADN) to choose
significant fine-tuning or massaging of MonetDB’s between plans produced by AlphaJoin (for long
execution runtime and optimizers pipeline. We find queries) and PostgreSQL (for short queries).
this result very interesting, as the alternative would There are also attempts to end-to-end learning
be to develop a brand new rule-based optimizer step optimizers. Neo [12] replaces most traditional
optiand add it to MonetDB’s optimizer pipeline. Third, mizer components with ML models and deep neural
the approach can be further extended to support networks. Bao [13] follows a schema and data
agquery and workload agnostic optimization using a nostic approach to learning optimization. It runs
more elaborate, reinforcement learning approach. a query with a predefined set of hints and keeps</p>
      <p>To the best of our knowledge, our work is the first the plan ranked first by a tree convolutional neural
attempt to employ learning in rule-based database network. Microlearner [14] divides a complex cloud
optimization. Our contributions can be summarized workload into a hierarchy of subsets, using them to
as follows: learn micro-models independently and in parallel.</p>
      <p>These methods focus on cost-based optimizers.
• We argue that training a model using a set Although several of these techniques could also be
of query optimization rules does not seem as used in our work, we consider them complementary
efective as relying on a cost model. to our goal: explore whether a learning optimization
• We show that applying a model trained on a approach could be beneficial for rule-based systems.
cost-based optimizer to a rule-based system
does seem to be efective, especially, if the
underlying system is not optimized for join 3. The Toulouse approach
ordering.</p>
      <p>In this section, we present our attempts towards
• We present a simple, supervised learning ap- learning optimization for rule-based query
optimizaproach able to transfer efective schema and tion. First, we report on our investigation toward
workload specific optimization policies to a building a learning model directly from query
optirule-based system. mization rules. Next, we present a diferent
direc• We present a more elaborate, reinforcement tion: employ a simple, supervised approach to train
learning approach that transfers efective a model on a cost-based system and apply it to a
schema and workload agnostic optimization rule-based system. Finally, we present a more
elabpolicies to a rule-based system. orate, reinforcement learning approach that serves
reasonably well previously unseen (i.e., not included
in the model training) query workloads.</p>
    </sec>
    <sec id="sec-2">
      <title>Outline. Section 2 presents related eforts on</title>
      <p>learning query optimization. Section 3 describes
the main principles and components of Toulouse.</p>
      <p>Section 4 presents our experimental evaluation. Fi- 3.1. Learning optimization rules
nally, Section 5 summarizes our findings.</p>
    </sec>
    <sec id="sec-3">
      <title>Our first attempt was to learn directly from query</title>
      <p>optimization rules. Interestingly, this attempt did
2. Related Work not produce very positive results.
Under the hood, MonetDB uses a low level
interThere are several approaches to learning query op- mediate language called MAL (MonetDB Assembly
timization [5]. These include system frameworks, Language) that encodes the execution complexity
e.g., SageDB [6], that aim at replacing core database of a SQL statement. Roughly speaking, a MAL
program is equivalent to a query plan. Instead of using
a single optimizer, MonetDB employs a collection of
20+ query optimizer transformers, each responsible
for a single task such as removing scalar expression
or aliases, avoid multiple calculations of the same
operator, range propagation, code parallelization,
optimize resource usage, garbage collection,
operator rewriting, caching aggregation results, dead
code removal, etc. [15]. There is also an optimizer
that deals with join paths, i.e., looking for join
operators and cascading them into multiple join paths,
and considers strategies such as materialization of
common subpaths. Additional optimizers can be
implemented and added as needed.</p>
      <p>Hence, query optimization in MonetDB is real- Figure 1: Example message passing network
ized as a linear sequence of MAL transformations
(optimizers) forming an optimizer pipeline. An op- transfer learning approach. In our current
impletimizer may be called multiple times in the same mentation, we used PostgreSQL as the reference
optimizer pipeline. MonetDB employs a cost model cost-based system and MonetDB as the target
rulethat to date does not use any statistics on the actual based system. Hence, our approach could be
sumdata, but relies on the size of rows provided as an marized as ‘develop (train) a model that learns how
input to an optimizer. We found that this feature is the PostgreSQL optimizer works (with its successes
not very mature yet and it does not seem to expose and mistakes) and employ it (inference) on
Monthe costs to the runtime. etDB bypassing its optimization choices’.</p>
      <p>Inspired by Bao [13] that gets training data by Several available methods in the literature are
running a query with various hint sets, our first bound to a specific schema and fail to generalize
attempt was to explore alternative variations of op- well and/or require expensive training. Still,
pretimizer pipelines and employ both the optimizers vious approaches following a graph neural network
and various common optimizer pipeline patterns as (GNN) modeling (without this being the only good
features in our featurization scheme. This can be solution) provide promising results [5]. Inspired by
realized easily, as MonetDB ofers the functionality this observation, Toulouse employs a GNN approach
to enable or disable optimizers via a ‘set optimizer’ to collect structural information about all possible
command. On the other hand, there are a few con- join decisions and learn how to pick the best one
straints to consider. First, there are dependencies in each case, or at least try to approximate it and
among the optimizers; e.g., static evaluation may learn more about the problem through this process.
happen only after constants propagation. In ad- Next, we present the two techniques developed
dition, there is a notion of the minimal optimizer in Toulouse, a supervised approach and a
reinforcepipeline, which specifies a number of absolutely nec- ment learning approach. The first relies heavily on
essary optimizers to run. These constraints limit the training data, and thus it is amenable to
spesignificantly the degrees of freedom we can consider. cific schema and query workloads. The later aims</p>
      <p>Our modeling and design eforts compared to the at providing a schema and query agnostic solution.
out-of-the-box MonetDB optimization strategies,
revealed limited benefits in generic optimizations 3.2.1. Preliminaries: GNN to the rescue
for simple SPJ queries including aggregation and
sort operations, but the results on join ordering were
rather disappointing as the improvements seemed
to be sporadic and random, which renders this path
neither robust nor practical.</p>
      <p>Graph Neural Networks (GNN) is a relatively new
class of deep learning techniques that aim at solving
the problem of having structural data of variable
complexity, by aggregating data from their
immediate neighbors. GNNs have been efective for
prob3.2. Employ a cost-based learning model lems that can be described as graphs, entities that
have variable size and structure and where the
inWe explored next the idea of developing a model teractions between them are as important as their
based on a cost-based query optimizer and inves- features. They can be best understood as message
tigating whether this would be efective on a rule- passing networks, where every nodes exchanges data
based system. This could be seen as a type of a with its directly connected nodes (see also Figure 1).
(a) An example query graph
(b) Applying a join decision
where the sum calculates the means of the neighbor- nodes previously connected to either  1 or  2.
and every edge  ∈</p>
    </sec>
    <sec id="sec-4">
      <title>At every step of the process, we choose an edge</title>
      <p>= ( 1,  2) and</p>
      <p>the two connected nodes  1
and  2. We remove those two nodes from the graph,
add a new node  1 2 and reconnect it with all the
is an implicit join predicate.
where  is the result of the previous level,  ^ =  +
is the adjacency matrix with inserted self-loops (the
+ ),  ^ 
= ∑︀
 =0
^
 
and Θ is the layer’s trainable weights.</p>
      <p>GCNII is a continuation of GCN with the added
features of initial residual connections and identity
mapping, which help tremendously with the
problem of oversmoothing (i.e., the more levels in a GNN,
the less expressive are the later layers). To support
this functionality, we change the GCN formulation
as follows:</p>
      <p>is the diagonal degree matrix,
 ′ = ((1− ) ^ −1/2 ^  ^ −1/2
 +
(0))((1− ) + Θ) cost 
where  (0) represents the initial features that get
added to each layer.</p>
      <sec id="sec-4-1">
        <title>3.2.2. Join optimization as an MDP</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Similar to earlier eforts (e.g., [ 7]), we formulate</title>
      <p>the join-ordering optimization problem as a Markov
As a first approach to the problem, Toulouse
employs a supervised approach using a simple GNN
architecture. We first present the featurization scheme
we use and then the architecture of the solution.

or 
_ 
_</p>
      <p>.</p>
    </sec>
    <sec id="sec-6">
      <title>Every such a step has a cost, and as we describe</title>
      <p>it as an MDP, we can assume that it follows the</p>
    </sec>
    <sec id="sec-7">
      <title>Bellman’s Principle of Optimality, thus we can ap</title>
      <p>proximate a function, or else a policy, that selects
the best action at each step, assuming that all
proceeding steps are also chosen optimally.</p>
      <p>Therefore, the problem we deal with is to find a
sequence of joins in the query graph that minimizes
the total cost, until after considering all joins needed
there is only one node left. Formally, in a query
graph  , with a cost model  , we search for the
sequence of joins  1 ∘  2... ∘  
with the minimum
 1,..., 
︀∑

 =1  (  ).</p>
      <sec id="sec-7-1">
        <title>3.2.3. A supervised approach</title>
        <p>shortest path between any two graph nodes) in
realistic query graphs, which is also a value that
seems to balance nicely the trade-of between
oversmoothing and performance boosting.</p>
      </sec>
      <sec id="sec-7-2">
        <title>3.2.4. A reinforcement learning approach</title>
        <p>Although our supervised approach provides a simple
and fast solution, still our investigation shows that
it does not capture efectively the case of unknown
query workloads, i.e., queries and schemas that have
not been seen by the model before. Related eforts
have dealt with the problem of cost estimation for
unseen queries considering transfer learning inspired
techniques, such as zero-shot learning (e.g., [17]).</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Although past work has considered zero-shot learning for learned cost prediction, we believe that the basic principles could also apply to the join-ordering</title>
    </sec>
    <sec id="sec-9">
      <title>Towards this end, we investigate a more complex</title>
      <p>architecture that employs a Graph Convolutional</p>
    </sec>
    <sec id="sec-10">
      <title>Neural Network (GCN) - Reinforcement Learning (RL) architecture, aiming at providing a more practical solution to the join-ordering problem, developing a model that could also cope with new query</title>
    </sec>
    <sec id="sec-11">
      <title>Featurization. As we aim at a more query and eral features as possible. Hence, for each node, we</title>
      <p>the typical diameter (i.e., the length of the longest
workloads.
consider:
scan) types, cardinality, and cost estimates. To ac- schema agnostic approach, we choose to use as
gentwo Multilayer Perceptron (MLP) encoding layers, indicating whether the query contains specific
pred• cardinality of an operator (here, scan or join)
• width; the estimated average width of the</p>
      <p>result (in bytes)
• selectivity;</p>
      <p>_/
scan operators, and the average of the two
subtrees for the join operators (multiplying
them is prohibited due to vanishing gradient
issues)
• total tuples; can be set to 0 for joins or the
averages, see selectivity above.</p>
      <p>for
• total pages
•  
• 
an aggregation.</p>
      <p>; true for joins, false otherwise</p>
      <p>; whether the node contributes in</p>
    </sec>
    <sec id="sec-12">
      <title>We also consider a number of (boolean) features</title>
      <p>icate expressions, such as filterEQ, filterNEQ,
filter</p>
    </sec>
    <sec id="sec-13">
      <title>Like, filterIs, filterIn, filterBetween, filterGTELTE,</title>
      <p>and so on.</p>
      <p>Finally, we complement our featurization scheme
with features related to the edges of the graph,
essentially capturing information about the join
predicate they represent, and use statistics for the
columns to be joined, such as:
• index; whether an index exists for the
column(s)
• null_frac; the fraction of nulls in the
column(s)
• n_distinct; the number of distinct elements
in the column(s)
• correlation; a PostgreSQL metric showing
the statistical correlation between physical
row ordering and logical ordering of the
column values, which is useful to determine
whether for example an index scan can be
eficient (similar info is captured in most
modern databases).</p>
    </sec>
    <sec id="sec-14">
      <title>Currently, Toulouse employs the PostgreSQL op</title>
      <p>timizer as a source of the cost/statistics estimates,
but another approach could also be used e.g.,
leveraging Bayesian Optimization to build performance
models [18].</p>
      <p>Architecture. In this approach, Toulouse employs
a GCN architecture in combination with the
popular Proximal Policy Optimization (PPO) algorithm.</p>
      <p>In particular, it uses multiple levels of GCNII,
combines them and passed them through two MLP
heads (actor and value) that can be used for PPO.</p>
      <p>The value head represents the cost of query so far
and the actor head represents the probability of
choosing a certain join at each step.</p>
      <p>For each step, we maintain two graphs, the one
described above and its line graph. A line graph
of an undirected graph  is another graph  ( ) Figure 5: Network architecture of Toulouse RL approach
that is created as follows: for each edge in  , we
make a vertex in  ( ); for every two edges in 
that have a vertex in common, we make an edge value head. The positional features,  , are passed
between their corresponding vertices in  ( ) [19]. through separate GCNII layers and we keep only
In other words, we convert every edge of  to a node the last one, which we concatenate with the layer
in  ( ) and every node of  to an edge in  ( ). results (a.k.a. tensors). The layer results are used
Therefore, we end up with every node carrying its directly in the actor head and the pooled ones in the
own features (of the edge turned to node) and of value head. The actor heads ends up in a categorical
the features of the corresponding adjacent nodes distribution. The categorical distribution is used to
that in the line graph were turned to edges. We sample join actions during training and choose the
also concatenate global features such as the number best action with argmax during inference.
of nodes, the diameter of the graph, etc. that help Reward function. An important part of any
reenrich the graph representation. In addition, we cal- inforcement learning implementation is the reward
culate positional encoding with random walks [20] function. For our model, we work as follows. For
to encode positional information of graph nodes and the training of our model we use a multiplicity of
make them more expressive. Nearby nodes have datasets (databases) and query workloads to achieve
similar positional features and distant nodes have as broad and generic learning as possible. (We
exdissimilar positional features. Still, we keep the plain this in more detail in the next section.) Then,
position encoding separately from the features. Fi- we choose random queries from all the available
nally, we pass each node through the model shown workloads (evenly distributed to avoid any bias
toin Figure 5. ward a specific workload), create the graphs with the</p>
      <p>The  features are passed through the layers of tables of all the schemata, add all implicit
connecGCNII. We pool the sum of each separate layer tions (join predicates), add also the join predicates
and concatenate them at the end to compute the that we get from equality propagation, and finally,
we proceed to use the model for choosing the next
best join. Hence, at each step we compute the
reward as follows:
4. Evaluation
4.1. Setup and implementation details
cut_parameter = 10.0 In our experiments, we used the following setup.
for database, query in queries: Hardware. Several rule-based systems are used
eibase_cost = postgres_cost(query) ther as server databases deployed on server machines
while number_of_nodes &gt; 1: or embedded databases running on lightweight
conchoose_join(query) ifgurations. We investigated both scenarios using
cost = make_join(query) two setups: (S1) 20-core Intel(R) Xeon(R) CPU
if cost &gt; base_cost * cut_parameter: E5-2630 v4 2.20GHz (3.1GHz max), 142GB
memrdeownaerd= =Tr-ue10.0 * cut_parameter ory; and (S2) 4-core AMD Ryzen 5 2500U 2GHz
else: (3.6GHz max), 8GB memory.</p>
      <p>reward = log(cost) Software. We used Pytorch (1.13.0) with
Pytorchreward = - (reward / reward_running_std) Geometric (2.1.0) for the GNNs, GATv2 for the
graph layer (although we also successfully tried
Sage</p>
      <p>The cut_parameter is set at a starting value ex- Conv and GINConv), Python for SQL query
rewritperimentally configured (e.g., 10-20) and then we ing, PostgreSQL (13.3), and MonetDB 11.41.11.
slowly reduce it as the training progresses. This Data and preparation. We tested two use case
helps avoid bad plans from diluting our training scenarios: (C1) training and inference using two
data, as instead of continuing when we consider a workloads on the same schema, and (C2) training
join than costs much more than what the query on a multiplicity of schemas and workloads, and
optimizer (e.g., PostgreSQL’s optimizer) can do inference on a new schema and a new workload.
by default, we penalize it appropriately using the C1 resembles a common use case, which is typically
cut_parameter and continue with the next query. seen in the experimental analysis of other learning
Otherwise, we  the cost and use running statistics optimization approaches. C2 represents a more
chal(Welford algorithm) to normalize it. Another ap- lenging, but clearly more interesting case scenario.
proach would be to penalize more the initial steps of Use case C1. Data generation, training, and
testa query, rather the later ones; e.g., a non-performant ing was performed on PostgreSQL on S1, which
join in the first or second steps is worse than a sub- was tuned for performance. For the graph layer,
optimal join toward the tail of the chain of join 20 epochs were enough, with early stopping to the
operators. best approximate result. We used two workloads:</p>
      <p>It is worth mentioning two advantages we find (W1) The Join Order Benchmark (JOB)[16] queries
in favor of the reinforcement learning approach. and data, which is based on the IMDB dataset and
First, it shows a potential to alleviate a limiting comprises 113 queries containing a varying
numdata generation performance challenge we faced ber of joins, ranging from joining 5 to 17 tables;
with the supervised approach; that is, queries with and (W2) We created a workload comprising 1200
many joins (e.g., more than 10 joins) result into queries generated by randomizing various
paramea huge number of potential join paths, which in ters and predicates of the JOB queries, changing
turn increases significantly the training time. In efectively several properties such as operator
selecaddition, the reinforcement learning approach is tivity, size of the intermediate results, etc. For the
amenable to online improvement, as we could fine- evaluation, we used W2 and did not consider any
tune the produced model at runtime using real query of the queries used in the training phase.
running times and potentially have a system that Use case C2. We employed a variety of relational
continuously learns as query workloads run. We datasets and workloads as described in the
DBconsider this interesting optimization opportunity Gen benchmark [21]1. This collection comprises 21
as future work. datasets, including IMDB and IMDB_full that are
the base of the JOB benchmark. Initially, we trained
3.2.5. Application to the rule-based system the model on PostgreSQL as described in (C1) on
all datasets except the two IMDB ones. Then, we
tested the model on both PostgreSQL and MonetDB
using the JOB workload (IMDB dataset). In this
case, the JOB workload is unknown to the model
To close the loop, the model that has been trained
on a cost-based system is then used (inference) on a
rule-based system. Given a new query to run on the
rule-based system, Toulouse proposes a plan with a
specific join ordering, which in turn executes on the
rule-based system bypassing the system’s optimizer.
1The DBGen benchmark can be found here: https://github.</p>
      <p>com/DataManagementLab/zero-shot-cost-estimation
(d) (e) (f)
Figure 6: Toulouse supervised: runtime perf and % improvement vs. PostgreSQL (psql) and MonetDB optimizers
trained –an experiment designed to test whether by fine-tuning the model and/or using additional
our approach could potentially serve as a schema features, thus avoiding the risk of over-fitting our
and workload agnostic solution. model.</p>
      <p>An illustration of the scenarios tested is as follows: Evaluation on MonetDB (mdb). Toulouse has
learnt eficient join paths from the psql optimizer,
workload database hardware so next, we investigated whether it could leverage
C1: supervised □ □ ■ □ ■ those on MonetDB and on two setups, S1 (same as
□C:2:saRmLe with training■ ■ : dife□ r■ent tha n□■ training psql) and S2 (new). Figures 6b and 6e show the avg
runtime performance with varying # and the
4.2. Model creation and evaluation % of improvement per query pattern, respectively,
for the S1 setup (Figures 6c and 6f show the same
Evaluating the supervised approach. First, we analysis on S2). Note that the more challenging
present the results of our experimental analysis for queries involve 7-10 tables. These run better on the
Toulouse’s supervised approach on the C1 scenario. S1 setup that has more memory and cpus. The less</p>
      <p>For training, we used 200,000 query graphs cre- resource demanding queries involving 4-7 tables run
ated by searching recursively in a subset of the JOB better on the S2 setup, whose resources sufice for
queries for valid join combination that each creates these queries, as it has more modern and faster cpus.
a diferent query graph, hence forming scenarios The plans specified by Toulouse (with a few
excepthat require join decisions. Every potential join has tions) improve the mdb performance by an average
an approximate best query runtime –should the join 42% on S1 and 36% on S2. Hence, join path
knowlhappens– which can be used directly for regression. edge seems to be efectively transferable. Looking
In our case, we set the minimum join value as 1 and deeper into the mdb plans, we realized that the mdb
everything else as 0 for binary classification. Next, optimizer does not make optimal decisions for
comwe fed this training data to our GNN structure, plex queries with many joins. This is an interesting
using in our implementation the Adam optimizer ifnding towards future research in query
optimizawith Binary Cross Entropy loss. tion: an ML based technique, trained on a diferent</p>
      <p>Evaluation on PostgreSQL (psql). We tested engine and setup, can efectively complement a
trathe queries with explicit join orders produced by ditional optimizer. The alternative, improving the
Toulouse vs. queries optimized by the psql optimizer optimizer codebase, would potentially be orders of
using the W2 workload. We ran each query 6 times magnitude more challenging to implement.
and report the average results. Figure 6a shows the The supervised approach however does not
peravg runtime performance of queries grouped by the form equally well on the use case C2 (results not
# they involve. Figure 6d shows the % of im- shown here), which also adds a diferent workload
provement per query pattern. The results show that to the equation. This motivated us to examine a
although Toulouse has learnt the trend of the psql more complex architecture, with a richer feature set,
optimizer’s decisions, still it does not outperform and a diferent design to cope with the challenge of
psql. This is normal, as we did not try to beat psql optimizing a previously unseen workload.
(d) (e) (f)
Figure 7: Toulouse RL: runtime perf and % improvement vs. PostgreSQL (psql) and MonetDB optimizers</p>
      <p>Evaluating the reinforcement learning approach. the diferent workloads used in the training phase.
Motivated by the positive results of re-using a cost- Evaluation on MonetDB. Next, we evaluate how
based learning model on a rule-based engine, next Toulouse could cope with an unknown workload
we increased the challenge to work with an unknown on two diferent hardware setups S1 (same as psql)
schema and workload. Hence, our second line of and S2 (new). Figures 7b and 7e show the avg
runexperiments focuses on the efectiveness of our re- time performance with varying # and the %
inforcement learning approach. As this approach of improvement per query id, respectively, for the S1
performs similarly to the supervised one on the C1 setup (Figures 7c and 7f show the same analysis on
scenario, we omit the presentation of results for this S2). Notably, Toulouse manages to optimize almost
case and focus on the more complex, C2 scenario. all queries by a significant fraction despite the fact</p>
      <p>For starters, we trained the model on psql using a that the inference is performed with a new
worksimilar process as before. In more detail, we altered load, on a new database, and on a new hardware
the query generation code of DBGen [21] to output setup (S2). Figure 8 shows the average
improveimplicit joins and used 18 diferent workloads from ment in MonetDB, on both hardware setups, for a
diferent database schemas, leaving out the two varying number of tables (i.e., # ) per query.
IMDB schemas completely during training and only On average, Toulouse improves query performance
performed inference on these two schemas. This way, in MonetDB by 45.68% on S1 (server machine) and
the learning happens in a schema-agnostic manner 24.4% on S2 (lower end machine).
and allows to test whether it could be transferable Discussion. We measured the robustness of
to diferent databases. In each epoch we collect 6400 Toulouse’s plans. The standard deviation of query
steps and train in batches of 64 for 3 mini-epochs. performance is rather low for the entire workload,</p>
      <p>Evaluation on PostgreSQL (psql). We tested the with the slight deviations efectively caused by
typqueries produced by Toulouse vs. queries optimized ical runtime variability. Looking at the plans, we
by the psql optimizer using the 113 queries of the verified that Toulouse consistently produced the
JOB benchmark. We ran each query 6 times and same join paths for each query having all the other
report the average results. Figure 7a shows the parameters of the experiment unchanged.
avg runtime performance of queries grouped by the Caveats. Although PostgreSQL is the
pre# they involve. Figure 7d shows the % of dominant system used in the majority of related
improvement per query id. The results show that work both for model training and as a baseline,
Toulouse manages to follow the general trend of clearly there are more advanced optimizers in other
the psql optimizer’s decisions. Observe, that for database systems. We believe that using a more
some queries, Toulouse RL performs worst than the advanced optimizer to train our models would only
supervised approach (compare Figures 7d and 6d). make our approach perform better, but we leave this
This is explained by the fact that in this scenario as a future work. The JOB benchmark (i.e., IMDB
Toulouse deals with queries and schema that has schema) has been used extensively in the literature
never seen before. Still, it seems to be able to apply as an analytics workload comprising queries with
efective optimization strategies that has learnt from varying complexity of join chains. We find this as
[4] MonetDB Docs, Optimizer pipelines, 2022.</p>
      <p>https://www.monetdb.org/documentationSep2022/admin-guide/performancetips/optimizer-pipelines/.
[5] D. Tsesmelis, A. Simitsis, Database optimizers in</p>
      <p>the era of learning, in: ICDE, 2022.
[6] T. Kraska, M. Alizadeh, A. Beutel, E. H. Chi,</p>
      <p>A. Kristo, G. Leclerc, S. Madden, H. Mao,
V. Nathan, Sagedb: A learned database system,
in: CIDR, 2019.</p>
      <p>Figure 8: Toulouse RL: % improvement in MonetDB [7] S. Krishnan, Z. Yang, K. Goldberg, J. M.
running on two hw setups S1 (server) and S2 (low end) Hellerstein, I. Stoica, Learning to optimize join
queries with deep reinforcement learning, 2018.</p>
      <p>CoRR:abs/1808.03196.
a useful first step for our analysis, still, we plan to [8] R. Marcus, O. Papaemmanouil, Deep
reinforceinvestigate more real-world workloads in the con- ment learning for join order enumeration, in:
tinuation of our work. Finally, our results reported aiDM@SIGMOD, 2018.
[9] X. Yu, G. Li, C. Chai, N. Tang, Reinforcement
here are based on our experimentation with Mon- learning with tree-lstm for join order selection, in:
etDB; an excellent open source, analytics database ICDE, 2020.
system that comes with one of the most popular [10] I. Trummer, J. Wang, D. Maram, S.
Moseand successful rule-based optimizers. However, we ley, S. Jo, J. Antonakakis, Skinnerdb:
Regretplan to extend our investigation to other rule-based bounded query evaluation via reinforcement
learnengines as well, in an efort to generalize our findings ing, in: SIGMOD, 2019.
to rule-based systems beyond MonetDB. [11] J. Zhang, Alphajoin: Join order selection à la
alphago, in: PVLDB-PhD, 2020.
[12] R. C. Marcus, P. Negi, H. Mao, C. Zhang, M.
Al5. Conclusions izadeh, T. Kraska, O. Papaemmanouil, N. Tatbul,
Neo: A learned query optimizer, in: PVLDB,
We presented an investigation on whether learning 2019.
could improve query optimization and in particular, [13] R. Marcus, P. Negi, H. Mao, N. Tatbul, M.
Althe join ordering, in rule-based systems. Our initial izadeh, T. Kraska, Bao: Making learned query
attempt to use rules and optimization strategies as optimization practical, in: SIGMOD, 2021.
features was not very successful. Still, as we show in [14] A. Jindal, S. Qiao, R. Sen, H. Patel, Microlearner:
A fine-grained learning optimizer for big data
this paper, applying a model trained on a cost-based workloads at microsoft, in: ICDE, 2021.
optimizer to a rule-based system for known and un- [15] MonetDB Docs, Mal optimizers, 2022.
known workloads alike shows potential and deserves
https://www.monetdb.org/documentationadditional research. Finally, our work shows that
Sep2022/dev-guide/monetdb-internals/malML techniques could be a reasonable supplement optimizers/.
to potential shortcomings of traditional optimizers [16] V. Leis, A. Gubichev, A. Mirchev, P. A. Boncz,
without requiring changing their codebase. A. Kemper, T. Neumann, How good are query</p>
      <p>Acknowledgements. This work has been partially optimizers, really?, Proc. VLDB Endow. 9 (2015)
supported by EU’s Horizon 2020 projects INODE 204–215.
and HBP-SGA3 with grant agreement numbers [17] B. Hilprecht, C. Binnig, One model to rule them
all: Towards zero-shot learning for databases, in:
863410 and 945539, respectively. CIDR, 2022.
[18] O. Alipourfard, H. H. Liu, J. Chen, S.
VenkataraReferences man, M. Yu, M. Zhang, CherryPick: Adaptively
unearthing the best cloud configurations for big
[1] L. M. Haas, J. C. Freytag, G. M. Lohman, H. Pi- data analytics, in: USENIX NSDI, 2017.
rahesh, Extensible query processing in starburst, [19] F. Harary, R. Z. Norman, Some properties of line
in: SIGMOD, 1989. digraphs, Rendiconti del Circolo Matematico di
[2] G. Graefe, W. J. McKenna, The volcano optimizer Palermo 9 (1960) 161—-168.
generator: Extensibility and eficient search, in: [20] V. P. Dwivedi, A. T. Luu, T. Laurent, Y. Bengio,
ICDE, 1993. X. Bresson, Graph neural networks with
learn[3] S. Idreos, F. Grofen, N. Nes, S. Manegold, K. S. able structural and positional representations, in:
Mullender, M. L. Kersten, MonetDB: Two decades ICLR, OpenReview.net, 2022.
of research in column-oriented database architec- [21] B. Hilprecht, C. Binnig, Zero-shot cost models
tures, IEEE Data Eng. Bull. 35 (2012) 40–45. for out-of-the-box learned cost prediction, Proc.
VLDB Endow. 15 (2022) 2361–2374.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>