=Paper= {{Paper |id=Vol-3186/paper5 |storemode=property |title=Building Learned Federated Query Optimizers |pdfUrl=https://ceur-ws.org/Vol-3186/paper_5.pdf |volume=Vol-3186 |authors=Victor Giannakouris |dblpUrl=https://dblp.org/rec/conf/vldb/Giannakouris22 }} ==Building Learned Federated Query Optimizers== https://ceur-ws.org/Vol-3186/paper_5.pdf
Building Learned Federated Query Optimizers
Victor Giannakouris
Advised by: Immanuel Trummer
Cornell University, NY, USA


                                          Abstract
                                          The goal of this thesis is to introduce a new design for building federated query optimizers, based on machine learning. We
                                          propose a modular and flexible architecture, allowing a federated query optimizer to integrate with any database system that
                                          supports SQL, with close-to-zero engineering effort. By observing the performance of the external systems, our optimizer
                                          learns and builds cost models on-the-fly, enabling federated query optimization with negligible communication with the
                                          external systems. To demonstrate the potential of this research plan, we present a prototype of our federated query optimizer
                                          built on top of Spark SQL. Our implementation effectively accelerates federated queries, achieving up to 7.5x better query
                                          execution times compared to the vanilla implementation of Spark SQL.

                                          Keywords
                                          federated query processing, query optimization, machine learning,



1. Introduction                                                                                                    subquery pushdown, might also be applied. Finally, the
                                                                                                                   resulting query plan is executed in the federation engine.
In the complex infrastructure of the modern “big data”                                                                Ideally, an efficient optimizer should be able to gener-
ecosystem, data are usually distributed across multiple,                                                           ate more sophisticated federated query plans, like in tra-
diverse database systems. This has led to the develop-                                                             ditional databases. For example, instead of just pushing
ment of federated query engines that enable users to                                                               down selections to the external systems, it should con-
simultaneously query multiple databases, using a unified,                                                          sider pushing down larger parts of the federated query,
SQL-based interface. For instance, it is common for a                                                              like a join sub-tree. However, the heterogeneous na-
data scientist to issue a query that joins a ”small” table in                                                      ture and architectural differences of the external systems
a relational database with a bigger table that resides in                                                          make the task of deciding which parts of the query to push
a distributed data lake, like Amazon S31 or Delta Lake2 .                                                          down and where particularly complex. One of the main
A number of federation engines developed by some of                                                                challenges is the complexity of estimating the subquery
the largest database vendors, including Athena Federated                                                           execution cost in an external system. This is a tricky
Query3 , BigQuery4 , Spark SQL [1], Presto5 or Dremio6                                                             task, for a number of factors. For example, due to the
over the last years, provide clear evidence for the impor-                                                         lack of access to statistics in the remote database system,
tance of federated query engines. Taking into account                                                              estimating the local execution cost (in the external sys-
the heterogeneity of the underlying systems that a federa-                                                         tem) and result size is very challenging. Furthermore, the
tion engine integrates with, optimizing federated queries                                                          larger search space that derives from the additional plan-
is one of the most challenging tasks for these systems.                                                            ning decisions (i.e. where to execute each operator) due
Usually, a federated query engine follows a one-size-fits-                                                         to federated execution, makes optimization even more
all approach, to connect with as many external database                                                            challenging.
systems as possible. In summary, the query lifecycle in                                                               As a result, the majority of federation engines apply
most federation systems (e.g. Spark, Presto) is quite sim-                                                         very few rule-based optimizations (i.e., selection push-
ple. First, the federated engine transfers all the tables and                                                      down), discarding optimization opportunities that would
views included in the query from the external database                                                             leverage the full potential of the external systems. While
systems to federation execution engine through the net-                                                            there have been some attempts to develop wrappers [2, 3]
work. A number of specific rule-based optimizations, e.g.                                                          and custom cost models [4] to enable more fine-grained
                                                                                                                   federated query plan generation, these approaches face
Proceedings of the VLDB 2022 PhD Workshop, September 5, 2022.
                                                                                                                   the following challenges. First, developing custom wrap-
Sydney, Australia
Envelope-Open vg292@cornell.edu (V. Giannakouris)                                                                  pers and cost models for new systems is a tedious task,
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative
                                    Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                   making the integration with new systems extremely dif-
 CEUR
 Workshop
 Proceedings         CEUR Workshop Proceedings (CEUR-WS.org)
               http://ceur-ws.org
               ISSN 1613-0073
                                                                                                                   ficult. Second, the communication with external systems
               1
                 https://aws.amazon.com/s3                                                                         to obtain cost estimates can slow optimization down,
               2
                 https://delta.io/
               3                                                                                                   something known as cost of costing [5]. Taking into ac-
                 https://aws.amazon.com/athena
               4
                 https://cloud.google.com/bigquery                                                                 count these challenges, we try to answer the following
               5
                 https://prestodb.io/                                                                              question: Can we develop a generic design for federated
               6
                 https://www.dremio.com
query optimizers that integrate with any external database     processor follows a similar design to a traditional, single-
system with 1. close-to-zero engineering effort and 2. mini-   engine query processor. The main extension is the ability
mal communication overhead during optimization?                to load data from external data sources. In this work,
   To answer this question, we present an engine-              our main focus will be the optimizer. Federated query
agnostic approach for federated query optimization, that       optimization is more complex than traditional query op-
copes with the heterogeneity of the underlying infras-         timization, as it has to tackle the challenges that arise
tructure. Using machine learning, our solution allows          from the heterogeneity of the external systems.
the optimizer to learn the performance of the external         Federated Query Processing. Federated query pro-
database systems, without relying on any system-specific       cessing is not a new problem, and there has been exten-
knowledge. Instead, it treats the external systems as black    sive work over the last few decades [7, 8] that aims at
boxes. The key idea behind our approach is the following.      optimizing queries across diverse data sources. For in-
In contrast to previous approaches that depend on cost         stance, Garlic [2] introduces a federated query optimizer
estimates obtained from external systems (e.g. by parsing      based on a cost-based, dynamic-programming approach
the output of EXPLAIN clause [6]), we use a unified query      that uses data wrappers in order to integrate, and exe-
vector model, to represent queries in the vector space.        cute queries across different data sources. Recent works,
Using this model, query trees can be simply transformed        like MuSQLE [4] and System-PV [3] present federated
to vectors, and fed to various machine learning models         (a.k.a. multi-engine) query optimization approaches that
in order to learn and predict the performance of the ex-       perform both inter- and intra-engine optimizations. A
ternal systems. We can then leverage these learned cost        similar approach based on data wrappers, is followed by
models in order to develop a federated query optimizer         another notable category of systems, called polystores
that can easily connect to different systems, and has zero     [9, 10]. However, these approaches depend strongly on
communication cost during optimization. In summary,            cost models, provided by the external systems. If a sys-
our contributions are the following:                           tem does not provide cost estimates, the only way to
                                                               integrate new systems is to implement the cost models.
     • We introduce a machine learning based architec-         This process makes the integration with new systems im-
       ture for federated query optimization that is able      practical. Moreover, the communication needed with the
       to integrate with any SQL-based database system         external systems to obtain cost estimates of local query
       with close-to-zero engineering effort.                  executions leads to excessive overheads that make the
     • We present an implementation of our architecture        optimization process slow (cost-of-costing [5]).
       on top of Spark SQL and we demonstrate how our          Learned Query Optimization. The idea of learned
       system can effectively optimize federated queries       query optimization has gained a lot of attention through
       over multiple systems, with zero communication          the last years. Approaches like Neo [11] and Bao [12]
       overhead.                                               are representative examples that showcase how machine
     • We discuss an experimental evaluation that              learning can be utilized for self-driving query optimizers.
       demonstrates our system’s ability to effectively        However, most of these works are focusing mainly on
       learn the performance of the external systems,          single-node database systems. The closest approach to
       while generated federated query plans always            our system is the one presented by Liqi Xu et al. [6],
       outperform Spark SQL.                                   which presents a supervised-learning approach for fed-
                                                               erated query optimization. The system learns the per-
   The rest of the paper is organized as follows. Section 2    formance of the externally connected systems and it pre-
presents some background on federated query optimiza-          dicts the best federation engine, which can be any of
tion, as well as a brief overview of the federated query       the connected data sources. However, this work does
processing research. Next, we present a prototype and          not consider splitting further the complement queries,
architecture of a machine learning based federated query       ignoring potential query plans that could achieve better
optimizer in Section 3. Then, we present some early ex-        performance. Furthermore, it relies on information (e.g.,
perimental evaluation in Section 4. Finally, in Sections 5     cost or row estimates), gathered by the EXPLAIN clause of
and 6, we conclude and present ongoing and planned             the target data source. Continuously invoking EXPLAIN
PhD research.                                                  during query planning on multiple external data sources
                                                               can make the process significantly slower, due to the com-
                                                               munication overhead. Furthermore, the assumption that
2. Background and Related Work                                 any of the connected data sources can be considered as
                                                               the federation engine is not realistic, as it is not common
A federated query contains tables that reside in one, or
                                                               for a DBMS to support reading data from external data
multiple external database systems. The corresponding
                                                               sources.
system is called a federation engine. A federated query
                                                                  The main weaknesses of prior work are the following.
First, integration with new systems can be tedious. Next,       cost model. For example, in some cases it might make
the cost estimation and interpretation for the external         sense to break a subtree that joins four tables into two
systems, as well as the continuous communication of the         subtrees that join two tables, in order to avoid computing
external systems during query optimization makes the            a large result, and fetching that result to the federation
process relatively slow, resulting in a high cost-of-costing.   engine over the network. This is achieved by adjusting
In the next section, we describe how we address these           a parameter called join_limit , which defines the max-
problems with machine learning, as well as a unified            imum subquery size that can be pushed down for local
query vectorizer that maps queries to vectors.                  execution to an external system.
                                                                Federated Rewriter. This module takes as input the
                                                                federated query plan produced by the previous step. For
3. ML-Based Federated Query                                     each subtree that refers to a specific location (database
   Optimization                                                 system) of the query plan, it performs on-the-fly SQL
                                                                code generation that will be pushed down to the exter-
3.1. Architecture                                               nal system for local execution. Finally, it generates the
                                                                SQL code that will be executed in the federation engine,
Figure 1 depicts the architecture of our federated query        which will aggregate the results of each component query
optimizer prototype. In this section, we describe in detail     executed in the external locations.
each individual component of our system.
Query Vectorizer. The query vectorizer takes as input
a parsed SQL query in its abstract syntax tree (AST) form       3.2. Query Lifecycle
and converts it into a vector that represents the semantics     The query lifecycle follows the same steps as in the pre-
of the query, e.g., which tables are joined in the query or     vious section. First, an SQL query is parsed and trans-
in which columns a GROUP BY operator is applied. In the         formed to the corresponding AST form. Then, this query
current version, we follow a simple one-hot-encoding            is passed to the vectorizer, which will transform it to the
approach. Each query operator is represented by a vector.       corresponding vector form. Next, the optimizer takes
For example, the aggregation vector 𝑔 = [1, 0, 0, 1] rep-       the AST of the query, it converts it to the corresponding
resents a query in which the GROUP BY clause is applied         graph, and it produces the final federated query plan.
on the first and the fourth columns. We combine all the         Finally, the federated plan is passed to the rewriter which
vectors for all the predicates that we need to include in       will perform the necessary SQL code generation for the
our search space and create a unified vector that repre-        external systems and the federation engine. The query
sents the full query.                                           is then executed by leveraging both the external systems
Cost Model Learning. In order to learn cost models,             and the federation engine, and the result is returned to
we use data obtained from past and current workloads.           the user. At the end of execution, we also keep the query
For each query, we keep its execution time and its vector       execution metrics, like the total execution time and the in-
form. We feed this data to a machine learning model, that       dividual execution times of the subqueries in the external
predicts the execution time of future queries. Our cur-         engines. We keep these metrics in order to re-train and
rent prototype trains its models with respect to execution      refine our learned cost models and keep them up-to-date.
time. However, the approach can be easily modified in           As mentioned in the previous sections, the key advan-
order to take into account more objectives, like monetary       tages of our federated query optimizer are the following.
cost in a cloud setting.                                        First, the query vectorizer allows our system to be easily
Federated Query Optimizer. Our federated query opti-            integrated with any system that supports SQL, making
mizer uses the first two components in order to generate        the design specifics of the external system transparent
near-optimal federated plans. First, it transforms the AST      to the federation engine. For example, in our implemen-
form of the query to a graph, in which each vertex repre-       tation over Spark SQL, we use JDBC drivers in order to
sents one table and its location, while an edge represents      connect to the external system. Then, our system oper-
a join between two tables. The optimizer works in two           ates only on Spark’s intermediate representation (AST) of
phases. The first pass, which we call Location-First Search,    the input query. This design will work exactly the same
is an extension of the traditional Breadth-First-Search al-     over any possible set of connected systems. Second, we
gorithm which traverses the graph, and generates a new          minimize the potential communication overhead during
binary tree with the following property. It is guaranteed       optimization. By leveraging learned cost models, the cost
that all vertices (tables) that reside in the same location,    estimates for each subquery in each external system are
will be co-located under the same subtree (whenever that        computed fast, while most of the optimization time is
is possible, given the query semantics). The second pass        spent on useful work, i.e. plan enumeration.
processes each subtree at each location, and makes the
required transformations, being advised by the learned
                                                               an external engine. This parameter needs adjustment for
                                                               the following reason. Imagine pushing down a large join,
                                                               that produces a very large intermediate result. Fetch-
                                                               ing this result from the external database system to the
                                                               federation engine will result in excessive network over-
Figure 1: Optimizer Design and Query Lifecycle                 heads. Thus, the query performance will decrease. Us-
                                                               ing our very first prototype of a Reinforcement Learn-
                                                               ing based optimizer that tries out different values of the
Table 1
                                                               join_limit parameter, we conclude that for this setup
Execution Time
                                                               the join limit should be either two or three. Sticking to
            System                  Avg. Exec Time (s)
                                                               these values, the average query execution time is 60%
            Spark SQL               1.15
                                                               of time that the vanilla implementation of Spark SQL re-
            Optimal                 0.122
                                                               quires. This improvement is achieved mainly by splitting
            Federated Optimizer     0.152
                                                               the query execution across Spark SQL and the external
                                                               systems (e.g. pushing down part of the join), utilizing
                                                               both the federation engine and the systems it connects
4. Preliminary Results                                         to.
We evaluate our prototype experimentally and compare
to Spark SQL. Our goal is to show that our design can ef-      5. Research Plan
fectively chose the right engine to execute each subquery,
and the resulting plans outperform the vanilla implemen-       There is still work that needs to be done in order to release
tation of Spark SQL. We evaluate our first prototype on        a fully-functional prototype of our ML-based optimizer.
a MacBook Pro with 16GB of memory and an Apple M1              We presented individual parts of the system that we cur-
chip. The infrastructure consists of a standalone, single-     rently work on, and an early experimental evaluation
node Spark SQL cluster, one Postgres 14.0 instance and         of those components that showcase the current perfor-
one MySQL 8.0.27 instance, everything running on the           mance improvements that our system achieves.
same machine. We used the TPC-H 7 (1GB) and the JOB            Short-Term Goals. We are working towards the full
[13] benchmarks.                                               integration of our optimizer with the learned cost models.
  Learned Cost Models. We first evaluate our opti-             Our main idea is to adopt a dynamic programming ap-
mizer’s effectiveness in choosing the most performant          proach for plan enumeration, as in traditional databases.
execution engine. We use the TPC-H dataset for this ex-        We plan to modify the algorithms and make them lever-
periment, making all tables available in all systems. First,   age the learned cost models in order to evaluate the cost
we run a set of micro-benchmarks by randomly picking           of the enumerated federated, cross-database plans. We
some of the TPC-H queries in order to collect data and         foresee the following challenges. First, in case of updates,
train the cost models. Next, we run all TPC-H queries          the cost models will become outdated and might mislead
to evaluate our system. For each TPC-H query, the opti-        the optimizer. In this case, the system should notice the
mizer is assisted by the learned cost models, and decides      performance degradation and re-calibrate models with
whether it is better to push the full query into MySQL,        respect to new data, possibly by re-training the models.
Postgres, or fetch all the data and execute the query in       Next, for queries with many joins, the large optimization
Spark. For the sake of the experiment, we run each query       space due to the multiple execution engine options might
in all three modes, which allows us to compare our op-         slow the optimizer down. We plan to develop specialized
timizer’s decision both with Spark SQL and the optimal         heuristics to prune the search space to maintain reason-
decision (i.e. minimum execution time). The results are        able optimization time.
reported in Table 1, and depict the end-to-end execution       Long-Term Plan. Our current design already has some
time, including query processing and data fetch from the       promising results for OLAP workloads on static data.
external systems. Using our optimizer, we achieve an           However, as previously mentioned, relying on past query
average speedup of 7.5x compared to Spark SQL.                 executions might be limiting if any updates are included
Optimized Queries. We use the JOB benchmark for                in the workload, i.e. the learned cost models will become
this section. In this scenario, tables are randomly placed     outdated. Furthermore, the larger search space that de-
across MySQL and Postgres. For those queries, we ex-           rives from the multiple-systems scenario might result
perimented with changing the number of the maximum             in slow optimization for larger join queries. To address
tables (join_limit parameter) that can be included in          these challenges, we are working towards an extension,
a subquery that is pushed-down for local execution to          based on Reinforcement Learning. Instead of using cost
    7
                                                               estimates, this RL-based optimizer will try out different
        https://www.tpc.org/tpch/
splits and join orders for the initial query during the ex-    [3] M. Karpathiotakis, A. Floratou, F. Özcan, A. Aila-
ploration phase. Using RL, we can develop an adaptive              maki, No data left behind: real-time insights from
optimizer, that will explore different subquery combina-           a complex data ecosystem, in: Proceedings of the
tions and pushdowns across the different systems. This             2017 Symposium on Cloud Computing, 2017, pp.
will not require prior training, and past workloads. Our           108–120.
goal is to create a solution that quickly adapts to data       [4] V. Giannakouris, N. Papailiou, D. Tsoumakos,
changes (in case of updates), while avoiding the costly            N. Koziris, Musqle: Distributed sql query execution
enumerations in query planning.                                    over multiple engine environments, in: 2016 IEEE
Vision. A federated query engine should be flexible, con-          International Conference on Big Data (Big Data),
nect easily to new data sources and hide the complexity            IEEE, 2016, pp. 452–461.
of the underlying infrastructure from the user. Existing       [5] A. Deshpande, J. M. Hellerstein, Decoupled query
approaches on federated query optimization, like System-           optimization for federated database systems, in:
PV [3] and MuSQLE [4], still require a lot of manual work          Proceedings 18th International Conference on Data
from the user. Our intuition is that the more generic, ML-         Engineering, IEEE, 2002, pp. 716–727.
based design that we develop as part of this PhD research      [6] L. Xu, R. L. Cole, D. Ting, Learning to optimize
will democratize federated query optimization, and make            federated queries, in: Proceedings of the Second
it possible to adopt these optimization schemes in the             International Workshop on Exploiting Artificial In-
real world. While we implement our optimizer on top                telligence Techniques for Data Management, 2019,
of Spark SQL, our design is generic enough and can be              pp. 1–7.
easily implemented in similar systems. The long-term           [7] D. McLeod, D. Heimbigner, A federated architecture
goal of this PhD is to introduce new federated query opti-         for database systems, in: Proceedings of the May
mization designs that are autonomous, and can be easily            19-22, 1980, national computer conference, 1980, pp.
adopted by systems both in industry and academia.                  283–289.
                                                               [8] A. P. Sheth, J. A. Larson, Federated database sys-
                                                                   tems for managing distributed, heterogeneous, and
6. Conclusions                                                     autonomous databases, ACM Computing Surveys
                                                                   (CSUR) 22 (1990) 183–236.
We presented a PhD research plan that proposes a new
                                                               [9] J. Duggan, A. J. Elmore, M. Stonebraker, M. Bal-
federated query optimization design, based on machine
                                                                   azinska, B. Howe, J. Kepner, S. Madden, D. Maier,
learning. Our design is still under development and in an
                                                                   T. Mattson, S. Zdonik, The bigdawg polystore sys-
early stage. The preliminary results show that ML-based
                                                                   tem, ACM Sigmod Record 44 (2015) 11–16.
federated query optimization achieves notable perfor-
                                                              [10] J. LeFevre, J. Sankaranarayanan, H. Hacigumus,
mance improvements when compared to Spark SQL. In
                                                                   J. Tatemura, N. Polyzotis, M. J. Carey, Miso: soup-
contrast to past works on federated query processing,
                                                                   ing up big data query processing with a multistore
our prototype leverages machine learning in order to
                                                                   system, in: Proceedings of the 2014 ACM SIGMOD
cope with the heterogeneity of the underlying database
                                                                   international conference on Management of data,
systems. Our optimizer is able to connect new systems
                                                                   2014, pp. 1591–1602.
with close-to-zero engineering effort, and effectively op-
                                                              [11] R. Marcus, P. Negi, H. Mao, C. Zhang, M. Al-
timize federated queries with minimal communication
                                                                   izadeh, T. Kraska, O. Papaemmanouil, N. Tatbul,
overhead.
                                                                   Neo: A learned query optimizer, arXiv preprint
                                                                   arXiv:1904.03711 (2019).
References                                                    [12] R. Marcus, P. Negi, H. Mao, N. Tatbul, M. Alizadeh,
                                                                   T. Kraska, Bao: Making learned query optimization
 [1] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K.       practical, ACM SIGMOD Record 51 (2022) 6–13.
     Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Gh-      [13] V. Leis, A. Gubichev, A. Mirchev, P. Boncz, A. Kem-
     odsi, et al., Spark sql: Relational data processing in        per, T. Neumann, How good are query optimizers,
     spark, in: Proceedings of the 2015 ACM SIGMOD                 really?, Proceedings of the VLDB Endowment 9
     international conference on management of data,               (2015) 204–215.
     2015, pp. 1383–1394.
 [2] V. Josifovski, P. Schwarz, L. Haas, E. Lin, Garlic: a
     new flavor of federated query processing for db2, Acknowledgments
     in: Proceedings of the 2002 ACM SIGMOD interna-
     tional conference on Management of data, 2002, pp. This research project is supported by NSF grant
     524–532.                                                IIS1910830 (“Regret-Bounded Query Evaluation via Rein-
                                                             forcement Learning”).