=Paper=
{{Paper
|id=Vol-3186/paper5
|storemode=property
|title=Building Learned Federated Query Optimizers
|pdfUrl=https://ceur-ws.org/Vol-3186/paper_5.pdf
|volume=Vol-3186
|authors=Victor Giannakouris
|dblpUrl=https://dblp.org/rec/conf/vldb/Giannakouris22
}}
==Building Learned Federated Query Optimizers==
Building Learned Federated Query Optimizers Victor Giannakouris Advised by: Immanuel Trummer Cornell University, NY, USA Abstract The goal of this thesis is to introduce a new design for building federated query optimizers, based on machine learning. We propose a modular and flexible architecture, allowing a federated query optimizer to integrate with any database system that supports SQL, with close-to-zero engineering effort. By observing the performance of the external systems, our optimizer learns and builds cost models on-the-fly, enabling federated query optimization with negligible communication with the external systems. To demonstrate the potential of this research plan, we present a prototype of our federated query optimizer built on top of Spark SQL. Our implementation effectively accelerates federated queries, achieving up to 7.5x better query execution times compared to the vanilla implementation of Spark SQL. Keywords federated query processing, query optimization, machine learning, 1. Introduction subquery pushdown, might also be applied. Finally, the resulting query plan is executed in the federation engine. In the complex infrastructure of the modern “big data” Ideally, an efficient optimizer should be able to gener- ecosystem, data are usually distributed across multiple, ate more sophisticated federated query plans, like in tra- diverse database systems. This has led to the develop- ditional databases. For example, instead of just pushing ment of federated query engines that enable users to down selections to the external systems, it should con- simultaneously query multiple databases, using a unified, sider pushing down larger parts of the federated query, SQL-based interface. For instance, it is common for a like a join sub-tree. However, the heterogeneous na- data scientist to issue a query that joins a ”small” table in ture and architectural differences of the external systems a relational database with a bigger table that resides in make the task of deciding which parts of the query to push a distributed data lake, like Amazon S31 or Delta Lake2 . down and where particularly complex. One of the main A number of federation engines developed by some of challenges is the complexity of estimating the subquery the largest database vendors, including Athena Federated execution cost in an external system. This is a tricky Query3 , BigQuery4 , Spark SQL [1], Presto5 or Dremio6 task, for a number of factors. For example, due to the over the last years, provide clear evidence for the impor- lack of access to statistics in the remote database system, tance of federated query engines. Taking into account estimating the local execution cost (in the external sys- the heterogeneity of the underlying systems that a federa- tem) and result size is very challenging. Furthermore, the tion engine integrates with, optimizing federated queries larger search space that derives from the additional plan- is one of the most challenging tasks for these systems. ning decisions (i.e. where to execute each operator) due Usually, a federated query engine follows a one-size-fits- to federated execution, makes optimization even more all approach, to connect with as many external database challenging. systems as possible. In summary, the query lifecycle in As a result, the majority of federation engines apply most federation systems (e.g. Spark, Presto) is quite sim- very few rule-based optimizations (i.e., selection push- ple. First, the federated engine transfers all the tables and down), discarding optimization opportunities that would views included in the query from the external database leverage the full potential of the external systems. While systems to federation execution engine through the net- there have been some attempts to develop wrappers [2, 3] work. A number of specific rule-based optimizations, e.g. and custom cost models [4] to enable more fine-grained federated query plan generation, these approaches face Proceedings of the VLDB 2022 PhD Workshop, September 5, 2022. the following challenges. First, developing custom wrap- Sydney, Australia Envelope-Open vg292@cornell.edu (V. Giannakouris) pers and cost models for new systems is a tedious task, © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). making the integration with new systems extremely dif- CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 ficult. Second, the communication with external systems 1 https://aws.amazon.com/s3 to obtain cost estimates can slow optimization down, 2 https://delta.io/ 3 something known as cost of costing [5]. Taking into ac- https://aws.amazon.com/athena 4 https://cloud.google.com/bigquery count these challenges, we try to answer the following 5 https://prestodb.io/ question: Can we develop a generic design for federated 6 https://www.dremio.com query optimizers that integrate with any external database processor follows a similar design to a traditional, single- system with 1. close-to-zero engineering effort and 2. mini- engine query processor. The main extension is the ability mal communication overhead during optimization? to load data from external data sources. In this work, To answer this question, we present an engine- our main focus will be the optimizer. Federated query agnostic approach for federated query optimization, that optimization is more complex than traditional query op- copes with the heterogeneity of the underlying infras- timization, as it has to tackle the challenges that arise tructure. Using machine learning, our solution allows from the heterogeneity of the external systems. the optimizer to learn the performance of the external Federated Query Processing. Federated query pro- database systems, without relying on any system-specific cessing is not a new problem, and there has been exten- knowledge. Instead, it treats the external systems as black sive work over the last few decades [7, 8] that aims at boxes. The key idea behind our approach is the following. optimizing queries across diverse data sources. For in- In contrast to previous approaches that depend on cost stance, Garlic [2] introduces a federated query optimizer estimates obtained from external systems (e.g. by parsing based on a cost-based, dynamic-programming approach the output of EXPLAIN clause [6]), we use a unified query that uses data wrappers in order to integrate, and exe- vector model, to represent queries in the vector space. cute queries across different data sources. Recent works, Using this model, query trees can be simply transformed like MuSQLE [4] and System-PV [3] present federated to vectors, and fed to various machine learning models (a.k.a. multi-engine) query optimization approaches that in order to learn and predict the performance of the ex- perform both inter- and intra-engine optimizations. A ternal systems. We can then leverage these learned cost similar approach based on data wrappers, is followed by models in order to develop a federated query optimizer another notable category of systems, called polystores that can easily connect to different systems, and has zero [9, 10]. However, these approaches depend strongly on communication cost during optimization. In summary, cost models, provided by the external systems. If a sys- our contributions are the following: tem does not provide cost estimates, the only way to integrate new systems is to implement the cost models. • We introduce a machine learning based architec- This process makes the integration with new systems im- ture for federated query optimization that is able practical. Moreover, the communication needed with the to integrate with any SQL-based database system external systems to obtain cost estimates of local query with close-to-zero engineering effort. executions leads to excessive overheads that make the • We present an implementation of our architecture optimization process slow (cost-of-costing [5]). on top of Spark SQL and we demonstrate how our Learned Query Optimization. The idea of learned system can effectively optimize federated queries query optimization has gained a lot of attention through over multiple systems, with zero communication the last years. Approaches like Neo [11] and Bao [12] overhead. are representative examples that showcase how machine • We discuss an experimental evaluation that learning can be utilized for self-driving query optimizers. demonstrates our system’s ability to effectively However, most of these works are focusing mainly on learn the performance of the external systems, single-node database systems. The closest approach to while generated federated query plans always our system is the one presented by Liqi Xu et al. [6], outperform Spark SQL. which presents a supervised-learning approach for fed- erated query optimization. The system learns the per- The rest of the paper is organized as follows. Section 2 formance of the externally connected systems and it pre- presents some background on federated query optimiza- dicts the best federation engine, which can be any of tion, as well as a brief overview of the federated query the connected data sources. However, this work does processing research. Next, we present a prototype and not consider splitting further the complement queries, architecture of a machine learning based federated query ignoring potential query plans that could achieve better optimizer in Section 3. Then, we present some early ex- performance. Furthermore, it relies on information (e.g., perimental evaluation in Section 4. Finally, in Sections 5 cost or row estimates), gathered by the EXPLAIN clause of and 6, we conclude and present ongoing and planned the target data source. Continuously invoking EXPLAIN PhD research. during query planning on multiple external data sources can make the process significantly slower, due to the com- munication overhead. Furthermore, the assumption that 2. Background and Related Work any of the connected data sources can be considered as the federation engine is not realistic, as it is not common A federated query contains tables that reside in one, or for a DBMS to support reading data from external data multiple external database systems. The corresponding sources. system is called a federation engine. A federated query The main weaknesses of prior work are the following. First, integration with new systems can be tedious. Next, cost model. For example, in some cases it might make the cost estimation and interpretation for the external sense to break a subtree that joins four tables into two systems, as well as the continuous communication of the subtrees that join two tables, in order to avoid computing external systems during query optimization makes the a large result, and fetching that result to the federation process relatively slow, resulting in a high cost-of-costing. engine over the network. This is achieved by adjusting In the next section, we describe how we address these a parameter called join_limit , which defines the max- problems with machine learning, as well as a unified imum subquery size that can be pushed down for local query vectorizer that maps queries to vectors. execution to an external system. Federated Rewriter. This module takes as input the federated query plan produced by the previous step. For 3. ML-Based Federated Query each subtree that refers to a specific location (database Optimization system) of the query plan, it performs on-the-fly SQL code generation that will be pushed down to the exter- 3.1. Architecture nal system for local execution. Finally, it generates the SQL code that will be executed in the federation engine, Figure 1 depicts the architecture of our federated query which will aggregate the results of each component query optimizer prototype. In this section, we describe in detail executed in the external locations. each individual component of our system. Query Vectorizer. The query vectorizer takes as input a parsed SQL query in its abstract syntax tree (AST) form 3.2. Query Lifecycle and converts it into a vector that represents the semantics The query lifecycle follows the same steps as in the pre- of the query, e.g., which tables are joined in the query or vious section. First, an SQL query is parsed and trans- in which columns a GROUP BY operator is applied. In the formed to the corresponding AST form. Then, this query current version, we follow a simple one-hot-encoding is passed to the vectorizer, which will transform it to the approach. Each query operator is represented by a vector. corresponding vector form. Next, the optimizer takes For example, the aggregation vector 𝑔 = [1, 0, 0, 1] rep- the AST of the query, it converts it to the corresponding resents a query in which the GROUP BY clause is applied graph, and it produces the final federated query plan. on the first and the fourth columns. We combine all the Finally, the federated plan is passed to the rewriter which vectors for all the predicates that we need to include in will perform the necessary SQL code generation for the our search space and create a unified vector that repre- external systems and the federation engine. The query sents the full query. is then executed by leveraging both the external systems Cost Model Learning. In order to learn cost models, and the federation engine, and the result is returned to we use data obtained from past and current workloads. the user. At the end of execution, we also keep the query For each query, we keep its execution time and its vector execution metrics, like the total execution time and the in- form. We feed this data to a machine learning model, that dividual execution times of the subqueries in the external predicts the execution time of future queries. Our cur- engines. We keep these metrics in order to re-train and rent prototype trains its models with respect to execution refine our learned cost models and keep them up-to-date. time. However, the approach can be easily modified in As mentioned in the previous sections, the key advan- order to take into account more objectives, like monetary tages of our federated query optimizer are the following. cost in a cloud setting. First, the query vectorizer allows our system to be easily Federated Query Optimizer. Our federated query opti- integrated with any system that supports SQL, making mizer uses the first two components in order to generate the design specifics of the external system transparent near-optimal federated plans. First, it transforms the AST to the federation engine. For example, in our implemen- form of the query to a graph, in which each vertex repre- tation over Spark SQL, we use JDBC drivers in order to sents one table and its location, while an edge represents connect to the external system. Then, our system oper- a join between two tables. The optimizer works in two ates only on Spark’s intermediate representation (AST) of phases. The first pass, which we call Location-First Search, the input query. This design will work exactly the same is an extension of the traditional Breadth-First-Search al- over any possible set of connected systems. Second, we gorithm which traverses the graph, and generates a new minimize the potential communication overhead during binary tree with the following property. It is guaranteed optimization. By leveraging learned cost models, the cost that all vertices (tables) that reside in the same location, estimates for each subquery in each external system are will be co-located under the same subtree (whenever that computed fast, while most of the optimization time is is possible, given the query semantics). The second pass spent on useful work, i.e. plan enumeration. processes each subtree at each location, and makes the required transformations, being advised by the learned an external engine. This parameter needs adjustment for the following reason. Imagine pushing down a large join, that produces a very large intermediate result. Fetch- ing this result from the external database system to the federation engine will result in excessive network over- Figure 1: Optimizer Design and Query Lifecycle heads. Thus, the query performance will decrease. Us- ing our very first prototype of a Reinforcement Learn- ing based optimizer that tries out different values of the Table 1 join_limit parameter, we conclude that for this setup Execution Time the join limit should be either two or three. Sticking to System Avg. Exec Time (s) these values, the average query execution time is 60% Spark SQL 1.15 of time that the vanilla implementation of Spark SQL re- Optimal 0.122 quires. This improvement is achieved mainly by splitting Federated Optimizer 0.152 the query execution across Spark SQL and the external systems (e.g. pushing down part of the join), utilizing both the federation engine and the systems it connects 4. Preliminary Results to. We evaluate our prototype experimentally and compare to Spark SQL. Our goal is to show that our design can ef- 5. Research Plan fectively chose the right engine to execute each subquery, and the resulting plans outperform the vanilla implemen- There is still work that needs to be done in order to release tation of Spark SQL. We evaluate our first prototype on a fully-functional prototype of our ML-based optimizer. a MacBook Pro with 16GB of memory and an Apple M1 We presented individual parts of the system that we cur- chip. The infrastructure consists of a standalone, single- rently work on, and an early experimental evaluation node Spark SQL cluster, one Postgres 14.0 instance and of those components that showcase the current perfor- one MySQL 8.0.27 instance, everything running on the mance improvements that our system achieves. same machine. We used the TPC-H 7 (1GB) and the JOB Short-Term Goals. We are working towards the full [13] benchmarks. integration of our optimizer with the learned cost models. Learned Cost Models. We first evaluate our opti- Our main idea is to adopt a dynamic programming ap- mizer’s effectiveness in choosing the most performant proach for plan enumeration, as in traditional databases. execution engine. We use the TPC-H dataset for this ex- We plan to modify the algorithms and make them lever- periment, making all tables available in all systems. First, age the learned cost models in order to evaluate the cost we run a set of micro-benchmarks by randomly picking of the enumerated federated, cross-database plans. We some of the TPC-H queries in order to collect data and foresee the following challenges. First, in case of updates, train the cost models. Next, we run all TPC-H queries the cost models will become outdated and might mislead to evaluate our system. For each TPC-H query, the opti- the optimizer. In this case, the system should notice the mizer is assisted by the learned cost models, and decides performance degradation and re-calibrate models with whether it is better to push the full query into MySQL, respect to new data, possibly by re-training the models. Postgres, or fetch all the data and execute the query in Next, for queries with many joins, the large optimization Spark. For the sake of the experiment, we run each query space due to the multiple execution engine options might in all three modes, which allows us to compare our op- slow the optimizer down. We plan to develop specialized timizer’s decision both with Spark SQL and the optimal heuristics to prune the search space to maintain reason- decision (i.e. minimum execution time). The results are able optimization time. reported in Table 1, and depict the end-to-end execution Long-Term Plan. Our current design already has some time, including query processing and data fetch from the promising results for OLAP workloads on static data. external systems. Using our optimizer, we achieve an However, as previously mentioned, relying on past query average speedup of 7.5x compared to Spark SQL. executions might be limiting if any updates are included Optimized Queries. We use the JOB benchmark for in the workload, i.e. the learned cost models will become this section. In this scenario, tables are randomly placed outdated. Furthermore, the larger search space that de- across MySQL and Postgres. For those queries, we ex- rives from the multiple-systems scenario might result perimented with changing the number of the maximum in slow optimization for larger join queries. To address tables (join_limit parameter) that can be included in these challenges, we are working towards an extension, a subquery that is pushed-down for local execution to based on Reinforcement Learning. Instead of using cost 7 estimates, this RL-based optimizer will try out different https://www.tpc.org/tpch/ splits and join orders for the initial query during the ex- [3] M. Karpathiotakis, A. Floratou, F. Özcan, A. Aila- ploration phase. Using RL, we can develop an adaptive maki, No data left behind: real-time insights from optimizer, that will explore different subquery combina- a complex data ecosystem, in: Proceedings of the tions and pushdowns across the different systems. This 2017 Symposium on Cloud Computing, 2017, pp. will not require prior training, and past workloads. Our 108–120. goal is to create a solution that quickly adapts to data [4] V. Giannakouris, N. Papailiou, D. Tsoumakos, changes (in case of updates), while avoiding the costly N. Koziris, Musqle: Distributed sql query execution enumerations in query planning. over multiple engine environments, in: 2016 IEEE Vision. A federated query engine should be flexible, con- International Conference on Big Data (Big Data), nect easily to new data sources and hide the complexity IEEE, 2016, pp. 452–461. of the underlying infrastructure from the user. Existing [5] A. Deshpande, J. M. Hellerstein, Decoupled query approaches on federated query optimization, like System- optimization for federated database systems, in: PV [3] and MuSQLE [4], still require a lot of manual work Proceedings 18th International Conference on Data from the user. Our intuition is that the more generic, ML- Engineering, IEEE, 2002, pp. 716–727. based design that we develop as part of this PhD research [6] L. Xu, R. L. Cole, D. Ting, Learning to optimize will democratize federated query optimization, and make federated queries, in: Proceedings of the Second it possible to adopt these optimization schemes in the International Workshop on Exploiting Artificial In- real world. While we implement our optimizer on top telligence Techniques for Data Management, 2019, of Spark SQL, our design is generic enough and can be pp. 1–7. easily implemented in similar systems. The long-term [7] D. McLeod, D. Heimbigner, A federated architecture goal of this PhD is to introduce new federated query opti- for database systems, in: Proceedings of the May mization designs that are autonomous, and can be easily 19-22, 1980, national computer conference, 1980, pp. adopted by systems both in industry and academia. 283–289. [8] A. P. Sheth, J. A. Larson, Federated database sys- tems for managing distributed, heterogeneous, and 6. Conclusions autonomous databases, ACM Computing Surveys (CSUR) 22 (1990) 183–236. We presented a PhD research plan that proposes a new [9] J. Duggan, A. J. Elmore, M. Stonebraker, M. Bal- federated query optimization design, based on machine azinska, B. Howe, J. Kepner, S. Madden, D. Maier, learning. Our design is still under development and in an T. Mattson, S. Zdonik, The bigdawg polystore sys- early stage. The preliminary results show that ML-based tem, ACM Sigmod Record 44 (2015) 11–16. federated query optimization achieves notable perfor- [10] J. LeFevre, J. Sankaranarayanan, H. Hacigumus, mance improvements when compared to Spark SQL. In J. Tatemura, N. Polyzotis, M. J. Carey, Miso: soup- contrast to past works on federated query processing, ing up big data query processing with a multistore our prototype leverages machine learning in order to system, in: Proceedings of the 2014 ACM SIGMOD cope with the heterogeneity of the underlying database international conference on Management of data, systems. Our optimizer is able to connect new systems 2014, pp. 1591–1602. with close-to-zero engineering effort, and effectively op- [11] R. Marcus, P. Negi, H. Mao, C. Zhang, M. Al- timize federated queries with minimal communication izadeh, T. Kraska, O. Papaemmanouil, N. Tatbul, overhead. Neo: A learned query optimizer, arXiv preprint arXiv:1904.03711 (2019). References [12] R. Marcus, P. Negi, H. Mao, N. Tatbul, M. Alizadeh, T. Kraska, Bao: Making learned query optimization [1] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. practical, ACM SIGMOD Record 51 (2022) 6–13. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Gh- [13] V. Leis, A. Gubichev, A. Mirchev, P. Boncz, A. Kem- odsi, et al., Spark sql: Relational data processing in per, T. Neumann, How good are query optimizers, spark, in: Proceedings of the 2015 ACM SIGMOD really?, Proceedings of the VLDB Endowment 9 international conference on management of data, (2015) 204–215. 2015, pp. 1383–1394. [2] V. Josifovski, P. Schwarz, L. Haas, E. Lin, Garlic: a new flavor of federated query processing for db2, Acknowledgments in: Proceedings of the 2002 ACM SIGMOD interna- tional conference on Management of data, 2002, pp. This research project is supported by NSF grant 524–532. IIS1910830 (“Regret-Bounded Query Evaluation via Rein- forcement Learning”).