1. Introduction

Distributed Processing in the Query Optimizer

Vancouver, Canada

0 supervised by Prof. Dr. Thomas Neumann, Technische Universität München

Distributed database systems gain relevance both in industry and academia. However, existing research on query optimization for relational database systems focuses largely on systems running on a single machine. Work on distributed systems neglects available workload information in database systems. In this work, we present optimization strategies to fully leverage the potential of distributed systems running on modern cloud architectures with fast networks. We focus on the optimal assignment of tasks to compute nodes and the joint optimization of join ordering and distribution layout of data. Furthermore, we introduce distributed plans and simulation-based evaluations using a new cost model for computation time.

1. Introduction

Considering the very high bandwidth available in modern cloud systems, distributed processing becomes more and more attractive to not only handle large data sizes but also to improve processing performance. Still, good physical plans are key for eficient execution. We argue that distributed execution engines require changes to existing query optimizers for optimal performance. Existing join ordering algorithms yield suboptimal results because they fail to model the cost of data transfers. Furthermore, they need to spread computational load while avoiding waiting on data when assigning tasks to machines. We plan to contribute the following components to investigate new optimization opportunities:

1. A strategy to transform query plans for eficient

distributed execution.

Methods like hash distributed joins require repartitioning of their input data. Our strategy chooses favorable distributions and introduces necessary data shufling. 2. A new operator-based computation time estimation method that allows us to compare the cost of transferring data over the network against local processing time. 3. An optimization method for the task assignment problem which determines on which node each part of a distributed query should be executed to minimize query response time.

4. A simulator that models the execution of distributed query plans on a cluster considering each nodes computation and network capabilities. This simulator uses our previous computation

VLDB 2023 PhD Workshop, co-located with the 49th International

1 2

3 Pipeline Break Pipelining Required Data

Shufle Stage (c) Distributed query plan base relations.

across various cluster setups using simulationbased evaluation. 5. A new join ordering algorithm that can take effects of distributed data partitioning and execution into account to jointly optimize distribution layout. use of (2) to estimate computational load and can be evaluated using (4). (5) also has a direct impact on query performance and can be compared against existing algorithms using (1).

Res B

1 2 7 1 9 8 4 2 6 5 3

3 1

B 2 1

2 (a) Initial query plan (b) New pipeline boundaries time estimations to track execution times accu- (3) will directly improve query response time. It makes rately. We can verify the quality of assignments

2. Related Work There are several distributed database systems, but the

number of publications on distributed optimization is rather limited. Microsoft extended the search space of SQL Server’s query optimizer with data distribution information for cost-based search [ 1 ]. Its cloud-native successor Polaris avoids the task assignment problem by writing and reading all intermediate results from a decoupled storage service [ 2 ]. This removes many efects of data locality. Redshift automatically chooses partition keys and distribution for observed query workloads, but there is little information published about its query optimizer [ 3 ]. Snowflake [ 4 ] uses a classical Cascades-like query optimizer [ 5 ] but fixes the data distribution at query runtime. Vertica segments tables by their columns instead of hash partitioning [ 6 ]. It chooses join ordering with a worklist-based approach that considers distribution information and terminates when the memory budget is exhausted. MemSQL performs cost-based query rewrites in a heuristically pruned search space, weighing data transfers with a constant factor [ 7 ]. SparkSQL uses cost and rule-based optimizations to broadcast small tables and perform preaggregations [ 8 ]. Rödiger et al. propose network optimal partition assignment using MILP for single join operations in [ 9 ]. They approach data skew with selective broadcast and Flow-Join that dynamically broadcast partitions and tuples respectively [ 10 ].

There is also a lot of related work in the area of big data that covers similar optimization aspects, such as task scheduling [ 11 ]. In contrast to big data systems like Hadoop and Spark, relational database systems have much more information ahead of time. We build upon that research utilizing this additional information and database-specific optimizations such as join ordering. query results to data units. Hash-distributed execution can be used to efectively perform aggregations and joins on large amounts of data. However, the processing speed of joins can be improved by broadcasting data units in cases of skewed data or vast diferences in cardinalities [ 9 ]. Furthermore, it is possible that the overhead of data transfers in further stages outweighs the advantages of distributed processing. Thus, the optimizer should also be able to decide that a data unit should only reside on a single node. In summary, data units can have the following four partition layouts: • Hash-partitioned: The data unit is hash partitioned by a key. • Broadcast: All data resides in a single partition that is broadcast to all eligible nodes. • Single-node: All data resides on one node, further processing will not be distributed.

• Scattered: Tuples are partitioned without a key.

There are many metrics, such as throughput, query la

tency, cloud cost, and energy consumption. We choose to optimize for latency, as we expect that optimizing for lower execution and transfer times will improve results in all metrics.

4. Components For Distributed Query Optimization The main components of our research project are distributed plan generation, computation time estimation, task assignment, a simulator for distributed execution, and a new join ordering optimizer.

4.1. Distributed Plan Generation

3. Distributed Processing Model We compose distributed query plans from three main components:

First, we define the characteristics of distributed systems Data Units can be base relations from a database, for which we want to optimize. We focus our work on intermediate results, or the final result of a query. They OLAP systems, but the concepts are also applicable to contain the available attributes and the estimated number transactional workloads. The system of concern has dis- of tuples in the data unit. Each data unit is annotated with aggregated compute from storage to allow flexible scaling a partition layout determining the type of partitioning of compute nodes similar to Snowflake [ 4 ]. Nodes can and the partition key if any. vary in computational and network capabilities to use Pipelines represent the fused computation of operavailable cloud instances cost-efectively. The full dataset ators that is not interrupted by data materialization or has to be stored on a storage service. Tables are stored in transfers. A pipeline always takes one data unit as an ina columnar fashion and hash partitioned on user-defined put and creates one data unit as an output. Additionally, distribution keys. Similarly to Polaris [ 2 ], nodes may a pipeline may require the presence of further data units, cache arbitrary partitions locally. In contrast, we explic- e.g., a pipeline performing a hash-join would require the itly do not disaggregate query state from compute nodes data unit of the build side while taking the data unit of to avoid the latency overhead of writing back and reading the probe side as input. all intermediate results from the storage service. Shufle Stages repartition data. A shufle stage takes

We generalize relations, intermediate results, and final one data unit as input and returns one data unit as output. The only diference between input and output data unit 4000 is their respective partition layout. cyenu23000000

We initially create distributed query plans from phys- rFeq1000 ical plans created for single machines, as depicted in 0 1.00 1.25 1.50 Relati1v.e7C5ost of Ass2ig.0n0ment 2.25 2.50 2.75 Figure 1a. Our method takes a query plan and partition layout information for all base relations and distributes Figure 2: Cost distribution of 100 thousand sampled task the plan in several passes. assignments for TPC-H Q21 on 16 machines.

First, the operators in the plan are combined to pipelines. Next, we determine the best partition layout at each operator and split pipelines where necessary, as The task assignment optimizer focuses on the choice of in Figure 1b. Finally, we explicitly name the data units which node should execute which tasks. Each node can at the ends of each pipeline. The output of a pipeline execute any task, if the necessary data is transferred acmay have a diferent partition layout than the required cordingly. However, this will have significant impact input layout of its scanning pipeline. In this case, we on performance. We want to evenly spread the comcreate two data units and link them with a shufle stage, putational load among nodes and minimize time spent as shown for 5 and 6 in Figure 1c. waiting on data transfers. Each assignment also has an

Distributing single-node plans like this will not yield efect on subsequent execution, as it determines on which optimal results as the original plan does not incorporate node the resulting partition will reside. any information about the distributed system. Ultimately, As depicted in Figure 2, good task assignments can the optimizer should consider distribution in all phases. improve performance over 2x. The number of possible Most stages of this method can be reused for such an end- assignments assign = tansokdess grows exponentially in the to-end optimizer, and we can create distributed plans to number of nodes, which renders exhaustive enumeraconduct experiments early. Our work on this rule-based tion of assignments infeasible. Using our computation plan generation is mostly done. time estimates and estimated time spent for data transfers, we plan to build a heuristic optimizer for the task 4.2. Computation Time Estimation assignment problem that is able to generate good assignments in short time. We will consider sampling-based Traditional single-node query optimizers rely on rela- and greedy methods to find good initial plans in short tively simple cost models because cardinality estimation time and refining methods like iterative improvement errors outweigh the efect of more detailed models [ 12 ]. and simulated annealing to further improve these plans. For distributed processing, however, we need to compare We have implemented first approaches to this problem. the relative cost of data transfers over the network to local computation to find good execution strategies. In 4.4. Distributed Execution Simulator the presence of fast modern networks, it is no longer sufifcient to simply rely on the sizes of intermediate results The best way to evaluate optimizations is to conduct and ignore the computation time. For an exact compari- benchmarks on a real system. However, it is intricate to son, we want to accurately predict the computation time conduct thorough large scale benchmarks on distributed of pipelines. We build a fine-grained operator-based cost systems. Experiments on large compute clusters are exmodel to predict the average computation time required pensive and take substantial efort to realize with a workfor each tuple at each operator. The profiling method pro- in-progress system. Hence, we decided to simulate the posed by Beischl et al. [ 13 ] provides very detailed data for distributed system without the need for a real implemenmodern compiled data processing engines. We use this tation. This simulator is much more flexible, as we can data and available information about each operator, such make fundamental changes to the execution model with as tuple size, input cardinality, and expression complex- little efort. Also, the structure of the simulated cluster ity, to create a detailed performance model. Optimization can be changed in compute nodes count, hardware, and stages can consider the performance estimations of this network speeds efortlessly. It can also be used as a direct model if they are included in the query plan. We have cost function for cost-based optimization methods. yet to implement this method. The simulator takes a distributed query plan, a cluster definition, each nodes cached partitions of base relations, 4.3. Task Assignment Optimizer and a task assignment as input. It maintains pending and currently active tasks and data transfers over the network Single pipelines can be distributed using data-parallelism. and their current progress in percent. First, it computes If the scanned data unit is partitioned into partitions, we the estimated remaining time to finish for each active create tasks for this pipeline, where each task scans one task and transfer. The shortest time min determines when partition and outputs one partition of the pipelines result. the set of currently running tasks and transfers changes.

The simulator advances the progress of all operations

by min. At least one of them will finish and therefore make a new partition available at some node. Finally, it ifnds all pending operations that can start since now new partitions are available. By accumulating all values of min, we can compute the overall runtime of the query.

Our implementation of this simulator is ready for use. We use it to evaluate randomly sampled task assignments and give their execution time distribution in Figure 2. As our implementation is fast, it can easily simulate tens of thousands of executions per second and is hence suitable for direct integration in the optimization loop. Not only the new problem of task assignment has optimization potential. Join ordering algorithms for singlenode execution yield deficient plans for distributed execution [ 7 ]. As large base relations are likely to be hash partitioned on join keys, it will be advantageous to execute the respective join first and avoid reshufling the data, even if that might not be optimal for single-node execution. Furthermore, the join ordering algorithm can directly choose the distribution (hash distributed, broadcast, or simply using only a single node) of intermediate results. We will investigate the feasibility of applying exhaustive dynamic programming algorithms that extend solutions by physical properties similar to SQL server PDW [ 1 ]. This will enlarge the search space significantly, and exhaustive search will be infeasible in many cases. Hence, we will work on further possibilities to restrict the search space and gracefully fall back to fast approximations. Additionally, we will work on a new cost model for enumeration algorithms which incorporates both computation and network time. We have not yet started working on this problem.

5. Conclusion

Distributed query processing opens potential for optimizations at many diferent stages of physical plan generation. This work proposes approaches to use that potential in several ways. We describe a way to lift current physical query plans for distributed execution. Then, we create a simulation-based evaluation method for these plans. We highlight the importance of the task assignment problem and sketch several methods to find good assignments. Finally, we present our vision for new enumeration-based join ordering algorithms that jointly optimize the distribution of data with the join ordering.

[1]

Shankar ,

R. V.

Nehme ,

Aguilar-Saborit ,

Chung ,

Elhemali ,

Halverson , E. Robinson,

M. S.

Subramanian ,

D. J.

DeWitt , C. A. GalindoLegaria, Query optimization in microsoft SQL server PDW , in: SIGMOD Conference, ACM, 2012 , pp. 767 - 776 .

[2]

Aguilar-Saborit ,

Ramakrishnan , POLARIS: the distributed SQL engine in azure synapse , Proc. VLDB Endow . 13 ( 2020 ) 3204 - 3216 .

[3]

Armenatzoglou ,

Basu ,

Bhanoori , et al., Amazon redshift re-invented , in: SIGMOD Conference, ACM, 2022 , pp. 2205 - 2217 .

[4]

Dageville ,

Cruanes , et al., The snowflake elastic data warehouse , in: SIGMOD Conference, ACM, 2016 , pp. 215 - 226 .

[5] G. Graefe, The cascades framework for query optimization , IEEE Data Eng. Bull . 18 ( 1995 ) 19 - 29 .

[6]

Tran ,

Lamb ,

Shrinivas ,

Bodagala , J. Dave, The vertica query optimizer: The case for specialized query optimizers , in: ICDE, IEEE Computer Society, 2014 , pp. 1108 - 1119 .

[7]

Chen ,

Jindel ,

Walzer ,

Sen ,

Jimsheleishvilli ,

Andrews , The memsql query optimizer: A modern optimizer for real-time analytics in a distributed database , Proc. VLDB Endow . 9 ( 2016 ) 1401 - 1412 .

[8]

Armbrust ,

R. S.

Xin ,

Lian ,

Huai ,

Liu ,

J. K.

Bradley ,

Meng ,

Kaftan ,

M. J.

Franklin ,

Ghodsi ,

Zaharia , Spark

SQL

: relational data processing in spark , in: SIGMOD Conference, ACM, 2015 , pp. 1383 - 1394 .

[9]

Rödiger ,

Mühlbauer ,

Unterbrunner ,

Reiser ,

Kemper , T. Neumann, Localitysensitive operators for parallel main-memory database clusters , in: ICDE, IEEE Computer Society, 2014 , pp. 592 - 603 .

[10]

Rödiger ,

Idicula ,

Kemper , T. Neumann, Flow-join: Adaptive skew handling for distributed joins over high-speed networks , in: ICDE, IEEE Computer Society , 2016 , pp. 1194 - 1205 .

[11]

Soualhia ,

Khomh ,

Tahar , Task scheduling in big data platforms: A systematic literature review , J. Syst. Softw . 134 ( 2017 ) 170 - 189 .

[12]

Leis ,

Gubichev ,

Mirchev ,

P. A.

Boncz ,

Kemper , T. Neumann, How good are query optimizers, really? , Proc. VLDB Endow . 9 ( 2015 ) 204 - 215 .

[13]

Beischl ,

Kersten ,

Bandle ,

Giceva , T. Neumann, Profiling dataflow systems on multiple abstraction levels , in: EuroSys, ACM, 2021 , pp. 474 - 489 .