1. Introduction

1613-0073

Towards a Future of Fully Self-Optimizing Query Engines

Paul Blockhaus

blockhau@ovgu.de 0 2

Gabriel Campero Durand

campero@ovgu.de 0 2

David Broneske

broneske@dzhw.eu 0 1

Gunter Saake

saake@ovgu.de 0 2 0 Adaptive Reprogramming , Micro-Adaptivity, Query Engines, Compiled Query Execution, MLIR, Voila 1 German Centre for Higher Education Research and Science Studies , Hannover , Germany 2 Otto-von-Guericke University , Magdeburg , Germany

3 10

With the ever-increasing heterogeneity of hardware, the database community is tasked with adapting to the new reality of diverse ecosystems. The traditional workflow of hand-tuning query engine implementations to the underlying hardware might become untenable for an ever-growing variety of hardware with diferent performance characteristics. Systems like Micro-Adaptivity in Vectorwise or HAWK have been studied as adaptive solutions, but their adoption remains limited. Envisioning a solution simplified for adoption, we propose a practical take on adaptive reprogramming using the domainspecific language Voila and the MLIR compiler framework. We identify five main challenges in the area, and demonstrate how we tackle the first challenges. To show the feasibility of our approach, we include a brief evaluation of its performance on TPC-H; comparing 120 generated variants from a small subspace of potential optimizations.

1. Introduction

Emerging trends in hardware development show a paradigm shift away from a single instruction set architecture (x86) towards a more heterogeneous ecosystem (e.g., RISC-V, ARM). While the complete consequences for the performance and design of database systems in this new era continue to be fleshed out [ 1], it is certain that engineers will continue to be tasked with adapting storage and query engines to new hardware. This trend has already been visible in the last few years with database systems adopting heterogeneous hardware [2] and finding benefits by exploiting new processing capaThis increased heterogeneity opens-up entire new sets of improvements for query engines. In this circumstance, the choice of a physical algorithm is likely to increasingly hinge less on the characteristics of queries, but rather on those of the underlying hardware. Broneske et al., among others, showed that with the increasing heterogeneity, not only the choice of algorithm and operator placement matters for query processing, but the implementation variant of the algorithms is crucial[3]. In numerous instances, promising variants are based only on so-called micro-optimizations commonly used by developers, as rithms (e.g., branch predication, loop unrolling).

Over the last decade, research has cemented that these micro-optimizations have a considerable influence on query performance. Hence, several solutions have

Our implementation is based on Voila [6], a DSL for

query execution able to model various strategies. For our work, we implemented a new compilation backend for Voilas’s execution. The backend is based on the

MLIR framework, which enabled us to implement micro

optimizations of entire query plans at a per-operator level.

To show the potential of our approach, in this work we

CEUR

ceur-ws.org empirically evaluate the performance of the DSL and its variants on an adapted subset of the TPC-H benchmark.

The remainder of the paper is structured as follows: First, we introduce the tools upon which our prototype is built, with an overview of previous work. Next, we introduce the five main challenges we identified for a fully adaptive query engine. We follow with an overview of the architecture of our prototype, that tackles the first two challenges. Finally, we present an early empirical evaluation.

2. Background & Related Work In this section, we discuss the MLIR framework and the Voila DSL, which are the basis for our prototype; enabling eficient and adaptive generation of operator variants. Afterward, we give a brief overview of previous work.

2.1. MLIR Voila [6] is a DSL designed for eficient query execution, independent of the execution paradigm. To this end, the language is decoupled from the execution backend, which can flexibly run Voila programs in a vectorized or compiled fashion. To achieve this independence, Voila is designed with vector data types as first class citizens. The size of the vector determines the execution style, where a size of ∞ corresponds to operator-at-a-time execution and a size of 1 being a tuple-at-a-time execution. To have a language that allows this degree of flexibility, the granularity of the operations is set to execute on entire vectors, but the instructions are kept general purpose to be able to write diferent algorithms without a large re-implementation efort. The resulting granularity is somewhere between relational algebra operators, e.g., MAL [9], and virtual CPU instructions, e.g., LLVM IR.

2.3. Previous Approaches

MLIR [7] is a compiler framework relying on LLVM [8] to In recent years, several approaches for adaptivity in quickly build DSLs. MLIR ofers a generic intermediate database query execution have been proposed. One of representation providing a unified syntax and support for the first approaches achieved micro-adaptivity in Veccommonly needed functionalities of DSL compilers, such torwise [4] through operator variants that were preas type inference, lowering to executable code, interfaces compiled in libraries and could be replaced during runfor common code transformations, and many more. The time by dynamic library loading. two core components of MLIR are dialects and transfor- A new approach on micro-adaptivity was proposed mations, where dialects are diferent DSLs, tailored to with Excalibur. Excalibur is a virtual machine for adapspecific use cases, e.g., the scf -dialect represents struc- tive query execution [10]. It utilizes Voila to generate tured control flow elements, the memref -dialect repre- queries that are executed in a vectorized fashion and sents memory references and access, and the gpu-dialect replaces parts of the execution pipeline with compiled allows expressing instructions executed by GPUs, as well variants to search for an optimal execution plan variant. as the communication between CPU and GPU. The HAWK framework [5] used the C preprocessor and

The structuring of MLIR in dialects and the possibility OpenCL to achieve adaptivity of the query algorithms on of mixing multiple dialects into this single representation heterogeneous hardware. Looking beyond the variant allows for an eficient interoperation between dialects generation, both solutions also considered the selection and simplifies their implementation. Other than dialects, of the optimal variant from the variant pool, using matransformations are the central component of MLIR. They chine learning. This was necessary as the spanned search convert between diferent dialects, while also support- space is extremely large, and the performance of each on ing transformation patterns within a dialect, for example a given hardware-query combination is inherently hard an optimization or canonicalization. As dialects are de- to predict. While the micro-adaptivity approach used an signed to represent problems in their domain eficiently, online multi-armed bandit to find the optimal variants, transformations can also be handled eficiently. This HAWK used ofline-learning with a heuristic to test in leads to better optimization times, and simpler rules, com- the optimization space for the most eficient variants. pared to general-purpose compilers such as LLVM. Addi- Micro-adaptivity and HAWK sufer from a rather retionally, transformations can be used in a plug-and-play stricted set of adaptivity due to pre-compiled variants fashion by conversion to the particular dialect. These using a template mechanism on top of a high-level lanconcepts show, to the best of our knowledge, a great guage. In addition, HAWK doesn’t provide any run-time potential for a compiled query engine capable to utilize adaptivity and has to be trained for a workload in adheterogeneous hardware and are unique for compiler vance. Excalibur is able to overcome the restrictions of frameworks. high-level language templating, but still sufers from the problem of having to implement each variant transformation.

3. Challenges for Adaptive Query Execution Frameworks

levels of granularity with diferent language constructs, as required. Additionally, the DSL can be designed to eficiently support various types of hardware eficiently Challenge 1 – Granularity of Adaptivity and even handle work distribution scenarios [13]. Unfortunately, to our knowledge, there are no established Finding the right granularity at which adaptive repro- approaches that utilize a DSL towards adaptive reprogramming is introduced is a key factor for the eficiency gramming in data management. in several aspects. The granularity determines the complexity of the system, as well as the resulting possibilities Challenge 2 – Adaptivity Mechanism to reach peak-performance. A coarse granularity at the level of, e.g., pipeline programs will only allow for a Closely linked to the choice of granularity is the choice global level of flexibility, such as optimizing compiler of the adaptivity mechanism. This refers to the comlfags, and their influence will overall not lead to large ponent responsible for creating variants on the chosen benefits compared to the choice of algorithms and access granularity, and therefore also (partially) defining the paths used for the query. Though compiler flags could variant space available. This component will depend on improve usage of hardware capabilities, the approach the granularity and query execution paradigm. Current is still limited in optimizing queries individually with approaches introduce adaptivity through an additional regard to the query characteristics. abstraction layer, producing the variants in a preprocess

In contrast, fine-grained adaptivity incurs additional ing step before compilation, i.e., using the C preprocessor overhead during variant generation, and high complexity for variant templates. These variants are then used eion the system for optimization due to the large variant ther directly in a compiled execution engine [5] or loaded space. For example, introducing reprogramming into Hy- dynamically from libraries [4].

Per’s operators [11] would lead to a huge variant space, While this approach can be easily implemented, and as the operators can be written in LLVM IR, but it also is shown to be very efective, it has some drawbacks, adds the complexity of an entire compiler that has to such as limited adaptivity according to the expressivebe able to exploit this variability, entirely disregarding ness of the underlying language and specification of the the complexity of implementing algorithms directly in preprocessor. Additionally, using a high-level language LLVM IR. The use of mainstream compilers for this level for adaptivity does not guarantee that the optimizations of variability is rather limited, as current compilers only are actually applied because state-of-the-art compilers concentrate on generating the best code for the common apply their own heuristics on the code that may undo the case, or work with profiles to optimize for a pre-profiled applied optimizations or optimize diferently. We argue use-case [12]. To our knowledge, there is no approach that a direct integration of the adaptivity mechanism into that instructs the compiler to transform LLVM IR or simi- the compiler will help to overcome these problems, as lar low-level intermediate languages to optimize the code the compilers already support many micro-optimizations at such fine granularity. by default; they just have to be applied. This eliminates

Therefore, current approaches choose the granularity redundancy in the design and also gives full control over of pipeline-operators for variability, as a compromise the compilation process, which also allows restricting the between complexity and variability. These operators optimizations or heuristics that the compiler uses, to not are usually implemented in a high-level general-purpose interfere with manually adapted code. Another advanprogramming language, which limits the degree of vari- tage of this approach is that it combines nicely with our ability through its language constructs and mapping to proposed solution for the first challenge, as modifications hardware instructions. In the presence of heterogeneous of the compiler to support a DSL can already require the systems, the language in which the operators are im- design of a compilation pipeline. In our prototype, we plemented has to support all target platforms, which are able to utilize the diferent granularities of diferent either leads to using a hardware-oblivious language, e.g., DSLs that MLIR ofers to adaptively optimize the query OpenCL, or a hardware-sensitive alternative, mixing mul- in MLIR’s own compilation pipeline. tiple languages. Both approaches come with their own disadvantages, they might not be performance portable, Challenge 3 – Learning Variants or add a lot of complexity through re-implementation of algorithms to optimize for hardware characteristics. With the first two challenges, we described the problem

In order to overcome the above-mentioned problems of variability, but without a good variant selection mechof the individual approaches, Broneske et al. proposed anism, such variability might be essentially useless. The the use of DSLs [3]. While this does not directly solve the main problem for finding the best variants is the extenproblem of choosing the right granularity, it allows de- sive search space that increases exponentially with every signing the DSL to fit an arbitrary granularity, or multiple added optimization. The state-of-the-art approaches, we described in subsection 2.3, leverage machine learning though both versions are possible, we plan to adapt the techniques to learn and predict the best variants for a modeling of relational algebra in MLIR as it is a more flexcertain workload. The HAWK-framework uses an ofline- ible approach; with the added possibility of combining it learning approach that is fed with example queries and, with the learning mechanism mentioned before. in this way, learns the best flavors for the underlying hardware [5]. To reduce the learning time, additional Challenge 5 – Adaptivity for OLTP heuristics are used to prune the exploration space. This approach focuses solely on optimization for a mostly Current approaches in code generation for data processstatic workload and best working parameters for optimal ing focus heavily on OLAP queries, which is reasonable hardware utilization. Another approach used by micro- due to their complexity and high latencies. This focus adaptivity is an online-learning approach that can also neglects OLTP queries, which, we believe, might still adapt to changing workloads [4]. This is achieved by for- provide room for improvement through code generation, mulating the variant selection as a multi-armed bandit with a focus on optimizing together sets of concurrent learning problem. This continuously relearning system queries or templates of them. We envision, at this stage, is also eficiently usable without previous learning. that OLTP queries could be served in at least three main

We argue that an online-learning approach is, in gen- ways. First, recent work has shown promising results eral, preferable over ofline learning, as it is designed to by tackling the challenge of learning an adaptive concuradapt to changes in the data management workloads by rency control mechanism for a mix of pre-defined query itself, whereas ofline learning would need re-training templates at diferent contention scenarios [ 15]. We conover new representative workloads, which nowadays can sider that diferent mixes of these queries and the variable change quickly and therefore is not easily predictable. concurrency control mechanism might be able to leverA solution combining ofline-learning to establish the age a code generation framework for a higher flexibility starting models for online-learning might combine the and adaptivity, in the search for high throughput. To best of both approaches. achieve this, we consider that multi-query optimization

We envision to not only use models to learn the best might be necessary, opening the door for further kinds optimizations that can be used to instruct the compiler of improvements, such as operator sharing. Secondly, we to apply certain optimizations on chosen places, but also consider that interpretation of MLIR for faster execution to generate parameters for each optimization, and to without compile overhead, and a switch between interguide the compiler regarding the number of optimiza- pretation and compilation through coroutines could be tion passes for applying the optimizations. Finally, we relevant for workload mixes that could be categorized envision that the selected solution should integrate well as HTAP. Finally, we argue that the flexibility of MLIR with an MLOps framework, considering the lifecycle, allows for the study of a new dialect focusing on robust evaluation, and maintenance of the models used. processing over adjacent data versions (as used in MVCC or in Delta Lakes), which might be able to provide a competitive support for high-contention scenarios.

Challenge 4 – Integration Into High-Level Optimization 4. A Transformation-Based Approach for Adaptivity

To take a step from a pure query execution engine towards a more widely usable database engine, an adaptive reprogramming framework must be compatible with other high-level optimizations such as algebraic optimiza- In the following, we present our prototype that seeks to tion, index and algorithm selection. To achieve this goal, tackle, at this stage, the first two challenges described in one of two challenges has to be accomplished: Either the the previous section. We start with the architecture of reprogramming framework needs to be integrated into an our approach (cf. Figure 1). existing DBMS, or the missing components are integrated To have a more fine-grained granularity with a larger into the framework. Neither of these two solutions has variant space while only adding a rather small overhead, been successfully applied yet. To the best of our knowl- we adapted Voila [6] as a DSL with sub-operator granuedge, there only exist approaches that integrate parts larity (Challenge 1), and the additional feature of being of traditional optimizations into the adaptivity pipeline. agnostic to the query execution strategy. The design of The HAWK-framework, for instance, has limited support Voila also allows for simple adoption to be eficiently exefor hash table algorithm selection besides the operator- cuted on heterogeneous processors, as similar designs allevel variant granularity. Furthermore, Jungmair et al. ready demonstrated [16]. Instead of explicit predication, showed a first example of how to integrate algebraic op- we rely on a separate compilation pass to automatically timization successfully into MLIR, from which on further choose when to use predication and when to materialhigh-level optimizations can be introduced [14]. Even ize selections, which allows for more flexibility in the

MLIR framework and its dialects to introduce adaptiv

Hardware OblivioVusoQiulaery Representation ity into the code and lower it to LLVM, which is then

JIT-compiled and executed.

MLIR The usual compilation pipeline of our framework looks SQL Query Adaptive Code Generation Query Results as follows, once Voila has been translated to Voila-MLIR, which includes type resolution and transformation of Target DepeLndLenVtCMode Generation columns to the tensor data type, aggressive function inlining is performed to achieve redundant sub-query removal

JIT Execution through canonicalization and common sub-expression Query ExecuHtioanrdownaHreeterogeneous elimination, as well as allowing for better optimizations by eliminating function call boundaries.

Figure 1: High-level architectural schema of the framework Afterward, the dematerialization pass is performed and the resulting plan is lowered to MLIR’s internal dialects, optimization while also simplifying the language. either linalg, or affine. The linalg dialect describes

Implementing those types of adaptivity directly into loops using the concept of iterators. Each linalg operathe DSL has the great advantage that it makes the DSL tion consists of a set of input tensors and generates a set adaptive by design (Challenge 2). In contrast to other of output tensors. The body of linalg operations has to approaches, there is no longer the need to provide an ad- be largely side efect free, and aside of regular reductions ditional callback to pre-compiled algorithms or to change doesn’t allow for data or control flow dependencies bethe query plan. Instead, the query engine directly calls tween the iterations. Therefore, we resort to the affine the generated query code, which is adapted during JIT- dialect for operations that need broader constraints. The compilation by the supplied transformation and optimiza- affine dialect models range-based for loops and allows tion passes. direct memory manipulations, only constrained by the

The DSL itself is built upon the MLIR and LLVM loop induction variables. frameworks, which enable simple definitions of micro- Once, Voila-MLIR is translated to these high-level interoptimizations as transformations. Some commonly used nal dialects of MLIR, we apply loop optimization passes transformations, such as loop unrolling, vectorization, such as parallelization, tiling, unrolling and loop fusion. parallelization, etc., are already implemented and can be Afterward, we continue lowering towards more lowerused. Transformations in MLIR are applied in passes, and level dialects such as the vector dialect that represents additional filters even allow applying transformations virtual, hardware-independent vectors and instructions selectively on certain instructions, or supplying certain that can directly be mapped to the hardware’s SIMD parameters to the passes. In addition to the pre-existing capabilities. Other translations in this phase are lowpasses, we implemented a selection dematerialization ering of parallel loops to the async dialect, which reppass to compensate for the missing predication by re- resents multi-threading through LLVM coroutines, or placing the materialized result of the selection with a to the openmp dialect. In this step, it is also possible to predicate, indicating the selected tuples, when possible. transform parallel loops for execution on further coproThe dematerialization pass uses a forward data flow anal- cessors, such as GPU, e.g. by using the gpu dialect. This ysis of selection results to decide whether to materialize dialect models the GPU execution model and also handles the result or keep the predication. Currently, we mate- communication between CPU and GPU. For a concrete rialize the results any time, where pipeline breakers are execution on GPU, the dialect can further be lowered to encountered, but the pass can be extended with further SPIR-V or AMD and NVIDIA specific dialects. options and more complex scenarios on when to replace In the next step, the low-level MLIR is lowered to materialization with predication and vice versa. LLVM and finally compiled to executable byte code and

As MLIR is built upon LLVM, it also comes with na- then the query can be executed. tive support for heterogeneous hardware. Hence, we are confident that our DSL is able to adjust its behavior to fit processing paradigms on heterogeneous hardware 5. Evaluation with only minimal extensions. However, in the following w.l.o.g., we focus on the extensible CPU-only part. There- 5.1. Dataset & Modifications fore, we use vectorization to utilize SIMD capabilities of To get an overview of the real-world behavior of our modern CPUs, as well as multithreading with built-in framework, we use a subset of the TPC-H benchmark. LLVM coroutines, or the OpenMP runtime. More precisely, we choose the queries Q1, Q3, Q6, Q9

In order to translate Voila to executable machine and Q18 as a representative sample for typical OLAP code, we added a backend to Voila that translates Voila queries [17]. Since our framework does not yet support to a Voila-MLIR dialect. From there on, we use the joins eficiently and has no support for string data types, we modify the TPC-H dataset slightly to still be able to run the queries.

We use order-preserving dictionary compression for string columns and denormalize tables that need to be joined in order to replace joins by selections that are simpler to implement eficiently. We argue that the dictionary compression has no significant impact on the results of our benchmarks, as strings in TPC-H are mostly only used as payloads [18] and since we are already using column-oriented query processing, the strings would commonly not be loaded at all for most of the time. The only exception here is Q9, where a string pattern matching takes place, which has to be translated in the case of dictionary compression. We choose to work around the matching by using a large IN-clause, checking for existence of each tuple using a hash table. For the replacement of joins with simple selections on denormalized tables, Li and Patel suggest that it has no large influence on the query performance and leads to comparable results [19]. 5.2. Setup these flags are not applied to the query compiler. To get more robust timing results, we disabled software-based frequency scaling and ran each benchmark 100 times.

The reported times were averaged using the median.

5.3. Adaptivity Behavior In order to get an overview of the efect of our optimiza

tions on the runtime of the queries, we only generate a subset of the variant space that we are able to generate. To restrict the number of variants to a maintainable size, we only create all configurations of the following five major optimizations and configurations: loop unrolling: copying the loop body multiple times within the loop to reduce control-flow overhead of the loop condition check loop tiling: splitting the loop body into an inner and outer loop, efectively creating a blocking of memory access to increase cache performance selection dematerialization: replacing materializations of intermediate selection results by bitvectors indicating selected tuples to reduce memory usage and increase vectorization capabilities parallelization: splitting independent loop iterations to multiple cores using LLVM coroutines or OpenMP vectorization: transform instructions in loop bodies to SIMD counterparts for data-parallel execution

In order to get a first overview of the performance char

acteristics of our experiments on diferent platforms, we use three diferent machines listed in Table 1. All three In addition, we used some minor, permanently enabled machines use Intel processors, but from diferent genera- optimizations such as CSE, inlining and bufer optimizations and with slightly diferent architectures and clock- tion passes to obtain both better optimizable and optirates, which should already show varying results. mized code.

To keep performance diferences limited to the hard- Furthermore, we restricted the configuration of these ware, we created a unified build tool chain with all ma- optimizations to a per-program level instead of generatchines using the same bootstrapped LLVM version 13.0.1 ing variants for each operation. With these prerequisites, to build the experiments. Additionally, we supplied the we generated 120 variants for each machine and show the following compiler flags for optimization: -O3 -flto results of the pure execution times without compilation -march=native . Where -O3 instructs the compiler to as heatmaps in Figure 2. To keep the time for running the apply more aggressive, but also expensive (in terms of configurations acceptable, we used the TPC-H dataset compile-time) optimization of the code, -flto enables with scale factor 1. link-time optimizations, such as de-virtualization of func- Each column of a heatmap represents a diferent vectortions and -march=native tells the compiler to enable in- size, while the rows describe combinations of unrolling, structions that are available for the architecture on which tiling and selection dematerialization. The diferent parthe code is compiled. Such instructions are usually SSE4 allelization techniques are divided by Figure 2a, Figure 2b and AVX, among others. While more advanced instruc- and Figure 2c. tions increase the overall latency of our framework to Due to an outdated OpenMP runtime on Machine 3, optimize and run code, this also makes the binary largely there are no heatmaps for OpenMP parallelization in this not portable to other architectures. Overall, the optimiza- machine. Furthermore, combining vectorization together tions only have an efect on the framework compile-time with tiling for Q6 was problematic. The main reason of the query, but not on the query execution time, as was that the generated code could not be compiled because parallel loops only allow reductions on scalar types through atomic read-modify-write operations. However, as the loop body was vectorized, the reduction would have to be done on vectors, which is not supported. Due to this limitation in MLIR, we decided to exclude variant results with tiling altogether for Q6. The key observations from this experiment are:

1. All machines show similar trends regarding vari

ants, but diferent configurations turn out to be most eficient (e.g., unrolling vs. no unrolling for Q6 or OpenMP vs. async). 2. Tiling has a mostly negative influence on the query performance, with a speed-up ×1 to ×0.1 compared to no optimization. 3. For Q1 and Q6, selection dematerialization improves the performance by a factor of up to ×10, whereas Q3 and Q9 showed decreased performance by as low as ×0.25. 4. Loop unrolling had only a moderate influence when combined with vectorization for larger vector sizes. 5. Vectorization had slightly positive efects on the runtime of Q6 (×2 speed-up), for the other queries it had slightly negative efects.

Overall, these experiments show that our approach is able to generate diferently performing variants that can be used to optimize the runtime of diferent queries depending on the executing hardware and software characteristics – even with only a small part of the possible variant space explored. On the other hand, we also observe some unexpected behavior, such as the partially low influence of vectorization and parallelization on the runtime, especially visible for Q1, Q9, and Q18. After inspection of the generated MLIR code, we saw that these queries are often not transformed to parallel loops because the transformation pass of MLIR is too restrictive in its side efect analysis and does not parallelize/vectorize loop nests with non-afine branches instead of traversing and analyzing their memory accesses recursively.

6. Conclusion & Future Work

In this paper, we presented our vision of the future of fully adaptive database query execution engines and showed that support of micro-optimizations in database query engines can have a performance impact in the order of magnitudes. At first, we identified the main challenges to tackle in order to achieve a fully adaptive database engine that, despite the clear advantage of such systems, still does not exist today. We find that there are approaches that tackle each of the challenges individually, but up to now there exists no framework that takes everything into none sel 1 u ien u + sel ch t aM t + sel

t + u t + u + sel none sel 2 u ien u + sel ch t aM t + sel

t + u t + u + sel none sel 3 u ien u + sel ch t aM t + sel

t + u t + u + sel none sel 1 u ien u + sel ch t aM t + sel

t + u t + u + sel none sel 2 u ien u + sel ch t aM t + sel

t + u t + u + sel none sel 3 u ien u + sel ch t aM t + sel

t + u t + u + sel none sel 1 u ien u + sel ch t aM t + sel

t + u t + u + sel none sel 2 u ien u + sel ch t aM t + sel

t + u t + u + sel (a) Variants performance for all machines without parallelization (b) Variants performance for all machines with coroutines 1 2 4 8 16 Vector Size 1 2 4 8 16 Vector Size 1 2 4 8 16 Vector Size 1 2 4 8 16 Vector Size 1 2 4 8 16

Vector Size 500 1000 pipeline and, thus, we tackle the challenges of adaptivity [9] P. A. Boncz, et al., Monet: A next-generation DBMS and granularity of the optimizations. Moreover, we high- kernel for query-intensive applications, 2002. light further pathways to tackle the remaining challenges [10] T. Gubner, P. Boncz, Excalibur: A virtual machine in the context of our approach. for adaptive fine-grained jit-compiled query execu

We briefly evaluated the functionality and perfor- tion based on voila, Proc. VLDB Endow. 16 (2022). mance using a modified version of the TPC-H benchmark. [11] T. Neumann, Eficiently Compiling Eficient Query We showed that our approach is able to generate diverse Plans for Modern Hardware., Proc. VLDB Endow. variants with diferent performance properties. While 4 (2011) 539–550. we still found some issues and the prototype is at an early [12] Y. Srikant, P. Shankar, The Compiler Design Handstage, we are confident that when addressing the issues, book: Optimizations and Machine Code Generation, our approach will be able to outperform non-adaptive CRC Press, 2003. systems. [13] C. Bertoni, J. Kwack, T. Applencourt, Y. Ghadar,

Our immediate next step is the addition of support B. Homerding, C. Knight, B. Videau, H. Zheng, for heterogeneous hardware, as well as improvement of V. Morozov, S. Parker, Performance portability the prototype to be more competitive by overcoming the evaluation of opencl benchmarks across intel and identified problems. Additionally, we want to extend our nvidia platforms, in: Proc. IPDPSW 2020, 2020, pp. approach to support joins eficiently, as well as string 330–339. types. For the future, we plan to extend our prototype to [14] M. Jungmair, A. Kohn, J. Giceva, Designing an open tackle all of our presented challenges, to become a fully framework for query optimization and compilation, adaptive reprogramming, heterogeneous database sys- Proc. VLDB Endow. 15 (2022) 2389–2401. tem that is able to achieve peak performance independent [15] J.-C. Wang, D. Ding, H. Wang, C. Christensen, of the supplied workload and hardware. Z. Wang, H. Chen, J. Li, Polyjuice: Highperformance transactions via learned concurrency control., in: OSDI, 2021, pp. 198–216.

References [16] H. Pirk, O. R. Moll, M. Zaharia, S. Madden, Voodoo - A Vector Algebra for Portable Database Perfor[1] T. Gubner, P. Boncz, Highlighting the performance mance on Modern Hardware, Proc. VLDB Endowdiversity of analytical queries using voila, in: Proc. ment 9 (2016) 1707–1718.

ADMS 2021, 2021. [17] T. Kersten, V. Leis, A. Kemper, T. Neumann, [2] K. Dursun, C. Binnig, U. Cetintemel, R. Petrocelli, A. Pavlo, P. A. Boncz, Everything You Always SiliconDB: Rethinking DBMSs for Modern Hetero- Wanted to Know About Compiled and Vectorized geneous Co-Processor Environments, in: Proc. Da- Queries But Were Afraid to Ask, Proc. VLDB Endow. mon 2017, ACM, 2017. 11 (2018) 2209–2222. [3] D. Broneske, Adaptive Reprogramming for [18] A. Vogelsgesang, M. Haubenschild, J. Finis, A. KemDatabases on Heterogeneous Processors, SIGMOD per, V. Leis, T. Mühlbauer, T. Neumann, M. Then, 2015 PhD Symposium, ACM, 2015, p. 51–55. Get Real: How Benchmarks Fail to Represent the [4] B. Răducanu, P. Boncz, M. Zukowski, Micro Adap- Real World, in: Proc. DBTest 2018, 2018, p. 1–6. tivity in Vectorwise, in: Proc. SIGMOD 2013, ACM, [19] Y. Li, J. M. Patel, WideTable: An Accelerator for 2013, p. 1231–1242. Analytical Data Processing, Proc. VLDB Endow. 7 [5] S. Breß, B. Köcher, H. Funke, S. Zeuch, T. Rabl, (2014) 907–918.

V. Markl, Generating custom code for eficient query execution on heterogeneous processors,

VLDB Journal 27 (2018) 797–822. [6] T. Gubner, P. Boncz, Charting the Design Space of

Query Execution Using VOILA, Proc. VLDB Endow.

14 (2021) 1067–1079. [7] C. Lattner, M. Amini, U. Bondhugula, A. Cohen,

A. Davis, J. Pienaar, R. Riddle, T. Shpeisman, N. Vasilache, O. Zinenko, MLIR: Scaling Compiler Infrastructure for Domain Specific Computation, in: Proc.

CGO 2021, IEEE/ACM, 2021, p. 2–14. [8] C. Lattner, V. Adve, LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation, in: Proc. CGO 2004, IEEE, 2004, p. 75–86.