=Paper=
{{Paper
|id=Vol-3714/paper3
|storemode=property
|title=Towards a Future of Fully Self-Optimizing Query Engines
|pdfUrl=https://ceur-ws.org/Vol-3714/paper3.pdf
|volume=Vol-3714
|authors=Paul Blockhaus,Gabriel Campero Durand,David Broneske,Gunter Saake
|dblpUrl=https://dblp.org/rec/conf/gvd/BlockhausDBS23
}}
==Towards a Future of Fully Self-Optimizing Query Engines==
Towards a Future of Fully Self-Optimizing Query Engines
Paul Blockhaus1 , Gabriel Campero Durand1 , David Broneske2 and Gunter Saake1
1
Otto-von-Guericke University, Magdeburg, Germany
2
German Centre for Higher Education Research and Science Studies, Hannover, Germany
Abstract
With the ever-increasing heterogeneity of hardware, the database community is tasked with adapting to the new reality of
diverse ecosystems. The traditional workflow of hand-tuning query engine implementations to the underlying hardware
might become untenable for an ever-growing variety of hardware with different performance characteristics. Systems
like Micro-Adaptivity in Vectorwise or HAWK have been studied as adaptive solutions, but their adoption remains limited.
Envisioning a solution simplified for adoption, we propose a practical take on adaptive reprogramming using the domain-
specific language Voila and the MLIR compiler framework. We identify five main challenges in the area, and demonstrate
how we tackle the first challenges. To show the feasibility of our approach, we include a brief evaluation of its performance
on TPC-H; comparing 120 generated variants from a small subspace of potential optimizations.
Keywords
Adaptive Reprogramming, Micro-Adaptivity, Query Engines, Compiled Query Execution, MLIR, Voila
1. Introduction been proposed to identify and use the best combina-
tions of micro-optimizations to reach peak-performance
Emerging trends in hardware development show a in hardware-sensitive algorithms over specific hetero-
paradigm shift away from a single instruction set archi- geneous hardware [4, 5]. However, to the best of our
tecture (x86) towards a more heterogeneous ecosystem knowledge, none of these approaches has led to a fully
(e.g., RISC-V, ARM). While the complete consequences adaptive reprogramming system that can automatically
for the performance and design of database systems in generate, choose and maintain the best query execu-
this new era continue to be fleshed out [1], it is certain tion plan and strategy, applying micro-optimizations to
that engineers will continue to be tasked with adapt- the query on the fly before executing it. Consequently,
ing storage and query engines to new hardware. This the goal of a flexible and highly adaptive approach to
trend has already been visible in the last few years with peak-performance and to relieve the burden for ongoing
database systems adopting heterogeneous hardware [2] hardware-aware optimization and specialization, seems
and finding benefits by exploiting new processing capa- to arguably remain only partly realized. To help achieve
bilities such as SIMD, non-volatile RAM, among others. this goal of adaptive reprogramming, we outline a vi-
This increased heterogeneity opens-up entire new sets of sion for a fully adaptive, self-optimizing hardware-tuned
improvements for query engines. In this circumstance, query engine. We identify five main challenges that need
the choice of a physical algorithm is likely to increasingly to be addressed on the way and integrate the compo-
hinge less on the characteristics of queries, but rather on nents into a single system. In order to have a sustainable
those of the underlying hardware. Broneske et al., among platform for our vision, we choose to link upfront some
others, showed that with the increasing heterogeneity, solutions in compiler development. This enables us to
not only the choice of algorithm and operator placement directly benefit from the early adoption of new features
matters for query processing, but the implementation into compilers and unifies the infrastructure for better
variant of the algorithms is crucial[3]. In numerous in- maintenance and efficient development. To demonstrate
stances, promising variants are based only on so-called a first step towards realizing our vision, we develop a
micro-optimizations commonly used by developers, as prototypical open-source approach1 that allows for adap-
well as compilers, to optimize high-performance algo- tivity in the entire query execution through the use of a
rithms (e.g., branch predication, loop unrolling). domain-specific language (DSL).
Over the last decade, research has cemented that Our implementation is based on Voila [6], a DSL for
these micro-optimizations have a considerable influence query execution able to model various strategies. For
on query performance. Hence, several solutions have our work, we implemented a new compilation backend
for Voilas’s execution. The backend is based on the
34th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
MLIR framework, which enabled us to implement micro-
banken), June 7-9, 2023, Hirsau, Germany
Envelope-Open blockhau@ovgu.de (P. Blockhaus); campero@ovgu.de optimizations of entire query plans at a per-operator level.
(G. C. Durand); broneske@dzhw.eu (D. Broneske); saake@ovgu.de To show the potential of our approach, in this work we
(G. Saake)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License 1
Attribution 4.0 International (CC BY 4.0). https://github.com/SuperFoo42/voila_mlir/tree/v0.1
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
empirically evaluate the performance of the DSL and its 2.2. Voila
variants on an adapted subset of the TPC-H benchmark.
Voila [6] is a DSL designed for efficient query execution,
The remainder of the paper is structured as follows:
independent of the execution paradigm. To this end,
First, we introduce the tools upon which our prototype
the language is decoupled from the execution backend,
is built, with an overview of previous work. Next, we
which can flexibly run Voila programs in a vectorized or
introduce the five main challenges we identified for a
compiled fashion. To achieve this independence, Voila is
fully adaptive query engine. We follow with an overview
designed with vector data types as first class citizens. The
of the architecture of our prototype, that tackles the first
size of the vector determines the execution style, where
two challenges. Finally, we present an early empirical
a size of ∞ corresponds to operator-at-a-time execution
evaluation.
and a size of 1 being a tuple-at-a-time execution. To
have a language that allows this degree of flexibility, the
2. Background & Related Work granularity of the operations is set to execute on entire
vectors, but the instructions are kept general purpose
In this section, we discuss the MLIR framework and the to be able to write different algorithms without a large
Voila DSL, which are the basis for our prototype; enabling re-implementation effort. The resulting granularity is
efficient and adaptive generation of operator variants. somewhere between relational algebra operators, e.g.,
Afterward, we give a brief overview of previous work. MAL [9], and virtual CPU instructions, e.g., LLVM IR.
2.1. MLIR 2.3. Previous Approaches
MLIR [7] is a compiler framework relying on LLVM [8] to In recent years, several approaches for adaptivity in
quickly build DSLs. MLIR offers a generic intermediate database query execution have been proposed. One of
representation providing a unified syntax and support for the first approaches achieved micro-adaptivity in Vec-
commonly needed functionalities of DSL compilers, such torwise [4] through operator variants that were pre-
as type inference, lowering to executable code, interfaces compiled in libraries and could be replaced during run-
for common code transformations, and many more. The time by dynamic library loading.
two core components of MLIR are dialects and transfor- A new approach on micro-adaptivity was proposed
mations, where dialects are different DSLs, tailored to with Excalibur. Excalibur is a virtual machine for adap-
specific use cases, e.g., the scf -dialect represents struc- tive query execution [10]. It utilizes Voila to generate
tured control flow elements, the memref -dialect repre- queries that are executed in a vectorized fashion and
sents memory references and access, and the gpu-dialect replaces parts of the execution pipeline with compiled
allows expressing instructions executed by GPUs, as well variants to search for an optimal execution plan variant.
as the communication between CPU and GPU. The HAWK framework [5] used the C preprocessor and
The structuring of MLIR in dialects and the possibility OpenCL to achieve adaptivity of the query algorithms on
of mixing multiple dialects into this single representation heterogeneous hardware. Looking beyond the variant
allows for an efficient interoperation between dialects generation, both solutions also considered the selection
and simplifies their implementation. Other than dialects, of the optimal variant from the variant pool, using ma-
transformations are the central component of MLIR. They chine learning. This was necessary as the spanned search
convert between different dialects, while also support- space is extremely large, and the performance of each on
ing transformation patterns within a dialect, for example a given hardware-query combination is inherently hard
an optimization or canonicalization. As dialects are de- to predict. While the micro-adaptivity approach used an
signed to represent problems in their domain efficiently, online multi-armed bandit to find the optimal variants,
transformations can also be handled efficiently. This HAWK used offline-learning with a heuristic to test in
leads to better optimization times, and simpler rules, com- the optimization space for the most efficient variants.
pared to general-purpose compilers such as LLVM. Addi- Micro-adaptivity and HAWK suffer from a rather re-
tionally, transformations can be used in a plug-and-play stricted set of adaptivity due to pre-compiled variants
fashion by conversion to the particular dialect. These using a template mechanism on top of a high-level lan-
concepts show, to the best of our knowledge, a great guage. In addition, HAWK doesn’t provide any run-time
potential for a compiled query engine capable to utilize adaptivity and has to be trained for a workload in ad-
heterogeneous hardware and are unique for compiler vance. Excalibur is able to overcome the restrictions of
frameworks. high-level language templating, but still suffers from the
problem of having to implement each variant transfor-
mation.
3. Challenges for Adaptive Query levels of granularity with different language constructs,
as required. Additionally, the DSL can be designed to
Execution Frameworks efficiently support various types of hardware efficiently
and even handle work distribution scenarios [13]. Un-
Challenge 1 – Granularity of Adaptivity
fortunately, to our knowledge, there are no established
Finding the right granularity at which adaptive repro- approaches that utilize a DSL towards adaptive repro-
gramming is introduced is a key factor for the efficiency gramming in data management.
in several aspects. The granularity determines the com-
plexity of the system, as well as the resulting possibilities Challenge 2 – Adaptivity Mechanism
to reach peak-performance. A coarse granularity at the
level of, e.g., pipeline programs will only allow for a Closely linked to the choice of granularity is the choice
global level of flexibility, such as optimizing compiler of the adaptivity mechanism. This refers to the com-
flags, and their influence will overall not lead to large ponent responsible for creating variants on the chosen
benefits compared to the choice of algorithms and access granularity, and therefore also (partially) defining the
paths used for the query. Though compiler flags could variant space available. This component will depend on
improve usage of hardware capabilities, the approach the granularity and query execution paradigm. Current
is still limited in optimizing queries individually with approaches introduce adaptivity through an additional
regard to the query characteristics. abstraction layer, producing the variants in a preprocess-
In contrast, fine-grained adaptivity incurs additional ing step before compilation, i.e., using the C preprocessor
overhead during variant generation, and high complexity for variant templates. These variants are then used ei-
on the system for optimization due to the large variant ther directly in a compiled execution engine [5] or loaded
space. For example, introducing reprogramming into Hy- dynamically from libraries [4].
Per’s operators [11] would lead to a huge variant space, While this approach can be easily implemented, and
as the operators can be written in LLVM IR, but it also is shown to be very effective, it has some drawbacks,
adds the complexity of an entire compiler that has to such as limited adaptivity according to the expressive-
be able to exploit this variability, entirely disregarding ness of the underlying language and specification of the
the complexity of implementing algorithms directly in preprocessor. Additionally, using a high-level language
LLVM IR. The use of mainstream compilers for this level for adaptivity does not guarantee that the optimizations
of variability is rather limited, as current compilers only are actually applied because state-of-the-art compilers
concentrate on generating the best code for the common apply their own heuristics on the code that may undo the
case, or work with profiles to optimize for a pre-profiled applied optimizations or optimize differently. We argue
use-case [12]. To our knowledge, there is no approach that a direct integration of the adaptivity mechanism into
that instructs the compiler to transform LLVM IR or simi- the compiler will help to overcome these problems, as
lar low-level intermediate languages to optimize the code the compilers already support many micro-optimizations
at such fine granularity. by default; they just have to be applied. This eliminates
Therefore, current approaches choose the granularity redundancy in the design and also gives full control over
of pipeline-operators for variability, as a compromise the compilation process, which also allows restricting the
between complexity and variability. These operators optimizations or heuristics that the compiler uses, to not
are usually implemented in a high-level general-purpose interfere with manually adapted code. Another advan-
programming language, which limits the degree of vari- tage of this approach is that it combines nicely with our
ability through its language constructs and mapping to proposed solution for the first challenge, as modifications
hardware instructions. In the presence of heterogeneous of the compiler to support a DSL can already require the
systems, the language in which the operators are im- design of a compilation pipeline. In our prototype, we
plemented has to support all target platforms, which are able to utilize the different granularities of different
either leads to using a hardware-oblivious language, e.g., DSLs that MLIR offers to adaptively optimize the query
OpenCL, or a hardware-sensitive alternative, mixing mul- in MLIR’s own compilation pipeline.
tiple languages. Both approaches come with their own
disadvantages, they might not be performance portable, Challenge 3 – Learning Variants
or add a lot of complexity through re-implementation of
algorithms to optimize for hardware characteristics. With the first two challenges, we described the problem
In order to overcome the above-mentioned problems of variability, but without a good variant selection mech-
of the individual approaches, Broneske et al. proposed anism, such variability might be essentially useless. The
the use of DSLs [3]. While this does not directly solve the main problem for finding the best variants is the exten-
problem of choosing the right granularity, it allows de- sive search space that increases exponentially with every
signing the DSL to fit an arbitrary granularity, or multiple added optimization. The state-of-the-art approaches, we
described in subsection 2.3, leverage machine learning though both versions are possible, we plan to adapt the
techniques to learn and predict the best variants for a modeling of relational algebra in MLIR as it is a more flex-
certain workload. The HAWK-framework uses an offline- ible approach; with the added possibility of combining it
learning approach that is fed with example queries and, with the learning mechanism mentioned before.
in this way, learns the best flavors for the underlying
hardware [5]. To reduce the learning time, additional Challenge 5 – Adaptivity for OLTP
heuristics are used to prune the exploration space. This
approach focuses solely on optimization for a mostly Current approaches in code generation for data process-
static workload and best working parameters for optimal ing focus heavily on OLAP queries, which is reasonable
hardware utilization. Another approach used by micro- due to their complexity and high latencies. This focus
adaptivity is an online-learning approach that can also neglects OLTP queries, which, we believe, might still
adapt to changing workloads [4]. This is achieved by for- provide room for improvement through code generation,
mulating the variant selection as a multi-armed bandit with a focus on optimizing together sets of concurrent
learning problem. This continuously relearning system queries or templates of them. We envision, at this stage,
is also efficiently usable without previous learning. that OLTP queries could be served in at least three main
We argue that an online-learning approach is, in gen- ways. First, recent work has shown promising results
eral, preferable over offline learning, as it is designed to by tackling the challenge of learning an adaptive concur-
adapt to changes in the data management workloads by rency control mechanism for a mix of pre-defined query
itself, whereas offline learning would need re-training templates at different contention scenarios [15]. We con-
over new representative workloads, which nowadays can sider that different mixes of these queries and the variable
change quickly and therefore is not easily predictable. concurrency control mechanism might be able to lever-
A solution combining offline-learning to establish the age a code generation framework for a higher flexibility
starting models for online-learning might combine the and adaptivity, in the search for high throughput. To
best of both approaches. achieve this, we consider that multi-query optimization
We envision to not only use models to learn the best might be necessary, opening the door for further kinds
optimizations that can be used to instruct the compiler of improvements, such as operator sharing. Secondly, we
to apply certain optimizations on chosen places, but also consider that interpretation of MLIR for faster execution
to generate parameters for each optimization, and to without compile overhead, and a switch between inter-
guide the compiler regarding the number of optimiza- pretation and compilation through coroutines could be
tion passes for applying the optimizations. Finally, we relevant for workload mixes that could be categorized
envision that the selected solution should integrate well as HTAP. Finally, we argue that the flexibility of MLIR
with an MLOps framework, considering the lifecycle, allows for the study of a new dialect focusing on robust
evaluation, and maintenance of the models used. processing over adjacent data versions (as used in MVCC
or in Delta Lakes), which might be able to provide a
competitive support for high-contention scenarios.
Challenge 4 – Integration Into High-Level
Optimization
To take a step from a pure query execution engine to-
4. A Transformation-Based
wards a more widely usable database engine, an adap- Approach for Adaptivity
tive reprogramming framework must be compatible with
other high-level optimizations such as algebraic optimiza- In the following, we present our prototype that seeks to
tion, index and algorithm selection. To achieve this goal, tackle, at this stage, the first two challenges described in
one of two challenges has to be accomplished: Either the the previous section. We start with the architecture of
reprogramming framework needs to be integrated into an our approach (cf. Figure 1).
existing DBMS, or the missing components are integrated To have a more fine-grained granularity with a larger
into the framework. Neither of these two solutions has variant space while only adding a rather small overhead,
been successfully applied yet. To the best of our knowl- we adapted Voila [6] as a DSL with sub-operator granu-
edge, there only exist approaches that integrate parts larity (Challenge 1), and the additional feature of being
of traditional optimizations into the adaptivity pipeline. agnostic to the query execution strategy. The design of
The HAWK-framework, for instance, has limited support Voila also allows for simple adoption to be efficiently exe-
for hash table algorithm selection besides the operator- cuted on heterogeneous processors, as similar designs al-
level variant granularity. Furthermore, Jungmair et al. ready demonstrated [16]. Instead of explicit predication,
showed a first example of how to integrate algebraic op- we rely on a separate compilation pass to automatically
timization successfully into MLIR, from which on further choose when to use predication and when to material-
high-level optimizations can be introduced [14]. Even ize selections, which allows for more flexibility in the
MLIR framework and its dialects to introduce adaptiv-
Voila
Hardware Oblivious Query Representation ity into the code and lower it to LLVM, which is then
JIT-compiled and executed.
MLIR The usual compilation pipeline of our framework looks
Adaptive Code Generation
SQL Query Query Results
as follows, once Voila has been translated to Voila-MLIR,
which includes type resolution and transformation of
LLVM
Target Dependent Code Generation columns to the tensor data type, aggressive function inlin-
ing is performed to achieve redundant sub-query removal
JIT Execution
Query Execution on Heterogeneous
through canonicalization and common sub-expression
Hardware
elimination, as well as allowing for better optimizations
by eliminating function call boundaries.
Figure 1: High-level architectural schema of the framework Afterward, the dematerialization pass is performed and
the resulting plan is lowered to MLIR’s internal dialects,
optimization while also simplifying the language. either linalg , or affine . The linalg dialect describes
Implementing those types of adaptivity directly into loops using the concept of iterators. Each linalg opera-
the DSL has the great advantage that it makes the DSL tion consists of a set of input tensors and generates a set
adaptive by design (Challenge 2). In contrast to other of output tensors. The body of linalg operations has to
approaches, there is no longer the need to provide an ad- be largely side effect free, and aside of regular reductions
ditional callback to pre-compiled algorithms or to change doesn’t allow for data or control flow dependencies be-
the query plan. Instead, the query engine directly calls tween the iterations. Therefore, we resort to the affine
the generated query code, which is adapted during JIT- dialect for operations that need broader constraints. The
compilation by the supplied transformation and optimiza- affine dialect models range-based for loops and allows
tion passes. direct memory manipulations, only constrained by the
The DSL itself is built upon the MLIR and LLVM loop induction variables.
frameworks, which enable simple definitions of micro- Once, Voila-MLIR is translated to these high-level inter-
optimizations as transformations. Some commonly used nal dialects of MLIR, we apply loop optimization passes
transformations, such as loop unrolling, vectorization, such as parallelization, tiling, unrolling and loop fusion.
parallelization, etc., are already implemented and can be Afterward, we continue lowering towards more lower-
used. Transformations in MLIR are applied in passes, and level dialects such as the vector dialect that represents
additional filters even allow applying transformations virtual, hardware-independent vectors and instructions
selectively on certain instructions, or supplying certain that can directly be mapped to the hardware’s SIMD
parameters to the passes. In addition to the pre-existing capabilities. Other translations in this phase are low-
passes, we implemented a selection dematerialization ering of parallel loops to the async dialect, which rep-
pass to compensate for the missing predication by re- resents multi-threading through LLVM coroutines, or
placing the materialized result of the selection with a to the openmp dialect. In this step, it is also possible to
predicate, indicating the selected tuples, when possible. transform parallel loops for execution on further copro-
The dematerialization pass uses a forward data flow anal- cessors, such as GPU, e.g. by using the gpu dialect. This
ysis of selection results to decide whether to materialize dialect models the GPU execution model and also handles
the result or keep the predication. Currently, we mate- communication between CPU and GPU. For a concrete
rialize the results any time, where pipeline breakers are execution on GPU, the dialect can further be lowered to
encountered, but the pass can be extended with further SPIR-V or AMD and NVIDIA specific dialects.
options and more complex scenarios on when to replace In the next step, the low-level MLIR is lowered to
materialization with predication and vice versa. LLVM and finally compiled to executable byte code and
As MLIR is built upon LLVM, it also comes with na- then the query can be executed.
tive support for heterogeneous hardware. Hence, we
are confident that our DSL is able to adjust its behavior
to fit processing paradigms on heterogeneous hardware 5. Evaluation
with only minimal extensions. However, in the following
w.l.o.g., we focus on the extensible CPU-only part. There- 5.1. Dataset & Modifications
fore, we use vectorization to utilize SIMD capabilities of
To get an overview of the real-world behavior of our
modern CPUs, as well as multithreading with built-in
framework, we use a subset of the TPC-H benchmark.
LLVM coroutines, or the OpenMP runtime.
More precisely, we choose the queries Q1, Q3, Q6, Q9
In order to translate Voila to executable machine
and Q18 as a representative sample for typical OLAP
code, we added a backend to Voila that translates Voila
queries [17]. Since our framework does not yet support
to a Voila-MLIR dialect. From there on, we use the
Table 1 these flags are not applied to the query compiler. To get
Benchmark machine specifications more robust timing results, we disabled software-based
frequency scaling and ran each benchmark 100 times.
Processor RAM SIMD
The reported times were averaged using the median.
Machine 1 Intel Xeon E-2286M@2.4GHz 128GB AVX2
Machine 2 Intel Xeon Gold 6130@2.1GHz 346GB AVX-512
Machine 3 Intel Xeon E5-2630 v3@2.4GHz 1008GB AVX2 5.3. Adaptivity Behavior
joins efficiently and has no support for string data types, In order to get an overview of the effect of our optimiza-
we modify the TPC-H dataset slightly to still be able to tions on the runtime of the queries, we only generate a
run the queries. subset of the variant space that we are able to generate.
We use order-preserving dictionary compression for To restrict the number of variants to a maintainable size,
string columns and denormalize tables that need to be we only create all configurations of the following five
joined in order to replace joins by selections that are major optimizations and configurations:
simpler to implement efficiently. We argue that the dic-
loop unrolling: copying the loop body multiple times
tionary compression has no significant impact on the re-
within the loop to reduce control-flow overhead
sults of our benchmarks, as strings in TPC-H are mostly
of the loop condition check
only used as payloads [18] and since we are already us-
ing column-oriented query processing, the strings would loop tiling: splitting the loop body into an inner and
commonly not be loaded at all for most of the time. The outer loop, effectively creating a blocking of mem-
only exception here is Q9, where a string pattern match- ory access to increase cache performance
ing takes place, which has to be translated in the case of
dictionary compression. We choose to work around the selection dematerialization: replacing material-
matching by using a large IN-clause, checking for exis- izations of intermediate selection results by
tence of each tuple using a hash table. For the replace- bitvectors indicating selected tuples to reduce
ment of joins with simple selections on denormalized memory usage and increase vectorization
tables, Li and Patel suggest that it has no large influ- capabilities
ence on the query performance and leads to comparable
parallelization: splitting independent loop iterations
results [19].
to multiple cores using LLVM coroutines or
OpenMP
5.2. Setup
vectorization: transform instructions in loop bodies to
In order to get a first overview of the performance char- SIMD counterparts for data-parallel execution
acteristics of our experiments on different platforms, we
use three different machines listed in Table 1. All three In addition, we used some minor, permanently enabled
machines use Intel processors, but from different genera- optimizations such as CSE, inlining and buffer optimiza-
tions and with slightly different architectures and clock- tion passes to obtain both better optimizable and opti-
rates, which should already show varying results. mized code.
To keep performance differences limited to the hard- Furthermore, we restricted the configuration of these
ware, we created a unified build tool chain with all ma- optimizations to a per-program level instead of generat-
chines using the same bootstrapped LLVM version 13.0.1 ing variants for each operation. With these prerequisites,
to build the experiments. Additionally, we supplied the we generated 120 variants for each machine and show the
following compiler flags for optimization: -O3 -flto results of the pure execution times without compilation
-march=native . Where -O3 instructs the compiler to as heatmaps in Figure 2. To keep the time for running the
apply more aggressive, but also expensive (in terms of configurations acceptable, we used the TPC-H dataset
compile-time) optimization of the code, -flto enables with scale factor 1.
link-time optimizations, such as de-virtualization of func- Each column of a heatmap represents a different vector-
tions and -march=native tells the compiler to enable in- size, while the rows describe combinations of unrolling,
structions that are available for the architecture on which tiling and selection dematerialization. The different par-
the code is compiled. Such instructions are usually SSE4 allelization techniques are divided by Figure 2a, Figure 2b
and AVX , among others. While more advanced instruc- and Figure 2c.
tions increase the overall latency of our framework to Due to an outdated OpenMP runtime on Machine 3,
optimize and run code, this also makes the binary largely there are no heatmaps for OpenMP parallelization in this
not portable to other architectures. Overall, the optimiza- machine. Furthermore, combining vectorization together
tions only have an effect on the framework compile-time with tiling for Q6 was problematic. The main reason
of the query, but not on the query execution time, as
Q1 Q3 Q6 Q9 Q18
was that the generated code could not be compiled be- none
sel
cause parallel loops only allow reductions on scalar types u
Machine 1
u + sel
through atomic read-modify-write operations. However, t
t + sel
as the loop body was vectorized, the reduction would t+u
t + u + sel
have to be done on vectors, which is not supported. Due
none
to this limitation in MLIR, we decided to exclude variant sel
results with tiling altogether for Q6. The key observa- u
Machine 2
u + sel
tions from this experiment are: t
t + sel
t+u
1. All machines show similar trends regarding vari- t + u + sel
ants, but different configurations turn out to be none
sel
most efficient (e.g., unrolling vs. no unrolling for u
Machine 3
u + sel
Q6 or OpenMP vs. async). t
t + sel
2. Tiling has a mostly negative influence on the t+u
t + u + sel
query performance, with a speed-up ×1 to ×0.1
compared to no optimization. (a) Variants performance for all machines without parallelization
3. For Q1 and Q6, selection dematerialization im- none
proves the performance by a factor of up to ×10, sel
u
Machine 1
whereas Q3 and Q9 showed decreased perfor- u + sel
t
mance by as low as ×0.25. t + sel
t+u
4. Loop unrolling had only a moderate influence t + u + sel
when combined with vectorization for larger vec- none
sel
tor sizes. u
Machine 2
u + sel
5. Vectorization had slightly positive effects on the t
t + sel
runtime of Q6 (×2 speed-up), for the other queries t+u
t + u + sel
it had slightly negative effects.
none
sel
Overall, these experiments show that our approach u
Machine 3
u + sel
is able to generate differently performing variants that t
t + sel
can be used to optimize the runtime of different queries t+u
t + u + sel
depending on the executing hardware and software char-
acteristics – even with only a small part of the possible (b) Variants performance for all machines with coroutines
variant space explored. On the other hand, we also ob-
serve some unexpected behavior, such as the partially none
sel
low influence of vectorization and parallelization on the u
Machine 1
u + sel
runtime, especially visible for Q1, Q9, and Q18. After t
inspection of the generated MLIR code, we saw that these t + sel
t+u
queries are often not transformed to parallel loops be- t + u + sel
cause the transformation pass of MLIR is too restrictive in none
sel
its side effect analysis and does not parallelize/vectorize u
Machine 2
u + sel
loop nests with non-affine branches instead of traversing t
t + sel
and analyzing their memory accesses recursively. t+u
t + u + sel
1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16
Vector Size Vector Size Vector Size Vector Size Vector Size
6. Conclusion & Future Work 500 1000 200 400 20 40 1000 2000 500 1000 1500
In this paper, we presented our vision of the future of fully (c) Variants performance for Machines 1 and 2 With OpenMP
adaptive database query execution engines and showed
Figure 2: Heatmaps showing the runtimes of the query vari-
that support of micro-optimizations in database query ants with optimizations: tiling (t), unrolling (u), selection de-
engines can have a performance impact in the order of materialization (sel)
magnitudes. At first, we identified the main challenges to
tackle in order to achieve a fully adaptive database engine account to achieve an entirely adaptive solution. There-
that, despite the clear advantage of such systems, still fore, we propose a simple and practical approach that
does not exist today. We find that there are approaches leverages the Voila DSL and the MLIR compiler frame-
that tackle each of the challenges individually, but up to work to introduce adaptivity into the query compilation
now there exists no framework that takes everything into
pipeline and, thus, we tackle the challenges of adaptivity [9] P. A. Boncz, et al., Monet: A next-generation DBMS
and granularity of the optimizations. Moreover, we high- kernel for query-intensive applications, 2002.
light further pathways to tackle the remaining challenges [10] T. Gubner, P. Boncz, Excalibur: A virtual machine
in the context of our approach. for adaptive fine-grained jit-compiled query execu-
We briefly evaluated the functionality and perfor- tion based on voila, Proc. VLDB Endow. 16 (2022).
mance using a modified version of the TPC-H benchmark. [11] T. Neumann, Efficiently Compiling Efficient Query
We showed that our approach is able to generate diverse Plans for Modern Hardware., Proc. VLDB Endow.
variants with different performance properties. While 4 (2011) 539–550.
we still found some issues and the prototype is at an early [12] Y. Srikant, P. Shankar, The Compiler Design Hand-
stage, we are confident that when addressing the issues, book: Optimizations and Machine Code Generation,
our approach will be able to outperform non-adaptive CRC Press, 2003.
systems. [13] C. Bertoni, J. Kwack, T. Applencourt, Y. Ghadar,
Our immediate next step is the addition of support B. Homerding, C. Knight, B. Videau, H. Zheng,
for heterogeneous hardware, as well as improvement of V. Morozov, S. Parker, Performance portability
the prototype to be more competitive by overcoming the evaluation of opencl benchmarks across intel and
identified problems. Additionally, we want to extend our nvidia platforms, in: Proc. IPDPSW 2020, 2020, pp.
approach to support joins efficiently, as well as string 330–339.
types. For the future, we plan to extend our prototype to [14] M. Jungmair, A. Kohn, J. Giceva, Designing an open
tackle all of our presented challenges, to become a fully framework for query optimization and compilation,
adaptive reprogramming, heterogeneous database sys- Proc. VLDB Endow. 15 (2022) 2389–2401.
tem that is able to achieve peak performance independent [15] J.-C. Wang, D. Ding, H. Wang, C. Christensen,
of the supplied workload and hardware. Z. Wang, H. Chen, J. Li, Polyjuice: High-
performance transactions via learned concurrency
control., in: OSDI, 2021, pp. 198–216.
References [16] H. Pirk, O. R. Moll, M. Zaharia, S. Madden, Voodoo
- A Vector Algebra for Portable Database Perfor-
[1] T. Gubner, P. Boncz, Highlighting the performance
mance on Modern Hardware, Proc. VLDB Endow-
diversity of analytical queries using voila, in: Proc.
ment 9 (2016) 1707–1718.
ADMS 2021, 2021.
[17] T. Kersten, V. Leis, A. Kemper, T. Neumann,
[2] K. Dursun, C. Binnig, U. Cetintemel, R. Petrocelli,
A. Pavlo, P. A. Boncz, Everything You Always
SiliconDB: Rethinking DBMSs for Modern Hetero-
Wanted to Know About Compiled and Vectorized
geneous Co-Processor Environments, in: Proc. Da-
Queries But Were Afraid to Ask, Proc. VLDB Endow.
mon 2017, ACM, 2017.
11 (2018) 2209–2222.
[3] D. Broneske, Adaptive Reprogramming for
[18] A. Vogelsgesang, M. Haubenschild, J. Finis, A. Kem-
Databases on Heterogeneous Processors, SIGMOD
per, V. Leis, T. Mühlbauer, T. Neumann, M. Then,
2015 PhD Symposium, ACM, 2015, p. 51–55.
Get Real: How Benchmarks Fail to Represent the
[4] B. Răducanu, P. Boncz, M. Zukowski, Micro Adap-
Real World, in: Proc. DBTest 2018, 2018, p. 1–6.
tivity in Vectorwise, in: Proc. SIGMOD 2013, ACM,
[19] Y. Li, J. M. Patel, WideTable: An Accelerator for
2013, p. 1231–1242.
Analytical Data Processing, Proc. VLDB Endow. 7
[5] S. Breß, B. Köcher, H. Funke, S. Zeuch, T. Rabl,
(2014) 907–918.
V. Markl, Generating custom code for efficient
query execution on heterogeneous processors,
VLDB Journal 27 (2018) 797–822.
[6] T. Gubner, P. Boncz, Charting the Design Space of
Query Execution Using VOILA, Proc. VLDB Endow.
14 (2021) 1067–1079.
[7] C. Lattner, M. Amini, U. Bondhugula, A. Cohen,
A. Davis, J. Pienaar, R. Riddle, T. Shpeisman, N. Vasi-
lache, O. Zinenko, MLIR: Scaling Compiler Infras-
tructure for Domain Specific Computation, in: Proc.
CGO 2021, IEEE/ACM, 2021, p. 2–14.
[8] C. Lattner, V. Adve, LLVM: A Compilation Frame-
work for Lifelong Program Analysis & Transforma-
tion, in: Proc. CGO 2004, IEEE, 2004, p. 75–86.