<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Towards a Future of Fully Self-Optimizing Query Engines</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paul Blockhaus</string-name>
          <email>blockhau@ovgu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriel Campero Durand</string-name>
          <email>campero@ovgu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Broneske</string-name>
          <email>broneske@dzhw.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gunter Saake</string-name>
          <email>saake@ovgu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Adaptive Reprogramming</institution>
          ,
          <addr-line>Micro-Adaptivity, Query Engines, Compiled Query Execution, MLIR, Voila</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>German Centre for Higher Education Research and Science Studies</institution>
          ,
          <addr-line>Hannover</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Otto-von-Guericke University</institution>
          ,
          <addr-line>Magdeburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>3</fpage>
      <lpage>10</lpage>
      <abstract>
        <p>With the ever-increasing heterogeneity of hardware, the database community is tasked with adapting to the new reality of diverse ecosystems. The traditional workflow of hand-tuning query engine implementations to the underlying hardware might become untenable for an ever-growing variety of hardware with diferent performance characteristics. Systems like Micro-Adaptivity in Vectorwise or HAWK have been studied as adaptive solutions, but their adoption remains limited. Envisioning a solution simplified for adoption, we propose a practical take on adaptive reprogramming using the domainspecific language Voila and the MLIR compiler framework. We identify five main challenges in the area, and demonstrate how we tackle the first challenges. To show the feasibility of our approach, we include a brief evaluation of its performance on TPC-H; comparing 120 generated variants from a small subspace of potential optimizations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Emerging trends in hardware development show a
paradigm shift away from a single instruction set
architecture (x86) towards a more heterogeneous ecosystem
(e.g., RISC-V, ARM). While the complete consequences
for the performance and design of database systems in
this new era continue to be fleshed out [ 1], it is certain
that engineers will continue to be tasked with
adapting storage and query engines to new hardware. This
trend has already been visible in the last few years with
database systems adopting heterogeneous hardware [2]
and finding benefits by exploiting new processing
capaThis increased heterogeneity opens-up entire new sets of
improvements for query engines. In this circumstance,
the choice of a physical algorithm is likely to increasingly
hinge less on the characteristics of queries, but rather on
those of the underlying hardware. Broneske et al., among
others, showed that with the increasing heterogeneity,
not only the choice of algorithm and operator placement
matters for query processing, but the implementation
variant of the algorithms is crucial[3]. In numerous
instances, promising variants are based only on so-called
micro-optimizations commonly used by developers, as
rithms (e.g., branch predication, loop unrolling).</p>
      <p>Over the last decade, research has cemented that
these micro-optimizations have a considerable influence
on query performance. Hence, several solutions have</p>
      <sec id="sec-1-1">
        <title>Our implementation is based on Voila [6], a DSL for</title>
        <p>query execution able to model various strategies. For
our work, we implemented a new compilation backend
for Voilas’s execution. The backend is based on the</p>
      </sec>
      <sec id="sec-1-2">
        <title>MLIR framework, which enabled us to implement micro</title>
        <p>optimizations of entire query plans at a per-operator level.</p>
      </sec>
      <sec id="sec-1-3">
        <title>To show the potential of our approach, in this work we</title>
        <p>CEUR</p>
        <p>ceur-ws.org
empirically evaluate the performance of the DSL and its
variants on an adapted subset of the TPC-H benchmark.</p>
        <p>The remainder of the paper is structured as follows:
First, we introduce the tools upon which our prototype
is built, with an overview of previous work. Next, we
introduce the five main challenges we identified for a
fully adaptive query engine. We follow with an overview
of the architecture of our prototype, that tackles the first
two challenges. Finally, we present an early empirical
evaluation.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Background &amp; Related Work</title>
      <sec id="sec-2-1">
        <title>In this section, we discuss the MLIR framework and the Voila DSL, which are the basis for our prototype; enabling eficient and adaptive generation of operator variants. Afterward, we give a brief overview of previous work.</title>
        <p>2.1. MLIR
Voila [6] is a DSL designed for eficient query execution,
independent of the execution paradigm. To this end,
the language is decoupled from the execution backend,
which can flexibly run Voila programs in a vectorized or
compiled fashion. To achieve this independence, Voila is
designed with vector data types as first class citizens. The
size of the vector determines the execution style, where
a size of ∞ corresponds to operator-at-a-time execution
and a size of 1 being a tuple-at-a-time execution. To
have a language that allows this degree of flexibility, the
granularity of the operations is set to execute on entire
vectors, but the instructions are kept general purpose
to be able to write diferent algorithms without a large
re-implementation efort. The resulting granularity is
somewhere between relational algebra operators, e.g.,
MAL [9], and virtual CPU instructions, e.g., LLVM IR.</p>
        <sec id="sec-2-1-1">
          <title>2.3. Previous Approaches</title>
          <p>MLIR [7] is a compiler framework relying on LLVM [8] to In recent years, several approaches for adaptivity in
quickly build DSLs. MLIR ofers a generic intermediate database query execution have been proposed. One of
representation providing a unified syntax and support for the first approaches achieved micro-adaptivity in
Veccommonly needed functionalities of DSL compilers, such torwise [4] through operator variants that were
preas type inference, lowering to executable code, interfaces compiled in libraries and could be replaced during
runfor common code transformations, and many more. The time by dynamic library loading.
two core components of MLIR are dialects and transfor- A new approach on micro-adaptivity was proposed
mations, where dialects are diferent DSLs, tailored to with Excalibur. Excalibur is a virtual machine for
adapspecific use cases, e.g., the scf -dialect represents struc- tive query execution [10]. It utilizes Voila to generate
tured control flow elements, the memref -dialect repre- queries that are executed in a vectorized fashion and
sents memory references and access, and the gpu-dialect replaces parts of the execution pipeline with compiled
allows expressing instructions executed by GPUs, as well variants to search for an optimal execution plan variant.
as the communication between CPU and GPU. The HAWK framework [5] used the C preprocessor and</p>
          <p>The structuring of MLIR in dialects and the possibility OpenCL to achieve adaptivity of the query algorithms on
of mixing multiple dialects into this single representation heterogeneous hardware. Looking beyond the variant
allows for an eficient interoperation between dialects generation, both solutions also considered the selection
and simplifies their implementation. Other than dialects, of the optimal variant from the variant pool, using
matransformations are the central component of MLIR. They chine learning. This was necessary as the spanned search
convert between diferent dialects, while also support- space is extremely large, and the performance of each on
ing transformation patterns within a dialect, for example a given hardware-query combination is inherently hard
an optimization or canonicalization. As dialects are de- to predict. While the micro-adaptivity approach used an
signed to represent problems in their domain eficiently, online multi-armed bandit to find the optimal variants,
transformations can also be handled eficiently. This HAWK used ofline-learning with a heuristic to test in
leads to better optimization times, and simpler rules, com- the optimization space for the most eficient variants.
pared to general-purpose compilers such as LLVM. Addi- Micro-adaptivity and HAWK sufer from a rather
retionally, transformations can be used in a plug-and-play stricted set of adaptivity due to pre-compiled variants
fashion by conversion to the particular dialect. These using a template mechanism on top of a high-level
lanconcepts show, to the best of our knowledge, a great guage. In addition, HAWK doesn’t provide any run-time
potential for a compiled query engine capable to utilize adaptivity and has to be trained for a workload in
adheterogeneous hardware and are unique for compiler vance. Excalibur is able to overcome the restrictions of
frameworks. high-level language templating, but still sufers from the
problem of having to implement each variant
transformation.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Challenges for Adaptive Query</title>
    </sec>
    <sec id="sec-4">
      <title>Execution Frameworks</title>
      <p>levels of granularity with diferent language constructs,
as required. Additionally, the DSL can be designed to
eficiently support various types of hardware eficiently
Challenge 1 – Granularity of Adaptivity and even handle work distribution scenarios [13].
Unfortunately, to our knowledge, there are no established
Finding the right granularity at which adaptive repro- approaches that utilize a DSL towards adaptive
reprogramming is introduced is a key factor for the eficiency gramming in data management.
in several aspects. The granularity determines the
complexity of the system, as well as the resulting possibilities Challenge 2 – Adaptivity Mechanism
to reach peak-performance. A coarse granularity at the
level of, e.g., pipeline programs will only allow for a Closely linked to the choice of granularity is the choice
global level of flexibility, such as optimizing compiler of the adaptivity mechanism. This refers to the
comlfags, and their influence will overall not lead to large ponent responsible for creating variants on the chosen
benefits compared to the choice of algorithms and access granularity, and therefore also (partially) defining the
paths used for the query. Though compiler flags could variant space available. This component will depend on
improve usage of hardware capabilities, the approach the granularity and query execution paradigm. Current
is still limited in optimizing queries individually with approaches introduce adaptivity through an additional
regard to the query characteristics. abstraction layer, producing the variants in a
preprocess</p>
      <p>In contrast, fine-grained adaptivity incurs additional ing step before compilation, i.e., using the C preprocessor
overhead during variant generation, and high complexity for variant templates. These variants are then used
eion the system for optimization due to the large variant ther directly in a compiled execution engine [5] or loaded
space. For example, introducing reprogramming into Hy- dynamically from libraries [4].</p>
      <p>Per’s operators [11] would lead to a huge variant space, While this approach can be easily implemented, and
as the operators can be written in LLVM IR, but it also is shown to be very efective, it has some drawbacks,
adds the complexity of an entire compiler that has to such as limited adaptivity according to the
expressivebe able to exploit this variability, entirely disregarding ness of the underlying language and specification of the
the complexity of implementing algorithms directly in preprocessor. Additionally, using a high-level language
LLVM IR. The use of mainstream compilers for this level for adaptivity does not guarantee that the optimizations
of variability is rather limited, as current compilers only are actually applied because state-of-the-art compilers
concentrate on generating the best code for the common apply their own heuristics on the code that may undo the
case, or work with profiles to optimize for a pre-profiled applied optimizations or optimize diferently. We argue
use-case [12]. To our knowledge, there is no approach that a direct integration of the adaptivity mechanism into
that instructs the compiler to transform LLVM IR or simi- the compiler will help to overcome these problems, as
lar low-level intermediate languages to optimize the code the compilers already support many micro-optimizations
at such fine granularity. by default; they just have to be applied. This eliminates</p>
      <p>Therefore, current approaches choose the granularity redundancy in the design and also gives full control over
of pipeline-operators for variability, as a compromise the compilation process, which also allows restricting the
between complexity and variability. These operators optimizations or heuristics that the compiler uses, to not
are usually implemented in a high-level general-purpose interfere with manually adapted code. Another
advanprogramming language, which limits the degree of vari- tage of this approach is that it combines nicely with our
ability through its language constructs and mapping to proposed solution for the first challenge, as modifications
hardware instructions. In the presence of heterogeneous of the compiler to support a DSL can already require the
systems, the language in which the operators are im- design of a compilation pipeline. In our prototype, we
plemented has to support all target platforms, which are able to utilize the diferent granularities of diferent
either leads to using a hardware-oblivious language, e.g., DSLs that MLIR ofers to adaptively optimize the query
OpenCL, or a hardware-sensitive alternative, mixing mul- in MLIR’s own compilation pipeline.
tiple languages. Both approaches come with their own
disadvantages, they might not be performance portable, Challenge 3 – Learning Variants
or add a lot of complexity through re-implementation of
algorithms to optimize for hardware characteristics. With the first two challenges, we described the problem</p>
      <p>In order to overcome the above-mentioned problems of variability, but without a good variant selection
mechof the individual approaches, Broneske et al. proposed anism, such variability might be essentially useless. The
the use of DSLs [3]. While this does not directly solve the main problem for finding the best variants is the
extenproblem of choosing the right granularity, it allows de- sive search space that increases exponentially with every
signing the DSL to fit an arbitrary granularity, or multiple added optimization. The state-of-the-art approaches, we
described in subsection 2.3, leverage machine learning though both versions are possible, we plan to adapt the
techniques to learn and predict the best variants for a modeling of relational algebra in MLIR as it is a more
flexcertain workload. The HAWK-framework uses an ofline- ible approach; with the added possibility of combining it
learning approach that is fed with example queries and, with the learning mechanism mentioned before.
in this way, learns the best flavors for the underlying
hardware [5]. To reduce the learning time, additional Challenge 5 – Adaptivity for OLTP
heuristics are used to prune the exploration space. This
approach focuses solely on optimization for a mostly Current approaches in code generation for data
processstatic workload and best working parameters for optimal ing focus heavily on OLAP queries, which is reasonable
hardware utilization. Another approach used by micro- due to their complexity and high latencies. This focus
adaptivity is an online-learning approach that can also neglects OLTP queries, which, we believe, might still
adapt to changing workloads [4]. This is achieved by for- provide room for improvement through code generation,
mulating the variant selection as a multi-armed bandit with a focus on optimizing together sets of concurrent
learning problem. This continuously relearning system queries or templates of them. We envision, at this stage,
is also eficiently usable without previous learning. that OLTP queries could be served in at least three main</p>
      <p>We argue that an online-learning approach is, in gen- ways. First, recent work has shown promising results
eral, preferable over ofline learning, as it is designed to by tackling the challenge of learning an adaptive
concuradapt to changes in the data management workloads by rency control mechanism for a mix of pre-defined query
itself, whereas ofline learning would need re-training templates at diferent contention scenarios [ 15]. We
conover new representative workloads, which nowadays can sider that diferent mixes of these queries and the variable
change quickly and therefore is not easily predictable. concurrency control mechanism might be able to
leverA solution combining ofline-learning to establish the age a code generation framework for a higher flexibility
starting models for online-learning might combine the and adaptivity, in the search for high throughput. To
best of both approaches. achieve this, we consider that multi-query optimization</p>
      <p>We envision to not only use models to learn the best might be necessary, opening the door for further kinds
optimizations that can be used to instruct the compiler of improvements, such as operator sharing. Secondly, we
to apply certain optimizations on chosen places, but also consider that interpretation of MLIR for faster execution
to generate parameters for each optimization, and to without compile overhead, and a switch between
interguide the compiler regarding the number of optimiza- pretation and compilation through coroutines could be
tion passes for applying the optimizations. Finally, we relevant for workload mixes that could be categorized
envision that the selected solution should integrate well as HTAP. Finally, we argue that the flexibility of MLIR
with an MLOps framework, considering the lifecycle, allows for the study of a new dialect focusing on robust
evaluation, and maintenance of the models used. processing over adjacent data versions (as used in MVCC
or in Delta Lakes), which might be able to provide a
competitive support for high-contention scenarios.</p>
      <sec id="sec-4-1">
        <title>Challenge 4 – Integration Into High-Level</title>
      </sec>
      <sec id="sec-4-2">
        <title>Optimization</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. A Transformation-Based</title>
    </sec>
    <sec id="sec-6">
      <title>Approach for Adaptivity</title>
      <p>To take a step from a pure query execution engine
towards a more widely usable database engine, an
adaptive reprogramming framework must be compatible with
other high-level optimizations such as algebraic optimiza- In the following, we present our prototype that seeks to
tion, index and algorithm selection. To achieve this goal, tackle, at this stage, the first two challenges described in
one of two challenges has to be accomplished: Either the the previous section. We start with the architecture of
reprogramming framework needs to be integrated into an our approach (cf. Figure 1).
existing DBMS, or the missing components are integrated To have a more fine-grained granularity with a larger
into the framework. Neither of these two solutions has variant space while only adding a rather small overhead,
been successfully applied yet. To the best of our knowl- we adapted Voila [6] as a DSL with sub-operator
granuedge, there only exist approaches that integrate parts larity (Challenge 1), and the additional feature of being
of traditional optimizations into the adaptivity pipeline. agnostic to the query execution strategy. The design of
The HAWK-framework, for instance, has limited support Voila also allows for simple adoption to be eficiently
exefor hash table algorithm selection besides the operator- cuted on heterogeneous processors, as similar designs
allevel variant granularity. Furthermore, Jungmair et al. ready demonstrated [16]. Instead of explicit predication,
showed a first example of how to integrate algebraic op- we rely on a separate compilation pass to automatically
timization successfully into MLIR, from which on further choose when to use predication and when to
materialhigh-level optimizations can be introduced [14]. Even ize selections, which allows for more flexibility in the</p>
      <sec id="sec-6-1">
        <title>MLIR framework and its dialects to introduce adaptiv</title>
        <p>Hardware OblivioVusoQiulaery Representation ity into the code and lower it to LLVM, which is then</p>
        <p>JIT-compiled and executed.</p>
        <p>MLIR The usual compilation pipeline of our framework looks
SQL Query Adaptive Code Generation Query Results as follows, once Voila has been translated to Voila-MLIR,
which includes type resolution and transformation of
Target DepeLndLenVtCMode Generation columns to the tensor data type, aggressive function
inlining is performed to achieve redundant sub-query removal</p>
        <p>JIT Execution through canonicalization and common sub-expression
Query ExecuHtioanrdownaHreeterogeneous elimination, as well as allowing for better optimizations
by eliminating function call boundaries.</p>
        <p>Figure 1: High-level architectural schema of the framework Afterward, the dematerialization pass is performed and
the resulting plan is lowered to MLIR’s internal dialects,
optimization while also simplifying the language. either linalg, or affine. The linalg dialect describes</p>
        <p>Implementing those types of adaptivity directly into loops using the concept of iterators. Each linalg
operathe DSL has the great advantage that it makes the DSL tion consists of a set of input tensors and generates a set
adaptive by design (Challenge 2). In contrast to other of output tensors. The body of linalg operations has to
approaches, there is no longer the need to provide an ad- be largely side efect free, and aside of regular reductions
ditional callback to pre-compiled algorithms or to change doesn’t allow for data or control flow dependencies
bethe query plan. Instead, the query engine directly calls tween the iterations. Therefore, we resort to the affine
the generated query code, which is adapted during JIT- dialect for operations that need broader constraints. The
compilation by the supplied transformation and optimiza- affine dialect models range-based for loops and allows
tion passes. direct memory manipulations, only constrained by the</p>
        <p>The DSL itself is built upon the MLIR and LLVM loop induction variables.
frameworks, which enable simple definitions of micro- Once, Voila-MLIR is translated to these high-level
interoptimizations as transformations. Some commonly used nal dialects of MLIR, we apply loop optimization passes
transformations, such as loop unrolling, vectorization, such as parallelization, tiling, unrolling and loop fusion.
parallelization, etc., are already implemented and can be Afterward, we continue lowering towards more
lowerused. Transformations in MLIR are applied in passes, and level dialects such as the vector dialect that represents
additional filters even allow applying transformations virtual, hardware-independent vectors and instructions
selectively on certain instructions, or supplying certain that can directly be mapped to the hardware’s SIMD
parameters to the passes. In addition to the pre-existing capabilities. Other translations in this phase are
lowpasses, we implemented a selection dematerialization ering of parallel loops to the async dialect, which
reppass to compensate for the missing predication by re- resents multi-threading through LLVM coroutines, or
placing the materialized result of the selection with a to the openmp dialect. In this step, it is also possible to
predicate, indicating the selected tuples, when possible. transform parallel loops for execution on further
coproThe dematerialization pass uses a forward data flow anal- cessors, such as GPU, e.g. by using the gpu dialect. This
ysis of selection results to decide whether to materialize dialect models the GPU execution model and also handles
the result or keep the predication. Currently, we mate- communication between CPU and GPU. For a concrete
rialize the results any time, where pipeline breakers are execution on GPU, the dialect can further be lowered to
encountered, but the pass can be extended with further SPIR-V or AMD and NVIDIA specific dialects.
options and more complex scenarios on when to replace In the next step, the low-level MLIR is lowered to
materialization with predication and vice versa. LLVM and finally compiled to executable byte code and</p>
        <p>As MLIR is built upon LLVM, it also comes with na- then the query can be executed.
tive support for heterogeneous hardware. Hence, we
are confident that our DSL is able to adjust its behavior
to fit processing paradigms on heterogeneous hardware 5. Evaluation
with only minimal extensions. However, in the following
w.l.o.g., we focus on the extensible CPU-only part. There- 5.1. Dataset &amp; Modifications
fore, we use vectorization to utilize SIMD capabilities of To get an overview of the real-world behavior of our
modern CPUs, as well as multithreading with built-in framework, we use a subset of the TPC-H benchmark.
LLVM coroutines, or the OpenMP runtime. More precisely, we choose the queries Q1, Q3, Q6, Q9</p>
        <p>In order to translate Voila to executable machine and Q18 as a representative sample for typical OLAP
code, we added a backend to Voila that translates Voila queries [17]. Since our framework does not yet support
to a Voila-MLIR dialect. From there on, we use the
joins eficiently and has no support for string data types,
we modify the TPC-H dataset slightly to still be able to
run the queries.</p>
        <p>We use order-preserving dictionary compression for
string columns and denormalize tables that need to be
joined in order to replace joins by selections that are
simpler to implement eficiently. We argue that the
dictionary compression has no significant impact on the
results of our benchmarks, as strings in TPC-H are mostly
only used as payloads [18] and since we are already
using column-oriented query processing, the strings would
commonly not be loaded at all for most of the time. The
only exception here is Q9, where a string pattern
matching takes place, which has to be translated in the case of
dictionary compression. We choose to work around the
matching by using a large IN-clause, checking for
existence of each tuple using a hash table. For the
replacement of joins with simple selections on denormalized
tables, Li and Patel suggest that it has no large
influence on the query performance and leads to comparable
results [19].
5.2. Setup
these flags are not applied to the query compiler. To get
more robust timing results, we disabled software-based
frequency scaling and ran each benchmark 100 times.</p>
        <p>The reported times were averaged using the median.</p>
        <sec id="sec-6-1-1">
          <title>5.3. Adaptivity Behavior</title>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>In order to get an overview of the efect of our optimiza</title>
        <p>tions on the runtime of the queries, we only generate a
subset of the variant space that we are able to generate.
To restrict the number of variants to a maintainable size,
we only create all configurations of the following five
major optimizations and configurations:
loop unrolling: copying the loop body multiple times
within the loop to reduce control-flow overhead
of the loop condition check
loop tiling: splitting the loop body into an inner and
outer loop, efectively creating a blocking of
memory access to increase cache performance
selection dematerialization: replacing
materializations of intermediate selection results by
bitvectors indicating selected tuples to reduce
memory usage and increase vectorization
capabilities
parallelization: splitting independent loop iterations
to multiple cores using LLVM coroutines or
OpenMP
vectorization: transform instructions in loop bodies to
SIMD counterparts for data-parallel execution</p>
      </sec>
      <sec id="sec-6-3">
        <title>In order to get a first overview of the performance char</title>
        <p>acteristics of our experiments on diferent platforms, we
use three diferent machines listed in Table 1. All three In addition, we used some minor, permanently enabled
machines use Intel processors, but from diferent genera- optimizations such as CSE, inlining and bufer
optimizations and with slightly diferent architectures and clock- tion passes to obtain both better optimizable and
optirates, which should already show varying results. mized code.</p>
        <p>To keep performance diferences limited to the hard- Furthermore, we restricted the configuration of these
ware, we created a unified build tool chain with all ma- optimizations to a per-program level instead of
generatchines using the same bootstrapped LLVM version 13.0.1 ing variants for each operation. With these prerequisites,
to build the experiments. Additionally, we supplied the we generated 120 variants for each machine and show the
following compiler flags for optimization: -O3 -flto results of the pure execution times without compilation
-march=native . Where -O3 instructs the compiler to as heatmaps in Figure 2. To keep the time for running the
apply more aggressive, but also expensive (in terms of configurations acceptable, we used the TPC-H dataset
compile-time) optimization of the code, -flto enables with scale factor 1.
link-time optimizations, such as de-virtualization of func- Each column of a heatmap represents a diferent
vectortions and -march=native tells the compiler to enable in- size, while the rows describe combinations of unrolling,
structions that are available for the architecture on which tiling and selection dematerialization. The diferent
parthe code is compiled. Such instructions are usually SSE4 allelization techniques are divided by Figure 2a, Figure 2b
and AVX, among others. While more advanced instruc- and Figure 2c.
tions increase the overall latency of our framework to Due to an outdated OpenMP runtime on Machine 3,
optimize and run code, this also makes the binary largely there are no heatmaps for OpenMP parallelization in this
not portable to other architectures. Overall, the optimiza- machine. Furthermore, combining vectorization together
tions only have an efect on the framework compile-time with tiling for Q6 was problematic. The main reason
of the query, but not on the query execution time, as
was that the generated code could not be compiled
because parallel loops only allow reductions on scalar types
through atomic read-modify-write operations. However,
as the loop body was vectorized, the reduction would
have to be done on vectors, which is not supported. Due
to this limitation in MLIR, we decided to exclude variant
results with tiling altogether for Q6. The key
observations from this experiment are:</p>
      </sec>
      <sec id="sec-6-4">
        <title>1. All machines show similar trends regarding vari</title>
        <p>ants, but diferent configurations turn out to be
most eficient (e.g., unrolling vs. no unrolling for
Q6 or OpenMP vs. async).
2. Tiling has a mostly negative influence on the
query performance, with a speed-up ×1 to ×0.1
compared to no optimization.
3. For Q1 and Q6, selection dematerialization
improves the performance by a factor of up to ×10,
whereas Q3 and Q9 showed decreased
performance by as low as ×0.25.
4. Loop unrolling had only a moderate influence
when combined with vectorization for larger
vector sizes.
5. Vectorization had slightly positive efects on the
runtime of Q6 (×2 speed-up), for the other queries
it had slightly negative efects.</p>
        <p>Overall, these experiments show that our approach
is able to generate diferently performing variants that
can be used to optimize the runtime of diferent queries
depending on the executing hardware and software
characteristics – even with only a small part of the possible
variant space explored. On the other hand, we also
observe some unexpected behavior, such as the partially
low influence of vectorization and parallelization on the
runtime, especially visible for Q1, Q9, and Q18. After
inspection of the generated MLIR code, we saw that these
queries are often not transformed to parallel loops
because the transformation pass of MLIR is too restrictive in
its side efect analysis and does not parallelize/vectorize
loop nests with non-afine branches instead of traversing
and analyzing their memory accesses recursively.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion &amp; Future Work</title>
      <p>In this paper, we presented our vision of the future of fully
adaptive database query execution engines and showed
that support of micro-optimizations in database query
engines can have a performance impact in the order of
magnitudes. At first, we identified the main challenges to
tackle in order to achieve a fully adaptive database engine
that, despite the clear advantage of such systems, still
does not exist today. We find that there are approaches
that tackle each of the challenges individually, but up to
now there exists no framework that takes everything into
none
sel
1 u
ien u + sel
ch t
aM t + sel</p>
      <p>t + u
t + u + sel
none
sel
2 u
ien u + sel
ch t
aM t + sel</p>
      <p>t + u
t + u + sel
none
sel
3 u
ien u + sel
ch t
aM t + sel</p>
      <p>t + u
t + u + sel
none
sel
1 u
ien u + sel
ch t
aM t + sel</p>
      <p>t + u
t + u + sel
none
sel
2 u
ien u + sel
ch t
aM t + sel</p>
      <p>t + u
t + u + sel
none
sel
3 u
ien u + sel
ch t
aM t + sel</p>
      <p>t + u
t + u + sel
none
sel
1 u
ien u + sel
ch t
aM t + sel</p>
      <p>t + u
t + u + sel
none
sel
2 u
ien u + sel
ch t
aM t + sel</p>
      <p>t + u
t + u + sel
(a) Variants performance for all machines without parallelization
(b) Variants performance for all machines with coroutines
1 2 4 8 16
Vector Size
1 2 4 8 16
Vector Size
1 2 4 8 16
Vector Size
1 2 4 8 16
Vector Size
1 2 4 8 16</p>
      <p>Vector Size
500
1000
pipeline and, thus, we tackle the challenges of adaptivity [9] P. A. Boncz, et al., Monet: A next-generation DBMS
and granularity of the optimizations. Moreover, we high- kernel for query-intensive applications, 2002.
light further pathways to tackle the remaining challenges [10] T. Gubner, P. Boncz, Excalibur: A virtual machine
in the context of our approach. for adaptive fine-grained jit-compiled query
execu</p>
      <p>We briefly evaluated the functionality and perfor- tion based on voila, Proc. VLDB Endow. 16 (2022).
mance using a modified version of the TPC-H benchmark. [11] T. Neumann, Eficiently Compiling Eficient Query
We showed that our approach is able to generate diverse Plans for Modern Hardware., Proc. VLDB Endow.
variants with diferent performance properties. While 4 (2011) 539–550.
we still found some issues and the prototype is at an early [12] Y. Srikant, P. Shankar, The Compiler Design
Handstage, we are confident that when addressing the issues, book: Optimizations and Machine Code Generation,
our approach will be able to outperform non-adaptive CRC Press, 2003.
systems. [13] C. Bertoni, J. Kwack, T. Applencourt, Y. Ghadar,</p>
      <p>Our immediate next step is the addition of support B. Homerding, C. Knight, B. Videau, H. Zheng,
for heterogeneous hardware, as well as improvement of V. Morozov, S. Parker, Performance portability
the prototype to be more competitive by overcoming the evaluation of opencl benchmarks across intel and
identified problems. Additionally, we want to extend our nvidia platforms, in: Proc. IPDPSW 2020, 2020, pp.
approach to support joins eficiently, as well as string 330–339.
types. For the future, we plan to extend our prototype to [14] M. Jungmair, A. Kohn, J. Giceva, Designing an open
tackle all of our presented challenges, to become a fully framework for query optimization and compilation,
adaptive reprogramming, heterogeneous database sys- Proc. VLDB Endow. 15 (2022) 2389–2401.
tem that is able to achieve peak performance independent [15] J.-C. Wang, D. Ding, H. Wang, C. Christensen,
of the supplied workload and hardware. Z. Wang, H. Chen, J. Li, Polyjuice:
Highperformance transactions via learned concurrency
control., in: OSDI, 2021, pp. 198–216.</p>
      <p>References [16] H. Pirk, O. R. Moll, M. Zaharia, S. Madden, Voodoo
- A Vector Algebra for Portable Database
Perfor[1] T. Gubner, P. Boncz, Highlighting the performance mance on Modern Hardware, Proc. VLDB
Endowdiversity of analytical queries using voila, in: Proc. ment 9 (2016) 1707–1718.</p>
      <p>ADMS 2021, 2021. [17] T. Kersten, V. Leis, A. Kemper, T. Neumann,
[2] K. Dursun, C. Binnig, U. Cetintemel, R. Petrocelli, A. Pavlo, P. A. Boncz, Everything You Always
SiliconDB: Rethinking DBMSs for Modern Hetero- Wanted to Know About Compiled and Vectorized
geneous Co-Processor Environments, in: Proc. Da- Queries But Were Afraid to Ask, Proc. VLDB Endow.
mon 2017, ACM, 2017. 11 (2018) 2209–2222.
[3] D. Broneske, Adaptive Reprogramming for [18] A. Vogelsgesang, M. Haubenschild, J. Finis, A.
KemDatabases on Heterogeneous Processors, SIGMOD per, V. Leis, T. Mühlbauer, T. Neumann, M. Then,
2015 PhD Symposium, ACM, 2015, p. 51–55. Get Real: How Benchmarks Fail to Represent the
[4] B. Răducanu, P. Boncz, M. Zukowski, Micro Adap- Real World, in: Proc. DBTest 2018, 2018, p. 1–6.
tivity in Vectorwise, in: Proc. SIGMOD 2013, ACM, [19] Y. Li, J. M. Patel, WideTable: An Accelerator for
2013, p. 1231–1242. Analytical Data Processing, Proc. VLDB Endow. 7
[5] S. Breß, B. Köcher, H. Funke, S. Zeuch, T. Rabl, (2014) 907–918.</p>
      <p>V. Markl, Generating custom code for eficient
query execution on heterogeneous processors,</p>
      <p>VLDB Journal 27 (2018) 797–822.
[6] T. Gubner, P. Boncz, Charting the Design Space of</p>
      <p>Query Execution Using VOILA, Proc. VLDB Endow.</p>
      <p>14 (2021) 1067–1079.
[7] C. Lattner, M. Amini, U. Bondhugula, A. Cohen,</p>
      <p>A. Davis, J. Pienaar, R. Riddle, T. Shpeisman, N.
Vasilache, O. Zinenko, MLIR: Scaling Compiler
Infrastructure for Domain Specific Computation, in: Proc.</p>
      <p>CGO 2021, IEEE/ACM, 2021, p. 2–14.
[8] C. Lattner, V. Adve, LLVM: A Compilation
Framework for Lifelong Program Analysis &amp;
Transformation, in: Proc. CGO 2004, IEEE, 2004, p. 75–86.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>