=Paper= {{Paper |id=Vol-3714/paper3 |storemode=property |title=Towards a Future of Fully Self-Optimizing Query Engines |pdfUrl=https://ceur-ws.org/Vol-3714/paper3.pdf |volume=Vol-3714 |authors=Paul Blockhaus,Gabriel Campero Durand,David Broneske,Gunter Saake |dblpUrl=https://dblp.org/rec/conf/gvd/BlockhausDBS23 }} ==Towards a Future of Fully Self-Optimizing Query Engines== https://ceur-ws.org/Vol-3714/paper3.pdf
                                Towards a Future of Fully Self-Optimizing Query Engines
                                Paul Blockhaus1 , Gabriel Campero Durand1 , David Broneske2 and Gunter Saake1
                                1
                                    Otto-von-Guericke University, Magdeburg, Germany
                                2
                                    German Centre for Higher Education Research and Science Studies, Hannover, Germany


                                                  Abstract
                                                  With the ever-increasing heterogeneity of hardware, the database community is tasked with adapting to the new reality of
                                                  diverse ecosystems. The traditional workflow of hand-tuning query engine implementations to the underlying hardware
                                                  might become untenable for an ever-growing variety of hardware with different performance characteristics. Systems
                                                  like Micro-Adaptivity in Vectorwise or HAWK have been studied as adaptive solutions, but their adoption remains limited.
                                                  Envisioning a solution simplified for adoption, we propose a practical take on adaptive reprogramming using the domain-
                                                  specific language Voila and the MLIR compiler framework. We identify five main challenges in the area, and demonstrate
                                                  how we tackle the first challenges. To show the feasibility of our approach, we include a brief evaluation of its performance
                                                  on TPC-H; comparing 120 generated variants from a small subspace of potential optimizations.

                                                  Keywords
                                                  Adaptive Reprogramming, Micro-Adaptivity, Query Engines, Compiled Query Execution, MLIR, Voila



                                1. Introduction                                                                                            been proposed to identify and use the best combina-
                                                                                                                                           tions of micro-optimizations to reach peak-performance
                                Emerging trends in hardware development show a                                                             in hardware-sensitive algorithms over specific hetero-
                                paradigm shift away from a single instruction set archi-                                                   geneous hardware [4, 5]. However, to the best of our
                                tecture (x86) towards a more heterogeneous ecosystem                                                       knowledge, none of these approaches has led to a fully
                                (e.g., RISC-V, ARM). While the complete consequences                                                       adaptive reprogramming system that can automatically
                                for the performance and design of database systems in                                                      generate, choose and maintain the best query execu-
                                this new era continue to be fleshed out [1], it is certain                                                 tion plan and strategy, applying micro-optimizations to
                                that engineers will continue to be tasked with adapt-                                                      the query on the fly before executing it. Consequently,
                                ing storage and query engines to new hardware. This                                                        the goal of a flexible and highly adaptive approach to
                                trend has already been visible in the last few years with                                                  peak-performance and to relieve the burden for ongoing
                                database systems adopting heterogeneous hardware [2]                                                       hardware-aware optimization and specialization, seems
                                and finding benefits by exploiting new processing capa-                                                    to arguably remain only partly realized. To help achieve
                                bilities such as SIMD, non-volatile RAM, among others.                                                     this goal of adaptive reprogramming, we outline a vi-
                                This increased heterogeneity opens-up entire new sets of                                                   sion for a fully adaptive, self-optimizing hardware-tuned
                                improvements for query engines. In this circumstance,                                                      query engine. We identify five main challenges that need
                                the choice of a physical algorithm is likely to increasingly                                               to be addressed on the way and integrate the compo-
                                hinge less on the characteristics of queries, but rather on                                                nents into a single system. In order to have a sustainable
                                those of the underlying hardware. Broneske et al., among                                                   platform for our vision, we choose to link upfront some
                                others, showed that with the increasing heterogeneity,                                                     solutions in compiler development. This enables us to
                                not only the choice of algorithm and operator placement                                                    directly benefit from the early adoption of new features
                                matters for query processing, but the implementation                                                       into compilers and unifies the infrastructure for better
                                variant of the algorithms is crucial[3]. In numerous in-                                                   maintenance and efficient development. To demonstrate
                                stances, promising variants are based only on so-called                                                    a first step towards realizing our vision, we develop a
                                micro-optimizations commonly used by developers, as                                                        prototypical open-source approach1 that allows for adap-
                                well as compilers, to optimize high-performance algo-                                                      tivity in the entire query execution through the use of a
                                rithms (e.g., branch predication, loop unrolling).                                                         domain-specific language (DSL).
                                   Over the last decade, research has cemented that                                                           Our implementation is based on Voila [6], a DSL for
                                these micro-optimizations have a considerable influence                                                    query execution able to model various strategies. For
                                on query performance. Hence, several solutions have                                                        our work, we implemented a new compilation backend
                                                                                                                                           for Voilas’s execution. The backend is based on the
                                34th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
                                                                                                                                           MLIR framework, which enabled us to implement micro-
                                banken), June 7-9, 2023, Hirsau, Germany
                                Envelope-Open blockhau@ovgu.de (P. Blockhaus); campero@ovgu.de                                             optimizations of entire query plans at a per-operator level.
                                (G. C. Durand); broneske@dzhw.eu (D. Broneske); saake@ovgu.de                                              To show the potential of our approach, in this work we
                                (G. Saake)
                                            © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License   1
                                            Attribution 4.0 International (CC BY 4.0).                                                         https://github.com/SuperFoo42/voila_mlir/tree/v0.1




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
empirically evaluate the performance of the DSL and its        2.2. Voila
variants on an adapted subset of the TPC-H benchmark.
                                                           Voila [6] is a DSL designed for efficient query execution,
   The remainder of the paper is structured as follows:
                                                           independent of the execution paradigm. To this end,
First, we introduce the tools upon which our prototype
                                                           the language is decoupled from the execution backend,
is built, with an overview of previous work. Next, we
                                                           which can flexibly run Voila programs in a vectorized or
introduce the five main challenges we identified for a
                                                           compiled fashion. To achieve this independence, Voila is
fully adaptive query engine. We follow with an overview
                                                           designed with vector data types as first class citizens. The
of the architecture of our prototype, that tackles the first
                                                           size of the vector determines the execution style, where
two challenges. Finally, we present an early empirical
                                                           a size of ∞ corresponds to operator-at-a-time execution
evaluation.
                                                           and a size of 1 being a tuple-at-a-time execution. To
                                                           have a language that allows this degree of flexibility, the
2. Background & Related Work                               granularity of the operations is set to execute on entire
                                                           vectors, but the instructions are kept general purpose
In this section, we discuss the MLIR framework and the to be able to write different algorithms without a large
Voila DSL, which are the basis for our prototype; enabling re-implementation effort. The resulting granularity is
efficient and adaptive generation of operator variants. somewhere between relational algebra operators, e.g.,
Afterward, we give a brief overview of previous work.      MAL [9], and virtual CPU instructions, e.g., LLVM IR.

2.1. MLIR                                                      2.3. Previous Approaches
MLIR [7] is a compiler framework relying on LLVM [8] to        In recent years, several approaches for adaptivity in
quickly build DSLs. MLIR offers a generic intermediate         database query execution have been proposed. One of
representation providing a unified syntax and support for      the first approaches achieved micro-adaptivity in Vec-
commonly needed functionalities of DSL compilers, such         torwise [4] through operator variants that were pre-
as type inference, lowering to executable code, interfaces     compiled in libraries and could be replaced during run-
for common code transformations, and many more. The            time by dynamic library loading.
two core components of MLIR are dialects and transfor-            A new approach on micro-adaptivity was proposed
mations, where dialects are different DSLs, tailored to        with Excalibur. Excalibur is a virtual machine for adap-
specific use cases, e.g., the scf -dialect represents struc-   tive query execution [10]. It utilizes Voila to generate
tured control flow elements, the memref -dialect repre-        queries that are executed in a vectorized fashion and
sents memory references and access, and the gpu-dialect        replaces parts of the execution pipeline with compiled
allows expressing instructions executed by GPUs, as well       variants to search for an optimal execution plan variant.
as the communication between CPU and GPU.                         The HAWK framework [5] used the C preprocessor and
   The structuring of MLIR in dialects and the possibility     OpenCL to achieve adaptivity of the query algorithms on
of mixing multiple dialects into this single representation    heterogeneous hardware. Looking beyond the variant
allows for an efficient interoperation between dialects        generation, both solutions also considered the selection
and simplifies their implementation. Other than dialects,      of the optimal variant from the variant pool, using ma-
transformations are the central component of MLIR. They        chine learning. This was necessary as the spanned search
convert between different dialects, while also support-        space is extremely large, and the performance of each on
ing transformation patterns within a dialect, for example      a given hardware-query combination is inherently hard
an optimization or canonicalization. As dialects are de-       to predict. While the micro-adaptivity approach used an
signed to represent problems in their domain efficiently,      online multi-armed bandit to find the optimal variants,
transformations can also be handled efficiently. This          HAWK used offline-learning with a heuristic to test in
leads to better optimization times, and simpler rules, com-    the optimization space for the most efficient variants.
pared to general-purpose compilers such as LLVM. Addi-            Micro-adaptivity and HAWK suffer from a rather re-
tionally, transformations can be used in a plug-and-play       stricted set of adaptivity due to pre-compiled variants
fashion by conversion to the particular dialect. These         using a template mechanism on top of a high-level lan-
concepts show, to the best of our knowledge, a great           guage. In addition, HAWK doesn’t provide any run-time
potential for a compiled query engine capable to utilize       adaptivity and has to be trained for a workload in ad-
heterogeneous hardware and are unique for compiler             vance. Excalibur is able to overcome the restrictions of
frameworks.                                                    high-level language templating, but still suffers from the
                                                               problem of having to implement each variant transfor-
                                                               mation.
3. Challenges for Adaptive Query                              levels of granularity with different language constructs,
                                                              as required. Additionally, the DSL can be designed to
   Execution Frameworks                                       efficiently support various types of hardware efficiently
                                                              and even handle work distribution scenarios [13]. Un-
Challenge 1 – Granularity of Adaptivity
                                                              fortunately, to our knowledge, there are no established
Finding the right granularity at which adaptive repro- approaches that utilize a DSL towards adaptive repro-
gramming is introduced is a key factor for the efficiency gramming in data management.
in several aspects. The granularity determines the com-
plexity of the system, as well as the resulting possibilities Challenge 2 – Adaptivity Mechanism
to reach peak-performance. A coarse granularity at the
level of, e.g., pipeline programs will only allow for a Closely linked to the choice of granularity is the choice
global level of flexibility, such as optimizing compiler of the adaptivity mechanism. This refers to the com-
flags, and their influence will overall not lead to large ponent responsible for creating variants on the chosen
benefits compared to the choice of algorithms and access granularity, and therefore also (partially) defining the
paths used for the query. Though compiler flags could variant space available. This component will depend on
improve usage of hardware capabilities, the approach the granularity and query execution paradigm. Current
is still limited in optimizing queries individually with approaches introduce adaptivity through an additional
regard to the query characteristics.                          abstraction layer, producing the variants in a preprocess-
   In contrast, fine-grained adaptivity incurs additional ing step before compilation, i.e., using the C preprocessor
overhead during variant generation, and high complexity for variant templates. These variants are then used ei-
on the system for optimization due to the large variant ther directly in a compiled execution engine [5] or loaded
space. For example, introducing reprogramming into Hy- dynamically from libraries [4].
Per’s operators [11] would lead to a huge variant space,         While this approach can be easily implemented, and
as the operators can be written in LLVM IR, but it also is shown to be very effective, it has some drawbacks,
adds the complexity of an entire compiler that has to such as limited adaptivity according to the expressive-
be able to exploit this variability, entirely disregarding ness of the underlying language and specification of the
the complexity of implementing algorithms directly in preprocessor. Additionally, using a high-level language
LLVM IR. The use of mainstream compilers for this level for adaptivity does not guarantee that the optimizations
of variability is rather limited, as current compilers only are actually applied because state-of-the-art compilers
concentrate on generating the best code for the common apply their own heuristics on the code that may undo the
case, or work with profiles to optimize for a pre-profiled applied optimizations or optimize differently. We argue
use-case [12]. To our knowledge, there is no approach that a direct integration of the adaptivity mechanism into
that instructs the compiler to transform LLVM IR or simi- the compiler will help to overcome these problems, as
lar low-level intermediate languages to optimize the code the compilers already support many micro-optimizations
at such fine granularity.                                     by default; they just have to be applied. This eliminates
   Therefore, current approaches choose the granularity redundancy in the design and also gives full control over
of pipeline-operators for variability, as a compromise the compilation process, which also allows restricting the
between complexity and variability. These operators optimizations or heuristics that the compiler uses, to not
are usually implemented in a high-level general-purpose interfere with manually adapted code. Another advan-
programming language, which limits the degree of vari- tage of this approach is that it combines nicely with our
ability through its language constructs and mapping to proposed solution for the first challenge, as modifications
hardware instructions. In the presence of heterogeneous of the compiler to support a DSL can already require the
systems, the language in which the operators are im- design of a compilation pipeline. In our prototype, we
plemented has to support all target platforms, which are able to utilize the different granularities of different
either leads to using a hardware-oblivious language, e.g., DSLs that MLIR offers to adaptively optimize the query
OpenCL, or a hardware-sensitive alternative, mixing mul- in MLIR’s own compilation pipeline.
tiple languages. Both approaches come with their own
disadvantages, they might not be performance portable, Challenge 3 – Learning Variants
or add a lot of complexity through re-implementation of
algorithms to optimize for hardware characteristics.          With the first two challenges, we described the problem
   In order to overcome the above-mentioned problems of variability, but without a good variant selection mech-
of the individual approaches, Broneske et al. proposed anism, such variability might be essentially useless. The
the use of DSLs [3]. While this does not directly solve the main problem for finding the best variants is the exten-
problem of choosing the right granularity, it allows de- sive search space that increases exponentially with every
signing the DSL to fit an arbitrary granularity, or multiple added optimization. The state-of-the-art approaches, we
described in subsection 2.3, leverage machine learning         though both versions are possible, we plan to adapt the
techniques to learn and predict the best variants for a        modeling of relational algebra in MLIR as it is a more flex-
certain workload. The HAWK-framework uses an offline-          ible approach; with the added possibility of combining it
learning approach that is fed with example queries and,        with the learning mechanism mentioned before.
in this way, learns the best flavors for the underlying
hardware [5]. To reduce the learning time, additional          Challenge 5 – Adaptivity for OLTP
heuristics are used to prune the exploration space. This
approach focuses solely on optimization for a mostly           Current approaches in code generation for data process-
static workload and best working parameters for optimal        ing focus heavily on OLAP queries, which is reasonable
hardware utilization. Another approach used by micro-          due to their complexity and high latencies. This focus
adaptivity is an online-learning approach that can also        neglects OLTP queries, which, we believe, might still
adapt to changing workloads [4]. This is achieved by for-      provide room for improvement through code generation,
mulating the variant selection as a multi-armed bandit         with a focus on optimizing together sets of concurrent
learning problem. This continuously relearning system          queries or templates of them. We envision, at this stage,
is also efficiently usable without previous learning.          that OLTP queries could be served in at least three main
   We argue that an online-learning approach is, in gen-       ways. First, recent work has shown promising results
eral, preferable over offline learning, as it is designed to   by tackling the challenge of learning an adaptive concur-
adapt to changes in the data management workloads by           rency control mechanism for a mix of pre-defined query
itself, whereas offline learning would need re-training        templates at different contention scenarios [15]. We con-
over new representative workloads, which nowadays can          sider that different mixes of these queries and the variable
change quickly and therefore is not easily predictable.        concurrency control mechanism might be able to lever-
A solution combining offline-learning to establish the         age a code generation framework for a higher flexibility
starting models for online-learning might combine the          and adaptivity, in the search for high throughput. To
best of both approaches.                                       achieve this, we consider that multi-query optimization
   We envision to not only use models to learn the best        might be necessary, opening the door for further kinds
optimizations that can be used to instruct the compiler        of improvements, such as operator sharing. Secondly, we
to apply certain optimizations on chosen places, but also      consider that interpretation of MLIR for faster execution
to generate parameters for each optimization, and to           without compile overhead, and a switch between inter-
guide the compiler regarding the number of optimiza-           pretation and compilation through coroutines could be
tion passes for applying the optimizations. Finally, we        relevant for workload mixes that could be categorized
envision that the selected solution should integrate well      as HTAP. Finally, we argue that the flexibility of MLIR
with an MLOps framework, considering the lifecycle,            allows for the study of a new dialect focusing on robust
evaluation, and maintenance of the models used.                processing over adjacent data versions (as used in MVCC
                                                               or in Delta Lakes), which might be able to provide a
                                                               competitive support for high-contention scenarios.
Challenge 4 – Integration Into High-Level
Optimization
To take a step from a pure query execution engine to-
                                                               4. A Transformation-Based
wards a more widely usable database engine, an adap-              Approach for Adaptivity
tive reprogramming framework must be compatible with
other high-level optimizations such as algebraic optimiza-     In the following, we present our prototype that seeks to
tion, index and algorithm selection. To achieve this goal,     tackle, at this stage, the first two challenges described in
one of two challenges has to be accomplished: Either the       the previous section. We start with the architecture of
reprogramming framework needs to be integrated into an         our approach (cf. Figure 1).
existing DBMS, or the missing components are integrated           To have a more fine-grained granularity with a larger
into the framework. Neither of these two solutions has         variant space while only adding a rather small overhead,
been successfully applied yet. To the best of our knowl-       we adapted Voila [6] as a DSL with sub-operator granu-
edge, there only exist approaches that integrate parts         larity (Challenge 1), and the additional feature of being
of traditional optimizations into the adaptivity pipeline.     agnostic to the query execution strategy. The design of
The HAWK-framework, for instance, has limited support          Voila also allows for simple adoption to be efficiently exe-
for hash table algorithm selection besides the operator-       cuted on heterogeneous processors, as similar designs al-
level variant granularity. Furthermore, Jungmair et al.        ready demonstrated [16]. Instead of explicit predication,
showed a first example of how to integrate algebraic op-       we rely on a separate compilation pass to automatically
timization successfully into MLIR, from which on further       choose when to use predication and when to material-
high-level optimizations can be introduced [14]. Even          ize selections, which allows for more flexibility in the
                                                                             MLIR framework and its dialects to introduce adaptiv-
                                  Voila
                   Hardware Oblivious Query Representation                   ity into the code and lower it to LLVM, which is then
                                                                             JIT-compiled and executed.
                                 MLIR                                           The usual compilation pipeline of our framework looks
                          Adaptive Code Generation

 SQL Query                                                   Query Results
                                                                             as follows, once Voila has been translated to Voila-MLIR,
                                                                             which includes type resolution and transformation of
                                 LLVM
                      Target Dependent Code Generation                       columns to the tensor data type, aggressive function inlin-
                                                                             ing is performed to achieve redundant sub-query removal
                           JIT Execution
                      Query Execution on Heterogeneous
                                                                             through canonicalization and common sub-expression
                                 Hardware
                                                                             elimination, as well as allowing for better optimizations
                                                                             by eliminating function call boundaries.
Figure 1: High-level architectural schema of the framework                      Afterward, the dematerialization pass is performed and
                                                                             the resulting plan is lowered to MLIR’s internal dialects,
optimization while also simplifying the language.                            either linalg , or affine . The linalg dialect describes
   Implementing those types of adaptivity directly into                      loops using the concept of iterators. Each linalg opera-
the DSL has the great advantage that it makes the DSL                        tion consists of a set of input tensors and generates a set
adaptive by design (Challenge 2). In contrast to other                       of output tensors. The body of linalg operations has to
approaches, there is no longer the need to provide an ad-                    be largely side effect free, and aside of regular reductions
ditional callback to pre-compiled algorithms or to change                    doesn’t allow for data or control flow dependencies be-
the query plan. Instead, the query engine directly calls                     tween the iterations. Therefore, we resort to the affine
the generated query code, which is adapted during JIT-                       dialect for operations that need broader constraints. The
compilation by the supplied transformation and optimiza-                     affine dialect models range-based for loops and allows
tion passes.                                                                 direct memory manipulations, only constrained by the
   The DSL itself is built upon the MLIR and LLVM                            loop induction variables.
frameworks, which enable simple definitions of micro-                           Once, Voila-MLIR is translated to these high-level inter-
optimizations as transformations. Some commonly used                         nal dialects of MLIR, we apply loop optimization passes
transformations, such as loop unrolling, vectorization,                      such as parallelization, tiling, unrolling and loop fusion.
parallelization, etc., are already implemented and can be                    Afterward, we continue lowering towards more lower-
used. Transformations in MLIR are applied in passes, and                     level dialects such as the vector dialect that represents
additional filters even allow applying transformations                       virtual, hardware-independent vectors and instructions
selectively on certain instructions, or supplying certain                    that can directly be mapped to the hardware’s SIMD
parameters to the passes. In addition to the pre-existing                    capabilities. Other translations in this phase are low-
passes, we implemented a selection dematerialization                         ering of parallel loops to the async dialect, which rep-
pass to compensate for the missing predication by re-                        resents multi-threading through LLVM coroutines, or
placing the materialized result of the selection with a                      to the openmp dialect. In this step, it is also possible to
predicate, indicating the selected tuples, when possible.                    transform parallel loops for execution on further copro-
The dematerialization pass uses a forward data flow anal-                    cessors, such as GPU, e.g. by using the gpu dialect. This
ysis of selection results to decide whether to materialize                   dialect models the GPU execution model and also handles
the result or keep the predication. Currently, we mate-                      communication between CPU and GPU. For a concrete
rialize the results any time, where pipeline breakers are                    execution on GPU, the dialect can further be lowered to
encountered, but the pass can be extended with further                       SPIR-V or AMD and NVIDIA specific dialects.
options and more complex scenarios on when to replace                           In the next step, the low-level MLIR is lowered to
materialization with predication and vice versa.                             LLVM and finally compiled to executable byte code and
   As MLIR is built upon LLVM, it also comes with na-                        then the query can be executed.
tive support for heterogeneous hardware. Hence, we
are confident that our DSL is able to adjust its behavior
to fit processing paradigms on heterogeneous hardware                        5. Evaluation
with only minimal extensions. However, in the following
w.l.o.g., we focus on the extensible CPU-only part. There-                   5.1. Dataset & Modifications
fore, we use vectorization to utilize SIMD capabilities of
                                                                             To get an overview of the real-world behavior of our
modern CPUs, as well as multithreading with built-in
                                                                             framework, we use a subset of the TPC-H benchmark.
LLVM coroutines, or the OpenMP runtime.
                                                                             More precisely, we choose the queries Q1, Q3, Q6, Q9
   In order to translate Voila to executable machine
                                                                             and Q18 as a representative sample for typical OLAP
code, we added a backend to Voila that translates Voila
                                                                             queries [17]. Since our framework does not yet support
to a Voila-MLIR dialect. From there on, we use the
Table 1                                                       these flags are not applied to the query compiler. To get
Benchmark machine specifications                              more robust timing results, we disabled software-based
                                                              frequency scaling and ran each benchmark 100 times.
            Processor                      RAM SIMD
                                                              The reported times were averaged using the median.
 Machine 1 Intel Xeon E-2286M@2.4GHz     128GB AVX2
 Machine 2 Intel Xeon Gold 6130@2.1GHz 346GB AVX-512
 Machine 3 Intel Xeon E5-2630 v3@2.4GHz 1008GB AVX2           5.3. Adaptivity Behavior
joins efficiently and has no support for string data types,   In order to get an overview of the effect of our optimiza-
we modify the TPC-H dataset slightly to still be able to      tions on the runtime of the queries, we only generate a
run the queries.                                              subset of the variant space that we are able to generate.
   We use order-preserving dictionary compression for         To restrict the number of variants to a maintainable size,
string columns and denormalize tables that need to be         we only create all configurations of the following five
joined in order to replace joins by selections that are       major optimizations and configurations:
simpler to implement efficiently. We argue that the dic-
                                                              loop unrolling: copying the loop body multiple times
tionary compression has no significant impact on the re-
                                                                    within the loop to reduce control-flow overhead
sults of our benchmarks, as strings in TPC-H are mostly
                                                                    of the loop condition check
only used as payloads [18] and since we are already us-
ing column-oriented query processing, the strings would       loop tiling: splitting the loop body into an inner and
commonly not be loaded at all for most of the time. The              outer loop, effectively creating a blocking of mem-
only exception here is Q9, where a string pattern match-             ory access to increase cache performance
ing takes place, which has to be translated in the case of
dictionary compression. We choose to work around the          selection dematerialization: replacing        material-
matching by using a large IN-clause, checking for exis-              izations of intermediate selection results by
tence of each tuple using a hash table. For the replace-             bitvectors indicating selected tuples to reduce
ment of joins with simple selections on denormalized                 memory usage and increase vectorization
tables, Li and Patel suggest that it has no large influ-             capabilities
ence on the query performance and leads to comparable
                                                              parallelization: splitting independent loop iterations
results [19].
                                                                     to multiple cores using LLVM coroutines or
                                                                     OpenMP
5.2. Setup
                                                              vectorization: transform instructions in loop bodies to
In order to get a first overview of the performance char-            SIMD counterparts for data-parallel execution
acteristics of our experiments on different platforms, we
use three different machines listed in Table 1. All three         In addition, we used some minor, permanently enabled
machines use Intel processors, but from different genera-     optimizations such as CSE, inlining and buffer optimiza-
tions and with slightly different architectures and clock-    tion passes to obtain both better optimizable and opti-
rates, which should already show varying results.             mized code.
   To keep performance differences limited to the hard-           Furthermore, we restricted the configuration of these
ware, we created a unified build tool chain with all ma-      optimizations to a per-program level instead of generat-
chines using the same bootstrapped LLVM version 13.0.1        ing variants for each operation. With these prerequisites,
to build the experiments. Additionally, we supplied the       we generated 120 variants for each machine and show the
following compiler flags for optimization: -O3 -flto          results of the pure execution times without compilation
-march=native . Where -O3 instructs the compiler to           as heatmaps in Figure 2. To keep the time for running the
apply more aggressive, but also expensive (in terms of        configurations acceptable, we used the TPC-H dataset
compile-time) optimization of the code, -flto enables         with scale factor 1.
link-time optimizations, such as de-virtualization of func-       Each column of a heatmap represents a different vector-
tions and -march=native tells the compiler to enable in-      size, while the rows describe combinations of unrolling,
structions that are available for the architecture on which   tiling and selection dematerialization. The different par-
the code is compiled. Such instructions are usually SSE4      allelization techniques are divided by Figure 2a, Figure 2b
and AVX , among others. While more advanced instruc-          and Figure 2c.
tions increase the overall latency of our framework to            Due to an outdated OpenMP runtime on Machine 3,
optimize and run code, this also makes the binary largely     there are no heatmaps for OpenMP parallelization in this
not portable to other architectures. Overall, the optimiza-   machine. Furthermore, combining vectorization together
tions only have an effect on the framework compile-time       with tiling for Q6 was problematic. The main reason
of the query, but not on the query execution time, as
                                                                                                Q1              Q3             Q6             Q9           Q18
was that the generated code could not be compiled be-                            none
                                                                                    sel
cause parallel loops only allow reductions on scalar types                           u




                                                                Machine 1
                                                                                u + sel
through atomic read-modify-write operations. However,                                 t
                                                                                t + sel
as the loop body was vectorized, the reduction would                              t+u
                                                                            t + u + sel
have to be done on vectors, which is not supported. Due
                                                                                 none
to this limitation in MLIR, we decided to exclude variant                           sel
results with tiling altogether for Q6. The key observa-                              u




                                                                Machine 2
                                                                                u + sel
tions from this experiment are:                                                       t
                                                                                t + sel
                                                                                  t+u
    1. All machines show similar trends regarding vari-                     t + u + sel

       ants, but different configurations turn out to be                         none
                                                                                    sel
       most efficient (e.g., unrolling vs. no unrolling for                          u




                                                                Machine 3
                                                                                u + sel
       Q6 or OpenMP vs. async).                                                       t
                                                                                t + sel
    2. Tiling has a mostly negative influence on the                              t+u
                                                                            t + u + sel
       query performance, with a speed-up ×1 to ×0.1
       compared to no optimization.                                 (a) Variants performance for all machines without parallelization
    3. For Q1 and Q6, selection dematerialization im-                            none
       proves the performance by a factor of up to ×10,                             sel
                                                                                     u




                                                                Machine 1
       whereas Q3 and Q9 showed decreased perfor-                               u + sel
                                                                                      t
       mance by as low as ×0.25.                                                t + sel
                                                                                  t+u
    4. Loop unrolling had only a moderate influence                         t + u + sel

       when combined with vectorization for larger vec-                          none
                                                                                    sel
       tor sizes.                                                                    u
                                                                Machine 2




                                                                                u + sel
    5. Vectorization had slightly positive effects on the                             t
                                                                                t + sel
       runtime of Q6 (×2 speed-up), for the other queries                         t+u
                                                                            t + u + sel
       it had slightly negative effects.
                                                                                 none
                                                                                    sel
   Overall, these experiments show that our approach                                 u
                                                                Machine 3




                                                                                u + sel
is able to generate differently performing variants that                              t
                                                                                t + sel
can be used to optimize the runtime of different queries                          t+u
                                                                            t + u + sel
depending on the executing hardware and software char-
acteristics – even with only a small part of the possible                      (b) Variants performance for all machines with coroutines
variant space explored. On the other hand, we also ob-
serve some unexpected behavior, such as the partially                            none
                                                                                    sel
low influence of vectorization and parallelization on the                            u
                                                                Machine 1




                                                                                u + sel
runtime, especially visible for Q1, Q9, and Q18. After                                t
inspection of the generated MLIR code, we saw that these                        t + sel
                                                                                  t+u
queries are often not transformed to parallel loops be-                     t + u + sel

cause the transformation pass of MLIR is too restrictive in                      none
                                                                                    sel
its side effect analysis and does not parallelize/vectorize                          u
                                                                Machine 2




                                                                                u + sel
loop nests with non-affine branches instead of traversing                             t
                                                                                t + sel
and analyzing their memory accesses recursively.                                  t+u
                                                                            t + u + sel
                                                                                          1 2 4 8 16        1 2 4 8 16     1 2 4 8 16     1 2 4 8 16    1 2 4 8 16
                                                                                          Vector Size       Vector Size    Vector Size    Vector Size   Vector Size
6. Conclusion & Future Work                                                               500        1000      200   400      20     40     1000 2000   500 1000 1500


In this paper, we presented our vision of the future of fully               (c) Variants performance for Machines 1 and 2 With OpenMP
adaptive database query execution engines and showed
                                                                Figure 2: Heatmaps showing the runtimes of the query vari-
that support of micro-optimizations in database query           ants with optimizations: tiling (t), unrolling (u), selection de-
engines can have a performance impact in the order of           materialization (sel)
magnitudes. At first, we identified the main challenges to
tackle in order to achieve a fully adaptive database engine     account to achieve an entirely adaptive solution. There-
that, despite the clear advantage of such systems, still        fore, we propose a simple and practical approach that
does not exist today. We find that there are approaches         leverages the Voila DSL and the MLIR compiler frame-
that tackle each of the challenges individually, but up to      work to introduce adaptivity into the query compilation
now there exists no framework that takes everything into
pipeline and, thus, we tackle the challenges of adaptivity      [9] P. A. Boncz, et al., Monet: A next-generation DBMS
and granularity of the optimizations. Moreover, we high-            kernel for query-intensive applications, 2002.
light further pathways to tackle the remaining challenges      [10] T. Gubner, P. Boncz, Excalibur: A virtual machine
in the context of our approach.                                     for adaptive fine-grained jit-compiled query execu-
   We briefly evaluated the functionality and perfor-               tion based on voila, Proc. VLDB Endow. 16 (2022).
mance using a modified version of the TPC-H benchmark.         [11] T. Neumann, Efficiently Compiling Efficient Query
We showed that our approach is able to generate diverse             Plans for Modern Hardware., Proc. VLDB Endow.
variants with different performance properties. While               4 (2011) 539–550.
we still found some issues and the prototype is at an early    [12] Y. Srikant, P. Shankar, The Compiler Design Hand-
stage, we are confident that when addressing the issues,            book: Optimizations and Machine Code Generation,
our approach will be able to outperform non-adaptive                CRC Press, 2003.
systems.                                                       [13] C. Bertoni, J. Kwack, T. Applencourt, Y. Ghadar,
   Our immediate next step is the addition of support               B. Homerding, C. Knight, B. Videau, H. Zheng,
for heterogeneous hardware, as well as improvement of               V. Morozov, S. Parker, Performance portability
the prototype to be more competitive by overcoming the              evaluation of opencl benchmarks across intel and
identified problems. Additionally, we want to extend our            nvidia platforms, in: Proc. IPDPSW 2020, 2020, pp.
approach to support joins efficiently, as well as string            330–339.
types. For the future, we plan to extend our prototype to      [14] M. Jungmair, A. Kohn, J. Giceva, Designing an open
tackle all of our presented challenges, to become a fully           framework for query optimization and compilation,
adaptive reprogramming, heterogeneous database sys-                 Proc. VLDB Endow. 15 (2022) 2389–2401.
tem that is able to achieve peak performance independent       [15] J.-C. Wang, D. Ding, H. Wang, C. Christensen,
of the supplied workload and hardware.                              Z. Wang, H. Chen, J. Li,          Polyjuice: High-
                                                                    performance transactions via learned concurrency
                                                                    control., in: OSDI, 2021, pp. 198–216.
References                                                     [16] H. Pirk, O. R. Moll, M. Zaharia, S. Madden, Voodoo
                                                                    - A Vector Algebra for Portable Database Perfor-
 [1] T. Gubner, P. Boncz, Highlighting the performance
                                                                    mance on Modern Hardware, Proc. VLDB Endow-
     diversity of analytical queries using voila, in: Proc.
                                                                    ment 9 (2016) 1707–1718.
     ADMS 2021, 2021.
                                                               [17] T. Kersten, V. Leis, A. Kemper, T. Neumann,
 [2] K. Dursun, C. Binnig, U. Cetintemel, R. Petrocelli,
                                                                    A. Pavlo, P. A. Boncz, Everything You Always
     SiliconDB: Rethinking DBMSs for Modern Hetero-
                                                                    Wanted to Know About Compiled and Vectorized
     geneous Co-Processor Environments, in: Proc. Da-
                                                                    Queries But Were Afraid to Ask, Proc. VLDB Endow.
     mon 2017, ACM, 2017.
                                                                    11 (2018) 2209–2222.
 [3] D. Broneske,        Adaptive Reprogramming for
                                                               [18] A. Vogelsgesang, M. Haubenschild, J. Finis, A. Kem-
     Databases on Heterogeneous Processors, SIGMOD
                                                                    per, V. Leis, T. Mühlbauer, T. Neumann, M. Then,
     2015 PhD Symposium, ACM, 2015, p. 51–55.
                                                                    Get Real: How Benchmarks Fail to Represent the
 [4] B. Răducanu, P. Boncz, M. Zukowski, Micro Adap-
                                                                    Real World, in: Proc. DBTest 2018, 2018, p. 1–6.
     tivity in Vectorwise, in: Proc. SIGMOD 2013, ACM,
                                                               [19] Y. Li, J. M. Patel, WideTable: An Accelerator for
     2013, p. 1231–1242.
                                                                    Analytical Data Processing, Proc. VLDB Endow. 7
 [5] S. Breß, B. Köcher, H. Funke, S. Zeuch, T. Rabl,
                                                                    (2014) 907–918.
     V. Markl, Generating custom code for efficient
     query execution on heterogeneous processors,
     VLDB Journal 27 (2018) 797–822.
 [6] T. Gubner, P. Boncz, Charting the Design Space of
     Query Execution Using VOILA, Proc. VLDB Endow.
     14 (2021) 1067–1079.
 [7] C. Lattner, M. Amini, U. Bondhugula, A. Cohen,
     A. Davis, J. Pienaar, R. Riddle, T. Shpeisman, N. Vasi-
     lache, O. Zinenko, MLIR: Scaling Compiler Infras-
     tructure for Domain Specific Computation, in: Proc.
     CGO 2021, IEEE/ACM, 2021, p. 2–14.
 [8] C. Lattner, V. Adve, LLVM: A Compilation Frame-
     work for Lifelong Program Analysis & Transforma-
     tion, in: Proc. CGO 2004, IEEE, 2004, p. 75–86.