Towards Porting Hardware-Oblivious Vectorized Query
                      Operators to GPUs

                        Johannes Fett, Annett Ungethüm, Dirk Habich, Wolfgang Lehner
                       Database Systems Group, Technische Universität Dresden, Dresden, Germany
              {johannes.fett,annett.ungethuem,dirk.habich,wolfgang.lehner}@tu-dresden.de


ABSTRACT                                                                    store systems [1, 6, 7]. SIMD increases the single-thread per-
Nowadays, query processing in column-store database sys-                    formance by executing a single operation on multiple data
tems is highly tuned to the underlying (co-)processors. This                elements in a vector register simultaneously (data paralle-
approach works very well from a performance perspective,                    lism) [8]. Such SIMD capabilities are common in today’s
but has several shortcomings from a conceptual perspective.                 mainstream x86-processors using specific SIMD instruction
For example, this tuning introduces high implementation as                  set extensions, whereas a current hardware trend is that the-
well as maintenance cost and one implementation cannot be                   se extensions are growing not only in terms of complexity of
ported to other (co-)processors. To overcome that, we deve-                 the provided instructions but also in the size of the vector
loped a column-store specific abstraction layer for hardware-               registers (number of data elements in parallel). To tackle
driven vectorization based on the Single Instruction Multiple               the evolving SIMD-specific diversity, we developed a novel
Data (SIMD) parallel paradigm. Thus, we are able to im-                     abstraction layer called Template Vector Library (TVL) for
plement vectorized query operators in a hardware-oblivious                  in-memory column-stores [6]. Using that TVL, we are ab-
manner, which can be specialized to different SIMD instruc-                 le to implement hardware-oblivious vectorized query opera-
tion set extensions of modern x86-processors. To soften the                 tors, which can be specialized to different SIMD instruction
limitation to x86-processors, we describe our vision to inte-               set extensions at query compile-time [6].
grate GPUs in our abstraction layer by interpreting GPUs                       Besides vectorization, hardware also shifts from homoge-
as virtual vector engines in this paper. Moreover, we present               neous x86-processors towards heterogeneous systems with
some initial evaluation results to determine a reasonable vir-              different computing units (CU) [5]. In this context, there
tual vector size. We conclude the paper with an outlook on                  is already a huge number of research works that deal with
our ongoing research in that direction.                                     the use of different CUs such as GPUs or FPGAs for an
                                                                            efficient analytical query processing [2, 3, 4, 9, 10, 11]. In
                                                                            general, these works have shown the great potential, but
1.    INTRODUCTION                                                          all these approaches have a common shortcoming. For each
   Analytical database queries typically access a small num-                CU, a separate hardware-conscious and hand-tuned code-
ber of columns, but a high number of rows and are, thus most                base for query operators has to be implemented and main-
efficiently answered by column-store database systems [1].                  tained using different programming concepts. From a perfor-
Since the amount of data is still growing, these systems con-               mance perspective, this approach works very well, but the
stantly adapt to novel hardware features to satisfy the requi-              efforts for implementation and maintenance often outweigh
rements of high query throughput and low query latency [2,                  the benefits. To overcome that, our overall vision is to en-
3, 4]. From the hardware perspective, we see that Moore’s                   hance our SIMD abstraction layer TVL to cover different
Law is still valid and the transistors on a chip double about               heterogeneous CUs as well.
every two years [5]. Unfortunately, we also see an end of                      Our Contribution and Outline. In this paper, we de-
Dennard scaling, so that not all transistors can be active                  scribe our vision and present some initial steps to integrate
due to power constraints [5]. To deal with that, vectorizati-               GPUs in our abstraction layer by interpreting GPUs as vir-
on, parallelization, specialization and heterogeneity are key               tual vector engines. Generally, GPUs use a Single Instructi-
approaches for hardware designers [5].                                      on Multiple Thread (SIMT) execution model which can be
   For this reason, vectorization based on the Single Instruc-              also interpreted as SIMD combined with multi-threading.
tion Multiple Data (SIMD) parallel paradigm has establis-                   Based on that, GPUs seem like a perfect match for our TVL
hed itself as a core query optimization technique in column-                to soften the limitation to x86-processors. Thus, our contri-
                                                                            butions are the following in this paper:
                                                                                1. We start with an introduction in our SIMD abstraction
                                                                                   layer TVL as well as with an architectural description
                                                                                   of NVIDIA GPUs in Section 2.
                                                                                2. Then, we present our general idea of SIMDization of
                                                                                   GPUs in Section 3. To interpret a GPU as virtual vec-
                                                                                   tor engine, we have to determine a reasonable virtual
32nd GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), September 01-03, 2021, Munich, Germany.
                                                                                   vector size, which can be most efficiently processed in
Copyright © 2021 for this paper by its authors. Use permitted under Crea-          parallel. The vector size is important because it is an
tive Commons License Attribution 4.0 International (CC BY 4.0).
                (a) Template Vector Library (TVL)                                (b) NVIDIA GPU Architecture

               Figure 1: Architectures of the Template Vector Library (TVL) and of NVIDIA GPUs.

        essential part of SIMD and an integral part of our TVL.   base t: The base type can be any scalar type.
        For that, we present some TVL-oriented experiments.       vector t: The vector type contains one or more values of
        Based on the results, we derive a virtual vector size.           the same base type.
     3. In Section 4, we summarize our lessons learned and        mask t: A mask is a scalar value, which is large enough to
        describe our ongoing activities in that direction.               store one bit for each element in a vector.
     4. We close the paper with related work in Section 5 and         Using the provided primitives and the data types, we
        a short summary in Section 6.                             can implement columnar query operators in a hardware-
                                                                  oblivious way. For the hardware-conscious mapping, we use
2.     BACKGROUND                                                 template metaprogramming requiring hardware-conscious
                                                                  implementations for all primitives and for all underlying
  In this section, we introduce our Template Vector Library       SIMD extensions. This function template specialization has
as SIMD abstraction layer for column-stores. Moreover, we         to be implemented, whereby the implementation depends
briefly describe the GPU architecture and execution model.        on the available functionality of the SIMD extension. In the
                                                                  best case, we can directly map a TVL primitive to a SIMD
2.1      Template Vector Library                                  intrinsic. However, if the necessary SIMD intrinsic is not
   Vectorization is a state-of-the-art query optimization tech-   available, we are able to implement an efficient workaround
nique in in-memory column-stores, because all recent x86-         in a hardware-conscious way. This implementation is
processors offer powerful SIMD extensions [1, 7, 12, 13, 14].     independent of any query operator and must be done only
SIMD provides data parallelism by executing a single in-          once for a specific SIMD extension.
struction on multiple data elements simultaneously [8]. To            Figure 2 illustrates how a hardware-oblivious vectorized
achieve the best performance, explicit vectorization using        operator using our TVL looks like and how it can be cal-
SIMD intrinsics is still the best way [6, 13, 14], whereas int-   led. We show a simple aggregation (summation) operator
rinsics are functions wrapping the underlying machine calls.      consisting of four TVL primitives:
However, these SIMD extensions are increasingly diverse in        set1: fills a vector register with a given value
terms of (i) the number of available vector instructions, (ii)    load: loads multiple consecutive data elements into a vector
the vector length, and (iii) the granularity of the bit-level            register
parallelism, i.e., on which data widths the vector instruc-       add: executes an element-wise addition on data elements in
tions are executable [6]. To hide this heterogeneity, we de-             two vector registers
veloped a specific abstraction layer called Template Vector       hadd: executes an horizontal addition on data elements in
Library (TVL) for column-stores [6].                                     one vector register
   Our abstraction approach follows a separation of concerns      The aggregation operator assumes that the number of data
concept as shown in Fig. 1(a). On the one hand, it offers         elements is a multiple of the number of elements per vector
hardware-oblivious but column-store specific primitives,          and works as follows: One vector register called resVec is
which are similar to intrinsics. The primitives are derived       filled with zeros. Afterwards, the operator iterates over the
from state-of-the-art vectorized columnar query operators.        input column in and loads a number of consecutive data
We organized these primitives in seven self-descriptive           elements into a second vector register called dataVec. Then,
classes like load/store (L/S) or an arithmetic class for          the data elements in both vector registers are added element-
a better organization including a unified interface per           wise and the result is stored in resVec. When all elements of
class [6]. On the other hand, our TVL is also responsible         the input column have been processed, the horizontal aggre-
for mapping the provided hardware-oblivious primitives            gation hadd is carried out to determine the final sum result.
to different SIMD extensions. For this mapping, our TVL               To specialize this hardware-oblivious operator implemen-
includes a plug-in concept and each plug-in has to provide        tation during query compile-time, we use three template pa-
a hardware-conscious implementation for all primitives.           rameters called processingStyle (ps): (i) the vector exten-
   From an implementation perspective, our abstraction con-       sion (e.g., SSE, AVX, NEON, or scalar), (ii) the vector size
cept is realized as a header-only library, where the hardware-    in bit, and (iii) the base data type with bit granularity (e.g.,
oblivious primitives abstract from SIMD intrinsics. These         int8, int64, float). The definition of ps is shown in Figure 2,
primitives are generic functions representing a unified inter-    which is used to call the aggregation-operator.
face for all SIMD architectures. In addition to the primitives,
we introduced generic data types:                                 2.2    GPUs
// Aggregation operator definition.
template<class ps> // processing style
base_t agg ( const base_t * in , size_t elCount ) {
  // For simplicity , we assume that elCount is a multiple of the number of data elements
  // per vector register.
  const size_t vecCount = elCount / ps:: vecto r_elem ent_c ount ;

    // Initialize running sum to zero.
    vector_t resVec = tvl::set1<ps , p s : : v e c t o r _ b a s e _ t _ g r a n u l a r i t y > (0) ;

    // Add all input data elements to running sum.
    for ( size_t i = 0; i < vecCount ; ++ i ) {
      vector_t dataVec = tvl::load<ps , tvl::ALIGNED , ps::vector_size_bit> ( in ) ;
      resVec = tvl::add<ps , p s : : v e c t o r _ b a s e _ t _ g r a n u l a r i t y > ( resVec , dataVec ) ;
      in += p s : :vec tor_e lement _coun t ;
    }
    // Calculate final sum using horizontal summation of the vector elements.
    return tvl::hadd<ps , p s : : v e c t o r _ b a s e _ t _ g r a n u l a r i t y > ( resVec ) ;
}

// Calling the operator.
using ps = t v l : : a v x 2 < t v l : : v 2 5 6 < u i n t 6 4 _ t > > ; // for example
size_t count = 1024;
uint64_t * array = generate_data ( count ) ;
uint64_t sum = agg<ps> ( array , elemCount ) ;
                           Figure 2: A simple sum-aggregation operator using the TVL [6].

                                                                       ce results. The CUDA programming model consists of two
  Graphics Processing Units (GPUs) are increasingly used               code parts: host code and GPU code. The host code runs
for large-scale query processing in database systems [2, 3,            on a CPU process and is responsible for setting up the en-
4, 15]. Specifically, their hardware parallelism and memo-             vironment, memory transfers between CPU and GPU and
ry access bandwidths contribute to considerable speedups.              executing kernels on the GPU. On the GPU, a function cal-
Figure 1(b) depicts a simplified architecture of an NVIDIA             led kernel is executed in parallel with a number of threads
GPU. Generally, a modern GPU consists of (i) a large glo-              and blocks. Blocks consist of a number of threads and are
bal main memory with a memory bandwidth of up to 1.2                   assigned to one streaming multiprocessor. Threads are assi-
TB/s and (ii) a number of compute units called Streaming               gned to a block. Thus, executing a kernel requires at least
Multiprocessors (SMs). Each SM has a number of simplistic              2 parameters. The first parameter defines how many blocks
cores, a fixed set of registers, and shared memory. This sha-          will be spawned. The second parameter sets the number of
red memory serves as scratchpad and can be accessed by all             threads per block. The total number of GPU threads is the
cores in the SM. Moreover, the GPU has an on-chip L2 ca-               product of both parameters. Depending on the data size and
che, which is shared across all SMs and optionally, each SM            the number of GPU threads, the number of elements pro-
may have a local L1 cache. The number of SMs, and cores                cessed per threads can vary.
per SM, the size of global memory, the size of the L2 cache,
etc. varies across GPU products.                                       3.    GPU-SIMDIZATION EXPERIMENTS
  The execution model of GPUs is called Single Instructi-
                                                                          Our overall vision is to fully support GPUs in our TVL
on Multiple Threads (SIMT), whereas SIMT is very similar
                                                                       as shown in Figure 1(a). For that, we have to create a
to SIMD. While multiple data are processed by a single in-
                                                                       GPU-specific hardware-conscious implementation of the
struction in SIMD, multiple threads are processed by a single
                                                                       hardware-oblivious interface. To achieve that, we interpret
instruction in lock-step. That means, each thread in SIMT
                                                                       GPUs as virtual vector engines and call this interpretation
executes the same instruction, but on different data. While
                                                                       SIMDization. For an optimal SIMD processing, we have
a thread switch is very costly on CPUs, GPUs can handle
                                                                       to find a reasonable vector size. This vector size is virtual
thread switching with more ease. GPUs feature a fast thread
                                                                       because we want to interpret regular arrays as virtual
switching based on groups of 32 threads called warps. The
                                                                       vector registers. For example, on Intel x86-processors, the
warp scheduler issues instructions to warps available on an
                                                                       different SIMD instruction set extensions either have a
SM beyond the number of physical cores to hide latency
                                                                       vector length of 128-, 256- or 512-bit. If we assume 64-bit
[16]. Thus, it is encouraged to create more threads than are
                                                                       per data element, we can simultaneously process 2 elements
available as physical cores. Then, this overload can be used
                                                                       in a vector register of the size 128-bit. The wider the vector
to schedule threads for execution, while others wait for a me-
                                                                       registers, the more data can be processed in parallel. We are
mory transfer. This is especially important since databases
                                                                       not aware of any work that has ever determined a vector
are more likely I/O-bound, not CPU-bound, and it is thus
                                                                       size for GPUs.
one of the most important features for implementing query
operators on GPUs [4].                                                 3.1    Vector Length Evaluation
  For the GPU implementation, general purpose parallel
                                                                         To determine a reasonable virtual vector size for our SIM-
programming models such as CUDA [17] or OpenCL [18]
                                                                       Dization approach for GPUs, several experiments were exe-
have to be used, whereas CUDA generate better performan-
                                                                       cuted. A first set of experiments were conducted on an NVI-
Figure 3: Evaluation of element-wise primitive add.                Figure 4: Evaluation of horizontal primitive hadd.

DIA RTX Quadro 8000 GPU. This GPU has a global main
memory size of 48 GiB, a memory bandwidth of 672 GB/s,
an L2 cache with a size of 6 MiB, and 72 SMs. Each SM has
64 cores resulting in a total of 4,608 cores.
   For the hardware-conscious implementation, we can dis-
tinguish three different main groups of hardware-oblivious
SIMD primitives across all TVL classes: (i) load/store pri-
mitives, (ii) element-wise primitives, and (iii) horizontal pri-
mitives. Element-wise primitives are characterized by the
feature that they do not introduce dependencies between
the elements of the same vector register, e.g., element-wise
arithmetic, comparisons, or boolean logic. In contrast to
that, horizontal primitives do not treat the elements of a
vector independently. An example is the horizontal additi-            Figure 5: Evaluation of Operator Aggregation.
on. As shown in our example sum-aggregation operator (cf.
Figure 2), different primitives from all groups are used to        CUDA SDK. For our evaluation, we generated different co-
implement a vectorized query operator. In this operator, we        lumns where the column size corresponds to the vector size.
use an element-wise and a horizontal add primitive.                Again, the data elements are 64-bit unsigned integer values.
   In our first experiments, we evaluated the element-wise         Figure 4 shows the best achieved throughput over all con-
and horizontal addition on the GPU. For that, we imple-            figurations for each vector size. Similar to the element-wise
mented simple vectorized CUDA kernels and executed these           addition, higher vector sizes lead to higher throughput.
kernels using different vector sizes. For the horizontal additi-      Aggregation Operator: Based on these evaluations, we
on, we use the function cub::DeviceReduce from the CUDA            could draw the conclusion that the best vector size should
SDK 11.2 as most efficient implementation as foundation.           be the column size. To validate our hypothesis, we evaluated
Since our GPU has 4,608 cores and usually more threads             the sum-aggregation operator consisting of both primitives
than cores are used for good performance, we investigated          as next. For this evaluation, we generated a single column
vector sizes in the range from 256 KiB to 1 GiB. In terms of       with a size of 1 GiB and varied the vector size from 256 KiB
number of elements, we evaluated the range from 213 , .., 227      to 1 GiB. As illustrated in Figure 5, we obtain a completely
number of 64-bit data elements. For each vector size, diffe-       different result. The throughput increases up to a vector size
rent CUDA configurations with blocks and threads per block         of 2 MiB. Then, the throughput decreases and stabilizes at a
are possible and we tested a large number of different confi-      low level. The main difference to our previous experiments is
gurations in a systematical way.                                   that a single vector register or array is now the main driver
   Element-wise Addition: This kernel was performed on             of the processing. As shown in Figure 2, one vector register
two input columns A and B. Each column was a sequence              resVec is filled with zeros at the beginning of the aggregati-
of unsigned randomly generated 64-bit integer values with          on operators. Afterwards, we repeatedly load into a second
a size of 1 GiB. The result was written back into column           vector dataV ec of the column and conduct an element-wise
B. Within a loop, our vectorized kernel is called until the        add between both. The result is stored in resVec vector.
whole columns have been processed. Figure 3 shows the best         This resVec vector is frequently accessed, and thus, should
achieved throughput over all configurations for each vector        be kept in cache.
size. As we can see, small vector sizes negatively affect the         To evaluate the cache-fitting in more detail, we slightly
performance. For a vector size of 256 KiB, a throughput of         modified our experiment. We generated a new input column
248.87 GiB/s is achieved. The best performance is gained by        of size 1,5 GiB containing 64-bit unsigned integers and varied
using a vector size of 1 GiB. This results in a throughput of      the vector size according to the on-chip L2 cache size: 1/4,
530.99 GiB/s. A vector size of 8 MiB is 3.8% slower than a         1/3, 1/2, 1, 2 times of the cache size. Figure 6 depicts the
vector size of 1 GiB. We conclude, higher vector sizes lead        resulting throughputs. As we can see, we obtain the best
to higher throughput and in the best case, the vector size         performance when our vector size is a third of the L2 cache
corresponds to column size.                                        size. Larger vector sizes lead to a lower throughput.
   Horizontal Additon: This kernel adds all elements in
a vector register together and the kernel is based on the          3.2   Comparing with Native CUDA
cub::DevideReduce function being shipped as part of the              As shown above, we are able to determine a reasonable
       Figure 6: Evaluation of cache size-fitting.
                                                                      Figure 8: Validation on NVIDIA RTX 1070 Ti.
virtual vector size providing the best performance for our
vectorized aggregation on the GPU. In this section, we pre-      4.     FUTURE WORK
sent evaluation results comparing the vectorized aggregation        Based on the previous section, we conclude that SIMD
with the native CUDA aggregation. For that, we generated         processing using a virtual vector model is generally possible
various columns—sequences of 64-bit integer values—with          on a GPU. Choosing the right vector size and configuration
increasing sizes from 2 MiB to 8 GiB. For the vectorized         is critical to achieve a good performance. Sub-optimal con-
aggregation, we applied the best performing vector size of 2     figurations reduce the performance by more than one order
MiB in all experiments. For the native CUDA aggregation,         of magnitude. Overall, we obtain good performance which
we used the function cub::DevideReduce from the CUDA             is slower than native CUDA implementations. For the ag-
SDK as already done for our horizontal aggregation.              gregation operator using the ideal vector size, our approach
  The results are depicted in Figure 7. The throughput of        is 37.4% slower in the worst case for 8 GiB data size, and
the native CUDA aggregation is slightly higher as for our        10.8% slower in the best case for 8 MiB data size than a
vectorized aggregation. This is especially true for large co-    native CUDA approach. By validating our approach on a
lumns, which is not particularly surprising. Nevertheless, the   different GPU, we have also shown that our conclusions are
result is promising to specialize our hardware-oblivious vec-    more generally applicable beyond a single GPU model. Mo-
torized query operators to GPUs and to obtain a reasonably       reover, our hardware-oblivious query operators can be spe-
good performance. We hope to increase the throughput of          cialized to SIMD extensions as well as to GPUs in a unified
our vectorized execution with additional GPU-specific opti-      way.
mization techniques as discussed in Section 4.                      To summarize, our results are promising and our ongoing
                                                                 research in that direction will focus on the following aspects:
3.3   Validation                                                    (1) Virtual Vector Size: To extend our work, we are
   We also executed all our experiments on a second NVIDIA       looking forward to evaluate other query operators. We seek
GPU namely an NVIDIA GTX 1070 Ti. This GPU provides              to explore if optimal vector sizes and configurations depend
a global main memory size of 8 GiB, a memory bandwidth of        on the query operators.
256 GB/s, an L2 cache of 2 MB size, and 19 SMs. Each SM             (2) Implementation: Completing the GPU TVL requi-
has 128 cores resulting in a total of 2,432 cores. In general,   res to implement all primitives. To optimize the hardware-
we observed a behavior similar as with the NVIDIA Qua-           conscious implementations, we want to explore the usage of
dro RTX 8000. In particular, we executed the cache-fitting       shared memory, registers, and persistent caching.
experiment to determine the best virtual vector size for the        (3) Optimization: Besides an optimal hardware-
aggregation operator. In contrast to the RTX 8000 GPU, we        conscious implementation for GPUs, we want to investigate
obtain the best performance when our vector size is a fourth     more mapping strategies for a broader optimization.
of the L2 cache (512 KiB) size as shown in Figure 8.             Currently, the TVL has a 1:1 mapping of hardware-
                                                                 oblivious primitives to hardware-conscious implementations.
                                                                 As shown in our experiments, the CUDA native aggregation
                                                                 outperforms the vectorized approach. To improve the
                                                                 performance of our vectorized approach, an idiom-based
                                                                 mapping strategy could be helpful. By replacing a vecto-
                                                                 rized computation by a semantically equivalent but more
                                                                 performant code at query compile-time, a speedup can
                                                                 possibly be achieved. For that, it is necessary to identify
                                                                 often used idioms that lose performance by vectorization
                                                                 and replace them by idiomatic implementations.

                                                                 5.     RELATED WORK
                                                                   To address the portability of code across heterogeneous
                                                                 computing units, OpenCL [18] and Intel’s OneAPI [19]ge-
                                                                 neral purpose parallel programming language approaches.
                                                                 However, the genericity is a major drawback from a perfor-
Figure 7: Comparing vectorized versus CUDA nati-
ve aggregation.
mance point of view. In contrast, Pirk et al. [20] presented       [4] Y. Yuan, R. Lee, and X. Zhang, “The yin and yang of
a more database-specific approach called Voodoo to execu-              processing data warehousing queries on GPU devices,”
te single-source query operators on different (co-)processors.         Proc. VLDB Endow., vol. 6, no. 10, pp. 817–828, 2013.
Voodoo is a declarative intermediate algebra that abstracts        [5] H. Esmaeilzadeh, E. R. Blem, R. S. Amant,
the detailed architectural properties of the hardware, wi-             K. Sankaralingam, and D. Burger, “Dark silicon and
thout losing the ability to generate highly tuned code. The            the end of multicore scaling,” IEEE Micro, vol. 32,
proposed algebra consists of a collection of declarative and           no. 3, pp. 122–134, 2012.
vector-oriented operations. Operators described in this al-        [6] A. Ungethüm, J. Pietrzyk, P. Damme, A. Krause,
gebra are compiled to OpenCL. The drawbacks of this ap-                D. Habich, W. Lehner, and E. Focht,
proach are: (i) operators have to be described with a new              “Hardware-oblivious SIMD parallelism for in-memory
algebra, (ii) a specialized compiler is required, and (iii) the        column-stores,” in CIDR. www.cidrdb.org, 2020.
overhead of OpenCL. Another abstraction concept was pro-           [7] J. Zhou and K. A. Ross, “Implementing database
posed by Heimel et al. [21]. They developed a hardware-                operations using SIMD instructions,” in SIGMOD,
oblivious parallel library for query operators, so that these          2002, pp. 145–156.
operators can be mapped to a variety of parallel processing        [8] C. J. Hughes, Single-Instruction Multiple-Data
architectures like many-core CPUs or GPUs. However, the                Execution, ser. Synthesis Lectures on Computer
approach is mainly based on OpenCL and they do not sup-                Architecture. Morgan & Claypool Publishers, 2015.
port SIMD on CPUs.
                                                                   [9] M. Xue, Q. Xing, C. Feng, F. Yu, and Z. Ma,
   In a recent work, Shanbhag et al. [15] introduced a new
                                                                       “Fpga-accelerated hash join operation for relational
processing model for an efficient query processing on GPUs
                                                                       databases,” IEEE Trans. Circuits Syst. II Express
called tile-based execution model. This processing model
                                                                       Briefs, vol. 67-II, no. 10, pp. 1919–1923, 2020.
extends the SIMD-based processing on CPUs where each
thread processes a vector at a time to GPU. Based on that         [10] S. Jha, B. He, M. Lu, X. Cheng, and H. P. Huynh,
SIMD extension concept, they introduced a CUDA-like                    “Improving main memory hash joins on intel xeon phi
library called Crystal consisting of data processing primi-            processors: An experimental approach,” Proc. VLDB
tives that can be composed in order to implement queries               Endow., vol. 8, no. 6, pp. 642–653, 2015.
on the GPU. Thus, this approach has a lot in common               [11] J. Pietrzyk, D. Habich, P. Damme, E. Focht, and
with our idea, but they are limited to GPUs. It would                  W. Lehner, “Evaluating the vector supercomputer
be interesting to implement our TVL hardware-conscious                 sx-aurora TSUBASA as a co-processor for in-memory
plug-in for GPUs using the Crystal library.                            database systems,” Datenbank-Spektrum, vol. 19,
                                                                       no. 3, pp. 183–197, 2019.
                                                                  [12] D. J. Abadi, P. A. Boncz, and S. Harizopoulos,
6.   SUMMARY                                                           “Column oriented database systems,” Proc. VLDB
   In this paper, we evaluated the integration of GPUs in our          Endow., vol. 2, no. 2, pp. 1664–1665, 2009.
SIMD abstraction layer TVL by interpreting GPUs as virtu-         [13] P. Damme, A. Ungethüm, J. Pietrzyk, A. Krause,
al vector engines. By conducting a number of experiments,              D. Habich, and W. Lehner, “Morphstore: Analytical
we have shown that our approach is promising. While the                query engine with a holistic compression-enabled
observed throughput does not outperform CUDA native im-                processing model,” Proc. VLDB Endow., vol. 13,
plementations, it allows developers to use CUDA-based fast             no. 11, pp. 2396–2410, 2020.
GPU primitives without requiring knowledge of GPU im-
                                                                  [14] O. Polychroniou and K. A. Ross, “VIP: A SIMD
plementations. Our vectorized approach still achieves reaso-
                                                                       vectorized analytical query engine,” VLDB J., vol. 29,
nable performance that is not an order of magnitude slower
                                                                       no. 6, pp. 1243–1261, 2020.
than native CUDA implementations. However, tuning the
primitives and operators by choosing the right virtual vector     [15] A. Shanbhag, S. Madden, and X. Yu, “A study of the
size and configuration is critical for achieving good perfor-          fundamental performance characteristics of gpus and
mance.                                                                 cpus for database analytics,” in SIGMOD, 2020, pp.
                                                                       1617–1632.
                                                                  [16] Turing Tuning Guide: CUDA Toolkit documentation,
Acknowledgments                                                        https://docs.nvidia.com/cuda/turing-tuning-guide/
This work was partly funded by the German Research Foun-               index.html.
dation (DFG) within the RTG 1907 (RoSI).                          [17] CUDA C Programming Guide, https://docs.nvidia.
                                                                       com/cuda/cuda-c-programming-guide/index.html.
7.   REFERENCES                                                   [18] OpenCL, https://www.khronos.org/opencl/.
 [1] D. Abadi, P. A. Boncz, S. Harizopoulos, S. Idreos, and       [19] OneAPI, https://www.oneapi.com.
     S. Madden, “The design and implementation of                 [20] H. Pirk, O. R. Moll, M. Zaharia, and S. Madden,
     modern column-oriented database systems,” Found.                  “Voodoo - A vector algebra for portable database
     Trends Databases, vol. 5, no. 3, pp. 197–280, 2013.               performance on modern hardware,” Proc. VLDB
 [2] H. Funke, S. Breß, S. Noll, V. Markl, and J. Teubner,             Endow., vol. 9, no. 14, pp. 1707–1718, 2016.
     “Pipelined query processing in coprocessor                   [21] M. Heimel, M. Saecker, H. Pirk, S. Manegold, and
     environments,” in SIGMOD, 2018, pp. 1603–1618.                    V. Markl, “Hardware-oblivious parallelism for
 [3] T. Karnagel, D. Habich, and W. Lehner, “Adaptive                  in-memory column-stores,” Proc. VLDB Endow.,
     work placement for query processing on heterogeneous              vol. 6, no. 9, pp. 709–720, 2013.
     computing resources,” Proc. VLDB Endow., vol. 10,
     no. 7, pp. 733–744, 2017.