Towards Porting Hardware-Oblivious Vectorized Query Operators to GPUs Johannes Fett, Annett Ungethüm, Dirk Habich, Wolfgang Lehner Database Systems Group, Technische Universität Dresden, Dresden, Germany {johannes.fett,annett.ungethuem,dirk.habich,wolfgang.lehner}@tu-dresden.de ABSTRACT store systems [1, 6, 7]. SIMD increases the single-thread per- Nowadays, query processing in column-store database sys- formance by executing a single operation on multiple data tems is highly tuned to the underlying (co-)processors. This elements in a vector register simultaneously (data paralle- approach works very well from a performance perspective, lism) [8]. Such SIMD capabilities are common in today’s but has several shortcomings from a conceptual perspective. mainstream x86-processors using specific SIMD instruction For example, this tuning introduces high implementation as set extensions, whereas a current hardware trend is that the- well as maintenance cost and one implementation cannot be se extensions are growing not only in terms of complexity of ported to other (co-)processors. To overcome that, we deve- the provided instructions but also in the size of the vector loped a column-store specific abstraction layer for hardware- registers (number of data elements in parallel). To tackle driven vectorization based on the Single Instruction Multiple the evolving SIMD-specific diversity, we developed a novel Data (SIMD) parallel paradigm. Thus, we are able to im- abstraction layer called Template Vector Library (TVL) for plement vectorized query operators in a hardware-oblivious in-memory column-stores [6]. Using that TVL, we are ab- manner, which can be specialized to different SIMD instruc- le to implement hardware-oblivious vectorized query opera- tion set extensions of modern x86-processors. To soften the tors, which can be specialized to different SIMD instruction limitation to x86-processors, we describe our vision to inte- set extensions at query compile-time [6]. grate GPUs in our abstraction layer by interpreting GPUs Besides vectorization, hardware also shifts from homoge- as virtual vector engines in this paper. Moreover, we present neous x86-processors towards heterogeneous systems with some initial evaluation results to determine a reasonable vir- different computing units (CU) [5]. In this context, there tual vector size. We conclude the paper with an outlook on is already a huge number of research works that deal with our ongoing research in that direction. the use of different CUs such as GPUs or FPGAs for an efficient analytical query processing [2, 3, 4, 9, 10, 11]. In general, these works have shown the great potential, but 1. INTRODUCTION all these approaches have a common shortcoming. For each Analytical database queries typically access a small num- CU, a separate hardware-conscious and hand-tuned code- ber of columns, but a high number of rows and are, thus most base for query operators has to be implemented and main- efficiently answered by column-store database systems [1]. tained using different programming concepts. From a perfor- Since the amount of data is still growing, these systems con- mance perspective, this approach works very well, but the stantly adapt to novel hardware features to satisfy the requi- efforts for implementation and maintenance often outweigh rements of high query throughput and low query latency [2, the benefits. To overcome that, our overall vision is to en- 3, 4]. From the hardware perspective, we see that Moore’s hance our SIMD abstraction layer TVL to cover different Law is still valid and the transistors on a chip double about heterogeneous CUs as well. every two years [5]. Unfortunately, we also see an end of Our Contribution and Outline. In this paper, we de- Dennard scaling, so that not all transistors can be active scribe our vision and present some initial steps to integrate due to power constraints [5]. To deal with that, vectorizati- GPUs in our abstraction layer by interpreting GPUs as vir- on, parallelization, specialization and heterogeneity are key tual vector engines. Generally, GPUs use a Single Instructi- approaches for hardware designers [5]. on Multiple Thread (SIMT) execution model which can be For this reason, vectorization based on the Single Instruc- also interpreted as SIMD combined with multi-threading. tion Multiple Data (SIMD) parallel paradigm has establis- Based on that, GPUs seem like a perfect match for our TVL hed itself as a core query optimization technique in column- to soften the limitation to x86-processors. Thus, our contri- butions are the following in this paper: 1. We start with an introduction in our SIMD abstraction layer TVL as well as with an architectural description of NVIDIA GPUs in Section 2. 2. Then, we present our general idea of SIMDization of GPUs in Section 3. To interpret a GPU as virtual vec- tor engine, we have to determine a reasonable virtual 32nd GI-Workshop on Foundations of Databases (Grundlagen von Daten- banken), September 01-03, 2021, Munich, Germany. vector size, which can be most efficiently processed in Copyright © 2021 for this paper by its authors. Use permitted under Crea- parallel. The vector size is important because it is an tive Commons License Attribution 4.0 International (CC BY 4.0). (a) Template Vector Library (TVL) (b) NVIDIA GPU Architecture Figure 1: Architectures of the Template Vector Library (TVL) and of NVIDIA GPUs. essential part of SIMD and an integral part of our TVL. base t: The base type can be any scalar type. For that, we present some TVL-oriented experiments. vector t: The vector type contains one or more values of Based on the results, we derive a virtual vector size. the same base type. 3. In Section 4, we summarize our lessons learned and mask t: A mask is a scalar value, which is large enough to describe our ongoing activities in that direction. store one bit for each element in a vector. 4. We close the paper with related work in Section 5 and Using the provided primitives and the data types, we a short summary in Section 6. can implement columnar query operators in a hardware- oblivious way. For the hardware-conscious mapping, we use 2. BACKGROUND template metaprogramming requiring hardware-conscious implementations for all primitives and for all underlying In this section, we introduce our Template Vector Library SIMD extensions. This function template specialization has as SIMD abstraction layer for column-stores. Moreover, we to be implemented, whereby the implementation depends briefly describe the GPU architecture and execution model. on the available functionality of the SIMD extension. In the best case, we can directly map a TVL primitive to a SIMD 2.1 Template Vector Library intrinsic. However, if the necessary SIMD intrinsic is not Vectorization is a state-of-the-art query optimization tech- available, we are able to implement an efficient workaround nique in in-memory column-stores, because all recent x86- in a hardware-conscious way. This implementation is processors offer powerful SIMD extensions [1, 7, 12, 13, 14]. independent of any query operator and must be done only SIMD provides data parallelism by executing a single in- once for a specific SIMD extension. struction on multiple data elements simultaneously [8]. To Figure 2 illustrates how a hardware-oblivious vectorized achieve the best performance, explicit vectorization using operator using our TVL looks like and how it can be cal- SIMD intrinsics is still the best way [6, 13, 14], whereas int- led. We show a simple aggregation (summation) operator rinsics are functions wrapping the underlying machine calls. consisting of four TVL primitives: However, these SIMD extensions are increasingly diverse in set1: fills a vector register with a given value terms of (i) the number of available vector instructions, (ii) load: loads multiple consecutive data elements into a vector the vector length, and (iii) the granularity of the bit-level register parallelism, i.e., on which data widths the vector instruc- add: executes an element-wise addition on data elements in tions are executable [6]. To hide this heterogeneity, we de- two vector registers veloped a specific abstraction layer called Template Vector hadd: executes an horizontal addition on data elements in Library (TVL) for column-stores [6]. one vector register Our abstraction approach follows a separation of concerns The aggregation operator assumes that the number of data concept as shown in Fig. 1(a). On the one hand, it offers elements is a multiple of the number of elements per vector hardware-oblivious but column-store specific primitives, and works as follows: One vector register called resVec is which are similar to intrinsics. The primitives are derived filled with zeros. Afterwards, the operator iterates over the from state-of-the-art vectorized columnar query operators. input column in and loads a number of consecutive data We organized these primitives in seven self-descriptive elements into a second vector register called dataVec. Then, classes like load/store (L/S) or an arithmetic class for the data elements in both vector registers are added element- a better organization including a unified interface per wise and the result is stored in resVec. When all elements of class [6]. On the other hand, our TVL is also responsible the input column have been processed, the horizontal aggre- for mapping the provided hardware-oblivious primitives gation hadd is carried out to determine the final sum result. to different SIMD extensions. For this mapping, our TVL To specialize this hardware-oblivious operator implemen- includes a plug-in concept and each plug-in has to provide tation during query compile-time, we use three template pa- a hardware-conscious implementation for all primitives. rameters called processingStyle (ps): (i) the vector exten- From an implementation perspective, our abstraction con- sion (e.g., SSE, AVX, NEON, or scalar), (ii) the vector size cept is realized as a header-only library, where the hardware- in bit, and (iii) the base data type with bit granularity (e.g., oblivious primitives abstract from SIMD intrinsics. These int8, int64, float). The definition of ps is shown in Figure 2, primitives are generic functions representing a unified inter- which is used to call the aggregation-operator. face for all SIMD architectures. In addition to the primitives, we introduced generic data types: 2.2 GPUs // Aggregation operator definition. template // processing style base_t agg ( const base_t * in , size_t elCount ) { // For simplicity , we assume that elCount is a multiple of the number of data elements // per vector register. const size_t vecCount = elCount / ps:: vecto r_elem ent_c ount ; // Initialize running sum to zero. vector_t resVec = tvl::set1 (0) ; // Add all input data elements to running sum. for ( size_t i = 0; i < vecCount ; ++ i ) { vector_t dataVec = tvl::load ( in ) ; resVec = tvl::add ( resVec , dataVec ) ; in += p s : :vec tor_e lement _coun t ; } // Calculate final sum using horizontal summation of the vector elements. return tvl::hadd ( resVec ) ; } // Calling the operator. using ps = t v l : : a v x 2 < t v l : : v 2 5 6 < u i n t 6 4 _ t > > ; // for example size_t count = 1024; uint64_t * array = generate_data ( count ) ; uint64_t sum = agg ( array , elemCount ) ; Figure 2: A simple sum-aggregation operator using the TVL [6]. ce results. The CUDA programming model consists of two Graphics Processing Units (GPUs) are increasingly used code parts: host code and GPU code. The host code runs for large-scale query processing in database systems [2, 3, on a CPU process and is responsible for setting up the en- 4, 15]. Specifically, their hardware parallelism and memo- vironment, memory transfers between CPU and GPU and ry access bandwidths contribute to considerable speedups. executing kernels on the GPU. On the GPU, a function cal- Figure 1(b) depicts a simplified architecture of an NVIDIA led kernel is executed in parallel with a number of threads GPU. Generally, a modern GPU consists of (i) a large glo- and blocks. Blocks consist of a number of threads and are bal main memory with a memory bandwidth of up to 1.2 assigned to one streaming multiprocessor. Threads are assi- TB/s and (ii) a number of compute units called Streaming gned to a block. Thus, executing a kernel requires at least Multiprocessors (SMs). Each SM has a number of simplistic 2 parameters. The first parameter defines how many blocks cores, a fixed set of registers, and shared memory. This sha- will be spawned. The second parameter sets the number of red memory serves as scratchpad and can be accessed by all threads per block. The total number of GPU threads is the cores in the SM. Moreover, the GPU has an on-chip L2 ca- product of both parameters. Depending on the data size and che, which is shared across all SMs and optionally, each SM the number of GPU threads, the number of elements pro- may have a local L1 cache. The number of SMs, and cores cessed per threads can vary. per SM, the size of global memory, the size of the L2 cache, etc. varies across GPU products. 3. GPU-SIMDIZATION EXPERIMENTS The execution model of GPUs is called Single Instructi- Our overall vision is to fully support GPUs in our TVL on Multiple Threads (SIMT), whereas SIMT is very similar as shown in Figure 1(a). For that, we have to create a to SIMD. While multiple data are processed by a single in- GPU-specific hardware-conscious implementation of the struction in SIMD, multiple threads are processed by a single hardware-oblivious interface. To achieve that, we interpret instruction in lock-step. That means, each thread in SIMT GPUs as virtual vector engines and call this interpretation executes the same instruction, but on different data. While SIMDization. For an optimal SIMD processing, we have a thread switch is very costly on CPUs, GPUs can handle to find a reasonable vector size. This vector size is virtual thread switching with more ease. GPUs feature a fast thread because we want to interpret regular arrays as virtual switching based on groups of 32 threads called warps. The vector registers. For example, on Intel x86-processors, the warp scheduler issues instructions to warps available on an different SIMD instruction set extensions either have a SM beyond the number of physical cores to hide latency vector length of 128-, 256- or 512-bit. If we assume 64-bit [16]. Thus, it is encouraged to create more threads than are per data element, we can simultaneously process 2 elements available as physical cores. Then, this overload can be used in a vector register of the size 128-bit. The wider the vector to schedule threads for execution, while others wait for a me- registers, the more data can be processed in parallel. We are mory transfer. This is especially important since databases not aware of any work that has ever determined a vector are more likely I/O-bound, not CPU-bound, and it is thus size for GPUs. one of the most important features for implementing query operators on GPUs [4]. 3.1 Vector Length Evaluation For the GPU implementation, general purpose parallel To determine a reasonable virtual vector size for our SIM- programming models such as CUDA [17] or OpenCL [18] Dization approach for GPUs, several experiments were exe- have to be used, whereas CUDA generate better performan- cuted. A first set of experiments were conducted on an NVI- Figure 3: Evaluation of element-wise primitive add. Figure 4: Evaluation of horizontal primitive hadd. DIA RTX Quadro 8000 GPU. This GPU has a global main memory size of 48 GiB, a memory bandwidth of 672 GB/s, an L2 cache with a size of 6 MiB, and 72 SMs. Each SM has 64 cores resulting in a total of 4,608 cores. For the hardware-conscious implementation, we can dis- tinguish three different main groups of hardware-oblivious SIMD primitives across all TVL classes: (i) load/store pri- mitives, (ii) element-wise primitives, and (iii) horizontal pri- mitives. Element-wise primitives are characterized by the feature that they do not introduce dependencies between the elements of the same vector register, e.g., element-wise arithmetic, comparisons, or boolean logic. In contrast to that, horizontal primitives do not treat the elements of a vector independently. An example is the horizontal additi- Figure 5: Evaluation of Operator Aggregation. on. As shown in our example sum-aggregation operator (cf. Figure 2), different primitives from all groups are used to CUDA SDK. For our evaluation, we generated different co- implement a vectorized query operator. In this operator, we lumns where the column size corresponds to the vector size. use an element-wise and a horizontal add primitive. Again, the data elements are 64-bit unsigned integer values. In our first experiments, we evaluated the element-wise Figure 4 shows the best achieved throughput over all con- and horizontal addition on the GPU. For that, we imple- figurations for each vector size. Similar to the element-wise mented simple vectorized CUDA kernels and executed these addition, higher vector sizes lead to higher throughput. kernels using different vector sizes. For the horizontal additi- Aggregation Operator: Based on these evaluations, we on, we use the function cub::DeviceReduce from the CUDA could draw the conclusion that the best vector size should SDK 11.2 as most efficient implementation as foundation. be the column size. To validate our hypothesis, we evaluated Since our GPU has 4,608 cores and usually more threads the sum-aggregation operator consisting of both primitives than cores are used for good performance, we investigated as next. For this evaluation, we generated a single column vector sizes in the range from 256 KiB to 1 GiB. In terms of with a size of 1 GiB and varied the vector size from 256 KiB number of elements, we evaluated the range from 213 , .., 227 to 1 GiB. As illustrated in Figure 5, we obtain a completely number of 64-bit data elements. For each vector size, diffe- different result. The throughput increases up to a vector size rent CUDA configurations with blocks and threads per block of 2 MiB. Then, the throughput decreases and stabilizes at a are possible and we tested a large number of different confi- low level. The main difference to our previous experiments is gurations in a systematical way. that a single vector register or array is now the main driver Element-wise Addition: This kernel was performed on of the processing. As shown in Figure 2, one vector register two input columns A and B. Each column was a sequence resVec is filled with zeros at the beginning of the aggregati- of unsigned randomly generated 64-bit integer values with on operators. Afterwards, we repeatedly load into a second a size of 1 GiB. The result was written back into column vector dataV ec of the column and conduct an element-wise B. Within a loop, our vectorized kernel is called until the add between both. The result is stored in resVec vector. whole columns have been processed. Figure 3 shows the best This resVec vector is frequently accessed, and thus, should achieved throughput over all configurations for each vector be kept in cache. size. As we can see, small vector sizes negatively affect the To evaluate the cache-fitting in more detail, we slightly performance. For a vector size of 256 KiB, a throughput of modified our experiment. We generated a new input column 248.87 GiB/s is achieved. The best performance is gained by of size 1,5 GiB containing 64-bit unsigned integers and varied using a vector size of 1 GiB. This results in a throughput of the vector size according to the on-chip L2 cache size: 1/4, 530.99 GiB/s. A vector size of 8 MiB is 3.8% slower than a 1/3, 1/2, 1, 2 times of the cache size. Figure 6 depicts the vector size of 1 GiB. We conclude, higher vector sizes lead resulting throughputs. As we can see, we obtain the best to higher throughput and in the best case, the vector size performance when our vector size is a third of the L2 cache corresponds to column size. size. Larger vector sizes lead to a lower throughput. Horizontal Additon: This kernel adds all elements in a vector register together and the kernel is based on the 3.2 Comparing with Native CUDA cub::DevideReduce function being shipped as part of the As shown above, we are able to determine a reasonable Figure 6: Evaluation of cache size-fitting. Figure 8: Validation on NVIDIA RTX 1070 Ti. virtual vector size providing the best performance for our vectorized aggregation on the GPU. In this section, we pre- 4. FUTURE WORK sent evaluation results comparing the vectorized aggregation Based on the previous section, we conclude that SIMD with the native CUDA aggregation. For that, we generated processing using a virtual vector model is generally possible various columns—sequences of 64-bit integer values—with on a GPU. Choosing the right vector size and configuration increasing sizes from 2 MiB to 8 GiB. For the vectorized is critical to achieve a good performance. Sub-optimal con- aggregation, we applied the best performing vector size of 2 figurations reduce the performance by more than one order MiB in all experiments. For the native CUDA aggregation, of magnitude. Overall, we obtain good performance which we used the function cub::DevideReduce from the CUDA is slower than native CUDA implementations. For the ag- SDK as already done for our horizontal aggregation. gregation operator using the ideal vector size, our approach The results are depicted in Figure 7. The throughput of is 37.4% slower in the worst case for 8 GiB data size, and the native CUDA aggregation is slightly higher as for our 10.8% slower in the best case for 8 MiB data size than a vectorized aggregation. This is especially true for large co- native CUDA approach. By validating our approach on a lumns, which is not particularly surprising. Nevertheless, the different GPU, we have also shown that our conclusions are result is promising to specialize our hardware-oblivious vec- more generally applicable beyond a single GPU model. Mo- torized query operators to GPUs and to obtain a reasonably reover, our hardware-oblivious query operators can be spe- good performance. We hope to increase the throughput of cialized to SIMD extensions as well as to GPUs in a unified our vectorized execution with additional GPU-specific opti- way. mization techniques as discussed in Section 4. To summarize, our results are promising and our ongoing research in that direction will focus on the following aspects: 3.3 Validation (1) Virtual Vector Size: To extend our work, we are We also executed all our experiments on a second NVIDIA looking forward to evaluate other query operators. We seek GPU namely an NVIDIA GTX 1070 Ti. This GPU provides to explore if optimal vector sizes and configurations depend a global main memory size of 8 GiB, a memory bandwidth of on the query operators. 256 GB/s, an L2 cache of 2 MB size, and 19 SMs. Each SM (2) Implementation: Completing the GPU TVL requi- has 128 cores resulting in a total of 2,432 cores. In general, res to implement all primitives. To optimize the hardware- we observed a behavior similar as with the NVIDIA Qua- conscious implementations, we want to explore the usage of dro RTX 8000. In particular, we executed the cache-fitting shared memory, registers, and persistent caching. experiment to determine the best virtual vector size for the (3) Optimization: Besides an optimal hardware- aggregation operator. In contrast to the RTX 8000 GPU, we conscious implementation for GPUs, we want to investigate obtain the best performance when our vector size is a fourth more mapping strategies for a broader optimization. of the L2 cache (512 KiB) size as shown in Figure 8. Currently, the TVL has a 1:1 mapping of hardware- oblivious primitives to hardware-conscious implementations. As shown in our experiments, the CUDA native aggregation outperforms the vectorized approach. To improve the performance of our vectorized approach, an idiom-based mapping strategy could be helpful. By replacing a vecto- rized computation by a semantically equivalent but more performant code at query compile-time, a speedup can possibly be achieved. For that, it is necessary to identify often used idioms that lose performance by vectorization and replace them by idiomatic implementations. 5. RELATED WORK To address the portability of code across heterogeneous computing units, OpenCL [18] and Intel’s OneAPI [19]ge- neral purpose parallel programming language approaches. However, the genericity is a major drawback from a perfor- Figure 7: Comparing vectorized versus CUDA nati- ve aggregation. mance point of view. In contrast, Pirk et al. [20] presented [4] Y. Yuan, R. Lee, and X. Zhang, “The yin and yang of a more database-specific approach called Voodoo to execu- processing data warehousing queries on GPU devices,” te single-source query operators on different (co-)processors. Proc. VLDB Endow., vol. 6, no. 10, pp. 817–828, 2013. Voodoo is a declarative intermediate algebra that abstracts [5] H. Esmaeilzadeh, E. R. Blem, R. S. Amant, the detailed architectural properties of the hardware, wi- K. Sankaralingam, and D. Burger, “Dark silicon and thout losing the ability to generate highly tuned code. The the end of multicore scaling,” IEEE Micro, vol. 32, proposed algebra consists of a collection of declarative and no. 3, pp. 122–134, 2012. vector-oriented operations. Operators described in this al- [6] A. Ungethüm, J. Pietrzyk, P. Damme, A. Krause, gebra are compiled to OpenCL. The drawbacks of this ap- D. Habich, W. Lehner, and E. Focht, proach are: (i) operators have to be described with a new “Hardware-oblivious SIMD parallelism for in-memory algebra, (ii) a specialized compiler is required, and (iii) the column-stores,” in CIDR. www.cidrdb.org, 2020. overhead of OpenCL. Another abstraction concept was pro- [7] J. Zhou and K. A. Ross, “Implementing database posed by Heimel et al. [21]. They developed a hardware- operations using SIMD instructions,” in SIGMOD, oblivious parallel library for query operators, so that these 2002, pp. 145–156. operators can be mapped to a variety of parallel processing [8] C. J. Hughes, Single-Instruction Multiple-Data architectures like many-core CPUs or GPUs. However, the Execution, ser. Synthesis Lectures on Computer approach is mainly based on OpenCL and they do not sup- Architecture. Morgan & Claypool Publishers, 2015. port SIMD on CPUs. [9] M. Xue, Q. Xing, C. Feng, F. Yu, and Z. Ma, In a recent work, Shanbhag et al. [15] introduced a new “Fpga-accelerated hash join operation for relational processing model for an efficient query processing on GPUs databases,” IEEE Trans. Circuits Syst. II Express called tile-based execution model. This processing model Briefs, vol. 67-II, no. 10, pp. 1919–1923, 2020. extends the SIMD-based processing on CPUs where each thread processes a vector at a time to GPU. Based on that [10] S. Jha, B. He, M. Lu, X. Cheng, and H. P. Huynh, SIMD extension concept, they introduced a CUDA-like “Improving main memory hash joins on intel xeon phi library called Crystal consisting of data processing primi- processors: An experimental approach,” Proc. VLDB tives that can be composed in order to implement queries Endow., vol. 8, no. 6, pp. 642–653, 2015. on the GPU. Thus, this approach has a lot in common [11] J. Pietrzyk, D. Habich, P. Damme, E. Focht, and with our idea, but they are limited to GPUs. It would W. Lehner, “Evaluating the vector supercomputer be interesting to implement our TVL hardware-conscious sx-aurora TSUBASA as a co-processor for in-memory plug-in for GPUs using the Crystal library. database systems,” Datenbank-Spektrum, vol. 19, no. 3, pp. 183–197, 2019. [12] D. J. Abadi, P. A. Boncz, and S. Harizopoulos, 6. SUMMARY “Column oriented database systems,” Proc. VLDB In this paper, we evaluated the integration of GPUs in our Endow., vol. 2, no. 2, pp. 1664–1665, 2009. SIMD abstraction layer TVL by interpreting GPUs as virtu- [13] P. Damme, A. Ungethüm, J. Pietrzyk, A. Krause, al vector engines. By conducting a number of experiments, D. Habich, and W. Lehner, “Morphstore: Analytical we have shown that our approach is promising. While the query engine with a holistic compression-enabled observed throughput does not outperform CUDA native im- processing model,” Proc. VLDB Endow., vol. 13, plementations, it allows developers to use CUDA-based fast no. 11, pp. 2396–2410, 2020. GPU primitives without requiring knowledge of GPU im- [14] O. Polychroniou and K. A. Ross, “VIP: A SIMD plementations. Our vectorized approach still achieves reaso- vectorized analytical query engine,” VLDB J., vol. 29, nable performance that is not an order of magnitude slower no. 6, pp. 1243–1261, 2020. than native CUDA implementations. However, tuning the primitives and operators by choosing the right virtual vector [15] A. Shanbhag, S. Madden, and X. Yu, “A study of the size and configuration is critical for achieving good perfor- fundamental performance characteristics of gpus and mance. cpus for database analytics,” in SIGMOD, 2020, pp. 1617–1632. [16] Turing Tuning Guide: CUDA Toolkit documentation, Acknowledgments https://docs.nvidia.com/cuda/turing-tuning-guide/ This work was partly funded by the German Research Foun- index.html. dation (DFG) within the RTG 1907 (RoSI). [17] CUDA C Programming Guide, https://docs.nvidia. com/cuda/cuda-c-programming-guide/index.html. 7. REFERENCES [18] OpenCL, https://www.khronos.org/opencl/. [1] D. Abadi, P. A. Boncz, S. Harizopoulos, S. Idreos, and [19] OneAPI, https://www.oneapi.com. S. Madden, “The design and implementation of [20] H. Pirk, O. R. Moll, M. Zaharia, and S. Madden, modern column-oriented database systems,” Found. “Voodoo - A vector algebra for portable database Trends Databases, vol. 5, no. 3, pp. 197–280, 2013. performance on modern hardware,” Proc. VLDB [2] H. Funke, S. Breß, S. Noll, V. Markl, and J. Teubner, Endow., vol. 9, no. 14, pp. 1707–1718, 2016. “Pipelined query processing in coprocessor [21] M. Heimel, M. Saecker, H. Pirk, S. Manegold, and environments,” in SIGMOD, 2018, pp. 1603–1618. V. Markl, “Hardware-oblivious parallelism for [3] T. Karnagel, D. Habich, and W. Lehner, “Adaptive in-memory column-stores,” Proc. VLDB Endow., work placement for query processing on heterogeneous vol. 6, no. 9, pp. 709–720, 2013. computing resources,” Proc. VLDB Endow., vol. 10, no. 7, pp. 733–744, 2017.