1. Introduction

Evaluating SIMD Compiler-Intrinsics for Database Systems

Lawrence Benson

Richard Ebeling

Tilmann Rabl

0 0 Hasso Plattner Institute, University of Potsdam

Modern query engines often use SIMD instructions to speed up query performance. As these instructions are heavily CPU-specific, developers must write multiple variants of the same code to support multiple target platforms such as AVX2, AVX512, and ARM NEON. This process leads to logical code duplication, which is cumbersome, hard to test, and hard to benchmark. In this paper, we make the case for writing less platform-specific SIMD code by leveraging the compiler's own platform-independent SIMD vector abstraction. This allows developers to write a single code variant for all platforms as with a SIMD library, without the library's redundant layers of abstraction. Clang and GCC implement the platforms' SIMD intrinsics on top of their own abstraction, so code written in it is optimized for the underlying vector instructions by the compiler. We conduct four database operation microbenchmarks based on code in real systems on x86 and ARM and show that compiler-intrinsic variants achieve the same or even better performance than platform-intrinsics in most cases. In addition, we completely replace the SIMD library in the state-of-the-art query engine Velox with compiler-intrinsics. Our results show that query engines can achieve the same performance with platform-independent code while requiring significantly less SIMD code and fewer variants.

1. Introduction

tiple platform-specific methods to handle special cases Numerous databases and query processing engines make and optimizations. This is the case, e.g., in the query use of vector instructions (SIMD = single instruction, mul- engine Velox [7], which uses a SIMD library to structure tiple data) to speed up query processing. By performing the vectorized code, but still contains dozens of platformthe same operation on multiple data items at once, SIMD specific functions for AVX2 and NEON that implement instructions increase the performance of these systems the same logic but for diferent types and sizes. significantly [ 1, 2, 3]. However, SIMD instructions are In this paper, we make the case for writing less platCPU-specific and come with various register widths and form-specific SIMD code in databases while also removcapabilities. Most modern x86 servers support SSE, AVX, ing redundant layers of abstraction. We show the redunand AVX2, but while a wide range of Intel servers also dant layers of abstraction by inspecting how x86’s and support AVX-512, only the newest AMD servers support NEON’s platform-intrinsics to add two 128-bit vectors it. With the rise of ARM in the cloud, query engines now of four 32-bit integers (_mm_add_epi32 and vaddq_s32) also target NEON vector instructions and will soon likely are implemented in compilers and abstracted from in target the new scalable vector extension (SVE), which SIMD libraries. Both Clang and GCC provide platformis available in AWS’ latest ARM servers. Supporting all independent vector intrinsics (compiler-intrinsics) [8, 9], of these instruction sets requires many set-specific code based on C/C++’s primitive types and compiler attributes. variants, resulting in significant logical code duplication The vector-addition platform-intrinsics are implemented and checks to select the correct version. Optimizing on top of these compiler-intrinsics, as outlined below. performance-critical SIMD code in this setup is cumber- // Simplified from Clang's <emmintrin.h> header. some, hard to benchmark, and hard to test. // Public 16-byte __m128i type in x86 SIMD API.

Various SIMD libraries exist to reduce the complexity typedef long __m128i __attribute__((vector_size(16))); of supporting multiple vector instruction sets [4, 5, 6]. /t/ypIendetfernianlt 1_6_-vb4ystue_v_eactttorribouftef_o_u(r(vienctteogre_rssi.ze(16))); These libraries often provide a thin abstraction on top of __m128i _mm_add_epi32(__m128i __a, __m128i __b) { the platform-specific SIMD intrinsics ( platform-intrinsics). return (__m128i)((__v4su)__a + (__v4su)__b); However, the explicit use of intrinsics directly or via a } library can hinder compiler optimizations, as we show in our benchmarks. If an operation cannot be expressed //// iSnitm3p2lxi4f_itedisfrdoemfiCnleadng'asna<laorgmo_unseloyn.tho>_h_eva4dseur.. using the library’s abstractions, systems still require mul- int32x4_t vaddq_s32(int32x4_t __p0, int32x4_t __p1) { int32x4_t __ret; __ret = __p0 + __p1; return __ret; Joint Workshops at 49th International Conference on Very Large Data Bases (VLDBW’23) — Workshop on Accelerating Analytics and Data Management Systems (ADMS’23), August 28 - September 1, 2023, Van- } couver, Canada $ lawrence.benson@hpi.de (L. Benson); richard.ebeling@hpi.de We see that x86’s 16-byte SIMD type __m128i is de(R. Ebelin©g2)0;2t3iClompyaringhntf.orrathbisl@papherpbyi.idtseau(tTho.rsR.Uasebple)rmitted under Creative Commons License fined using Clang’s vector type via vector_size. The CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) internal 16-byte __v4su vector of four integers is de1 template <typename T> struct vec { 2 vec<T> operator+(vec<T> other); 3 } 4 #if __x86_64__ // x86 platform 5 vec<T> vec<T>::operator+(vec<T> other) { 6 return _mm_add_epi32(data, other.data); 7 } 8 #elif __aarch64__ // ARM NEON platform 9 vec<T> vec<T>::operator+(vec<T> other) { 10 return vaddq_s32(data, other.data); 11 } 12 #else ...

Based on a generic vector type (Lines 1–3), libraries

implement C++ operators. In Lines 6 and 10, we show how the platform-intrinsics are wrapped depending on the used platform (x86 or NEON) to provide the library’s C++ type abstraction and platform-independence. SIMD libraries are abstractions on top of platform-intrinsics, which in turn are abstractions on top of compiler-intrinsics. Instead of an additional API layer around specific functions, we argue that compiler-intrinsics can be used directly to structure SIMD code while also providing a wide range of operations common across SIMD instruction sets. For our vector-addition code example, compilerintrinsics could be used as follows. 1 template <typename T> // Templated 16-byte vector. 2 using vec __attribute__((vector_size(16))) = T; 3 4 // Code structured with compiler-intrinsics. 5 vec<T> foo(vec<T> a, vec<T> b) { 6 vec<T> result = a + b; // Use standard + operator. 7 // ... 8 return result; 9 }

Compiler-intrinsics allow developers to express vector code once instead of having to implement it times for platforms. Developers can leverage them to write a single templatable, platform-independent version while still benefiting from the platform’s vector execution capabilities. By removing platform-intrinsics, developing and testing also becomes easier because the same code can be run on diefrent machines, regardless of whether they support specific instructions or not. Any operation deifned on a compiler-vector is guaranteed to be compiled correctly. As platform-intrinsics are implemented on top of compiler-intrinsics, both can be used interchangeably in case specific instructions are required. We argue that developers should express the logical vector operations that they need and let the compiler determine the correct 1) We compare the performance of compiler-intrinsics to scalar, auto-vectorized, and platform-specific SIMD code in four database operation microbenchmarks.

2) We replace all platform-specific SIMD code in the

query engine Velox with compiler-intrinsics and show the impact on end-to-end TPC-H workloads.

3) Based on our results, we make the case for platformindependent SIMD code in databases and discuss open challenges. In Section 2, we discuss vectorized processing concepts,

before introducing compiler-intrinsics in Section 3. In Sections 4 and 5, we present our benchmark results on various machines, compiled with Clang and GCC. We discuss our findings in Section 6. We conclude our work in Section 8 after presenting related work in Section 7.

2. Vectorized Processing In this section, we give a short overview of SIMD opera

tions commonly used in databases. We then discuss the usage of SIMD instructions via intrinsics in C++ code.

Finally, we briefly lay out auto-vectorization eforts in modern compilers.

SIMD. When using SIMD operations, the program code operates on vectors of elements and operations are applied to all elements of a vector. For example, binary operations such as addition are applied pairwise to the elements of the vectors. Since multiple vector elements are processed in a single operation, this yields a higher throughput in comparison to scalar operations.

Databases often utilize this to speed up computations.

In some cases, specialized data structures or layouts can help to utilize SIMD operations, e.g., columnar data storage or hash table buckets [10]. template <typename InT> __m128i x86_half(__m128i data); template <> __m128i x86_half<uint32_t>(__m128i data) {

return _mm_cvtepu32_epi64(data); } } template <> __m128i x86_half<int32_t>(__m128i data) { return _mm_cvtepi32_epi64(data); (a) x86 C++ #define VEC_SIZE(n) \

__attribute__((vector_size(n))) template <typename T> using Vec VEC_SIZE(16) = T; template <typename T> using HalfVec VEC_SIZE(8) = T; template <typename Out, typename In> Vec<Out> vec_half(Vec<In> data) { return __builtin_convertvector(

(HalfVec<In>&)data, Vec<Out>); } (b) compiler-intrinsics C++ uint64x2_t neon_half(uint32x4_t data) { return vmovl_u32(

vget_low_u32(data)); } int64x2_t neon_half(int32x4_t data) { return vmovl_s32(

vget_low_s32(data)); } (c) NEON C++ vec/x86_half<unsigned int, unsigned long>(long long __vector(2)): ; Note the z for zero-extend pmovzxdq xmm0, xmm0 ret vec/x86_half<int, long>(long long __vector(2)): ; Note the s for sign-extend pmovsxdq xmm0, xmm0 ret (d) x86 assembly vec/neon_half(__Uint32x4_t): ; Note the u for zero-extend ushll v0.2d, v0.2s, #0 ret vec/neon_half(__Int32x4_t): ; Note the s for sign-extend sshll v0.2d, v0.2s, #0 ret (e) NEON assembly Listing 1: Code examples for x86, compiler-intrinsics, and NEON to extract lower two 32-bit values from 128-bit register and either sign- or zero-extending them to two 64-bit values in an 128-bit output register. Compiler-intrinsics produce the same assembly while using type- and platform-independent code (https://godbolt.org/z/4aoxEv14b)1.

In hardware, a vector is stored as contiguous bits in a in their name, e.g., vaddvq_u32 and vaddvq_s32. The register, so element boundaries inside the vector depend ARM intrinsics guide lists 4344 NEON intrinsics [12]. on the operation. This allows for scenarios where input Auto-Vectorization. Apart from designated vector data is first shufled as 8-bit values and then treated as operations in the source code, major compilers such as 32-bit integers in the next operation. Common opera- GCC, Clang, and ICX also perform auto-vectorization as tions in database contexts are load and store, arithmetic, an optimization step. They attempt to detect patterns in comparison, shufling, gather/scatter, and widening/nar- scalar code that can be vectorized and replace the code rowing. For example, a column scan could load a vector with a vectorized version [13]. Auto-vectorization may of attribute values, compare them according to a filter fail, typically either because the pattern is not supported predicate, and extract the indices of matching elements. or because some prerequisite for vectorization cannot

Platform-Intrinsics. There are multiple extensions be satisfied. There are commonly supported and docuto the x86 instruction set that add vector operations: SSE mented approaches to circumvent these problems and (Streaming SIMD Extensions) with four major versions help with auto-vectorization [14]. introducing 128-bit registers, followed by AVX (Advanced Vector Extensions) with 256-bit registers, AVX2, and most recently AVX-512 with 512-bit registers. AVX-512 has 3. Compiler-Intrinsics a modular design based on a foundation specification A disadvantage of platform-intrinsics is that they encode and various sub-extensions adding new instructions. For the type in the function name, as C does not support all these extensions, the specification defines a C API function overloading. This makes it hard to generically with mnemonic function names (intrinsics) that allow implement SIMD operations, as developers must use difusing the vector instructions in higher-level languages. ferent functions for, e.g., int and long. Additionally, The Intel intrinsics documentation currently lists 6251 using platform-intrinsics does not guarantee that the corintrinsics from the SSE and AVX instruction families [11]. responding instruction is actually chosen by the compiler.

On ARM, the NEON vector extensions allow vector For example, Clang further merges instructions and peroperations on 128-bit registers. Similar to x86, they define forms constant propagation, even when using explicit a C API. The vector types include the element width x86 intrinsics2. for integer types (e.g., uint32x4_t), but due to missing overloading in C, functions still encode the element type

1We use godbolt.org to support our claims with generated assembly. 2https://godbolt.org/z/M8zaqzj3r

1 // 16-Byte vector of 4x uint32_t.

GCC and Clang provide vector extensions that allow 2 using Vec __attribute__((vector_size(16))) = uint32_t; declaring a vector type of elements of a language base 3 // Same as Vec but without 16-byte alignment. type, e.g., int [8, 9]. A code example that extracts the 4 using UnalignedVec __attribute__((aligned(1))) = Vec; lower two 32-bit integers from a 128-bit register and sign- 5 // Vector of 4 bools (only available in LLVM). or zero-extends them to two 64-bit values is shown in 6 using BitVec __attribute__((ext_vector_type(4))) = bool; Listing 1. For both x86 and NEON, we have to define one 87 // Scan integer column and write matching row ids. function per type, as the intrinsics difer for signed and 9 uint32_t dense_column_scan(uint32_t* column, unsigned integer. Overall, we end up with four distinct 10 uint32_t filter_val, implementations. With compiler intrinsics, we use tem- 11 uint32_t* __restrict out) { ipnlasttrinugctaionndsthfoerceoamchpiplelar’tsfoarbmili.tTyhtoiscahloloowses afoprpraospirnigatlee 111423 ufio/nr/t3(L2uo_iatndtn3ud2ma__ttmaartaocnwhde=sco0m=;pra0or;we.< NUM_ROWS; row += 4) { implementation that can be used on both platforms. 15 Vec values = *(Vec*)(column + row);

Vectors are defined using GCC-style __attribute__(), 16 Vec matches = values < filter_val; specifying the size in bytes and the element’s type. In this 17 edxeafinminpglea, tvheecsteoraroef16by=te1f6o/rsViezceaonfd(Ta)teemlepmlaetnettsy.pTehTe, 112890 B/i/tVCeocnvbeirttveccomp=ar_i_sbounmialrttecishneu_slc,tonBtvioetrVstecvcae)lc;atrorb(itmask. input vector is cast to a HalfVec to extract the lower /2 21 // Upper bits may contain random data values. We use the compiler’s builtin convert method 22 uint8_t bitmask = ((uint8_t&) bitvec) & 0xf; to widen the values to the requested output type, e.g., 23 (arug)uimnte6n4t)_.tIninLiosutirnegx1adm) apnled(es)p, ewceifiesdeeinthaOtuthtTe cteommpplialetre- 222564 V/e/*c(GVreeotcw*_r)oofwMfAsoTefCtfHssEeSt=_sTOu_sRiOnWg_OlFoFoSkEuTpS[tbaibtlmea.sk]; intrinsics produce the same assembly as the platform- 27 intrinsics, without type- or platform-dependent code. 28 Vec compressed_rows = row + row_offsets;

In Listing 2, we show a more complete code example 29 fcroodme socuarn sfiltaenri ningtesgcearncboelunmchnm, filatrekrs(Sbeyclteiossn-t4h.a5n). sTomhies 333201 *a/ou/uttoW_*rvioetucet=_mvacetoccmh=pirn(egsUsrneoadwl_irigodnwsesd;tVoec*o)u(topuutt.+ num_matches); value, and writes matching row ids to an output array. 33 num_matches += std::popcount(bitmask);

In Line 2, we first define a vector of four 32-bit inte- 34 } gers. As vectors assume alignment equal to their size, we 35 return num_matches; specify an unaligned version in Line 4, which we use for 36 } unaligned stores. Clang supports special Boolean vec- Listing 2: Compiler-intrinsics code example for our filtering tors, where each entry is only 1 bit (Line 6). Vectors are integer column scan. Evaluated in Section 4.5 as vec-add. loaded and stored using casts (Line 15). Operations are expressed using common C++ operators, such as arithmetic or comparison operators (+, <, ==, . . . ), on these types. Op- ing them with compiler-vector types. Operations that erators also support scalar operands, e.g., < filter_val are not natively supported on the target platform can in Line 16. The ternary operator can be used for element- still be expressed in a canonical fashion. For example, up wise selection. For some complex operations that have no to AVX2, there is no intrinsic for the multiplication of matching C++-operator, Clang and GCC provide helper vectors with 64-bit integer elements. Here, programmers functions, e.g., __builtin_convertvector() for nar- need to fall back to multiplying and combining 32-bit rowing or widening (Line 19). We cast the Boolean vector halves of the input numbers. With compiler-intrinsics, to a scalar bitmask and mask of the high bits (Line 22). programmers can use the C++ multiplication operator We then use that bitmask to get row ofsets from a lookup and the compiler inserts the required logic. table and add them to the base row id (Lines 25–28). Finally, we use a cast to perform an unaligned store of the 4. Benchmarks matching row ids (Lines 31–33).

The compiler lowers these operations to the target plat- In this section, we perform four microbenchmarks to form. The goal of these compiler-intrinsics is to allow compare platform-specific SIMD code with scalar, autofor platform-independent source code that performs well vectorized, and platform-independent variants. We evalon all target platforms. Note that with minor templating, uate multiply-shift hashing (Section 4.2), a hash bucket we could extend this method to support arbitrary vector key lookup (Section 4.3), bit-packed decompression (Secsizes while maintaining only a single implementation. tion 4.4), and an integer column filter scan (Section 4.5). Depending on the given size, the compiler would gen- We then present the impact of platform- and compilererate the appropriate instructions for us. If developers intrinsics in end-to-end TPC-H benchmarks in Velox, a want to use explicit platform-intrinsics, Clang allows useeSppdu01 s25n a) x86 Icelake 102 s25n b) M1 lrnmoiugoemhtpitcbsboetayrpretwarsa.irttuShioinonn-uctsietmaaepeanpcylvhiaeddliuanatepanutddhtpeanrptuoeimdsnubdkceenernoshcweiaxensasctbbhteleeyftosworaneemeetenohuealrtohpiotouhpttscalarnaiavuetovveecc-1v2e8c-2v5e6c-5s1se24-a1v2x82-25a6vx512 scalar naiveautovecvec-128vec-256vec-512 neon ittoerwatoiorknsw,perlolginraammnaeirvseliikmeplyleemxpeencttataiuotno.-vLeLctVoMrizaatuiotonvectorizes our naive implementation on x86, so we inFigure 1: 64-bit multiply-shift hashing ( 64 values). clude an additional scalar bar that shows the naive implestate-of-the-art columnar query engine (Section 4.6). Our mentation with disabled auto-vectorization. The results results show that in most benchmarks, vector code is are shown in Figure 1. needed to achieve peak performance and that when us- On x86, the naive, autovec, and vec-512 variants proing vector code, compiler-intrinsics perform at least as duce optimal vectorized code. For smaller explicit vector well as platform-intrinsics. All of our code and all results widths, the performance degrades as we require more opcan be found in our GitHub repository3. erations to process the same input. With 128-bit vectors, we observe a speedup of only ~10% over scalar.

The avx512 variant does not achieve peak performance 4.1. Setup and Methodology because we use the shifting intrinsic _mm512_srli_epi64 The results presented in this section are based on an x86 (compiled to vpsrlq), which requires an immediate operIntel Icelake CPU (Xeon Platinum 8352Y) and on an M1 and that is known at compile-time. Compile-time conMacBook Pro 14" laptop with ARM NEON. We validate stants can generally be used for more aggressive optiour results on various other x86 and ARM platforms and mization. However, for the vec-512 variant, the compiler discuss this in Section 5. All experiments are performed instead chooses the _mm512_srlv_epi64 (vpsrlvq) inusing Clang/LLVM4 15 with -march/-mtune=native op- struction that shifts all elements in input vector 1 by timizations and -O3. All experiments are run single- the amount specified by the corresponding element in threaded and repeated ten times. We report the aver- input vector 2. The second operand needed in the hot age runtime, the variance for all benchmarks is below loop is moved outside the loop, and in this case, the in5%. We show the relative speedup of each variant over a struction that the compiler chooses is ~20% faster than naive scalar implementation without explicit vectoriza- the instruction with the immediate operand. tion eforts. For all microbenchmarks, we show the naive With the avx2 and sse4 variants, we observe worse perbaseline (naive), a scalar code version that is optimized formance than with the corresponding compiler-variants for auto-vectorization (autovec), our compiler-intrinsics using the same vector width. Our implementations only versions (vec-*), and a platform-intrinsics version (sse4-*, use instructions available in the instruction set that the avx2-*, avx512-*, neon-*). The platform-specific variants code is targeted for. Vector multiplication with 64-bit use manually selected intrinsics and also represent the us- integer elements was first introduced in AVX-512, so our age of a SIMD library, which provides a thin abstraction code performs manual multiplication of 32-bit halves on top of platform-intrinsics. of the input using the approach that is also used in the

In some experiments, the platform-specific variants vectorclass SIMD library [17]. LLVM is unable to optiperform worse than the ones using compiler-intrinsics. mize this to the proper multiplication instruction on the We note that it is nearly always possible to adapt the plat- target architecture. This shows a core disadvantage of form-intrinsics to match the compiler’s code to achieve platform-specific code. Implementations tailored to a the same performance. However, we keep these worse re- specific microarchitecture cannot benefit from new insults to demonstrate the complexity of manually selecting structions and optimizations, while generic code can. instructions from thousands of available ones. In these When repeating the measurements on a Skylake CPU cases, we explicitly discuss the responsible instructions. without AVX-512 support, we observe that the autovec, vec-256 and vec-512 variants are still 25% faster than the handwritten AVX2 implementation using the multipli4.2. Multiply-Shift Hashing cation from the vectorclass library. The generated code Our first microbenchmark computes the hash values of only difers in how the multiplication is performed. In the input numbers using multiply-shift hashing [15]. This this case, using a specialized SIMD library even results is done, e.g., to determine the bucket in a hash index [16]. in worse performance than the naive implementation. Each 64-bit input number is multiplied by a 64-bit con- On M1, the LLVM cost model does not consider loop stant, the result is truncated to 64-bits, and then shifted vectorization desirable for the naive implementation. For the autovec variant, we explicitly instruct the compiler 43Whtetprse:/f/egritthouCbl.acnomgf/ohrpitdhees/Ca+u+tofvroecn-tdebnd and to LLVM for backend to vectorize the loop via an annotation. LLVM generates optimizations. very similar code for the vec-* variants, with nearly iden

In this benchmark, we measure the performance of finding an entry in a hash bucket that stores additional fingerprints in a metadata block at the beginning of the hash bucket. A fingerprint is a 1-byte hash value of a key. The block of fingerprints can then be used to quickly skip over keys that do not match the search key. This design is common in index structures and used in, e.g., Facebook’s F14 hash table, Google’s Abseil hash map, Velox, Dash, and ART [18, 19, 7, 20, 21]. With vector operations, all ifngerprints can be compared to the fingerprint of the search key in a single operation, allowing a direct jump to the first key with a fingerprint match. Our hash bucket holds 15 entries, each with a fingerprint, an 8-byte key, and an 8-byte value. Fingerprints are stored in a contiguous 16-byte array that is processed as a 128-bit vector. 4.4. Bit-Packed Integer Decompression The benchmark repeatedly performs key lookups in the hash bucket with a 50% chance of a successful lookup.

Our benchmark results are shown in Figure 2. The naive-key-only variant ignores the fingerprints and simply loops over the keys to find a match. The *-bytemask variants perform a vector-comparison on the fingerprints array, resulting in a vector of 16 mask-bytes with either all one-bits or all zero-bits. They then loop over the all-one-bytes in this vector and compare the corresponding key. The *-bitmask variants first narrow the 16-byte mask to a bitmask of 16 bits, allowing the result to fit in a general-purpose register and simplifying the loop logic.

On both x86 and M1, LLVM correctly auto-vectorizes the comparison of the 15 fingerprints. However, it generates suboptimal code when attempting to extract the 16 bytes as a __uint128_t value that could then be used with the bytemask-logic [22]. The approach of narrowing the 16 bytes to 16 bits also results in suboptimal code [23].

Overall, while the fingerprint comparison is vectorized as expected, we are unable to use the comparison result eficiently, so the autovec implementation does not achieve competitive performance.

With explicit vector operations, the bitmask approach is better on x86 due to simpler loop logic and a saved

In this benchmark, we evaluate the performance of de

compressing 100k bit-packed integers. The variants are based on SIMD-Scan [3] and an implementation in Velox [26]. The input consists of 9-bit packed values, which expand to 32-bit in the output vector, as in the SIMD-Scan paper. The results are shown in Figure 3.

The naive variant operates on 9 bits at a time, expanding and storing them, and then shifts in the next 9 bits.

The autovec variant performs a scalar expand+store, but in a loop over 32 elements without loop-carried dependencies, allowing the compiler to auto-vectorize operations. The vec/neon/sse4-* variants perform the SIMDScan algorithm for 128 bits as in the original paper. It loads a 128-bit vector, i) shufles to move the required bytes into the lanes, ii) shifts right to move the relevant bits for each number to the beginning of the lane, and iii) masks of leftover bits using bitwise-and. The extended 256 and 512-bit variants use the same approach with wider registers. The avx2-256 variant uses two 128-bit registers since 256-bit registers are separated into two 128-bit lanes with no support for bytewise shufles across lanes. The pdep variant implements the algorithm used 5https://godbolt.org/z/hTjnWMr3n 6https://godbolt.org/z/PeWGW6rsj 7Merged, available upstream since cd68e17. in Velox with the x86 pdep instruction [27] to extract Figure 4: Filtering integer column scan (32 000 values). multiple compressed -bit values out of 64-bit data.

LLVM auto-vectorizes better for x86 than for NEON8, combines them with a platform-independent optimizaalthough both achieve the same ~5× speedup. It detects tmioanp. tHoodwifeerevnetr,LwLhVeMn oups-incogdtehse, wNhEiOchNa-rinetnriontsdiceste,cthteedse the pattern on x86 and auto-vectorizes it with gather, in this optimization step, resulting in worse code. This shift, and bitwise-and instructions. For NEON, LLVM shows that the compiler can detect the optimization for emits unrolled scalar code.

On x86, comparing the vec variants with the platform- two shifts in general, but it currently cannot detect it when the developer explicitly requests NEON-intrinsics.

The 256-bit variant is compiled to two 128-bit operations with equal performance, while the 512-bit variant produces ineficient scalar code and performs significantly worse on NEON. intrinsic variants shows multiple things. The 128-bit variants perform equally, as they produce identical assembly. vec-256 is faster than vec-128 due to higher parallelism.

It is also faster than our avx2 variant, as the compiler selects more eficient instructions than we do. The avx2256 variant processes two 128-bit registers to solve the problem that no cross-lane byte-level shufle instruction 4.5. Filtering Integer Column Scan is available for 256-bit registers in AVX2. For this reason, In our final microbenchmark, we show the performance it performs the shifting and bit-masking twice as often of an integer scan with a filter on 32k values, inspired compared to an algorithm using a 256-bit register. The by implementations in Velox [28] and Hyrise [29]. All vec-256 variant, on the other hand, is compiled to use values in the column are represented as 32-bit integers 256-bit registers by first performing a cross-lane immedi- and we filter by < with a selectivity of 50%. The ate permute operation and then using an in-lane shufle. matching 32-bit RowIDs are written to an output vector. Here, LLVM utilizes the fact that for each of our shufle The results are shown in Figure 4. operations, all selected indices are from a contiguous 16- The naive variant implements a for-loop and writes to byte range, so it is able to store all required source-bytes the output vector if the filter matches. The autovec variin both 128-bit lanes. Overall, both variants require the ant does the same but uses a branchless implementation, same number of instructions for shufling, but the vec-256 i.e., out_idx += val < x. This always writes to the outvariant uses half as many for shifting and masking. put but overwrites the current value in the next iteration

The avx512 variant outperforms its vec-512 counter- if it did not match. The *-shufle variants create a bitmask part, as Clang inserts an and-instruction to mask the (4-bit in 128, 16-bit in 512) from the comparison result shufle input in the frontend, which the backend cannot vector and perform a table lookup (24 or 216 entries) remove, resulting in more instructions in the tight loop. with that mask to retrieve the indices with which the The vec-256 and avx512 variants perform equally, as the current iteration’s RowIDs are shufled. If positions 1 and CPU retires twice as many 256-bit instructions per cycle 3 match, (ids[1], ids[3], ∘ , ∘ ) is written to the output, as 512-bit, compensating for the lower parallelism. where the ∘ ’s are overwritten in the next iteration. The

The vec implementation of SIMD-Scan outperforms sse4-pdep found in Velox, while not using any platform- *-add variants do not shufle but use the bitmask to look up which RowIDs must be added to the current iteration’s start RowID . If positions 1 and 3 match, ( + 1, + 3, ∘ , specific code. Overall, we again see that the vec-* variants perform on par with the platform-intrinsic variants.

With NEON, we see that the 128- and 256-bit compiler- ∘ ) is written. This saves one shufle instruction but only works for contiguous input RowIDs. The vpcompressd intrinsics outperform the NEON-intrinsics, as the com- variants use the AVX-512 vpcompressd [30] instruction piler combines the two shift instructions with compile- that shifts all matching values to the beginning of the time constants that are executed independently in the SIMD register and stores them to memory. We also evalneon variant9. During instruction selection, LLVM recog- uate a version with the SSE4 pext [31] instruction, but nizes the vector shift operations in our vec variant and this is slower than the more general approaches. Unlike the other benchmarks, auto-vectorization and 8https://godbolt.org/z/ef7WMPMT4 9https://godbolt.org/z/5bsv3qWT6 compiler-intrinsics do not achieve the performance of vec xsimd platform-intrinsics on x86. The special vpcompressd ]s300 a) x86 Icelake instruction performs the required logic and achieves sig- [em200 nificantly better results by combining the lookup and itm100 shufle in a single instruction. However, there is cur- un 0 1 3 5 6 7 8 9 10 12 13 14 15 16 17 18 19 20 22 rently no way to make LLVM produce this instruction R TPC-H Query usianugtoavuetco-sviegcntioficrainzatltyionouotrpceorfmorpmilesr-inntariivnesibcsec[a3u2s,e33o]f. ]sm300 b) M1 branchless execution. The compiler unrolls the naive [e200 loop five times but converts the write inside an if-branch itm100 icnotsot fivfoerbbrraanncchhesminispasresedmicbtiloyn,rse10s.ulting in high runtime unR 0 1 3 5 6 7 8 9TP10C-1H2 1Q3u1e4ry15 16 17 18 19 20 22

For the *-shufle/add variants, the 512-bit variants out- Figure 5: End-to-end TPC-H benchmark (19 supported perform the 128-bit ones by only 2× . The higher paral- queries) with SF1 from Parquet in Velox on x86 and M1. lelism is counterbalanced by the need for a bigger lookup table of up to 4 MB that causes cache misses. With the time difers by 1.3% on Icelake and around 0.13% on M1. compress variants, we observe a 3.4× speedup from 128- However, the diference on Icelake is not caused by SIMD to 512-bits, as there is no lookup table and the 128- and code changes but by code alignment issues [36]. Explic512-bit instructions have the same latency and through- itly adding eight nop padding bytes before a single hot put [30], benefiting the higher parallelism of the 512-bit loop in a scan operator reduces the performance difervariant. For the 128-bit add-based variants, LLVM gen- ence to 0.1%. Code-alignment optimizations are highly erates slightly better assembly with three fewer instruc- platform-specific and out of scope of this work. We use tions when using compiler-intrinsics instead of SSE. them to illustrate that performance diferences at this

For NEON, we see that the vec-128 variants achieve the level have various causes that must be considered when same performance as the NEON implementation with optimizing. Overall, there is no performance impact our patched LLVM [25]. Without the patch, both sufer when switching from platform- to compiler-intrinsics. from the conversion-to-bitmask problem as in the hash We also perform a run without any explicit SIMD probucket lookup (Section 4.3). For the shufle variant, LLVM cessing by replacing all compiler-intrinsics with fixedalso generates worse shufle code that uses element-wise sized array operations. This decreases the overall perforinsertion. We submitted an LLVM patch to fix this using mance by 4%, i.e., explicit SIMD code in Velox achieves NEON’s TBL instruction11 (dashed part of bar) [34]. As 4% performance gains for TPC-H on average. However, NEON does not have 512-bit registers, the improved con- these gains are not distributed uniformly. While some version to bitmask and shufle cannot be used in vec-512, queries are not afected at all, one drops by 24.9%. In resulting in worse assembly. Table 1, we show Q15, Q16, and Q18, which exhibit the largest diferences to the xsimd baseline when disabling all SIMD code (novec).

4.6. SIMD in Velox

Q15 Q16

In this benchmark, we evaluate the performance of com

piler-intrinsics in an end-to-end system setup, running novec vec novec vec vec TPC-H queries in Velox [7]. To show the diference x86 –5.0% –3.4%* –3.8% +0.2% –0.1% between platform- and compiler-intrinsics, we replace M1 +0.1% +0.2% –2.5% +0.2% –0.2% Velox’ xsimd [4] dependency with Clang’s vector intrin- Table 1: Performance diference from xsimd TPC-H baseline sics. We run 18 out of 19 supported TPC-H queries in to no SIMD code (novec) and compiler-intrinsics (vec). Velox12 with scale factor 1. All queries are executed ten times with Velox’s default TPC-H configuration of four For Q15, vectorization does not impact M1, i.e., all threads and we report the average runtime. We compile three variants perform similarly. On x86, the vec variwith Clang/LLVM 15 on x86 and with our patched ver- ant is 3.4% slower with the default code alignment but sion on M1, using -O3 and -march/mtune=native flags. only 0.3% slower with the hot loop padding (marked * in The results are shown in Figure 5. Table 1). For Q16, novec is worse than xsimd, while vec

For both x86 and M1, our vec implementation performs performs marginally better on both x86 and M1. Q18 is a equally to Velox’s original xsimd code. The average run- query with high vectorization benefits. On both x86 and M1, vec achieves the same performance as the handwrit1110hMtetprgs:e/d/g, oadvbaoillatb.olergu/pz/shtr9e6aPm7ss1i7n3ce 267d6d6. ten xsimd variant, closing the 25% and 14% gaps. Overall, 12We omit Q21 as Velox produces wrong results with scale factor vec performs better than xsimd in 7/18 queries on x86 10 [35]. Overall, SF10 shows the same performance trends as SF1. (with padding), and in 5/18 on M1.

Q18 novec –24.9% –14.3%

In the process of replacing xsimd, we removed 54 platform-specific functions across ten function groups, guarded via #ifdef <PLATFORM> directives. Of these ten groups, five had multiple implementations for diferent data types (e.g., int32_t and int64_t), averaging at 3.6 type-specific implementations per group. Overall, our templatable compiler-intrinsics code removed hundreds of lines of code and most type-specific variants while achieving the same performance. We also discovered a bug, in which an optimized x86 bit-packing decoder algorithm was not used instead of a scalar fallback, as a platform-specific macro was not included correctly [ 37].

This highlights the complexity of correctly handling multiple platforms and target CPUs in query engines, which can be avoided with compiler-intrinsics.

5. Performance Generalizability

In Section 4, we focus on two machines and LLVM to perform our evaluation in depth. To support our claims and show the applicability of our approach to a wider range of setups, we show performance results across various x86 and ARM machines in Section 5.1. In Section 5.2, we show results for GCC as an additional major compiler that ofers compiler-intrinsics. we occasionally encountered performance outliers of up to 100%. Our microbenchmarks contain very tight loops with only a few instructions, so code alignment has a large performance impact. In one case, the loop was aligned to 32 bytes before a change and 16 bytes after a change. Due to this alignment, Intel’s micro-op cache could not fully cache the hot loop, leading to higher pressure on the instruction decoder.

These alignment issues are highly system-specific [ 36] and such large outliers could invalidate our conclusions, so we run the benchmarks on multiple x86 and multiple ARM servers for validation. For x86, we run the experiments on the Intel Icelake server mentioned in Section 4.1, on an Intel Cascadelake CPU (Xeon Gold 5220R), on an Intel Skylake laptop CPU with AVX2 (i5-6200U), and on an AMD Rome CPU with AVX2 (EPYC 7742). For NEON, we run on a M1 MacBook Pro (see Section 4.1), a Graviton 2 CPU (t4g.xlarge), a Graviton 3 CPU (c7g.xlarge), and a Raspberry Pi 4 (ARM Cortex-A72). All NEON benchmarks are compiled with a current LLVM version (April ’23, commit cd68e17), which contains our patches. We show the results for the bit-unpacking benchmark in Table 2. The other benchmarks show similar trends, so we omit them for space reasons.

For x86, we see that the performance trends discussed for Icelake also hold for other x86 machines. Across all systems, we see that most vec-* variants achieve the same or better performance than the x86-specific ones. Systems that do not have the necessary AVX512 instructions perform poorly with vec-512, as the logic is not easily translatable to AVX2 instructions. On all servers, explicit vectorization outperforms naive and autovec. On AMD Rome, the sse4-pdep variant, as used in Velox, performs significantly worse than any other variant. This is caused by the pdep instruction being significantly slower on AMD than on Intel CPUs [38].

On ARM, we see a wider absolute performance range. The Raspberry Pi naive baseline is 4× slower than M1, while the Graviton CPUs are 1.5–2× slower. Regardless of absolute performance, all machines benefit from vectorization. As NEON does not have 256-bit registers, the vec-256 code is split into multiple 128-bit instructions. Graviton 2 does not scale well here, as it has only two SIMD units per core, while M1 and Graviton 3 have four.

Overall, the results show that our insights and conclusions hold on a wider range of systems. We see that in nearly all cases, our compiler-intrinsics approach performs on par with or outperforms the platform-intrinsics.

5.2. Across Compilers 5.1. Across Platforms In our evaluation, we focus on Clang/LLVM. While many

In this section, we discuss the generalizability of our projects focus on one compiler, it is common to use mulresults across multiple platforms to avoid the impact of tiple compilers for testing. If features and built-in funcmeasurement bias. While developing the benchmarks, tions diverge, developers have to write multiple compiler

M1 Graviton 3 Also, GCC produces slower code for the vec-256 variants 1 (110 µs) on the ARM platforms.

1.2x On Icelake, the 512-bit variants vec-512 and avx512 7.2x perform 20% worse than vec-256. Code inspection shows 1.1x that these variants produce identical assembly except for 0.9x vector width and ofset values. Thus, we attribute the - lower performance of the 512-bit variants to microarchi- tectural details. With LLVM, these variants are compiled - to a slightly diferent instruction order, where vec-256 - and avx-512 achieve similar performance.

6.5x On Rome, vec-128 performs a bit worse than sse4-128. (a) Bit-packed integer decompression. This is caused by difering instruction selection. In vec128, we use a shift-operation on vectors to align the num

IcIenltaekle RAoMmDe M1 Grav. 3 sbheirfst iinnsstirduecttihoenl.aInnehsa,nwdhwircihtteGnCcCodceo, mthpisilceosrtroesapovnedctsotronaive 1 (156 µs) 1 (143 µs) 1 (97 µs) 1 (99 µs) the _mm_sllv_epi32 instruction, which is only available autovec 8.2x 7.4x 8.6x 5.0x in AVX2. So we instead use a multiplication instruction vec-128-shufle 7.9x 8.1x 9.1x 3.9x (_mm_mullo_epi32) in sse4-128, which is slightly faster.

vec-128-add 8.1x 8.7x 10.2x 4.4x An improvement of GCC’s instruction selection would vec-512-shufle 7.8x 1.9x 3.4x 1.9x remove this diference.

vec-512-add 7.9x 4.7x 5.6x 2.5x For the filtering column scan, computing the matchsse4ss-1e248-1-s2h8-uaflded 141.40.x1x 181.90.x8x -- -- ing element bitmask is a fundamental operation. As vpcompressd-512 38.2x - - - described above, GCC does not support vector-bitmask neon-shufle - - 12.3x 6.0x types, and thus does not allow expressing bitmask-conneon-add - - 12.8x 6.0x versions. We circumvent this by accessing all bitmaskrelated logic through a helper function that uses the (b) Filtering column scan. NEON-approach of bitwise masking and horizontal comTable 3: Results for compilation with GCC on multiple x86 bining if compiled with GCC. In isolation, this approach and NEON platforms. Speedup relative to naive. does not result in good performance on x86 platforms that have native movemask operations13. In a performance critdependent versions instead of platform-dependent ones, ical path, programmers would have to implement bitmask reducing the benefit of compiler-intrinsics. In this sec- conversion for each platform. tion, we discuss the portability across compilers. The code generated by our generic implementation

The vector extensions using the vector_size attribute causes a significant performance drop compared to the were initially specified by GCC and later adopted by platform-specific variants. On x86, the sse4-128 variants Clang. Additionally, Clang supports a ext_vector_type outperform the vec-128 variants by 1.7× on Icelake and attribute that allows specifying bitmask types. As of June 2.2× on Rome. Our function to compute bitmasks scales 2023 (version 19.35), Microsoft’s MSVC compiler does in complexity with the number of vector elements. For not support the vector_size attribute. Intel’s ICX com- a 128-bit vector of 32-bit integers, GCC compiles it to 8 piler is based on LLVM and supports both vector_size instructions, while requiring 40 for a 512-bit vector. Due and ext_vector_type. As of June 2023 (version 13.1), to this additional overhead, the vec-512 variants do not GCC does not support the ext_vector_type attribute outperform the vec-128 ones. and thus ofers no way to natively express conversions On ARM, the compiler-intrinsics and platform-intrinfrom or to bitmasks. sics difer by a smaller, but still significant amount. The

We repeat our benchmarks using GCC 12.2 to test if handwritten NEON-variants are 30% and 40% faster on other performance problems occur. The results of the M1 and Graviton 3, respectively. GCC correctly converts integer decompression and filtering column scan bench- the shufle to a TBL instruction, but struggles with the marks for a selection of platforms are shown in Table 3. bitmask, as for x86.

The other benchmarks exhibit similar characteristics, so we omit them here for space reasons. 6. Discussion

With the bit-packed integer decompression benchmark, the results are similar to the results observed with Our results show that in most of the evaluated database LLVM with a few subtle diferences. GCC’s auto-vectori- operations, platform-specific SIMD code does not achieve zation performs worse on Rome and the ARM platforms. 1 template <typename T, typename FilterOp> better performance than platform-independent SIMD 2 void filter(T* in, T filter_val, T* out, FilterOp op) { code. In 7/8 microbenchmarks, compiler-intrinsics per- 3 int num_elements = sizeof(Vec<T>) / sizeof(T); form on par with or better than platform-intrinsics. Auto- 4 for (int i = 0; i < N; i += num_elements) { vectorization alone does not achieve this, providing good 5 // Vec<T> is a templated version of Vec (Listing 2) wpeitrhfosrtmataen-ocfe-tihneo-anrltyc2o/m8 pbielenrcsh, mexaprlkicsi.t WSIeMcDonccolduedies tnheact- 876 /VV/eecc<<FTTi>>ltmveaartlcsihs=esp(la=Vtefoco<pr(Tv>ma&-l)i,nidfne[ipile]tn;edre_nvtala)n;d templatable. essary to achieve high performance, and we show that 9 // n variants only needed for special instructions. this is possible without implementing multiple versions 10 uint64_t mask = compress_store(vals, matches, out); of the same code for diferent platforms and data types. 11 out += std::popcount(mask); 12 } 13 }

6.1. Using Compiler-Intrinsics 14

There are cases in which platform-intrinsics are neces- 1165 t#eimfpdleafte__<AtVyX5p1e2naFm_e_ T>//uiWnet6c4a_ntuse compressstore. sary. Therefore, we recommend approaching writing 17 compress_store(Vec<T> vals, Vec<T> matches, T* out) { SIMD code as follows. First, developers should try to 18 uint64_t mask; rely on auto-vectorization, because this code is the most 19 // When T is 32-bit portable and often easiest to understand. If this does not 20 if constexpr (sizeof(Vec<T>) == 16) { yield good results or is not possible, compiler-intrinsics 2212 m_amsmk_m=as_km_mc_ommopvreepsis3s2t_omraesuk_(empait3c2(hoeus)t,; mask, vals); should be used to a) structure the overall SIMD code and 23 } else if (sizeof(Vec<T>) == 32) { // _mm256... b) write portable SIMD algorithms. Only in cases where 24 } else if (sizeof(Vec<T>) == 64) { // _mm512... the desired logic cannot be expressed portably, devel- 25 } return mask; opers should use platform-intrinsics as small, localized 26 } performance fixes. 2287 t#eemlpsleate//<tWyepecnaan'met Tu>seuicnotm6p4r_etssstore.

We show this approach in a small example in Listing 3, 29 compress_store(Vec<T> vals, Vec<T> matches, T* out) { in which we store matching values based on a filter pred- 30 // Fallback shuffle + store similar to Listing 2. icate, similar to the benchmark in Section 4.5. In our 31 ... benchmarks, we observe that AVX512’s compressstore 32 } is very eficient in storing only matching values. To lever- 33 #endif age this via compiler-intrinsics, developers can write a Listing 3: Use compiler-intrinsics to structure code, represent single platform- and type-independent templated method vectors, and for common operations. Only switch to localized fpoler,ththeegseanmerealf ufiltnecrtiniognlocgainc b(Leinuessed1–fo13r)i.nItnegoeurrse, xfloaamts-, vpalartiafonrtmsf-oinrttrhinessicasmfeorfisltpeercinifgicloingsitcr.uctions without multiple and longs in 16-, 32-, or 64-byte vectors on x86, ARM, and other platforms. We distinguish between platforms and to test the generic 512-bit “AVX512” vector logic on an types only for the special compressstore instruction AVX2 or even an ARM server. While the generated code in Line 10. Depending on the platform, we choose the may be ineficient, the compiler guarantees to generate AVX512 implementation (Lines 16–26) or a fallback im- valid code for the given input regardless of the available plementation (Lines 28–32). Even in this small example, instructions. Second, as our benchmarks show, staying we see that supporting all variants of compressstore in the compiler’s own type domain allows for better opleads to many code paths, i.e., three vector sizes and mul- timization in some cases. Third, with minimal code overtiple element sizes. However, this complexity is localized head, developers can structure SIMD code according to and does not leak into the filtering operation. their needs without adapting to the interface given by the library or relying on a third-party dependency at all. 6.2. Advantages of Compiler-Intrinsics As SIMD code is written for high performance, develThe approach described in Section 6.1 is also achievable opers still need to benchmark and evaluate their code via SIMD libraries due to their vector abstractions and continuously. Even with compiler-intrinsics, the comimplemented as such, e.g., in Velox. However, relying piler should be instructed to generate optimized code via on compiler-intrinsics over libraries has three key ad- -march/mtune flags, and the code should be profiled and vantages. First, development and testing becomes easier. updated if new compiler versions ofer new functionality. When relying on a SIMD library that wraps platform- This is also the case when using platform-intrinsics but intrinsics, functions can only be compiled on machines without the benefit of automatically adapted instructions that support these intrinsics, e.g., testing AVX512 code on new platforms. on an AVX2 server is not possible in general. When expressing the logic via compiler-intrinsics, it is possible

6.3. Open Challenges Compiler-intrinsics perform well in our experiments and

Velox, but a few things still require improvement. GCC’s missing support for vector-to-bitmask conversion forces us to use workarounds with suboptimal performance.

LLVM does not generate good vectorized code for some common (x86-) patterns. For example, an x86 gather is detected in some cases, but not in others, while an analogous scatter is not detected at all14. Clang supports the ternary operator for element-wise selection but LLVM does not yet combine it with mask-able instructions available in, e.g., AVX51215. Overall, compilerintrinsics support x86 better than NEON due to wider adoption. The documentation of compiler-intrinsics is not as broad as that of platform-intrinsics, as they are mostly used internally. Better instruction selection and documentation as well as more features will increase the applicability of compiler-intrinsics, further reducing the need for platform-intrinsics.

SVE and RISC-V V implement vector length agnostic programming approaches where vector sizes may be determined at runtime. C++ code using the proposed compiler-intrinsics can conceptually be written to support arbitrary vector sizes. However, when this is done using templating, the compiler still typically computes constants such as loop increments at compile time. The programming model of variable length vectors also emphasizes usage of predicate/mask registers to decouple the available input element count from actual vector sizes.

This allows removing any epilogue logic handling leftover elements. To our knowledge, there is currently no way to express runtime dynamic vector sizes or predicated vector operations with compiler intrinsics in GCC 8. Conclusion and Clang. With wider adoption of variable length vector instructions, we expect compilers to add more support In this paper, we make the case for writing less platformfor these concepts. specific SIMD code in databases. Instead of requiring

Platform-independent vector extensions provided by multiple platform-specific implementations for the same major compilers are not widely known in the database logic, compiler vector intrinsics allow developers to write community, and thus, they are not widely used. We a single version that is optimized by the compiler for believe that higher adoption will lead to more features every target platform. Our evaluation shows that this apand better instruction selection, solving some of the open proach achieves the same or better performance in most challenges. Our results show that platform-dependent of our microbenchmarks and a real system. While chalSIMD code is often not needed, so we encourage database lenges remain, adopting compiler-intrinsics in databases developers to use compiler-intrinsics in their code. leads to less logical code duplication and fewer platformspecific SIMD variants.

This research shows that vectorization is beneficial and serves as a motivation for our work. Based on some of it, we conduct microbenchmarks and show that they can often be implemented with platform-independent SIMD code. In general, research on database vectorization is orthogonal to our work and we think that it should consider platform-independent optimizations in the future.

Some work makes use of many platform-specific instructions [43, 44], making it hard to fit into our platformindependent approach. There are cases where this is beneficial, but we recommend using these platform-specific algorithms only if they have shown to be better.

SIMD Libraries. Various SIMD libraries exist that help developers to write platform-independent code by hiding the platform-specific intrinsics behind an abstraction layer [4, 45, 46, 5, 6, 17]. To varying degrees, they cover the standard operations supported on all platforms and additional interfaces for platform-specific instructions. Some of these libraries also provide higher-level functions that combine instructions to ofer more features, which is orthogonal to our work. However, all of these libraries add a layer of abstraction on top of the compiler’s platform-intrinsics. In contrast, we propose to remove a layer of abstraction and work directly with compiler-intrinsics instead. Our approach reduces code complexity and occasionally benefits from better compiler optimizations.

Another approach is to add a SIMD abstraction to the programming language’s standard library [47, 48, 49].

This supports our claim that SIMD programming should be platform-independent.

7. Related Work Acknowledgments

14https://godbolt.org/z/ven54M5ea 15https://godbolt.org/z/4333j6xEb

In this section, we briefly discuss related work on SIMD

programming in databases and on SIMD libraries. This work was partially funded by the German Min

SIMD in Databases. There is a wide range of research istry for Education and Research (ref. 01IS18025A and on vectorization in databases [5, 1, 39, 3, 40, 41, 2, 42]. ref. 01IS18037A), the German Research Foundation (ref. 414984028), and the European Union’s Horizon 2020 research and innovation programme (ref. 957407). [26] Meta Inc., Velox bitpackdecoder [source code], measurements=on&cb_base=on&cb_bmi=on, 2023. https://github.com/facebookincubator/velox/blob/ [39] O. Polychroniou, K. A. Ross, VIP: A SIMD 3f3aa92255aa514e6995ce8ba0d0e849a8beccdc/ vectorized analytical query engine, The velox/dwio/common/BitPackDecoder.h#L315, VLDB Journal 29 (2020) 1243–1261. URL: 2023. https://doi.org/10.1007/s00778-020-00621-w. [27] Intel Intrinsics Guide, Intel x86 pdep instruc- doi:10.1007/s00778-020-00621-w. tion, https://www.intel.com/content/www/us/en/ [40] Cagri Balkesen, Jens Teubner, Gustavo Alonso, docs/intrinsics-guide/index.html#text=pdep, 2023. Tamer Ozsu, Main-memory hash joins on multi[28] Meta Inc., Velox dictionary column core CPUs: Tuning to the underlying hardware, in: visitor [source code], https://github. ICDE ’13, ICDE ’13, IEEE, 2013, pp. 362–373. URL: com/facebookincubator/velox/blob/ https://doi.org/10.1109/ICDE.2013.6544839. doi:10. 52869442bd3710255df01c20914b65dcf842116b/ 1109/ICDE.2013.6544839.

velox/dwio/common/ColumnVisitors.h#L784, 2023. [41] C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. [29] M. Dreseler, Hyrise table scan [source Nguyen, N. Satish, J. Chhugani, A. Di Blas, P. Dubey, code], https://github.com/hyrise/hyrise/blob/ Sort vs. Hash revisited: fast join implementation f7ad570cc30208c71593c0d7f609e34c0f60decb/src/ on modern multi-core CPUs, Proceedings of the lib/operators/table_scan/abstract_table_scan_ VLDB Endowment 2 (2009) 1378–1389. URL: https: impl.hpp#L83, 2019. //dl.acm.org/doi/10.14778/1687553.1687564. doi:10. [30] Intel Intrinsics Guide, Intel x86 vpcompressd 14778/1687553.1687564.

instruction, https://www.intel.com/content/www/ [42] A. Afroozeh, P. Boncz, The FastLanes Compression us/en/docs/intrinsics-guide/index.html#text= Layout: Decoding >100 billion integers per second vpcompressd, 2023. with scalar code, PVLDB 16 (2023) 2132–2144. URL: [31] Intel Intrinsics Guide, Intel x86 pext instruc- https://ir.cwi.nl/pub/32992.

tion, https://www.intel.com/content/www/us/en/ [43] A. Ungethum, J. Pietrzyk, P. Damme, D. Habich, docs/intrinsics-guide/index.html#text=pext, 2023. W. Lehner, Conflict Detection-Based Run-Length [32] Clang Team, Clang language extensions: Vectors Encoding - AVX-512 CD Instruction Set in Acand extended vectors, https://clang.llvm.org/docs/ tion, in: 2018 IEEE 34th International ConferLanguageExtensions.html#builtin-functions, 2023. ence on Data Engineering Workshops (ICDEW), [33] D. Bolvanský, Missed ’compress’ codegen opportu- 2018, pp. 96–101. doi:10.1109/ICDEW.2018.00023, nity, https://github.com/llvm/llvm-project/issues/ iSSN: 2473-3490.

42210, 2019. [44] M. Dreseler, J. Kossmann, J. Frohnhofen, [34] L. Benson, [aarch64] use neon’s tbl1 for 16xi8 M. Uflacker, H. Plattner, Fused Table Scans: Combuild vector with mask., https://reviews.llvm.org/ bining AVX-512 and JIT to Double the Performance D146212, 2023. of Multi-Predicate Scans, in: 2018 IEEE 34th Interna[35] L. Benson, Tpc-h q21 wrong results with sf10, tional Conference on Data Engineering Workshops https://github.com/facebookincubator/velox/ (ICDEW), IEEE, Paris, 2018, pp. 102–109. URL: issues/4312, 2023. https://ieeexplore.ieee.org/document/8402027/. [36] T. Mytkowicz, A. Diwan, M. Hauswirth, P. F. doi:10.1109/ICDEW.2018.00024.

Sweeney, Producing wrong data without do- [45] M. Kretz, Vc: Vector classes for c++ [source code], ing anything obviously wrong!, ACM SIG- https://github.com/VcDevel/Vc, 2023. PLAN Notices 44 (2009) 265–276. URL: https: [46] I. Yermalayeu, Simd library [source code], https: //dl.acm.org/doi/10.1145/1508284.1508275. doi:10. //github.com/ermig1979/Simd, 2023. 1145/1508284.1508275. [47] M. Kretz, C++ simd library, https://en.cppreference. [37] L. Benson, Bitpackdecoder not using avx2 code, com/w/cpp/experimental/simd, 2023. https://github.com/facebookincubator/velox/ [48] M. Kretz, Data-parallel vector types & operaissues/4313, 2023. tions, https://www.open-std.org/jtc1/sc22/wg21/ [38] A. Abel, uops info pdep, https://uops.info/ docs/papers/2018/p0214r9.pdf, 2018. table.html?search=pdep&cb_lat=on&cb_tp=on& [49] H. Sivonen, Portable packed simd vector types, cb_uops=on&cb_ICL=on&cb_ZEN2=on&cb_ https://github.com/rust-lang/rfcs/pull/2948, 2020.