<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating SIMD Compiler-Intrinsics for Database Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lawrence Benson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richard Ebeling</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tilmann Rabl</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Hasso Plattner Institute, University of Potsdam</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Modern query engines often use SIMD instructions to speed up query performance. As these instructions are heavily CPU-specific, developers must write multiple variants of the same code to support multiple target platforms such as AVX2, AVX512, and ARM NEON. This process leads to logical code duplication, which is cumbersome, hard to test, and hard to benchmark. In this paper, we make the case for writing less platform-specific SIMD code by leveraging the compiler's own platform-independent SIMD vector abstraction. This allows developers to write a single code variant for all platforms as with a SIMD library, without the library's redundant layers of abstraction. Clang and GCC implement the platforms' SIMD intrinsics on top of their own abstraction, so code written in it is optimized for the underlying vector instructions by the compiler. We conduct four database operation microbenchmarks based on code in real systems on x86 and ARM and show that compiler-intrinsic variants achieve the same or even better performance than platform-intrinsics in most cases. In addition, we completely replace the SIMD library in the state-of-the-art query engine Velox with compiler-intrinsics. Our results show that query engines can achieve the same performance with platform-independent code while requiring significantly less SIMD code and fewer variants.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>tiple platform-specific methods to handle special cases
Numerous databases and query processing engines make and optimizations. This is the case, e.g., in the query
use of vector instructions (SIMD = single instruction, mul- engine Velox [7], which uses a SIMD library to structure
tiple data) to speed up query processing. By performing the vectorized code, but still contains dozens of
platformthe same operation on multiple data items at once, SIMD specific functions for AVX2 and NEON that implement
instructions increase the performance of these systems the same logic but for diferent types and sizes.
significantly [ 1, 2, 3]. However, SIMD instructions are In this paper, we make the case for writing less
platCPU-specific and come with various register widths and form-specific SIMD code in databases while also
removcapabilities. Most modern x86 servers support SSE, AVX, ing redundant layers of abstraction. We show the
redunand AVX2, but while a wide range of Intel servers also dant layers of abstraction by inspecting how x86’s and
support AVX-512, only the newest AMD servers support NEON’s platform-intrinsics to add two 128-bit vectors
it. With the rise of ARM in the cloud, query engines now of four 32-bit integers (_mm_add_epi32 and vaddq_s32)
also target NEON vector instructions and will soon likely are implemented in compilers and abstracted from in
target the new scalable vector extension (SVE), which SIMD libraries. Both Clang and GCC provide
platformis available in AWS’ latest ARM servers. Supporting all independent vector intrinsics (compiler-intrinsics) [8, 9],
of these instruction sets requires many set-specific code based on C/C++’s primitive types and compiler attributes.
variants, resulting in significant logical code duplication The vector-addition platform-intrinsics are implemented
and checks to select the correct version. Optimizing on top of these compiler-intrinsics, as outlined below.
performance-critical SIMD code in this setup is cumber- // Simplified from Clang's &lt;emmintrin.h&gt; header.
some, hard to benchmark, and hard to test. // Public 16-byte __m128i type in x86 SIMD API.</p>
      <p>Various SIMD libraries exist to reduce the complexity typedef long __m128i __attribute__((vector_size(16)));
of supporting multiple vector instruction sets [4, 5, 6]. /t/ypIendetfernianlt 1_6_-vb4ystue_v_eactttorribouftef_o_u(r(vienctteogre_rssi.ze(16)));
These libraries often provide a thin abstraction on top of __m128i _mm_add_epi32(__m128i __a, __m128i __b) {
the platform-specific SIMD intrinsics ( platform-intrinsics). return (__m128i)((__v4su)__a + (__v4su)__b);
However, the explicit use of intrinsics directly or via a }
library can hinder compiler optimizations, as we show
in our benchmarks. If an operation cannot be expressed //// iSnitm3p2lxi4f_itedisfrdoemfiCnleadng'asna&lt;laorgmo_unseloyn.tho&gt;_h_eva4dseur..
using the library’s abstractions, systems still require mul- int32x4_t vaddq_s32(int32x4_t __p0, int32x4_t __p1) {
int32x4_t __ret;
__ret = __p0 + __p1;
return __ret;
Joint Workshops at 49th International Conference on Very Large Data
Bases (VLDBW’23) — Workshop on Accelerating Analytics and Data
Management Systems (ADMS’23), August 28 - September 1, 2023, Van- }
couver, Canada
$ lawrence.benson@hpi.de (L. Benson); richard.ebeling@hpi.de We see that x86’s 16-byte SIMD type __m128i is
de(R. Ebelin©g2)0;2t3iClompyaringhntf.orrathbisl@papherpbyi.idtseau(tTho.rsR.Uasebple)rmitted under Creative Commons License fined using Clang’s vector type via vector_size. The
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) internal 16-byte __v4su vector of four integers is
de1 template &lt;typename T&gt; struct vec {
2 vec&lt;T&gt; operator+(vec&lt;T&gt; other);
3 }
4 #if __x86_64__ // x86 platform
5 vec&lt;T&gt; vec&lt;T&gt;::operator+(vec&lt;T&gt; other) {
6 return _mm_add_epi32(data, other.data);
7 }
8 #elif __aarch64__ // ARM NEON platform
9 vec&lt;T&gt; vec&lt;T&gt;::operator+(vec&lt;T&gt; other) {
10 return vaddq_s32(data, other.data);
11 }
12 #else ...</p>
      <sec id="sec-1-1">
        <title>Based on a generic vector type (Lines 1–3), libraries</title>
        <p>implement C++ operators. In Lines 6 and 10, we show
how the platform-intrinsics are wrapped depending on
the used platform (x86 or NEON) to provide the library’s
C++ type abstraction and platform-independence. SIMD
libraries are abstractions on top of platform-intrinsics,
which in turn are abstractions on top of
compiler-intrinsics. Instead of an additional API layer around specific
functions, we argue that compiler-intrinsics can be used
directly to structure SIMD code while also providing a
wide range of operations common across SIMD
instruction sets. For our vector-addition code example,
compilerintrinsics could be used as follows.
1 template &lt;typename T&gt; // Templated 16-byte vector.
2 using vec __attribute__((vector_size(16))) = T;
3
4 // Code structured with compiler-intrinsics.
5 vec&lt;T&gt; foo(vec&lt;T&gt; a, vec&lt;T&gt; b) {
6 vec&lt;T&gt; result = a + b; // Use standard + operator.
7 // ...
8 return result;
9 }</p>
        <p>Compiler-intrinsics allow developers to express vector
code once instead of having to implement it  times for
 platforms. Developers can leverage them to write a
single templatable, platform-independent version while
still benefiting from the platform’s vector execution
capabilities. By removing platform-intrinsics, developing and
testing also becomes easier because the same code can
be run on diefrent machines, regardless of whether they
support specific instructions or not. Any operation
deifned on a compiler-vector is guaranteed to be compiled
correctly. As platform-intrinsics are implemented on top
of compiler-intrinsics, both can be used interchangeably
in case specific instructions are required. We argue that
developers should express the logical vector operations
that they need and let the compiler determine the correct
1) We compare the performance of compiler-intrinsics
to scalar, auto-vectorized, and platform-specific SIMD
code in four database operation microbenchmarks.</p>
      </sec>
      <sec id="sec-1-2">
        <title>2) We replace all platform-specific SIMD code in the</title>
        <p>query engine Velox with compiler-intrinsics and show
the impact on end-to-end TPC-H workloads.</p>
      </sec>
      <sec id="sec-1-3">
        <title>3) Based on our results, we make the case for platformindependent SIMD code in databases and discuss open challenges.</title>
      </sec>
      <sec id="sec-1-4">
        <title>In Section 2, we discuss vectorized processing concepts,</title>
        <p>before introducing compiler-intrinsics in Section 3. In
Sections 4 and 5, we present our benchmark results on
various machines, compiled with Clang and GCC. We
discuss our findings in Section 6. We conclude our work
in Section 8 after presenting related work in Section 7.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Vectorized Processing</title>
      <sec id="sec-2-1">
        <title>In this section, we give a short overview of SIMD opera</title>
        <p>tions commonly used in databases. We then discuss the
usage of SIMD instructions via intrinsics in C++ code.</p>
        <p>Finally, we briefly lay out auto-vectorization eforts in
modern compilers.</p>
        <p>SIMD. When using SIMD operations, the program
code operates on vectors of elements and operations
are applied to all elements of a vector. For example,
binary operations such as addition are applied pairwise to
the elements of the vectors. Since multiple vector
elements are processed in a single operation, this yields a
higher throughput in comparison to scalar operations.</p>
        <p>Databases often utilize this to speed up computations.</p>
        <p>In some cases, specialized data structures or layouts can
help to utilize SIMD operations, e.g., columnar data
storage or hash table buckets [10].
template &lt;typename InT&gt;
__m128i x86_half(__m128i data);
template &lt;&gt; __m128i
x86_half&lt;uint32_t&gt;(__m128i data) {</p>
        <p>return _mm_cvtepu32_epi64(data);
}
}
template &lt;&gt; __m128i
x86_half&lt;int32_t&gt;(__m128i data) {
return _mm_cvtepi32_epi64(data);
(a) x86 C++
#define VEC_SIZE(n) \</p>
        <p>__attribute__((vector_size(n)))
template &lt;typename T&gt;
using Vec VEC_SIZE(16) = T;
template &lt;typename T&gt;
using HalfVec VEC_SIZE(8) = T;
template &lt;typename Out, typename In&gt;
Vec&lt;Out&gt; vec_half(Vec&lt;In&gt; data) {
return __builtin_convertvector(</p>
        <p>(HalfVec&lt;In&gt;&amp;)data, Vec&lt;Out&gt;);
}
(b) compiler-intrinsics C++
uint64x2_t
neon_half(uint32x4_t data) {
return vmovl_u32(</p>
        <p>vget_low_u32(data));
}
int64x2_t
neon_half(int32x4_t data) {
return vmovl_s32(</p>
        <p>vget_low_s32(data));
}
(c) NEON C++
vec/x86_half&lt;unsigned int, unsigned long&gt;(long long __vector(2)):
; Note the z for zero-extend
pmovzxdq xmm0, xmm0
ret
vec/x86_half&lt;int, long&gt;(long long __vector(2)):
; Note the s for sign-extend
pmovsxdq xmm0, xmm0
ret
(d) x86 assembly
vec/neon_half(__Uint32x4_t):
; Note the u for zero-extend
ushll v0.2d, v0.2s, #0
ret
vec/neon_half(__Int32x4_t):
; Note the s for sign-extend
sshll v0.2d, v0.2s, #0
ret
(e) NEON assembly
Listing 1: Code examples for x86, compiler-intrinsics, and NEON to extract lower two 32-bit values from 128-bit register and
either sign- or zero-extending them to two 64-bit values in an 128-bit output register. Compiler-intrinsics produce the same
assembly while using type- and platform-independent code (https://godbolt.org/z/4aoxEv14b)1.</p>
        <p>In hardware, a vector is stored as contiguous bits in a in their name, e.g., vaddvq_u32 and vaddvq_s32. The
register, so element boundaries inside the vector depend ARM intrinsics guide lists 4344 NEON intrinsics [12].
on the operation. This allows for scenarios where input Auto-Vectorization. Apart from designated vector
data is first shufled as 8-bit values and then treated as operations in the source code, major compilers such as
32-bit integers in the next operation. Common opera- GCC, Clang, and ICX also perform auto-vectorization as
tions in database contexts are load and store, arithmetic, an optimization step. They attempt to detect patterns in
comparison, shufling, gather/scatter, and widening/nar- scalar code that can be vectorized and replace the code
rowing. For example, a column scan could load a vector with a vectorized version [13]. Auto-vectorization may
of attribute values, compare them according to a filter fail, typically either because the pattern is not supported
predicate, and extract the indices of matching elements. or because some prerequisite for vectorization cannot</p>
        <p>Platform-Intrinsics. There are multiple extensions be satisfied. There are commonly supported and
docuto the x86 instruction set that add vector operations: SSE mented approaches to circumvent these problems and
(Streaming SIMD Extensions) with four major versions help with auto-vectorization [14].
introducing 128-bit registers, followed by AVX (Advanced
Vector Extensions) with 256-bit registers, AVX2, and most
recently AVX-512 with 512-bit registers. AVX-512 has 3. Compiler-Intrinsics
a modular design based on a foundation specification A disadvantage of platform-intrinsics is that they encode
and various sub-extensions adding new instructions. For the type in the function name, as C does not support
all these extensions, the specification defines a C API function overloading. This makes it hard to generically
with mnemonic function names (intrinsics) that allow implement SIMD operations, as developers must use
difusing the vector instructions in higher-level languages. ferent functions for, e.g., int and long. Additionally,
The Intel intrinsics documentation currently lists 6251 using platform-intrinsics does not guarantee that the
corintrinsics from the SSE and AVX instruction families [11]. responding instruction is actually chosen by the compiler.</p>
        <p>On ARM, the NEON vector extensions allow vector For example, Clang further merges instructions and
peroperations on 128-bit registers. Similar to x86, they define forms constant propagation, even when using explicit
a C API. The vector types include the element width x86 intrinsics2.
for integer types (e.g., uint32x4_t), but due to missing
overloading in C, functions still encode the element type</p>
      </sec>
      <sec id="sec-2-2">
        <title>1We use godbolt.org to support our claims with generated assembly. 2https://godbolt.org/z/M8zaqzj3r</title>
        <p>1 // 16-Byte vector of 4x uint32_t.</p>
        <p>GCC and Clang provide vector extensions that allow 2 using Vec __attribute__((vector_size(16))) = uint32_t;
declaring a vector type of  elements of a language base 3 // Same as Vec but without 16-byte alignment.
type, e.g., int [8, 9]. A code example that extracts the 4 using UnalignedVec __attribute__((aligned(1))) = Vec;
lower two 32-bit integers from a 128-bit register and sign- 5 // Vector of 4 bools (only available in LLVM).
or zero-extends them to two 64-bit values is shown in 6 using BitVec __attribute__((ext_vector_type(4))) = bool;
Listing 1. For both x86 and NEON, we have to define one 87 // Scan integer column and write matching row ids.
function per type, as the intrinsics difer for signed and 9 uint32_t dense_column_scan(uint32_t* column,
unsigned integer. Overall, we end up with four distinct 10 uint32_t filter_val,
implementations. With compiler intrinsics, we use tem- 11 uint32_t* __restrict out) {
ipnlasttrinugctaionndsthfoerceoamchpiplelar’tsfoarbmili.tTyhtoiscahloloowses afoprpraospirnigatlee 111423 ufio/nr/t3(L2uo_iatndtn3ud2ma__ttmaartaocnwhde=sco0m=;pra0or;we.&lt; NUM_ROWS; row += 4) {
implementation that can be used on both platforms. 15 Vec values = *(Vec*)(column + row);</p>
        <p>Vectors are defined using GCC-style __attribute__(), 16 Vec matches = values &lt; filter_val;
specifying the size in bytes and the element’s type. In this 17
edxeafinminpglea, tvheecsteoraroef16by=te1f6o/rsViezceaonfd(Ta)teemlepmlaetnettsy.pTehTe, 112890 B/i/tVCeocnvbeirttveccomp=ar_i_sbounmialrttecishneu_slc,tonBtvioetrVstecvcae)lc;atrorb(itmask.
input vector is cast to a HalfVec to extract the lower /2 21 // Upper bits may contain random data
values. We use the compiler’s builtin convert method 22 uint8_t bitmask = ((uint8_t&amp;) bitvec) &amp; 0xf;
to widen the values to the requested output type, e.g., 23
(arug)uimnte6n4t)_.tIninLiosutirnegx1adm) apnled(es)p, ewceifiesdeeinthaOtuthtTe cteommpplialetre- 222564 V/e/*c(GVreeotcw*_r)oofwMfAsoTefCtfHssEeSt=_sTOu_sRiOnWg_OlFoFoSkEuTpS[tbaibtlmea.sk];
intrinsics produce the same assembly as the platform- 27
intrinsics, without type- or platform-dependent code. 28 Vec compressed_rows = row + row_offsets;</p>
        <p>In Listing 2, we show a more complete code example 29
fcroodme socuarn sfiltaenri ningtesgcearncboelunmchnm, filatrekrs(Sbeyclteiossn-t4h.a5n). sTomhies 333201 *a/ou/uttoW_*rvioetucet=_mvacetoccmh=pirn(egsUsrneoadwl_irigodnwsesd;tVoec*o)u(topuutt.+ num_matches);
value, and writes matching row ids to an output array. 33 num_matches += std::popcount(bitmask);</p>
        <p>In Line 2, we first define a vector of four 32-bit inte- 34 }
gers. As vectors assume alignment equal to their size, we 35 return num_matches;
specify an unaligned version in Line 4, which we use for 36 }
unaligned stores. Clang supports special Boolean vec- Listing 2: Compiler-intrinsics code example for our filtering
tors, where each entry is only 1 bit (Line 6). Vectors are integer column scan. Evaluated in Section 4.5 as vec-add.
loaded and stored using casts (Line 15). Operations are
expressed using common C++ operators, such as arithmetic
or comparison operators (+, &lt;, ==, . . . ), on these types. Op- ing them with compiler-vector types. Operations that
erators also support scalar operands, e.g., &lt; filter_val are not natively supported on the target platform can
in Line 16. The ternary operator can be used for element- still be expressed in a canonical fashion. For example, up
wise selection. For some complex operations that have no to AVX2, there is no intrinsic for the multiplication of
matching C++-operator, Clang and GCC provide helper vectors with 64-bit integer elements. Here, programmers
functions, e.g., __builtin_convertvector() for nar- need to fall back to multiplying and combining 32-bit
rowing or widening (Line 19). We cast the Boolean vector halves of the input numbers. With compiler-intrinsics,
to a scalar bitmask and mask of the high bits (Line 22). programmers can use the C++ multiplication operator
We then use that bitmask to get row ofsets from a lookup and the compiler inserts the required logic.
table and add them to the base row id (Lines 25–28).
Finally, we use a cast to perform an unaligned store of the 4. Benchmarks
matching row ids (Lines 31–33).</p>
        <p>The compiler lowers these operations to the target plat- In this section, we perform four microbenchmarks to
form. The goal of these compiler-intrinsics is to allow compare platform-specific SIMD code with scalar,
autofor platform-independent source code that performs well vectorized, and platform-independent variants. We
evalon all target platforms. Note that with minor templating, uate multiply-shift hashing (Section 4.2), a hash bucket
we could extend this method to support arbitrary vector key lookup (Section 4.3), bit-packed decompression
(Secsizes while maintaining only a single implementation. tion 4.4), and an integer column filter scan (Section 4.5).
Depending on the given size, the compiler would gen- We then present the impact of platform- and
compilererate the appropriate instructions for us. If developers intrinsics in end-to-end TPC-H benchmarks in Velox, a
want to use explicit platform-intrinsics, Clang allows
useeSppdu01 s25n a) x86 Icelake 102 s25n b) M1
lrnmoiugoemhtpitcbsboetayrpretwarsa.irttuShioinonn-uctsietmaaepeanpcylvhiaeddliuanatepanutddhtpeanrptuoeimdsnubdkceenernoshcweiaxensasctbbhteleeyftosworaneemeetenohuealrtohpiotouhpttscalarnaiavuetovveecc-1v2e8c-2v5e6c-5s1se24-a1v2x82-25a6vx512 scalar naiveautovecvec-128vec-256vec-512 neon
ittoerwatoiorknsw,perlolginraammnaeirvseliikmeplyleemxpeencttataiuotno.-vLeLctVoMrizaatuiotonvectorizes our naive implementation on x86, so we
inFigure 1: 64-bit multiply-shift hashing ( 64 values). clude an additional scalar bar that shows the naive
implestate-of-the-art columnar query engine (Section 4.6). Our mentation with disabled auto-vectorization. The results
results show that in most benchmarks, vector code is are shown in Figure 1.
needed to achieve peak performance and that when us- On x86, the naive, autovec, and vec-512 variants
proing vector code, compiler-intrinsics perform at least as duce optimal vectorized code. For smaller explicit vector
well as platform-intrinsics. All of our code and all results widths, the performance degrades as we require more
opcan be found in our GitHub repository3. erations to process the same input. With 128-bit vectors,
we observe a speedup of only ~10% over scalar.</p>
        <p>The avx512 variant does not achieve peak performance
4.1. Setup and Methodology because we use the shifting intrinsic _mm512_srli_epi64
The results presented in this section are based on an x86 (compiled to vpsrlq), which requires an immediate
operIntel Icelake CPU (Xeon Platinum 8352Y) and on an M1 and that is known at compile-time. Compile-time
conMacBook Pro 14" laptop with ARM NEON. We validate stants can generally be used for more aggressive
optiour results on various other x86 and ARM platforms and mization. However, for the vec-512 variant, the compiler
discuss this in Section 5. All experiments are performed instead chooses the _mm512_srlv_epi64 (vpsrlvq)
inusing Clang/LLVM4 15 with -march/-mtune=native op- struction that shifts all elements in input vector 1 by
timizations and -O3. All experiments are run single- the amount specified by the corresponding element in
threaded and repeated ten times. We report the aver- input vector 2. The second operand needed in the hot
age runtime, the variance for all benchmarks is below loop is moved outside the loop, and in this case, the
in5%. We show the relative speedup of each variant over a struction that the compiler chooses is ~20% faster than
naive scalar implementation without explicit vectoriza- the instruction with the immediate operand.
tion eforts. For all microbenchmarks, we show the naive With the avx2 and sse4 variants, we observe worse
perbaseline (naive), a scalar code version that is optimized formance than with the corresponding compiler-variants
for auto-vectorization (autovec), our compiler-intrinsics using the same vector width. Our implementations only
versions (vec-*), and a platform-intrinsics version (sse4-*, use instructions available in the instruction set that the
avx2-*, avx512-*, neon-*). The platform-specific variants code is targeted for. Vector multiplication with 64-bit
use manually selected intrinsics and also represent the us- integer elements was first introduced in AVX-512, so our
age of a SIMD library, which provides a thin abstraction code performs manual multiplication of 32-bit halves
on top of platform-intrinsics. of the input using the approach that is also used in the</p>
        <p>In some experiments, the platform-specific variants vectorclass SIMD library [17]. LLVM is unable to
optiperform worse than the ones using compiler-intrinsics. mize this to the proper multiplication instruction on the
We note that it is nearly always possible to adapt the plat- target architecture. This shows a core disadvantage of
form-intrinsics to match the compiler’s code to achieve platform-specific code. Implementations tailored to a
the same performance. However, we keep these worse re- specific microarchitecture cannot benefit from new
insults to demonstrate the complexity of manually selecting structions and optimizations, while generic code can.
instructions from thousands of available ones. In these When repeating the measurements on a Skylake CPU
cases, we explicitly discuss the responsible instructions. without AVX-512 support, we observe that the autovec,
vec-256 and vec-512 variants are still 25% faster than the
handwritten AVX2 implementation using the
multipli4.2. Multiply-Shift Hashing cation from the vectorclass library. The generated code
Our first microbenchmark computes the hash values of only difers in how the multiplication is performed. In
the input numbers using multiply-shift hashing [15]. This this case, using a specialized SIMD library even results
is done, e.g., to determine the bucket in a hash index [16]. in worse performance than the naive implementation.
Each 64-bit input number is multiplied by a 64-bit con- On M1, the LLVM cost model does not consider loop
stant, the result is truncated to 64-bits, and then shifted vectorization desirable for the naive implementation. For
the autovec variant, we explicitly instruct the compiler
43Whtetprse:/f/egritthouCbl.acnomgf/ohrpitdhees/Ca+u+tofvroecn-tdebnd and to LLVM for backend to vectorize the loop via an annotation. LLVM generates
optimizations. very similar code for the vec-* variants, with nearly
iden</p>
        <p>In this benchmark, we measure the performance of
finding an entry in a hash bucket that stores additional
fingerprints in a metadata block at the beginning of the hash
bucket. A fingerprint is a 1-byte hash value of a key. The
block of fingerprints can then be used to quickly skip
over keys that do not match the search key. This design is
common in index structures and used in, e.g., Facebook’s
F14 hash table, Google’s Abseil hash map, Velox, Dash,
and ART [18, 19, 7, 20, 21]. With vector operations, all
ifngerprints can be compared to the fingerprint of the
search key in a single operation, allowing a direct jump
to the first key with a fingerprint match. Our hash bucket
holds 15 entries, each with a fingerprint, an 8-byte key,
and an 8-byte value. Fingerprints are stored in a
contiguous 16-byte array that is processed as a 128-bit vector. 4.4. Bit-Packed Integer Decompression
The benchmark repeatedly performs key lookups in the
hash bucket with a 50% chance of a successful lookup.</p>
        <p>Our benchmark results are shown in Figure 2. The
naive-key-only variant ignores the fingerprints and
simply loops over the keys to find a match. The *-bytemask
variants perform a vector-comparison on the fingerprints
array, resulting in a vector of 16 mask-bytes with either
all one-bits or all zero-bits. They then loop over the
all-one-bytes in this vector and compare the
corresponding key. The *-bitmask variants first narrow the 16-byte
mask to a bitmask of 16 bits, allowing the result to fit in a
general-purpose register and simplifying the loop logic.</p>
        <p>On both x86 and M1, LLVM correctly auto-vectorizes
the comparison of the 15 fingerprints. However, it
generates suboptimal code when attempting to extract the
16 bytes as a __uint128_t value that could then be used
with the bytemask-logic [22]. The approach of narrowing
the 16 bytes to 16 bits also results in suboptimal code [23].</p>
        <p>Overall, while the fingerprint comparison is vectorized as
expected, we are unable to use the comparison result
eficiently, so the autovec implementation does not achieve
competitive performance.</p>
        <p>With explicit vector operations, the bitmask approach
is better on x86 due to simpler loop logic and a saved</p>
      </sec>
      <sec id="sec-2-3">
        <title>In this benchmark, we evaluate the performance of de</title>
        <p>compressing 100k bit-packed integers. The variants are
based on SIMD-Scan [3] and an implementation in Velox
[26]. The input consists of 9-bit packed values, which
expand to 32-bit in the output vector, as in the SIMD-Scan
paper. The results are shown in Figure 3.</p>
        <p>The naive variant operates on 9 bits at a time,
expanding and storing them, and then shifts in the next 9 bits.</p>
        <p>The autovec variant performs a scalar expand+store, but
in a loop over 32 elements without loop-carried
dependencies, allowing the compiler to auto-vectorize
operations. The vec/neon/sse4-* variants perform the
SIMDScan algorithm for 128 bits as in the original paper. It
loads a 128-bit vector, i) shufles to move the required
bytes into the lanes, ii) shifts right to move the relevant
bits for each number to the beginning of the lane, and iii)
masks of leftover bits using bitwise-and. The extended
256 and 512-bit variants use the same approach with
wider registers. The avx2-256 variant uses two 128-bit
registers since 256-bit registers are separated into two
128-bit lanes with no support for bytewise shufles across
lanes. The pdep variant implements the algorithm used
5https://godbolt.org/z/hTjnWMr3n
6https://godbolt.org/z/PeWGW6rsj
7Merged, available upstream since cd68e17.
in Velox with the x86 pdep instruction [27] to extract Figure 4: Filtering integer column scan (32 000 values).
multiple compressed -bit values out of 64-bit data.</p>
        <p>LLVM auto-vectorizes better for x86 than for NEON8, combines them with a platform-independent
optimizaalthough both achieve the same ~5× speedup. It detects tmioanp. tHoodwifeerevnetr,LwLhVeMn oups-incogdtehse, wNhEiOchNa-rinetnriontsdiceste,cthteedse
the pattern on x86 and auto-vectorizes it with gather, in this optimization step, resulting in worse code. This
shift, and bitwise-and instructions. For NEON, LLVM shows that the compiler can detect the optimization for
emits unrolled scalar code.</p>
        <p>On x86, comparing the vec variants with the platform- two shifts in general, but it currently cannot detect it
when the developer explicitly requests NEON-intrinsics.</p>
        <p>The 256-bit variant is compiled to two 128-bit
operations with equal performance, while the 512-bit
variant produces ineficient scalar code and performs
significantly worse on NEON.
intrinsic variants shows multiple things. The 128-bit
variants perform equally, as they produce identical assembly.
vec-256 is faster than vec-128 due to higher parallelism.</p>
        <p>It is also faster than our avx2 variant, as the compiler
selects more eficient instructions than we do. The
avx2256 variant processes two 128-bit registers to solve the
problem that no cross-lane byte-level shufle instruction 4.5. Filtering Integer Column Scan
is available for 256-bit registers in AVX2. For this reason, In our final microbenchmark, we show the performance
it performs the shifting and bit-masking twice as often of an integer scan with a filter on 32k values, inspired
compared to an algorithm using a 256-bit register. The by implementations in Velox [28] and Hyrise [29]. All
vec-256 variant, on the other hand, is compiled to use values in the column are represented as 32-bit integers
256-bit registers by first performing a cross-lane immedi- and we filter by  &lt;  with a selectivity of 50%. The
ate permute operation and then using an in-lane shufle. matching 32-bit RowIDs are written to an output vector.
Here, LLVM utilizes the fact that for each of our shufle The results are shown in Figure 4.
operations, all selected indices are from a contiguous 16- The naive variant implements a for-loop and writes to
byte range, so it is able to store all required source-bytes the output vector if the filter matches. The autovec
variin both 128-bit lanes. Overall, both variants require the ant does the same but uses a branchless implementation,
same number of instructions for shufling, but the vec-256 i.e., out_idx += val &lt; x. This always writes to the
outvariant uses half as many for shifting and masking. put but overwrites the current value in the next iteration</p>
        <p>The avx512 variant outperforms its vec-512 counter- if it did not match. The *-shufle variants create a bitmask
part, as Clang inserts an and-instruction to mask the (4-bit in 128, 16-bit in 512) from the comparison result
shufle input in the frontend, which the backend cannot vector and perform a table lookup (24 or 216 entries)
remove, resulting in more instructions in the tight loop. with that mask to retrieve the indices with which the
The vec-256 and avx512 variants perform equally, as the current iteration’s RowIDs are shufled. If positions 1 and
CPU retires twice as many 256-bit instructions per cycle 3 match, (ids[1], ids[3], ∘ , ∘ ) is written to the output,
as 512-bit, compensating for the lower parallelism. where the ∘ ’s are overwritten in the next iteration. The</p>
        <p>The vec implementation of SIMD-Scan outperforms
sse4-pdep found in Velox, while not using any platform- *-add variants do not shufle but use the bitmask to look
up which RowIDs must be added to the current iteration’s
start RowID . If positions 1 and 3 match, ( + 1,  + 3, ∘ ,
specific code. Overall, we again see that the vec-* variants
perform on par with the platform-intrinsic variants.</p>
        <p>With NEON, we see that the 128- and 256-bit compiler- ∘ ) is written. This saves one shufle instruction but only
works for contiguous input RowIDs. The vpcompressd
intrinsics outperform the NEON-intrinsics, as the com- variants use the AVX-512 vpcompressd [30] instruction
piler combines the two shift instructions with compile- that shifts all matching values to the beginning of the
time constants that are executed independently in the SIMD register and stores them to memory. We also
evalneon variant9. During instruction selection, LLVM recog- uate a version with the SSE4 pext [31] instruction, but
nizes the vector shift operations in our vec variant and this is slower than the more general approaches.
Unlike the other benchmarks, auto-vectorization and
8https://godbolt.org/z/ef7WMPMT4
9https://godbolt.org/z/5bsv3qWT6
compiler-intrinsics do not achieve the performance of vec xsimd
platform-intrinsics on x86. The special vpcompressd ]s300 a) x86 Icelake
instruction performs the required logic and achieves sig- [em200
nificantly better results by combining the lookup and itm100
shufle in a single instruction. However, there is cur- un 0 1 3 5 6 7 8 9 10 12 13 14 15 16 17 18 19 20 22
rently no way to make LLVM produce this instruction R TPC-H Query
usianugtoavuetco-sviegcntioficrainzatltyionouotrpceorfmorpmilesr-inntariivnesibcsec[a3u2s,e33o]f. ]sm300 b) M1
branchless execution. The compiler unrolls the naive [e200
loop five times but converts the write inside an if-branch itm100
icnotsot fivfoerbbrraanncchhesminispasresedmicbtiloyn,rse10s.ulting in high runtime unR 0 1 3 5 6 7 8 9TP10C-1H2 1Q3u1e4ry15 16 17 18 19 20 22</p>
        <p>For the *-shufle/add variants, the 512-bit variants out- Figure 5: End-to-end TPC-H benchmark (19 supported
perform the 128-bit ones by only 2× . The higher paral- queries) with SF1 from Parquet in Velox on x86 and M1.
lelism is counterbalanced by the need for a bigger lookup
table of up to 4 MB that causes cache misses. With the time difers by 1.3% on Icelake and around 0.13% on M1.
compress variants, we observe a 3.4× speedup from 128- However, the diference on Icelake is not caused by SIMD
to 512-bits, as there is no lookup table and the 128- and code changes but by code alignment issues [36].
Explic512-bit instructions have the same latency and through- itly adding eight nop padding bytes before a single hot
put [30], benefiting the higher parallelism of the 512-bit loop in a scan operator reduces the performance
difervariant. For the 128-bit add-based variants, LLVM gen- ence to 0.1%. Code-alignment optimizations are highly
erates slightly better assembly with three fewer instruc- platform-specific and out of scope of this work. We use
tions when using compiler-intrinsics instead of SSE. them to illustrate that performance diferences at this</p>
        <p>For NEON, we see that the vec-128 variants achieve the level have various causes that must be considered when
same performance as the NEON implementation with optimizing. Overall, there is no performance impact
our patched LLVM [25]. Without the patch, both sufer when switching from platform- to compiler-intrinsics.
from the conversion-to-bitmask problem as in the hash We also perform a run without any explicit SIMD
probucket lookup (Section 4.3). For the shufle variant, LLVM cessing by replacing all compiler-intrinsics with
fixedalso generates worse shufle code that uses element-wise sized array operations. This decreases the overall
perforinsertion. We submitted an LLVM patch to fix this using mance by 4%, i.e., explicit SIMD code in Velox achieves
NEON’s TBL instruction11 (dashed part of bar) [34]. As 4% performance gains for TPC-H on average. However,
NEON does not have 512-bit registers, the improved con- these gains are not distributed uniformly. While some
version to bitmask and shufle cannot be used in vec-512, queries are not afected at all, one drops by 24.9%. In
resulting in worse assembly. Table 1, we show Q15, Q16, and Q18, which exhibit the
largest diferences to the xsimd baseline when disabling
all SIMD code (novec).</p>
        <sec id="sec-2-3-1">
          <title>4.6. SIMD in Velox</title>
          <p>Q15
Q16</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>In this benchmark, we evaluate the performance of com</title>
        <p>piler-intrinsics in an end-to-end system setup, running novec vec novec vec vec
TPC-H queries in Velox [7]. To show the diference x86 –5.0% –3.4%* –3.8% +0.2% –0.1%
between platform- and compiler-intrinsics, we replace M1 +0.1% +0.2% –2.5% +0.2% –0.2%
Velox’ xsimd [4] dependency with Clang’s vector intrin- Table 1: Performance diference from xsimd TPC-H baseline
sics. We run 18 out of 19 supported TPC-H queries in to no SIMD code (novec) and compiler-intrinsics (vec).
Velox12 with scale factor 1. All queries are executed ten
times with Velox’s default TPC-H configuration of four For Q15, vectorization does not impact M1, i.e., all
threads and we report the average runtime. We compile three variants perform similarly. On x86, the vec
variwith Clang/LLVM 15 on x86 and with our patched ver- ant is 3.4% slower with the default code alignment but
sion on M1, using -O3 and -march/mtune=native flags. only 0.3% slower with the hot loop padding (marked * in
The results are shown in Figure 5. Table 1). For Q16, novec is worse than xsimd, while vec</p>
        <p>For both x86 and M1, our vec implementation performs performs marginally better on both x86 and M1. Q18 is a
equally to Velox’s original xsimd code. The average run- query with high vectorization benefits. On both x86 and
M1, vec achieves the same performance as the
handwrit1110hMtetprgs:e/d/g, oadvbaoillatb.olergu/pz/shtr9e6aPm7ss1i7n3ce 267d6d6. ten xsimd variant, closing the 25% and 14% gaps. Overall,
12We omit Q21 as Velox produces wrong results with scale factor vec performs better than xsimd in 7/18 queries on x86
10 [35]. Overall, SF10 shows the same performance trends as SF1. (with padding), and in 5/18 on M1.</p>
        <p>Q18
novec
–24.9%
–14.3%</p>
        <p>In the process of replacing xsimd, we removed 54
platform-specific functions across ten function groups,
guarded via #ifdef &lt;PLATFORM&gt; directives. Of these ten
groups, five had multiple implementations for diferent
data types (e.g., int32_t and int64_t), averaging at 3.6
type-specific implementations per group. Overall, our
templatable compiler-intrinsics code removed hundreds
of lines of code and most type-specific variants while
achieving the same performance. We also discovered
a bug, in which an optimized x86 bit-packing decoder
algorithm was not used instead of a scalar fallback, as a
platform-specific macro was not included correctly [ 37].</p>
        <p>This highlights the complexity of correctly handling
multiple platforms and target CPUs in query engines, which
can be avoided with compiler-intrinsics.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Performance Generalizability</title>
      <p>In Section 4, we focus on two machines and LLVM to
perform our evaluation in depth. To support our claims and
show the applicability of our approach to a wider range
of setups, we show performance results across various
x86 and ARM machines in Section 5.1. In Section 5.2, we
show results for GCC as an additional major compiler
that ofers compiler-intrinsics.
we occasionally encountered performance outliers of up
to 100%. Our microbenchmarks contain very tight loops
with only a few instructions, so code alignment has a
large performance impact. In one case, the loop was
aligned to 32 bytes before a change and 16 bytes after
a change. Due to this alignment, Intel’s micro-op cache
could not fully cache the hot loop, leading to higher
pressure on the instruction decoder.</p>
      <p>These alignment issues are highly system-specific [ 36]
and such large outliers could invalidate our conclusions,
so we run the benchmarks on multiple x86 and multiple
ARM servers for validation. For x86, we run the
experiments on the Intel Icelake server mentioned in Section 4.1,
on an Intel Cascadelake CPU (Xeon Gold 5220R), on an
Intel Skylake laptop CPU with AVX2 (i5-6200U), and on
an AMD Rome CPU with AVX2 (EPYC 7742). For NEON,
we run on a M1 MacBook Pro (see Section 4.1), a Graviton
2 CPU (t4g.xlarge), a Graviton 3 CPU (c7g.xlarge), and
a Raspberry Pi 4 (ARM Cortex-A72). All NEON
benchmarks are compiled with a current LLVM version (April
’23, commit cd68e17), which contains our patches. We
show the results for the bit-unpacking benchmark in
Table 2. The other benchmarks show similar trends, so we
omit them for space reasons.</p>
      <p>For x86, we see that the performance trends discussed
for Icelake also hold for other x86 machines. Across all
systems, we see that most vec-* variants achieve the same
or better performance than the x86-specific ones.
Systems that do not have the necessary AVX512 instructions
perform poorly with vec-512, as the logic is not easily
translatable to AVX2 instructions. On all servers,
explicit vectorization outperforms naive and autovec. On
AMD Rome, the sse4-pdep variant, as used in Velox,
performs significantly worse than any other variant. This is
caused by the pdep instruction being significantly slower
on AMD than on Intel CPUs [38].</p>
      <p>On ARM, we see a wider absolute performance range.
The Raspberry Pi naive baseline is 4× slower than M1,
while the Graviton CPUs are 1.5–2× slower. Regardless
of absolute performance, all machines benefit from
vectorization. As NEON does not have 256-bit registers, the
vec-256 code is split into multiple 128-bit instructions.
Graviton 2 does not scale well here, as it has only two
SIMD units per core, while M1 and Graviton 3 have four.</p>
      <p>Overall, the results show that our insights and
conclusions hold on a wider range of systems. We see that
in nearly all cases, our compiler-intrinsics approach
performs on par with or outperforms the platform-intrinsics.</p>
      <sec id="sec-3-1">
        <title>5.2. Across Compilers</title>
      </sec>
      <sec id="sec-3-2">
        <title>5.1. Across Platforms</title>
        <sec id="sec-3-2-1">
          <title>In our evaluation, we focus on Clang/LLVM. While many</title>
          <p>In this section, we discuss the generalizability of our projects focus on one compiler, it is common to use
mulresults across multiple platforms to avoid the impact of tiple compilers for testing. If features and built-in
funcmeasurement bias. While developing the benchmarks, tions diverge, developers have to write multiple
compiler</p>
          <p>M1 Graviton 3 Also, GCC produces slower code for the vec-256 variants
1 (110 µs) on the ARM platforms.</p>
          <p>1.2x On Icelake, the 512-bit variants vec-512 and avx512
7.2x perform 20% worse than vec-256. Code inspection shows
1.1x that these variants produce identical assembly except for
0.9x vector width and ofset values. Thus, we attribute the
- lower performance of the 512-bit variants to
microarchi- tectural details. With LLVM, these variants are compiled
- to a slightly diferent instruction order, where vec-256
- and avx-512 achieve similar performance.</p>
          <p>6.5x On Rome, vec-128 performs a bit worse than sse4-128.
(a) Bit-packed integer decompression. This is caused by difering instruction selection. In
vec128, we use a shift-operation on vectors to align the
num</p>
          <p>IcIenltaekle RAoMmDe M1 Grav. 3 sbheirfst iinnsstirduecttihoenl.aInnehsa,nwdhwircihtteGnCcCodceo,
mthpisilceosrtroesapovnedctsotronaive 1 (156 µs) 1 (143 µs) 1 (97 µs) 1 (99 µs) the _mm_sllv_epi32 instruction, which is only available
autovec 8.2x 7.4x 8.6x 5.0x in AVX2. So we instead use a multiplication instruction
vec-128-shufle 7.9x 8.1x 9.1x 3.9x (_mm_mullo_epi32) in sse4-128, which is slightly faster.</p>
          <p>vec-128-add 8.1x 8.7x 10.2x 4.4x An improvement of GCC’s instruction selection would
vec-512-shufle 7.8x 1.9x 3.4x 1.9x remove this diference.</p>
          <p>vec-512-add 7.9x 4.7x 5.6x 2.5x For the filtering column scan, computing the
matchsse4ss-1e248-1-s2h8-uaflded 141.40.x1x 181.90.x8x -- -- ing element bitmask is a fundamental operation. As
vpcompressd-512 38.2x - - - described above, GCC does not support vector-bitmask
neon-shufle - - 12.3x 6.0x types, and thus does not allow expressing
bitmask-conneon-add - - 12.8x 6.0x versions. We circumvent this by accessing all
bitmaskrelated logic through a helper function that uses the
(b) Filtering column scan. NEON-approach of bitwise masking and horizontal
comTable 3: Results for compilation with GCC on multiple x86 bining if compiled with GCC. In isolation, this approach
and NEON platforms. Speedup relative to naive. does not result in good performance on x86 platforms that
have native movemask operations13. In a performance
critdependent versions instead of platform-dependent ones, ical path, programmers would have to implement bitmask
reducing the benefit of compiler-intrinsics. In this sec- conversion for each platform.
tion, we discuss the portability across compilers. The code generated by our generic implementation</p>
          <p>The vector extensions using the vector_size attribute causes a significant performance drop compared to the
were initially specified by GCC and later adopted by platform-specific variants. On x86, the sse4-128 variants
Clang. Additionally, Clang supports a ext_vector_type outperform the vec-128 variants by 1.7× on Icelake and
attribute that allows specifying bitmask types. As of June 2.2× on Rome. Our function to compute bitmasks scales
2023 (version 19.35), Microsoft’s MSVC compiler does in complexity with the number of vector elements. For
not support the vector_size attribute. Intel’s ICX com- a 128-bit vector of 32-bit integers, GCC compiles it to 8
piler is based on LLVM and supports both vector_size instructions, while requiring 40 for a 512-bit vector. Due
and ext_vector_type. As of June 2023 (version 13.1), to this additional overhead, the vec-512 variants do not
GCC does not support the ext_vector_type attribute outperform the vec-128 ones.
and thus ofers no way to natively express conversions On ARM, the compiler-intrinsics and
platform-intrinfrom or to bitmasks. sics difer by a smaller, but still significant amount. The</p>
          <p>We repeat our benchmarks using GCC 12.2 to test if handwritten NEON-variants are 30% and 40% faster on
other performance problems occur. The results of the M1 and Graviton 3, respectively. GCC correctly converts
integer decompression and filtering column scan bench- the shufle to a TBL instruction, but struggles with the
marks for a selection of platforms are shown in Table 3. bitmask, as for x86.</p>
          <p>The other benchmarks exhibit similar characteristics, so
we omit them here for space reasons. 6. Discussion</p>
          <p>With the bit-packed integer decompression
benchmark, the results are similar to the results observed with Our results show that in most of the evaluated database
LLVM with a few subtle diferences. GCC’s auto-vectori- operations, platform-specific SIMD code does not achieve
zation performs worse on Rome and the ARM platforms.
1 template &lt;typename T, typename FilterOp&gt;
better performance than platform-independent SIMD 2 void filter(T* in, T filter_val, T* out, FilterOp op) {
code. In 7/8 microbenchmarks, compiler-intrinsics per- 3 int num_elements = sizeof(Vec&lt;T&gt;) / sizeof(T);
form on par with or better than platform-intrinsics. Auto- 4 for (int i = 0; i &lt; N; i += num_elements) {
vectorization alone does not achieve this, providing good 5 // Vec&lt;T&gt; is a templated version of Vec (Listing 2)
wpeitrhfosrtmataen-ocfe-tihneo-anrltyc2o/m8 pbielenrcsh, mexaprlkicsi.t WSIeMcDonccolduedies tnheact- 876 /VV/eecc&lt;&lt;FTTi&gt;&gt;ltmveaartlcsihs=esp(la=Vtefoco&lt;pr(Tv&gt;ma&amp;-l)i,nidfne[ipile]tn;edre_nvtala)n;d templatable.
essary to achieve high performance, and we show that 9 // n variants only needed for special instructions.
this is possible without implementing multiple versions 10 uint64_t mask = compress_store(vals, matches, out);
of the same code for diferent platforms and data types. 11 out += std::popcount(mask);
12 }
13 }</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>6.1. Using Compiler-Intrinsics 14</title>
        <p>There are cases in which platform-intrinsics are neces- 1165 t#eimfpdleafte__&lt;AtVyX5p1e2naFm_e_ T&gt;//uiWnet6c4a_ntuse compressstore.
sary. Therefore, we recommend approaching writing 17 compress_store(Vec&lt;T&gt; vals, Vec&lt;T&gt; matches, T* out) {
SIMD code as follows. First, developers should try to 18 uint64_t mask;
rely on auto-vectorization, because this code is the most 19 // When T is 32-bit
portable and often easiest to understand. If this does not 20 if constexpr (sizeof(Vec&lt;T&gt;) == 16) {
yield good results or is not possible, compiler-intrinsics 2212 m_amsmk_m=as_km_mc_ommopvreepsis3s2t_omraesuk_(empait3c2(hoeus)t,; mask, vals);
should be used to a) structure the overall SIMD code and 23 } else if (sizeof(Vec&lt;T&gt;) == 32) { // _mm256...
b) write portable SIMD algorithms. Only in cases where 24 } else if (sizeof(Vec&lt;T&gt;) == 64) { // _mm512...
the desired logic cannot be expressed portably, devel- 25 } return mask;
opers should use platform-intrinsics as small, localized 26 }
performance fixes. 2287 t#eemlpsleate//&lt;tWyepecnaan'met Tu&gt;seuicnotm6p4r_etssstore.</p>
        <p>We show this approach in a small example in Listing 3, 29 compress_store(Vec&lt;T&gt; vals, Vec&lt;T&gt; matches, T* out) {
in which we store matching values based on a filter pred- 30 // Fallback shuffle + store similar to Listing 2.
icate, similar to the benchmark in Section 4.5. In our 31 ...
benchmarks, we observe that AVX512’s compressstore 32 }
is very eficient in storing only matching values. To lever- 33 #endif
age this via compiler-intrinsics, developers can write a Listing 3: Use compiler-intrinsics to structure code, represent
single platform- and type-independent templated method vectors, and for common operations. Only switch to localized
fpoler,ththeegseanmerealf ufiltnecrtiniognlocgainc b(Leinuessed1–fo13r)i.nItnegoeurrse, xfloaamts-, vpalartiafonrtmsf-oinrttrhinessicasmfeorfisltpeercinifgicloingsitcr.uctions without multiple
and longs in 16-, 32-, or 64-byte vectors on x86, ARM, and
other platforms. We distinguish between platforms and to test the generic 512-bit “AVX512” vector logic on an
types only for the special compressstore instruction AVX2 or even an ARM server. While the generated code
in Line 10. Depending on the platform, we choose the may be ineficient, the compiler guarantees to generate
AVX512 implementation (Lines 16–26) or a fallback im- valid code for the given input regardless of the available
plementation (Lines 28–32). Even in this small example, instructions. Second, as our benchmarks show, staying
we see that supporting all variants of compressstore in the compiler’s own type domain allows for better
opleads to many code paths, i.e., three vector sizes and mul- timization in some cases. Third, with minimal code
overtiple element sizes. However, this complexity is localized head, developers can structure SIMD code according to
and does not leak into the filtering operation. their needs without adapting to the interface given by
the library or relying on a third-party dependency at all.
6.2. Advantages of Compiler-Intrinsics As SIMD code is written for high performance,
develThe approach described in Section 6.1 is also achievable opers still need to benchmark and evaluate their code
via SIMD libraries due to their vector abstractions and continuously. Even with compiler-intrinsics, the
comimplemented as such, e.g., in Velox. However, relying piler should be instructed to generate optimized code via
on compiler-intrinsics over libraries has three key ad- -march/mtune flags, and the code should be profiled and
vantages. First, development and testing becomes easier. updated if new compiler versions ofer new functionality.
When relying on a SIMD library that wraps platform- This is also the case when using platform-intrinsics but
intrinsics, functions can only be compiled on machines without the benefit of automatically adapted instructions
that support these intrinsics, e.g., testing AVX512 code on new platforms.
on an AVX2 server is not possible in general. When
expressing the logic via compiler-intrinsics, it is possible</p>
      </sec>
      <sec id="sec-3-4">
        <title>6.3. Open Challenges</title>
        <sec id="sec-3-4-1">
          <title>Compiler-intrinsics perform well in our experiments and</title>
          <p>Velox, but a few things still require improvement. GCC’s
missing support for vector-to-bitmask conversion forces
us to use workarounds with suboptimal performance.</p>
          <p>LLVM does not generate good vectorized code for some
common (x86-) patterns. For example, an x86 gather
is detected in some cases, but not in others, while an
analogous scatter is not detected at all14. Clang
supports the ternary operator for element-wise selection
but LLVM does not yet combine it with mask-able
instructions available in, e.g., AVX51215. Overall,
compilerintrinsics support x86 better than NEON due to wider
adoption. The documentation of compiler-intrinsics is
not as broad as that of platform-intrinsics, as they are
mostly used internally. Better instruction selection and
documentation as well as more features will increase the
applicability of compiler-intrinsics, further reducing the
need for platform-intrinsics.</p>
          <p>SVE and RISC-V V implement vector length
agnostic programming approaches where vector sizes may be
determined at runtime. C++ code using the proposed
compiler-intrinsics can conceptually be written to
support arbitrary vector sizes. However, when this is done
using templating, the compiler still typically computes
constants such as loop increments at compile time. The
programming model of variable length vectors also
emphasizes usage of predicate/mask registers to decouple
the available input element count from actual vector sizes.</p>
          <p>This allows removing any epilogue logic handling
leftover elements. To our knowledge, there is currently no
way to express runtime dynamic vector sizes or
predicated vector operations with compiler intrinsics in GCC 8. Conclusion
and Clang. With wider adoption of variable length vector
instructions, we expect compilers to add more support In this paper, we make the case for writing less
platformfor these concepts. specific SIMD code in databases. Instead of requiring</p>
          <p>Platform-independent vector extensions provided by multiple platform-specific implementations for the same
major compilers are not widely known in the database logic, compiler vector intrinsics allow developers to write
community, and thus, they are not widely used. We a single version that is optimized by the compiler for
believe that higher adoption will lead to more features every target platform. Our evaluation shows that this
apand better instruction selection, solving some of the open proach achieves the same or better performance in most
challenges. Our results show that platform-dependent of our microbenchmarks and a real system. While
chalSIMD code is often not needed, so we encourage database lenges remain, adopting compiler-intrinsics in databases
developers to use compiler-intrinsics in their code. leads to less logical code duplication and fewer
platformspecific SIMD variants.</p>
          <p>This research shows that vectorization is beneficial and
serves as a motivation for our work. Based on some of
it, we conduct microbenchmarks and show that they can
often be implemented with platform-independent SIMD
code. In general, research on database vectorization is
orthogonal to our work and we think that it should
consider platform-independent optimizations in the future.</p>
          <p>Some work makes use of many platform-specific
instructions [43, 44], making it hard to fit into our
platformindependent approach. There are cases where this is
beneficial, but we recommend using these platform-specific
algorithms only if they have shown to be better.</p>
          <p>SIMD Libraries. Various SIMD libraries exist that
help developers to write platform-independent code by
hiding the platform-specific intrinsics behind an
abstraction layer [4, 45, 46, 5, 6, 17]. To varying degrees, they
cover the standard operations supported on all platforms
and additional interfaces for platform-specific
instructions. Some of these libraries also provide higher-level
functions that combine instructions to ofer more
features, which is orthogonal to our work. However, all
of these libraries add a layer of abstraction on top of
the compiler’s platform-intrinsics. In contrast, we
propose to remove a layer of abstraction and work directly
with compiler-intrinsics instead. Our approach reduces
code complexity and occasionally benefits from better
compiler optimizations.</p>
          <p>Another approach is to add a SIMD abstraction to the
programming language’s standard library [47, 48, 49].</p>
          <p>This supports our claim that SIMD programming should
be platform-independent.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>7. Related Work</title>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>14https://godbolt.org/z/ven54M5ea
15https://godbolt.org/z/4333j6xEb</p>
      <sec id="sec-5-1">
        <title>In this section, we briefly discuss related work on SIMD</title>
        <p>programming in databases and on SIMD libraries. This work was partially funded by the German
Min</p>
        <p>SIMD in Databases. There is a wide range of research istry for Education and Research (ref. 01IS18025A and
on vectorization in databases [5, 1, 39, 3, 40, 41, 2, 42]. ref. 01IS18037A), the German Research Foundation (ref.
414984028), and the European Union’s Horizon 2020
research and innovation programme (ref. 957407).
[26] Meta Inc., Velox bitpackdecoder [source code], measurements=on&amp;cb_base=on&amp;cb_bmi=on, 2023.
https://github.com/facebookincubator/velox/blob/ [39] O. Polychroniou, K. A. Ross, VIP: A SIMD
3f3aa92255aa514e6995ce8ba0d0e849a8beccdc/ vectorized analytical query engine, The
velox/dwio/common/BitPackDecoder.h#L315, VLDB Journal 29 (2020) 1243–1261. URL:
2023. https://doi.org/10.1007/s00778-020-00621-w.
[27] Intel Intrinsics Guide, Intel x86 pdep instruc- doi:10.1007/s00778-020-00621-w.
tion, https://www.intel.com/content/www/us/en/ [40] Cagri Balkesen, Jens Teubner, Gustavo Alonso,
docs/intrinsics-guide/index.html#text=pdep, 2023. Tamer Ozsu, Main-memory hash joins on
multi[28] Meta Inc., Velox dictionary column core CPUs: Tuning to the underlying hardware, in:
visitor [source code], https://github. ICDE ’13, ICDE ’13, IEEE, 2013, pp. 362–373. URL:
com/facebookincubator/velox/blob/ https://doi.org/10.1109/ICDE.2013.6544839. doi:10.
52869442bd3710255df01c20914b65dcf842116b/ 1109/ICDE.2013.6544839.</p>
        <p>velox/dwio/common/ColumnVisitors.h#L784, 2023. [41] C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D.
[29] M. Dreseler, Hyrise table scan [source Nguyen, N. Satish, J. Chhugani, A. Di Blas, P. Dubey,
code], https://github.com/hyrise/hyrise/blob/ Sort vs. Hash revisited: fast join implementation
f7ad570cc30208c71593c0d7f609e34c0f60decb/src/ on modern multi-core CPUs, Proceedings of the
lib/operators/table_scan/abstract_table_scan_ VLDB Endowment 2 (2009) 1378–1389. URL: https:
impl.hpp#L83, 2019. //dl.acm.org/doi/10.14778/1687553.1687564. doi:10.
[30] Intel Intrinsics Guide, Intel x86 vpcompressd 14778/1687553.1687564.</p>
        <p>instruction, https://www.intel.com/content/www/ [42] A. Afroozeh, P. Boncz, The FastLanes Compression
us/en/docs/intrinsics-guide/index.html#text= Layout: Decoding &gt;100 billion integers per second
vpcompressd, 2023. with scalar code, PVLDB 16 (2023) 2132–2144. URL:
[31] Intel Intrinsics Guide, Intel x86 pext instruc- https://ir.cwi.nl/pub/32992.</p>
        <p>tion, https://www.intel.com/content/www/us/en/ [43] A. Ungethum, J. Pietrzyk, P. Damme, D. Habich,
docs/intrinsics-guide/index.html#text=pext, 2023. W. Lehner, Conflict Detection-Based Run-Length
[32] Clang Team, Clang language extensions: Vectors Encoding - AVX-512 CD Instruction Set in
Acand extended vectors, https://clang.llvm.org/docs/ tion, in: 2018 IEEE 34th International
ConferLanguageExtensions.html#builtin-functions, 2023. ence on Data Engineering Workshops (ICDEW),
[33] D. Bolvanský, Missed ’compress’ codegen opportu- 2018, pp. 96–101. doi:10.1109/ICDEW.2018.00023,
nity, https://github.com/llvm/llvm-project/issues/ iSSN: 2473-3490.</p>
        <p>42210, 2019. [44] M. Dreseler, J. Kossmann, J. Frohnhofen,
[34] L. Benson, [aarch64] use neon’s tbl1 for 16xi8 M. Uflacker, H. Plattner, Fused Table Scans:
Combuild vector with mask., https://reviews.llvm.org/ bining AVX-512 and JIT to Double the Performance
D146212, 2023. of Multi-Predicate Scans, in: 2018 IEEE 34th
Interna[35] L. Benson, Tpc-h q21 wrong results with sf10, tional Conference on Data Engineering Workshops
https://github.com/facebookincubator/velox/ (ICDEW), IEEE, Paris, 2018, pp. 102–109. URL:
issues/4312, 2023. https://ieeexplore.ieee.org/document/8402027/.
[36] T. Mytkowicz, A. Diwan, M. Hauswirth, P. F. doi:10.1109/ICDEW.2018.00024.</p>
        <p>Sweeney, Producing wrong data without do- [45] M. Kretz, Vc: Vector classes for c++ [source code],
ing anything obviously wrong!, ACM SIG- https://github.com/VcDevel/Vc, 2023.
PLAN Notices 44 (2009) 265–276. URL: https: [46] I. Yermalayeu, Simd library [source code], https:
//dl.acm.org/doi/10.1145/1508284.1508275. doi:10. //github.com/ermig1979/Simd, 2023.
1145/1508284.1508275. [47] M. Kretz, C++ simd library, https://en.cppreference.
[37] L. Benson, Bitpackdecoder not using avx2 code, com/w/cpp/experimental/simd, 2023.
https://github.com/facebookincubator/velox/ [48] M. Kretz, Data-parallel vector types &amp;
operaissues/4313, 2023. tions, https://www.open-std.org/jtc1/sc22/wg21/
[38] A. Abel, uops info pdep, https://uops.info/ docs/papers/2018/p0214r9.pdf, 2018.
table.html?search=pdep&amp;cb_lat=on&amp;cb_tp=on&amp; [49] H. Sivonen, Portable packed simd vector types,
cb_uops=on&amp;cb_ICL=on&amp;cb_ZEN2=on&amp;cb_ https://github.com/rust-lang/rfcs/pull/2948, 2020.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>