<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Adrian Riedl</string-name>
          <email>adrian.riedl@in.tum.de</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Philipp Fent</string-name>
          <email>fent@in.tum.de</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maximilian Bandle</string-name>
          <email>bandle@in.tum.de</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Neumann</string-name>
          <email>neumann@in.tum.de</email>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>1</volume>
      <issue>2023</issue>
      <abstract>
        <p>Eficiently evaluating text pattern matching is one of the most common computationally expensive tasks in data processing pipelines. Especially when dealing with text-heavy real-world data, evaluating even simple LIKE predicates is costly. Despite the abundance of text and the frequency of string-handling expressions in real-world queries, processing is an afterthought for most systems. We argue that we must instead properly integrate text processing into the flow of DBMS query execution. In this work, we propose a code generation approach that specifically tailors the generated code to the given pattern and matching algorithm and integrates cleanly into DBMS query compilation. In addition, we introduce a generalized SSE search algorithm that uses a sequence of SSE instructions to compare packed strings in the generated code to eficiently locate longer input patterns. Our approach of generating specialized code for each pattern eliminates the overhead of interpreting the pattern for each tuple. As a result, we improve the performance of LIKE pattern matching by up to 2.5×, demonstrating that CEUR Joint Workshops at 49th International Conference on Very Large Data Bases (VLDBW'23) - Workshop on Accelerating Analytics and Data OCG OCG OCG</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>code generation can significantly improve the eficiency of LIKE predicate evaluation in DBMSs.</p>
      <p>can also inline the generated code in a larger processing</p>
      <p>© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License ing function by inlining the patterns and shift tables. We</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>Modern data processing systems ofer outstanding
performance on simple data, which makes them an essential
component for eficient data processing pipelines.
However, these systems are still lacking in compute-heavy
string processing, which is common in real-world
applications. Tableau’s research shows that approximately
50% of all attributes use text-based data types, even when
there are more suitable data types [1]. Thus, database
systems need to focus on eficient text operations such
as LIKE expressions.</p>
      <p>In current systems, a common technique to process
text is to use a third-party library that focuses on
matching string patterns, often ofering advanced features such
as SIMD acceleration [2, 3]. Unfortunately, these do
not integrate well with DBMS-specific text
representation. For example, DBMS commonly use special string
storage formats, e.g., with parts of the string inlined or
lightweight compressed [4, 5]. However, to use external
text libraries, these systems need expensive string
conversions before they can invoke a matching function, while
a better integrated approach could allow just-in-time
decompression. In addition, string search libraries optimize
for finding patterns in a large continuous text corpus,
e.g., a text document. In contrast, for database systems,
the per-tuple overhead to interpret the string pattern or
transition tables leaves significant performance on the
100
1010
01
for each tuple</p>
      <sec id="sec-2-1">
        <title>Result</title>
      </sec>
      <sec id="sec-2-2">
        <title>Tuple</title>
      </sec>
      <sec id="sec-2-3">
        <title>Stream</title>
        <p>Naïve</p>
        <sec id="sec-2-3-1">
          <title>Preprocessed</title>
        </sec>
        <sec id="sec-2-3-2">
          <title>Generated</title>
          <p>Preprocessed:</p>
          <p>The pattern is preprocessed only once
before the first search starts, and the result of this
preprocessing (e.g., a transition table) is stored. For each tuple,
the database engine still calls a generic pattern-matching
function, but this function reuses the stored information
in the search.</p>
          <p>In contrast to these approaches, we need to integrate
LIKE pattern matching deeper into the data processing
engine. Code generation allows specializing the
matchkernel, e.g., using data-centric code generation [6], or
in just-in-time compiled vectorized functions. In this
work, we integrate common algorithms such as
KnuthMorris-Pratt [7], Boyer-Moore [8, 9, 10], Two-Way [11],
and SIMD optimized routines [12]:
Generated: During query compilation, we preprocess
the pattern once. Then, we generate code for the entire
search process using as much preprocessed results as
possible. The entire matching process is performed in
the generated code to avoid repeating function calls.
re2 represents the regular expression as an automaton in
bytecode and passes it for interpretation to an execution
engine [23]; Microsoft’s .NET framework provides both
a bytecode interpreter and a just-in-time compiler that
converts the expression to native machine code [24].</p>
          <p>However, only a few projects aim to generate machine
code for regular expression matching at runtime, and no
database system actively combines code generation and
pattern matching.</p>
          <p>In this paper, we focus on how to generate code for the 3. Implementation
most common subset of regular expressions in SQL: LIKE
expressions. Section 3 introduces the string-matching This section outlines how pattern-matching algorithms
algorithms and outlines how our Generated approach uti- can use parts of code-generation frameworks to be
inlizes a code generation framework to rebuild each match- tegrated into compiling database engines. The concept
ing function and specialize the code for the given pattern. of code generation for queries, as used in HyPer [14, 6]
Furthermore, we present a generalized SSE Search al- or Umbra [13], presents a novel opportunity for
evalugorithm that generates highly eficient code for longer ating LIKE expressions in relational database systems
patterns by utilizing SSE instructions. In Section 4, we by generating code for the matching process instead of
evaluate these algorithms alongside the mentioned op- interpreting the pattern. We start with Naïve, which uses
tions in our code-generating DBMS Umbra [13]. This a hand-written function for pattern matching. It is called
provides a fair analysis of the diferent options to evaluate for every tuple in the generated code by the database
LIKE predicates within a single system. While patterns system. We can already improve its performance by
preare typically short, we also analyze the performance with processing the pattern passed to the function.
longer patterns, reaching up to nearly 300 characters. However, we aim to entirely replace the function by
Additionally, we compare our Generated approach with generating code specifically for the pattern and algorithm.
the publicly available systems Postgres, DuckDB, Hyper, The generated code is then directly embedded in the
surand ClickHouse. rounding generated function code. Our current focus is
on constant LIKE patterns without any underscores or
collations. Thus, a bytewise comparison between the
pat2. Related work tern and input text is feasible, allowing us also to handle
non-ASCII characters. Within Umbra, we are using our
Kemper and Neumann introduced Hyper, a pure in-memory type-safe code-generation framework, which allows us to
database for both OLTP and OLAP workloads [14], with pass the generated code in static single assignment (SSA)
a data-centric approach for generating and compiling form on to Umbra’s diferent execution backends [ 19].
compact and eficient machine code using LLVM [ 6]. Throughout this chapter, let us consider the following</p>
          <p>The idea of code generation has become widely ac- query that filters the uni relation and counts how many
cepted among database researchers and developers [15, names contain the relatively short pattern 'TUM':
16, 17, 18]. Umbra [13], the research successor of the
HyPer system, introduced a type-safe code generation select count(*) from uni where name like '%TUM%';
framework with a tailored intermediate representation to In the upcoming Section 3.1 and Section 3.2, we discuss
compile directly into machine code, which makes Umbra the Knuth-Morris-Pratt (KMP) and Boyer-Moore (BM)
applicable for low latency applications [19]. Together algorithms and explain in detail how they integrate into
with adaptive execution, this allows us to switch between the code generation process. Section 3.3 briefly
introprioritizing low-latency or high throughput in execu- duces the Two-Way (TW) search algorithm that
comtion [20]. bines KMP and BM. Moving away from the well-known</p>
          <p>In general, regular expression matching can benefit pattern matching algorithms, Section 3.4 presents the
from using just-in-time code generation. Thompson in- Hybrid-Search (HS) algorithm which uses an SSE
instructroduced the concept in 1968 to produce an IBM 7094 pro- tion to match patterns up to a certain size. In Section 3.5,
grams for locating characters by regular expressions [21]. we show how blockwise processing can improve
perToday, many state-of-the-art libraries use code genera- formance in finding possible occurrences of the pattern.
tion to convert the given regular expression pattern into Finally, Section 3.6 presents the SSE Search algorithm
an internal representation of bytecode: Python’s re mod- which uses SSE instructions to perform pattern matching
ule generates bytecode for regular expressions and uses expressions for longer patterns. It is important to note
an internal C engine for eficient execution [ 22]; Google’s
that this algorithm can only be implemented efectively
in a code-generating database engine. This is due to the
diverse nature of input patterns, where the flexibility
offered by a code-generating process surpasses that of an
interpreting algorithm.
3.1. Knuth-Morris-Pratt Algorithm
In 1977, Knuth, Morris, and Pratt introduced an algorithm
for performing exact string pattern matching without the
need to backtrack in the input text by preprocessing the
pattern [7]. The algorithm builds a table with pattern
length + 1 entries that point to where, in the pattern, we
continue the search after a mismatch. Thus, the table
holds information about the longest proper prefix that is
also a proper sufix of the pattern ( lps). For the prefix of
length 0, the table stores the value −1, indicating that if a
mismatch occurs, the pattern can be shifted one position
to the right since no sufix exists at that point.
1 KMP(text, pattern):
2 lpsTable = preprocess(pattern)
3 pPos = 0; pSize = pattern.size()
4 tPos = 0; tSize = text.size()
5 while (tPos - pPos + pSize &lt;= tSize)
6 if (pattern[pPos] == text[tPos])
7 pPos++; tPos++
8 if (pPos == pSize) return true
9 else
10 shift = lpsTable[pPos]
11 if (shift &lt; 0) pPos = 0; tPos++
12 else pPos = shift
13 return false
whileLoopHeader:
check tPos - pPos + 6 1 ≤ text.size()
check pPos = 0
3.1.1. Preprocessed approach
To prevent the pattern from being processed repeatedly
for each tuple, we preprocess the pattern during code
Listing 1: Pseudocode for the Knuth-Morris-Pratt generation time to get the lps table and then save this
algorithm table along with the pattern in the generated program.</p>
          <p>We do not need to store any additional information for</p>
          <p>Listing 1 presents the pseudocode of the KMP algo- the lps table because its size depends on the pattern size.
rithm. Line 2 preprocesses the pattern and initializes two When executing the generated code, we still make a
position counters for text and pattern. When the char- function call to perform the search, but instead of
recalcuacters at these indexes match (lines 6 to 8), the function lating the table every time, we reuse the stored lps table.
increments both position counters. When reaching the So, we omit the preprocessing in line 2 of Listing 1.
pattern end, a match is found.</p>
          <p>If the characters do not match (lines 9 to 12), the func- 3.1.2. Generated approach
tion reads the optimal shift value from the lps table based
on the pattern position. A negative value indicates that We replace all calls to the matching function with
specifthere is no proper sufix in the pattern that is also a prefix ically generated code for the query’s pattern. So, we
of the pattern. Thus, we need to restart the comparison embed the search phase algorithm entirely within the
from the following text character. Otherwise, the func- generated code. Figure 2 shows how we can reconstruct
tion updates the pattern position to the lps table value the Knuth-Morris-Pratt algorithm using the example we
and increments the text index. discussed earlier to search for the pattern 'TUM' in the</p>
          <p>In line 5 of Listing 1, we introduce an optimization input text.
for the KMP algorithm called the early return, which we We begin the algorithm in the whileLoopHeader block
use in all our variants. This optimization checks every by checking if the remaining length of the input text is
iteration to determine whether the end of the pattern lies suficient. In this check, we can inline the size of the
patwithin the text. If that is not the case, we will stop the tern into the arithmetic expression 1 . Then, we move to
comparison, as it is impossible to find another match. the correct pattern position to continue the comparisons
from this position. We generate the comparisons out of
In 1977, Boyer and Moore presented a pattern-matching
algorithm that iterates backwards over the pattern to
search it in the input text [8]. We focus on their fast
implementation shown in Listing 2.
the pattern from left to right 2 . If the characters match, the rightmost character. The implementation looks up
we proceed directly to the next character, but if there is a the value in  0 and either shifts the pattern to the right or
mismatch, we jump to the performShift block. In this adds Ψ to the current position (line 11 of Listing 2). This
block, we choose the shift value from the inlined lps ta- allows us to scan through the input text, and once we
ble 3 based on the position of the failed comparison. We add Ψ to the current position, we know the last character
then determine how to proceed and continue the search was found. Thus, the algorithm recalculates the index
by jumping back to the whileLoopHeader. for the second last character and starts comparing the
pattern from right to left with the input text ( lines 16
3.2. Boyer-Moore Algorithm to 20). In case of a mismatch during this comparison, the
algorithm applies the maximum of the shifts according
to the heuristics (line 21) before the search for the last
character of the pattern restarts.
3.2.1. Preprocessed approach
As the BCH table contains 256 values and the size of the
GSH table matches the pattern length, repetitive
processing is quite expensive. Like the KMP algorithm, we move
all preprocessing steps to code generation and store the
resulting tables directly along with their corresponding
pattern. When storing the tables, we do not require
additional information since the size of the tables is either
known beforehand or can be derived from the pattern
size. We replace the calls to the preprocessing functions
(lines 6 to 7) with pointers to the corresponding table.</p>
          <p>As the only diference between  0 and  1 is the value
for the pattern’s last character, we do not copy the table
but instead, modify the code in the search phase loop to
actively add the correct value, so either the value Ψ or
the value from  1.
3.2.2. Generated approach
1 BM(text, pattern):
2 pSize = pattern.size()
3 tSize = text.size()
4 // Ψ &gt; tSize + pSize for all inputs
5 Ψ = 1 &lt;&lt; 48; pPos = pSize - 1; tPos = pPos
6  1 = preprocessBadCharacterHeuristic(pattern)
7  2 = preprocessGoodSuffixHeuristic(pattern)
8  0 =  1
9  0[pattern[pSize - 1]] = Ψ
10 while (tPos &lt; tSize)
11 tPos +=  0[text[tPos]]
12 if (tPos &gt;= Ψ)
13 tPos = tPos - Ψ - 1
14 if (pSize == 1) return true
15 else
16 pPos = pSize - 2
17 while (pPos &amp;&amp; text[tPos] == pattern[pPos])
18 pPos--;
tPos-19 if (!pPos &amp;&amp; text[tPos] == pattern[pPos])
20 return true
21 tPos += max( 1[text[tPos]],  2[pPos])
22 return false</p>
          <p>Similar to the KMP algorithm, we rebuild the
BoyerListing 2: Pseudocode for the Boyer-Moore algorithm Moore algorithm to get pattern-specific code. Figure 3
presents the conceptual control flow for searching the
patBefore the search phase begins, the algorithm requires
tern 'TUM' in the input text. We check whether enough
two preprocessing steps, namely Bad Character
Heurischaracters are left in the input string before we start the
tic (BCH) in line 6 and Good Sufix Heuristic (GSH) in
line 7. The BCH ensures that the text’s letter at which matching process in the whileLoopHeader block. If so,
the mismatch occurred aligns with its rightmost occur- we get the shift value from the inlined  0 table 1 and add
it to the pattern position. If that value is smaller than Ψ,
rence in the pattern. Alternatively, the GSH shifts the
pattern based on the longest sufix of the matched input we continue in this loop by going to the whileLoopHeader
block. Otherwise, we know the input text contains the
text. This part is aligned with the rightmost occurrence
last character of the pattern, so further checks are
reof that character sequence in the pattern (except for the
quired. Before the checks, we recalculate the
characsufix of the pattern itself). Both heuristics precalculate
ter index aligned with the second last character in the
shift values for the pattern and store them in tables. The
pattern. In order to proceed, we check the text based
original papers contain a more detailed explanation of
on the reversed pattern 2 . In case of a mismatch, we
the heuristics and their efects [ 8, 9].</p>
          <p>The fast implementation requires a third table,  0, which jump to the performShift block. In this block, we
deis essentially a copy of the result of BCH but holds the termine goodShift from  2 3 based on the preceding
value Ψ (called large in [8]) for the last character of the bmloatcckhianngdttehxet cbhaadrSahcitefrt. Wfroemthe1n
a4ddbathseedmoanxitmheummiosfpattern. This value needs to be greater than the sum of
both shift values to the text position before returning
the lengths of all possible input texts and patterns. At the
beginning of the search phase, the pattern is aligned with to the whileLoopHeader to continue with the matching
process.
the start of the input text and the comparison starts from
whileLoopHeader:
check tPos &lt; text.size()
performShift:
goodShift =  [5, 4] 3
badShift = match text[tPos] { 'T': 2, 'U': 1, 'M': 0, _: 3} 4
tPos = tPos + max(badShift, goodShift)
half is compared from right to left. If any mismatches
during the comparisons occur, the pattern is shifted by a
certain number of positions. After a close analysis of the
interpreting Two-Way algorithm, its functionality can be
rebuilt using the available code generation framework as
for the other algorithms.</p>
          <p>Again, we implement a Naïve, Preprocessed, and
Generated version for the Two-Way algorithm. In the Naïve
version, the Critical Factorization preprocessing step is
performed repeatedly for each input text. In the
Preprocessed version, we store the necessary preprocessed
result along with the pattern in the data section of the
generated code. This value is then loaded along with the
pattern when needed and used in an interpretive
algorithm. The Generated version of the algorithm depends
on the output of the preprocessing function. It generates
the relevant part of the Two-Way algorithm based on the
outcome of the Critical Factorization step and inlines as
much of the information as possible into the algorithm.
3.4. Hybrid-Search Algorithm
Sitaridi et al. have introduced an algorithm which uses
the SSE 4.2 SIMD instruction set, which comprises
instructions that eficiently accelerate string and text
processing [12]. According to Intel, these instructions are
designed to enhance the performance of databases or
complex searching and pattern matching algorithms [26].</p>
          <p>However, the presented algorithm is restricted to patterns
to fit into a 128-bit SIMD register.</p>
          <p>When analyzing the instructions in the performShift 1 HS(text, pattern):
block of Figure 3, one may mistakenly perceive the in- 2 if (pattern.size() &lt;= 16 &amp;&amp; text.size() &gt;= 16)
clusion of the inlined  1 table for determining the shift 3 iter = text.begin(); end = text.end()
caused by the BCH as unnecessary. This impression 4 safeMatch = 17 - pattern.size()
arises from the fact that the minimum shift resulting 5 pattern16 = load16(pattern)
from the GSH is always greater than the maximum pos- 67 whimaltech((=itpecrm+pi1s6t)ri&lt;(paentdte)rn16, load16(iter))
sible shift caused by the BCH. Consequently, the maxi- 8 if (match &lt; safeMatch) return true
mum shift is consistently determined by the good sufix 9 iter += safeMatch
heuristics. It is important to note, however, that this ob- 10 if (iter &lt; end)
servation cannot be universally applied to all patterns. 11 match = pcmpistri(pattern16, load16(end - 16))
To address this, we have implemented an optimization 1123 retreutrunrnfalmsaetch &lt; safeMatch
in the code generation process, which generates code for 14 return TW(text, pattern)
determining the BCH shift only when it is truly required.</p>
          <p>Listing 3: Pseudocode for the Hybrid Search algorithm
3.3. Two-Way Algorithm With our Hybrid Search, we extend this algorithm to
handle any size of input text and pattern, as presented
As an alternative to the Knuth-Morris-Pratt and Boyer- in Listing 3: For patterns up to the length of a vector
Moore algorithms, Crochemore and Perrin presented the register, we use the pcmpistri instruction, given that
Two-Way String-Matching algorithm (TW) which com- the input text is at least 16 bytes long. In such cases, we
bines both previous algorithms into one [11]. To achieve process 16 bytes of the input text at once until less than
this, the algorithm first splits the pattern according to 16 bytes are left ( lines 6 to 9). To check the end of the
the known Critical Factorization Theorem [25]. With input text, we load the last 16 bytes of the input text,
that, the pattern is theoretically split into a left and right which is safe since we know that the input text is long
part. In the search phase, the right half is compared from enough (lines 10 to 12). However, if either of the input
left to right first; if all characters match, then the left parameters does not meet its length criterion, we resort
to a default string search algorithm. In our case, it is
the Two-Way algorithm (line 14). Considering that the
best-suited algorithm depends on various factors, such
as the pattern and workload, it would be beneficial to</p>
          <p>1 uint64_t block = loadNext8Bytes(...)
implement multiple fallback algorithms, allowing the 2 // broadcast 'T' to each byte: 0x5454545454545454ull
selection of the most appropriate one. 3 uint64_t searchedChar = broadcast('T')
4 const uint64_t high = 0x8080808080808080ull
3.4.1. Preprocessed approach. 5 const uint64_t low = ~high</p>
          <p>6 uint64_t lowChars = (~block) &amp; high
Based on the chosen fallback algorithm, one might con- 7 uint64_t cleared = (block &amp; low) ^ searchedChar
sider how to include the corresponding Preprocessed func- 8 uint64_t found = ~((cleared + low) &amp; high)
9 uint64_t matches = found &amp; lowChars
tion of the chosen algorithm. To match the chosen Two- 10 bool matchFound = matches != 0
Way algorithm as default fallback algorithm in the Naïve
approach, we chose the Preprocessed version for this ap- Listing 4: Blockwise search for ASCII character T
proach.
instructions may provide similar functionality, our
objective is to present a versatile approach that is not limited
to any specific hardware support.
3.4.2. Generated approach.
3.6. SSE-Search Algorithm
To generate code for our Hybrid Search, we extended The Hybrid Search algorithm employs an SSE instruction
the custom code generation framework of Umbra to sup- for pattern matching, limiting the pattern length to at
port the necessary SSE instruction for comparing packed most 16 bytes. For longer patterns, the algorithm
prostrings. This enables us to generate the parts of the algo- vides an alternative approach that does not use the SSE
rithm needed for the specific pattern. For patterns that instruction. Expanding the algorithm to handle longer
are longer than 12 bytes, we only generate the code for patterns with SSE instructions is feasible but significantly
the default matching algorithm, as we do not use the increases its complexity.</p>
          <p>SSE instruction for that kind of patterns. For shorter However, the emergence of code-generating database
patterns, we generate both the part using the SSE instruc- engines has opened up new possibilities for generating
tion and the default fallback. While executing the code, code optimized with SSE instructions that are specifically
we determine which part of the algorithm to use based tailored to long patterns. In Figure 4, we present the
on the length of the input text. The decision to set the conceptual design of the generated code for searching the
limit to 12 bytes is guided by the fact that this still allows pattern 'Technical University of Munich'. Similar
performant shifting of the input pattern (cf. safeMatch, to the KMP algorithm, we initially check if the pattern
measured in Figure 9 in Section 4.3.1). can fit within the remaining text by directly including
the pattern length 1 . If there are enough characters
remaining, we proceed with the comparison.
3.5. Blockwise Processing Optimization The search algorithm aims to locate the starting
posiBlockwise Processing can improve the initial pattern tion of the pattern within the input text. Once the start
search. It draws inspiration from SIMD within a register position is found, we continue comparing subsequent
(SWAR) [27] and enhances the eficiency of character parts of the pattern and text sequentially from that
posisearch in an input text over the naïve idea. This would tion.
involve iterating over the text and examining each char- To achieve this, we extract the first 16 bytes of the
acter resulting in a wastage of cycles simply searching pattern and load the next 16 bytes from the text. Using
for the desired character. However, blockwise processing the SSE instruction pcmpistri, we search for the start of
can be implemented to rapidly locate the first charac- the pattern in the input text 2 . If no match is found, we
ter of the pattern and then continue with the chosen shift the text position to the right and restart the overall
pattern-matching algorithm. Listing 4 demonstrates the search for the pattern start in the input text. In the case
algorithm to detect the presence of the ASCII character of a match, we enter the generated code, which loads
'T' in the block. We read the next eight bytes from the the next 16 bytes from the input text and compares them
input text into a register. With another register having to the corresponding part of the pattern. This
comparthe character broadcasted to each byte, we perform vari- ison can be performed using either the SSE instruction
ous bitwise operations between the registers and specific pcmpistri or another binary comparison function for
constants. After these operations, we get a value back vector registers. Since we know that the subsequent
patwhich is either 0, so 'T' could not be found in block, or tern blocks must follow the previous ones, we can easily
the highest bit of the byte at which the character appeared generate code to handle this logic. This is repeated until
is set. This code can also be adjusted for non-ASCII char- less than 16 bytes of the pattern are left.
acters which have the highest bit set. While certain SSE
whileLoopHeader:
check tPos + 30 1 ≤ text.size()
of the pattern into a vector register. However, when
fully utilizing the vector register, one can only shift one
position to the right if the start of the pattern does not
match the loaded text block. To increase the possible
shift in case this part is not found in the loaded input
text, we can reduce the number of bytes loaded from the
pattern.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Evaluation</title>
      <p>In order to check our implementations on a more realistic
dataset, we use ClickBench1. It includes typical modern
workloads and queries used in ad-hoc analytics and
realtime dashboards. The data used in the benchmark is
collected from a real-world web analytics platform. While
it is anonymized, it retains the essential distributions of
the data, including non-ASCII characters. For our
experiments, we used the queries 20, 21, 22, and 23 from
the ClickBench benchmark, which contain the following
LIKE predicates:</p>
      <p>Q 20, 21, 23:
Q 22:
url like '%google%'
title like '%Google%'
and url not like '%.google.%'</p>
      <p>Query 20 scans the relation hits and counts how many
tuples fulfill the predicate; the other queries involve more
operators like aggregates or sorting, so the overall
performance is not entirely dominated by the pattern matching</p>
      <p>Handling the remaining bytes of the pattern requires algorithm.
special handling, as both the pattern and input text may Since the patterns mentioned above are shorter than
not fully occupy an SSE register. Therefore, we load a the length of a vector register, we classify them as short
maximum of 16 bytes from the input text and also return patterns. To evaluate the efects of longer patterns, we
how many bytes were read. With the loaded block and increased the pattern length for Q 20 to 31, 160, and 291
the number of read bytes, we employ the SSE instruc- characters, categorizing them as long patterns.
tion pcmpestri 3 . This instruction requires explicit We run the microbenchmarks on an Intel i9-7900X
specification of the length of the input data as additional CPU (Skylake-X, 3.3-4.5 GHz) with 10 cores and 128 GB
arguments. 4-channel DDR4-2133 memory, running Ubuntu 22.10</p>
      <p>If there is a mismatch between one of the blocks of the (Kernel 5.19, gcc 12.2), and repeat all measurements five
pattern and the corresponding text block, we stop the times.
comparison and go to the performShift block. Within
this block, we apply a shift heuristic that moves the pat- 4.1. Full System Comparison
tern as far to the right as feasible. Following the pattern
shift, we return to the whileLoopHeader to resume the We compared our Generated approaches for pattern
matchmatching process by checking the remaining length of ing with other popular database systems, namely
Postthe text. gres, DuckDB, Hyper, and ClickHouse. Figure 5
demonShift heuristics. For shift heuristics, we have two op- strates that our approach of generating pattern-specific
tions: a simple shift to the right by one position or a code performs better than the other databases. While
more advanced KMP-like heuristic. The latter relies on Umbra outperforms the other databases for Query 21
identifying the longest sufix of the already matched pat- and 22, Hyper is slightly faster than our Boyer-Moore
tern that is also a proper prefix. This operation results algorithm for Query 20 but slower than the other three
in no additional runtime overhead since it can be prepro- algorithms. However, Hyper uses a pattern matching
cessed during code generation and is directly written to algorithm which is quite similar to our Hybrid-Search
code. algorithm, also using an SSE instruction to search for
the given pattern. As our pattern is relatively short, we
Size of start block. Figure 4 presents the version of
the algorithm, which directly loads the first 16 bytes
1https://benchmark.clickhouse.com</p>
      <p>Nave</p>
      <p>Q 22
Postgres
ClickHouse</p>
      <p>DuckDB
Hyper</p>
      <p>KMP
BM</p>
      <p>TW</p>
      <p>HS</p>
      <p>Non-Blockwise
(Section 3.1)</p>
      <p>Blockwise
(Section 3.5)
not found. Consequently, the Preprocessed approach is
observe that in nearly all cases we benefit from gener- faster than the Naïve one since it doesn’t need to access
ating pattern specific code. Due to the shortness of the the lps table as often. Still, the Generated approach
repattern, one can observe that the Hybrid-Search benefits mains superior to the other alternatives. Based on these
from the SSE instruction, as it clearly dominates the other findings, we directly focus further analysis on the KMP
algorithms, especially for Query 20. algorithm with blockwise processing.
4.2. Short Pattern Microbenchmark
4.2.2. Algorithm Comparison
We continue our comparison within our database system Figure 7 illustrates the results of comparing the matching
Umbra to analyze the diferences between the algorithms algorithms discussed in Section 3 for Q 20 and Q 21 of the
in detail. In this section, we will focus on the more com- ClickBench benchmark. The Preprocessed and Generated
mon case of short patterns. approaches outperform the interpreting approaches.
This is because the preprocessing phases of the KMP
4.2.1. Blockwise Processing and BM algorithms generate large lookup tables. By
employing code generation capabilities for the
PreproOur initial investigation evaluates the eficacy of using cessed approach, we can avoid redundant preprocessing
Blockwise Processing in combination with KMP as ex- of the pattern. Storing the tables in the data section of
plained in Sections 3.1 and 3.5. If a mismatch occurs, we the generated program leads to a substantial
improvecheck how the KMP algorithm would shift the pattern ment in throughput. As the Generated approach suggests,
according to its preprocessing. In case the pattern would generating highly specialized code further enhances
perbe shifted by one character, we switch back to blockwise formance. However, for the Boyer-Moore algorithm, the
processing and restart the search for the first charac- performance improves less in Query 20. According to
ter of the pattern. Figure 6 illustrates the advantages further analysis, this is due to many branches in the
of blockwise processing compared to the non-blockwise generated code, which can result in mispredictions and
approach for Query 20 on the ClickBench dataset. By overall slow performance.
applying this optimization, larger blocks of the input text For Query 21, we observe similar behavior as for Query
can be processed at once instead of reading byte by byte. 20. However, the generated code for the Boyer-Moore</p>
      <p>In the non-blockwise case, both the Naïve and Prepro- algorithm appears marginally diferent, leading to higher
cessed versions show similar throughputs. After conduct- performance improvement for the Generated approach
ing a performance analysis, we identified that loading over the Preprocessed version.
the lps table values from the data section in the Prepro- The preprocessing function of the Two-Way algorithm
cessed approach yields performance similar to repeatedly only returns a number, so it does not have to generate a
preprocessing the relatively short pattern in the Naïve table as the other two algorithms. Consequently, the
Genapproach. By using the Generated approach, we can com- erated approach achieves a higher throughput compared
pletely avoid any indirections, resulting in the highest to the Naïve approach.
throughput for the KMP algorithm. When it comes to the Hybrid-Search algorithm, one</p>
      <p>In the blockwise algorithms, larger blocks of the input can see the benefit of using SSE instructions to search
text can be skipped if the first character of the pattern is
]
s
/
se50 M
l
p
u
[T 0
t
u
p
gh0.1 G
u
o
rh 0
T</p>
      <p>Preprocessed</p>
      <p>Q 20
KMP BM TW HS
(Sections (Section 3.2) (Section 3.3) (Section 3.4)
3.1 + 3.5)
12 4 6 8 10
#threads
Nave
15
KMP</p>
      <p>BM
TW
HS</p>
      <p>SSE
for a pattern in an input text. For each approach, this
algorithm dominates all the other algorithms.
Additionally, it benefits from avoiding repetitive function calls to
the matching function, resulting in the throughput of the
Generated version being nearly 2.5× the throughput of
the Naïve approach.
4.2.3. Multithreading
The final interesting aspect of our microbenchmarks is
how throughput develops when running queries using
multiple threads. We expect the throughput to scale
linearly with an increasing number of threads, as Umbra involved, as it depends on both the pattern and input text
uses morsel-driven parallelism [28]. As shown in Figure 8, which specific part of the algorithm is executed. In the
exthis expectation aligns with the observed results. When periment, the pattern falls within the length limit and the
hyperthreading is reached, the throughput still increases, input texts are on average also large enough. Therefore,
but at a diferent rate than before. the Hybrid-Search algorithm predominantly executes</p>
      <p>Our diferent degrees of code generation provide the the search with the SSE instruction and rarely falls back
most benefit to the KMP algorithm when comparing the to the default algorithm. This method of utilizing SSE
diferent approaches. In the Generated version, we can instructions for pattern matching shows a significant
pernearly double the throughput compared to the Naïve formance improvement, even with the Naïve approach.
approach, while the Preprocessed version is in the middle. By applying the Generated approach and eliminating the</p>
      <p>The Boyer-Moore algorithm also benefits from the overhead of the repeated function calls, the throughput
code generation approaches. However, the Preprocessed further increases. However, after using more than eight
and Generated versions are much closer together, and threads, the throughput levels out because it approaches
when it comes to hyperthreading, both versions are ap- the memory limit of the machine.
proaching each other. Nevertheless, the main problem Table 1 presents the execution times of the diferent
apwith this algorithm is the higher number of branches proaches for Query 20 using 20 threads. One can observe
required to get the correct shift from the BCH table. In that, with the Generated approach, the SSE Search
algothe Preprocessed approach, this is just a memory lookup. rithm is slightly slower than the Hybrid-Search algorithm.
With more branches, more mispredictions happen to This can be attributed to the SSE Search algorithm’s need
cause a generic function to outperform specifically gen- for specialized handling of short patterns and input texts,
erated code. resulting in more complex code and slower execution.</p>
      <p>Lastly, the Generated version of the Two-Way
algorithm is also slightly faster than its Naïve version. Due to 4.2.4. Compilation Overhead
the less complexity of the preprocessing phase, the difer- When dealing with code-generating database engines, it
ence between both versions is small. Still, the Generated is essential to consider the overhead of compilation. In
version has higher throughput. Table 1, we also show the compilation times for Query</p>
      <p>Analyzing the Hybrid-Search algorithm is a bit more
20 using the LLVM backend of Umbra. The data reveals
that as we move from the Naïve to the Generated
approach, the compilation times for the algorithms increase
marginally. Nevertheless, this increase is balanced out
by the reduction in execution time.</p>
      <p>Moreover, it is worth mentioning that Umbra could
employ the Flying Start technique or the FireARM
backend. This means that the compilation overhead could be
concealed by using a specific backend until the compiled
LLVM function becomes available [19, 29]. However,
in our experiments, we did not employ this option, as
during compilation, only 0.5% of the tuples could be
processed when running the Hybrid-Search algorithm fully
multi-threaded.
4.3. Long Pattern Microbenchmark
As the final part of the evaluation, we look at the efect
of long patterns. We classify a pattern which exceeds the
length of a single vector register (16 bytes) as a long
pattern. For our experiments, we use three patterns: pattern
A with 31 characters, pattern B with 160 characters, and
pattern C is a combination of three long patterns totaling
291 characters.
4.3.1. Size of start block
Figure 9 presents the results of varying the number of
characters in the start block, which is employed to locate
a potential pattern start. The top plots visualize the
performance using only one thread, while the bottom ones
show performance with 20 threads.</p>
      <p>When executing with a single thread, the algorithm
achieves peak performance when three bytes of the
pattern are employed in the localization phase. This size
allows for suficient shifting of the pattern to the right while
minimizing false positives. When utilizing 20 threads,
the performance remains largely unafected by the size
of the start block. The limiting factor in this scenario
is the available memory bandwidth, which operates at
68 GB/s and is utilized over 90%.</p>
      <p>Moreover, the combination of longer patterns and the
early return implementation proves advantageous,
leading to increased throughput with larger pattern sizes.
4.3.2. Algorithm overview
l]sse/p treadh00..12 GG
uT 1
[
t
oguhpu tsreahd01..05 GG
rh 02
T
0
0
]1.2 G
s
/
s
le1.0 G
p
u
[T0.8 G
t
u
hp0.5 G
g
u
ro0.2 G
h
T
0
Finally, the comparison of diferent code generating
algorithms using 20 threads in terms of long patterns is Each matching algorithm combined with a code
generaillustrated in Figure 10. For the SSE Search algorithms, tion approach ofers diferent performance benefits and
we have chosen the start size with the highest perfor- use cases but also challenges.
mance. Since the fallback option for the Hybrid-Search
algorithm is the Two-Way algorithm for long patterns, Knuth-Morris-Pratt is relatively straightforward to
imboth algorithms show similar performance. plement in all three approaches. It can be enhanced by</p>
      <p>For all patterns, the SSE Search algorithm, which gen- adding blockwise optimization with just a few
modifierates pattern specific code for the matching process, cations. Our experiments demonstrated that the
code</p>
    </sec>
    <sec id="sec-4">
      <title>5. Lessons Learned</title>
      <p>generating approach significantly improved the perfor- and dynamic adjustment of the size of the start block.
mance of the KMP algorithm. The blockwise version The performance of the algorithm surpasses that of
alof KMP is particularly efective when the first character ternative methods across all three versions, consistently
of the pattern has a low frequency of occurrence in the delivering superior results. Moreover, the performance
input text. This allows eficient consumption of large por- is mostly bound by the available memory bandwidth.
tions of the text. Applying the early return optimization Ultimately, tuning the performance for the
patternfurther enhances the performance. This optimization matching algorithms in a code-generating database
sysdiscards longer patterns faster once they no longer fit tem is still a trade-of. The Preprocessed approach is
sufin the input text. Additionally, the KMP algorithm it- ficient to improve performance compared to the classic
erates over the input text from left to right. Thus, the Naïve approach while keeping the complexity of
genalgorithm can also process texts with Unicode characters erated code low. However, generating pattern-specific
based on their codepoints rather than only their bytewise code for the matching process further improves
overrepresentation. all query throughput at the cost of increased code
comBoyer-Moore is also easy to realize in the Naïve and plexity. Utilizing specific SSE instructions for pattern
Preprocessed approaches. The Generated approach re- matching ofers the dual advantage of enhancing
perforquires more bookkeeping during code generation for mance and reducing code complexity in certain aspects
correct SSA form. The experiments show that Generated of the matching algorithm. Based on our experimental
is superior, while Preprocessed is better in case of hyper- findings, it can be inferred that when the required SSE
threading. This algorithm is more efective when the last instructions are not supported by the hardware, no single
pattern character has a lower distribution than the first matching algorithm exhibits consistently superior
perforcharacter. It also works better with longer patterns due mance across all patterns. However, if hardware support
to early rejection once they exceed the input text. is available, we can conclude that for short patterns the
Two-Way is complex to implement in both approaches. In Hybrid Search algorithm is superior, while for long
patour experiments, Generated is slightly faster than Naïve, terns, the new SSE Search algorithm is more efective.
due to its less costly preprocessing function. However, Integrating both the Hybrid Search algorithms and the
the performance varies depending on pattern factoriza- SSE Search algorithm into a code-generating database
tion. The pattern of the experiment was not optimally engine can be achieved seamlessly by designating the
factorized, leading to a similar performance as for the SSE Search algorithm as the default fallback.
FurtherKMP algorithm. With a pattern better suited for factor- more, employing algorithms that utilize SSE instructions
ization, performance improves. ofers an additional advantage. The code required in the
database engine for these algorithms is relatively small
and straightforward to maintain and the generated code
is clear and well-structured, facilitating easy verification
and debugging processes.</p>
      <p>Hybrid-Search ’s complexity is relatively low when using
the Naïve approach, and it depends solely on the chosen
default algorithm. Implementing the SSE search
component is a simple task and completely decoupled from the
fallback algorithm. However, integrating the Generated
approach into Umbra and its backends requires more 6. Conclusion
efort. This is because we needed to introduce a new
internal instruction for the SSE string comparison function This paper demonstrates the efectiveness of code
generwhich then maps to the corresponding function for the ation for pattern-matching algorithms to evaluate LIKE
backend. Nevertheless, this algorithm shows the most predicates. The performance increases by up to a
facpromise and consistently outperforms the other algo- tor of two compared to function calls and outperforms
rithms in all three approaches. The Generated approach state-of-the-art systems like Postgres, DuckDB, Hyper, or
is only limited by the memory speed of the machine. ClickHouse on text-heavy datasets like ClickBench. The
Since the SSE part has a pattern length restriction, this results indicate that replacing generic function calls with
algorithm is particularly suitable for short patterns. For pattern-specific generated code significantly increases
longer patterns, it is necessary to carefully investigate the throughput. We have also demonstrated that using
the selected default algorithm. SSE instructions to compare packed strings has a
posiSSE Search introduces an innovative method for generat- tive impact on overall performance, particularly when
ing specialized code for long patterns by leveraging SSE incorporating them into the generated code.
Additionvector instructions. Incorporating this algorithm into a ally, we have presented a generalized algorithm which
code-generating database engine is straightforward. The uses multiple SSE instructions to perform eficient
patonly challenge is adding the necessary SSE instructions tern matching for long patterns as a favorable alternative
to the backends. Furthermore, this approach facilitates to classic algorithms.
the seamless implementation of various shift heuristics
Hekaton: SQL server’s memory-optimized OLTP
engine, in: SIGMOD Conference, ACM, 2013, pp.
[1] A. Vogelsgesang, M. Haubenschild, J. Finis, A. Kem- 1243–1254.</p>
      <p>per, V. Leis, T. Mühlbauer, T. Neumann, M. Then, [17] T. Hof, Code Generation: The Inner
SancGet real: How benchmarks fail to represent the tum Of Database Performance, 2016. URL:
real world, in: DBTest@SIGMOD, ACM, 2018, pp.
http://highscalability.com/blog/2016/9/7/code1:1–1:6.
generation-the-inner-sanctum-of-database[2] D. Sidler, Z. István, M. Owaida, G. Alonso, Accel- performance.html.</p>
      <p>erating pattern matching queries in hybrid CPU- [18] S. Wanderman-Milne, N. Li, Runtime code
generFPGA architectures, in: SIGMOD Conference, ation in cloudera impala, IEEE Data Eng. Bull. 37
ACM, 2017, pp. 403–415. (2014) 31–37.
[3] X. Wang, Y. Hong, H. Chang, K. Park, G. Langdale, [19] T. Kersten, V. Leis, T. Neumann, Tidy tuples and
J. Hu, H. Zhu, Hyperscan: A fast multi-pattern lfying start: fast compilation and fast execution
regex matcher for modern cpus, in: NSDI, USENIX of relational queries in umbra, VLDB J. 30 (2021)
Association, 2019, pp. 631–648. 883–905.
[4] J. Hildebrandt, D. Habich, P. Damme, W. Lehner, [20] A. Kohn, V. Leis, T. Neumann, Adaptive execution
Compression-aware in-memory query processing: of compiled queries, in: ICDE, IEEE Computer
Vision, system design and beyond, in: ADM- Society, 2018, pp. 197–208.</p>
      <p>S/IMDM@VLDB, volume 10195 of Lecture Notes [21] K. Thompson, Regular expression search algorithm,
in Computer Science, Springer, 2016, pp. 40–56. Commun. ACM 11 (1968) 419–422.
[5] P. A. Boncz, T. Neumann, V. Leis, FSST: fast random [22] A. Kuchling, Regular Expression HOWTO, https:
access string compression, Proc. VLDB Endow. 13 //docs.python.org/3/howto/regex.html, 2023.
(2020) 2649–2661. [23] P. Wankadia, Glossary, https://github.com/google/
[6] T. Neumann, Eficiently compiling eficient query re2/wiki/Glossary, 2021.</p>
      <p>plans for modern hardware, Proc. VLDB Endow. 4 [24] .NET: Compilation and Reuse in Regular
Expres(2011) 539–550. sions, https://learn.microsoft.com/en-us/dotnet/
[7] D. E. Knuth, J. H. Morris Jr., V. R. Pratt, Fast pat-
standard/base-types/compilation-and-reuse-intern matching in strings, SIAM J. Comput. 6 (1977) regular-expressions, 2021.</p>
      <p>323–350. [25] M. Lothaire, Combinatorics on words, Second
Edi[8] R. S. Boyer, J. S. Moore, A fast string searching tion, Cambridge mathematical library, Cambridge
algorithm, Commun. ACM 20 (1977) 762–772. University Press, 1997.
[9] W. Rytter, A correct preprocessing algorithm for [26] R. Ramanathan, Extending the world’s most
boyer-moore string-searching, SIAM J. Comput. 9 popular processor architecture, 2006. URL:
(1980) 509–512. https://web.archive.org/web/20090823145906/http:
[10] R. N. Horspool, Practical fast searching in strings, //download.intel.com/technology/architecture/</p>
      <p>Softw. Pract. Exp. 10 (1980) 501–506. new-instructions-paper.pdf.
[11] M. Crochemore, D. Perrin, Two-way string match- [27] L. Lamport, Multiple byte processing with full-word
ing, J. ACM 38 (1991) 651–675. instructions, Commun. ACM 18 (1975) 471–475.
[12] E. A. Sitaridi, O. Polychroniou, K. A. Ross, Simd- [28] V. Leis, P. A. Boncz, A. Kemper, T. Neumann,
accelerated regular expression matching, in: Da- Morsel-driven parallelism: a numa-aware query
MoN, ACM, 2016, pp. 8:1–8:7. evaluation framework for the many-core age, in:
[13] T. Neumann, M. J. Freitag, Umbra: A disk-based SIGMOD Conference, ACM, 2014, pp. 743–754.
system with in-memory performance, in: CIDR, [29] F. Gruber, M. Bandle, A. Engelke, T. Neumann,
www.cidrdb.org, 2020. J. Giceva, Bringing compiling databases to RISC
[14] A. Kemper, T. Neumann, Hyper: A hybrid architectures, Proc. VLDB Endow. 16 (2023)
oltp&amp;olap main memory database system based 1222–1234.
on virtual memory snapshots, in: ICDE, IEEE
Computer Society, 2011, pp. 195–206.
[15] S. Agarwal, D. Liu, R. Xin, Apache Spark as a
Compiler: Joining a Billion Rows per Second on a Laptop,
2016. URL: https://www.databricks.com/blog/
2016/05/23/apache-spark-as-a-compiler-joining-abillion-rows-per-second-on-a-laptop.html.
[16] C. Diaconu, C. Freedman, E. Ismert, P. Larson,</p>
      <p>P. Mittal, R. Stonecipher, N. Verma, M. Zwilling,</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>