-

September

Adrian Riedl

adrian.riedl@in.tum.de

Philipp Fent

fent@in.tum.de

Maximilian Bandle

bandle@in.tum.de

Thomas Neumann

neumann@in.tum.de

2023

1 2023

Eficiently evaluating text pattern matching is one of the most common computationally expensive tasks in data processing pipelines. Especially when dealing with text-heavy real-world data, evaluating even simple LIKE predicates is costly. Despite the abundance of text and the frequency of string-handling expressions in real-world queries, processing is an afterthought for most systems. We argue that we must instead properly integrate text processing into the flow of DBMS query execution. In this work, we propose a code generation approach that specifically tailors the generated code to the given pattern and matching algorithm and integrates cleanly into DBMS query compilation. In addition, we introduce a generalized SSE search algorithm that uses a sequence of SSE instructions to compare packed strings in the generated code to eficiently locate longer input patterns. Our approach of generating specialized code for each pattern eliminates the overhead of interpreting the pattern for each tuple. As a result, we improve the performance of LIKE pattern matching by up to 2.5×, demonstrating that CEUR Joint Workshops at 49th International Conference on Very Large Data Bases (VLDBW'23) - Workshop on Accelerating Analytics and Data OCG OCG OCG

code generation can significantly improve the eficiency of LIKE predicate evaluation in DBMSs.

can also inline the generated code in a larger processing

1. Introduction

Modern data processing systems ofer outstanding performance on simple data, which makes them an essential component for eficient data processing pipelines. However, these systems are still lacking in compute-heavy string processing, which is common in real-world applications. Tableau’s research shows that approximately 50% of all attributes use text-based data types, even when there are more suitable data types [1]. Thus, database systems need to focus on eficient text operations such as LIKE expressions.

In current systems, a common technique to process text is to use a third-party library that focuses on matching string patterns, often ofering advanced features such as SIMD acceleration [2, 3]. Unfortunately, these do not integrate well with DBMS-specific text representation. For example, DBMS commonly use special string storage formats, e.g., with parts of the string inlined or lightweight compressed [4, 5]. However, to use external text libraries, these systems need expensive string conversions before they can invoke a matching function, while a better integrated approach could allow just-in-time decompression. In addition, string search libraries optimize for finding patterns in a large continuous text corpus, e.g., a text document. In contrast, for database systems, the per-tuple overhead to interpret the string pattern or transition tables leaves significant performance on the 100 1010 01 for each tuple

Result Tuple Stream

Naïve

Preprocessed Generated

Preprocessed:

The pattern is preprocessed only once before the first search starts, and the result of this preprocessing (e.g., a transition table) is stored. For each tuple, the database engine still calls a generic pattern-matching function, but this function reuses the stored information in the search.

In contrast to these approaches, we need to integrate LIKE pattern matching deeper into the data processing engine. Code generation allows specializing the matchkernel, e.g., using data-centric code generation [6], or in just-in-time compiled vectorized functions. In this work, we integrate common algorithms such as KnuthMorris-Pratt [7], Boyer-Moore [8, 9, 10], Two-Way [11], and SIMD optimized routines [12]: Generated: During query compilation, we preprocess the pattern once. Then, we generate code for the entire search process using as much preprocessed results as possible. The entire matching process is performed in the generated code to avoid repeating function calls. re2 represents the regular expression as an automaton in bytecode and passes it for interpretation to an execution engine [23]; Microsoft’s .NET framework provides both a bytecode interpreter and a just-in-time compiler that converts the expression to native machine code [24].

However, only a few projects aim to generate machine code for regular expression matching at runtime, and no database system actively combines code generation and pattern matching.

In this paper, we focus on how to generate code for the 3. Implementation most common subset of regular expressions in SQL: LIKE expressions. Section 3 introduces the string-matching This section outlines how pattern-matching algorithms algorithms and outlines how our Generated approach uti- can use parts of code-generation frameworks to be inlizes a code generation framework to rebuild each match- tegrated into compiling database engines. The concept ing function and specialize the code for the given pattern. of code generation for queries, as used in HyPer [14, 6] Furthermore, we present a generalized SSE Search al- or Umbra [13], presents a novel opportunity for evalugorithm that generates highly eficient code for longer ating LIKE expressions in relational database systems patterns by utilizing SSE instructions. In Section 4, we by generating code for the matching process instead of evaluate these algorithms alongside the mentioned op- interpreting the pattern. We start with Naïve, which uses tions in our code-generating DBMS Umbra [13]. This a hand-written function for pattern matching. It is called provides a fair analysis of the diferent options to evaluate for every tuple in the generated code by the database LIKE predicates within a single system. While patterns system. We can already improve its performance by preare typically short, we also analyze the performance with processing the pattern passed to the function. longer patterns, reaching up to nearly 300 characters. However, we aim to entirely replace the function by Additionally, we compare our Generated approach with generating code specifically for the pattern and algorithm. the publicly available systems Postgres, DuckDB, Hyper, The generated code is then directly embedded in the surand ClickHouse. rounding generated function code. Our current focus is on constant LIKE patterns without any underscores or collations. Thus, a bytewise comparison between the pat2. Related work tern and input text is feasible, allowing us also to handle non-ASCII characters. Within Umbra, we are using our Kemper and Neumann introduced Hyper, a pure in-memory type-safe code-generation framework, which allows us to database for both OLTP and OLAP workloads [14], with pass the generated code in static single assignment (SSA) a data-centric approach for generating and compiling form on to Umbra’s diferent execution backends [ 19]. compact and eficient machine code using LLVM [ 6]. Throughout this chapter, let us consider the following

The idea of code generation has become widely ac- query that filters the uni relation and counts how many cepted among database researchers and developers [15, names contain the relatively short pattern 'TUM': 16, 17, 18]. Umbra [13], the research successor of the HyPer system, introduced a type-safe code generation select count(*) from uni where name like '%TUM%'; framework with a tailored intermediate representation to In the upcoming Section 3.1 and Section 3.2, we discuss compile directly into machine code, which makes Umbra the Knuth-Morris-Pratt (KMP) and Boyer-Moore (BM) applicable for low latency applications [19]. Together algorithms and explain in detail how they integrate into with adaptive execution, this allows us to switch between the code generation process. Section 3.3 briefly introprioritizing low-latency or high throughput in execu- duces the Two-Way (TW) search algorithm that comtion [20]. bines KMP and BM. Moving away from the well-known

In general, regular expression matching can benefit pattern matching algorithms, Section 3.4 presents the from using just-in-time code generation. Thompson in- Hybrid-Search (HS) algorithm which uses an SSE instructroduced the concept in 1968 to produce an IBM 7094 pro- tion to match patterns up to a certain size. In Section 3.5, grams for locating characters by regular expressions [21]. we show how blockwise processing can improve perToday, many state-of-the-art libraries use code genera- formance in finding possible occurrences of the pattern. tion to convert the given regular expression pattern into Finally, Section 3.6 presents the SSE Search algorithm an internal representation of bytecode: Python’s re mod- which uses SSE instructions to perform pattern matching ule generates bytecode for regular expressions and uses expressions for longer patterns. It is important to note an internal C engine for eficient execution [ 22]; Google’s that this algorithm can only be implemented efectively in a code-generating database engine. This is due to the diverse nature of input patterns, where the flexibility offered by a code-generating process surpasses that of an interpreting algorithm. 3.1. Knuth-Morris-Pratt Algorithm In 1977, Knuth, Morris, and Pratt introduced an algorithm for performing exact string pattern matching without the need to backtrack in the input text by preprocessing the pattern [7]. The algorithm builds a table with pattern length + 1 entries that point to where, in the pattern, we continue the search after a mismatch. Thus, the table holds information about the longest proper prefix that is also a proper sufix of the pattern ( lps). For the prefix of length 0, the table stores the value −1, indicating that if a mismatch occurs, the pattern can be shifted one position to the right since no sufix exists at that point. 1 KMP(text, pattern): 2 lpsTable = preprocess(pattern) 3 pPos = 0; pSize = pattern.size() 4 tPos = 0; tSize = text.size() 5 while (tPos - pPos + pSize <= tSize) 6 if (pattern[pPos] == text[tPos]) 7 pPos++; tPos++ 8 if (pPos == pSize) return true 9 else 10 shift = lpsTable[pPos] 11 if (shift < 0) pPos = 0; tPos++ 12 else pPos = shift 13 return false whileLoopHeader: check tPos - pPos + 6 1 ≤ text.size() check pPos = 0 3.1.1. Preprocessed approach To prevent the pattern from being processed repeatedly for each tuple, we preprocess the pattern during code Listing 1: Pseudocode for the Knuth-Morris-Pratt generation time to get the lps table and then save this algorithm table along with the pattern in the generated program.

We do not need to store any additional information for

Listing 1 presents the pseudocode of the KMP algo- the lps table because its size depends on the pattern size. rithm. Line 2 preprocesses the pattern and initializes two When executing the generated code, we still make a position counters for text and pattern. When the char- function call to perform the search, but instead of recalcuacters at these indexes match (lines 6 to 8), the function lating the table every time, we reuse the stored lps table. increments both position counters. When reaching the So, we omit the preprocessing in line 2 of Listing 1. pattern end, a match is found.

If the characters do not match (lines 9 to 12), the func- 3.1.2. Generated approach tion reads the optimal shift value from the lps table based on the pattern position. A negative value indicates that We replace all calls to the matching function with specifthere is no proper sufix in the pattern that is also a prefix ically generated code for the query’s pattern. So, we of the pattern. Thus, we need to restart the comparison embed the search phase algorithm entirely within the from the following text character. Otherwise, the func- generated code. Figure 2 shows how we can reconstruct tion updates the pattern position to the lps table value the Knuth-Morris-Pratt algorithm using the example we and increments the text index. discussed earlier to search for the pattern 'TUM' in the

In line 5 of Listing 1, we introduce an optimization input text. for the KMP algorithm called the early return, which we We begin the algorithm in the whileLoopHeader block use in all our variants. This optimization checks every by checking if the remaining length of the input text is iteration to determine whether the end of the pattern lies suficient. In this check, we can inline the size of the patwithin the text. If that is not the case, we will stop the tern into the arithmetic expression 1 . Then, we move to comparison, as it is impossible to find another match. the correct pattern position to continue the comparisons from this position. We generate the comparisons out of In 1977, Boyer and Moore presented a pattern-matching algorithm that iterates backwards over the pattern to search it in the input text [8]. We focus on their fast implementation shown in Listing 2. the pattern from left to right 2 . If the characters match, the rightmost character. The implementation looks up we proceed directly to the next character, but if there is a the value in 0 and either shifts the pattern to the right or mismatch, we jump to the performShift block. In this adds Ψ to the current position (line 11 of Listing 2). This block, we choose the shift value from the inlined lps ta- allows us to scan through the input text, and once we ble 3 based on the position of the failed comparison. We add Ψ to the current position, we know the last character then determine how to proceed and continue the search was found. Thus, the algorithm recalculates the index by jumping back to the whileLoopHeader. for the second last character and starts comparing the pattern from right to left with the input text ( lines 16 3.2. Boyer-Moore Algorithm to 20). In case of a mismatch during this comparison, the algorithm applies the maximum of the shifts according to the heuristics (line 21) before the search for the last character of the pattern restarts. 3.2.1. Preprocessed approach As the BCH table contains 256 values and the size of the GSH table matches the pattern length, repetitive processing is quite expensive. Like the KMP algorithm, we move all preprocessing steps to code generation and store the resulting tables directly along with their corresponding pattern. When storing the tables, we do not require additional information since the size of the tables is either known beforehand or can be derived from the pattern size. We replace the calls to the preprocessing functions (lines 6 to 7) with pointers to the corresponding table.

As the only diference between 0 and 1 is the value for the pattern’s last character, we do not copy the table but instead, modify the code in the search phase loop to actively add the correct value, so either the value Ψ or the value from 1. 3.2.2. Generated approach 1 BM(text, pattern): 2 pSize = pattern.size() 3 tSize = text.size() 4 // Ψ > tSize + pSize for all inputs 5 Ψ = 1 << 48; pPos = pSize - 1; tPos = pPos 6 1 = preprocessBadCharacterHeuristic(pattern) 7 2 = preprocessGoodSuffixHeuristic(pattern) 8 0 = 1 9 0[pattern[pSize - 1]] = Ψ 10 while (tPos < tSize) 11 tPos += 0[text[tPos]] 12 if (tPos >= Ψ) 13 tPos = tPos - Ψ - 1 14 if (pSize == 1) return true 15 else 16 pPos = pSize - 2 17 while (pPos && text[tPos] == pattern[pPos]) 18 pPos--; tPos-19 if (!pPos && text[tPos] == pattern[pPos]) 20 return true 21 tPos += max( 1[text[tPos]], 2[pPos]) 22 return false

Similar to the KMP algorithm, we rebuild the BoyerListing 2: Pseudocode for the Boyer-Moore algorithm Moore algorithm to get pattern-specific code. Figure 3 presents the conceptual control flow for searching the patBefore the search phase begins, the algorithm requires tern 'TUM' in the input text. We check whether enough two preprocessing steps, namely Bad Character Heurischaracters are left in the input string before we start the tic (BCH) in line 6 and Good Sufix Heuristic (GSH) in line 7. The BCH ensures that the text’s letter at which matching process in the whileLoopHeader block. If so, the mismatch occurred aligns with its rightmost occur- we get the shift value from the inlined 0 table 1 and add it to the pattern position. If that value is smaller than Ψ, rence in the pattern. Alternatively, the GSH shifts the pattern based on the longest sufix of the matched input we continue in this loop by going to the whileLoopHeader block. Otherwise, we know the input text contains the text. This part is aligned with the rightmost occurrence last character of the pattern, so further checks are reof that character sequence in the pattern (except for the quired. Before the checks, we recalculate the characsufix of the pattern itself). Both heuristics precalculate ter index aligned with the second last character in the shift values for the pattern and store them in tables. The pattern. In order to proceed, we check the text based original papers contain a more detailed explanation of on the reversed pattern 2 . In case of a mismatch, we the heuristics and their efects [ 8, 9].

The fast implementation requires a third table, 0, which jump to the performShift block. In this block, we deis essentially a copy of the result of BCH but holds the termine goodShift from 2 3 based on the preceding value Ψ (called large in [8]) for the last character of the bmloatcckhianngdttehxet cbhaadrSahcitefrt. Wfroemthe1n a4ddbathseedmoanxitmheummiosfpattern. This value needs to be greater than the sum of both shift values to the text position before returning the lengths of all possible input texts and patterns. At the beginning of the search phase, the pattern is aligned with to the whileLoopHeader to continue with the matching process. the start of the input text and the comparison starts from whileLoopHeader: check tPos < text.size() performShift: goodShift = [5, 4] 3 badShift = match text[tPos] { 'T': 2, 'U': 1, 'M': 0, _: 3} 4 tPos = tPos + max(badShift, goodShift) half is compared from right to left. If any mismatches during the comparisons occur, the pattern is shifted by a certain number of positions. After a close analysis of the interpreting Two-Way algorithm, its functionality can be rebuilt using the available code generation framework as for the other algorithms.

Again, we implement a Naïve, Preprocessed, and Generated version for the Two-Way algorithm. In the Naïve version, the Critical Factorization preprocessing step is performed repeatedly for each input text. In the Preprocessed version, we store the necessary preprocessed result along with the pattern in the data section of the generated code. This value is then loaded along with the pattern when needed and used in an interpretive algorithm. The Generated version of the algorithm depends on the output of the preprocessing function. It generates the relevant part of the Two-Way algorithm based on the outcome of the Critical Factorization step and inlines as much of the information as possible into the algorithm. 3.4. Hybrid-Search Algorithm Sitaridi et al. have introduced an algorithm which uses the SSE 4.2 SIMD instruction set, which comprises instructions that eficiently accelerate string and text processing [12]. According to Intel, these instructions are designed to enhance the performance of databases or complex searching and pattern matching algorithms [26].

However, the presented algorithm is restricted to patterns to fit into a 128-bit SIMD register.

When analyzing the instructions in the performShift 1 HS(text, pattern): block of Figure 3, one may mistakenly perceive the in- 2 if (pattern.size() <= 16 && text.size() >= 16) clusion of the inlined 1 table for determining the shift 3 iter = text.begin(); end = text.end() caused by the BCH as unnecessary. This impression 4 safeMatch = 17 - pattern.size() arises from the fact that the minimum shift resulting 5 pattern16 = load16(pattern) from the GSH is always greater than the maximum pos- 67 whimaltech((=itpecrm+pi1s6t)ri<(paentdte)rn16, load16(iter)) sible shift caused by the BCH. Consequently, the maxi- 8 if (match < safeMatch) return true mum shift is consistently determined by the good sufix 9 iter += safeMatch heuristics. It is important to note, however, that this ob- 10 if (iter < end) servation cannot be universally applied to all patterns. 11 match = pcmpistri(pattern16, load16(end - 16)) To address this, we have implemented an optimization 1123 retreutrunrnfalmsaetch < safeMatch in the code generation process, which generates code for 14 return TW(text, pattern) determining the BCH shift only when it is truly required.

Listing 3: Pseudocode for the Hybrid Search algorithm 3.3. Two-Way Algorithm With our Hybrid Search, we extend this algorithm to handle any size of input text and pattern, as presented As an alternative to the Knuth-Morris-Pratt and Boyer- in Listing 3: For patterns up to the length of a vector Moore algorithms, Crochemore and Perrin presented the register, we use the pcmpistri instruction, given that Two-Way String-Matching algorithm (TW) which com- the input text is at least 16 bytes long. In such cases, we bines both previous algorithms into one [11]. To achieve process 16 bytes of the input text at once until less than this, the algorithm first splits the pattern according to 16 bytes are left ( lines 6 to 9). To check the end of the the known Critical Factorization Theorem [25]. With input text, we load the last 16 bytes of the input text, that, the pattern is theoretically split into a left and right which is safe since we know that the input text is long part. In the search phase, the right half is compared from enough (lines 10 to 12). However, if either of the input left to right first; if all characters match, then the left parameters does not meet its length criterion, we resort to a default string search algorithm. In our case, it is the Two-Way algorithm (line 14). Considering that the best-suited algorithm depends on various factors, such as the pattern and workload, it would be beneficial to

1 uint64_t block = loadNext8Bytes(...) implement multiple fallback algorithms, allowing the 2 // broadcast 'T' to each byte: 0x5454545454545454ull selection of the most appropriate one. 3 uint64_t searchedChar = broadcast('T') 4 const uint64_t high = 0x8080808080808080ull 3.4.1. Preprocessed approach. 5 const uint64_t low = ~high

6 uint64_t lowChars = (~block) & high Based on the chosen fallback algorithm, one might con- 7 uint64_t cleared = (block & low) ^ searchedChar sider how to include the corresponding Preprocessed func- 8 uint64_t found = ~((cleared + low) & high) 9 uint64_t matches = found & lowChars tion of the chosen algorithm. To match the chosen Two- 10 bool matchFound = matches != 0 Way algorithm as default fallback algorithm in the Naïve approach, we chose the Preprocessed version for this ap- Listing 4: Blockwise search for ASCII character T proach. instructions may provide similar functionality, our objective is to present a versatile approach that is not limited to any specific hardware support. 3.4.2. Generated approach. 3.6. SSE-Search Algorithm To generate code for our Hybrid Search, we extended The Hybrid Search algorithm employs an SSE instruction the custom code generation framework of Umbra to sup- for pattern matching, limiting the pattern length to at port the necessary SSE instruction for comparing packed most 16 bytes. For longer patterns, the algorithm prostrings. This enables us to generate the parts of the algo- vides an alternative approach that does not use the SSE rithm needed for the specific pattern. For patterns that instruction. Expanding the algorithm to handle longer are longer than 12 bytes, we only generate the code for patterns with SSE instructions is feasible but significantly the default matching algorithm, as we do not use the increases its complexity.

SSE instruction for that kind of patterns. For shorter However, the emergence of code-generating database patterns, we generate both the part using the SSE instruc- engines has opened up new possibilities for generating tion and the default fallback. While executing the code, code optimized with SSE instructions that are specifically we determine which part of the algorithm to use based tailored to long patterns. In Figure 4, we present the on the length of the input text. The decision to set the conceptual design of the generated code for searching the limit to 12 bytes is guided by the fact that this still allows pattern 'Technical University of Munich'. Similar performant shifting of the input pattern (cf. safeMatch, to the KMP algorithm, we initially check if the pattern measured in Figure 9 in Section 4.3.1). can fit within the remaining text by directly including the pattern length 1 . If there are enough characters remaining, we proceed with the comparison. 3.5. Blockwise Processing Optimization The search algorithm aims to locate the starting posiBlockwise Processing can improve the initial pattern tion of the pattern within the input text. Once the start search. It draws inspiration from SIMD within a register position is found, we continue comparing subsequent (SWAR) [27] and enhances the eficiency of character parts of the pattern and text sequentially from that posisearch in an input text over the naïve idea. This would tion. involve iterating over the text and examining each char- To achieve this, we extract the first 16 bytes of the acter resulting in a wastage of cycles simply searching pattern and load the next 16 bytes from the text. Using for the desired character. However, blockwise processing the SSE instruction pcmpistri, we search for the start of can be implemented to rapidly locate the first charac- the pattern in the input text 2 . If no match is found, we ter of the pattern and then continue with the chosen shift the text position to the right and restart the overall pattern-matching algorithm. Listing 4 demonstrates the search for the pattern start in the input text. In the case algorithm to detect the presence of the ASCII character of a match, we enter the generated code, which loads 'T' in the block. We read the next eight bytes from the the next 16 bytes from the input text and compares them input text into a register. With another register having to the corresponding part of the pattern. This comparthe character broadcasted to each byte, we perform vari- ison can be performed using either the SSE instruction ous bitwise operations between the registers and specific pcmpistri or another binary comparison function for constants. After these operations, we get a value back vector registers. Since we know that the subsequent patwhich is either 0, so 'T' could not be found in block, or tern blocks must follow the previous ones, we can easily the highest bit of the byte at which the character appeared generate code to handle this logic. This is repeated until is set. This code can also be adjusted for non-ASCII char- less than 16 bytes of the pattern are left. acters which have the highest bit set. While certain SSE whileLoopHeader: check tPos + 30 1 ≤ text.size() of the pattern into a vector register. However, when fully utilizing the vector register, one can only shift one position to the right if the start of the pattern does not match the loaded text block. To increase the possible shift in case this part is not found in the loaded input text, we can reduce the number of bytes loaded from the pattern.

4. Evaluation

In order to check our implementations on a more realistic dataset, we use ClickBench1. It includes typical modern workloads and queries used in ad-hoc analytics and realtime dashboards. The data used in the benchmark is collected from a real-world web analytics platform. While it is anonymized, it retains the essential distributions of the data, including non-ASCII characters. For our experiments, we used the queries 20, 21, 22, and 23 from the ClickBench benchmark, which contain the following LIKE predicates:

Q 20, 21, 23: Q 22: url like '%google%' title like '%Google%' and url not like '%.google.%'

Query 20 scans the relation hits and counts how many tuples fulfill the predicate; the other queries involve more operators like aggregates or sorting, so the overall performance is not entirely dominated by the pattern matching

Handling the remaining bytes of the pattern requires algorithm. special handling, as both the pattern and input text may Since the patterns mentioned above are shorter than not fully occupy an SSE register. Therefore, we load a the length of a vector register, we classify them as short maximum of 16 bytes from the input text and also return patterns. To evaluate the efects of longer patterns, we how many bytes were read. With the loaded block and increased the pattern length for Q 20 to 31, 160, and 291 the number of read bytes, we employ the SSE instruc- characters, categorizing them as long patterns. tion pcmpestri 3 . This instruction requires explicit We run the microbenchmarks on an Intel i9-7900X specification of the length of the input data as additional CPU (Skylake-X, 3.3-4.5 GHz) with 10 cores and 128 GB arguments. 4-channel DDR4-2133 memory, running Ubuntu 22.10

If there is a mismatch between one of the blocks of the (Kernel 5.19, gcc 12.2), and repeat all measurements five pattern and the corresponding text block, we stop the times. comparison and go to the performShift block. Within this block, we apply a shift heuristic that moves the pat- 4.1. Full System Comparison tern as far to the right as feasible. Following the pattern shift, we return to the whileLoopHeader to resume the We compared our Generated approaches for pattern matchmatching process by checking the remaining length of ing with other popular database systems, namely Postthe text. gres, DuckDB, Hyper, and ClickHouse. Figure 5 demonShift heuristics. For shift heuristics, we have two op- strates that our approach of generating pattern-specific tions: a simple shift to the right by one position or a code performs better than the other databases. While more advanced KMP-like heuristic. The latter relies on Umbra outperforms the other databases for Query 21 identifying the longest sufix of the already matched pat- and 22, Hyper is slightly faster than our Boyer-Moore tern that is also a proper prefix. This operation results algorithm for Query 20 but slower than the other three in no additional runtime overhead since it can be prepro- algorithms. However, Hyper uses a pattern matching cessed during code generation and is directly written to algorithm which is quite similar to our Hybrid-Search code. algorithm, also using an SSE instruction to search for the given pattern. As our pattern is relatively short, we Size of start block. Figure 4 presents the version of the algorithm, which directly loads the first 16 bytes 1https://benchmark.clickhouse.com

Nave

Q 22 Postgres ClickHouse

DuckDB Hyper

KMP BM

Non-Blockwise (Section 3.1)

Blockwise (Section 3.5) not found. Consequently, the Preprocessed approach is observe that in nearly all cases we benefit from gener- faster than the Naïve one since it doesn’t need to access ating pattern specific code. Due to the shortness of the the lps table as often. Still, the Generated approach repattern, one can observe that the Hybrid-Search benefits mains superior to the other alternatives. Based on these from the SSE instruction, as it clearly dominates the other findings, we directly focus further analysis on the KMP algorithms, especially for Query 20. algorithm with blockwise processing. 4.2. Short Pattern Microbenchmark 4.2.2. Algorithm Comparison We continue our comparison within our database system Figure 7 illustrates the results of comparing the matching Umbra to analyze the diferences between the algorithms algorithms discussed in Section 3 for Q 20 and Q 21 of the in detail. In this section, we will focus on the more com- ClickBench benchmark. The Preprocessed and Generated mon case of short patterns. approaches outperform the interpreting approaches. This is because the preprocessing phases of the KMP 4.2.1. Blockwise Processing and BM algorithms generate large lookup tables. By employing code generation capabilities for the PreproOur initial investigation evaluates the eficacy of using cessed approach, we can avoid redundant preprocessing Blockwise Processing in combination with KMP as ex- of the pattern. Storing the tables in the data section of plained in Sections 3.1 and 3.5. If a mismatch occurs, we the generated program leads to a substantial improvecheck how the KMP algorithm would shift the pattern ment in throughput. As the Generated approach suggests, according to its preprocessing. In case the pattern would generating highly specialized code further enhances perbe shifted by one character, we switch back to blockwise formance. However, for the Boyer-Moore algorithm, the processing and restart the search for the first charac- performance improves less in Query 20. According to ter of the pattern. Figure 6 illustrates the advantages further analysis, this is due to many branches in the of blockwise processing compared to the non-blockwise generated code, which can result in mispredictions and approach for Query 20 on the ClickBench dataset. By overall slow performance. applying this optimization, larger blocks of the input text For Query 21, we observe similar behavior as for Query can be processed at once instead of reading byte by byte. 20. However, the generated code for the Boyer-Moore

In the non-blockwise case, both the Naïve and Prepro- algorithm appears marginally diferent, leading to higher cessed versions show similar throughputs. After conduct- performance improvement for the Generated approach ing a performance analysis, we identified that loading over the Preprocessed version. the lps table values from the data section in the Prepro- The preprocessing function of the Two-Way algorithm cessed approach yields performance similar to repeatedly only returns a number, so it does not have to generate a preprocessing the relatively short pattern in the Naïve table as the other two algorithms. Consequently, the Genapproach. By using the Generated approach, we can com- erated approach achieves a higher throughput compared pletely avoid any indirections, resulting in the highest to the Naïve approach. throughput for the KMP algorithm. When it comes to the Hybrid-Search algorithm, one

In the blockwise algorithms, larger blocks of the input can see the benefit of using SSE instructions to search text can be skipped if the first character of the pattern is ] s / se50 M l p u [T 0 t u p gh0.1 G u o rh 0 T

Preprocessed

Q 20 KMP BM TW HS (Sections (Section 3.2) (Section 3.3) (Section 3.4) 3.1 + 3.5) 12 4 6 8 10 #threads Nave 15 KMP

BM TW HS

SSE for a pattern in an input text. For each approach, this algorithm dominates all the other algorithms. Additionally, it benefits from avoiding repetitive function calls to the matching function, resulting in the throughput of the Generated version being nearly 2.5× the throughput of the Naïve approach. 4.2.3. Multithreading The final interesting aspect of our microbenchmarks is how throughput develops when running queries using multiple threads. We expect the throughput to scale linearly with an increasing number of threads, as Umbra involved, as it depends on both the pattern and input text uses morsel-driven parallelism [28]. As shown in Figure 8, which specific part of the algorithm is executed. In the exthis expectation aligns with the observed results. When periment, the pattern falls within the length limit and the hyperthreading is reached, the throughput still increases, input texts are on average also large enough. Therefore, but at a diferent rate than before. the Hybrid-Search algorithm predominantly executes

Our diferent degrees of code generation provide the the search with the SSE instruction and rarely falls back most benefit to the KMP algorithm when comparing the to the default algorithm. This method of utilizing SSE diferent approaches. In the Generated version, we can instructions for pattern matching shows a significant pernearly double the throughput compared to the Naïve formance improvement, even with the Naïve approach. approach, while the Preprocessed version is in the middle. By applying the Generated approach and eliminating the

The Boyer-Moore algorithm also benefits from the overhead of the repeated function calls, the throughput code generation approaches. However, the Preprocessed further increases. However, after using more than eight and Generated versions are much closer together, and threads, the throughput levels out because it approaches when it comes to hyperthreading, both versions are ap- the memory limit of the machine. proaching each other. Nevertheless, the main problem Table 1 presents the execution times of the diferent apwith this algorithm is the higher number of branches proaches for Query 20 using 20 threads. One can observe required to get the correct shift from the BCH table. In that, with the Generated approach, the SSE Search algothe Preprocessed approach, this is just a memory lookup. rithm is slightly slower than the Hybrid-Search algorithm. With more branches, more mispredictions happen to This can be attributed to the SSE Search algorithm’s need cause a generic function to outperform specifically gen- for specialized handling of short patterns and input texts, erated code. resulting in more complex code and slower execution.

Lastly, the Generated version of the Two-Way algorithm is also slightly faster than its Naïve version. Due to 4.2.4. Compilation Overhead the less complexity of the preprocessing phase, the difer- When dealing with code-generating database engines, it ence between both versions is small. Still, the Generated is essential to consider the overhead of compilation. In version has higher throughput. Table 1, we also show the compilation times for Query

Analyzing the Hybrid-Search algorithm is a bit more 20 using the LLVM backend of Umbra. The data reveals that as we move from the Naïve to the Generated approach, the compilation times for the algorithms increase marginally. Nevertheless, this increase is balanced out by the reduction in execution time.

Moreover, it is worth mentioning that Umbra could employ the Flying Start technique or the FireARM backend. This means that the compilation overhead could be concealed by using a specific backend until the compiled LLVM function becomes available [19, 29]. However, in our experiments, we did not employ this option, as during compilation, only 0.5% of the tuples could be processed when running the Hybrid-Search algorithm fully multi-threaded. 4.3. Long Pattern Microbenchmark As the final part of the evaluation, we look at the efect of long patterns. We classify a pattern which exceeds the length of a single vector register (16 bytes) as a long pattern. For our experiments, we use three patterns: pattern A with 31 characters, pattern B with 160 characters, and pattern C is a combination of three long patterns totaling 291 characters. 4.3.1. Size of start block Figure 9 presents the results of varying the number of characters in the start block, which is employed to locate a potential pattern start. The top plots visualize the performance using only one thread, while the bottom ones show performance with 20 threads.

When executing with a single thread, the algorithm achieves peak performance when three bytes of the pattern are employed in the localization phase. This size allows for suficient shifting of the pattern to the right while minimizing false positives. When utilizing 20 threads, the performance remains largely unafected by the size of the start block. The limiting factor in this scenario is the available memory bandwidth, which operates at 68 GB/s and is utilized over 90%.

Moreover, the combination of longer patterns and the early return implementation proves advantageous, leading to increased throughput with larger pattern sizes. 4.3.2. Algorithm overview l]sse/p treadh00..12 GG uT 1 [ t oguhpu tsreahd01..05 GG rh 02 T 0 0 ]1.2 G s / s le1.0 G p u [T0.8 G t u hp0.5 G g u ro0.2 G h T 0 Finally, the comparison of diferent code generating algorithms using 20 threads in terms of long patterns is Each matching algorithm combined with a code generaillustrated in Figure 10. For the SSE Search algorithms, tion approach ofers diferent performance benefits and we have chosen the start size with the highest perfor- use cases but also challenges. mance. Since the fallback option for the Hybrid-Search algorithm is the Two-Way algorithm for long patterns, Knuth-Morris-Pratt is relatively straightforward to imboth algorithms show similar performance. plement in all three approaches. It can be enhanced by

For all patterns, the SSE Search algorithm, which gen- adding blockwise optimization with just a few modifierates pattern specific code for the matching process, cations. Our experiments demonstrated that the code

5. Lessons Learned

generating approach significantly improved the perfor- and dynamic adjustment of the size of the start block. mance of the KMP algorithm. The blockwise version The performance of the algorithm surpasses that of alof KMP is particularly efective when the first character ternative methods across all three versions, consistently of the pattern has a low frequency of occurrence in the delivering superior results. Moreover, the performance input text. This allows eficient consumption of large por- is mostly bound by the available memory bandwidth. tions of the text. Applying the early return optimization Ultimately, tuning the performance for the patternfurther enhances the performance. This optimization matching algorithms in a code-generating database sysdiscards longer patterns faster once they no longer fit tem is still a trade-of. The Preprocessed approach is sufin the input text. Additionally, the KMP algorithm it- ficient to improve performance compared to the classic erates over the input text from left to right. Thus, the Naïve approach while keeping the complexity of genalgorithm can also process texts with Unicode characters erated code low. However, generating pattern-specific based on their codepoints rather than only their bytewise code for the matching process further improves overrepresentation. all query throughput at the cost of increased code comBoyer-Moore is also easy to realize in the Naïve and plexity. Utilizing specific SSE instructions for pattern Preprocessed approaches. The Generated approach re- matching ofers the dual advantage of enhancing perforquires more bookkeeping during code generation for mance and reducing code complexity in certain aspects correct SSA form. The experiments show that Generated of the matching algorithm. Based on our experimental is superior, while Preprocessed is better in case of hyper- findings, it can be inferred that when the required SSE threading. This algorithm is more efective when the last instructions are not supported by the hardware, no single pattern character has a lower distribution than the first matching algorithm exhibits consistently superior perforcharacter. It also works better with longer patterns due mance across all patterns. However, if hardware support to early rejection once they exceed the input text. is available, we can conclude that for short patterns the Two-Way is complex to implement in both approaches. In Hybrid Search algorithm is superior, while for long patour experiments, Generated is slightly faster than Naïve, terns, the new SSE Search algorithm is more efective. due to its less costly preprocessing function. However, Integrating both the Hybrid Search algorithms and the the performance varies depending on pattern factoriza- SSE Search algorithm into a code-generating database tion. The pattern of the experiment was not optimally engine can be achieved seamlessly by designating the factorized, leading to a similar performance as for the SSE Search algorithm as the default fallback. FurtherKMP algorithm. With a pattern better suited for factor- more, employing algorithms that utilize SSE instructions ization, performance improves. ofers an additional advantage. The code required in the database engine for these algorithms is relatively small and straightforward to maintain and the generated code is clear and well-structured, facilitating easy verification and debugging processes.

Hybrid-Search ’s complexity is relatively low when using the Naïve approach, and it depends solely on the chosen default algorithm. Implementing the SSE search component is a simple task and completely decoupled from the fallback algorithm. However, integrating the Generated approach into Umbra and its backends requires more 6. Conclusion efort. This is because we needed to introduce a new internal instruction for the SSE string comparison function This paper demonstrates the efectiveness of code generwhich then maps to the corresponding function for the ation for pattern-matching algorithms to evaluate LIKE backend. Nevertheless, this algorithm shows the most predicates. The performance increases by up to a facpromise and consistently outperforms the other algo- tor of two compared to function calls and outperforms rithms in all three approaches. The Generated approach state-of-the-art systems like Postgres, DuckDB, Hyper, or is only limited by the memory speed of the machine. ClickHouse on text-heavy datasets like ClickBench. The Since the SSE part has a pattern length restriction, this results indicate that replacing generic function calls with algorithm is particularly suitable for short patterns. For pattern-specific generated code significantly increases longer patterns, it is necessary to carefully investigate the throughput. We have also demonstrated that using the selected default algorithm. SSE instructions to compare packed strings has a posiSSE Search introduces an innovative method for generat- tive impact on overall performance, particularly when ing specialized code for long patterns by leveraging SSE incorporating them into the generated code. Additionvector instructions. Incorporating this algorithm into a ally, we have presented a generalized algorithm which code-generating database engine is straightforward. The uses multiple SSE instructions to perform eficient patonly challenge is adding the necessary SSE instructions tern matching for long patterns as a favorable alternative to the backends. Furthermore, this approach facilitates to classic algorithms. the seamless implementation of various shift heuristics Hekaton: SQL server’s memory-optimized OLTP engine, in: SIGMOD Conference, ACM, 2013, pp. [1] A. Vogelsgesang, M. Haubenschild, J. Finis, A. Kem- 1243–1254.

per, V. Leis, T. Mühlbauer, T. Neumann, M. Then, [17] T. Hof, Code Generation: The Inner SancGet real: How benchmarks fail to represent the tum Of Database Performance, 2016. URL: real world, in: DBTest@SIGMOD, ACM, 2018, pp. http://highscalability.com/blog/2016/9/7/code1:1–1:6. generation-the-inner-sanctum-of-database[2] D. Sidler, Z. István, M. Owaida, G. Alonso, Accel- performance.html.

erating pattern matching queries in hybrid CPU- [18] S. Wanderman-Milne, N. Li, Runtime code generFPGA architectures, in: SIGMOD Conference, ation in cloudera impala, IEEE Data Eng. Bull. 37 ACM, 2017, pp. 403–415. (2014) 31–37. [3] X. Wang, Y. Hong, H. Chang, K. Park, G. Langdale, [19] T. Kersten, V. Leis, T. Neumann, Tidy tuples and J. Hu, H. Zhu, Hyperscan: A fast multi-pattern lfying start: fast compilation and fast execution regex matcher for modern cpus, in: NSDI, USENIX of relational queries in umbra, VLDB J. 30 (2021) Association, 2019, pp. 631–648. 883–905. [4] J. Hildebrandt, D. Habich, P. Damme, W. Lehner, [20] A. Kohn, V. Leis, T. Neumann, Adaptive execution Compression-aware in-memory query processing: of compiled queries, in: ICDE, IEEE Computer Vision, system design and beyond, in: ADM- Society, 2018, pp. 197–208.

S/IMDM@VLDB, volume 10195 of Lecture Notes [21] K. Thompson, Regular expression search algorithm, in Computer Science, Springer, 2016, pp. 40–56. Commun. ACM 11 (1968) 419–422. [5] P. A. Boncz, T. Neumann, V. Leis, FSST: fast random [22] A. Kuchling, Regular Expression HOWTO, https: access string compression, Proc. VLDB Endow. 13 //docs.python.org/3/howto/regex.html, 2023. (2020) 2649–2661. [23] P. Wankadia, Glossary, https://github.com/google/ [6] T. Neumann, Eficiently compiling eficient query re2/wiki/Glossary, 2021.

plans for modern hardware, Proc. VLDB Endow. 4 [24] .NET: Compilation and Reuse in Regular Expres(2011) 539–550. sions, https://learn.microsoft.com/en-us/dotnet/ [7] D. E. Knuth, J. H. Morris Jr., V. R. Pratt, Fast pat- standard/base-types/compilation-and-reuse-intern matching in strings, SIAM J. Comput. 6 (1977) regular-expressions, 2021.

323–350. [25] M. Lothaire, Combinatorics on words, Second Edi[8] R. S. Boyer, J. S. Moore, A fast string searching tion, Cambridge mathematical library, Cambridge algorithm, Commun. ACM 20 (1977) 762–772. University Press, 1997. [9] W. Rytter, A correct preprocessing algorithm for [26] R. Ramanathan, Extending the world’s most boyer-moore string-searching, SIAM J. Comput. 9 popular processor architecture, 2006. URL: (1980) 509–512. https://web.archive.org/web/20090823145906/http: [10] R. N. Horspool, Practical fast searching in strings, //download.intel.com/technology/architecture/

Softw. Pract. Exp. 10 (1980) 501–506. new-instructions-paper.pdf. [11] M. Crochemore, D. Perrin, Two-way string match- [27] L. Lamport, Multiple byte processing with full-word ing, J. ACM 38 (1991) 651–675. instructions, Commun. ACM 18 (1975) 471–475. [12] E. A. Sitaridi, O. Polychroniou, K. A. Ross, Simd- [28] V. Leis, P. A. Boncz, A. Kemper, T. Neumann, accelerated regular expression matching, in: Da- Morsel-driven parallelism: a numa-aware query MoN, ACM, 2016, pp. 8:1–8:7. evaluation framework for the many-core age, in: [13] T. Neumann, M. J. Freitag, Umbra: A disk-based SIGMOD Conference, ACM, 2014, pp. 743–754. system with in-memory performance, in: CIDR, [29] F. Gruber, M. Bandle, A. Engelke, T. Neumann, www.cidrdb.org, 2020. J. Giceva, Bringing compiling databases to RISC [14] A. Kemper, T. Neumann, Hyper: A hybrid architectures, Proc. VLDB Endow. 16 (2023) oltp&olap main memory database system based 1222–1234. on virtual memory snapshots, in: ICDE, IEEE Computer Society, 2011, pp. 195–206. [15] S. Agarwal, D. Liu, R. Xin, Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop, 2016. URL: https://www.databricks.com/blog/ 2016/05/23/apache-spark-as-a-compiler-joining-abillion-rows-per-second-on-a-laptop.html. [16] C. Diaconu, C. Freedman, E. Ismert, P. Larson,

P. Mittal, R. Stonecipher, N. Verma, M. Zwilling,