-

SHaMBa: Reducing Bloom Filter Overhead in LSM Trees

Zichen Zhu

0 0 supervised by Prof. Manos Athanassoulis Boston University , MA , USA

Bloom Filters (BFs) are typically employed to alleviate unnecessary disk accesses to facilitate point lookup in LSM trees. They are particularly beneficial when there is a significant performance diference between probing a Bloom filter (hashing and accessing memory) and accessing data (on secondary storage). However, this gap is decreasing as SSDs and NVMs have increasingly lower latency, to the point that the cost of accessing data can be comparable to that of hashing and filter probing , especially for large key sizes that results in high hashing cost. In addition, BFs are beneficial for empty queries while they are a burden for positive queries (i.e., on existing keys). Also, with larger datasets, the total consumed memory also increases, making it less feasible to keep all BFs in memory. Coupling this, with the increasing price of memory and the need to reduce the memory-to-data ratio in many practical deployments, we are seeing an increased memory pressure. In this setting, fewer BF blocks are cached, thus causing additional storage accesses, since they have to be fetched in memory to answer a query. In this PhD work, we introduce SHaMBa (Shared Hash Modular Bloom Filter), a new LSM-based key-value engine that addresses both (a) the increasing hashing overhead and (b) the sub-optimal performance when BFs do not fit in memory. First, SHaMBa decouples the hashing cost from the data size by sharing a single hash digest across diferent levels. Second, SHaMBa applies a workload-aware BF skipping policy based on Modular Bloom Filter (i.e., a set of mini-BFs that replace a single large BF) to avoid accessing BFs when they are not useful. Our evaluation shows that SHaMBa reduces the CPU cost for BF probing, and substantially outperforms the state of the art under memory pressure.

eol>LSM trees Bloom Filter compaction policy storage memory pressure

1. Introduction key. In addition, every file is built with an index block (also termed fence pointers) that maintains the min-max LSM-based Key-Value Stores. Log-Structured Merge- range and the ofset for each data block (or disk page), trees (LSM trees) [1] are widely adopted as one of the which ensures that at most one data block (or disk page) core data structures in modern NoSQL storage engines is retrieved when probing a file. including LevelDB [2], RocksDB [3], and WiredTiger [4]. Problem 1: The Benefit of BFs Shrinks For Faster This is because LSM trees ofer high write throughput by Storage. Contrary to common perception, BFs are not employing out-of-place ingestion. In LSM trees, incoming always beneficial. The rationale behind the ubiquitous entries (inserts, updates, and deletes) are bufered within use of BFs in LSM trees is that there is a considerable main memory. Once the write bufer becomes full, the cost diference between accessing a BF (in memory) and contained entries are sorted and flushed to disk as a sorted accessing data (on disk). As new storage devices like run. The disk-resident sorted runs are organized into a SSDs and non-volatile memories (NVMs) emerge, the number of levels of increasing sizes. In practice, a sorted latency gap between memory and storage narrows, and run may consist of one or more immutable Sorted-String thus the advantages of using BFs weaken. If the data is Tables (or SST files). To bound the number of files that already cached in main memory, BFs are even detrimental. a point lookup needs to probe, runs of similar sizes in Experiments show that MurmurHash64 calculation (used the same level are sort-merged and pushed to the next in production systems [3]) is ∼ 1.47× more expensive (deeper) level when the accumulated bytes of similarly than accessing a memory page, thereby, making the use sized runs reach a predefined capacity. To avoid unnec- of a BF detrimental. The LSM hashing overhead is further essary accesses for point lookups, LSM trees typically exacerbated as multiple BFs are queried per lookup (at construct a Bloom Filter (BF) for every file that probabilis- least one per level), and repeated hash calculations turn tically allows to skip a file if it does not contain the target querying over fast storage (or cached data) into a CPUintensive operation.

VLDB 2023 PhD Workshop, co-located with the 49th International Problem 2: Read Performance in LSM trees DeVCaonncfeoruevnecre,ConanVaedray Large Data Bases (VLDB 2023), August 28, 2023, grades With Limited Memory. While computing, $ zczhu@bu.edu (Z. Zhu) memory, and storage prices decrease and allow us to https://cs-people.bu.edu/zczhu/ (Z. Zhu) facilitate more data, in the last few years, the price drop 0000-0002-9197-4649 (Z. Zhu) in memory has been slower than what has been for com© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License puting and storage, making it hard to maintain the same CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) memory-to-data ratio. For example, since 2010, the price of SSDs has decreased by a factor of sixty, whereas the price of memory has only decreased by a factor of ten [5].

As a result, BFs may not always be in memory when com- L3 peting for the resource with index blocks and data blocks, and when there is a cache miss for BFs, a significant number of I/Os may be spent on fetching them. LLL210 SMSemTBAuf SST B gSeStT(kC) k hhh SSSSTT CBBBBFFF DDDaaatttaaa k h BBBFFF DDDaaaSSStttSSSaaaTTT ACB

(a) Query path (b) Hashing in a query (c) Shared hashing in a query Figure 1: Hash sharing across BFs of diferent levels.

SHaMBa: Less Hashing on Modular Bloom Filters.

To address these challenges, we propose Shared Hashing and skipping-based Modular Bloom Filters - two techniques that reduce the BF overhead in LSM trees and we integrate them into our system, SHaMBa. Specifically, we (a) Uniform (b) Zipfian ifrst propose a shared hashing technique [6] that shares a single hash digest across all the levels in an LSM tree Figure 2: Hash sharing reduces hashing overhead. The reducto alleviate the unnecessary cost of re-hashing for every tion is more pronounced for larger keys and skew workload. level. Shared hashing decouples the aggregated hashing an optimization reduces the CPU cost by a factor of cost from data size. With SHaMBa, regardless of the num- when probing a single BF. Here, we apply a form of hash ber of LSM tree levels (which depends on the data size, sharing across multiple BFs from diferent LSM levels. the size ratio, and the memory bufer size), the hashing cost is constant. We then identify that not all the BFs are equally important, and based on this observation, we propose a skipping mechanism [7] based on the Modular Bloom Filter (MBF) design that allows us to load part of the filter to alleviate the memory pressure. Our evaluation shows that hash sharing can lead to 20% higher lookup performance when using a state-of-the-art PCIe SSD, and the skipping mechanism in MBF increases read throughput by more than 50% compared with the state of the art when memory is constrained to 10% of the total size of Bloom Filters.

Contributions. Our contributions are as follows: Hash Sharing Across Levels. The key observation is that for a specific query, the same hash digest calculation is repeated across levels when fence pointers cannot filter the query. The BFs are diferent across levels (they have indexed diferent elements), however, in order to probe them for a given key, the same hash digest is calculated for every level until the key is found or the tree is entirely searched. To mitigate this overhead, we share the hash digest calculation across levels by re-engineering the BF implementation and allowing the BFs residing in diferent levels to work collaboratively during the execution of a single query (Fig. 1). As a result, the hashing cost remains constant regardless of the number of levels. • We identify that BFs dominate LSM query latency for Evaluation. We build an in-house LSM tree prototype, fast storage and high hashing cost, and we decouple which uses RocksDB’s fast local Bloom Filter and Murthe amount of hashing from data size (height of LSM murHash64. We bulk load our LSM tree with 22GB of tree) by shared hashing across diferent levels. key-value pairs (entry size is fixed as 2KB), and report • We propose a skipping mechanism based on Modular the latency of empty point queries. The experiments are Bloom Filter (MBF) that reduces the memory footprint running with a state-of-the-art PCIe SSD that ofers 10 s without sacrificing performance. latency for 4KB page access. As shown in Figure 2a, the • We integrate Shared Hashing and Modular Bloom Fil- hashing cost increases for both approaches as the key ters in the state-of-the-art LSM-engine RocksDB, and size grows, however, shared hashing has a performance we show through extensive experiments that our pro- gain of up to 23% (blue line). The time breakdown shows posed techniques reduce the hashing cost, and out- where this benefit is coming from. The time spent in BFs perform the state of the art under memory pressure. (both hashing and probing) is drastically reduced for the hash sharing approach, while the cost for accessing data, as well as the other costs (e.g., binary search in fence pointers), virtually remains the same. In addition, larger 2. Shared Hashing key sizes have higher hashing cost, hence, hash sharing Classical BFs rely on independent hash functions, is more beneficial for them. Besides, when the query which results in high CPU overhead when probing a workload becomes skewed, we further observe the gain BF. Practical implementations [3] use a single hash di- steeply increases for 1KB keys to more than 60% in Figgest and generates − 1 indexes by bit rotation. This ure 2b. This is because the skewed workload has fewer optimization is based on the double hashing scheme [8], data accesses due to fewer false positives, and as a result, for which it is shown that it can achieve nearly the same hashing becomes a bottleneck for skewed point queries. accuracy obtained by independent hash functions. Such More experiments can be found in our full paper [6]. Note that since the modules are accessed sequentially, the decision to skip the -th module afects the remaining In this section, we discuss how we achieve a lower mem- modules. We also allow the algorithm to use a diferent ory footprint with our skipping mechanism [7] for MBFs, threshold per module slot for more flexibility. and we show that, under memory pressure, our skipping algorithm achieves higher read throughput compared to QueryMBF (key , ,) the state of the art. for = 1, ≤ number of modules, ++ do Modular Bloom Filters (MBFs). We first present Mod- ,, = , · (1 − ,) · (−1 − ) ular Bloom Filters (MBFs) that divide a normal Bloom if //ski=p=pitnrugem||odu,le,b y<rtehtruersnhionldgptohsietnive Filter into multiple modules. MBF is a generalized ver- return true; sion of ElasticBF [9] since MBF allows the size of each else module to be diferent. A Modular Bloom filter (MBF) //// tphriosbmeathyecmauosdeualne lIi/kOe iaf mit iinsinBoFt cached uses bits to index elements in each of modules. if QueryModule(, module,,) == false then Each module uses bits such that ∑︀=1 = . Es- return false; sentially, an MBF is a collection of Bloom filters, and end every membership test tries to sequentially go through end all the modules before it concludes with a positive result. end A negative response at any module terminates the query return true; without further probing the remaining modules.

k end Algorithm 1: MBF decides to skip a module based on its utility (along with threshold). 3. Skipping-Based Modular BFs 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 1 Updates and Deletes. In an update-heavy or delete(a) A standard Bloom filter. heavy setting, our skipping algorithm still works by ack tively maintaining ,, , for each SST file and inheriting these values when building new SST files during compactions. For example, if point queries mainly target 0 0 1m0od0ul0e #011 0 0 0 0 0m0od0ul1e #020 1 0 0 0 1m0od0ul0e #031 0 0 frequently updated/deleted keys, these queries mostly (b) A modular Bloom filter with three modules of equal size. terminate in shallower levels, leading to fewer accesses Figure 3: Modular Bloom filters split the physical representa- to the files in deeper levels that contain obsolete entries tion of a BF into multiple independent modules. ( , of the files in deeper levels decreases). Therefore, according to the utility definition in Eq. (1), ,, also deFigure 3 compares an MBF (using three modules) with a creases, which possibly leads to skipping modules when standard BF. By design, a point lookup can use all or any other queries access these files. On the other hand, if subset of the modules without re-indexing. Thus, MBF point queries concentrate on a few infrequently updatcan navigate the tradeof between accuracy and memory ed/deleted keys (in deeper levels), , of the files that footprint without re-calculating the filter. contain these keys may increase since other point queries Skipping Modules. To fully exploit MBFs, we quantify that target frequently updated/deleted keys terminate in the utility of each module and design a module skipping shallower levels, which again yields lower utility because mechanism. We define the utility as follows. most of the queries for these SST files are likely to be nonempty queries. As a result, for non-empty queries, BFs ,, = , · (1 − ,) · ︁( −1 − ︁) (1) can be skipped, and data blocks can be directly loaded, thus saving memory space and I/Os for BFs. where , ( ,) represents the point lookup frequency Evaluation. We integrate our skipping algorithm (the ratio of true positive point queries, respectively) of and MBF into RocksDB with two equal-size modules the ℎ ifle at level , ,, and is the false positive (with thresholds, respectively, threshold1 = 0.02 and rate when using the first modules. By definition, ,, threshold2 = 0.01) to showcase the benefit of our skipquantifies how many I/Os (on average) can be avoided ping mechanism. We stress-test our approach with diferusing the ℎ module. Based on the utility, we propose ent workloads: 1) vary the point query patterns to follow to skip probing modules if the utility is lower than a a uniform or Zipfian distribution; 2) vary the proportion certain threshold (threshold). Algorithm 1 shows how of existing point lookups ( ). We report the average lato query an MBF. The core idea is to skip a module if it tency per lookup, as shown in Figure 4. The experimental does not lead to a reduction of I/Os. In other words, if we results show that our skipping algorithm efectively reanticipate that querying a module leads anyway to an I/O, duces the lookup latency when there is memory pressure, we will skip the rest of the MBF as an I/O is inevitable. and the benefits persist for both empty ( = 0.0) and (t)sceaynLm213 (t)secaynLm213 (t)sceaynLm213 taMelgrBmoFr,iutwhnemdeasrrtelhitmaatritcgeaedntinmngaevmaigsoaerttye.otfIhfhewoeleinsttaiilrcleomwdeemseiagocnrhysm-ptuaocndeiunolgef 0 10 M4e0mory70bud1g0e0t (%) 150 0 10 M4e0mory70bud1g0e0t (%) 150 0 10 M4e0mory70bud1g0e0t (%) 150 to have a diferent BPK and allow each file to have a (a) = 0, uniform. (b) = 12 , uniform. (c) = 1, uniform. varying number of modules, we create a more expansive t)(secaynLm0110....5050 10 M4e0mory70RMMbouBBcdFFk1gs0Dwwe0t//Bo(S%Skki)pip150 t()sceaynLm0110....5050 10 M4e0mory70bud1g0e0t (%) 150 t()sceaynLm0110....5050 10 M4e0mory70bud1g0e0t (%) 150 fceaduhmlelrasotpinwhgtgyeniinrnpcgagooceintncahtetceilnhleoturomoautktamoeulddpmufsiolenecrmattMnoohreBhhyFaafso,vvilezealeonlaawdfrodigprniefeogBrrieFnBtnswtP,tKoqfilBeusPwfeoKwrariyiewttsshh:ietcmih(araono)firursBebttye (d) = 0, Zipfian. (e) = 21 , Zipfian. (f) = 1, Zipfian. modules while smaller BPK are assigned for the first Figure 4: Our skipping algorithm reduces the lookup latency modules of other files. In this way, most empty point tohfeRpoecrkcseDnBtaguendoferthmeetmotoarlyBFprseizsesu(froerwehxaemrepxle-a,x1i0s0r%epmreesmeonrtys lookups can be blocked by the first modules due to a budget is 20 MB when 224 keys are inserted). lower false positive rate. (b) The above design requires compaction to re-construct the MBF. If each file can have non-empty queries ( = 1.0). More experiments (e.g., three or more modules, we may have higher flexibility varying the number of modules, and allowing modules when deciding how many modules are required when to be diferent-sized) can be found in our full paper [7]. answering point queries, even before compaction occurs. However, more modules also indicate multiple BF probes for non-empty queries. Our goal is to create a workload4. Research Plan aware solution that leverages the above trade-of and navigates the design space of MBF to achieve minimum point query cost.

State-of-the-art LSM trees employ a static memory allocation paradigm across levels and files, which leads to all files having BFs with the same bits-per-key (BPK, the ratio between BF size in bits and the number of keys). A 5. Conclusion larger BPK indicates larger space to hash keys and thus a lower false positive ratio. Notably, to achieve minimal In this PhD work, we propose SHaMBa, a novel LSMread cost without changing the overall memory space based key-value engine that addresses two key challenges. of BFs, Monkey [10] proposes to allocate more BPK to First, the fact that as we move to faster storage devices, shallower levels. Our main goal moving forward is to use hashing for BFs in LSM trees becomes bottleneck, and, detailed access pattern information of the workload to second, the fact that the benefit of BFs diminishes under decide the exact BPK at the file and the module level. memory pressure. Our evaluation shows that SHaMBa [Short-Term] Dynamic BPK Allocation. We are work- can reduce the fraction of time spent on hashing during ing towards a dynamic BPK allocation strategy for all lookups, and it can also exploit the available memory to the BFs in LSM trees, which allows diferent files to have ofer better performance than the state of the art under diferent BPK (diferent false positive ratio). The deci- memory pressure. The long-term goal of this PhD work is sion of the BPK per file is implemented at compaction to introduce hardware/workload-aware BF management time. Unlike prior work that assumes a predefined static policy to facilitate point queries in LSM trees, and to workload [10], we will employ machine learning tech- study data systems under memory pressure. niques (e.g., kernel density estimation) to estimate read access statistics (empty and non-empty queries per level), References and this will allow us to identify the best BPK alloca- [1] P. E. O’Neil, et al., The log-structured merge-tree (LSM-tree), Acta Informatica tion strategy at compaction-time, thus generalizing prior 33 (1996) 351–385. approaches. Our earlier work [11] has shown that the [[23]] FGaocoegbloeo,kL,eRveolcDkBsD,hBt,tphstt:/p/sg:i/t/hguitbh.ucobm.co/gmo/ofagclee/bleovoekl/drbo/ck(2s0d2b1()2.021). average compaction latency is mostly afected by mov- [4] WiredTiger, Source Code, https://github.com/wiredtiger/wiredtiger (2021). ing data, with the creation of BFs at compaction time [5] Jh.ttCps.:/M/jccmCiatl.nluemt/,memH20is1t5o.rhitcmal(2C0o2s2t). of Computer Memory and Storage, being a low-overhead process. In other words, we can [6] SZt.oZrhague,eDteavl.i,ceRse,dDuAciMngOBNlo(o2m021F)il.ter CPU Overhead in LSM-Trees on Modern implement a better BPK re-allocation without any visible [7] J. H. Mun, et al., LSM-Tree Under (Memory) Pressure, ADMS (2022). increase in compaction latency. Notably, the dynamic [8] PM.eCth.DodilsliinngCero,mP.pMutaenro-Alioidse,dBlDoeosmigFnil(t2e0r0s4i)n. Probabilistic Verification, Formal BPK re-allocation strategy can also be potentially applied [9] Y. Zhang, et al., ElasticBF: Fine-grained and Elastic Bloom Filter Towards when other types of filters (e.g., Cuckoo Filter [ 12] and [10] ENfic.iDenatyaRne,aedtfaolr., LMSoMn-ktreeye:-ObapsteimdaKlVNaSvtoigraebs,leHKoetyS-tVoaralugeeS(t2o0r1e8,)S.IGMOD (2017). XOR Filter [13]) are employed, by simply replacing the [11] S. Sarkar, et al., Constructing and Analyzing the LSM Compaction Design false positive rate calculation in our cost function. [12] SBp.aFcaen,, PetVaLlD., BCu14ck(o20o2F1i)lt2e2r1:6P–r2a2ct2i9c.ally Better Than Bloom, CoNEXT (2014). [Long-Term] Holistic Memory Tuning. In the long [13] PC.oCR.RD2il1li0n3g.0e2r,5S1.5W(2a0l2ze1r),. Ribbon filter: practically smaller than Bloom and Xor,