<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SHaMBa: Reducing Bloom Filter Overhead in LSM Trees</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zichen Zhu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>supervised by Prof. Manos Athanassoulis Boston University</institution>
          ,
          <addr-line>MA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Bloom Filters (BFs) are typically employed to alleviate unnecessary disk accesses to facilitate point lookup in LSM trees. They are particularly beneficial when there is a significant performance diference between probing a Bloom filter (hashing and accessing memory) and accessing data (on secondary storage). However, this gap is decreasing as SSDs and NVMs have increasingly lower latency, to the point that the cost of accessing data can be comparable to that of hashing and filter probing , especially for large key sizes that results in high hashing cost. In addition, BFs are beneficial for empty queries while they are a burden for positive queries (i.e., on existing keys). Also, with larger datasets, the total consumed memory also increases, making it less feasible to keep all BFs in memory. Coupling this, with the increasing price of memory and the need to reduce the memory-to-data ratio in many practical deployments, we are seeing an increased memory pressure. In this setting, fewer BF blocks are cached, thus causing additional storage accesses, since they have to be fetched in memory to answer a query. In this PhD work, we introduce SHaMBa (Shared Hash Modular Bloom Filter), a new LSM-based key-value engine that addresses both (a) the increasing hashing overhead and (b) the sub-optimal performance when BFs do not fit in memory. First, SHaMBa decouples the hashing cost from the data size by sharing a single hash digest across diferent levels. Second, SHaMBa applies a workload-aware BF skipping policy based on Modular Bloom Filter (i.e., a set of mini-BFs that replace a single large BF) to avoid accessing BFs when they are not useful. Our evaluation shows that SHaMBa reduces the CPU cost for BF probing, and substantially outperforms the state of the art under memory pressure.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;LSM trees</kwd>
        <kwd>Bloom Filter</kwd>
        <kwd>compaction policy</kwd>
        <kwd>storage</kwd>
        <kwd>memory pressure</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction
key. In addition, every file is built with an index block
(also termed fence pointers) that maintains the min-max
LSM-based Key-Value Stores. Log-Structured Merge- range and the ofset for each data block (or disk page),
trees (LSM trees) [1] are widely adopted as one of the which ensures that at most one data block (or disk page)
core data structures in modern NoSQL storage engines is retrieved when probing a file.
including LevelDB [2], RocksDB [3], and WiredTiger [4]. Problem 1: The Benefit of BFs Shrinks For Faster
This is because LSM trees ofer high write throughput by Storage. Contrary to common perception, BFs are not
employing out-of-place ingestion. In LSM trees, incoming always beneficial. The rationale behind the ubiquitous
entries (inserts, updates, and deletes) are bufered within use of BFs in LSM trees is that there is a considerable
main memory. Once the write bufer becomes full, the cost diference between accessing a BF (in memory) and
contained entries are sorted and flushed to disk as a sorted accessing data (on disk). As new storage devices like
run. The disk-resident sorted runs are organized into a SSDs and non-volatile memories (NVMs) emerge, the
number of levels of increasing sizes. In practice, a sorted latency gap between memory and storage narrows, and
run may consist of one or more immutable Sorted-String thus the advantages of using BFs weaken. If the data is
Tables (or SST files). To bound the number of files that already cached in main memory, BFs are even detrimental.
a point lookup needs to probe, runs of similar sizes in Experiments show that MurmurHash64 calculation (used
the same level are sort-merged and pushed to the next in production systems [3]) is ∼ 1.47× more expensive
(deeper) level when the accumulated bytes of similarly than accessing a memory page, thereby, making the use
sized runs reach a predefined capacity. To avoid unnec- of a BF detrimental. The LSM hashing overhead is further
essary accesses for point lookups, LSM trees typically exacerbated as multiple BFs are queried per lookup (at
construct a Bloom Filter (BF) for every file that probabilis- least one per level), and repeated hash calculations turn
tically allows to skip a file if it does not contain the target querying over fast storage (or cached data) into a
CPUintensive operation.</p>
      <p>VLDB 2023 PhD Workshop, co-located with the 49th International Problem 2: Read Performance in LSM trees
DeVCaonncfeoruevnecre,ConanVaedray Large Data Bases (VLDB 2023), August 28, 2023, grades With Limited Memory. While computing,
$ zczhu@bu.edu (Z. Zhu) memory, and storage prices decrease and allow us to
 https://cs-people.bu.edu/zczhu/ (Z. Zhu) facilitate more data, in the last few years, the price drop
0000-0002-9197-4649 (Z. Zhu) in memory has been slower than what has been for
com© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License puting and storage, making it hard to maintain the same
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)
memory-to-data ratio. For example, since 2010, the price
of SSDs has decreased by a factor of sixty, whereas the
price of memory has only decreased by a factor of ten [5].</p>
      <p>As a result, BFs may not always be in memory when com- L3
peting for the resource with index blocks and data blocks,
and when there is a cache miss for BFs, a significant
number of I/Os may be spent on fetching them.
LLL210 SMSemTBAuf SST B gSeStT(kC) k hhh SSSSTT CBBBBFFF DDDaaatttaaa k h BBBFFF DDDaaaSSStttSSSaaaTTT ACB</p>
      <p>(a) Query path (b) Hashing in a query (c) Shared hashing in a query
Figure 1: Hash sharing across BFs of diferent levels.</p>
      <p>SHaMBa: Less Hashing on Modular Bloom Filters.</p>
      <p>To address these challenges, we propose Shared
Hashing and skipping-based Modular Bloom Filters - two
techniques that reduce the BF overhead in LSM trees and we
integrate them into our system, SHaMBa. Specifically, we (a) Uniform (b) Zipfian
ifrst propose a shared hashing technique [6] that shares
a single hash digest across all the levels in an LSM tree Figure 2: Hash sharing reduces hashing overhead. The
reducto alleviate the unnecessary cost of re-hashing for every tion is more pronounced for larger keys and skew workload.
level. Shared hashing decouples the aggregated hashing an optimization reduces the CPU cost by a factor of 
cost from data size. With SHaMBa, regardless of the num- when probing a single BF. Here, we apply a form of hash
ber of LSM tree levels (which depends on the data size, sharing across multiple BFs from diferent LSM levels.
the size ratio, and the memory bufer size), the hashing
cost is constant. We then identify that not all the BFs
are equally important, and based on this observation, we
propose a skipping mechanism [7] based on the Modular
Bloom Filter (MBF) design that allows us to load part of
the filter to alleviate the memory pressure. Our
evaluation shows that hash sharing can lead to 20% higher
lookup performance when using a state-of-the-art PCIe
SSD, and the skipping mechanism in MBF increases read
throughput by more than 50% compared with the state of
the art when memory is constrained to 10% of the total
size of Bloom Filters.</p>
      <p>Contributions. Our contributions are as follows:
Hash Sharing Across Levels. The key observation is
that for a specific query, the same hash digest calculation
is repeated across levels when fence pointers cannot filter
the query. The BFs are diferent across levels (they have
indexed diferent elements), however, in order to probe
them for a given key, the same hash digest is calculated
for every level until the key is found or the tree is entirely
searched. To mitigate this overhead, we share the hash
digest calculation across levels by re-engineering the BF
implementation and allowing the BFs residing in diferent
levels to work collaboratively during the execution of a
single query (Fig. 1). As a result, the hashing cost remains
constant regardless of the number of levels.
• We identify that BFs dominate LSM query latency for Evaluation. We build an in-house LSM tree prototype,
fast storage and high hashing cost, and we decouple which uses RocksDB’s fast local Bloom Filter and
Murthe amount of hashing from data size (height of LSM murHash64. We bulk load our LSM tree with 22GB of
tree) by shared hashing across diferent levels. key-value pairs (entry size is fixed as 2KB), and report
• We propose a skipping mechanism based on Modular the latency of empty point queries. The experiments are
Bloom Filter (MBF) that reduces the memory footprint running with a state-of-the-art PCIe SSD that ofers 10  s
without sacrificing performance. latency for 4KB page access. As shown in Figure 2a, the
• We integrate Shared Hashing and Modular Bloom Fil- hashing cost increases for both approaches as the key
ters in the state-of-the-art LSM-engine RocksDB, and size grows, however, shared hashing has a performance
we show through extensive experiments that our pro- gain of up to 23% (blue line). The time breakdown shows
posed techniques reduce the hashing cost, and out- where this benefit is coming from. The time spent in BFs
perform the state of the art under memory pressure. (both hashing and probing) is drastically reduced for the
hash sharing approach, while the cost for accessing data,
as well as the other costs (e.g., binary search in fence
pointers), virtually remains the same. In addition, larger
2. Shared Hashing key sizes have higher hashing cost, hence, hash sharing
Classical BFs rely on  independent hash functions, is more beneficial for them. Besides, when the query
which results in high CPU overhead when probing a workload becomes skewed, we further observe the gain
BF. Practical implementations [3] use a single hash di- steeply increases for 1KB keys to more than 60% in
Figgest and generates  − 1 indexes by bit rotation. This ure 2b. This is because the skewed workload has fewer
optimization is based on the double hashing scheme [8], data accesses due to fewer false positives, and as a result,
for which it is shown that it can achieve nearly the same hashing becomes a bottleneck for skewed point queries.
accuracy obtained by independent  hash functions. Such More experiments can be found in our full paper [6].
Note that since the modules are accessed sequentially,
the decision to skip the -th module afects the remaining
In this section, we discuss how we achieve a lower mem- modules. We also allow the algorithm to use a diferent
ory footprint with our skipping mechanism [7] for MBFs, threshold per module slot for more flexibility.
and we show that, under memory pressure, our skipping
algorithm achieves higher read throughput compared to QueryMBF (key , ,)
the state of the art. for  = 1,  ≤ number of modules, ++ do
Modular Bloom Filters (MBFs). We first present Mod- ,, =  , · (1 −  ,) · (−1 − )
ular Bloom Filters (MBFs) that divide a normal Bloom if //ski=p=pitnrugem||odu,le,b y&lt;rtehtruersnhionldgptohsietnive
Filter into multiple modules. MBF is a generalized ver- return true;
sion of ElasticBF [9] since MBF allows the size of each else
module to be diferent. A Modular Bloom filter (MBF) //// tphriosbmeathyecmauosdeualne lIi/kOe iaf mit iinsinBoFt cached
uses  bits to index  elements in each of  modules. if QueryModule(, module,,) == false then
Each module uses  bits such that ∑︀=1  = . Es- return false;
sentially, an MBF is a collection of  Bloom filters, and end
every membership test tries to sequentially go through end
all the modules before it concludes with a positive result. end
A negative response at any module terminates the query return true;
without further probing the remaining modules.</p>
      <p>k
end
Algorithm 1: MBF decides to skip a module based
on its utility (along with threshold).
3. Skipping-Based Modular BFs
0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 1 Updates and Deletes. In an update-heavy or
delete(a) A standard Bloom filter. heavy setting, our skipping algorithm still works by
ack tively maintaining  ,,  , for each SST file and
inheriting these values when building new SST files during
compactions. For example, if point queries mainly target
0 0 1m0od0ul0e #011 0 0 0 0 0m0od0ul1e #020 1 0 0 0 1m0od0ul0e #031 0 0 frequently updated/deleted keys, these queries mostly
(b) A modular Bloom filter with three modules of equal size. terminate in shallower levels, leading to fewer accesses
Figure 3: Modular Bloom filters split the physical representa- to the files in deeper levels that contain obsolete entries
tion of a BF into multiple independent modules. ( , of the files in deeper levels decreases). Therefore,
according to the utility definition in Eq. (1), ,, also
deFigure 3 compares an MBF (using three modules) with a creases, which possibly leads to skipping modules when
standard BF. By design, a point lookup can use all or any other queries access these files. On the other hand, if
subset of the  modules without re-indexing. Thus, MBF point queries concentrate on a few infrequently
updatcan navigate the tradeof between accuracy and memory ed/deleted keys (in deeper levels),  , of the files that
footprint without re-calculating the filter. contain these keys may increase since other point queries
Skipping Modules. To fully exploit MBFs, we quantify that target frequently updated/deleted keys terminate in
the utility of each module and design a module skipping shallower levels, which again yields lower utility because
mechanism. We define the utility as follows. most of the queries for these SST files are likely to be
nonempty queries. As a result, for non-empty queries, BFs
,, =  , · (1 −  ,) · ︁( −1 − ︁) (1) can be skipped, and data blocks can be directly loaded,
thus saving memory space and I/Os for BFs.
where  , ( ,) represents the point lookup frequency Evaluation. We integrate our skipping algorithm
(the ratio of true positive point queries, respectively) of and MBF into RocksDB with two equal-size modules
the ℎ ifle at level , ,, and  is the false positive (with thresholds, respectively, threshold1 = 0.02 and
rate when using the first  modules. By definition, ,, threshold2 = 0.01) to showcase the benefit of our
skipquantifies how many I/Os (on average) can be avoided ping mechanism. We stress-test our approach with
diferusing the ℎ module. Based on the utility, we propose ent workloads: 1) vary the point query patterns to follow
to skip probing modules if the utility is lower than a a uniform or Zipfian distribution; 2) vary the proportion
certain threshold (threshold). Algorithm 1 shows how of existing point lookups ( ). We report the average
lato query an MBF. The core idea is to skip a module if it tency per lookup, as shown in Figure 4. The experimental
does not lead to a reduction of I/Os. In other words, if we results show that our skipping algorithm efectively
reanticipate that querying a module leads anyway to an I/O, duces the lookup latency when there is memory pressure,
we will skip the rest of the MBF as an I/O is inevitable. and the benefits persist for both empty ( = 0.0) and
(t)sceaynLm213 (t)secaynLm213 (t)sceaynLm213 taMelgrBmoFr,iutwhnemdeasrrtelhitmaatritcgeaedntinmngaevmaigsoaerttye.otfIhfhewoeleinsttaiilrcleomwdeemseiagocnrhysm-ptuaocndeiunolgef
0 10 M4e0mory70bud1g0e0t (%) 150 0 10 M4e0mory70bud1g0e0t (%) 150 0 10 M4e0mory70bud1g0e0t (%) 150 to have a diferent BPK and allow each file to have a
(a)  = 0, uniform. (b)  = 12 , uniform. (c)  = 1, uniform. varying number of modules, we create a more expansive
t)(secaynLm0110....5050 10 M4e0mory70RMMbouBBcdFFk1gs0Dwwe0t//Bo(S%Skki)pip150 t()sceaynLm0110....5050 10 M4e0mory70bud1g0e0t (%) 150 t()sceaynLm0110....5050 10 M4e0mory70bud1g0e0t (%) 150 fceaduhmlelrasotpinwhgtgyeniinrnpcgagooceintncahtetceilnhleoturomoautktamoeulddpmufsiolenecrmattMnoohreBhhyFaafso,vvilezealeonlaawdfrodigprniefeogBrrieFnBtnswtP,tKoqfilBeusPwfeoKwrariyiewttsshh:ietcmih(araono)firursBebttye
(d)  = 0, Zipfian. (e)  = 21 , Zipfian. (f)  = 1, Zipfian. modules while smaller BPK are assigned for the first
Figure 4: Our skipping algorithm reduces the lookup latency modules of other files. In this way, most empty point
tohfeRpoecrkcseDnBtaguendoferthmeetmotoarlyBFprseizsesu(froerwehxaemrepxle-a,x1i0s0r%epmreesmeonrtys lookups can be blocked by the first modules due to a
budget is 20 MB when 224 keys are inserted). lower false positive rate. (b) The above design requires
compaction to re-construct the MBF. If each file can have
non-empty queries ( = 1.0). More experiments (e.g., three or more modules, we may have higher flexibility
varying the number of modules, and allowing modules when deciding how many modules are required when
to be diferent-sized) can be found in our full paper [7]. answering point queries, even before compaction occurs.
However, more modules also indicate multiple BF probes
for non-empty queries. Our goal is to create a
workload4. Research Plan aware solution that leverages the above trade-of and
navigates the design space of MBF to achieve minimum
point query cost.</p>
      <p>State-of-the-art LSM trees employ a static memory
allocation paradigm across levels and files, which leads to
all files having BFs with the same bits-per-key (BPK, the
ratio between BF size in bits and the number of keys). A 5. Conclusion
larger BPK indicates larger space to hash keys and thus
a lower false positive ratio. Notably, to achieve minimal In this PhD work, we propose SHaMBa, a novel
LSMread cost without changing the overall memory space based key-value engine that addresses two key challenges.
of BFs, Monkey [10] proposes to allocate more BPK to First, the fact that as we move to faster storage devices,
shallower levels. Our main goal moving forward is to use hashing for BFs in LSM trees becomes bottleneck, and,
detailed access pattern information of the workload to second, the fact that the benefit of BFs diminishes under
decide the exact BPK at the file and the module level. memory pressure. Our evaluation shows that SHaMBa
[Short-Term] Dynamic BPK Allocation. We are work- can reduce the fraction of time spent on hashing during
ing towards a dynamic BPK allocation strategy for all lookups, and it can also exploit the available memory to
the BFs in LSM trees, which allows diferent files to have ofer better performance than the state of the art under
diferent BPK (diferent false positive ratio). The deci- memory pressure. The long-term goal of this PhD work is
sion of the BPK per file is implemented at compaction to introduce hardware/workload-aware BF management
time. Unlike prior work that assumes a predefined static policy to facilitate point queries in LSM trees, and to
workload [10], we will employ machine learning tech- study data systems under memory pressure.
niques (e.g., kernel density estimation) to estimate read
access statistics (empty and non-empty queries per level), References
and this will allow us to identify the best BPK alloca- [1] P. E. O’Neil, et al., The log-structured merge-tree (LSM-tree), Acta Informatica
tion strategy at compaction-time, thus generalizing prior 33 (1996) 351–385.
approaches. Our earlier work [11] has shown that the [[23]] FGaocoegbloeo,kL,eRveolcDkBsD,hBt,tphstt:/p/sg:i/t/hguitbh.ucobm.co/gmo/ofagclee/bleovoekl/drbo/ck(2s0d2b1()2.021).
average compaction latency is mostly afected by mov- [4] WiredTiger, Source Code, https://github.com/wiredtiger/wiredtiger (2021).
ing data, with the creation of BFs at compaction time [5] Jh.ttCps.:/M/jccmCiatl.nluemt/,memH20is1t5o.rhitcmal(2C0o2s2t). of Computer Memory and Storage,
being a low-overhead process. In other words, we can [6] SZt.oZrhague,eDteavl.i,ceRse,dDuAciMngOBNlo(o2m021F)il.ter CPU Overhead in LSM-Trees on Modern
implement a better BPK re-allocation without any visible [7] J. H. Mun, et al., LSM-Tree Under (Memory) Pressure, ADMS (2022).
increase in compaction latency. Notably, the dynamic [8] PM.eCth.DodilsliinngCero,mP.pMutaenro-Alioidse,dBlDoeosmigFnil(t2e0r0s4i)n. Probabilistic Verification, Formal
BPK re-allocation strategy can also be potentially applied [9] Y. Zhang, et al., ElasticBF: Fine-grained and Elastic Bloom Filter Towards
when other types of filters (e.g., Cuckoo Filter [ 12] and [10] ENfic.iDenatyaRne,aedtfaolr., LMSoMn-ktreeye:-ObapsteimdaKlVNaSvtoigraebs,leHKoetyS-tVoaralugeeS(t2o0r1e8,)S.IGMOD (2017).
XOR Filter [13]) are employed, by simply replacing the [11] S. Sarkar, et al., Constructing and Analyzing the LSM Compaction Design
false positive rate calculation in our cost function. [12] SBp.aFcaen,, PetVaLlD., BCu14ck(o20o2F1i)lt2e2r1:6P–r2a2ct2i9c.ally Better Than Bloom, CoNEXT (2014).
[Long-Term] Holistic Memory Tuning. In the long [13] PC.oCR.RD2il1li0n3g.0e2r,5S1.5W(2a0l2ze1r),. Ribbon filter: practically smaller than Bloom and Xor,</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>