=Paper= {{Paper |id=Vol-1882/paper02 |storemode=property |title=Spatio-Temporal Locality in Hash Tables |pdfUrl=https://ceur-ws.org/Vol-1882/paper02.pdf |volume=Vol-1882 |authors=Matthew A. Pugh |dblpUrl=https://dblp.org/rec/conf/vldb/Pugh17 }} ==Spatio-Temporal Locality in Hash Tables== https://ceur-ws.org/Vol-1882/paper02.pdf

Spatio-Temporal Locality in Hash Tables

Matt A. Pugh
Supervised by Prof. Stratis Viglas
University of Edinburgh
10 Crichton St.
Scotland
matt.pugh@ed.ac.uk

ABSTRACT load factor) of a HT is 0 ≥ L ≥ 11 . The benefit of a HT is
The overall theme of this Ph.D. is looking at ways to use that the associated value’s position i in memory is calculated
emerging NVM (Non-Volatile Memory) technologies in real- directly using a hash function h(k), i = h(k) mod M .
world data-science scenarios. It is hoped that the exploita-
tion of the characteristics of the technology will result in
1.1 Self-Reorganising Data-Structures
performance improvements, defined as being either/or an A ST (Splay Tree) is a self-adjusting binary search tree
increase in computational throughput and energy-use reduc- introduced by [6] that has the property that the most fre-
tion. Primarily, this has been through the inclusion of tem- quently accessed node is near the head of the tree. STs have
poral locality into HopH (Hopscotch Hashing) by [2]. The the amortised time-complexity for Get() and Insert() op-
problem of highly-skewed access patterns affecting lookup erations of O(log n), and the goal of their invention was to
time and required computation is shown through a simple minimise average time of lookups in worst-case scenario in
model. A simulator is then used to measure the expected real-time terms. Their solution to achieving this is to min-
performance gains of incorporating temporal locality, given imise the number of node traversals by ensuring the most-
different HT (Hash Table) configurations. This work was frequently-accessed nodes are as close to the head as pos-
originally motivated by NVM, as a way to mask the extra sible, a property which is closely tied to temporal locality,
latency anticipated, but the work is applicable to HTs in although not exactly the same thing. [6] acknowledge that
DRAM (Dynamic Random-Access Memory) also. The sec- the computational cost of ongoing reordering is high, and
ond area of interest in the Ph.D. is looking at exploiting may be problematic given the target application workload.
the characteristics of NVM for different families of machine This is clear when considering the rotation of sub-trees in
learning algorithms, though this paper focuses solely on the a Splay() operation – there is no spatial locality, or cache-
former. friendliness, guaranteed by the structure of the tree in mem-
ory, as such pointer-chasing in rotations is inevitable.
In databases, [3] describe cracking; a method of self or-
1. INTRODUCTION ganisation data by query workload. This process is tuned
for a RDBMS in which a column of interest ACRK is re-
At this stage, the primary focus has been at introducing
tained as a copy, and is continuously reorganised based on
spatial and temporal locality into HopH – taking the concept
the queries touching it. This is analogous to data hotness
of minimising amortised average lookup time into HopH will
as we wish to view it, and can be considered a cache-tuning
provide spatio-temporal locality with an amortised complex-
problem, similar to a LRU (Least Recently Used) cache. As
ity of O(1) for the typical HT operations; Get(), Insert(),
ACRK is a column, they are able to effectively partition the
and Delete(). Details on the approaches undertaken are
data itself to improve probing response times significantly.
provided in Section 3. This is motivated by skewed access
Partitioning within a HopH context is less intuitive as there
patterns, following Zipf’s law, and high-performance sys-
are, at any position, H overlapping neighbourhoods.
tems, such as RDBMS (Relational DataBase Management
System) join operators.
In its simplest form, a Hash Table (HT) T is an M -bucket 2. MODEL AND SIMULATOR
long associative array-like data structure that maps a key k The structure of a HT is entirely dependent on the order in
of any hashable type to value v. A given bucket Bi in T which keys were inserted into it. For example, the hottest
can have S slots for key/value tuples. The occupancy (or element of the neighbourhood, or even entire table, over
some given time period, may be in the last available bucket
in a neighbourhood, leading to multiple evaluations before
finding the tuple required.

2.1 Model
Assume a constructed Hash Table T , of which we have a
Proceedings of the VLDB 2017 PhD Workshop, August 28, 2017. set of all keys K in T . Let X be the sequence of accesses,
Munich, Germany. 1
Copyright (C) 2017 for this paper by its authors. Copying permitted for Assuming no chained-structures resulting in L > 1 load
private and academic purposes. factor
99.22% of cycles would be saved versus T with X. This ap-
proach has the problem that manually attempting to cover
#reads=
T[0] - T[H-1]
a number of configurations of T to find the expected benefit
of rearrangement would be tedious. To deal with this, we
k1 k2 k3 k4
...
employ a simulator.
i= 0 1 ... H-2 H-1

2.2 Simulator

Redistribute()
The developed simulator provides the estimated perfor-
#reads=
mance benefit of performing reordering for varying table
T[0] - T[H-1] sizes, load factors and skews. This is done using a num-
ber of discrete probability distributions to obtain different
k4 k3 k1 k2
...
configurations of T , in order to explore the problem space
i= 0 1 ... H-2 H-1
we are interested in:

C C C δC C
1. The number of slots that each bucket will have popu-
Figure 1: This figure illustrates buckets (coloured blocks) lated in the simulator’s construction is sampled from
in neighbourhoods (colours) over some contiguous range of a Multinomial distribution, for bucket Bi this is ςi .
buckets within the table T . For a given neighbourhood, ac-
cess patterns may naturally fall as the above distribution,
2. N = bS×M ×Lc samples are drawn from a Zipfian dis-
where C is the cache-line size and each blue block represents
tribution Z(α). These are the random hotness values
a constant number of accesses to the associated key ki . Sup-
that are then inserted into each bucket Bi ∈ T over ςi
pose a request is made for the hottest item k4 , there will be
slots in each bucket. This approach gives a load factor
3 + δ cache-invalidations before hitting the hottest data k4
L and skew α over the data. T is then randomly per-
if T [h(k4 )] is not already cached. Reordering the data such
muted to distribute the remaining (for L < 1) empty
that k4 is in position 0, minimising cache-invalidations, is the
buckets throughout the table.
optimal distribution. Note that the distribution of colours
in T does not change.
3. As the number of buckets occupied within a neigh-
bourhood νi is not a constant, either bξc or dξe buck-
such that every element x ∈ X exists in K. We define a ets per neighbourhood are allocated with probability
simple cost model Φ(T, X) that gives a unit cost of a bucket πβ = dξe − bξc, where ξ is the expected number of el-
traversal as β, and a slot traversal and key comparison as ements in a neighbourhood, given in [5]. πβ must be
σ unit cost, where α and γ are the number of respective within the [0, 1] interval, and is interpreted as the prob-
operations. ability that the neighbourhood contains the upper-
bound of entries. At each neighbourhood root, neigh-
bourhood occupancy is sampled from a Binomial dis-
Φ(T, X) = αβ + γσ (1) tribution with probability πβ .
It is simple to use this model with a sufficient access-skew
in X to show good potential gains in the worst case of the 4. Finding the locations within the neighbourhood for oc-
configuration of T . Consider a RO (Read-Only) T where cupancy should not be done linearly, and must be at
some key k has hashed into bucket Bi , but the only available least partly stochastic to emulate the chronological in-
slot in the otherwise fully-occupied neighbourhood νi is at sertions over multiple neighbourhoods. Approaching
Bi+H−1 , slot S. Assume that X is completely skewed, such this in a linear manner would construct T such that
that there is only one k present, repeated A times. In this all insertions happened to populate the table in exactly
case, the cost of accesses is: linear order; this behaviour is not realistic. Instead,
we randomly draw samples from a Poisson distribu-
Φ(T, X) = AH(β) + ASH(σ) tion that has a mean λ = 2, in order that the mass of
the distribution is towards the beginning of the neigh-
In this example, it is clear to see that minimising the bourhood, but may be further on.
values of α = AH and/or γ = ASH are the only areas of
movement upon which we can optimise, if we assign some
real-world values, where S = 3, H = 64, σ = 15, β = 15, A =
2.2.1 Output & Discussion
1 × 109 , we obtain a cost of: Table 1 shows the output obtained thus far from the sim-
ulator, this shows that even on a coarse reordering policy,
we begin to approx a factor of 2 improvement in terms of
Φ(T, X) ' 3.84 × 1012 work performed. Performing fine reordering over the same
When we assume that X 0 the hottest element k0 in the configuration invariably leads to better results than coarse.
first slot of the first bucket in a neighbourhood, the cost is A key caveat is that this metric concerns itself only with
naturally far lower: the hottest element in the table, extensions are underway to
take a more wholistic view of potential performance gains.
The fact that as L → 1, gains appear to disappear is entirely
Φ(T, X 0 ) = A(β + σ) ' 3.0 × 1010
expected. This is due to the fact that as a table becomes
This is the lowest possible cost incurred for T and X 0 , and full, it is to be expected that most neighbourhoods will only
is therefore referred to as the oracle. With this reordering, have one bucket in T , therefore reordering is not possible.
Table 1: Simulator results showing the oracle measurements 3.1.1 Fine - Intra-Bucket (FIntrB)
for the simulator, in terms of percentage of instructions
This method simply sorts the tuples within a bucket Bi
avoided.
(that must be of the same neighbourhood) by hotness. This
Size M Load L Skew α Coarse (%) Fine (%) is performed using a priority queue, inserting the tuples
1.00E+03 0.7 1.1 30.59 92.1 and ordering by their number of accesses, before reinsert-
1.00E+03 0.7 2.1 31.49 99.53 ing them into Bi . As we know that |Bi | ≤ C, there are no
1.00E+03 0.7 3.1 1.13 3.53 further bucket traversals required, and any potential gains
1.00E+03 0.8 1.1 27.86 83.61 are purely in terms of operations performed within Bi for a
1.00E+03 0.8 2.1 44.04 96.3 Contains() or Get() and, as such, will be minimal.
1.00E+03 0.8 3.1 0.58 2.08
1.00E+03 0.9 1.1 47.5 95.7 3.1.2 Coarse Inter-Bucket (CIB)
1.00E+03 0.9 2.1 48.43 99.04 We can express the overall hotness of a bucket Bi by
1.00E+03 0.9 3.1 0.83 1.71 the summation of accesses to all tuples contained within
1.00E+03 1 1.1 0.99 2.64 it. With this method, we do not care about the order of the
1.00E+03 1 2.1 9.36 14.9 tuples within slots, but simply that the most-frequently-hit
1.00E+03 1 3.1 0.59 1.1 bucket is closest to the neighbourhood root. A clear com-
1.00E+05 0.7 1.1 29.86 82.3 promise of this strategy is that it does not care about the
1.00E+05 0.7 2.1 35.39 98.92 distribution of accesses within Bi ; in the example where
1.00E+05 0.7 3.1 35.74 95.81 S = 3 and Bi has one element with many accesses, but
1.00E+05 0.8 1.1 32.82 93.67 two without, and Bj has a uniform distribution whose total
1.00E+05 0.8 2.1 33.68 99.64 (summed) access is more than Bi , Bj will be promoted first.
1.00E+05 0.8 3.1 32.64 95.69
1.00E+05 0.9 1.1 32.97 99.33 3.1.3 Fine Inter-Bucket (FIB)
1.00E+05 0.9 2.1 35.43 97.79 By far the most expensive operation, this seeks to redis-
1.00E+05 0.9 3.1 31.42 97.17 tribute the neighbourhood at a tuple-level using a priority
1.00E+05 1 1.1 0.01 0.02 queue to order tuples by hotness, before reinserting them
1.00E+05 1 2.1 0.65 1.96 into all buckets within the neighbourhood. The positions of
1.00E+05 1 3.1 0.01 0.01 buckets within the neighbourhood do not change. In terms
1.00E+07 0.7 1.1 34.16 97.81 of memory use, this strategy is the most demanding. Poten-
1.00E+07 0.7 2.1 34.14 99.2 tially, if every possible bucket in a neighbourhood νi belongs
1.00E+07 0.7 3.1 33.8 97.37 to that neighbourhood, there will be many elements to copy
1.00E+07 0.8 1.1 33.03 98.04 and reinsert over H buckets. Although the memory traversal
1.00E+07 0.8 2.1 32.52 96.91 will be sequential, this will involve H −1 cache-invalidations.
1.00E+07 0.8 3.1 31.81 94.19 The trade-off for this cost is that we are guaranteed to have
1.00E+07 0.9 1.1 31.87 95.88 all tuples in correct order, with none of the compromises of
1.00E+07 0.9 2.1 26.05 77.39 FIB (Fine Inter-Bucket) or CIB (Coarse Inter-Bucket).
1.00E+07 0.9 3.1 33.89 99.79
1.00E+07 1 1.1 0 0 3.1.4 Heuristic
1.00E+07 1 2.1 0 0
The simplest approach is a direct-swap heuristic, which
1.00E+07 1 3.1 0 0
simply compares the current tuple τi with the first tuple of
the neighbourhood, τr , if the τi is the hotter of the two. This
approach should not have a high computational overhead,
3. METHODS as in traversing the neighbourhood to find τi , we already
stored a reference to τr upon first encountering it. If τi is
HopH is used as the basis for the solution. The argument
hotter than τr , swap them and their distribution entries.
of reordering based on the hotness of the data itself given by
[1] is highly aligned with the goals for this work. This work
differs as the specific objective is achieving spatio-temporal
3.1.5 Approximate Sorting
locality, in order to minimise average lookup times. The This approach exploits the spatial locality afforded by
value of S is selected such that a bucket Bi fits within a HopH; we are guaranteed that all H buckets within neigh-
cache-line size C = 64B. In order that the size of the bucket bourhood νi are in the same, homogeneous region of mem-
|Bi | ≤ C, we choose S, where the size of a tuple |τ | = ory. Once these pages are in the cache-hierarchy, latencies
8 + 8 = 16B, in accessing them are reduced and, depending on the cache
j andk µ is the size of any required meta-data,
layer, very efficient. As the Get() or Contains() opera-
to be S = C−µ |τ |
. Different access patterns are simulated
tion traverses νi , a Bubblesort-like operation can be applied
by drawing samples from a Zipfian distribution Z, whose based on hotness.
parameter α affects the skew of the samples obtained.
3.2 Epochs & Squashing
3.1 Reordering Strategies In order to be adaptive over time, there must be a series of
This section describes a number of re-ordering strategies rules that govern how the hotness of data changes over times
for the placement of data in a neighbourhood. These meth- and accesses. For approaches where there is an absolute trig-
ods look at coarse, down to fine tuple-level, and heuristic ger for reordering, an epoch is defined at a neighbourhood
reordering operations. level, and is its state before a reordering method is invoked.
For non-triggered reordering methods, there must be a con- 4. EXPERIMENTS
tinual state of adaptation for the associated meta-data.
Once tuples in νi have been rearranged, and the numeric 1. The base experiment looks at the Avg.CPU (Aver-
values of tuple accesses have been reduced in some manner, age CPU Cycles) metric of creating tables based on
the hottest element in νi should remain so (preservation of the same input data, and performing the same set of
hotness). For epoch-based approaches, this means the skew queries, in the same order, amongst different table con-
and shape of the distribution should be roughly equal after figurations and comparing their results. Input data is
an epoch, but smaller in magnitude. For non-epoch-based randomly generated, and HopH is used as a baseline.
approaches, the distribution should be more fluid, dynami-
cally changing all values in νi regularly. Those tuples in νi 2. A SHJ (Symmetric Hash Join) is constructed of two
that have not had many accesses should have historic ac- tables. A HotH configuration is compared with one
cess counts reduced, until sufficient epochs have passed that backed by HopH, to measure the difference the adap-
they no longer have any hotness in epoch-based reordering tive approach we propose makes to realistic scenario.
(cooling). Experiments evaluating SHJ performance will use ex-
isting data-sets, over different relation sizes.
3.3 Deletions
Handling deletions in HotH (Hotscotch Hashing) should 5. DISCUSSION
follow two principles further to a baseline HopH deletion: Implementations of all methods described have been com-
plete, with results expected soon. As the overarching theme
1. For per-slot level meta-data, the distribution entry
of this thesis is NVM, work will be conducted to exploit its
(number of reads) for the tuple to be deleted should
characteristics to aid HotH. An approach by [4] provides a
also be deleted from any per-bucket counter if present,
way for leaf nodes being stored in a DRAM/NVM hybrid
before being erased itself.
B+Tree, where the system prefers Contains() operations
2. If any slot that is not the first slot of a bucket contains to Get(), as keys are stored in quick DRAM and values in
the tuple to be deleted, subsequent slots should be NVM. Following a similar methodology for hybrid NVM /
shift towards the first slot, to minimise unnecessary DRAM should provide a fast key lookup for Contains(), as
traversal gaps in the bucket structure. we can now fit many keys within C, but also gives slower
reordering due to the extra cost of moving around NVM
After this, should the bucket be empty after tuple dele- vs. DRAM. Unlike the problem solved by [4], we must still
tion, the standard HopH process is followed. ensure spatial and temporal locality. Intuitively, this could
mean analog structures of M elements in DRAM and NVM
3.4 Invocation with a translation function t(i, s) → j that takes the bucket
Finding the balance between the severity of the reordering and slot numbers i, s respectively, and maps it to a position
strategy employed, and how and when to trigger it are key (offset) j in the NVM table.
points in this work. We explore a number of invocation
strategies, and the various forms of meta-data required to 6. ACKNOWLEDGMENTS
permit them. This work was supported in part by the EPSRC Centre
3.4.1 Trigger-based for Doctoral Training in Data Science, funded by the UK
Engineering and Physical Sciences Research Council (grant
The simplest method of invocation is that of setting a min- EP/L016427/1) and the University of Edinburgh.
imum number of accesses to a bucket or neighbourhood, de-
pendant upon the reordering strategy used, where a counter
ρ = 0 is meta-data within the bucket structure. Upon every 7. REFERENCES
Get() or Contains() call, ρ is incremented and evaluated. [1] S. Albers and M. Karpinski. Randomized splay trees:
Once ρ exceeds some instantiation-defined value σ, the re- Theoretical and experimental results. Information
ordering is invoked. Processing Letters, 81(4):213–221, 2002.
[2] M. Herlihy. Hopscotch Hashing. pages 0–15.
3.4.2 Probabilistic-based [3] S. Idreos, M. Kersten, and S. Manegold. Database
Instead of storing any meta-data at a bucket/neighbour- Cracking. CIDR ’07: 3rd Biennial Conference on
hood level, we can simply state that with some probability Innovative Data Systems Research, pages 68–78, 2007.
πr = P(Perform Redistribution), we will perform a reorder- [4] I. Oukid, J. Lasperas, A. Nica, T. Willhalm, and
ing. A key point in using this method is to avoid paying W. Lehner. FPTree: A Hybrid SCM-DRAM Persistent
the cost of the given reordering strategy often; however the and Concurrent B-Tree for Storage Class Memory.
generation and evaluation of PRN (Pseudo Random Num- Proceedings of the 2016 International Conference on
ber)s is itself a non-zero cost. Conclusions drawn from ex- Management of Data - SIGMOD ’16, pages 371–386,
perimentation using a number of PRNG (Pseudo Random 2016.
Number Generator)s show that the xorshift a good candi- [5] R. Pagh and F. F. Rodler. Cuckoo Hashing. Journal of
date. Further work is looking into simpler masking / shifting Algorithms, 51(2):122–144, 2004.
of an integer value, as statistically-sound randomness isn’t [6] D. D. Sleator and R. E. Tarjan. Self-adjusting Binary
mandatory.
Search Trees. Journal of the ACM, 32(3):652–686, 1985.