Spatio-Temporal Locality in Hash Tables Matt A. Pugh Supervised by Prof. Stratis Viglas University of Edinburgh 10 Crichton St. Scotland matt.pugh@ed.ac.uk ABSTRACT load factor) of a HT is 0 ≥ L ≥ 11 . The benefit of a HT is The overall theme of this Ph.D. is looking at ways to use that the associated value’s position i in memory is calculated emerging NVM (Non-Volatile Memory) technologies in real- directly using a hash function h(k), i = h(k) mod M . world data-science scenarios. It is hoped that the exploita- tion of the characteristics of the technology will result in 1.1 Self-Reorganising Data-Structures performance improvements, defined as being either/or an A ST (Splay Tree) is a self-adjusting binary search tree increase in computational throughput and energy-use reduc- introduced by [6] that has the property that the most fre- tion. Primarily, this has been through the inclusion of tem- quently accessed node is near the head of the tree. STs have poral locality into HopH (Hopscotch Hashing) by [2]. The the amortised time-complexity for Get() and Insert() op- problem of highly-skewed access patterns affecting lookup erations of O(log n), and the goal of their invention was to time and required computation is shown through a simple minimise average time of lookups in worst-case scenario in model. A simulator is then used to measure the expected real-time terms. Their solution to achieving this is to min- performance gains of incorporating temporal locality, given imise the number of node traversals by ensuring the most- different HT (Hash Table) configurations. This work was frequently-accessed nodes are as close to the head as pos- originally motivated by NVM, as a way to mask the extra sible, a property which is closely tied to temporal locality, latency anticipated, but the work is applicable to HTs in although not exactly the same thing. [6] acknowledge that DRAM (Dynamic Random-Access Memory) also. The sec- the computational cost of ongoing reordering is high, and ond area of interest in the Ph.D. is looking at exploiting may be problematic given the target application workload. the characteristics of NVM for different families of machine This is clear when considering the rotation of sub-trees in learning algorithms, though this paper focuses solely on the a Splay() operation – there is no spatial locality, or cache- former. friendliness, guaranteed by the structure of the tree in mem- ory, as such pointer-chasing in rotations is inevitable. In databases, [3] describe cracking; a method of self or- 1. INTRODUCTION ganisation data by query workload. This process is tuned for a RDBMS in which a column of interest ACRK is re- At this stage, the primary focus has been at introducing tained as a copy, and is continuously reorganised based on spatial and temporal locality into HopH – taking the concept the queries touching it. This is analogous to data hotness of minimising amortised average lookup time into HopH will as we wish to view it, and can be considered a cache-tuning provide spatio-temporal locality with an amortised complex- problem, similar to a LRU (Least Recently Used) cache. As ity of O(1) for the typical HT operations; Get(), Insert(), ACRK is a column, they are able to effectively partition the and Delete(). Details on the approaches undertaken are data itself to improve probing response times significantly. provided in Section 3. This is motivated by skewed access Partitioning within a HopH context is less intuitive as there patterns, following Zipf’s law, and high-performance sys- are, at any position, H overlapping neighbourhoods. tems, such as RDBMS (Relational DataBase Management System) join operators. In its simplest form, a Hash Table (HT) T is an M -bucket 2. MODEL AND SIMULATOR long associative array-like data structure that maps a key k The structure of a HT is entirely dependent on the order in of any hashable type to value v. A given bucket Bi in T which keys were inserted into it. For example, the hottest can have S slots for key/value tuples. The occupancy (or element of the neighbourhood, or even entire table, over some given time period, may be in the last available bucket in a neighbourhood, leading to multiple evaluations before finding the tuple required. 2.1 Model Assume a constructed Hash Table T , of which we have a Proceedings of the VLDB 2017 PhD Workshop, August 28, 2017. set of all keys K in T . Let X be the sequence of accesses, Munich, Germany. 1 Copyright (C) 2017 for this paper by its authors. Copying permitted for Assuming no chained-structures resulting in L > 1 load private and academic purposes. factor 99.22% of cycles would be saved versus T with X. This ap- proach has the problem that manually attempting to cover #reads= T[0] - T[H-1] a number of configurations of T to find the expected benefit of rearrangement would be tedious. To deal with this, we k1 k2 k3 k4 ... employ a simulator. i= 0 1 ... H-2 H-1 2.2 Simulator Redistribute() The developed simulator provides the estimated perfor- #reads= mance benefit of performing reordering for varying table T[0] - T[H-1] sizes, load factors and skews. This is done using a num- ber of discrete probability distributions to obtain different k4 k3 k1 k2 ... configurations of T , in order to explore the problem space i= 0 1 ... H-2 H-1 we are interested in: C C C δC C 1. The number of slots that each bucket will have popu- Figure 1: This figure illustrates buckets (coloured blocks) lated in the simulator’s construction is sampled from in neighbourhoods (colours) over some contiguous range of a Multinomial distribution, for bucket Bi this is ςi . buckets within the table T . For a given neighbourhood, ac- cess patterns may naturally fall as the above distribution, 2. N = bS×M ×Lc samples are drawn from a Zipfian dis- where C is the cache-line size and each blue block represents tribution Z(α). These are the random hotness values a constant number of accesses to the associated key ki . Sup- that are then inserted into each bucket Bi ∈ T over ςi pose a request is made for the hottest item k4 , there will be slots in each bucket. This approach gives a load factor 3 + δ cache-invalidations before hitting the hottest data k4 L and skew α over the data. T is then randomly per- if T [h(k4 )] is not already cached. Reordering the data such muted to distribute the remaining (for L < 1) empty that k4 is in position 0, minimising cache-invalidations, is the buckets throughout the table. optimal distribution. Note that the distribution of colours in T does not change. 3. As the number of buckets occupied within a neigh- bourhood νi is not a constant, either bξc or dξe buck- such that every element x ∈ X exists in K. We define a ets per neighbourhood are allocated with probability simple cost model Φ(T, X) that gives a unit cost of a bucket πβ = dξe − bξc, where ξ is the expected number of el- traversal as β, and a slot traversal and key comparison as ements in a neighbourhood, given in [5]. πβ must be σ unit cost, where α and γ are the number of respective within the [0, 1] interval, and is interpreted as the prob- operations. ability that the neighbourhood contains the upper- bound of entries. At each neighbourhood root, neigh- bourhood occupancy is sampled from a Binomial dis- Φ(T, X) = αβ + γσ (1) tribution with probability πβ . It is simple to use this model with a sufficient access-skew in X to show good potential gains in the worst case of the 4. Finding the locations within the neighbourhood for oc- configuration of T . Consider a RO (Read-Only) T where cupancy should not be done linearly, and must be at some key k has hashed into bucket Bi , but the only available least partly stochastic to emulate the chronological in- slot in the otherwise fully-occupied neighbourhood νi is at sertions over multiple neighbourhoods. Approaching Bi+H−1 , slot S. Assume that X is completely skewed, such this in a linear manner would construct T such that that there is only one k present, repeated A times. In this all insertions happened to populate the table in exactly case, the cost of accesses is: linear order; this behaviour is not realistic. Instead, we randomly draw samples from a Poisson distribu- Φ(T, X) = AH(β) + ASH(σ) tion that has a mean λ = 2, in order that the mass of the distribution is towards the beginning of the neigh- In this example, it is clear to see that minimising the bourhood, but may be further on. values of α = AH and/or γ = ASH are the only areas of movement upon which we can optimise, if we assign some real-world values, where S = 3, H = 64, σ = 15, β = 15, A = 2.2.1 Output & Discussion 1 × 109 , we obtain a cost of: Table 1 shows the output obtained thus far from the sim- ulator, this shows that even on a coarse reordering policy, we begin to approx a factor of 2 improvement in terms of Φ(T, X) ' 3.84 × 1012 work performed. Performing fine reordering over the same When we assume that X 0 the hottest element k0 in the configuration invariably leads to better results than coarse. first slot of the first bucket in a neighbourhood, the cost is A key caveat is that this metric concerns itself only with naturally far lower: the hottest element in the table, extensions are underway to take a more wholistic view of potential performance gains. The fact that as L → 1, gains appear to disappear is entirely Φ(T, X 0 ) = A(β + σ) ' 3.0 × 1010 expected. This is due to the fact that as a table becomes This is the lowest possible cost incurred for T and X 0 , and full, it is to be expected that most neighbourhoods will only is therefore referred to as the oracle. With this reordering, have one bucket in T , therefore reordering is not possible. Table 1: Simulator results showing the oracle measurements 3.1.1 Fine - Intra-Bucket (FIntrB) for the simulator, in terms of percentage of instructions This method simply sorts the tuples within a bucket Bi avoided. (that must be of the same neighbourhood) by hotness. This Size M Load L Skew α Coarse (%) Fine (%) is performed using a priority queue, inserting the tuples 1.00E+03 0.7 1.1 30.59 92.1 and ordering by their number of accesses, before reinsert- 1.00E+03 0.7 2.1 31.49 99.53 ing them into Bi . As we know that |Bi | ≤ C, there are no 1.00E+03 0.7 3.1 1.13 3.53 further bucket traversals required, and any potential gains 1.00E+03 0.8 1.1 27.86 83.61 are purely in terms of operations performed within Bi for a 1.00E+03 0.8 2.1 44.04 96.3 Contains() or Get() and, as such, will be minimal. 1.00E+03 0.8 3.1 0.58 2.08 1.00E+03 0.9 1.1 47.5 95.7 3.1.2 Coarse Inter-Bucket (CIB) 1.00E+03 0.9 2.1 48.43 99.04 We can express the overall hotness of a bucket Bi by 1.00E+03 0.9 3.1 0.83 1.71 the summation of accesses to all tuples contained within 1.00E+03 1 1.1 0.99 2.64 it. With this method, we do not care about the order of the 1.00E+03 1 2.1 9.36 14.9 tuples within slots, but simply that the most-frequently-hit 1.00E+03 1 3.1 0.59 1.1 bucket is closest to the neighbourhood root. A clear com- 1.00E+05 0.7 1.1 29.86 82.3 promise of this strategy is that it does not care about the 1.00E+05 0.7 2.1 35.39 98.92 distribution of accesses within Bi ; in the example where 1.00E+05 0.7 3.1 35.74 95.81 S = 3 and Bi has one element with many accesses, but 1.00E+05 0.8 1.1 32.82 93.67 two without, and Bj has a uniform distribution whose total 1.00E+05 0.8 2.1 33.68 99.64 (summed) access is more than Bi , Bj will be promoted first. 1.00E+05 0.8 3.1 32.64 95.69 1.00E+05 0.9 1.1 32.97 99.33 3.1.3 Fine Inter-Bucket (FIB) 1.00E+05 0.9 2.1 35.43 97.79 By far the most expensive operation, this seeks to redis- 1.00E+05 0.9 3.1 31.42 97.17 tribute the neighbourhood at a tuple-level using a priority 1.00E+05 1 1.1 0.01 0.02 queue to order tuples by hotness, before reinserting them 1.00E+05 1 2.1 0.65 1.96 into all buckets within the neighbourhood. The positions of 1.00E+05 1 3.1 0.01 0.01 buckets within the neighbourhood do not change. In terms 1.00E+07 0.7 1.1 34.16 97.81 of memory use, this strategy is the most demanding. Poten- 1.00E+07 0.7 2.1 34.14 99.2 tially, if every possible bucket in a neighbourhood νi belongs 1.00E+07 0.7 3.1 33.8 97.37 to that neighbourhood, there will be many elements to copy 1.00E+07 0.8 1.1 33.03 98.04 and reinsert over H buckets. Although the memory traversal 1.00E+07 0.8 2.1 32.52 96.91 will be sequential, this will involve H −1 cache-invalidations. 1.00E+07 0.8 3.1 31.81 94.19 The trade-off for this cost is that we are guaranteed to have 1.00E+07 0.9 1.1 31.87 95.88 all tuples in correct order, with none of the compromises of 1.00E+07 0.9 2.1 26.05 77.39 FIB (Fine Inter-Bucket) or CIB (Coarse Inter-Bucket). 1.00E+07 0.9 3.1 33.89 99.79 1.00E+07 1 1.1 0 0 3.1.4 Heuristic 1.00E+07 1 2.1 0 0 The simplest approach is a direct-swap heuristic, which 1.00E+07 1 3.1 0 0 simply compares the current tuple τi with the first tuple of the neighbourhood, τr , if the τi is the hotter of the two. This approach should not have a high computational overhead, 3. METHODS as in traversing the neighbourhood to find τi , we already stored a reference to τr upon first encountering it. If τi is HopH is used as the basis for the solution. The argument hotter than τr , swap them and their distribution entries. of reordering based on the hotness of the data itself given by [1] is highly aligned with the goals for this work. This work differs as the specific objective is achieving spatio-temporal 3.1.5 Approximate Sorting locality, in order to minimise average lookup times. The This approach exploits the spatial locality afforded by value of S is selected such that a bucket Bi fits within a HopH; we are guaranteed that all H buckets within neigh- cache-line size C = 64B. In order that the size of the bucket bourhood νi are in the same, homogeneous region of mem- |Bi | ≤ C, we choose S, where the size of a tuple |τ | = ory. Once these pages are in the cache-hierarchy, latencies 8 + 8 = 16B, in accessing them are reduced and, depending on the cache j andk µ is the size of any required meta-data, layer, very efficient. As the Get() or Contains() opera- to be S = C−µ |τ | . Different access patterns are simulated tion traverses νi , a Bubblesort-like operation can be applied by drawing samples from a Zipfian distribution Z, whose based on hotness. parameter α affects the skew of the samples obtained. 3.2 Epochs & Squashing 3.1 Reordering Strategies In order to be adaptive over time, there must be a series of This section describes a number of re-ordering strategies rules that govern how the hotness of data changes over times for the placement of data in a neighbourhood. These meth- and accesses. For approaches where there is an absolute trig- ods look at coarse, down to fine tuple-level, and heuristic ger for reordering, an epoch is defined at a neighbourhood reordering operations. level, and is its state before a reordering method is invoked. For non-triggered reordering methods, there must be a con- 4. EXPERIMENTS tinual state of adaptation for the associated meta-data. Once tuples in νi have been rearranged, and the numeric 1. The base experiment looks at the Avg.CPU (Aver- values of tuple accesses have been reduced in some manner, age CPU Cycles) metric of creating tables based on the hottest element in νi should remain so (preservation of the same input data, and performing the same set of hotness). For epoch-based approaches, this means the skew queries, in the same order, amongst different table con- and shape of the distribution should be roughly equal after figurations and comparing their results. Input data is an epoch, but smaller in magnitude. For non-epoch-based randomly generated, and HopH is used as a baseline. approaches, the distribution should be more fluid, dynami- cally changing all values in νi regularly. Those tuples in νi 2. A SHJ (Symmetric Hash Join) is constructed of two that have not had many accesses should have historic ac- tables. A HotH configuration is compared with one cess counts reduced, until sufficient epochs have passed that backed by HopH, to measure the difference the adap- they no longer have any hotness in epoch-based reordering tive approach we propose makes to realistic scenario. (cooling). Experiments evaluating SHJ performance will use ex- isting data-sets, over different relation sizes. 3.3 Deletions Handling deletions in HotH (Hotscotch Hashing) should 5. DISCUSSION follow two principles further to a baseline HopH deletion: Implementations of all methods described have been com- plete, with results expected soon. As the overarching theme 1. For per-slot level meta-data, the distribution entry of this thesis is NVM, work will be conducted to exploit its (number of reads) for the tuple to be deleted should characteristics to aid HotH. An approach by [4] provides a also be deleted from any per-bucket counter if present, way for leaf nodes being stored in a DRAM/NVM hybrid before being erased itself. B+Tree, where the system prefers Contains() operations 2. If any slot that is not the first slot of a bucket contains to Get(), as keys are stored in quick DRAM and values in the tuple to be deleted, subsequent slots should be NVM. Following a similar methodology for hybrid NVM / shift towards the first slot, to minimise unnecessary DRAM should provide a fast key lookup for Contains(), as traversal gaps in the bucket structure. we can now fit many keys within C, but also gives slower reordering due to the extra cost of moving around NVM After this, should the bucket be empty after tuple dele- vs. DRAM. Unlike the problem solved by [4], we must still tion, the standard HopH process is followed. ensure spatial and temporal locality. Intuitively, this could mean analog structures of M elements in DRAM and NVM 3.4 Invocation with a translation function t(i, s) → j that takes the bucket Finding the balance between the severity of the reordering and slot numbers i, s respectively, and maps it to a position strategy employed, and how and when to trigger it are key (offset) j in the NVM table. points in this work. We explore a number of invocation strategies, and the various forms of meta-data required to 6. ACKNOWLEDGMENTS permit them. This work was supported in part by the EPSRC Centre 3.4.1 Trigger-based for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant The simplest method of invocation is that of setting a min- EP/L016427/1) and the University of Edinburgh. imum number of accesses to a bucket or neighbourhood, de- pendant upon the reordering strategy used, where a counter ρ = 0 is meta-data within the bucket structure. Upon every 7. REFERENCES Get() or Contains() call, ρ is incremented and evaluated. [1] S. Albers and M. Karpinski. Randomized splay trees: Once ρ exceeds some instantiation-defined value σ, the re- Theoretical and experimental results. Information ordering is invoked. Processing Letters, 81(4):213–221, 2002. [2] M. Herlihy. Hopscotch Hashing. pages 0–15. 3.4.2 Probabilistic-based [3] S. Idreos, M. Kersten, and S. Manegold. Database Instead of storing any meta-data at a bucket/neighbour- Cracking. CIDR ’07: 3rd Biennial Conference on hood level, we can simply state that with some probability Innovative Data Systems Research, pages 68–78, 2007. πr = P(Perform Redistribution), we will perform a reorder- [4] I. Oukid, J. Lasperas, A. Nica, T. Willhalm, and ing. A key point in using this method is to avoid paying W. Lehner. FPTree: A Hybrid SCM-DRAM Persistent the cost of the given reordering strategy often; however the and Concurrent B-Tree for Storage Class Memory. generation and evaluation of PRN (Pseudo Random Num- Proceedings of the 2016 International Conference on ber)s is itself a non-zero cost. Conclusions drawn from ex- Management of Data - SIGMOD ’16, pages 371–386, perimentation using a number of PRNG (Pseudo Random 2016. Number Generator)s show that the xorshift a good candi- [5] R. Pagh and F. F. Rodler. Cuckoo Hashing. Journal of date. Further work is looking into simpler masking / shifting Algorithms, 51(2):122–144, 2004. of an integer value, as statistically-sound randomness isn’t [6] D. D. Sleator and R. E. Tarjan. Self-adjusting Binary mandatory. Search Trees. Journal of the ACM, 32(3):652–686, 1985.