1. INTRODUCTION

PhD Workshop, August

Spatio-Temporal Locality in Hash Tables

Matt A. Pugh

matt.pugh@ed.ac.uk 0

Supervised by Stratis Viglas

0 0 University of Edinburgh 10 Crichton St. Scotland

2017

28 2017

The overall theme of this Ph.D. is looking at ways to use emerging NVM (Non-Volatile Memory) technologies in realworld data-science scenarios. It is hoped that the exploitation of the characteristics of the technology will result in performance improvements, de ned as being either/or an increase in computational throughput and energy-use reduction. Primarily, this has been through the inclusion of temporal locality into HopH (Hopscotch Hashing) by [2]. The problem of highly-skewed access patterns a ecting lookup time and required computation is shown through a simple model. A simulator is then used to measure the expected performance gains of incorporating temporal locality, given di erent HT (Hash Table) con gurations. This work was originally motivated by NVM, as a way to mask the extra latency anticipated, but the work is applicable to HTs in DRAM (Dynamic Random-Access Memory) also. The second area of interest in the Ph.D. is looking at exploiting the characteristics of NVM for di erent families of machine learning algorithms, though this paper focuses solely on the former.

1. INTRODUCTION

At this stage, the primary focus has been at introducing spatial and temporal locality into HopH { taking the concept of minimising amortised average lookup time into HopH will provide spatio-temporal locality with an amortised complexity of O(1) for the typical HT operations; Get(), Insert(), and Delete(). Details on the approaches undertaken are provided in Section 3. This is motivated by skewed access patterns, following Zipf's law, and high-performance systems, such as RDBMS (Relational DataBase Management System) join operators.

In its simplest form, a Hash Table (HT) T is an M -bucket long associative array-like data structure that maps a key k of any hashable type to value v. A given bucket Bi in T can have S slots for key/value tuples. The occupancy (or load factor) of a HT is 0 L 11. The bene t of a HT is that the associated value's position i in memory is calculated directly using a hash function h(k), i = h(k) mod M . 1.1

A ST (Splay Tree) is a self-adjusting binary search tree introduced by [ 6 ] that has the property that the most frequently accessed node is near the head of the tree. STs have the amortised time-complexity for Get() and Insert() operations of O(log n), and the goal of their invention was to minimise average time of lookups in worst-case scenario in real-time terms. Their solution to achieving this is to minimise the number of node traversals by ensuring the mostfrequently-accessed nodes are as close to the head as possible, a property which is closely tied to temporal locality, although not exactly the same thing. [ 6 ] acknowledge that the computational cost of ongoing reordering is high, and may be problematic given the target application workload. This is clear when considering the rotation of sub-trees in a Splay() operation { there is no spatial locality, or cachefriendliness, guaranteed by the structure of the tree in memory, as such pointer-chasing in rotations is inevitable.

In databases, [ 3 ] describe cracking; a method of self organisation data by query workload. This process is tuned for a RDBMS in which a column of interest ACRK is retained as a copy, and is continuously reorganised based on the queries touching it. This is analogous to data hotness as we wish to view it, and can be considered a cache-tuning problem, similar to a LRU (Least Recently Used) cache. As ACRK is a column, they are able to e ectively partition the data itself to improve probing response times signi cantly. Partitioning within a HopH context is less intuitive as there are, at any position, H overlapping neighbourhoods. 2.

MODEL AND SIMULATOR

The structure of a HT is entirely dependent on the order in which keys were inserted into it. For example, the hottest element of the neighbourhood, or even entire table, over some given time period, may be in the last available bucket in a neighbourhood, leading to multiple evaluations before nding the tuple required. 2.1

Model

Assume a constructed Hash Table T , of which we have a set of all keys K in T . Let X be the sequence of accesses, 1Assuming no chained-structures resulting in L > 1 load factor

C C C δC C Figure 1: This gure illustrates buckets (coloured blocks) in neighbourhoods (colours) over some contiguous range of buckets within the table T . For a given neighbourhood, access patterns may naturally fall as the above distribution, where C is the cache-line size and each blue block represents a constant number of accesses to the associated key ki. Suppose a request is made for the hottest item k4, there will be 3 + cache-invalidations before hitting the hottest data k4 if T [h(k4)] is not already cached. Reordering the data such that k4 is in position 0, minimising cache-invalidations, is the optimal distribution. Note that the distribution of colours in T does not change. such that every element x 2 X exists in K. We de ne a simple cost model (T; X) that gives a unit cost of a bucket traversal as , and a slot traversal and key comparison as unit cost, where and are the number of respective operations.

(T; X) = + (1)

It is simple to use this model with a su cient access-skew in X to show good potential gains in the worst case of the con guration of T . Consider a RO (Read-Only) T where some key k has hashed into bucket Bi, but the only available slot in the otherwise fully-occupied neighbourhood i is at Bi+H 1, slot S. Assume that X is completely skewed, such that there is only one k present, repeated A times. In this case, the cost of accesses is:

(T; X) = AH( ) + ASH( )

In this example, it is clear to see that minimising the values of = AH and/or = ASH are the only areas of movement upon which we can optimise, if we assign some real-world values, where S = 3; H = 64; = 15; = 15; A = 1 109, we obtain a cost of: (T; X) ' 3:84 1012 When we assume that X0 the hottest element k0 in the rst slot of the rst bucket in a neighbourhood, the cost is naturally far lower: (T; X0) = A( + ) ' 3:0 1010

This is the lowest possible cost incurred for T and X0, and is therefore referred to as the oracle. With this reordering, 99:22% of cycles would be saved versus T with X. This approach has the problem that manually attempting to cover a number of con gurations of T to nd the expected bene t of rearrangement would be tedious. To deal with this, we employ a simulator. 2.2

Simulator

The developed simulator provides the estimated performance bene t of performing reordering for varying table sizes, load factors and skews. This is done using a number of discrete probability distributions to obtain di erent con gurations of T , in order to explore the problem space we are interested in: 1. The number of slots that each bucket will have populated in the simulator's construction is sampled from a Multinomial distribution, for bucket Bi this is &i. 2. N = bS M Lc samples are drawn from a Zip an distribution Z( ). These are the random hotness values that are then inserted into each bucket Bi 2 T over &i slots in each bucket. This approach gives a load factor L and skew over the data. T is then randomly permuted to distribute the remaining (for L < 1) empty buckets throughout the table. 3. As the number of buckets occupied within a neighbourhood i is not a constant, either b c or d e buckets per neighbourhood are allocated with probability = d e b c, where is the expected number of elements in a neighbourhood, given in [ 5 ]. must be within the [0; 1] interval, and is interpreted as the probability that the neighbourhood contains the upperbound of entries. At each neighbourhood root, neighbourhood occupancy is sampled from a Binomial distribution with probability . 4. Finding the locations within the neighbourhood for occupancy should not be done linearly, and must be at least partly stochastic to emulate the chronological insertions over multiple neighbourhoods. Approaching this in a linear manner would construct T such that all insertions happened to populate the table in exactly linear order; this behaviour is not realistic. Instead, we randomly draw samples from a Poisson distribution that has a mean = 2, in order that the mass of the distribution is towards the beginning of the neighbourhood, but may be further on. 2.2.1

Output & Discussion

Table 1 shows the output obtained thus far from the simulator, this shows that even on a coarse reordering policy, we begin to approx a factor of 2 improvement in terms of work performed. Performing ne reordering over the same con guration invariably leads to better results than coarse. A key caveat is that this metric concerns itself only with the hottest element in the table, extensions are underway to take a more wholistic view of potential performance gains. The fact that as L ! 1, gains appear to disappear is entirely expected. This is due to the fact that as a table becomes full, it is to be expected that most neighbourhoods will only have one bucket in T , therefore reordering is not possible.

METHODS

HopH is used as the basis for the solution. The argument of reordering based on the hotness of the data itself given by [ 1 ] is highly aligned with the goals for this work. This work di ers as the speci c objective is achieving spatio-temporal locality, in order to minimise average lookup times. The value of S is selected such that a bucket Bi ts within a cache-line size C = 64B. In order that the size of the bucket jBij C, we choose S, where the size of a tuple j j = 8 + 8 = 16B, and is the size of any required meta-data, to be S = j C k. Di erent access patterns are simulated j j by drawing samples from a Zip an distribution Z, whose parameter a ects the skew of the samples obtained. 3.1

Reordering Strategies

This section describes a number of re-ordering strategies for the placement of data in a neighbourhood. These methods look at coarse, down to ne tuple-level, and heuristic reordering operations. 3.1.1

Fine - Intra-Bucket (FIntrB)

This method simply sorts the tuples within a bucket Bi (that must be of the same neighbourhood) by hotness. This is performed using a priority queue, inserting the tuples and ordering by their number of accesses, before reinserting them into Bi. As we know that jBij C, there are no further bucket traversals required, and any potential gains are purely in terms of operations performed within Bi for a Contains() or Get() and, as such, will be minimal. 3.1.2

Coarse Inter-Bucket (CIB)

We can express the overall hotness of a bucket Bi by the summation of accesses to all tuples contained within it. With this method, we do not care about the order of the tuples within slots, but simply that the most-frequently-hit bucket is closest to the neighbourhood root. A clear compromise of this strategy is that it does not care about the distribution of accesses within Bi; in the example where S = 3 and Bi has one element with many accesses, but two without, and Bj has a uniform distribution whose total (summed) access is more than Bi, Bj will be promoted rst. 3.1.3

Fine Inter-Bucket (FIB)

By far the most expensive operation, this seeks to redistribute the neighbourhood at a tuple-level using a priority queue to order tuples by hotness, before reinserting them into all buckets within the neighbourhood. The positions of buckets within the neighbourhood do not change. In terms of memory use, this strategy is the most demanding. Potentially, if every possible bucket in a neighbourhood i belongs to that neighbourhood, there will be many elements to copy and reinsert over H buckets. Although the memory traversal will be sequential, this will involve H 1 cache-invalidations. The trade-o for this cost is that we are guaranteed to have all tuples in correct order, with none of the compromises of FIB (Fine Inter-Bucket) or CIB (Coarse Inter-Bucket). 3.1.4

Heuristic

The simplest approach is a direct-swap heuristic, which simply compares the current tuple i with the rst tuple of the neighbourhood, r, if the i is the hotter of the two. This approach should not have a high computational overhead, as in traversing the neighbourhood to nd i, we already stored a reference to r upon rst encountering it. If i is hotter than r, swap them and their distribution entries. 3.1.5

Approximate Sorting

This approach exploits the spatial locality a orded by HopH; we are guaranteed that all H buckets within neighbourhood i are in the same, homogeneous region of memory. Once these pages are in the cache-hierarchy, latencies in accessing them are reduced and, depending on the cache layer, very e cient. As the Get() or Contains() operation traverses i, a Bubblesort-like operation can be applied based on hotness. 3.2

Epochs & Squashing

In order to be adaptive over time, there must be a series of rules that govern how the hotness of data changes over times and accesses. For approaches where there is an absolute trigger for reordering, an epoch is de ned at a neighbourhood level, and is its state before a reordering method is invoked. For non-triggered reordering methods, there must be a continual state of adaptation for the associated meta-data.

Once tuples in i have been rearranged, and the numeric values of tuple accesses have been reduced in some manner, the hottest element in i should remain so (preservation of hotness). For epoch-based approaches, this means the skew and shape of the distribution should be roughly equal after an epoch, but smaller in magnitude. For non-epoch-based approaches, the distribution should be more uid, dynamically changing all values in i regularly. Those tuples in i that have not had many accesses should have historic access counts reduced, until su cient epochs have passed that they no longer have any hotness in epoch-based reordering (cooling). 3.3

Deletions

Handling deletions in HotH (Hotscotch Hashing) should follow two principles further to a baseline HopH deletion: 1. For per-slot level meta-data, the distribution entry (number of reads) for the tuple to be deleted should also be deleted from any per-bucket counter if present, before being erased itself. 2. If any slot that is not the rst slot of a bucket contains the tuple to be deleted, subsequent slots should be shift towards the rst slot, to minimise unnecessary traversal gaps in the bucket structure.

After this, should the bucket be empty after tuple deletion, the standard HopH process is followed. 3.4

Invocation

Finding the balance between the severity of the reordering strategy employed, and how and when to trigger it are key points in this work. We explore a number of invocation strategies, and the various forms of meta-data required to permit them. 3.4.1

Trigger-based

The simplest method of invocation is that of setting a minimum number of accesses to a bucket or neighbourhood, dependant upon the reordering strategy used, where a counter = 0 is meta-data within the bucket structure. Upon every Get() or Contains() call, is incremented and evaluated. Once exceeds some instantiation-de ned value , the reordering is invoked. 3.4.2

Probabilistic-based

Instead of storing any meta-data at a bucket/neighbourhood level, we can simply state that with some probability r = P(Perform Redistribution), we will perform a reordering. A key point in using this method is to avoid paying the cost of the given reordering strategy often; however the generation and evaluation of PRN (Pseudo Random Number)s is itself a non-zero cost. Conclusions drawn from experimentation using a number of PRNG (Pseudo Random Number Generator)s show that the xorshift a good candidate. Further work is looking into simpler masking / shifting of an integer value, as statistically-sound randomness isn't mandatory.

EXPERIMENTS

DISCUSSION

Implementations of all methods described have been complete, with results expected soon. As the overarching theme of this thesis is NVM, work will be conducted to exploit its characteristics to aid HotH. An approach by [ 4 ] provides a way for leaf nodes being stored in a DRAM/NVM hybrid B+Tree, where the system prefers Contains() operations to Get(), as keys are stored in quick DRAM and values in NVM. Following a similar methodology for hybrid NVM / DRAM should provide a fast key lookup for Contains(), as we can now t many keys within C, but also gives slower reordering due to the extra cost of moving around NVM vs. DRAM. Unlike the problem solved by [ 4 ], we must still ensure spatial and temporal locality. Intuitively, this could mean analog structures of M elements in DRAM and NVM with a translation function t(i; s) ! j that takes the bucket and slot numbers i; s respectively, and maps it to a position (o set) j in the NVM table.

ACKNOWLEDGMENTS

This work was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh.

[1]

Albers and

Karpinski . Randomized splay trees: Theoretical and experimental results . Information Processing Letters , 81 ( 4 ): 213 { 221 , 2002 .

[2]

Herlihy . Hopscotch Hashing . pages 0 { 15 .

[3]

Idreos ,

Kersten , and

S. Manegold. Database

Cracking . CIDR '07: 3rd Biennial Conference on Innovative Data Systems Research , pages 68 { 78 , 2007 .

[4]

Oukid ,

Lasperas ,

Nica ,

Willhalm , and

Lehner. FPTree: A Hybrid SCM-DRAM Persistent and Concurrent B-Tree for Storage Class Memory . Proceedings of the 2016 International Conference on Management of Data - SIGMOD '16 , pages 371 { 386 , 2016 .

[5]

Pagh and

F. F.

Rodler . Cuckoo Hashing. Journal of Algorithms , 51 ( 2 ): 122 { 144 , 2004 .

[6]

D. D.

Sleator and

R. E.

Tarjan . Self-adjusting Binary Search Trees . Journal of the ACM , 32 ( 3 ): 652 { 686 , 1985 .