=Paper=
{{Paper
|id=None
|storemode=property
|title=Revision of Relational Joins for Multi-Core and Many-Core Architectures
|pdfUrl=https://ceur-ws.org/Vol-706/paper16.pdf
|volume=Vol-706
|dblpUrl=https://dblp.org/rec/conf/dateso/KrulisY11
}}
==Revision of Relational Joins for Multi-Core and Many-Core Architectures==
<pdf width="1500px">https://ceur-ws.org/Vol-706/paper16.pdf</pdf>
<pre>
    Revision of Relational Joins for Multi-Core And
    Revision of Relational Joins for Multi-Core and
               Many-Core Architectures
               Many-Core Architectures?
                              Martin Kruliš, Jakub Yaghob⋆
                             Martin Kruliš and Jakub Yaghob
                           Department of Software Engineering
                            Faculty  of Mathematics
                            Department   of Software and  Physics
                                                      Engineering
                               Charles
                            Faculty      University inand
                                     of Mathematics    Prague
                                                          Physics
                       Malostranské nám.
                               Charles     25, Prague,
                                         University    Czech Republic
                                                    in Prague
                           {krulis,
                            {krulis,yaghob}@ksi.mff.cuni.cz
                                       yaghob}@ksi.mff.cuni.cz
                              http://www.ksi.mff.cuni.cz/


           Abstract. Actual trend set by CPU manufacturers and recent develope-
           ment in the field of graphical processing units (GPUs) oﬀered us the com-
           putational power of multi-core and many-core architectures. Database
           applications can benefit greatly from parallelism; however, many algo-
           rithms need to be redesigned and many technical issues need to be solved.
           In this paper, we have focused on standard relational join problem from
           the perspective of current highly parallel architectures. We present com-
           parison of diﬀerent approaches and propose algorithm adaptations which
           can better utilize multiple computational cores.

           Key words: relational, join, intersection, parallel, multi-core, many-
           core, GPGPU


   1    Introduction

   Joins in relational databases have been studied thoroughly since they are one
   of the most essential operations. Even though the algorithms are not likely to
   be improved anymore, the implementation details need to be revised every few
   years as they are sensitive to many aspects of hardware architectures, which are
   changing constantly.
       In the past few years, parallel architectures become available for common
   users. Central processing units with up to 12 logical cores are occupying main-
   stream segment of the market and graphical cards containing tens or even hun-
   dreds of processing units are present in almost every PC. Furthermore, the field
   of general purpose computing have encountered several major changes in both
   software accessibility1 and hardware design.
       Unfortunately, sometimes the parallelism cannot be exploited as easily as
   hardware manufacturers suggest. There are many concerns that need to be taken
   into account. In case of multi-core CPUs, there are issues regarding memory
    ⋆
      This paper was supported by the grant SVV-2011-263312 and by the grant agency
      of Charles University (GAUK), project no. 277911.
    1
      A release of OpenCL framework for instance


V. Snášel, J. Pokorný, K. Richta (Eds.): Dateso 2011, pp. 229–240, ISBN 978-80-248-2391-1.
2230    Martin Kruliš,
        Martin Kruliš, Jakub
                        Jakub Yaghob
                              Yaghob


access. Namely the cache coherency maintenance on multiple cores, bottleneck
created by multiple cores accessing memory via single bus, or diﬀerent memory
access latency caused by NUMA [1] factor. In case of many-core GPUs, the
memory access problems become even more severe. First problem is that the
GPU has its own memory which is connected to the rest of the system via
external PCI-Express bus. Second problem is that the caches of a GPU are
much smaller than caches of ordinary CPU, therefore access to the memory
must be carefully optimized. In addition, the GPU architecture is diﬀerent to
the architecture of CPU as it exploits single instruction multiple data paradigm.
In this paper, we will attend described issues and present some solutions.
    For the sake of simplicity, we have reduced the problem of join operations,
since they have many variations based on its purpose, datatypes, existence of
indices, etc. Henceforth, we define join operation as a simple intersection of two
key sets. Both sets have unique keys, which are numbers of fixed size (e.g. 32-bit
integers). We also assume that the keys are distributed almost uniformly among
their domain.
    The paper is organized as follows. Section 2 reviews related work. Section 3
summarizes shortly standard serial algorithms for the join problem. Section 4
presents and compares possible parallelization techniques for serial algorithms.
Section 5 addresses problems of many-core architectures (such as GPU cards)
and revise parallel algorithms from their perspective. Finally, Section 6 presents
experimental results and Section 7 concludes.


2      Related Work

Relational joins [2] are one of the most important database operations. There
are many papers related to the subject of parallel join processing taking many
diﬀerent views. Liu et al. [3] focused on the pipelined parallelism of multi-join
queries. In contrast, we are focusing on accelerating processing of single join
query. Lu et al. [4] compared four hash-based join algorithms on a multiprocessor
system. Schneider et al. [5] studied join algorithms in share-nothing system.
Cieslewicz et al. [6] implemented highly parallel hash join on the Cray MTA-2
architecture. Recently, Changkyu et al. [7] reviewed hash and sort join algorithms
from the perspective of modern multi-core CPUs.
    New modern many-core architectures, such as GPU [8] become available for
programers thanks to GPGPU languages and frameworks such as CUDA [9] and
OpenCL [10]. The GPGPU techniques are summarized in survey of Owens et al.
[11]. The GPU has already adopted many tasks used in query processing, such as
sorting [12][13]. Bakkum et al. [14] implemented SQL query processor accelerated
by GPU. Bingsheng et al. [15] studied several types of join algorithms using low
level data primitives such as split or gather/scatter implemented on GPUs. We
will reflect some of their findings in our work, but we use diﬀerent approach to
the problem that will exploit new hardware features and architectures of today.
                              Relational Joins
                              Relational Joins for
                                               for Many-Core
                                                   Many-Core Architectures
                                                             Architectures     231
                                                                                 3

3     Serial Joins

3.1     Brief Overview

In this section we will review existing join algorithms shortly, so we can refer to
them later. We also pinpoint their strengths and weaknesses in the perspective
of current hardware properties.
    There are two major approaches to solving join problem. First approach
widely used was the merge join. Both joined sets are sorted first and then merged
in single pass. Originally, it was designed for systems with small amount of
internal memory since the sorting can be achieved using algorithms for external
memory and the merge phase requires little additional memory. At present time,
the merge join algorithm is used only in special cases. For instance, if the joined
sets are (or can be) yielded by previous stages of query execution pipeline already
sorted, the sorting phase may be omitted.
    As the amount of internal random access memory in computers grew, database
systems replaced merge join with hash join algorithm. Hash join stores one of
the sets into a hash table and then looks up each item from the second set in
the table. If the item is found there, it is included into the result. Hash join
principles are used in the state of the art algorithms of today.
    Special case of hash join algorithm uses bitmap as a hash table. In case
the key domain is suﬃciently small, we can create a bit field where each bit
corresponds to one item from the domain. In our case, such field requires 232
bits (512 MB). When a bit is set, the corresponding key is present in the hashed
set and vice versa.


3.2     Considering Hardware Aspects

There are two most important aspects of current CPU architectures that aﬀects
join performance [7]:

    • CPU caches,
    • and virtual memory translation.

    CPU uses two or three level caches to reduce memory access latency. The
size of the cache is much smaller than size of main memory, thus accessing large
memory areas randomly is not very eﬃcient. Furthermore, the processor employs
prefetch mechanisms which tries to identify memory that is likely to be needed
soon and load it into the cache in advance. These mechanisms work on their best
if the memory is accessed sequentially.
    Second important issue is virtual memory translation. On IA32 architecture,
the translation is performed through page tables [16]. Depending on virtual
address space and some other details, the address is translated by looking up
records in 3 or 4 tables. Therefore, each virtual memory access leads into 4 or
5 physical memory accesses. In order to reduce this overhead, modern CPUs
implement TLB cache for translation. However, TLB cache is limited in its size,
4232    Martin Kruliš,
        Martin Kruliš, Jakub
                        Jakub Yaghob
                              Yaghob


thus if we access larger amount of virtual pages randomly, the TLB misses are
more frequent and memory latency rises.
   In the light of current CPU architectures, we can redesign both merge and
hash join algorithms. The sorting algorithm, which takes more than 90% of
merge join time, can be optimized [17] and the hash join can compact its memory
access pattern by introducing partitioning techniques [7] that we call bucketing.
These optimization are well beyond the scope of this paper, but we will exploit
bucketing for parallel reasons later.


4      Multi-Core Implementation

In this section, we will consider multi-core CPU architectures to increase perfor-
mance of join algorithms. We will examine strengths and weaknesses of merge
and join algorithms from the parallel point of view and then improve them with
the bucket partitioning.


4.1    Direct Approach

Merge Join can be parallelized quite easily as the sorting algorithms are known
to scale well [17]. The final merge can be parallelized as well by separating one
of the sorted sets into fragments and processing each fragment concurrently.
The only complication is that we need to find corresponding fragments in the
second set so both fragments can be merged; however, these fragments are easily
identified by simple application of binary search algorithm.
    Hash join does not scale as well as merge join. We need to synchronize access
to hash map (or the bitmap) by some kind of locking mechanism or atomic
operations. The synchronization creates bottleneck of the algorithm, therefore
the speedup is rather small as we can see in Section 6.


4.2    Bucketing Exploitation

The best way to solve any problem in parallel is to create independent tasks so
that they can be processed by multiple threads without explicit synchronization.
In case of join algorithms, we can employ the bucketing principle. We use hashing
function to separate data into buckets and each bucket can be processed by a
diﬀerent thread. For the sake of simplicity, we always divide the set among 2k
buckets using upper k bits from the key. We can use more sophisticated hashing
functions in case the key distribution is distorted in any way.


Divide And Conquer. One of possible solutions is based on divide and conquer
programing paradigm. It requires that the hashing function can be incrementally
refined. In our case, we can simply use more bits from the key for more fine
grained bucketing. The algorithm is similar to QuickSort [18]. In the first pass
it takes the array representing set of keys and reorder them so that keys with
                                 Relational Joins
                                 Relational Joins for
                                                  for Many-Core
                                                      Many-Core Architectures
                                                                Architectures   233
                                                                                  5

uppermost bit equal to 0 are on the left and keys starting with 1 are on the right.
Then both parts of the array are processed recursively using two uppermost bits,
etc. The recursion stops in predefined depth or when a bucket size decrease below
predefined threshold. This approach has several advantages.

 • The algorithm does not require any additional memory. We can perform
   splitting phases in place as the QuickSort does.
 • Recursive calls can be processed by diﬀerent threads since they work on
   disjoint ranges of the array.
 • We can use diﬀerent grain for diﬀerent buckets, thus achieving more balanced
   workload even if the keys are not distributed homogeneously.
 • Each splitting phase works sequentially so it benefits from caches.

    After partitioning both sets into buckets, these buckets are processed con-
currently. A simple hash join algorithm (using bitmap as a hash table) is used
locally. Thanks to the partitioning, we require only 232−k bits for the bitmap.


Two-Pass Bucketing. Previous approach is quite eﬀective, but it has two
flaws. It takes a while before recursion reaches suﬃcient depth to fully exploit
parallelism and we need to process each item O(k) times. We can achieve buck-
eting by one-pass algorithm if we put items directly into their buckets. The
problem is that if we want to run such algorithm concurrently, access to these
buckets needs to be synchronized, thus creating additional overhead. We propose
lock-free adaptation, which requires two passes, but it can run concurrently on
all available cores from the beginning.
    We create a matrix of buckets T × b, where T is number of threads available
and b is number of buckets, so that each thread has its own set of buckets. We
also expect that T ≤ b, otherwise the following stages of the algorithm cannot
fully utilize all the threads. For the sake of simplicity, we describe the algorithm
for T = b.


                   1. split keys to buckets       2. hash all buckets


    In the first phase, each thread takes its equal share of input set and divide
it into its local buckets. When the keys are scattered to the buckets, we prepare
6234     Martin Kruliš,
         Martin Kruliš, Jakub
                         Jakub Yaghob
                               Yaghob


vector of b hash tables (in our case bitmaps). Each thread takes one column of
the matrix (i.e. group of buckets containing keys with the same prefix of length k)
and save them in the corresponding hash table. Threads can access hash tables
without locking since each thread works on a separate bucket group and each
group has its own table. Finally, the second key set is processed concurrently.
Proper bucket is determined for each key in the set and the key is looked up in
its hash table. No synchronization is required as the hash tables are used only
for reading.


5      Adaptation for Many-Core Architectures

Before we start designing join algorithms for many-core GPUs, we recall the most
important facts about the architecture first. These facts need to be considered
carefully since they aﬀect performance significantly.


5.1    GPU Architecture

Kernel Execution. There are two main concerns about GPU architecture.
The first is rather specific program execution. Portions of code, which are to
be executed on GPU, are called kernels. Kernel is a function that is invoked
multiple times – once for each working thread. Each thread gets the same set
of calling arguments and a unique identifier from 0 to N − 1 range where N is
the number of invoked threads. The kernel uses thread identifier to select proper
part of the parallel work. The thread identifiers may be also organized in two or
three dimensional space for programmers convenience.
    The thread managing and context switching capabilities of a GPU are very
advanced. Therefore, it is usually better to create huge amount of threads even
if they execute only a few instructions each. More threads are better for load
balancing and fast context switching can be used to hide latency of accessing
global memory.


                                       global memory


    local memory          local memory          local memory      local memory

                             Fig. 1. Thread organization


   Threads are aggregated into small bundles called groups. A group usually
contains tens to hundreds threads which execute the kernel in SIMD2 or virtual
2
    Single Instruction Multiple Data
                                 Relational Joins
                                 Relational Joins for
                                                  for Many-Core
                                                      Many-Core Architectures
                                                                Architectures   235
                                                                                  7

SIMD fashion. Every thread executes the same instruction, but it has its own
set of registers, thus working on diﬀerent portion of data. SIMD suﬀers from
branching problem – diﬀerent threads in the group chose diﬀerent branches of ‘if’
statements. To execute conditional code properly, all branches must be executed
by all threads and each thread masks instruction execution according to local
result the condition. On the other hand, SIMD approach eases synchronization
within the group and threads can collaborate through shared local memory.
    Finally, we have to point out that most of the graphic hardware is not capable
of executing diﬀerent kernels simultaneously, even if they do not occupy all the
processing units. The only architecture currently capable of simultaneous kernel
execution is the newest NVIDIA Fermi [8].

Memory Organization. We have four diﬀerent memory address spaces to
work with when programming GPUs:

 • host memory,
 • global memory,
 • local memory,
 • and private memory.

    The host memory deserves extra attention, since it is not directly accessible
from processing units. Input data needs to be transferred from host memory
to graphic device memory and the results needs to be transferred back to host
memory when the computation terminates. Furthermore, these transfer use PCI-
Express bus, which is rather slow (in comparison with internal memory buses).
    The global memory is directly accessible from GPU cores. Input data and
computed results of a kernel are stored here. The global memory has high both
latency and bandwidth. In order to access the global memory optimally, threads
in one group are encouraged to use coalesced loads. Coalesced load is performed
when all the threads in the group loads or stores continuous block of aligned
memory, so that each thread processes one 4-byte word (see Figure 2).


                                             thread group

                   4B     4B    4B      4B

                        aligned block                       global memory

                                 Fig. 2. Coalesced load


    The local memory is shared among threads in one group. It is very small
(in order of tens of kB) but almost as fast as GPU registers. Local memory
can play role of program-managed cache for global memory, or the threads may
store partial results in here while they cooperate on a task. The memory is
8236    Martin Kruliš,
        Martin Kruliš, Jakub
                        Jakub Yaghob
                              Yaghob


separated into several (usually 16) banks. Following two 4-byte words are stored
in following two banks (modulo number of banks). When two threads access the
same bank (except if they read the same address), the memory operations are
serialized.
    Finally, the private memory is private to each thread and corresponds to
processor registers. Private memory size is very limited (tens to hundreds of
words), therefore it is suitable just for few local variables.

5.2    Algorithm Design
There are three diﬀerent approaches to implementing join algorithm on GPU.
First exploits the GPU power for sorting. Sorting algorithms for GPUs have been
already thoroughly studied [12][13], but our preliminary results indicated that
the sorting operation takes too much of precious time and the join algorithm can
be implemented better. Second is to implement bucketing on GPU and create
buckets small enough so they fit to local memory. This is also promising way;
however, fine grained bucketing takes also quite large amount of time.
    In our work, we examined the third possibility – applying hash join using
bitmap as hash table on GPU. Unfortunately, the bitmap would require con-
tinuous memory block of 512 MB, which cannot be allocated on present GPU
hardware due to some technical limitations. We have applied lightweight buck-
eting in order to reduce bitmap size so it would fit GPU memory. The algorithm
is designed as follows:
  for both input sets do
    calculate histogram (number of items in each bucket)
    perform prefix sum on the histogram (compute starting oﬀsets)
    split keys into the buckets
  end for
  for each bucket do
    prepare empty bitmap
    for ∀x of the first set in the bucket do
       set bit corresponding to x to 1
    end for
    for ∀y of the second set in the bucket do
       if bit corresponding to y is 1 then
          include y to the result
       end if
    end for
  end for
    The histogram is computed in highly parallel fashion using atomic increment
operations. The splitting algorithm works similarly like the histogram computa-
tion, but it uses local memory as output cache, so that the data are written to
global memory in larger blocks.
    We have chosen total size of 16 buckets for the current hardware. This way
the size of bitmap required for hashing is reduced to 32 MB which is feasible.
                               Relational Joins
                               Relational Joins for
                                                for Many-Core
                                                    Many-Core Architectures
                                                              Architectures     237
                                                                                  9

When the bitmap is being filled, a thread is created for each item x in the bucket.
Data are written using atomic OR operation, thus all conflicts are avoided. After
they complete, a thread is created for each key y. Threads in one group uses local
memory as a cache for keys that has been included into the result. When the
local cache is full, it is spiled into the global memory using atomic add operation
to allocate position in output buﬀer.


6     Experimental Results

6.1   Methodology, Hardware, And Testing Data

Each test was performed 10 times and presented values are the average of mea-
sured times. All times were well within ±10% limit from the presented value.
    Scalability tests were performed on Dell M910 server with four six-core Intel
Xenon E7540 processors with hyper-threading (i.e. 48 logical cores) clocked at
2.0 GHz. Server was equipped with 128 GB of RAM organized as 4-node NUMA.
GPU tests were conducted on a common PC with six-core Intel Core i7 870
CPU with hyper-threading clocked at 2.93 GHz. The machine was equipped
with 16 GB of RAM and one NVIDIA GTX 580 GPU card with 512 CUDA
cores and 1.5 GB of RAM. A RedHat Enterprise Linux (version 6) was used as
operating system on both machines.
    We use three data set for testing – pairs of 4M , 8M , 16M key sets. Each set
was generated randomly with uniform distribution over the 32-bit key domain.
We did not use larger data as they would not fit to the GPU memory.


6.2   Scalability on Multi-Core CPUs

First, we test scalability of CPU multi-core algorithms. Horizontal axis shows
number of active threads and vertical axis represents time in ms. The merge
denotes merge join algorithm, hash is hash join with bitmap hash table employ-
ing atomic operations for synchronization, split stands for divide and conquer
recursive bucketing algorithm, and finally bucket represents two-pass bucketing
algorithm.
    Figure 3 shows several things. First, that the algorithms does not scale very
well beyond 8 cores. This is most likely caused by relatively low memory band-
width. As multiple cores share the same memory controllers and caches, they
slow down each other. Second, the best algorithm is obviously bucketing algo-
rithm as we expected [7]. Finally, even though the merge join algorithm does
not perform very well on single core, its scalability makes him worthy adversary
on eight and more cores.
    Figures 4 and 5 shows how the algorithms perform on smaller datasets. The
merge join is more suited for smaller sets as it can better utilize CPU caches
and TLB. On the other hand, hash join performs worse on smaller data sets as
it requires initialization of large bitmap which size is relative to the size of key
domain, not the size of joined sets.
238
10    Martin Kruliš,
      Martin Kruliš, Jakub
                      Jakub Yaghob
                            Yaghob


       5000
       4500
       4000
       3500
       3000                                                   merge

       2500                                                   hash

       2000                                                   split

       1500                                                   bucket

       1000
        500
          0
               1       4       8       16       24      48


                    Fig. 3. Join of two sets, 16M keys each


       1200


       1000

        800
                                                              merge

        600                                                   hash
                                                              split
        400                                                   bucket

        200


          0
               1       4       8       16       24      48


                    Fig. 4. Join of two sets, 4M keys each


       2500


       2000


       1500                                                   merge
                                                              hash

       1000                                                   split
                                                              bucket

        500


          0
               1       4       8       16       24      48


                    Fig. 5. Join of two sets, 8M keys each
                              Relational Joins
                              Relational Joins for
                                               for Many-Core
                                                   Many-Core Architectures
                                                             Architectures   239
                                                                              11

6.3   Performance on GPU
We have compared our GPU algorithm to both serial and parallel algorithms on
CPU. The vertical axis represent times in ms. The hash(1T) denotes hash join al-
gorithm executed on single thread, bucket(12T) is two-pass bucketing algorithm
executed on 12 cores, GPU comp. stands for GPU algorithm (only computation
time), and finally GPU total represents times of GPU computation including
data transfers between host and device memory. The hash(1T) and bucket(12T)
were selected as best serial and parallel representatives on Core i7 CPU.


         700

         600

         500

         400                                                        4M
                                                                    8M
         300
                                                                    16M
         200

         100

          0
                hash (1T)   bucket (12T)   GPU comp.   GPU total


                 Fig. 6. Comparison of GPU and CPU algorithms


    The GPU algorithm runs more than 10× faster than fully parallelized buck-
eting algorithm running on 12 cores, and still more than 5× faster if we take
the time for data transfers into account. We can also observe that the transfer
of data to the graphic card and back takes almost the same amount of time as
the computation itself.

7     Conclusions
This paper presents comparison of join algorithms and their modifications in the
perspective of multi-core CPUs. Partitioning of joined sets into buckets improves
performance significantly and it can be easily adopted for lock-free concurrent
processing. We have also presented an adaptation of join algorithm that works
well on GPU many-core hardware and achieve more than 10× speedup to the
CPU parallel algorithms.
    There is much work to be done still. We are going to compare our GPU
algorithm with other possible implementations and make more thorough analy-
sis of them. We are also planing to compare our join algorithm with standard
in-memory database systems. In our future work, we will focus on exploiting
many-core GPU hardware to accelerate common database operations and index
searching.
240
12      Martin Kruliš,
        Martin Kruliš, Jakub
                        Jakub Yaghob
                              Yaghob


References
 1. NUMA: Non-Uniform Memory Architecture. http://en.wikipedia.org/wiki/Non-
    Uniform Memory Access
 2. Mishra, P., Eich, M.: Join processing in relational databases. ACM Computing
    Surveys (CSUR) 24(1) (1992) 63–113
 3. Liu, B., Rundensteiner, E.: Revisiting pipelined parallelism in multi-join query
    processing. In: Proceedings of the 31st international conference on Very large data
    bases, VLDB Endowment (2005) 829–840
 4. Lu, H., Tan, K., Shan, M.: Hash-based join algorithms for multiprocessor com-
    puters. In: Proceedings of the 16th International Conference on Very Large Data
    Bases, Morgan Kaufmann Publishers Inc. (1990) 198–209
 5. Schneider, D., DeWitt, D.: A performance evaluation of four parallel join algo-
    rithms in a shared-nothing multiprocessor environment. ACM SIGMOD Record
    18(2) (1989) 110–121
 6. Cieslewicz, J., Berry, J., Hendrickson, B., Ross, K.: Realizing parallelism in
    database operations: insights from a massively multithreaded architecture. In:
    Proceedings of the 2nd international workshop on Data management on new hard-
    ware, ACM (2006) 4
 7. Kim, C., Kaldewey, T., Lee, V., Sedlar, E., Nguyen, A., Satish, N., Chhugani, J.,
    Di Blas, A., Dubey, P.: Sort vs. Hash revisited: fast join implementation on modern
    multi-core CPUs. Proceedings of the VLDB Endowment 2(2) (2009) 1378–1389
 8. NVIDIA: Fermi Architecture. http://www.nvidia.com/object/fermi architecture.html
 9. NVIDIA: CUDA. http://www.nvidia.com/object/cuda home new.html
10. Khronos: OpenCL – The open standard for parallel programming of heterogeneous
    systems. http://www.khronos.org/opencl/
11. Owens, J., Luebke, D., Govindaraju, N., Harris, M., Kruger, J., Lefohn, A., Pur-
    cell, T.: A Survey of General-Purpose Computation on Graphics Hardware. In:
    Computer graphics forum. Volume 26., Wiley Online Library (2007) 80–113
12. Sintorn, E., Assarsson, U.: Fast parallel GPU-sorting using a hybrid algorithm.
    Journal of Parallel and Distributed Computing 68(10) (2008) 1381–1388
13. Greß, A., Zachmann, G.: GPU-ABiSort: Optimal parallel sorting on stream archi-
    tectures. In: Proceedings of the 20th IEEE International Parallel and Distributed
    Processing Symposium, Citeseer (2006) 45
14. Bakkum, P., Skadron, K.: Accelerating SQL database operations on a GPU with
    CUDA. In: Proceedings of the 3rd Workshop on General-Purpose Computation
    on Graphics Processing Units, ACM (2010) 94–103
15. He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N., Luo, Q., Sander, P.: Rela-
    tional joins on graphics processors. In: Proceedings of the 2008 ACM SIGMOD
    international conference on Management of data, ACM (2008) 511–524
16. Page Tables. http://en.wikipedia.org/wiki/Page table
17. Chhugani, J., Nguyen, A., Lee, V., Macy, W., Hagog, M., Chen, Y., Baransi, A.,
    Kumar, S., Dubey, P.: Eﬃcient implementation of sorting on multi-core SIMD
    CPU architecture. Proceedings of the VLDB Endowment 1(2) (2008) 1313–1324
18. Hoare, C.: Quicksort. The Computer Journal 5(1) (1962) 10

</pre>