=Paper= {{Paper |id=Vol-480/paper-4 |storemode=property |title=Sorting using BItonic netwoRk wIth CUDA |pdfUrl=https://ceur-ws.org/Vol-480/paper4.pdf |volume=Vol-480 |dblpUrl=https://dblp.org/rec/conf/sigir/CapanniniSBN09 }} ==Sorting using BItonic netwoRk wIth CUDA== https://ceur-ws.org/Vol-480/paper4.pdf

Sorting using BItonic netwoRk wIth CUDA

Ranieri Baraglia Gabriele Capannini
ISTI - CNR ISTI - CNR
Via G. Moruzzi, 1 Via G. Moruzzi, 1
56124 Pisa, Italy 56124 Pisa, Italy
r.baraglia@isti.cnr.it g.capannini@isti.cnr.it
Franco Maria Nardini Fabrizio Silvestri
ISTI - CNR ISTI - CNR
Via G. Moruzzi, 1 Via G. Moruzzi, 1
56124 Pisa, Italy 56124 Pisa, Italy
f.nardini@isti.cnr.it f.silvestri@isti.cnr.it

ABSTRACT keep them up to date. Storage space must be used efficiently
Novel “manycore” architectures, such as graphics processors, to store indexes, and documents. The indexing system must
are high-parallel and high-performance shared-memory ar- process hundreds of gigabytes of data efficiently. Queries
chitectures [7] born to solve specific problems such as the must be handled quickly, at a rate of hundreds to thousands
graphical ones. Those architectures can be exploited to per second. All these services run on clusters of homoge-
solve a wider range of problems by designing the related neous PCs. PCs in these clusters depends upon price, CPU
algorithm for such architectures. We present a fast sorting speed, memory and disk size, heat output, and physical size
algorithm implementing an efficient bitonic sorting network. [3]. Nowadays these characteristics can be find also in other
This algorithm is highly suitable for information retrieval commodity hardware originally designed for specific graph-
applications. Sorting is a fundamental and universal prob- ics computations. Many special processor architectures have
lem in computer science. Even if sort has been extensively been proposed to exploit data parallelism for data intensive
addressed by many research works, it still remains an inter- computations and graphics processors (GPUs) are one of
esting challenge to make it faster by exploiting novel tech- those. For example, the scientific community uses GPUs
nologies. In this light, this paper shows how to use graph- for general purpose computation. The result obtained, in
ics processors as coprocessors to speed up sorting while al- term of computational latency, outperform the time charge
lowing CPU to perform other tasks. Our new algorithm requested on classical processors. Unfortunately, such pro-
exploits a memory-efficient data access pattern maintain- grams must rely on APIs to access the hardware, for example
ing the minimum number of accesses to the memory out OpenGL or DirectX. These APIs are simultaneously over-
of the chip. We introduce an efficient instruction dispatch specified, forcing programmer to manipulate data that is not
mechanism to improve the overall sorting performance. We directly relevant, and drivers. These APIs make critical pol-
also present a cache-based computational model for graph- icy decisions, such as deciding where data resides in memory
ics processors. Experimental results highlight remarkable and when they are copied.
improvements over prior CPU-based sorting methods, and In last years, due to the growing trend of media market,
a significant improvement over previous GPU-based sorting the request of rendering algorithms is rapidly evolving. For
algorithms. those companies producing hardware, it means to design
every time new hardware both able to run novel algorithms
and able to provide higher rate of computations per sec-
1. INTRODUCTION ond. Such processors require significant design effort and
Every day people use Web Search Engines as a tool for are thus difficult to change as applications and algorithms
accessing information, sites, and services on the Web. Infor- evolve. The request for flexibility in media processing moti-
mation retrieval has to face those issues due to the growing vates the use of programmable processors, and the existing
amount of information on the web, as well as the number of need for non-graphical APIs pushed the same companies into
new users. Creating a Web Search Engine which scales even creating new abstractions designed to last.
to today’s Web contents presents many challenges. A fast Finally, according to what Moore’s law foresee, the num-
crawling technology is needed to gather web documents and ber of transistor density doubles every two years. Further-
more, mainly due to power-related issues, new generation
processors (such as traditional CPUs) tend to incorporate
an ever-increasing number of processors (also called cores)
on the same chip [9]. The result is that nowadays the mar-
ket proposes low-cost commodity hardware that is able to
execute heavy loads of computations. In order to enable de-
Copyright ° c 2009 for the individual papers by the papers’ authors. Copy- velopers to leverage the power of such architectures, they
ing permitted for private and academic purposes. Re-publication of material
from this volume requires permission by the copyright owners. This volume
usually make available ad-hoc programming tools for.
is published by its editors. For the reasons listed so far, these architectures are ideal
LSDS-IR Workshop. July 2009. Boston, USA.
candidates for the efficient implementation of those compo- bitonic sort as well as an odd-even merge sort. They pre-
nent of a Large-Scale Search Engine that are eager of com- sented an improved bitonic sort routine that achieves a per-
putational power. formance gain by minimizing both the number of instruc-
This paper focuses on using a GPU as a co-processor for tions to be executed in the fragment program and the num-
sorting. Sorting is a core problem in computer science that ber of texture operations.
has been extensively researched over the last five decades Zachmann et al. [14] presented a novel approach for par-
and still remains a bottleneck in many applications involv- allel sorting on stream processing architectures based on an
ing large volumes of data. One could argue why efficient adaptive bitonic sorting [6]. They presented an implemen-
sorting is related with LSDS-IR. First of all, sorting is a tation based on modern programmable graphics hardware
basic application for indexing. We will show in Section 3 showing that they approach is competitive with common
how many indexing algorithms are basically a sorting op- sequential sorting algorithms not only from a theoretical
eration over integer sequences. Large scale indexing, thus, viewpoint, but also from a practical one. Good results are
required scalable sorting. Second, the technique we are in- achieved by using efficient linear stream memory accesses
troducing here is of crucial importance for Distributed Sys- and by combining the optimal time approach with algo-
tems for IR since it is designed to run on GPUs that are rithms.
considered by many as a basic building block for future gen- Govindaraju et al. [13] implemented sorting as the main
eration data-centers [4]. Our bitonic sorting network can computational component for histogram approximation. This
be seen as a viable alternative for sorting large quantities solution is based on the periodic balanced sorting network
of data on graphics processors. In the last years general method by Dowd et al. [10]. In order to achieve high com-
purpose processors have been specialized adopting mecha- putational performance on the GPUs, they used a sorting
nisms to make more flexible their work. Such facilities (i.e. network based algorithm and each stage is computed using
more levels of caches, out-of-order execution paradigm, and rasterization. They later presented a hybrid bitonic-radix
branch prediction techniques) leads to make the theoreti- sort that is able to sort vast quantities of data [12], called
cal performance of CPUs closer to the real achievable one. GPUTeraSort. This algorithm was proposed to sort record
From the other side specialized processors, like GPUs, ex- contained in databases using a GPU. This approach uses the
pose lower flexibility at design phase, but are able to reach data and task parallelism on the GPU to perform memory-
higher computational power providing more computational intensive and compute-intensive tasks while the CPU is used
cores with respect to other the class of processors. We map to perform I/O and resource management.
a bitonic sorting network on GPU exploiting the its high Sengupta et al. [25] presented a radix-sort and a Quick-
bandwidth memory interface. Our novel data partitioning sort implementation based on segmented scan primitives.
improves GPU cache efficiency and minimizes data transfers Authors presented new approaches of implementing several
between on-chip and off-chip memories. classic applications using this primitives and shows that this
This paper is organized as follows. Section 2 discusses primitives are an excellent match for a broad set of problems
related works, while Sections 3 introduces some relevant on parallel hardware.
characteristics about the applicability of GPU-based sort- Recently, Sintorn et al. [28] presented a sorting algorithm
ing in Web Search Engines. Section 4 presents some issues that combines bucket sort with merge sort. In addition,
arising from the stream programming model and the single- authors show this new GPU sorting method sorts on nlog(n)
instruction multiple-data (SIMD) architecture. The central time.
part is devoted to expose our solution, and the computa- Cederman et al. [8] showed that their GPU-Quicksort is
tional model used to formalize Graphics Processing Units. a viable sorting alternative. The algorithm recursively par-
Section 6 presents some results obtained in a preliminary tition the sequence to be sorted with respect to a pivot.
evaluation phase. Section 7 discuss hot to evolve and im- This is done in parallel by each GPU-thread until the en-
prove this research activity. tire sequence has been sorted. In each partition iteration
a new pivot value is picked and as a result two new sub-
sequences are created that can be sorted independently by
2. RELATED WORK each thread block can be assigned one of them. Finally, ex-
Since most sorting algorithms are memory bandwidth bound, periments pointed out that GPU-Quicksort can outperform
there is no surprise that there is currently a big interest in other GPU-based sorting algorithms.
sorting on the high bandwidth GPUs.
Purcell et al. [24] presented an implementation of bitonic 3. APPLICATIONS TO INDEXING
merge sort on GPUs based on an implementation by Ka-
pasi et al. [17]. Author used that approach to sort photons A modern search engine must scale even with the growing
into a spatial data structure providing an efficient search of today’s Web contents. Large-scale and distributed appli-
mechanism for GPU-based photon mapping. Comparator cations in Information Retrieval such as crawling, indexing,
stages were entirely realized in the fragment units1 , includ- and query processing can exploit the computational power
ing arithmetic, logical and texture operations. Authors re- of new GPU architectures to keep up with this exponential
ported their implementation to be compute-bound rather grow.
than bandwidth-bound, and they achieve a throughput far We consider here one of the core-component of a large-
below the theoretical optimum of the target architecture. scale search engine: the indexer. In the indexing phase, each
Kipfer et al. [19, 20] showed an improved version of the crawled document is converted into a set of word occurrences
called hits. For each word the hits record: frequency, posi-
1
In addition to computational functionality, fragment units tion in document, and some other information. Indexing,
also provide an efficient memory interface to server-side then, can be considered as a “sort” operation on a set of
data, i.e. texture maps and frame buffer objects. records representing term occurrences [2]. Records repre-
sent distinct occurrences of each term in each distinct docu- This option is even more important if the allocation of the
ment. Sorting efficiently these records using a good balance task is scheduled dynamically, as it can be done depending
of memory and disk usage, is a very challenging operation. of the workload of the single resources.
In the last years it has been shown that sort-based ap- The-state-of-art in string sort lacks of solution for GPU
proaches [29], or single-pass algorithms [21], are efficient in architectures: nowadays we are not aware of solutions for
several scenarios, where indexing of a large amount of data parallel SIMD processors. In the literature, this problem is
is performed with limited resources. efficiently solved by using different approaches. The most
A sort-based approach first makes a pass through the col- interesting and suitable for us seems to be Burstsort [27].
lection assembling all term-docID pairs. Then it sorts the It is a technique that combines the burst trie [15] to dis-
pairs with the term as the dominant key and docID as the tribute string-items into small buckets whose contents are
secondary key. Finally, it organizes the docIDs for each term then sorted with standard (string) sorting algorithms. Suc-
into a postings list (it also computes statistics like term and cessively, Sinha et al. [26] introduced improvements that
document frequency). For small collections, all this can be reduce by a significant margin the memory requirements of
done in memory. Burstsort. This aspect is even more relevant for GPU archi-
When memory is not sufficient, we need to use an external tectures having small-sized on-chip memories.
sorting algorithm [22]. The main requirement of such algo-
rithm is that it minimizes the number of random disk seeks
during sorting. One solution is the Blocked Sort-Based In-
4. SORTING WITH GPUS
dexing algorithm (BSBI). BSBI segments the collection into This section gives a brief overview of GPUs highlighting
parts of equal size, then it sorts the termID-docID pairs of features that make them useful for sorting. GPUs are de-
each part in memory, finally stores intermediate sorted re- signed to execute geometric transformations that generate a
sults on disk. When all the segments are sorted, it merges data stream of display-pixels. A data stream is processed by
all intermediate results into the final index. a program running on multiple SIMD processors, which are
A more scalable alternative is Single-Pass In-Memory In- capable for data-intensive computations. The output, then,
dexing (SPIMI). SPIMI uses terms instead of termIDs, writes is written back to the memory.
each block’s dictionary to disk, and then starts a new dic-
tionary for the next block. SPIMI can index collections of 4.1 SIMD Architecture
any size as long as there is enough disk space available. The SIMD machines are classified as processor-array machines:
algorithm parses documents and turns them into a stream of a SIMD machine basically consists of an array of computa-
term-docID pairs, called tokens. Tokens are then processed tional units connected together by a simple network topol-
one by one. For each token, SPIMI adds a posting directly to ogy [23]. This processor array is connected to a control pro-
its postings list. Instead of first collecting all termID-docID cessor, which is responsible for fetching and interpreting in-
pairs and then sorting them (as BSBI does), each postings structions. The control processor issues arithmetic and data
list is dynamic. This means that its size is adjusted as it processing instructions to the processor array, and handles
grows. This has two advantages: it is faster because there any control flow or serial computation that cannot be par-
is no sorting required, and it saves memory because it keeps allelized. Processing elements can be individually disabled
track of the term a postings list belongs to, so the termIDs for conditional execution: this option give more flexibility
of postings need not be stored. during the design of an algorithm.
When memory finished, SPIMI writes the index of the Although SIMD machines are very effective for certain
block (which consists of the dictionary and the postings lists) classes of problems, the architecture is specifically tailored
to disk. Before doing this, SPIMI sorts the terms to facil- for data computation-intensive work, and it results to be
itate the final merging step: if each block’s postings lists quite “inflexible” on some classes of problems2 .
were written in unsorted order, merging blocks could not be
accomplished by a simple linear scan through each block. 4.2 Stream Programming Model
The last step of SPIMI is then to merge the blocks into the A stream program [18] organizes data as streams and ex-
final inverted index. presses all computation as kernels. A stream is a sequence
SPIMI, which time complexity is lower because no sorting of similar data elements, that are defined by a regular ac-
of tokens is required, is usually preferred with respect to cess pattern. A kernel typically loops through all the input
BSBI that presents an higher time complexity. stream elements, performing a sequence of operations on
In both the methods presented for indexing, sorting is in- each element, and appending results to an output stream.
volved: BSBI sorts the termID-docID pairs of all parts in These operations exhibits an high instruction level paral-
memory, SPIMI sorts the terms to facilitate the final merg- lelism. Moreover, these operations cannot access to arbi-
ing step [22]. trary memory locations but they keep all the intermediate
In order to efficiently evaluate these two approaches on a values locally, into kernels. Since each element of the in-
heterogeneous cluster we have to compare “standard” SPIMI put stream can be processed simultaneously, kernels also ex-
performances with the performances of a BSBI-based sorter pose large amounts of data-level parallelism. Furthermore,
implemented by us. Moreover, to fully analyze the indexing stream memory transfers can be executed concurrently with
phase, we need a GPU-based string sorter able to outperform kernels, thereby exposing task-level parallelism in the stream
CPUs as well as our sorter for integers does. In this way program. Some other important characteristics common to
we have the possibility to compare both solutions, on all all stream-processing applications are: (i) elements are read
architectures, then to choose the best combination. Having once from memory, (ii) elements are not visited twice, and
all possible implementations available, a flexible execution 2
of indexing running on various hardware can be imagined. For example, these architectures cannot efficiently run the
control-flow dominated code.
(iii) the application requires high level of arithmetic opera- cache-oblivious algorithms. The model underlying this type
tions per memory reference, i.e. computationally intensive. of algorithms is not directly applicable on GPU’s parallel ar-
chitecture, which is equipped with local memory instead of
4.3 Nvidia’s CUDA cache. Adopting local memory approach, the programmer
CUDA [1], acronym of Compute Unified Device Architec- has to bear the effort of synchronizing, sizing, and schedul-
ture, is defined as an architecture built around a scalable ing the computation of data and its movement through the
array of multithreaded streaming multiprocessors (MPs). off-chip memory and the in-chip one. On the other hand in
Each MP is defined as a unit comprising one instruction cache-based architectures this aspect is automatically man-
unit, a collection of eight single precision pipelines also called aged by the underlying support. A local memory approach
cores, a functional units for special arithmetical operations permits to move data located in different addresses compos-
and a 16 KB local store also called shared memory. In prac- ing a specific access pattern. This capability is impossible to
tice, this means that each MP is a SIMD processor, whose realize with caches, where the hardware hides this operation
cores form an arithmetic pipeline that executes scalar in- by automatically replacing missed cache lines.
structions. All MPs create, manage, and execute concurrent Frigo et al. [11] presents cache-oblivious algorithms that
threads in hardware with zero scheduling overhead, and im- use both asymptotically optimal amounts of work, and asymp-
plements a barrier for threads synchronization. Neverthe- totically optimal number of transfers among multiple levels
less, only threads concurrently running on the same MP can of cache. An algorithm is cache oblivious if no program
be synchronized. variables dependent on hardware configuration parameters,
CUDA model also introduces the entity warp as a group such as cache size and cache-line length need to be tuned
of 32 parallel scalar threads, and reports that each warp to minimize the number of cache misses. Authors introduce
executes one common instruction at a time. This is another the “Z, L” ideal-cache model to study the cache complexity
way of saying that warp is a stream of vector instructions: of algorithms. This model describes a computer with a two-
scalar threads are then vector elements. But, unlike others level memory hierarchy consisting of an ideal data cache of
SIMD instruction set, such as Intel’s SSE, a particular value Z words of constant size, and an arbitrarily large main mem-
of the vector length is not specified. A thread block is defined ory. The cache is partitioned into cache lines, each consist-
as a collection of warps that run on the same MP and share ing of L consecutive words that are always moved together
a partition of local store. The number of warps in a thread between cache and main memory.
block is configurable. The processor can only reference words that reside in the
cache. If the referenced word belongs to a line already in
4.3.1 CUDA SDK cache, a cache hit occurs, and the word is delivered to the
The SDK provided by Nvidia for its GPUs consist of a processor. Otherwise, a cache-miss occurs, and the line is
large, collection of code examples, compilers and run-time fetched into the cache. If the cache is full, a cache line
libraries. Clearly the CUDA model is “restricted” to Nvidia must be evicted. An algorithm with an input of size n is
products, mainly for efficiency reasons, and it is conform to measured in the ideal-cache model in terms of its work com-
the stream programming model. Threads and thread blocks plexity W (n) and its cache complexity Q(n, Z, L), that is
can be created only by invoking a parallel kernel, not from the number of cache misses it incurs as a function of the size
within a parallel kernel. Task parallelism can be expressed Z and line length L of the ideal cache.
at the thread-block level, but block-wide barriers are not The metrics used to measure cache-oblivious algorithms
well suited for supporting task parallelism among threads need to be reconsidered in order to be used with GPUs that
in a block. To enable CUDA programs to run on any num- are parallel architectures. To do that W () has to be defined
ber of processors, communication between different blocks of taking care of the level of parallelism exposed by GPUs.
threads, is not allowed, so they must execute independently. Evaluating Q(), we must translate the concept of Z and L
Since CUDA requires that thread blocks are independent that refer to cache characteristics. More precisely, a GPUs
and allows blocks to be executed in any order. Combin- is provided of p MPs each one with a local memory. We
ing results generated by multiple blocks must in general be can abstract such architectural organization by considering
done by launching a second kernel. However, multiple thread each local memory as one cache-line, and the union of all
blocks can coordinate their work using atomic operations on local memories as the entire cache, taking no care of which
the external (off-chip) memory. processor is using data. In this point of view, if the shared
Recursive function calls are not allowed in CUDA kernels. memory of each MP is 16 KB, we obtain L = 16 KB and
Recursion is, in fact, unattractive in a massively parallel Z = 16 · p KB.
kernel. Providing stack space for all the active threads it
would require substantial amounts of memory. To support 5. BITONIC SORT
an heterogeneous system architecture combining a CPU and To design our sorting algorithm in the stream program-
a GPU, each with its own memory system, CUDA programs ming model, we started from the popular Bitonic Sort (BS)
must copy data and results between host memory and device network and we extend it to adapt to our specific architec-
memory. The overhead of CPU/GPU interaction and data ture. Specifically, BS is one of the fastest sorting networks
transfers is minimized by using DMA block-transfer engines [5]. A sorting network is a special kind of sorting algo-
and fast interconnects. rithm, where the sequence of comparisons do not depend
on the order with which the data is presented, see Figure
4.4 Cache-oblivious algorithms 1. This makes sorting networks suitable for implementation
The cost of communication can be larger up to an order on GPUs. In particular, the regularity of the schema used
of magnitude than the cost of the pure computation on such to compare the elements to sort, makes this kind of sorting
architectures. Our idea is to model the proposed solution as network particularly suitable for partitioning the elements in
the stream programming fashion, as GPUs require. Finally, memory. In the ideal-cache computational model, it corre-
we compared theoretical results with the ones resulting from sponds to a cache-miss event, which wastes the performance
the tests, in order to say if the adapted ideal-cache model is of the algorithm.
useful to abstract GPUs. Resuming, performing several network steps in the same
kernel has the double effect to reduce the number of cache-
misses, i.e. improving Q() metric, and to augment the level
of arithmetic operations per memory reference. The unique
constraint is that the computation of an element has to be
independent from the one of another element in the same
stream.
step

step

Figure 1: Bitonic sorting networks for 16 elements.
Each step is completed when all comparisons in-
volved are computed. In the figure each comparison
is represented with a vertical line that link two ele-
ments, which are represented with horizontal lines.
A B C

The main issue to address is to define an efficient schema
to map all comparisons involved in the BS on the elements Figure 3: Increasing the number of steps covered
composing the streams invoked. by the partition, the number of items included dou-
The first constraint is that the elements composing each bles. A, B and C are partitions respectively for local
stream must be “distinct”. This means that each item in memory of 2, 4 and 8 locations.
the input has to be included in exactly one element of the
stream. From this point of view, each stream of elements
Let us introduce our solution. First of all, we need to
defines a partition of the input array to be sorted. This char-
establish the number of consecutive steps to be executed by
acteristic is necessary due to the runtime support, because
one kernel. We must consider that for each step assigned
it does not permit any type of data-exchange, or synchro-
to a kernel, in order to maintain the all stream elements
nization, between two elements (Figure 2).
independent, the number of memory location needed by the
step relative partition doubles, see Figure 3. So, the number
of steps a kernel can cover is bounded by the number of
items that it is possible to include into the stream element.
Furthermore, the number of items is bounded by the size of
the shared memory available for each SIMD processor.

Algorithm 1 Bitonic Sort algorithm.
a ← array to sort
for s = 1 to log2 |a| do
element for c = s − 1 down to 0 do
kernel stream for r = 0 to |a| − 1 do
if 2rc ≡ 2rs (mod 2) ∧ a[r] > a[r ⊕ 2c ] then
Figure 2: Example of a kernel stream comprising a[r] ⇆ a[r ⊕ 2c ]
more sorting network steps. The subset of items end if
composing each element must perform comparison end for
only inside itself. end for
end for
The second aspect to optimize is the partitioning of the
steps composing the sorting network. Since a BS is com- More precisely, to know how many steps can be included
posed by several steps (Figure 1), we have to map the execu- in one partition, we have to count how many distinct values c
tion of all steps into a sequence of independent runs, each of assumes, see Algorithm 1. Due to the fixed size of memory
them corresponding to the invocation of a kernel. Since each locations, i.e. 16 KB, we can specifies partition of SH =
kernel invocation implies a communication phase, such map- 4 K items, for items of 32 bits. Moreover such partition is
ping should be done in order to reduce the number of these able to cover “at least” sh = log(SH) = 12 steps. From
invocations, thus the communication overhead. Specifically, this evaluation it is also possible to estimate the size of the
this overhead is generated whenever the SIMD processor be- kernel stream: if a partition representing an element of the
gins the execution of a new stream element. In that case, the stream contains SH items, and the array a to sort contains
processor needs to flush the results contained in the proper N = 2n items, then the stream contains b = N/SH = 2n−sh
shared memory, then to fetch the new data from the off-chip elements.
A important consideration must be done for the first ker- L = SH, Z = SH · p, and each kernel` covers sh ´ steps, ex-
nel invoked by the algorithm: until the variable s in the cept the first kernel that covers σ ′ = 21 sh2 + sh . Then the
Algorithm 1 is not greater than sh the computation of the number of cache-misses is ⌈(σ − σ ′ )/sh⌉.
several first steps can be done with this first kernel. This The last consideration regards W (n, p) measure, that should
because the values assumed by c remain in a range of sh be estimated considering that each MP is a SIMD processor.
distinct values. More precisely the first kernel computes the In fact, each MP reaches its maximum performance when-
first sh·(sh+1)
2
steps (Figure 3). ever the data-flow permits to the control unit to issue the
This access pattern schema can be traduced in the func- same instruction for all cores. In this way such instruction
tion ℓc (i) that for the current kernel stream, given the cur- is executed in parallel on different data in a SIMD fashion.
rent value of c, is able to define the subset of a to be assigned In order to compare our bitonic sort with the quick sort
to the i-th stream element. In other words, ℓ describes a proposed by Cederman et al., we tried to extend the analysis
method to enumerate the set of indexes in a that the i-th of ideal-cache model metrics to their solution. Unfortunately
element of the kernel stream has to perform. Formally, it is their solution does not permit to be accurately measured
defined as: like ours. In particular, it is possible to estimate the num-
ber of transfers among multiple levels of cache, but quick
ℓc : i ∈ [0, b − 1] → Π ⊆ πsh (a) sort uses off-chip memory also to implement prefix-sum for
each stream element ran. In particular quick sort splits in-
where πsh (a) is a partition of a, namely a set of nonempty
put array in two parts with respect to a pivot by invoking
subsets of a such that every element in a is in exactly one of
a procedure on GPU, and recursively repeats this opera-
these subsets, and each subset contains exactly 2sh elements
tion until each part can be entirely contained in the shared
of a.
memory. In the optimistic case, assuming |a| = n and a
Let us assume n = 32 and the size of the shared memory
shared memory equal to L = SH, this operation is invoked
can contains 4 K items, so we have |a| = 232 and b = 2n−sh =
log(n)−log(L) times, that is also the number of cache-misses
232−12 = 220 elements for each stream. Basically, we need
for quick sort, namely Q(). This value is sensibly lower to
32 bits to point an element of a, and log(b) = 20 bits to
the Q() measured for bitonic sort, but the evaluation of the
identify the i-th partition among the b existing. The i value
number of such cache-misses should be proportionally aug-
is used to build a bit-mask that is equal for each address
mented due to the prefix-sum procedure. Regarding W (),
produced by ℓc (i). Such mask sets log(b) bits of the 32 bits
the authors report a computational complexity equal
´ to the
composing an index for a. The missing sh bits are generated `
one obtained for sequential case, i.e. O n · log(n) , without
by using a variable x to enumerate all values in the range
referring the parallelism of the underlying hardware. How-
X = [0, 2sh −1] and by inserting each of them in c-th position
ever, optimistic evaluation of such parallel, version should
of the i mask. This composition leads to a set of addresses
be, also in this case, lower than W () computed for bitonic
of n bits whose relative items compose the b-th partition.
sort.
Formally, we obtain:
ℓc (i) = {x ∈ X : i[31...c] ◦ x ◦ i[c+1...0] } 6. RESULTS
where i[31...c] notation identifies the leftmost bits, namely The experiments have been conducted on an Ubuntu Linux
from the 31th bit down to the c-th one, and “◦” is the con- Desktop with an Nvidia 8800GT, namely a GPU provided
catenation operator. with 14 SIMD processors and CUDA SDK 2.1. To gener-
The rule to compose the elements in ℓc (i) is easy, but in ate the arrays for these preliminary tests we used uniform,
some case it leads to some exception. When c < sh, then gaussian and zipfian distributions. Specifically, for each dis-
x is divided in two parts, that are xL and xR , and they tribution was generated 20 different arrays. Figure 4 shows:
are inserted in c-th position and in the c′ -th position of i the means, the standard deviation, the maximum and the
respectively. In particular, the statement c < sh occurs minimum for the times elapsed for all runs of each distri-
whenever the middle loop in the Algorithm 1 ends and c bution. The tests involved CPU-based quick sort provided
start a new loop getting the new value of s, denoted with c′ . with C standard library, our solution and the one proposed
Specifically, it happens that, depending on the current value by Cederman et al. [8]. In that paper, the GPU-based
of c, the algorithm needs to make two insertions: to add xR quick sort resulted the most performing algorithm in liter-
at position c, and to add xL at position c′ . Formally, when ature, so our preliminary tests take into consideration only
c < sh, we have to define a second function ℓc,c′ () as in the such GPU-based algorithm.
following: Figure 4 shows that GPU-based solutions are able to out-
perform CPU’s performance for the sorting problem. The
ℓc,c′ (i) = {x ∈ X : i[31...c̄] ◦ xL ◦ i[c′ +sh−c...c] ◦ xR } same figure also shows that our solution is not competitive
with respect to the one of Cederman until the number on
where xL = x[sh−1...c] , xR = x[c−1...0] , and c′ = s + 1. items to sort reaches 8 MB. One more consideration is that
GPU-based quick sort is not able to successfully conclude
5.1 Evaluation the tests for arrays greater than 16 MB. Further analysis
For our solution we obtain that W (n, p), where p indi- pointed out that bitonic algorithm spends the main part of
cates the number of MPs, and n the size of a, is equal the time elapsed for the data-transfer phase of some specific
to the computational cost` of the sequential BS divided p, kernel instance. Since element streams were always of the
specifically W (n, p) = O np · log2 n . To know Q(n, Z, L)
´
same size, we deduced that the number of transfers is not
we must estimate the number of cache-misses. Assuming the only aspect to take into consideration to minimize the
|a| = `n, we obtain that´ the sorting network is made of communication overhead, as metrics of ideal-cache models
σ = 12 log2 (n) + log(n) steps. Furthermore, let us assume suggests.
a) gaussian distribution used dependent approach whereas sorting network are based on a
2500
bitonic sort
gpu quick sort
fixed schema of comparisons that does not vary with respect
cpu quick sort to data-distribution.
2000
Consideration on the results obtained from these prelim-
inary test suggest us that ideal-cache model does not seem
sufficiently accurate to abstract GPU’s architecture. If the-
1500 oretical results lead to better performance for GPU-based
quick sort, from the tests conducted, it arises that bitonic
sec/1000

sort has a better performance-trend by increasing the size of
1000 the problem. This consideration is enforced by the analysis
of the data-transfer: we strongly believe that by improving
the data-transfer bandwidth, bitonic sort can reach better
500
results without increasing theoretical W () and Q() metrics.

0
1M 2M 4M 8M 16M 32M 64M
7. FUTURE WORKS
b) uniform distribution used
Preliminary results show that the number of transfers is
2500 not the only aspect to take into consideration for minimiz-
bitonic sort
gpu quick sort
cpu quick sort
ing communication overhead. Another important factor is
the transfer bandwidth that is relevant to achieve better re-
2000
sults. Results show that the ideal-cache model is not able
to fully describe and capture all the aspects determining the
performance for such kind of architectures. Probably differ-
1500
ent kinds of performance metrics are needed to evaluate an
sec/1000

algorithms on these novel hardware resources.
1000
Furthermore, since indexing is a tuple-sorting operation
we will extend our solution to include the sorting of tuples
of integers. In this paper, in fact, we assume the tuples are
500 sorted by using multiple passes on the dataset. We reserve
to future work the extension to tuples.
Of course, we have to run more tests to enforce the results
0
1M 2M 4M 8M 16M 32M 64M
obtained and to analyze more in deep the causes of the waste
of performance that affect our algorithm.
c) zipfian distribution used
2500
bitonic sort
gpu quick sort
cpu quick sort 8. ACKNOWLEDGMENTS
2000
This research has been supported by the Action IC0805:
Open European Network for High Performance Computing
on Complex Environments.
1500
sec/1000

9. REFERENCES
1000 [1] NVIDIA CUDA Compute Unified Device Architecture
- Programming Guide, 2007.
[2] R. Baeza-Yates, C. Castillo, F. Junqueira,
500
V. Plachouras, and F. Silvestri. Challenges on
distributed web retrieval. In ICDE. IEEE, 2007.
0
[3] L. A. Barroso, J. Dean, and Holzle. Web search for a
1M 2M 4M 8M 16M 32M 64M planet: The google cluster architecture. Micro, IEEE,
23(2):22–28, 2003.
Figure 4: Squared white areas represent the vari- [4] L. A. Barroso and U. Hölzle. The Datacenter as a
ance obtained for several running for each size of Computer: An Introduction to the Design of
the problem. Vertical lines point out the maximum Warehouse-Scale Machines. Morgan & Claypool, San
and the minimum value obtained. Rafael, CA, USA, 2009.
[5] K. E. Batcher. Sorting networks and their
applications. In AFIPS ’68 (Spring): Proceedings of
As suggest by Helman et al. [16] a deeper evaluation of the April 30–May 2, 1968, spring joint computer
the algorithm can be conducted by using arrays generated conference, pages 307–314, New York, NY, USA, 1968.
by different distributions. This type of test puts in evi- ACM.
dence the behavior of the algorithms regarding to the vari- [6] G. Bilardi and A. Nicolau. Adaptive bitonic sorting:
ance of the times obtained in different contexts. For the An optimal parallel algorithm for shared memory
two distribution tested, bitonic sort algorithm has a better machines. Technical report, Ithaca, NY, USA, 1986.
behavior with respect to variance. Obviously, this result is [7] J. Bovay, B. H. Brent, H. Lin, and K. Wadleigh.
caused by the type of algorithm used. Quick sort is a data- Accelerators for high performance computing
investigation. White paper, High Performance [20] P. Kipfer and R. Westermann. Improved GPU sorting.
Computing Division - Hewlett-Packard Company, In M. Pharr, editor, GPUGems 2: Programming
2007. Techniques for High-Performance Graphics and
[8] D. Cederman and P. Tsigas. A practical quicksort General-Purpose Computation, pages 733–746.
algorithm for graphics processors. In ESA ’08: Addison-Wesley, 2005.
Proceedings of the 16th annual European symposium [21] N. Lester. Fast on-line index construction by
on Algorithms, pages 246–258, Berlin, Heidelberg, geometric partitioning. In In Proceedings of the 14th
2008. Springer-Verlag. ACM Conference on Information and Knowledge
[9] J. D. Davis, J. Laudon, and K. Olukotun. Maximizing Management (CIKM 2005, pages 776–783. ACM
cmp throughput with mediocre cores. In PACT ’05: Press, 2005.
Proceedings of the 14th International Conference on [22] C. D. Manning, P. Raghavan, and H. Schütze. An
Parallel Architectures and Compilation Techniques, Introduction to Information Retrieval. 2008.
pages 51–62, Washington, DC, USA, 2005. IEEE [23] M. A. Nichols, H. J. Siegel, H. G. Dietz, R. W. Quong,
Computer Society. and W. G. Nation. Eliminating memory for
[10] M. Dowd, Y. Perl, L. Rudolph, and M. Saks. The fragmentation within partitionable simd/spmd
periodic balanced sorting network. J. ACM, machines. IEEE Trans. Parallel Distrib. Syst.,
36(4):738–757, 1989. 2(3):290–303, 1991.
[11] M. Frigo, C. E. Leiserson, H. Prokop, and [24] T. J. Purcell, C. Donner, M. Cammarano, H. W.
S. Ramachandran. Cache-oblivious algorithms. In Jensen, and P. Hanrahan. Photon mapping on
FOCS ’99: Proceedings of the 40th Annual Symposium programmable graphics hardware. In Proceedings of
on Foundations of Computer Science, Washington, the ACM SIGGRAPH Conference on Graphics
DC, USA, 1999. IEEE Computer Society. Hardware, pages 41–50. Eurographics Association,
[12] N. K. Govindaraju, J. Gray, R. Kumar, and 2003.
D. Manocha. Gputerasort: High performance graphics [25] S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens.
coprocessor sorting for large database management. In Scan primitives for gpu computing. In GH ’07:
ACM SIGMOD International Conference on Proceedings of the 22nd ACM
Management of Data, Chicago, United States, June SIGGRAPH/EUROGRAPHICS symposium on
2006. Graphics hardware, pages 97–106, Aire-la-Ville,
[13] N. K. Govindaraju, N. Raghuvanshi, and D. Manocha. Switzerland, Switzerland, 2007. Eurographics
Fast and approximate stream mining of quantiles and Association.
frequencies using graphics processors. In SIGMOD [26] R. Sinha and A. Wirth. Engineering burstsort:
’05: Proceedings of the 2005 ACM SIGMOD Towards fast in-place string sorting. In WEA, pages
international conference on Management of data, 14–27, 2008.
pages 611–622, New York, NY, USA, 2005. ACM. [27] R. Sinha, J. Zobel, and D. Ring. Cache-efficient string
[14] A. Greß and G. Zachmann. Gpu-abisort: Optimal sorting using copying. ACM Journal of Experimental
parallel sorting on stream architectures. In The 20th Algorithmics, 2006.
IEEE International Parallel and Distributed [28] E. Sintorn and U. Assarsson. Fast parallel gpu-sorting
Processing Symposium (IPDPS ’06), page 45, Apr. using a hybrid algorithm. Journal of Parallel and
2006. Distributed Computing, In Press, Corrected Proof.
[15] S. Heinz, J. Zobel, and H. E. Williams. Burst tries: A [29] I. H. Witten, A. Moffat, and T. C. Bell. Managing
fast, efficient data structure for string keys. ACM Gigabytes: Compressing and Indexing Documents and
Transactions on Information Systems, 20(2):192–223, Images. Morgan Kaufmann Publishers, San Francisco,
2002. CA, 1999.
[16] D. R. Helman, D. A. B. Y, and J. J. Z. A randomized
parallel sorting algorithm with an experimental study.
Technical report, Journal of Parallel and Distributed
Computing, 1995.
[17] U. J. Kapasi, W. J. Dally, S. Rixner, P. R. Mattson,
J. D. Owens, and B. Khailany. Efficient conditional
operations for data-parallel architectures. In In
Proceedings of the 33rd Annual ACM/IEEE
International Symposium on Microarchitecture, pages
159–170. ACM Press, 2000.
[18] B. Khailany, W. J. Dally, U. J. Kapasi, P. Mattson,
J. Namkoong, J. D. Owens, B. Towles, A. Chang, and
S. Rixner. Imagine: Media processing with streams.
IEEE Micro, 21(2):35–46, 2001.
[19] P. Kipfer, M. Segal, and R. Westermann. Uberflow: a
gpu-based particle engine. In HWWS ’04: Proceedings
of the ACM SIGGRAPH/EUROGRAPHICS
conference on Graphics hardware, pages 115–122, New
York, NY, USA, 2004. ACM Press.