=Paper= {{Paper |id=Vol-480/paper-4 |storemode=property |title=Sorting using BItonic netwoRk wIth CUDA |pdfUrl=https://ceur-ws.org/Vol-480/paper4.pdf |volume=Vol-480 |dblpUrl=https://dblp.org/rec/conf/sigir/CapanniniSBN09 }} ==Sorting using BItonic netwoRk wIth CUDA== https://ceur-ws.org/Vol-480/paper4.pdf
                      Sorting using BItonic netwoRk wIth CUDA

                                          Ranieri Baraglia                      Gabriele Capannini
                                              ISTI - CNR                               ISTI - CNR
                                           Via G. Moruzzi, 1                        Via G. Moruzzi, 1
                                           56124 Pisa, Italy                        56124 Pisa, Italy
                                      r.baraglia@isti.cnr.it                  g.capannini@isti.cnr.it
                                      Franco Maria Nardini                       Fabrizio Silvestri
                                              ISTI - CNR                               ISTI - CNR
                                           Via G. Moruzzi, 1                        Via G. Moruzzi, 1
                                           56124 Pisa, Italy                        56124 Pisa, Italy
                                        f.nardini@isti.cnr.it                   f.silvestri@isti.cnr.it

ABSTRACT                                                                      keep them up to date. Storage space must be used efficiently
Novel “manycore” architectures, such as graphics processors,                  to store indexes, and documents. The indexing system must
are high-parallel and high-performance shared-memory ar-                      process hundreds of gigabytes of data efficiently. Queries
chitectures [7] born to solve specific problems such as the                   must be handled quickly, at a rate of hundreds to thousands
graphical ones. Those architectures can be exploited to                       per second. All these services run on clusters of homoge-
solve a wider range of problems by designing the related                      neous PCs. PCs in these clusters depends upon price, CPU
algorithm for such architectures. We present a fast sorting                   speed, memory and disk size, heat output, and physical size
algorithm implementing an efficient bitonic sorting network.                  [3]. Nowadays these characteristics can be find also in other
This algorithm is highly suitable for information retrieval                   commodity hardware originally designed for specific graph-
applications. Sorting is a fundamental and universal prob-                    ics computations. Many special processor architectures have
lem in computer science. Even if sort has been extensively                    been proposed to exploit data parallelism for data intensive
addressed by many research works, it still remains an inter-                  computations and graphics processors (GPUs) are one of
esting challenge to make it faster by exploiting novel tech-                  those. For example, the scientific community uses GPUs
nologies. In this light, this paper shows how to use graph-                   for general purpose computation. The result obtained, in
ics processors as coprocessors to speed up sorting while al-                  term of computational latency, outperform the time charge
lowing CPU to perform other tasks. Our new algorithm                          requested on classical processors. Unfortunately, such pro-
exploits a memory-efficient data access pattern maintain-                     grams must rely on APIs to access the hardware, for example
ing the minimum number of accesses to the memory out                          OpenGL or DirectX. These APIs are simultaneously over-
of the chip. We introduce an efficient instruction dispatch                   specified, forcing programmer to manipulate data that is not
mechanism to improve the overall sorting performance. We                      directly relevant, and drivers. These APIs make critical pol-
also present a cache-based computational model for graph-                     icy decisions, such as deciding where data resides in memory
ics processors. Experimental results highlight remarkable                     and when they are copied.
improvements over prior CPU-based sorting methods, and                           In last years, due to the growing trend of media market,
a significant improvement over previous GPU-based sorting                     the request of rendering algorithms is rapidly evolving. For
algorithms.                                                                   those companies producing hardware, it means to design
                                                                              every time new hardware both able to run novel algorithms
                                                                              and able to provide higher rate of computations per sec-
1. INTRODUCTION                                                               ond. Such processors require significant design effort and
  Every day people use Web Search Engines as a tool for                       are thus difficult to change as applications and algorithms
accessing information, sites, and services on the Web. Infor-                 evolve. The request for flexibility in media processing moti-
mation retrieval has to face those issues due to the growing                  vates the use of programmable processors, and the existing
amount of information on the web, as well as the number of                    need for non-graphical APIs pushed the same companies into
new users. Creating a Web Search Engine which scales even                     creating new abstractions designed to last.
to today’s Web contents presents many challenges. A fast                         Finally, according to what Moore’s law foresee, the num-
crawling technology is needed to gather web documents and                     ber of transistor density doubles every two years. Further-
                                                                              more, mainly due to power-related issues, new generation
                                                                              processors (such as traditional CPUs) tend to incorporate
                                                                              an ever-increasing number of processors (also called cores)
                                                                              on the same chip [9]. The result is that nowadays the mar-
                                                                              ket proposes low-cost commodity hardware that is able to
                                                                              execute heavy loads of computations. In order to enable de-
Copyright ° c 2009 for the individual papers by the papers’ authors. Copy-    velopers to leverage the power of such architectures, they
ing permitted for private and academic purposes. Re-publication of material
from this volume requires permission by the copyright owners. This volume
                                                                              usually make available ad-hoc programming tools for.
is published by its editors.                                                     For the reasons listed so far, these architectures are ideal
LSDS-IR Workshop. July 2009. Boston, USA.
candidates for the efficient implementation of those compo-      bitonic sort as well as an odd-even merge sort. They pre-
nent of a Large-Scale Search Engine that are eager of com-       sented an improved bitonic sort routine that achieves a per-
putational power.                                                formance gain by minimizing both the number of instruc-
   This paper focuses on using a GPU as a co-processor for       tions to be executed in the fragment program and the num-
sorting. Sorting is a core problem in computer science that      ber of texture operations.
has been extensively researched over the last five decades          Zachmann et al. [14] presented a novel approach for par-
and still remains a bottleneck in many applications involv-      allel sorting on stream processing architectures based on an
ing large volumes of data. One could argue why efficient         adaptive bitonic sorting [6]. They presented an implemen-
sorting is related with LSDS-IR. First of all, sorting is a      tation based on modern programmable graphics hardware
basic application for indexing. We will show in Section 3        showing that they approach is competitive with common
how many indexing algorithms are basically a sorting op-         sequential sorting algorithms not only from a theoretical
eration over integer sequences. Large scale indexing, thus,      viewpoint, but also from a practical one. Good results are
required scalable sorting. Second, the technique we are in-      achieved by using efficient linear stream memory accesses
troducing here is of crucial importance for Distributed Sys-     and by combining the optimal time approach with algo-
tems for IR since it is designed to run on GPUs that are         rithms.
considered by many as a basic building block for future gen-        Govindaraju et al. [13] implemented sorting as the main
eration data-centers [4]. Our bitonic sorting network can        computational component for histogram approximation. This
be seen as a viable alternative for sorting large quantities     solution is based on the periodic balanced sorting network
of data on graphics processors. In the last years general        method by Dowd et al. [10]. In order to achieve high com-
purpose processors have been specialized adopting mecha-         putational performance on the GPUs, they used a sorting
nisms to make more flexible their work. Such facilities (i.e.    network based algorithm and each stage is computed using
more levels of caches, out-of-order execution paradigm, and      rasterization. They later presented a hybrid bitonic-radix
branch prediction techniques) leads to make the theoreti-        sort that is able to sort vast quantities of data [12], called
cal performance of CPUs closer to the real achievable one.       GPUTeraSort. This algorithm was proposed to sort record
From the other side specialized processors, like GPUs, ex-       contained in databases using a GPU. This approach uses the
pose lower flexibility at design phase, but are able to reach    data and task parallelism on the GPU to perform memory-
higher computational power providing more computational          intensive and compute-intensive tasks while the CPU is used
cores with respect to other the class of processors. We map      to perform I/O and resource management.
a bitonic sorting network on GPU exploiting the its high            Sengupta et al. [25] presented a radix-sort and a Quick-
bandwidth memory interface. Our novel data partitioning          sort implementation based on segmented scan primitives.
improves GPU cache efficiency and minimizes data transfers       Authors presented new approaches of implementing several
between on-chip and off-chip memories.                           classic applications using this primitives and shows that this
   This paper is organized as follows. Section 2 discusses       primitives are an excellent match for a broad set of problems
related works, while Sections 3 introduces some relevant         on parallel hardware.
characteristics about the applicability of GPU-based sort-          Recently, Sintorn et al. [28] presented a sorting algorithm
ing in Web Search Engines. Section 4 presents some issues        that combines bucket sort with merge sort. In addition,
arising from the stream programming model and the single-        authors show this new GPU sorting method sorts on nlog(n)
instruction multiple-data (SIMD) architecture. The central       time.
part is devoted to expose our solution, and the computa-            Cederman et al. [8] showed that their GPU-Quicksort is
tional model used to formalize Graphics Processing Units.        a viable sorting alternative. The algorithm recursively par-
Section 6 presents some results obtained in a preliminary        tition the sequence to be sorted with respect to a pivot.
evaluation phase. Section 7 discuss hot to evolve and im-        This is done in parallel by each GPU-thread until the en-
prove this research activity.                                    tire sequence has been sorted. In each partition iteration
                                                                 a new pivot value is picked and as a result two new sub-
                                                                 sequences are created that can be sorted independently by
2. RELATED WORK                                                  each thread block can be assigned one of them. Finally, ex-
  Since most sorting algorithms are memory bandwidth bound,      periments pointed out that GPU-Quicksort can outperform
there is no surprise that there is currently a big interest in   other GPU-based sorting algorithms.
sorting on the high bandwidth GPUs.
  Purcell et al. [24] presented an implementation of bitonic     3.   APPLICATIONS TO INDEXING
merge sort on GPUs based on an implementation by Ka-
pasi et al. [17]. Author used that approach to sort photons         A modern search engine must scale even with the growing
into a spatial data structure providing an efficient search      of today’s Web contents. Large-scale and distributed appli-
mechanism for GPU-based photon mapping. Comparator               cations in Information Retrieval such as crawling, indexing,
stages were entirely realized in the fragment units1 , includ-   and query processing can exploit the computational power
ing arithmetic, logical and texture operations. Authors re-      of new GPU architectures to keep up with this exponential
ported their implementation to be compute-bound rather           grow.
than bandwidth-bound, and they achieve a throughput far             We consider here one of the core-component of a large-
below the theoretical optimum of the target architecture.        scale search engine: the indexer. In the indexing phase, each
  Kipfer et al. [19, 20] showed an improved version of the       crawled document is converted into a set of word occurrences
                                                                 called hits. For each word the hits record: frequency, posi-
1
 In addition to computational functionality, fragment units      tion in document, and some other information. Indexing,
also provide an efficient memory interface to server-side        then, can be considered as a “sort” operation on a set of
data, i.e. texture maps and frame buffer objects.                records representing term occurrences [2]. Records repre-
sent distinct occurrences of each term in each distinct docu-     This option is even more important if the allocation of the
ment. Sorting efficiently these records using a good balance      task is scheduled dynamically, as it can be done depending
of memory and disk usage, is a very challenging operation.        of the workload of the single resources.
   In the last years it has been shown that sort-based ap-           The-state-of-art in string sort lacks of solution for GPU
proaches [29], or single-pass algorithms [21], are efficient in   architectures: nowadays we are not aware of solutions for
several scenarios, where indexing of a large amount of data       parallel SIMD processors. In the literature, this problem is
is performed with limited resources.                              efficiently solved by using different approaches. The most
   A sort-based approach first makes a pass through the col-      interesting and suitable for us seems to be Burstsort [27].
lection assembling all term-docID pairs. Then it sorts the        It is a technique that combines the burst trie [15] to dis-
pairs with the term as the dominant key and docID as the          tribute string-items into small buckets whose contents are
secondary key. Finally, it organizes the docIDs for each term     then sorted with standard (string) sorting algorithms. Suc-
into a postings list (it also computes statistics like term and   cessively, Sinha et al. [26] introduced improvements that
document frequency). For small collections, all this can be       reduce by a significant margin the memory requirements of
done in memory.                                                   Burstsort. This aspect is even more relevant for GPU archi-
   When memory is not sufficient, we need to use an external      tectures having small-sized on-chip memories.
sorting algorithm [22]. The main requirement of such algo-
rithm is that it minimizes the number of random disk seeks
during sorting. One solution is the Blocked Sort-Based In-
                                                                  4.    SORTING WITH GPUS
dexing algorithm (BSBI). BSBI segments the collection into           This section gives a brief overview of GPUs highlighting
parts of equal size, then it sorts the termID-docID pairs of      features that make them useful for sorting. GPUs are de-
each part in memory, finally stores intermediate sorted re-       signed to execute geometric transformations that generate a
sults on disk. When all the segments are sorted, it merges        data stream of display-pixels. A data stream is processed by
all intermediate results into the final index.                    a program running on multiple SIMD processors, which are
   A more scalable alternative is Single-Pass In-Memory In-       capable for data-intensive computations. The output, then,
dexing (SPIMI). SPIMI uses terms instead of termIDs, writes       is written back to the memory.
each block’s dictionary to disk, and then starts a new dic-
tionary for the next block. SPIMI can index collections of        4.1   SIMD Architecture
any size as long as there is enough disk space available. The        SIMD machines are classified as processor-array machines:
algorithm parses documents and turns them into a stream of        a SIMD machine basically consists of an array of computa-
term-docID pairs, called tokens. Tokens are then processed        tional units connected together by a simple network topol-
one by one. For each token, SPIMI adds a posting directly to      ogy [23]. This processor array is connected to a control pro-
its postings list. Instead of first collecting all termID-docID   cessor, which is responsible for fetching and interpreting in-
pairs and then sorting them (as BSBI does), each postings         structions. The control processor issues arithmetic and data
list is dynamic. This means that its size is adjusted as it       processing instructions to the processor array, and handles
grows. This has two advantages: it is faster because there        any control flow or serial computation that cannot be par-
is no sorting required, and it saves memory because it keeps      allelized. Processing elements can be individually disabled
track of the term a postings list belongs to, so the termIDs      for conditional execution: this option give more flexibility
of postings need not be stored.                                   during the design of an algorithm.
   When memory finished, SPIMI writes the index of the               Although SIMD machines are very effective for certain
block (which consists of the dictionary and the postings lists)   classes of problems, the architecture is specifically tailored
to disk. Before doing this, SPIMI sorts the terms to facil-       for data computation-intensive work, and it results to be
itate the final merging step: if each block’s postings lists      quite “inflexible” on some classes of problems2 .
were written in unsorted order, merging blocks could not be
accomplished by a simple linear scan through each block.          4.2   Stream Programming Model
The last step of SPIMI is then to merge the blocks into the          A stream program [18] organizes data as streams and ex-
final inverted index.                                             presses all computation as kernels. A stream is a sequence
   SPIMI, which time complexity is lower because no sorting       of similar data elements, that are defined by a regular ac-
of tokens is required, is usually preferred with respect to       cess pattern. A kernel typically loops through all the input
BSBI that presents an higher time complexity.                     stream elements, performing a sequence of operations on
   In both the methods presented for indexing, sorting is in-     each element, and appending results to an output stream.
volved: BSBI sorts the termID-docID pairs of all parts in         These operations exhibits an high instruction level paral-
memory, SPIMI sorts the terms to facilitate the final merg-       lelism. Moreover, these operations cannot access to arbi-
ing step [22].                                                    trary memory locations but they keep all the intermediate
   In order to efficiently evaluate these two approaches on a     values locally, into kernels. Since each element of the in-
heterogeneous cluster we have to compare “standard” SPIMI         put stream can be processed simultaneously, kernels also ex-
performances with the performances of a BSBI-based sorter         pose large amounts of data-level parallelism. Furthermore,
implemented by us. Moreover, to fully analyze the indexing        stream memory transfers can be executed concurrently with
phase, we need a GPU-based string sorter able to outperform       kernels, thereby exposing task-level parallelism in the stream
CPUs as well as our sorter for integers does. In this way         program. Some other important characteristics common to
we have the possibility to compare both solutions, on all         all stream-processing applications are: (i) elements are read
architectures, then to choose the best combination. Having        once from memory, (ii) elements are not visited twice, and
all possible implementations available, a flexible execution      2
of indexing running on various hardware can be imagined.            For example, these architectures cannot efficiently run the
                                                                  control-flow dominated code.
(iii) the application requires high level of arithmetic opera-       cache-oblivious algorithms. The model underlying this type
tions per memory reference, i.e. computationally intensive.          of algorithms is not directly applicable on GPU’s parallel ar-
                                                                     chitecture, which is equipped with local memory instead of
4.3 Nvidia’s CUDA                                                    cache. Adopting local memory approach, the programmer
   CUDA [1], acronym of Compute Unified Device Architec-             has to bear the effort of synchronizing, sizing, and schedul-
ture, is defined as an architecture built around a scalable          ing the computation of data and its movement through the
array of multithreaded streaming multiprocessors (MPs).              off-chip memory and the in-chip one. On the other hand in
Each MP is defined as a unit comprising one instruction              cache-based architectures this aspect is automatically man-
unit, a collection of eight single precision pipelines also called   aged by the underlying support. A local memory approach
cores, a functional units for special arithmetical operations        permits to move data located in different addresses compos-
and a 16 KB local store also called shared memory. In prac-          ing a specific access pattern. This capability is impossible to
tice, this means that each MP is a SIMD processor, whose             realize with caches, where the hardware hides this operation
cores form an arithmetic pipeline that executes scalar in-           by automatically replacing missed cache lines.
structions. All MPs create, manage, and execute concurrent              Frigo et al. [11] presents cache-oblivious algorithms that
threads in hardware with zero scheduling overhead, and im-           use both asymptotically optimal amounts of work, and asymp-
plements a barrier for threads synchronization. Neverthe-            totically optimal number of transfers among multiple levels
less, only threads concurrently running on the same MP can           of cache. An algorithm is cache oblivious if no program
be synchronized.                                                     variables dependent on hardware configuration parameters,
   CUDA model also introduces the entity warp as a group             such as cache size and cache-line length need to be tuned
of 32 parallel scalar threads, and reports that each warp            to minimize the number of cache misses. Authors introduce
executes one common instruction at a time. This is another           the “Z, L” ideal-cache model to study the cache complexity
way of saying that warp is a stream of vector instructions:          of algorithms. This model describes a computer with a two-
scalar threads are then vector elements. But, unlike others          level memory hierarchy consisting of an ideal data cache of
SIMD instruction set, such as Intel’s SSE, a particular value        Z words of constant size, and an arbitrarily large main mem-
of the vector length is not specified. A thread block is defined     ory. The cache is partitioned into cache lines, each consist-
as a collection of warps that run on the same MP and share           ing of L consecutive words that are always moved together
a partition of local store. The number of warps in a thread          between cache and main memory.
block is configurable.                                                  The processor can only reference words that reside in the
                                                                     cache. If the referenced word belongs to a line already in
4.3.1    CUDA SDK                                                    cache, a cache hit occurs, and the word is delivered to the
   The SDK provided by Nvidia for its GPUs consist of a              processor. Otherwise, a cache-miss occurs, and the line is
large, collection of code examples, compilers and run-time           fetched into the cache. If the cache is full, a cache line
libraries. Clearly the CUDA model is “restricted” to Nvidia          must be evicted. An algorithm with an input of size n is
products, mainly for efficiency reasons, and it is conform to        measured in the ideal-cache model in terms of its work com-
the stream programming model. Threads and thread blocks              plexity W (n) and its cache complexity Q(n, Z, L), that is
can be created only by invoking a parallel kernel, not from          the number of cache misses it incurs as a function of the size
within a parallel kernel. Task parallelism can be expressed          Z and line length L of the ideal cache.
at the thread-block level, but block-wide barriers are not              The metrics used to measure cache-oblivious algorithms
well suited for supporting task parallelism among threads            need to be reconsidered in order to be used with GPUs that
in a block. To enable CUDA programs to run on any num-               are parallel architectures. To do that W () has to be defined
ber of processors, communication between different blocks of         taking care of the level of parallelism exposed by GPUs.
threads, is not allowed, so they must execute independently.         Evaluating Q(), we must translate the concept of Z and L
Since CUDA requires that thread blocks are independent               that refer to cache characteristics. More precisely, a GPUs
and allows blocks to be executed in any order. Combin-               is provided of p MPs each one with a local memory. We
ing results generated by multiple blocks must in general be          can abstract such architectural organization by considering
done by launching a second kernel. However, multiple thread          each local memory as one cache-line, and the union of all
blocks can coordinate their work using atomic operations on          local memories as the entire cache, taking no care of which
the external (off-chip) memory.                                      processor is using data. In this point of view, if the shared
   Recursive function calls are not allowed in CUDA kernels.         memory of each MP is 16 KB, we obtain L = 16 KB and
Recursion is, in fact, unattractive in a massively parallel          Z = 16 · p KB.
kernel. Providing stack space for all the active threads it
would require substantial amounts of memory. To support              5.   BITONIC SORT
an heterogeneous system architecture combining a CPU and                To design our sorting algorithm in the stream program-
a GPU, each with its own memory system, CUDA programs                ming model, we started from the popular Bitonic Sort (BS)
must copy data and results between host memory and device            network and we extend it to adapt to our specific architec-
memory. The overhead of CPU/GPU interaction and data                 ture. Specifically, BS is one of the fastest sorting networks
transfers is minimized by using DMA block-transfer engines           [5]. A sorting network is a special kind of sorting algo-
and fast interconnects.                                              rithm, where the sequence of comparisons do not depend
                                                                     on the order with which the data is presented, see Figure
4.4 Cache-oblivious algorithms                                       1. This makes sorting networks suitable for implementation
   The cost of communication can be larger up to an order            on GPUs. In particular, the regularity of the schema used
of magnitude than the cost of the pure computation on such           to compare the elements to sort, makes this kind of sorting
architectures. Our idea is to model the proposed solution as         network particularly suitable for partitioning the elements in
the stream programming fashion, as GPUs require. Finally,         memory. In the ideal-cache computational model, it corre-
we compared theoretical results with the ones resulting from      sponds to a cache-miss event, which wastes the performance
the tests, in order to say if the adapted ideal-cache model is    of the algorithm.
useful to abstract GPUs.                                             Resuming, performing several network steps in the same
                                                                  kernel has the double effect to reduce the number of cache-
                                                                  misses, i.e. improving Q() metric, and to augment the level
                                                                  of arithmetic operations per memory reference. The unique
                                                                  constraint is that the computation of an element has to be
                                                                  independent from the one of another element in the same
                                                                  stream.
                                                                             step




           step




Figure 1: Bitonic sorting networks for 16 elements.
Each step is completed when all comparisons in-
volved are computed. In the figure each comparison
is represented with a vertical line that link two ele-
ments, which are represented with horizontal lines.
                                                                         A          B   C


   The main issue to address is to define an efficient schema
to map all comparisons involved in the BS on the elements         Figure 3: Increasing the number of steps covered
composing the streams invoked.                                    by the partition, the number of items included dou-
   The first constraint is that the elements composing each       bles. A, B and C are partitions respectively for local
stream must be “distinct”. This means that each item in           memory of 2, 4 and 8 locations.
the input has to be included in exactly one element of the
stream. From this point of view, each stream of elements
                                                                     Let us introduce our solution. First of all, we need to
defines a partition of the input array to be sorted. This char-
                                                                  establish the number of consecutive steps to be executed by
acteristic is necessary due to the runtime support, because
                                                                  one kernel. We must consider that for each step assigned
it does not permit any type of data-exchange, or synchro-
                                                                  to a kernel, in order to maintain the all stream elements
nization, between two elements (Figure 2).
                                                                  independent, the number of memory location needed by the
             step                                                 relative partition doubles, see Figure 3. So, the number
                                                                  of steps a kernel can cover is bounded by the number of
                                                                  items that it is possible to include into the stream element.
                                                                  Furthermore, the number of items is bounded by the size of
                                                                  the shared memory available for each SIMD processor.

                                                                  Algorithm 1 Bitonic Sort algorithm.
                                                                   a ← array to sort
                                                                   for s = 1 to log2 |a| do
                        element                                      for c = s − 1 down to 0 do
        kernel stream                                                   for r = 0 to |a| − 1 do
                                                                          if 2rc ≡ 2rs (mod 2) ∧ a[r] > a[r ⊕ 2c ] then
Figure 2: Example of a kernel stream comprising                              a[r] ⇆ a[r ⊕ 2c ]
more sorting network steps. The subset of items                           end if
composing each element must perform comparison                          end for
only inside itself.                                                  end for
                                                                   end for
   The second aspect to optimize is the partitioning of the
steps composing the sorting network. Since a BS is com-              More precisely, to know how many steps can be included
posed by several steps (Figure 1), we have to map the execu-      in one partition, we have to count how many distinct values c
tion of all steps into a sequence of independent runs, each of    assumes, see Algorithm 1. Due to the fixed size of memory
them corresponding to the invocation of a kernel. Since each      locations, i.e. 16 KB, we can specifies partition of SH =
kernel invocation implies a communication phase, such map-        4 K items, for items of 32 bits. Moreover such partition is
ping should be done in order to reduce the number of these        able to cover “at least” sh = log(SH) = 12 steps. From
invocations, thus the communication overhead. Specifically,       this evaluation it is also possible to estimate the size of the
this overhead is generated whenever the SIMD processor be-        kernel stream: if a partition representing an element of the
gins the execution of a new stream element. In that case, the     stream contains SH items, and the array a to sort contains
processor needs to flush the results contained in the proper      N = 2n items, then the stream contains b = N/SH = 2n−sh
shared memory, then to fetch the new data from the off-chip       elements.
   A important consideration must be done for the first ker-        L = SH, Z = SH · p, and each kernel` covers sh     ´ steps, ex-
nel invoked by the algorithm: until the variable s in the           cept the first kernel that covers σ ′ = 21 sh2 + sh . Then the
Algorithm 1 is not greater than sh the computation of the           number of cache-misses is ⌈(σ − σ ′ )/sh⌉.
several first steps can be done with this first kernel. This           The last consideration regards W (n, p) measure, that should
because the values assumed by c remain in a range of sh             be estimated considering that each MP is a SIMD processor.
distinct values. More precisely the first kernel computes the       In fact, each MP reaches its maximum performance when-
first sh·(sh+1)
          2
                steps (Figure 3).                                   ever the data-flow permits to the control unit to issue the
   This access pattern schema can be traduced in the func-          same instruction for all cores. In this way such instruction
tion ℓc (i) that for the current kernel stream, given the cur-      is executed in parallel on different data in a SIMD fashion.
rent value of c, is able to define the subset of a to be assigned      In order to compare our bitonic sort with the quick sort
to the i-th stream element. In other words, ℓ describes a           proposed by Cederman et al., we tried to extend the analysis
method to enumerate the set of indexes in a that the i-th           of ideal-cache model metrics to their solution. Unfortunately
element of the kernel stream has to perform. Formally, it is        their solution does not permit to be accurately measured
defined as:                                                         like ours. In particular, it is possible to estimate the num-
                                                                    ber of transfers among multiple levels of cache, but quick
                ℓc : i ∈ [0, b − 1] → Π ⊆ πsh (a)                   sort uses off-chip memory also to implement prefix-sum for
                                                                    each stream element ran. In particular quick sort splits in-
where πsh (a) is a partition of a, namely a set of nonempty
                                                                    put array in two parts with respect to a pivot by invoking
subsets of a such that every element in a is in exactly one of
                                                                    a procedure on GPU, and recursively repeats this opera-
these subsets, and each subset contains exactly 2sh elements
                                                                    tion until each part can be entirely contained in the shared
of a.
                                                                    memory. In the optimistic case, assuming |a| = n and a
   Let us assume n = 32 and the size of the shared memory
                                                                    shared memory equal to L = SH, this operation is invoked
can contains 4 K items, so we have |a| = 232 and b = 2n−sh =
                                                                    log(n)−log(L) times, that is also the number of cache-misses
232−12 = 220 elements for each stream. Basically, we need
                                                                    for quick sort, namely Q(). This value is sensibly lower to
32 bits to point an element of a, and log(b) = 20 bits to
                                                                    the Q() measured for bitonic sort, but the evaluation of the
identify the i-th partition among the b existing. The i value
                                                                    number of such cache-misses should be proportionally aug-
is used to build a bit-mask that is equal for each address
                                                                    mented due to the prefix-sum procedure. Regarding W (),
produced by ℓc (i). Such mask sets log(b) bits of the 32 bits
                                                                    the authors report a computational complexity      equal
                                                                                                                         ´ to the
composing an index for a. The missing sh bits are generated                                                    `
                                                                    one obtained for sequential case, i.e. O n · log(n) , without
by using a variable x to enumerate all values in the range
                                                                    referring the parallelism of the underlying hardware. How-
X = [0, 2sh −1] and by inserting each of them in c-th position
                                                                    ever, optimistic evaluation of such parallel, version should
of the i mask. This composition leads to a set of addresses
                                                                    be, also in this case, lower than W () computed for bitonic
of n bits whose relative items compose the b-th partition.
                                                                    sort.
Formally, we obtain:
           ℓc (i) = {x ∈ X : i[31...c] ◦ x ◦ i[c+1...0] }           6.   RESULTS
where i[31...c] notation identifies the leftmost bits, namely          The experiments have been conducted on an Ubuntu Linux
from the 31th bit down to the c-th one, and “◦” is the con-         Desktop with an Nvidia 8800GT, namely a GPU provided
catenation operator.                                                with 14 SIMD processors and CUDA SDK 2.1. To gener-
   The rule to compose the elements in ℓc (i) is easy, but in       ate the arrays for these preliminary tests we used uniform,
some case it leads to some exception. When c < sh, then             gaussian and zipfian distributions. Specifically, for each dis-
x is divided in two parts, that are xL and xR , and they            tribution was generated 20 different arrays. Figure 4 shows:
are inserted in c-th position and in the c′ -th position of i       the means, the standard deviation, the maximum and the
respectively. In particular, the statement c < sh occurs            minimum for the times elapsed for all runs of each distri-
whenever the middle loop in the Algorithm 1 ends and c              bution. The tests involved CPU-based quick sort provided
start a new loop getting the new value of s, denoted with c′ .      with C standard library, our solution and the one proposed
Specifically, it happens that, depending on the current value       by Cederman et al. [8]. In that paper, the GPU-based
of c, the algorithm needs to make two insertions: to add xR         quick sort resulted the most performing algorithm in liter-
at position c, and to add xL at position c′ . Formally, when        ature, so our preliminary tests take into consideration only
c < sh, we have to define a second function ℓc,c′ () as in the      such GPU-based algorithm.
following:                                                             Figure 4 shows that GPU-based solutions are able to out-
                                                                    perform CPU’s performance for the sorting problem. The
    ℓc,c′ (i) = {x ∈ X : i[31...c̄] ◦ xL ◦ i[c′ +sh−c...c] ◦ xR }   same figure also shows that our solution is not competitive
                                                                    with respect to the one of Cederman until the number on
where xL = x[sh−1...c] , xR = x[c−1...0] , and c′ = s + 1.          items to sort reaches 8 MB. One more consideration is that
                                                                    GPU-based quick sort is not able to successfully conclude
5.1 Evaluation                                                      the tests for arrays greater than 16 MB. Further analysis
   For our solution we obtain that W (n, p), where p indi-          pointed out that bitonic algorithm spends the main part of
cates the number of MPs, and n the size of a, is equal              the time elapsed for the data-transfer phase of some specific
to the computational cost` of the sequential BS divided p,          kernel instance. Since element streams were always of the
specifically W (n, p) = O np · log2 n . To know Q(n, Z, L)
                                     ´
                                                                    same size, we deduced that the number of transfers is not
we must estimate the number of cache-misses. Assuming               the only aspect to take into consideration to minimize the
|a| = `n, we obtain that´ the sorting network is made of            communication overhead, as metrics of ideal-cache models
σ = 12 log2 (n) + log(n) steps. Furthermore, let us assume          suggests.
                                     a) gaussian distribution used                  dependent approach whereas sorting network are based on a
           2500
                     bitonic sort
                   gpu quick sort
                                                                                    fixed schema of comparisons that does not vary with respect
                   cpu quick sort                                                   to data-distribution.
           2000
                                                                                       Consideration on the results obtained from these prelim-
                                                                                    inary test suggest us that ideal-cache model does not seem
                                                                                    sufficiently accurate to abstract GPU’s architecture. If the-
           1500                                                                     oretical results lead to better performance for GPU-based
                                                                                    quick sort, from the tests conducted, it arises that bitonic
sec/1000




                                                                                    sort has a better performance-trend by increasing the size of
           1000                                                                     the problem. This consideration is enforced by the analysis
                                                                                    of the data-transfer: we strongly believe that by improving
                                                                                    the data-transfer bandwidth, bitonic sort can reach better
           500
                                                                                    results without increasing theoretical W () and Q() metrics.

             0
                  1M           2M   4M              8M            16M   32M   64M
                                                                                    7.   FUTURE WORKS
                                         b) uniform distribution used
                                                                                       Preliminary results show that the number of transfers is
           2500                                                                     not the only aspect to take into consideration for minimiz-
                     bitonic sort
                   gpu quick sort
                   cpu quick sort
                                                                                    ing communication overhead. Another important factor is
                                                                                    the transfer bandwidth that is relevant to achieve better re-
           2000
                                                                                    sults. Results show that the ideal-cache model is not able
                                                                                    to fully describe and capture all the aspects determining the
                                                                                    performance for such kind of architectures. Probably differ-
           1500
                                                                                    ent kinds of performance metrics are needed to evaluate an
sec/1000




                                                                                    algorithms on these novel hardware resources.
           1000
                                                                                       Furthermore, since indexing is a tuple-sorting operation
                                                                                    we will extend our solution to include the sorting of tuples
                                                                                    of integers. In this paper, in fact, we assume the tuples are
           500                                                                      sorted by using multiple passes on the dataset. We reserve
                                                                                    to future work the extension to tuples.
                                                                                       Of course, we have to run more tests to enforce the results
             0
                  1M           2M   4M              8M            16M   32M   64M
                                                                                    obtained and to analyze more in deep the causes of the waste
                                                                                    of performance that affect our algorithm.
                                         c) zipfian distribution used
           2500
                     bitonic sort
                   gpu quick sort
                   cpu quick sort                                                   8.   ACKNOWLEDGMENTS
           2000
                                                                                      This research has been supported by the Action IC0805:
                                                                                    Open European Network for High Performance Computing
                                                                                    on Complex Environments.
           1500
sec/1000




                                                                                    9.   REFERENCES
           1000                                                                      [1] NVIDIA CUDA Compute Unified Device Architecture
                                                                                         - Programming Guide, 2007.
                                                                                     [2] R. Baeza-Yates, C. Castillo, F. Junqueira,
           500
                                                                                         V. Plachouras, and F. Silvestri. Challenges on
                                                                                         distributed web retrieval. In ICDE. IEEE, 2007.
             0
                                                                                     [3] L. A. Barroso, J. Dean, and Holzle. Web search for a
                  1M           2M   4M              8M            16M   32M   64M        planet: The google cluster architecture. Micro, IEEE,
                                                                                         23(2):22–28, 2003.
Figure 4: Squared white areas represent the vari-                                    [4] L. A. Barroso and U. Hölzle. The Datacenter as a
ance obtained for several running for each size of                                       Computer: An Introduction to the Design of
the problem. Vertical lines point out the maximum                                        Warehouse-Scale Machines. Morgan & Claypool, San
and the minimum value obtained.                                                          Rafael, CA, USA, 2009.
                                                                                     [5] K. E. Batcher. Sorting networks and their
                                                                                         applications. In AFIPS ’68 (Spring): Proceedings of
  As suggest by Helman et al. [16] a deeper evaluation of                                the April 30–May 2, 1968, spring joint computer
the algorithm can be conducted by using arrays generated                                 conference, pages 307–314, New York, NY, USA, 1968.
by different distributions. This type of test puts in evi-                               ACM.
dence the behavior of the algorithms regarding to the vari-                          [6] G. Bilardi and A. Nicolau. Adaptive bitonic sorting:
ance of the times obtained in different contexts. For the                                An optimal parallel algorithm for shared memory
two distribution tested, bitonic sort algorithm has a better                             machines. Technical report, Ithaca, NY, USA, 1986.
behavior with respect to variance. Obviously, this result is                         [7] J. Bovay, B. H. Brent, H. Lin, and K. Wadleigh.
caused by the type of algorithm used. Quick sort is a data-                              Accelerators for high performance computing
     investigation. White paper, High Performance             [20] P. Kipfer and R. Westermann. Improved GPU sorting.
     Computing Division - Hewlett-Packard Company,                 In M. Pharr, editor, GPUGems 2: Programming
     2007.                                                         Techniques for High-Performance Graphics and
 [8] D. Cederman and P. Tsigas. A practical quicksort              General-Purpose Computation, pages 733–746.
     algorithm for graphics processors. In ESA ’08:                Addison-Wesley, 2005.
     Proceedings of the 16th annual European symposium        [21] N. Lester. Fast on-line index construction by
     on Algorithms, pages 246–258, Berlin, Heidelberg,             geometric partitioning. In In Proceedings of the 14th
     2008. Springer-Verlag.                                        ACM Conference on Information and Knowledge
 [9] J. D. Davis, J. Laudon, and K. Olukotun. Maximizing           Management (CIKM 2005, pages 776–783. ACM
     cmp throughput with mediocre cores. In PACT ’05:              Press, 2005.
     Proceedings of the 14th International Conference on      [22] C. D. Manning, P. Raghavan, and H. Schütze. An
     Parallel Architectures and Compilation Techniques,            Introduction to Information Retrieval. 2008.
     pages 51–62, Washington, DC, USA, 2005. IEEE             [23] M. A. Nichols, H. J. Siegel, H. G. Dietz, R. W. Quong,
     Computer Society.                                             and W. G. Nation. Eliminating memory for
[10] M. Dowd, Y. Perl, L. Rudolph, and M. Saks. The                fragmentation within partitionable simd/spmd
     periodic balanced sorting network. J. ACM,                    machines. IEEE Trans. Parallel Distrib. Syst.,
     36(4):738–757, 1989.                                          2(3):290–303, 1991.
[11] M. Frigo, C. E. Leiserson, H. Prokop, and                [24] T. J. Purcell, C. Donner, M. Cammarano, H. W.
     S. Ramachandran. Cache-oblivious algorithms. In               Jensen, and P. Hanrahan. Photon mapping on
     FOCS ’99: Proceedings of the 40th Annual Symposium            programmable graphics hardware. In Proceedings of
     on Foundations of Computer Science, Washington,               the ACM SIGGRAPH Conference on Graphics
     DC, USA, 1999. IEEE Computer Society.                         Hardware, pages 41–50. Eurographics Association,
[12] N. K. Govindaraju, J. Gray, R. Kumar, and                     2003.
     D. Manocha. Gputerasort: High performance graphics       [25] S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens.
     coprocessor sorting for large database management. In         Scan primitives for gpu computing. In GH ’07:
     ACM SIGMOD International Conference on                        Proceedings of the 22nd ACM
     Management of Data, Chicago, United States, June              SIGGRAPH/EUROGRAPHICS symposium on
     2006.                                                         Graphics hardware, pages 97–106, Aire-la-Ville,
[13] N. K. Govindaraju, N. Raghuvanshi, and D. Manocha.            Switzerland, Switzerland, 2007. Eurographics
     Fast and approximate stream mining of quantiles and           Association.
     frequencies using graphics processors. In SIGMOD         [26] R. Sinha and A. Wirth. Engineering burstsort:
     ’05: Proceedings of the 2005 ACM SIGMOD                       Towards fast in-place string sorting. In WEA, pages
     international conference on Management of data,               14–27, 2008.
     pages 611–622, New York, NY, USA, 2005. ACM.             [27] R. Sinha, J. Zobel, and D. Ring. Cache-efficient string
[14] A. Greß and G. Zachmann. Gpu-abisort: Optimal                 sorting using copying. ACM Journal of Experimental
     parallel sorting on stream architectures. In The 20th         Algorithmics, 2006.
     IEEE International Parallel and Distributed              [28] E. Sintorn and U. Assarsson. Fast parallel gpu-sorting
     Processing Symposium (IPDPS ’06), page 45, Apr.               using a hybrid algorithm. Journal of Parallel and
     2006.                                                         Distributed Computing, In Press, Corrected Proof.
[15] S. Heinz, J. Zobel, and H. E. Williams. Burst tries: A   [29] I. H. Witten, A. Moffat, and T. C. Bell. Managing
     fast, efficient data structure for string keys. ACM           Gigabytes: Compressing and Indexing Documents and
     Transactions on Information Systems, 20(2):192–223,           Images. Morgan Kaufmann Publishers, San Francisco,
     2002.                                                         CA, 1999.
[16] D. R. Helman, D. A. B. Y, and J. J. Z. A randomized
     parallel sorting algorithm with an experimental study.
     Technical report, Journal of Parallel and Distributed
     Computing, 1995.
[17] U. J. Kapasi, W. J. Dally, S. Rixner, P. R. Mattson,
     J. D. Owens, and B. Khailany. Efficient conditional
     operations for data-parallel architectures. In In
     Proceedings of the 33rd Annual ACM/IEEE
     International Symposium on Microarchitecture, pages
     159–170. ACM Press, 2000.
[18] B. Khailany, W. J. Dally, U. J. Kapasi, P. Mattson,
     J. Namkoong, J. D. Owens, B. Towles, A. Chang, and
     S. Rixner. Imagine: Media processing with streams.
     IEEE Micro, 21(2):35–46, 2001.
[19] P. Kipfer, M. Segal, and R. Westermann. Uberflow: a
     gpu-based particle engine. In HWWS ’04: Proceedings
     of the ACM SIGGRAPH/EUROGRAPHICS
     conference on Graphics hardware, pages 115–122, New
     York, NY, USA, 2004. ACM Press.