<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sorting using BItonic netwoRk wIth CUDA</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ranieri Baraglia</string-name>
          <email>r.baraglia@isti.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Franco Maria Nardini</string-name>
          <email>f.nardini@isti.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriele Capannini</string-name>
          <email>g.capannini@isti.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabrizio Silvestri</string-name>
          <email>f.silvestri@isti.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ISTI - CNR</institution>
          ,
          <addr-line>Via G. Moruzzi, 1, 56124 Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Novel “manycore” architectures, such as graphics processors, are high-parallel and high-performance shared-memory architectures [7] born to solve specific problems such as the graphical ones. Those architectures can be exploited to solve a wider range of problems by designing the related algorithm for such architectures. We present a fast sorting algorithm implementing an efficient bitonic sorting network. This algorithm is highly suitable for information retrieval applications. Sorting is a fundamental and universal problem in computer science. Even if sort has been extensively addressed by many research works, it still remains an interesting challenge to make it faster by exploiting novel technologies. In this light, this paper shows how to use graphics processors as coprocessors to speed up sorting while allowing CPU to perform other tasks. Our new algorithm exploits a memory-efficient data access pattern maintaining the minimum number of accesses to the memory out of the chip. We introduce an efficient instruction dispatch mechanism to improve the overall sorting performance. We also present a cache-based computational model for graphics processors. Experimental results highlight remarkable improvements over prior CPU-based sorting methods, and a significant improvement over previous GPU-based sorting algorithms.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>Every day people use Web Search Engines as a tool for
accessing information, sites, and services on the Web.
Information retrieval has to face those issues due to the growing
amount of information on the web, as well as the number of
new users. Creating a Web Search Engine which scales even
to today’s Web contents presents many challenges. A fast
crawling technology is needed to gather web documents and
Copyright °c 2009 for the individual papers by the papers’ authors.
Copying permitted for private and academic purposes. Re-publication of material
from this volume requires permission by the copyright owners. This volume
is published by its editors.</p>
      <p>
        LSDS-IR Workshop. July 2009. Boston, USA.
keep them up to date. Storage space must be used efficiently
to store indexes, and documents. The indexing system must
process hundreds of gigabytes of data efficiently. Queries
must be handled quickly, at a rate of hundreds to thousands
per second. All these services run on clusters of
homogeneous PCs. PCs in these clusters depends upon price, CPU
speed, memory and disk size, heat output, and physical size
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Nowadays these characteristics can be find also in other
commodity hardware originally designed for specific
graphics computations. Many special processor architectures have
been proposed to exploit data parallelism for data intensive
computations and graphics processors (GPUs) are one of
those. For example, the scientific community uses GPUs
for general purpose computation. The result obtained, in
term of computational latency, outperform the time charge
requested on classical processors. Unfortunately, such
programs must rely on APIs to access the hardware, for example
OpenGL or DirectX. These APIs are simultaneously
overspecified, forcing programmer to manipulate data that is not
directly relevant, and drivers. These APIs make critical
policy decisions, such as deciding where data resides in memory
and when they are copied.
      </p>
      <p>In last years, due to the growing trend of media market,
the request of rendering algorithms is rapidly evolving. For
those companies producing hardware, it means to design
every time new hardware both able to run novel algorithms
and able to provide higher rate of computations per
second. Such processors require significant design effort and
are thus difficult to change as applications and algorithms
evolve. The request for flexibility in media processing
motivates the use of programmable processors, and the existing
need for non-graphical APIs pushed the same companies into
creating new abstractions designed to last.</p>
      <p>
        Finally, according to what Moore’s law foresee, the
number of transistor density doubles every two years.
Furthermore, mainly due to power-related issues, new generation
processors (such as traditional CPUs) tend to incorporate
an ever-increasing number of processors (also called cores)
on the same chip [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The result is that nowadays the
market proposes low-cost commodity hardware that is able to
execute heavy loads of computations. In order to enable
developers to leverage the power of such architectures, they
usually make available ad-hoc programming tools for.
      </p>
      <p>For the reasons listed so far, these architectures are ideal
candidates for the efficient implementation of those
component of a Large-Scale Search Engine that are eager of
computational power.</p>
      <p>
        This paper focuses on using a GPU as a co-processor for
sorting. Sorting is a core problem in computer science that
has been extensively researched over the last five decades
and still remains a bottleneck in many applications
involving large volumes of data. One could argue why efficient
sorting is related with LSDS-IR. First of all, sorting is a
basic application for indexing. We will show in Section 3
how many indexing algorithms are basically a sorting
operation over integer sequences. Large scale indexing, thus,
required scalable sorting. Second, the technique we are
introducing here is of crucial importance for Distributed
Systems for IR since it is designed to run on GPUs that are
considered by many as a basic building block for future
generation data-centers [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Our bitonic sorting network can
be seen as a viable alternative for sorting large quantities
of data on graphics processors. In the last years general
purpose processors have been specialized adopting
mechanisms to make more flexible their work. Such facilities (i.e.
more levels of caches, out-of-order execution paradigm, and
branch prediction techniques) leads to make the
theoretical performance of CPUs closer to the real achievable one.
From the other side specialized processors, like GPUs,
expose lower flexibility at design phase, but are able to reach
higher computational power providing more computational
cores with respect to other the class of processors. We map
a bitonic sorting network on GPU exploiting the its high
bandwidth memory interface. Our novel data partitioning
improves GPU cache efficiency and minimizes data transfers
between on-chip and off-chip memories.
      </p>
      <p>This paper is organized as follows. Section 2 discusses
related works, while Sections 3 introduces some relevant
characteristics about the applicability of GPU-based
sorting in Web Search Engines. Section 4 presents some issues
arising from the stream programming model and the
singleinstruction multiple-data (SIMD) architecture. The central
part is devoted to expose our solution, and the
computational model used to formalize Graphics Processing Units.
Section 6 presents some results obtained in a preliminary
evaluation phase. Section 7 discuss hot to evolve and
improve this research activity.</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>Since most sorting algorithms are memory bandwidth bound,
there is no surprise that there is currently a big interest in
sorting on the high bandwidth GPUs.</p>
      <p>
        Purcell et al. [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] presented an implementation of bitonic
merge sort on GPUs based on an implementation by
Kapasi et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Author used that approach to sort photons
into a spatial data structure providing an efficient search
mechanism for GPU-based photon mapping. Comparator
stages were entirely realized in the fragment units1,
including arithmetic, logical and texture operations. Authors
reported their implementation to be compute-bound rather
than bandwidth-bound, and they achieve a throughput far
below the theoretical optimum of the target architecture.
      </p>
      <p>
        Kipfer et al. [
        <xref ref-type="bibr" rid="ref19 ref20">19, 20</xref>
        ] showed an improved version of the
1In addition to computational functionality, fragment units
also provide an efficient memory interface to server-side
data, i.e. texture maps and frame buffer objects.
bitonic sort as well as an odd-even merge sort. They
presented an improved bitonic sort routine that achieves a
performance gain by minimizing both the number of
instructions to be executed in the fragment program and the
number of texture operations.
      </p>
      <p>
        Zachmann et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] presented a novel approach for
parallel sorting on stream processing architectures based on an
adaptive bitonic sorting [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. They presented an
implementation based on modern programmable graphics hardware
showing that they approach is competitive with common
sequential sorting algorithms not only from a theoretical
viewpoint, but also from a practical one. Good results are
achieved by using efficient linear stream memory accesses
and by combining the optimal time approach with
algorithms.
      </p>
      <p>
        Govindaraju et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] implemented sorting as the main
computational component for histogram approximation. This
solution is based on the periodic balanced sorting network
method by Dowd et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In order to achieve high
computational performance on the GPUs, they used a sorting
network based algorithm and each stage is computed using
rasterization. They later presented a hybrid bitonic-radix
sort that is able to sort vast quantities of data [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], called
GPUTeraSort. This algorithm was proposed to sort record
contained in databases using a GPU. This approach uses the
data and task parallelism on the GPU to perform
memoryintensive and compute-intensive tasks while the CPU is used
to perform I/O and resource management.
      </p>
      <p>
        Sengupta et al. [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] presented a radix-sort and a
Quicksort implementation based on segmented scan primitives.
Authors presented new approaches of implementing several
classic applications using this primitives and shows that this
primitives are an excellent match for a broad set of problems
on parallel hardware.
      </p>
      <p>
        Recently, Sintorn et al. [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] presented a sorting algorithm
that combines bucket sort with merge sort. In addition,
authors show this new GPU sorting method sorts on nlog(n)
time.
      </p>
      <p>
        Cederman et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] showed that their GPU-Quicksort is
a viable sorting alternative. The algorithm recursively
partition the sequence to be sorted with respect to a pivot.
This is done in parallel by each GPU-thread until the
entire sequence has been sorted. In each partition iteration
a new pivot value is picked and as a result two new
subsequences are created that can be sorted independently by
each thread block can be assigned one of them. Finally,
experiments pointed out that GPU-Quicksort can outperform
other GPU-based sorting algorithms.
3.
      </p>
    </sec>
    <sec id="sec-3">
      <title>APPLICATIONS TO INDEXING</title>
      <p>A modern search engine must scale even with the growing
of today’s Web contents. Large-scale and distributed
applications in Information Retrieval such as crawling, indexing,
and query processing can exploit the computational power
of new GPU architectures to keep up with this exponential
grow.</p>
      <p>
        We consider here one of the core-component of a
largescale search engine: the indexer. In the indexing phase, each
crawled document is converted into a set of word occurrences
called hits. For each word the hits record: frequency,
position in document, and some other information. Indexing,
then, can be considered as a “sort” operation on a set of
records representing term occurrences [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Records
represent distinct occurrences of each term in each distinct
document. Sorting efficiently these records using a good balance
of memory and disk usage, is a very challenging operation.
      </p>
      <p>
        In the last years it has been shown that sort-based
approaches [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], or single-pass algorithms [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], are efficient in
several scenarios, where indexing of a large amount of data
is performed with limited resources.
      </p>
      <p>A sort-based approach first makes a pass through the
collection assembling all term-docID pairs. Then it sorts the
pairs with the term as the dominant key and docID as the
secondary key. Finally, it organizes the docIDs for each term
into a postings list (it also computes statistics like term and
document frequency). For small collections, all this can be
done in memory.</p>
      <p>
        When memory is not sufficient, we need to use an external
sorting algorithm [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. The main requirement of such
algorithm is that it minimizes the number of random disk seeks
during sorting. One solution is the Blocked Sort-Based
Indexing algorithm (BSBI). BSBI segments the collection into
parts of equal size, then it sorts the termID-docID pairs of
each part in memory, finally stores intermediate sorted
results on disk. When all the segments are sorted, it merges
all intermediate results into the final index.
      </p>
      <p>A more scalable alternative is Single-Pass In-Memory
Indexing (SPIMI). SPIMI uses terms instead of termIDs, writes
each block’s dictionary to disk, and then starts a new
dictionary for the next block. SPIMI can index collections of
any size as long as there is enough disk space available. The
algorithm parses documents and turns them into a stream of
term-docID pairs, called tokens. Tokens are then processed
one by one. For each token, SPIMI adds a posting directly to
its postings list. Instead of first collecting all termID-docID
pairs and then sorting them (as BSBI does), each postings
list is dynamic. This means that its size is adjusted as it
grows. This has two advantages: it is faster because there
is no sorting required, and it saves memory because it keeps
track of the term a postings list belongs to, so the termIDs
of postings need not be stored.</p>
      <p>When memory finished, SPIMI writes the index of the
block (which consists of the dictionary and the postings lists)
to disk. Before doing this, SPIMI sorts the terms to
facilitate the final merging step: if each block’s postings lists
were written in unsorted order, merging blocks could not be
accomplished by a simple linear scan through each block.
The last step of SPIMI is then to merge the blocks into the
final inverted index.</p>
      <p>SPIMI, which time complexity is lower because no sorting
of tokens is required, is usually preferred with respect to
BSBI that presents an higher time complexity.</p>
      <p>
        In both the methods presented for indexing, sorting is
involved: BSBI sorts the termID-docID pairs of all parts in
memory, SPIMI sorts the terms to facilitate the final
merging step [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
      </p>
      <p>In order to efficiently evaluate these two approaches on a
heterogeneous cluster we have to compare “standard” SPIMI
performances with the performances of a BSBI-based sorter
implemented by us. Moreover, to fully analyze the indexing
phase, we need a GPU-based string sorter able to outperform
CPUs as well as our sorter for integers does. In this way
we have the possibility to compare both solutions, on all
architectures, then to choose the best combination. Having
all possible implementations available, a flexible execution
of indexing running on various hardware can be imagined.
This option is even more important if the allocation of the
task is scheduled dynamically, as it can be done depending
of the workload of the single resources.</p>
      <p>
        The-state-of-art in string sort lacks of solution for GPU
architectures: nowadays we are not aware of solutions for
parallel SIMD processors. In the literature, this problem is
efficiently solved by using different approaches. The most
interesting and suitable for us seems to be Burstsort [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
It is a technique that combines the burst trie [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] to
distribute string-items into small buckets whose contents are
then sorted with standard (string) sorting algorithms.
Successively, Sinha et al. [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] introduced improvements that
reduce by a significant margin the memory requirements of
Burstsort. This aspect is even more relevant for GPU
architectures having small-sized on-chip memories.
4.
      </p>
    </sec>
    <sec id="sec-4">
      <title>SORTING WITH GPUS</title>
      <p>This section gives a brief overview of GPUs highlighting
features that make them useful for sorting. GPUs are
designed to execute geometric transformations that generate a
data stream of display-pixels. A data stream is processed by
a program running on multiple SIMD processors, which are
capable for data-intensive computations. The output, then,
is written back to the memory.
4.1</p>
    </sec>
    <sec id="sec-5">
      <title>SIMD Architecture</title>
      <p>
        SIMD machines are classified as processor-array machines:
a SIMD machine basically consists of an array of
computational units connected together by a simple network
topology [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. This processor array is connected to a control
processor, which is responsible for fetching and interpreting
instructions. The control processor issues arithmetic and data
processing instructions to the processor array, and handles
any control flow or serial computation that cannot be
parallelized. Processing elements can be individually disabled
for conditional execution: this option give more flexibility
during the design of an algorithm.
      </p>
      <p>Although SIMD machines are very effective for certain
classes of problems, the architecture is specifically tailored
for data computation-intensive work, and it results to be
quite “inflexible” on some classes of problems2.
4.2</p>
    </sec>
    <sec id="sec-6">
      <title>Stream Programming Model</title>
      <p>
        A stream program [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] organizes data as streams and
expresses all computation as kernels. A stream is a sequence
of similar data elements, that are defined by a regular
access pattern. A kernel typically loops through all the input
stream elements, performing a sequence of operations on
each element, and appending results to an output stream.
These operations exhibits an high instruction level
parallelism. Moreover, these operations cannot access to
arbitrary memory locations but they keep all the intermediate
values locally, into kernels. Since each element of the
input stream can be processed simultaneously, kernels also
expose large amounts of data-level parallelism. Furthermore,
stream memory transfers can be executed concurrently with
kernels, thereby exposing task-level parallelism in the stream
program. Some other important characteristics common to
all stream-processing applications are: (i) elements are read
once from memory, (ii) elements are not visited twice, and
2 For example, these architectures cannot efficiently run the
control-flow dominated code.
(iii) the application requires high level of arithmetic
operations per memory reference, i.e. computationally intensive.
4.3
      </p>
    </sec>
    <sec id="sec-7">
      <title>Nvidia’s CUDA</title>
      <p>
        CUDA [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], acronym of Compute Unified Device
Architecture, is defined as an architecture built around a scalable
array of multithreaded streaming multiprocessors (MPs).
Each MP is defined as a unit comprising one instruction
unit, a collection of eight single precision pipelines also called
cores, a functional units for special arithmetical operations
and a 16 KB local store also called shared memory. In
practice, this means that each MP is a SIMD processor, whose
cores form an arithmetic pipeline that executes scalar
instructions. All MPs create, manage, and execute concurrent
threads in hardware with zero scheduling overhead, and
implements a barrier for threads synchronization.
Nevertheless, only threads concurrently running on the same MP can
be synchronized.
      </p>
      <p>CUDA model also introduces the entity warp as a group
of 32 parallel scalar threads, and reports that each warp
executes one common instruction at a time. This is another
way of saying that warp is a stream of vector instructions:
scalar threads are then vector elements. But, unlike others
SIMD instruction set, such as Intel’s SSE, a particular value
of the vector length is not specified. A thread block is defined
as a collection of warps that run on the same MP and share
a partition of local store. The number of warps in a thread
block is configurable.
4.3.1</p>
      <p>CUDA SDK</p>
      <p>The SDK provided by Nvidia for its GPUs consist of a
large, collection of code examples, compilers and run-time
libraries. Clearly the CUDA model is “restricted” to Nvidia
products, mainly for efficiency reasons, and it is conform to
the stream programming model. Threads and thread blocks
can be created only by invoking a parallel kernel, not from
within a parallel kernel. Task parallelism can be expressed
at the thread-block level, but block-wide barriers are not
well suited for supporting task parallelism among threads
in a block. To enable CUDA programs to run on any
number of processors, communication between different blocks of
threads, is not allowed, so they must execute independently.
Since CUDA requires that thread blocks are independent
and allows blocks to be executed in any order.
Combining results generated by multiple blocks must in general be
done by launching a second kernel. However, multiple thread
blocks can coordinate their work using atomic operations on
the external (off-chip) memory.</p>
      <p>Recursive function calls are not allowed in CUDA kernels.
Recursion is, in fact, unattractive in a massively parallel
kernel. Providing stack space for all the active threads it
would require substantial amounts of memory. To support
an heterogeneous system architecture combining a CPU and
a GPU, each with its own memory system, CUDA programs
must copy data and results between host memory and device
memory. The overhead of CPU/GPU interaction and data
transfers is minimized by using DMA block-transfer engines
and fast interconnects.
4.4</p>
    </sec>
    <sec id="sec-8">
      <title>Cache-oblivious algorithms</title>
      <p>The cost of communication can be larger up to an order
of magnitude than the cost of the pure computation on such
architectures. Our idea is to model the proposed solution as
cache-oblivious algorithms. The model underlying this type
of algorithms is not directly applicable on GPU’s parallel
architecture, which is equipped with local memory instead of
cache. Adopting local memory approach, the programmer
has to bear the effort of synchronizing, sizing, and
scheduling the computation of data and its movement through the
off-chip memory and the in-chip one. On the other hand in
cache-based architectures this aspect is automatically
managed by the underlying support. A local memory approach
permits to move data located in different addresses
composing a specific access pattern. This capability is impossible to
realize with caches, where the hardware hides this operation
by automatically replacing missed cache lines.</p>
      <p>
        Frigo et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] presents cache-oblivious algorithms that
use both asymptotically optimal amounts of work, and
asymptotically optimal number of transfers among multiple levels
of cache. An algorithm is cache oblivious if no program
variables dependent on hardware configuration parameters,
such as cache size and cache-line length need to be tuned
to minimize the number of cache misses. Authors introduce
the “Z, L” ideal-cache model to study the cache complexity
of algorithms. This model describes a computer with a
twolevel memory hierarchy consisting of an ideal data cache of
Z words of constant size, and an arbitrarily large main
memory. The cache is partitioned into cache lines, each
consisting of L consecutive words that are always moved together
between cache and main memory.
      </p>
      <p>The processor can only reference words that reside in the
cache. If the referenced word belongs to a line already in
cache, a cache hit occurs, and the word is delivered to the
processor. Otherwise, a cache-miss occurs, and the line is
fetched into the cache. If the cache is full, a cache line
must be evicted. An algorithm with an input of size n is
measured in the ideal-cache model in terms of its work
complexity W (n) and its cache complexity Q(n, Z, L), that is
the number of cache misses it incurs as a function of the size
Z and line length L of the ideal cache.</p>
      <p>The metrics used to measure cache-oblivious algorithms
need to be reconsidered in order to be used with GPUs that
are parallel architectures. To do that W () has to be defined
taking care of the level of parallelism exposed by GPUs.
Evaluating Q(), we must translate the concept of Z and L
that refer to cache characteristics. More precisely, a GPUs
is provided of p MPs each one with a local memory. We
can abstract such architectural organization by considering
each local memory as one cache-line, and the union of all
local memories as the entire cache, taking no care of which
processor is using data. In this point of view, if the shared
memory of each MP is 16 KB, we obtain L = 16 KB and
Z = 16 · p KB.
5.</p>
    </sec>
    <sec id="sec-9">
      <title>BITONIC SORT</title>
      <p>
        To design our sorting algorithm in the stream
programming model, we started from the popular Bitonic Sort (BS)
network and we extend it to adapt to our specific
architecture. Specifically, BS is one of the fastest sorting networks
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. A sorting network is a special kind of sorting
algorithm, where the sequence of comparisons do not depend
on the order with which the data is presented, see Figure
1. This makes sorting networks suitable for implementation
on GPUs. In particular, the regularity of the schema used
to compare the elements to sort, makes this kind of sorting
network particularly suitable for partitioning the elements in
the stream programming fashion, as GPUs require. Finally,
we compared theoretical results with the ones resulting from
the tests, in order to say if the adapted ideal-cache model is
useful to abstract GPUs.
      </p>
      <p>step
step</p>
      <p>The main issue to address is to define an efficient schema
to map all comparisons involved in the BS on the elements
composing the streams invoked.</p>
      <p>The first constraint is that the elements composing each
stream must be “distinct”. This means that each item in
the input has to be included in exactly one element of the
stream. From this point of view, each stream of elements
defines a partition of the input array to be sorted. This
characteristic is necessary due to the runtime support, because
it does not permit any type of data-exchange, or
synchronization, between two elements (Figure 2).</p>
      <p>kernel stream</p>
      <p>element</p>
      <p>The second aspect to optimize is the partitioning of the
steps composing the sorting network. Since a BS is
composed by several steps (Figure 1), we have to map the
execution of all steps into a sequence of independent runs, each of
them corresponding to the invocation of a kernel. Since each
kernel invocation implies a communication phase, such
mapping should be done in order to reduce the number of these
invocations, thus the communication overhead. Specifically,
this overhead is generated whenever the SIMD processor
begins the execution of a new stream element. In that case, the
processor needs to flush the results contained in the proper
shared memory, then to fetch the new data from the off-chip
memory. In the ideal-cache computational model, it
corresponds to a cache-miss event, which wastes the performance
of the algorithm.</p>
      <p>Resuming, performing several network steps in the same
kernel has the double effect to reduce the number of
cachemisses, i.e. improving Q() metric, and to augment the level
of arithmetic operations per memory reference. The unique
constraint is that the computation of an element has to be
independent from the one of another element in the same
stream.</p>
      <p>step
A</p>
      <p>B C</p>
      <p>Let us introduce our solution. First of all, we need to
establish the number of consecutive steps to be executed by
one kernel. We must consider that for each step assigned
to a kernel, in order to maintain the all stream elements
independent, the number of memory location needed by the
relative partition doubles, see Figure 3. So, the number
of steps a kernel can cover is bounded by the number of
items that it is possible to include into the stream element.
Furthermore, the number of items is bounded by the size of
the shared memory available for each SIMD processor.
Algorithm 1 Bitonic Sort algorithm.</p>
      <p>a ← array to sort
for s = 1 to log2 |a| do
for c = s − 1 down to 0 do
for r = 0 to |a| − 1 do
if 2rc ≡ 2rs (mod 2) ∧ a[r] &gt; a[r ⊕ 2c] then</p>
      <p>a[r] ⇆ a[r ⊕ 2c]
end if
end for
end for
end for</p>
      <p>More precisely, to know how many steps can be included
in one partition, we have to count how many distinct values c
assumes, see Algorithm 1. Due to the fixed size of memory
locations, i.e. 16 KB, we can specifies partition of SH =
4 K items, for items of 32 bits. Moreover such partition is
able to cover “at least” sh = log(SH) = 12 steps. From
this evaluation it is also possible to estimate the size of the
kernel stream: if a partition representing an element of the
stream contains SH items, and the array a to sort contains
N = 2n items, then the stream contains b = N/SH = 2n−sh
elements.</p>
      <p>A important consideration must be done for the first
kernel invoked by the algorithm: until the variable s in the
Algorithm 1 is not greater than sh the computation of the
several first steps can be done with this first kernel. This
because the values assumed by c remain in a range of sh
distinct values. More precisely the first kernel computes the
first sh·(s2h+1) steps (Figure 3).</p>
      <p>This access pattern schema can be traduced in the
function ℓc(i) that for the current kernel stream, given the
current value of c, is able to define the subset of a to be assigned
to the i-th stream element. In other words, ℓ describes a
method to enumerate the set of indexes in a that the i-th
element of the kernel stream has to perform. Formally, it is
defined as:</p>
      <p>ℓc : i ∈ [0, b − 1] → Π ⊆ πsh(a)
where πsh(a) is a partition of a, namely a set of nonempty
subsets of a such that every element in a is in exactly one of
these subsets, and each subset contains exactly 2sh elements
of a.</p>
      <p>Let us assume n = 32 and the size of the shared memory
can contains 4 K items, so we have |a| = 232 and b = 2n−sh =
232−12 = 220 elements for each stream. Basically, we need
32 bits to point an element of a, and log(b) = 20 bits to
identify the i-th partition among the b existing. The i value
is used to build a bit-mask that is equal for each address
produced by ℓc(i). Such mask sets log(b) bits of the 32 bits
composing an index for a. The missing sh bits are generated
by using a variable x to enumerate all values in the range
X = [0, 2sh −1] and by inserting each of them in c-th position
of the i mask. This composition leads to a set of addresses
of n bits whose relative items compose the b-th partition.
Formally, we obtain:</p>
      <p>ℓc(i) = {x ∈ X : i[31...c] ◦ x ◦ i[c+1...0]}
where i[31...c] notation identifies the leftmost bits, namely
from the 31th bit down to the c-th one, and “◦” is the
concatenation operator.</p>
      <p>The rule to compose the elements in ℓc(i) is easy, but in
some case it leads to some exception. When c &lt; sh, then
x is divided in two parts, that are xL and xR, and they
are inserted in c-th position and in the c′-th position of i
respectively. In particular, the statement c &lt; sh occurs
whenever the middle loop in the Algorithm 1 ends and c
start a new loop getting the new value of s, denoted with c′.
Specifically, it happens that, depending on the current value
of c, the algorithm needs to make two insertions: to add xR
at position c, and to add xL at position c′. Formally, when
c &lt; sh, we have to define a second function ℓc,c′ () as in the
following:</p>
      <p>ℓc,c′ (i) = {x ∈ X : i[31...c¯] ◦ xL ◦ i[c′+sh−c...c] ◦ xR}
where xL = x[sh−1...c], xR = x[c−1...0], and c′ = s + 1.
5.1</p>
    </sec>
    <sec id="sec-10">
      <title>Evaluation</title>
      <p>For our solution we obtain that W (n, p), where p
indicates the number of MPs, and n the size of a, is equal
to the computational cost of the sequential BS divided p,
specifically W (n, p) = O` np · log2 n´. To know Q(n, Z, L)
we must estimate the number of cache-misses. Assuming
|a| = n, we obtain that the sorting network is made of
σ = 12 `log2(n) + log(n)´ steps. Furthermore, let us assume
L = SH, Z = SH · p, and each kernel covers sh steps,
except the first kernel that covers σ′ = 21 `sh2 + sh´. Then the
number of cache-misses is ⌈(σ − σ′)/sh⌉.</p>
      <p>The last consideration regards W (n, p) measure, that should
be estimated considering that each MP is a SIMD processor.
In fact, each MP reaches its maximum performance
whenever the data-flow permits to the control unit to issue the
same instruction for all cores. In this way such instruction
is executed in parallel on different data in a SIMD fashion.</p>
      <p>In order to compare our bitonic sort with the quick sort
proposed by Cederman et al., we tried to extend the analysis
of ideal-cache model metrics to their solution. Unfortunately
their solution does not permit to be accurately measured
like ours. In particular, it is possible to estimate the
number of transfers among multiple levels of cache, but quick
sort uses off-chip memory also to implement prefix-sum for
each stream element ran. In particular quick sort splits
input array in two parts with respect to a pivot by invoking
a procedure on GPU, and recursively repeats this
operation until each part can be entirely contained in the shared
memory. In the optimistic case, assuming |a| = n and a
shared memory equal to L = SH, this operation is invoked
log(n)−log(L) times, that is also the number of cache-misses
for quick sort, namely Q(). This value is sensibly lower to
the Q() measured for bitonic sort, but the evaluation of the
number of such cache-misses should be proportionally
augmented due to the prefix-sum procedure. Regarding W (),
the authors report a computational complexity equal to the
one obtained for sequential case, i.e. O`n · log(n)´, without
referring the parallelism of the underlying hardware.
However, optimistic evaluation of such parallel, version should
be, also in this case, lower than W () computed for bitonic
sort.
6.</p>
    </sec>
    <sec id="sec-11">
      <title>RESULTS</title>
      <p>
        The experiments have been conducted on an Ubuntu Linux
Desktop with an Nvidia 8800GT, namely a GPU provided
with 14 SIMD processors and CUDA SDK 2.1. To
generate the arrays for these preliminary tests we used uniform,
gaussian and zipfian distributions. Specifically, for each
distribution was generated 20 different arrays. Figure 4 shows:
the means, the standard deviation, the maximum and the
minimum for the times elapsed for all runs of each
distribution. The tests involved CPU-based quick sort provided
with C standard library, our solution and the one proposed
by Cederman et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In that paper, the GPU-based
quick sort resulted the most performing algorithm in
literature, so our preliminary tests take into consideration only
such GPU-based algorithm.
      </p>
      <p>Figure 4 shows that GPU-based solutions are able to
outperform CPU’s performance for the sorting problem. The
same figure also shows that our solution is not competitive
with respect to the one of Cederman until the number on
items to sort reaches 8 MB. One more consideration is that
GPU-based quick sort is not able to successfully conclude
the tests for arrays greater than 16 MB. Further analysis
pointed out that bitonic algorithm spends the main part of
the time elapsed for the data-transfer phase of some specific
kernel instance. Since element streams were always of the
same size, we deduced that the number of transfers is not
the only aspect to take into consideration to minimize the
communication overhead, as metrics of ideal-cache models
suggests.</p>
      <p>bitonic sort
gpu quick sort
cpu quick sort</p>
      <p>
        As suggest by Helman et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] a deeper evaluation of
the algorithm can be conducted by using arrays generated
by different distributions. This type of test puts in
evidence the behavior of the algorithms regarding to the
variance of the times obtained in different contexts. For the
two distribution tested, bitonic sort algorithm has a better
behavior with respect to variance. Obviously, this result is
caused by the type of algorithm used. Quick sort is a
datadependent approach whereas sorting network are based on a
fixed schema of comparisons that does not vary with respect
to data-distribution.
      </p>
      <p>Consideration on the results obtained from these
preliminary test suggest us that ideal-cache model does not seem
sufficiently accurate to abstract GPU’s architecture. If
theoretical results lead to better performance for GPU-based
quick sort, from the tests conducted, it arises that bitonic
sort has a better performance-trend by increasing the size of
the problem. This consideration is enforced by the analysis
of the data-transfer: we strongly believe that by improving
the data-transfer bandwidth, bitonic sort can reach better
results without increasing theoretical W () and Q() metrics.
7.</p>
    </sec>
    <sec id="sec-12">
      <title>FUTURE WORKS</title>
      <p>Preliminary results show that the number of transfers is
not the only aspect to take into consideration for
minimizing communication overhead. Another important factor is
the transfer bandwidth that is relevant to achieve better
results. Results show that the ideal-cache model is not able
to fully describe and capture all the aspects determining the
performance for such kind of architectures. Probably
different kinds of performance metrics are needed to evaluate an
algorithms on these novel hardware resources.</p>
      <p>Furthermore, since indexing is a tuple-sorting operation
we will extend our solution to include the sorting of tuples
of integers. In this paper, in fact, we assume the tuples are
sorted by using multiple passes on the dataset. We reserve
to future work the extension to tuples.</p>
      <p>Of course, we have to run more tests to enforce the results
obtained and to analyze more in deep the causes of the waste
of performance that affect our algorithm.
8.</p>
    </sec>
    <sec id="sec-13">
      <title>ACKNOWLEDGMENTS</title>
      <p>This research has been supported by the Action IC0805:
Open European Network for High Performance Computing
on Complex Environments.
9.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>NVIDIA</given-names>
            <surname>CUDA Compute Unified Device Architecture - Programming Guide</surname>
          </string-name>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Castillo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Junqueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Plachouras</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          .
          <article-title>Challenges on distributed web retrieval</article-title>
          .
          <source>In ICDE. IEEE</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Barroso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <surname>and Holzle.</surname>
          </string-name>
          <article-title>Web search for a planet: The google cluster architecture</article-title>
          .
          <source>Micro</source>
          , IEEE,
          <volume>23</volume>
          (
          <issue>2</issue>
          ):
          <fpage>22</fpage>
          -
          <lpage>28</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Barroso</surname>
          </string-name>
          and
          <string-name>
            <given-names>U.</given-names>
            <surname>Ho</surname>
          </string-name>
          <article-title>¨lzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines</article-title>
          . Morgan &amp; Claypool, San Rafael, CA, USA,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K. E.</given-names>
            <surname>Batcher</surname>
          </string-name>
          .
          <article-title>Sorting networks and their applications</article-title>
          . In AFIPS '
          <volume>68</volume>
          (Spring):
          <source>Proceedings of the April 30-May 2</source>
          ,
          <year>1968</year>
          , spring joint computer conference, pages
          <fpage>307</fpage>
          -
          <lpage>314</lpage>
          , New York, NY, USA,
          <year>1968</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bilardi</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Nicolau</surname>
          </string-name>
          .
          <article-title>Adaptive bitonic sorting: An optimal parallel algorithm for shared memory machines</article-title>
          .
          <source>Technical report</source>
          , Ithaca,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>1986</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bovay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. H.</given-names>
            <surname>Brent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Wadleigh</surname>
          </string-name>
          .
          <article-title>Accelerators for high performance computing investigation. White paper, High Performance Computing Division - Hewlett-</article-title>
          <string-name>
            <surname>Packard</surname>
            <given-names>Company</given-names>
          </string-name>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Cederman</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Tsigas</surname>
          </string-name>
          .
          <article-title>A practical quicksort algorithm for graphics processors</article-title>
          .
          <source>In ESA '08: Proceedings of the 16th annual European symposium on Algorithms</source>
          , pages
          <fpage>246</fpage>
          -
          <lpage>258</lpage>
          , Berlin, Heidelberg,
          <year>2008</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Laudon</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Olukotun</surname>
          </string-name>
          .
          <article-title>Maximizing cmp throughput with mediocre cores</article-title>
          .
          <source>In PACT '05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques</source>
          , pages
          <fpage>51</fpage>
          -
          <lpage>62</lpage>
          , Washington, DC, USA,
          <year>2005</year>
          . IEEE Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dowd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Perl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rudolph</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Saks</surname>
          </string-name>
          .
          <article-title>The periodic balanced sorting network</article-title>
          .
          <source>J. ACM</source>
          ,
          <volume>36</volume>
          (
          <issue>4</issue>
          ):
          <fpage>738</fpage>
          -
          <lpage>757</lpage>
          ,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Frigo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Leiserson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Prokop</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ramachandran</surname>
          </string-name>
          .
          <article-title>Cache-oblivious algorithms</article-title>
          .
          <source>In FOCS '99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science</source>
          , Washington, DC, USA,
          <year>1999</year>
          . IEEE Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N. K.</given-names>
            <surname>Govindaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Manocha</surname>
          </string-name>
          .
          <article-title>Gputerasort: High performance graphics coprocessor sorting for large database management</article-title>
          .
          <source>In ACM SIGMOD International Conference on Management of Data</source>
          , Chicago, United States,
          <year>June 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>N. K.</given-names>
            <surname>Govindaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Raghuvanshi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Manocha</surname>
          </string-name>
          .
          <article-title>Fast and approximate stream mining of quantiles and frequencies using graphics processors</article-title>
          .
          <source>In SIGMOD '05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data</source>
          , pages
          <fpage>611</fpage>
          -
          <lpage>622</lpage>
          , New York, NY, USA,
          <year>2005</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Greß</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Zachmann.</surname>
          </string-name>
          Gpu-abisort:
          <article-title>Optimal parallel sorting on stream architectures</article-title>
          .
          <source>In The 20th IEEE International Parallel and Distributed Processing Symposium (IPDPS '06)</source>
          , page 45,
          <string-name>
            <surname>Apr</surname>
          </string-name>
          .
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Heinz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H. E.</given-names>
            <surname>Williams</surname>
          </string-name>
          .
          <article-title>Burst tries: A fast, efficient data structure for string keys</article-title>
          .
          <source>ACM Transactions on Information Systems</source>
          ,
          <volume>20</volume>
          (
          <issue>2</issue>
          ):
          <fpage>192</fpage>
          -
          <lpage>223</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Helman</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. A. B. Y</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. J. Z.</surname>
          </string-name>
          <article-title>A randomized parallel sorting algorithm with an experimental study</article-title>
          .
          <source>Technical report, Journal of Parallel and Distributed Computing</source>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>U. J. Kapasi</surname>
            ,
            <given-names>W. J.</given-names>
          </string-name>
          <string-name>
            <surname>Dally</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Rixner</surname>
            ,
            <given-names>P. R.</given-names>
          </string-name>
          <string-name>
            <surname>Mattson</surname>
            ,
            <given-names>J. D.</given-names>
          </string-name>
          <string-name>
            <surname>Owens</surname>
            , and
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Khailany</surname>
          </string-name>
          .
          <article-title>Efficient conditional operations for data-parallel architectures</article-title>
          .
          <source>In In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture</source>
          , pages
          <fpage>159</fpage>
          -
          <lpage>170</lpage>
          . ACM Press,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>B.</given-names>
            <surname>Khailany</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Dally</surname>
          </string-name>
          , U. J.
          <string-name>
            <surname>Kapasi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mattson</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Namkoong</surname>
            ,
            <given-names>J. D.</given-names>
          </string-name>
          <string-name>
            <surname>Owens</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Towles</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Chang</surname>
            , and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Rixner</surname>
          </string-name>
          . Imagine:
          <article-title>Media processing with streams</article-title>
          .
          <source>IEEE Micro</source>
          ,
          <volume>21</volume>
          (
          <issue>2</issue>
          ):
          <fpage>35</fpage>
          -
          <lpage>46</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kipfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Segal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Westermann</surname>
          </string-name>
          .
          <article-title>Uberflow: a gpu-based particle engine</article-title>
          .
          <source>In HWWS '04: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware</source>
          , pages
          <fpage>115</fpage>
          -
          <lpage>122</lpage>
          , New York, NY, USA,
          <year>2004</year>
          . ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kipfer</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Westermann</surname>
          </string-name>
          .
          <article-title>Improved GPU sorting</article-title>
          . In M. Pharr, editor,
          <source>GPUGems</source>
          <volume>2</volume>
          :
          <article-title>Programming Techniques for High-Performance Graphics</article-title>
          and
          <string-name>
            <surname>General-Purpose Computation</surname>
          </string-name>
          , pages
          <fpage>733</fpage>
          -
          <lpage>746</lpage>
          . Addison-Wesley,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>N.</given-names>
            <surname>Lester</surname>
          </string-name>
          .
          <article-title>Fast on-line index construction by geometric partitioning</article-title>
          .
          <source>In In Proceedings of the 14th ACM Conference on Information and Knowledge Management (CIKM</source>
          <year>2005</year>
          , pages
          <fpage>776</fpage>
          -
          <lpage>783</lpage>
          . ACM Press,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>C. D. Manning</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Raghavan</surname>
            , and
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Schu</surname>
          </string-name>
          <article-title>¨tze</article-title>
          . An Introduction to Information Retrieval.
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Nichols</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Siegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. G.</given-names>
            <surname>Dietz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Quong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W. G.</given-names>
            <surname>Nation</surname>
          </string-name>
          .
          <article-title>Eliminating memory for fragmentation within partitionable simd/spmd machines</article-title>
          .
          <source>IEEE Trans. Parallel Distrib. Syst.</source>
          ,
          <volume>2</volume>
          (
          <issue>3</issue>
          ):
          <fpage>290</fpage>
          -
          <lpage>303</lpage>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>T. J.</given-names>
            <surname>Purcell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Donner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cammarano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Jensen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Hanrahan</surname>
          </string-name>
          .
          <article-title>Photon mapping on programmable graphics hardware</article-title>
          .
          <source>In Proceedings of the ACM SIGGRAPH Conference on Graphics Hardware</source>
          , pages
          <fpage>41</fpage>
          -
          <lpage>50</lpage>
          . Eurographics Association,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sengupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Harris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Owens</surname>
          </string-name>
          .
          <article-title>Scan primitives for gpu computing</article-title>
          .
          <source>In GH '07: Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware</source>
          , pages
          <fpage>97</fpage>
          -
          <lpage>106</lpage>
          ,
          <article-title>Aire-la-</article-title>
          <string-name>
            <surname>Ville</surname>
          </string-name>
          , Switzerland, Switzerland,
          <year>2007</year>
          . Eurographics Association.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sinha</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Wirth</surname>
          </string-name>
          .
          <article-title>Engineering burstsort: Towards fast in-place string sorting</article-title>
          .
          <source>In WEA</source>
          , pages
          <fpage>14</fpage>
          -
          <lpage>27</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sinha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Ring</surname>
          </string-name>
          .
          <article-title>Cache-efficient string sorting using copying</article-title>
          .
          <source>ACM Journal of Experimental Algorithmics</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>E.</given-names>
            <surname>Sintorn</surname>
          </string-name>
          and
          <string-name>
            <given-names>U.</given-names>
            <surname>Assarsson</surname>
          </string-name>
          .
          <article-title>Fast parallel gpu-sorting using a hybrid algorithm</article-title>
          .
          <source>Journal of Parallel and Distributed Computing</source>
          , In Press, Corrected Proof.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moffat</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T. C.</given-names>
            <surname>Bell</surname>
          </string-name>
          . Managing Gigabytes:
          <article-title>Compressing and Indexing Documents and Images</article-title>
          . Morgan Kaufmann Publishers, San Francisco, CA,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>