<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CCrriittiiccaall EEvvaalluuaattiioonn ooff EExxiissttiinngg EExxtteerrnnaall SSoorrttiinngg MethModesthinodtsheinPtehrespPeecrtisvpeecotfivMe oodfeMrnoHdearrndware</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hardware? Martin Krulis?</string-name>
          <email>krulis@ksi.mff.cuni.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Parallel Architectures/ApMpalirctaitnioKnsr/uAlilsgorithms Research Group Faculty of Mathematics and Physics, Charles University in Prague Parallel AMrachloisttercatnusrkese/nAapmp.lic2a5t,ioPnrsa/gAuelg,oCriztehcmhsRRepesuebalrich Group Faculty of</institution>
        </aff>
      </contrib-group>
      <fpage>77</fpage>
      <lpage>88</lpage>
      <abstract>
        <p>External sorting methods which are designed to order large amounts of data stored in persistent memory are well known for decades. These methods were originally designed for systems with small amount of operating (internal) memory and magnetic tapes used as external memory. Data on magnetic tapes has to be accessed in strictly serial manner and this limitation shaped the external sorting algorithms. In time, magnetic tapes were replaced with hard drives which are now being replaced with solid state drives. Furthermore, the amount of the operating memory in mainstream servers have increased by orders of magnitude and the future may hold even more impressive innovations such as nonvolatile memories. As a result, most of the assumptions of the external sorting algorithms are not valid any more and these methods needs to be innovated to better re ect the hardware of the day. In this work, we critically evaluate original assumptions in empirical manner and propose possible improvements.</p>
      </abstract>
      <kwd-group>
        <kwd>sorting</kwd>
        <kwd>external sorting</kwd>
        <kwd>hard disk</kwd>
        <kwd>parallel</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Sorting is perhaps the most fundamental algorithm which reaches well beyond
computer science. Along with relational JOIN operation, it is one of the most
often employed data processing steps in most database systems. Despite the fact
that the amount of operating memory of current desktop and server computers is
increasing steadily, many problems still exist where the sorting operation cannot
be performed entirely in the internal memory. In such situations, the data has to
be stored in persistent storage such as hard drives and external sorting algorithms
must be employed.</p>
      <p>External sorting algorithms were designed in rather long time ago, when
magnetic tapes were typical representatives of external memory. Even though
the magnetic tapes are rarely used at present (except for specialized applications
such as data archiving), the fundamental principles of these algorithms are still
? This paper was supported by Czech Science Foundation (GACR) project</p>
      <p>P103/14/14292P.</p>
      <p>M. Necasky, J. Pokorny, P. Moravec (Eds.): Dateso 2015, pp. 77{88, CEUR-WS.org/Vol-1343.
being used. These principles are based on assumptions which no longer hold,
most importantly the following:</p>
      <p>External memory has to be accessed sequentially or the sequential access is
signi cantly more e cient.</p>
      <p>There is only a small amount of internal memory and the external memory
is several orders of magnitude larger.</p>
      <p>External memory is slower than the internal memory by many orders of
magnitude.</p>
      <p>In this work, we would like to subject these assumptions to an empirical
evaluation in order to critically asses their validity. We have performed a number
of tests designed to determine performance properties of mainstream systems,
hard drives, and solid state drives. Based on the results of these experiments,
we propose a modi cation of existing methods in order to improve their
performance.</p>
      <p>In the remainder of the paper, we will assume that the data being sorted
are represented as a sequence of items, where all items have the same size.
An item has a value called key, which is used for sorting. We have used only
numerical keys in our experiments; however, we have no additional assumptions
about the keys than their relative comparability. Furthermore, we expect that
a sorting algorithm produces a sequence of items where the keys are arranged in
ascending order.</p>
      <p>The paper is organized as follows. Section 2 summarizes related work.
Traditional approach to the external sorting is revised in Section 3. Section 4 presents
the experimental results that asses the individual operations required for
external sorting. Finally, Section 5 proposes modi cations to existing methods.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        The external sorting (i.e., sorting of the data that does not t the internal
memory and has to be stored in the external persistent memory) is one of the key
problems of data processing. Sorting operations are widely employed in various
domains [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and they also form essential parts of other algorithms. Perhaps the
rst study of problems that require intensive utilization of external memory was
presented in thesis of Demuth [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] more than 50 years ago, which focused mainly
on the data sorting.
      </p>
      <p>
        A rather broad study of various problem instances and of various sorting
algorithms was conducted by Knuth [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] in 1970s. This study is presented in the
Art of Computer Programming books, which are still considered to be one of the
best educational materials regarding algorithms and data structures. Volume 3
is dedicated to sorting and searching and it describes commonly used methods of
external sorting, such as multiway merging, polyphase merging, and various
improvements. Many derived algorithms and methods for external data sorting [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]
has been published since then, but they all share the fundamental assumptions
layed down by Knuth.
      </p>
      <p>
        The data communication between fast internal memory and slower external
memory is still considered the bottleneck of large data processing [
        <xref ref-type="bibr" rid="ref11 ref16 ref2">2, 11, 16</xref>
        ]. Most
algorithms are attempting to avoid this bottleneck by minimizing the number of
total input-output operations employing various strategies for internal memory
data caching [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Another widely utilized technique is to perform the
external memory operations asynchronously, so they can overlap with internal data
processing [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Originally, external sorting employed magnetic tapes as external memory.
When magnetic disks arrived, they allowed (almost) random access to the data,
so that one disk can store multiple data streams being merged. However,
magnetic drives also have their limits and the sequential access to the data is still
preferable. Hence, it might be bene cial to employ multiple disks either
managed by the sorting algorithm directly or using striping to divide data among
the disks evenly (e.g., by RAID 0 technology) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Newest advances in the
hardware development, especially the emergence of Solid State Disks (SSD) that
employ FLASH memory instead of traditional magnetic recording, have been
studied from the perspective of sorting algorithms. The SSD drives are
particularly interesting from the energy consumption point of view [
        <xref ref-type="bibr" rid="ref1 ref14">1, 14</xref>
        ]; however,
the introduced papers follow the original approach to external sorting and their
main objective is the minimization of I/O operations.
      </p>
      <p>
        Sorting of large datasets can be also accelerated employing multiple
computers interconnected with some networking technology. Clusters of computers
may have more combined internal memory than individual servers, thus a
distributed solution may even achieve superlinear speedup in some cases. On the
other hand, distributed sorting methods introduce new problems, such as
communication overhead or load balancing [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Furthermore, the distributed sorting
is an integral part of Map-Reduce algorithm [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which is being widely used for
distributed data processing on a large scale.
      </p>
      <p>
        Internal sorting methods are an integral part of the external sorting, since
it may be quite bene cial to pre-sort the data partially by chunks that t the
internal memory. Initially, the internal sorting methods have been summarized
by Knuth in the Art of Computer Programming (vol. 3) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. More recently,
Larson et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] made a study of several internal sorting methods and asses their
applicability for the initial part of external sorting.
      </p>
      <p>
        In the past decade, the hardware development has employed parallel
processing on many levels. A practical combination of internal sorting methods
which takes advantage of parallel features of modern CPUs was proposed by
Chhugani [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We have also observed an emergence of platforms for massive data
parallel processing (like GPGPUs), and so several di erent algorithms were
developed for manycore GPU accelerators [
        <xref ref-type="bibr" rid="ref10 ref13">13, 10</xref>
        ]. Finally, we would like to point
out the work of Falt et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] which proposes an adaptation of Mergesort for
inmemory parallel data streaming systems. This work is particularly interesting,
since we believe that a similar approach can be adopted for external sorting.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Traditional Approach to External Sorting</title>
      <p>The traditional algorithms of external sorting have two major parts. The rst
part employs methods of internal sorting to form runs { long sequences of ordered
values. The second part merges the runs into one nal run, which is also the nal
result of the sorting problem. The algorithms that employ this approach may
vary in details, such as how the runs are generated or how many runs are being
merged in each step of the second part.
3.1</p>
      <sec id="sec-3-1">
        <title>Forming Runs</title>
        <p>
          Perhaps the most direct method for creating a run is the utilization of internal
sorting algorithms, like Quicksort [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] for instance. The application allocates as
large memory bu er as possible and ll it with data from the disk. The data
in the bu er are sorted by an internal sorting algorithm and written back to
the hard drive as a single run. These steps are repeated until all input data are
pre-sorted into runs.
        </p>
        <p>
          Direct application of internal sorting generates runs, which length
corresponds to the internal memory size. Even though the intuition suggests that
this is the optimal solution, Knuth [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] describes an improvement, which can be
used to increase the average length of the runs. This improvement is based on
adjusted version of Heapsort [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] algorithm { i.e., an internal sorting algorithm
that employs regular heaps (sometimes also denoted as priority queues).
        </p>
        <p>The algorithm also utilizes as much internal memory as possible. At the
beginning, input data are loaded into the internal memory bu er and a d-regular
heap is constructed in-place using the bottom-up algorithm. Than the algorithm
performs iteratively the following steps:
1. Minimum m (top of the heap) is written to the currently generated run (but
not removed from the heap yet).
2. New item x is loaded from the input data le. If key(m) key(x), the
x replaces the top of the heap (item m) and the heap is reorganized as if
the increase-key operation has been performed on the top. Otherwise, the
remove-top operation is performed on the heap, so its size is reduced by 1,
and the x item is stored at the position in the main memory bu er vacated
by the heap.
3. If the main heap becomes empty in the previous step, the currently generated
run is completed and a new run is started. At the same time, the main
memory bu er is already lled with input data values, so a heap structure
is constructed from them.</p>
        <p>When the input le ends, the algorithm empties the heap to the current run
by repeating the remove-top step. After that, the remaining data outside of the
heap are sorted by internal sorting and saved as another run.</p>
        <p>Let us emphasize two important implementation details. Despite the fact the
algorithm operates on individual items, the external I/O operations are always
performed on larger blocks of items and thus both data input and data output
is bu ered. Furthermore, the explicit construction of a new heap (in step 3) can
be amortized into step 2, so that when an item x is placed outside the heap, one
step of bottom-up construction algorithm is performed. This way a secondary
heap is constructed incrementally as the rst heap shrinks.</p>
        <p>
          If the keys exhibit uniform distribution, the two-heap algorithm can generate
runs with average length 2N , where N is the number of items stored in the
internal memory bu er. This fact is explained in detail by Knuth [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and we
have also veri ed it empirically.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Merging Runs</title>
        <p>The second part of the external sorting performs the merging of the runs. The
main problem is that the number of runs being merged simultaneously may be
limited. In the past, the computers have limited number of magnetic tapes which
can be operated simultaneously. At present, operating systems have limited
number of simultaneously opened les and we can perform only limited number of
concurrent I/O operations to the same disk. For these reasons, we can read or
write only N data streams (i.e., runs) at once. In the following, we will use the
term data stream as an abstraction, which could be related to a magnetic tape
or to a le on a hard disk, in order to simplify technical details such as opening
and closing les.</p>
        <p>The most direct approach to this problem is called two-phase merging. N 1
data streams are used as inputs and one data stream is used as output. At
the beginning, the runs are distributed evenly among the input streams. The
algorithm then alternates two phases (depicted in Figure 1), until there is only
one run in a single data stream. In the rst phase, the runs are merged from
the input streams into output stream. In the second phase, the runs from the
output stream are distributed evenly among the input streams.</p>
        <p>N-1 streams
sorted run
output stream</p>
        <p>...
...</p>
        <p>...
merging</p>
        <p>distribution</p>
        <p>The merging phase takes one run from each fo N 1 input streams and
merges them together into one run that is written to the output stream. The
merging of runs is performed in the internal memory again. It can be
implemented hierarchically or by using a priority queue for instance.</p>
        <p>The main disadvantage of two-phase merging is the necessity of distributing
the runs in the second phase. We can integrate the distributing phase with the
merging phase as follows. If the data streams cannot be easily replaced (e.g.,
such as in case of magnetic tapes), we can reduce the number of input streams
to N=2 and the remaining N=2 streams utilize as output streams. Hence, the
merged runs are distributed in a round robin manner among the output streams
as they are created. Unfortunately, the number of simultaneously merged runs
is reduced from N 1 to N=2, wich increases the total number of iterations
performed by the algorithm.</p>
        <p>If the data streams can be replaced e ectively and e ciently (e.g., a le
can be closed and another le can be opened), the distribution of the runs can
be performed also in the merging phase whilst N 1 input streams are used.
Furthermore, it might be possible to utilize more than N 1 input streams, if
the streams can be opened/closed or sought fast enough.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Polyphase Merging</title>
        <p>There is yet another way how to amortize the distribution phase in the merging
phase. This method is called polyphase merging, since it repeats only the merging
step and the distribution of the runs is performed naturally. The merging of the
runs is also performed from N 1 input streams into one output stream, but
the initial distribution of the runs is not even. When one of the input streams
is emptied, this stream becomes new output stream and the output stream is
included among the input streams.</p>
        <p>The distribution of the runs has to be carefully planned in order to achieve
optimal workload. The optimal run distribution for the last merging step is that
each input stream has exactly one run. One step before the last step, the optimal
distribution is that N 2 input streams have two runs and the remaining input
stream has one run. Using this pattern, the initial distribution can be planed
retrospectively. An example of the merging plan for 21 runs and three streams
(N = 3) is presented in Table 1.</p>
        <p>beginning #1 #2 #3 #4 #5 #6
stream 1
stream 2
stream 3
13
8
0
5
0
8
0
5
3
3
2
0
1
0
2
0
1
1
1
0
0</p>
        <p>At the beginning, the rst stream holds 13 runs and the second 8 runs. The
third stream is used as the output stream. Each phase merges as many runs as
possible and merges them (e.g., 8 runs are merged in the rst step). We may
observe, that if there are two input streams, the distribution of the runs follow
the Fibonacci sequence. If the total number of the runs is not suitable for optimal
distribution, some streams may be padded with virtual runs wich do not require
merging.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>In this section, we present performance experiments that asses individual
operations of external sorting { especially the internal sorting, the performance of
the heap data structure, and the disk I/O operations.</p>
      <p>The internal sorting experiments were conducted on a high-end desktop PC
with Intel Core i7 870 CPU comprising 8 logical cores clocked at 2:93 GHz and
16 GB of RAM. Additional tests were performed on 4-way cache coherent NUMA
server with four Intel Xeon E7540 CPUs comprising 12 logical cores clocked at
2:0 GHz and 128 GB RAM. The RHEL (Red Hat Enterprise Linux) 7 was used
as operating system on both machines.</p>
      <p>The I/O tests were conducted on three storages. Tests denoted HDD were
measured using one Segate Barracuda HDD (2 TB, 7; 200 rpm) connected via
SATA II interface. Tests denoted SSD were conducted on 6 solid state drives
Samsung 840 Pro (512 GB each) in software RAID 0 con guration. Finally, tests
denoted array used Infortrend ESDS 3060 enterprise disk array which comprised
2 SSD disks (400 GB each) and 14 HDDs (4 TB, 7; 200 rpm) connected in RAID
6. The array was connected via dual-link 16 Gbit Fiber Channel. Again, the
RHEL 7 was used as operating system and XFS was used as the le system.
4.1</p>
      <sec id="sec-4-1">
        <title>Internal Sorting</title>
        <p>In order to asses the bene ts of the heap data structure for both generating the
runs and for multiway merging, we compare Heapsort with conventional sorting
methods. Before we can do that, we need to select optimal heap degree (i.e.,
branching degree of the tree that represents the heap) for this task.
desktop PC</p>
        <p>NUMA server
nd .30
o
c
rsee .25
p
tsem .20
if
sono .15
li
ilm .01
32bit
32bit + 32bit
64bit
64bit + 64bit
Figure 2 presents the measured throughput of Heapsort algorithm with
various heap degrees on 1 GB of data. The experiments were conducted using four
di erent item/key sizes: 32-bit (integer) keys, 32-bit integers with additional
32-bit payload, 64-bit integer keys, and 64-bit keys with 64-bit payload. The
payload simulates additional index or pointer associated with the key which
refers to additional (possibly large) data properties of the item.</p>
        <p>The results indicate, that 2-regular heaps (which are often used as a default)
have poor performance. Much better choice are degrees between 3-5, which
reduce the height of the tree. On the other hand, the largest (16 B) items have
exhibited a slightly di erent behaviour, which we were unable to explain so far.
A more extensive study of this data structure has yet to be conducted. In the
following experiments, we have used the optimal heap degree for each situation.
32bit keys</p>
        <p>32bit keys with 32bit payload
l)
e
a
c
s
g
o
l
(ndo 20
rsec 10
e
sp 5
m
e
it
fo 2
s
n
o
llii
m
l)
e
a
c
s
l(og 20
d
ceon 10
s
rep 5
s
item 2
f
o
sn 1
o
lili
m
std
tbb
heap
std
tbb
heap
l)
e
a
scg 50
o
l(
nod 02
c
rsee 10
p
sm 5
e
ift
so 2
n
o
lili
m
l)
e
a
scgo 50
l(
cnod 02
rsee 10
p
sm 5
e
it
fso 2
n
o
ilil
m
std
tbb
heap
std
tbb
heap
64M 128M 256M 512M 1G 2G 4G 8G
total data size
64bit keys
64M 128M 256M 512M 1G 2G 4G 8G</p>
        <p>total data size
64bit keys with 64bit payload
64M 128M 256M 512M 1G 2G 4G 8G
total data size
64M 128M 256M 512M 1G 2G 4G 8G</p>
        <p>total data size</p>
        <p>In the internal sorting experiments, we compare the Heapsort algorithm with
serial Quicksort implemented in C++ STL library (std::sort) and parallel
sorting algorithm implemented in the Intel Threading Building Blocks (TBB)
library. These algorithms may not be the fastest possible, but they present an
etalon, which can be used for further comparisons.</p>
        <p>The results depicted in Figure 3 show that the Heapsort is several times
slower than the two other methods and its performance degrade more rapidly on
larger datasets. Furthermore, the TBB sort is expected to scale with increasing
number of cores, but no parallel implementation of Heapsort exists to our best
knowledge and we believe that it is not feasible with current CPU architectures.</p>
        <p>We have conducted the experiments on the NUMA server as well. The server
CPUs have lower serial throughput, which is compensated by the number of
cores. Hence, the sorting throughput of TBB algorithm is comparable, but the
std::sort and the Heapsort exhibit signi cantly lower performance.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Sequential Data Reads</title>
        <p>The existing external sorting methods are designed to access data sequentially.
Hence, our rst set of experiments measure the sequential read operations. Let us
emphasize, that we use sequential access to the le that holds the data; however,
the underlying storage device may fragment the le or otherwise scatter the data.</p>
        <p>The following experiments perform sequential reads from binary data les
using blocks of xed size. The data are read using 1-8 threads concurrently, where
each thread has its own le in order to avoid explicit synchronization on
application level. In every case, the application reads total amount of 64 GB (each
thread has an equal share). We have used three APIs for le management {
traditional blocking API (sync), asynchronous API (async), and memory mapped
les (mmap) { in order to asses their overhead in di erent situations.
256k 1M 4M 16M 64M
block size
256k 1M 4M 16M 64M
block size
256k 1M 4M 16M 64M
block size
]sdn 052
seco 200
[
item 150
iadng 100
e
r 0
5
]sd 204
0
4
256k 1M 4M 16M 64M
block size
256k 1M 4M 16M 64M
block size
256k 1M 4M 16M 64M
block size</p>
        <p>Fig. 5. Reading 64 GB sequentially from 6 SSD disks in RAID 0
256k 1M 4M 16M 64M
block size
256k 1M 4M 16M 64M
block size
256k 1M 4M 16M 64M
block size</p>
        <p>The last set of experiments conducted on an enterprise disk array is depicted
in Figure 6. The RAID nature of the array provide much better performance than
a single drive, but the additional overhead imposed by the data redundancy and
internal storage control makes the array less e cient than the SSD drives. The
array is also much less predictable, since the I/O scheduling algorithm of the
host system is probably clashing with the internal scheduler of the array.
read−hdd
read−ssd
read−array
std−sort (32b)
tbb−sort (32b)
std−sort (32b+32b)
tbb−sort (32b+32b)
std−sort (64b)
tbb−sort (64b)
std−sort (64b+64b)
tbb−sort (64b+64b)
0 5 10 15 20 25 30
time [seconds]
Fig. 7. Comparison of processing times of 1 GB of data (i.e. 256M of 32-bit items,
128M of 32 + 32-bit and 64-bit items, and 64M of 64 + 64-bit items</p>
        <p>Finally, let us compare the data reading times with internal sorting times.
Figure 7 summarizes the results of reading and sorting operations performed on
1 GB of data. In general, the internal sorting is slower than reading from
persistent storage. Furthermore, the reading operation is expected to scale linearly
with the data size, while the sorting has time complexity (N log N ) and so it
will get even slower (w.r.t. reading) when larger data bu ers are used.
sync
async
mmap
best_seq
sync
async
mmap
best_seq
sync
async
mmap
best_seq
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Random Data Access</title>
        <p>In the merging phase, the data streams are usually loaded from the same storage
device. In such case, the device must load data from multiple locations in short
order. Hence, we would like to compare random access with the sequential access.
single HDD
6 SSDs in RAID0
enterprise disk array
256k 1M 4M 16M 64M
block size
256k 1M 4M 16M 64M
block size
256k 1M 4M 16M 64M
block size</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>The experimental results indicate that using heaps for increasing the length
of initial runs has negative impact on the performance. The internal sorting
takes more time than data loads, thus the optimization of the internal sorting
methods become a priority. Another important fact yielded by the empirical
evaluation is that the random access could be nearly as fast as sequential access
when the data are transmitted in large blocks. Hence, we can place as many
streams as required on the same storage (even in the same le) and access them
simultaneously. Based on the evidence, we propose the following:
The internal sorting used for generating runs should utilize the fastest
(parallel) algorithm possible. The length of the runs are no longer rst priority.
Modern servers of the day have hundreds of GB operating memory and tens
of TB storage capacity. Hence, if the sorted data t the persistent storage,
the rst phase will generate hundreds of runs at most.</p>
      <p>The merging phase should process all generated runs in one step. Current
operating systems can work with hundreds separate les and the storage can
handle simultaneous (and thus random) access to all these streams without
a signi cant drop in performance.</p>
      <p>
        We have established that traditional methods of external sorting presented by
Knuth are obsolete. New methods should focus more on optimizing the internal
sorting algorithms and e cient hierarchial merging than on the number of I/O
operations. In the future work, we would like to adopt a data stream sorting
algorithm of Falt et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] for external merging.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Beckmann</surname>
          </string-name>
          , A., Meyer, U.,
          <string-name>
            <surname>Sanders</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singler</surname>
          </string-name>
          , J.:
          <article-title>Energy-e cient sorting using solid state disks</article-title>
          .
          <source>Sustainable Computing: Informatics and Systems</source>
          <volume>1</volume>
          (
          <issue>2</issue>
          ),
          <volume>151</volume>
          {
          <fpage>163</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bertasi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bressan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peserico</surname>
          </string-name>
          , E.:
          <article-title>psort, yet another fast stable sorting software</article-title>
          .
          <source>Journal of Experimental Algorithmics (JEA) 16</source>
          ,
          <issue>2</issue>
          {
          <issue>4</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Chhugani</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>A.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>V.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macy</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hagog</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baransi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubey</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>E cient implementation of sorting on multicore simd cpu architecture</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          <volume>1</volume>
          (
          <issue>2</issue>
          ),
          <volume>1313</volume>
          {
          <fpage>1324</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghemawat</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Mapreduce: simpli ed data processing on large clusters</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>51</volume>
          (
          <issue>1</issue>
          ),
          <volume>107</volume>
          {
          <fpage>113</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Demuth</surname>
          </string-name>
          , H.B.:
          <article-title>Electronic data sorting</article-title>
          . Dept. of Electrical Engineering (
          <year>1956</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Falt</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bulanek</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yaghob</surname>
          </string-name>
          , J.:
          <article-title>On parallel sorting of data streams</article-title>
          .
          <source>In: Advances in Databases and Information Systems</source>
          . pp.
          <volume>69</volume>
          {
          <fpage>77</fpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Hoare</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Quicksort</surname>
          </string-name>
          . The
          <source>Computer Journal</source>
          <volume>5</volume>
          (
          <issue>1</issue>
          ),
          <volume>10</volume>
          (
          <year>1962</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Knuth</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          :
          <article-title>Sorting and Searching</article-title>
          .
          <string-name>
            <surname>Addison-Wesley</surname>
          </string-name>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Larson</surname>
            ,
            <given-names>P.A.</given-names>
          </string-name>
          :
          <article-title>External sorting: Run formation revisited. Knowledge and Data Engineering</article-title>
          , IEEE Transactions on
          <volume>15</volume>
          (
          <issue>4</issue>
          ),
          <volume>961</volume>
          {
          <fpage>972</fpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Merrill</surname>
            ,
            <given-names>D.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grimshaw</surname>
            ,
            <given-names>A.S.:</given-names>
          </string-name>
          <article-title>Revisiting sorting for gpgpu stream architectures</article-title>
          .
          <source>In: Proceedings of the 19th international conference on Parallel architectures and compilation techniques</source>
          . pp.
          <volume>545</volume>
          {
          <fpage>546</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Nyberg</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barclay</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cvetanovic</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lomet</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Alphasort: A cachesensitive parallel external sort</article-title>
          .
          <source>The VLDB Journal { The International Journal on Very Large Data Bases</source>
          <volume>4</volume>
          (
          <issue>4</issue>
          ),
          <volume>603</volume>
          {
          <fpage>628</fpage>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Rasmussen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Porter</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Conley</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madhyastha</surname>
            ,
            <given-names>H.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mysore</surname>
            ,
            <given-names>R.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pucher</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vahdat</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Tritonsort: A balanced large-scale sorting system</article-title>
          .
          <source>In: Proceedings of the 8th USENIX conference on Networked systems design and implementation</source>
          . pp.
          <volume>3</volume>
          {
          <issue>3</issue>
          .
          <string-name>
            <given-names>USENIX</given-names>
            <surname>Association</surname>
          </string-name>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Satish</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harris</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garland</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Designing e cient sorting algorithms for manycore gpus</article-title>
          .
          <source>In: Parallel &amp; Distributed Processing</source>
          ,
          <year>2009</year>
          .
          <article-title>IPDPS 2009</article-title>
          . IEEE International Symposium on. pp.
          <volume>1</volume>
          {
          <fpage>10</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Vasudevan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaminsky</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kozuch</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andersen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pillai</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Fawnsort:
          <article-title>Energy-e cient sorting of 10gb. Sort Benchmark nal (</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Vengro</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Scott</given-names>
            <surname>Vitter</surname>
          </string-name>
          , J.:
          <article-title>Supporting i/o-e cient scienti c computation in tpie</article-title>
          .
          <source>In: Parallel and Distributed Processing</source>
          ,
          <year>1995</year>
          . Proceedings. Seventh IEEE Symposium on. pp.
          <volume>74</volume>
          {
          <fpage>77</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Vitter</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          :
          <article-title>External memory algorithms and data structures: Dealing with massive data</article-title>
          .
          <source>ACM Computing surveys (CsUR) 33</source>
          (
          <issue>2</issue>
          ),
          <volume>209</volume>
          {
          <fpage>271</fpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>J.W.J.</given-names>
          </string-name>
          : Algorithm-232-heapsort.
          <source>Communications of the ACM</source>
          <volume>7</volume>
          (
          <issue>6</issue>
          ),
          <volume>347</volume>
          {
          <fpage>348</fpage>
          (
          <year>1964</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>