=Paper=
{{Paper
|id=Vol-2399/paper13
|storemode=property
|title=Comprehensive Framework for Sorting Benchmarks
|pdfUrl=https://ceur-ws.org/Vol-2399/paper13.pdf
|volume=Vol-2399
|authors=Sergey Madaminov
|dblpUrl=https://dblp.org/rec/conf/vldb/Madaminov19
}}
==Comprehensive Framework for Sorting Benchmarks==
ceur-ws.org/Vol-2399/paper13.pdf
Comprehensive Framework for Sorting Benchmarks
Sergey Madaminov
Department of Computer Science
Stony Brook University
New Computer Science Building
Stony Brook, New York 11794-2424
smadaminov@cs.stonybrook.edu
supervised by Michael Ferdman
ABSTRACT In spite of the rich body of knowledge on benchmarks,
In the early days, sorting accounted for almost 25% of all drastic changes in computing today have made some of the
cycles that computers were spending. That led to the devel- benchmarks obsolete. For instance, benchmarks such as
opment of a variety of sorting algorithms and their imple- PennySort and TeraByte Sort are deprecated due to the
mentations, as well as the creation of sorting benchmarks. substantial growth in computational power that allows han-
However, those benchmarks do not account well for increas- dling much larger data sets [13]. Similarly, the nature of
ing variability in the nature of data and they also fail to the data itself may also di↵er and while there is a suggested
assess architectural features of di↵erent computer systems structure of a record to sort [7] that defines 100-byte records,
depending on the choice of the sorting algorithm. This work not all studies follow it [20, 23]. Moreover, sorting task itself
proposes the development of a comprehensive sorting bench- can vary a lot: it can be local to a single computer machine
mark framework to address those issues and to help with or distributed among many nodes in a cluster, or it can tar-
the evaluation of sorting algorithms from both software and get di↵erent architectures.
hardware perspectives. The variety of di↵erent factors makes it unnecessarily com-
plicated to evaluate sorting algorithms and sorting systems
and compare them against each other. Without a defined
1. INTRODUCTION structure of data record or defined distribution, it may be-
Sorting is an important operation that computers have come non-trivial how to compare di↵erent sorting algorithms
been performing from the early days [18]. This led to the de- or their implementations directly. It becomes even more
velopment of various sorting algorithms. As it has proved to complicated when targeted systems are FPGAs as they may
be important at datacenter scale [11, 12] and it targeted dif- be programmed to process a very specific set of data and
ferent scenarios and systems, various algorithms were devel- changes in the structure of the data may either significantly
oped for general purpose sorting by using CPUs [22, 15, 5], a↵ect results or make it unfeasible to even process that data.
for sorting that is suitable for highly parallel systems [2], and To e↵ectively analyze the choice of a sorting algorithm
for sorting using other types of architectures [10, 19]. How- or sorting system, one needs to collect both hardware and
ever, with the rapid pace of increase in the scale of a sorting software statistics of any viable approach. While hardware
problem, the question of which algorithm to choose remains statistics may include cache performance, branch mispredic-
persistent. To answer this question, one needs to have a tion, and TLB misses, the software statistics may include
sorting benchmark that is capable of providing enough in- running time on a particular system and scalability of the
formation for analyzing the needs and efficiency of available sorting algorithm with the increasing number of available
and proposed algorithms for a given purpose. parallelism or growth in the volume of data.
The idea of having benchmarks is not novel and there is To overcome the above issues, this work proposes develop-
a body of work done on the benchmarks for system compo- ing a comprehensive framework for sorting benchmarks ca-
nents such as CPU [6], applications such as databases [25], pable of evaluating various hardware and software aspects
and systems for processing cloud workloads [8]. Some exist- of sorting algorithms and sorting systems while maintain-
ing studies have targeted sorting specifically [13, 7, 21, 14]. ing ease of use. This work is structured as follows: Section
Generally stated, the di↵erent types of benchmarks cover 2 justifies the development of such a framework, Section 3
di↵erent parts of sorting systems from both architectural discusses framework architecture, and Section 4 concludes.
perspectives as well as algorithmic and software implemen-
tations.
2. THE NEED FOR COMPREHENSIVE
FRAMEWORK
Multiple studies have been done on sorting benchmarks [3,
24, 9]. However, we advocate that there is a need for more
research on that topic. There are three main reasons for
that: first, existing benchmarks do not consider the vari-
Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019. ety of input data and its distribution, second, they do not
Los Angeles, California.
Copyright (C) 2019 for this paper by its authors. Copying permitted for assess hardware statistics, and, third, they are not a good
private and academic purposes.. fit for a variety of di↵erent computer architectures. The
last one is particularly important as there is a number of
studies targeting various architectures such as GPUs [10], Table 1: Example of the data distribution.
FPGAs [19], and AVX-based [4]. But without a systematic Uniform Bernoulli
approach, the task of comparing them against each other Poisson Exponential
becomes quite challenging. This task of comparing di↵er- Gaussian Log Normal
ent architectures between themselves especially complicated Gamma Beta
when only part of the sorting algorithm is implemented. For
example, some studies targeting FPGAs focus on the merg-
ing [20, 23]. As such implementations may require data
transfer to and from the sorting system, some level of data
3. FRAMEWORK ARCHITECTURE
preparation, or may depend on the problem size, it is un- This section provides a brief overview of potential frame-
clear how to compare results obtained from di↵erent archi- work components and argues for their need. It discusses
tectures. Thus, the proposed framework should provide a various aspects of proposed framework such as data distri-
facility to perform a comparison between them. For similar bution, collectible system statistics, and some of the other
sorting algorithms, it can be achieved by direct compari- aspects that include record structure.
son of similar phases of the algorithms and estimating the
remaining phases, which may include potentially required 3.1 Data Distribution
communication such as data transfer over the PCIe or an- The record generator used in the Sort Benchmark [13] can
other medium. produce two types of data distribution. Despite a bigger va-
Many studies related to sorting use record structure sug- riety of data being considered by Helman et al. [14], their
gested by Datamation sorting benchmark [7], but it is not work focuses on the structure of the data rather than its dis-
universally accepted. Due to variations in record struc- tribution. Based on the nature of the data, it is possible to
ture, comparing the results of di↵erent studies directly is not have more options in the data distribution and the proposed
straightforward. On the other hand, the Datamation sorting framework should account for both data structure and data
benchmark that defines the structure of a data record be- distribution. Often, these two features may be independent
ing 100-byte with ten-byte key and ninety-byte value could of each other so the framework should provide facilities to
have become outdated. The current database vendors and combine them together. Thus, it may become possible to
users should be surveyed to collect prevailing structures of have both staggered data structure with Gaussian distribu-
records and data distributions. However, as some works tion or any other combination of data structure and data
may use the di↵erent input data, it is important to allow distribution. Table 1 provides the list of some of the pos-
variations in the input data. First, it will allow analyzing sible distributions to account for. However, similar to the
studies that use di↵erent input data. Second, it will enable Datamation [7], a comprehensive list should be compiled us-
the comparison with prior work. ing input from the database vendors and the database users
It deems important to understand how sorting algorithms to represent the actual workloads that may be found outside
scale with an increasing number of parallelism or volume of of research groups.
data, which requires collecting corresponding information.
To perform a more thorough evaluation of the sorting algo- 3.2 System Statistics
rithm it is crucial to collect systems statistics such as mem- Currently, to assess systems performance such as mem-
ory bandwidth and caches miss rate. While it is possible ory bandwidth, the developer has to use tools such as Intel
to use existing tools for profiling, it requires the algorithm VTune [17]. While in some cases it may be inevitable to
developer to install and learn a variety of tools. It can be use external software, the framework should collect statis-
avoided by adding such functionality into the framework it- tics where it can and at least provide the list of various
self. Some of the algorithms exhibit di↵erent behavior on metrics to account for. Table 2 provides the list of some of
systems level, e.g., Quicksort algorithm is known for good the suggested systems statistics to collect.
cache behavior and utilization. Gathering more information Many modern sorting algorithms have optimal or near-
can help to get a clear picture of the sorting algorithm, which optimal complexity, but real implementations may result in
in turn can help to reason about the di↵erences between dif- noticeable di↵erences between them. Collecting such statis-
ferent sorting algorithms. We suggest that the framework tics may help to identify bottlenecks that may lead to fur-
should not just provide statistical data as feedback, but also ther research on how those bottlenecks can be mitigated.
provide an analysis report that identifies weak points of the As a naı̈ve example, hugepages may help to reduce TLB
algorithm and what potentially can be improved. Moreover, misses [16] and using recently introduced high-bandwidth
modern benchmark systems are not easy to use. Thus, the memories may help to handle the memory bandwidth bound
proposed framework should be user-friendly and should pro- parts of the sorting algorithms. Moreover, identifying such
vide reports for further analysis in a readable format. bottlenecks may steer hardware research. One can imagine
With a variety of studies on sorting including recent works building a sorting specific accelerator to overcome them. For
on exploring new computer architectures such as FPGAs [19, example, it may be an FPGA that accelerates a particular
20, 23] and GPUs [10] and their suitability for sorting, com- task or even a special-purpose processor that has an ISA
paring their result becomes a challenging task. The pro- targeting the sorting task.
posed framework will strive to address these challenges and
needs while maintaining ease of use. It may be still unclear 3.3 Miscellaneous
how to compare di↵erent computer architectures but this
The data distribution and systems statistics cover many
work sets resolving this problem as one of its targets.
di↵erent aspects of sorting but there are still some imple-
mentation details and guidelines that may become useful for
Companion of the 2018 ACM/SPEC International
Table 2: Example of the collectible system statistics. Conference on Performance Engineering, pages 41–42,
I/O Intensity TLB Miss Rate New York, NY, USA, 2018. ACM.
IPC Intensity Caches Miss Rate [7] A. et al, D. Bitton, M. Brown, R. Catell, S. Ceri,
Cache Utilization Branch Misprediction T. Chou, D. DeWitt, D. Gawlick, H. Garcia-Molina,
Memory Bandwidth Memory Peak B/W B. Good, J. Gray, P. Homan, B. Jolls, T. Lukes,
E. Lazowska, J. Nauman, M. Pong, A. Spector,
K. Trieber, H. Sammer, O. Serlin, M. Stonebraker,
algorithm developers. They include using custom compara- A. Reuter, and P. Weinberger. A Measure of
tors, avoiding using indirect function calls [1], and di↵erent Transaction Processing Power. Datamation,
record types with latter being tightly coupled with custom 31(7):112–118, April 1985.
comparators. Ultimately, evaluation of sorting algorithms [8] M. Ferdman, A. Adileh, O. Kocberber, S. Volos,
and sorting systems may have more factors to consider that M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu,
we have previously defined and it is deemed important to A. Ailamaki, and B. Falsafi. Clearing the Clouds: A
identify them and leave the framework open to including Study of Emerging Scale-out Workloads on Modern
them. Hardware. Proceedings of the Seventeenth
International Conference on Architectural Support for
4. CONCLUSIONS Programming Languages and Operating Systems,
This work advocates for the development of a compre- pages 37–48, 2012.
hensive framework for sorting benchmarks, which accounts [9] P. Garcia and H. Korth. Multithreaded Architectures
for various aspects of sorting algorithms starting with defin- and the Sort Benchmark. In Proceedings of the 1st
ing the input data and measures both their software and International Workshop on Data Management on New
hardware statistics. Such a framework may help to create Hardware, New York, NY, USA, 2005. ACM.
a system to foster the development of sorting algorithms as [10] N. Govindaraju, J. Gray, R. Kumar, and D. Manocha.
well as designing new computer architectures for sorting. We GPUTeraSort: High Performance Graphics
envision that it will be beneficial for many communities out- Co-processor Sorting for Large Database Management.
side of a group of scientists who work on the development In Proceedings of the 2006 ACM SIGMOD
of new sorting algorithms or modifying the existing ones. International Conference on Management of Data,
While the work in its preliminary stage, there are many de- pages 325–336, New York, NY, USA, 2006. ACM.
sign choices that have to be done and collecting feedback [11] G. Graefe. Sorting And Indexing With Partitioned
from database vendors and users is essential for what are B-Trees. In In Proceedings of the 1st International
the common data features and hardware statistics they do Conference on Innovative Data Systems Research,
care. Asilomar, CA, USA, January 2003.
[12] G. Graefe. Implementing Sorting in Database Systems.
5. REFERENCES ACM Computing Surveys, 38(3), September 2006.
[1] V. Agrawal, A. Dabral, T. Palit, Y. Shen, and [13] J. Gray, C. Nyberg, M. Shah, and N. Govindaraju.
M. Ferdman. Architectural Support for Dynamic Sorting Benchmark. http://sortbenchmark.org/.
Linking. In Proceedings of the Twentieth International [14] D. R. Helman, D. A. Bader, and J. JáJá. Parallel
Conference on Architectural Support for Programming Algorithms for Personalized Communication and
Languages and Operating Systems, pages 691–702, Sorting With an Experimental Study (Extended
New York, NY, USA, 2015. ACM. Abstract). In Proceedings of the eighth annual ACM
[2] M. Axtmann, T. Bingmann, P. Sanders, and symposium on Parallel Algorithms and Architectures,
C. Schulz. Practical Massively Parallel Sorting. In pages 211–222. ACM Press, June 1996.
Proceedings of the 27th ACM on symposium on [15] C. A. R. Hoare. Quicksort. The Computer Journal,
Parallelism in Algorithms and Architectures, pages 5(1):10–16, January 1962.
13–23. ACM Press, June 2015. [16] J. Hu, X. Bai, S. Sha, Y. Luo, X. Wang, and Z. Wang.
[3] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. HUB: Hugepage Ballooning in Kernel-based Virtual
Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, Machines. In Proceedings of the International
P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, Symposium on Memory Systems, pages 31–37, New
H. D. Simon, V. Venkatakrishnan, and S. K. York, NY, USA, 2018. ACM.
Weeratunga. The NAS Parallel Benchmarks - [17] Intel. Intel VTune.
Summary and Preliminary Results. In Proceedings of https://software.intel.com/en-us/vtune.
the 1991 ACM/IEEE Conference on Supercomputing, [18] D. E. Knuth. The Art of Computer Programming:
pages 158–165, New York, NY, USA, 1991. ACM. Sorting and Searching, volume 3. Addison-Wesley
[4] B. Bramas. A Novel Hybrid Quicksort Algorithm Professional, 2nd edition, 1998.
Vectorized using AVX-512 on Intel Skylake. [19] D. Koch and J. Torresen. FPGASort. In Proceedings
International Journal of Advanced Computer Science of the 19th ACM/SIGDA International Symposium on
and Applications, 8(10), 2017. Field Programmable Gate Arrays, pages 45–54. ACM
[5] C. Bron. Merge Sort Algorithm [M1]. Communications Press, 2011.
of the ACM, 15(5):357–358, May 1972. [20] S. Mashimo, T. V. Chu, and K. Kise.
[6] J. Bucek, K.-D. Lange, and J. v. Kistowski. SPEC High-Performance Hardware Merge Sorter. In 2017
CPU2017: Next-Generation Compute Benchmark. In IEEE 25th Annual International Symposium on
Field-Programmable Custom Computing Machines Hardware Merge Sorter without Feedback Datapath.
(FCCM), pages 1–8. IEEE, April 2017. In 2018 IEEE 26th Annual International Symposium
[21] C. Nyberg, T. Barclay, Z. Cvetanovic, J. Gray, and on Field-Programmable Custom Computing Machines
D. Lomet. AlphaSort: A Cache-sensitive Parallel (FCCM), pages 197–204. IEEE, April 2018.
External Sort. The VLDB Journal, 4(4):603–628, [24] K. Thearling and S. Smith. An Improved
October 1995. Supercomputer Sorting Benchmark. In Proceedings of
[22] T. Peters. Timsort. the 1992 ACM/IEEE Conference on Supercomputing,
https://bugs.python.org/file4451/timsort.txt. pages 14–19, Los Alamitos, CA, USA, 1992. IEEE
[23] M. Saitoh, E. A. Elsayed, T. V. Chu, S. Mashimo, and Computer Society Press.
K. Kise. A High-Performance and Cost-E↵ective [25] TPC. Active TPC Benchmarks.
http://www.tpc.org/information/benchmarks.asp.