=Paper= {{Paper |id=Vol-2399/paper13 |storemode=property |title=Comprehensive Framework for Sorting Benchmarks |pdfUrl=https://ceur-ws.org/Vol-2399/paper13.pdf |volume=Vol-2399 |authors=Sergey Madaminov |dblpUrl=https://dblp.org/rec/conf/vldb/Madaminov19 }} ==Comprehensive Framework for Sorting Benchmarks== https://ceur-ws.org/Vol-2399/paper13.pdf
                                                                                                             ceur-ws.org/Vol-2399/paper13.pdf




        Comprehensive Framework for Sorting Benchmarks

                                                           Sergey Madaminov
                                                    Department of Computer Science
                                                         Stony Brook University
                                                    New Computer Science Building
                                                   Stony Brook, New York 11794-2424
                                                smadaminov@cs.stonybrook.edu
                                                     supervised by Michael Ferdman

ABSTRACT                                                                     In spite of the rich body of knowledge on benchmarks,
In the early days, sorting accounted for almost 25% of all                drastic changes in computing today have made some of the
cycles that computers were spending. That led to the devel-               benchmarks obsolete. For instance, benchmarks such as
opment of a variety of sorting algorithms and their imple-                PennySort and TeraByte Sort are deprecated due to the
mentations, as well as the creation of sorting benchmarks.                substantial growth in computational power that allows han-
However, those benchmarks do not account well for increas-                dling much larger data sets [13]. Similarly, the nature of
ing variability in the nature of data and they also fail to               the data itself may also di↵er and while there is a suggested
assess architectural features of di↵erent computer systems                structure of a record to sort [7] that defines 100-byte records,
depending on the choice of the sorting algorithm. This work               not all studies follow it [20, 23]. Moreover, sorting task itself
proposes the development of a comprehensive sorting bench-                can vary a lot: it can be local to a single computer machine
mark framework to address those issues and to help with                   or distributed among many nodes in a cluster, or it can tar-
the evaluation of sorting algorithms from both software and               get di↵erent architectures.
hardware perspectives.                                                       The variety of di↵erent factors makes it unnecessarily com-
                                                                          plicated to evaluate sorting algorithms and sorting systems
                                                                          and compare them against each other. Without a defined
1.    INTRODUCTION                                                        structure of data record or defined distribution, it may be-
   Sorting is an important operation that computers have                  come non-trivial how to compare di↵erent sorting algorithms
been performing from the early days [18]. This led to the de-             or their implementations directly. It becomes even more
velopment of various sorting algorithms. As it has proved to              complicated when targeted systems are FPGAs as they may
be important at datacenter scale [11, 12] and it targeted dif-            be programmed to process a very specific set of data and
ferent scenarios and systems, various algorithms were devel-              changes in the structure of the data may either significantly
oped for general purpose sorting by using CPUs [22, 15, 5],               a↵ect results or make it unfeasible to even process that data.
for sorting that is suitable for highly parallel systems [2], and            To e↵ectively analyze the choice of a sorting algorithm
for sorting using other types of architectures [10, 19]. How-             or sorting system, one needs to collect both hardware and
ever, with the rapid pace of increase in the scale of a sorting           software statistics of any viable approach. While hardware
problem, the question of which algorithm to choose remains                statistics may include cache performance, branch mispredic-
persistent. To answer this question, one needs to have a                  tion, and TLB misses, the software statistics may include
sorting benchmark that is capable of providing enough in-                 running time on a particular system and scalability of the
formation for analyzing the needs and efficiency of available             sorting algorithm with the increasing number of available
and proposed algorithms for a given purpose.                              parallelism or growth in the volume of data.
   The idea of having benchmarks is not novel and there is                   To overcome the above issues, this work proposes develop-
a body of work done on the benchmarks for system compo-                   ing a comprehensive framework for sorting benchmarks ca-
nents such as CPU [6], applications such as databases [25],               pable of evaluating various hardware and software aspects
and systems for processing cloud workloads [8]. Some exist-               of sorting algorithms and sorting systems while maintain-
ing studies have targeted sorting specifically [13, 7, 21, 14].           ing ease of use. This work is structured as follows: Section
Generally stated, the di↵erent types of benchmarks cover                  2 justifies the development of such a framework, Section 3
di↵erent parts of sorting systems from both architectural                 discusses framework architecture, and Section 4 concludes.
perspectives as well as algorithmic and software implemen-
tations.
                                                                          2. THE NEED FOR COMPREHENSIVE
                                                                             FRAMEWORK
                                                                             Multiple studies have been done on sorting benchmarks [3,
                                                                          24, 9]. However, we advocate that there is a need for more
                                                                          research on that topic. There are three main reasons for
                                                                          that: first, existing benchmarks do not consider the vari-
Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019.             ety of input data and its distribution, second, they do not
Los Angeles, California.
Copyright (C) 2019 for this paper by its authors. Copying permitted for   assess hardware statistics, and, third, they are not a good
private and academic purposes..                                           fit for a variety of di↵erent computer architectures. The
last one is particularly important as there is a number of
studies targeting various architectures such as GPUs [10],             Table 1: Example of the data distribution.
FPGAs [19], and AVX-based [4]. But without a systematic                             Uniform      Bernoulli
approach, the task of comparing them against each other                             Poisson     Exponential
becomes quite challenging. This task of comparing di↵er-                            Gaussian    Log Normal
ent architectures between themselves especially complicated                         Gamma          Beta
when only part of the sorting algorithm is implemented. For
example, some studies targeting FPGAs focus on the merg-
ing [20, 23]. As such implementations may require data
transfer to and from the sorting system, some level of data
                                                                  3.   FRAMEWORK ARCHITECTURE
preparation, or may depend on the problem size, it is un-           This section provides a brief overview of potential frame-
clear how to compare results obtained from di↵erent archi-        work components and argues for their need. It discusses
tectures. Thus, the proposed framework should provide a           various aspects of proposed framework such as data distri-
facility to perform a comparison between them. For similar        bution, collectible system statistics, and some of the other
sorting algorithms, it can be achieved by direct compari-         aspects that include record structure.
son of similar phases of the algorithms and estimating the
remaining phases, which may include potentially required          3.1 Data Distribution
communication such as data transfer over the PCIe or an-             The record generator used in the Sort Benchmark [13] can
other medium.                                                     produce two types of data distribution. Despite a bigger va-
   Many studies related to sorting use record structure sug-      riety of data being considered by Helman et al. [14], their
gested by Datamation sorting benchmark [7], but it is not         work focuses on the structure of the data rather than its dis-
universally accepted. Due to variations in record struc-          tribution. Based on the nature of the data, it is possible to
ture, comparing the results of di↵erent studies directly is not   have more options in the data distribution and the proposed
straightforward. On the other hand, the Datamation sorting        framework should account for both data structure and data
benchmark that defines the structure of a data record be-         distribution. Often, these two features may be independent
ing 100-byte with ten-byte key and ninety-byte value could        of each other so the framework should provide facilities to
have become outdated. The current database vendors and            combine them together. Thus, it may become possible to
users should be surveyed to collect prevailing structures of      have both staggered data structure with Gaussian distribu-
records and data distributions. However, as some works            tion or any other combination of data structure and data
may use the di↵erent input data, it is important to allow         distribution. Table 1 provides the list of some of the pos-
variations in the input data. First, it will allow analyzing      sible distributions to account for. However, similar to the
studies that use di↵erent input data. Second, it will enable      Datamation [7], a comprehensive list should be compiled us-
the comparison with prior work.                                   ing input from the database vendors and the database users
   It deems important to understand how sorting algorithms        to represent the actual workloads that may be found outside
scale with an increasing number of parallelism or volume of       of research groups.
data, which requires collecting corresponding information.
To perform a more thorough evaluation of the sorting algo-        3.2 System Statistics
rithm it is crucial to collect systems statistics such as mem-       Currently, to assess systems performance such as mem-
ory bandwidth and caches miss rate. While it is possible          ory bandwidth, the developer has to use tools such as Intel
to use existing tools for profiling, it requires the algorithm    VTune [17]. While in some cases it may be inevitable to
developer to install and learn a variety of tools. It can be      use external software, the framework should collect statis-
avoided by adding such functionality into the framework it-       tics where it can and at least provide the list of various
self. Some of the algorithms exhibit di↵erent behavior on         metrics to account for. Table 2 provides the list of some of
systems level, e.g., Quicksort algorithm is known for good        the suggested systems statistics to collect.
cache behavior and utilization. Gathering more information           Many modern sorting algorithms have optimal or near-
can help to get a clear picture of the sorting algorithm, which   optimal complexity, but real implementations may result in
in turn can help to reason about the di↵erences between dif-      noticeable di↵erences between them. Collecting such statis-
ferent sorting algorithms. We suggest that the framework          tics may help to identify bottlenecks that may lead to fur-
should not just provide statistical data as feedback, but also    ther research on how those bottlenecks can be mitigated.
provide an analysis report that identifies weak points of the     As a naı̈ve example, hugepages may help to reduce TLB
algorithm and what potentially can be improved. Moreover,         misses [16] and using recently introduced high-bandwidth
modern benchmark systems are not easy to use. Thus, the           memories may help to handle the memory bandwidth bound
proposed framework should be user-friendly and should pro-        parts of the sorting algorithms. Moreover, identifying such
vide reports for further analysis in a readable format.           bottlenecks may steer hardware research. One can imagine
   With a variety of studies on sorting including recent works    building a sorting specific accelerator to overcome them. For
on exploring new computer architectures such as FPGAs [19,        example, it may be an FPGA that accelerates a particular
20, 23] and GPUs [10] and their suitability for sorting, com-     task or even a special-purpose processor that has an ISA
paring their result becomes a challenging task. The pro-          targeting the sorting task.
posed framework will strive to address these challenges and
needs while maintaining ease of use. It may be still unclear      3.3 Miscellaneous
how to compare di↵erent computer architectures but this
                                                                    The data distribution and systems statistics cover many
work sets resolving this problem as one of its targets.
                                                                  di↵erent aspects of sorting but there are still some imple-
                                                                  mentation details and guidelines that may become useful for
                                                                      Companion of the 2018 ACM/SPEC International
Table 2: Example of the collectible system statistics.                Conference on Performance Engineering, pages 41–42,
         I/O Intensity           TLB Miss Rate                        New York, NY, USA, 2018. ACM.
         IPC Intensity          Caches Miss Rate                  [7] A. et al, D. Bitton, M. Brown, R. Catell, S. Ceri,
        Cache Utilization      Branch Misprediction                   T. Chou, D. DeWitt, D. Gawlick, H. Garcia-Molina,
       Memory Bandwidth         Memory Peak B/W                       B. Good, J. Gray, P. Homan, B. Jolls, T. Lukes,
                                                                      E. Lazowska, J. Nauman, M. Pong, A. Spector,
                                                                      K. Trieber, H. Sammer, O. Serlin, M. Stonebraker,
algorithm developers. They include using custom compara-              A. Reuter, and P. Weinberger. A Measure of
tors, avoiding using indirect function calls [1], and di↵erent        Transaction Processing Power. Datamation,
record types with latter being tightly coupled with custom            31(7):112–118, April 1985.
comparators. Ultimately, evaluation of sorting algorithms         [8] M. Ferdman, A. Adileh, O. Kocberber, S. Volos,
and sorting systems may have more factors to consider that            M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu,
we have previously defined and it is deemed important to              A. Ailamaki, and B. Falsafi. Clearing the Clouds: A
identify them and leave the framework open to including               Study of Emerging Scale-out Workloads on Modern
them.                                                                 Hardware. Proceedings of the Seventeenth
                                                                      International Conference on Architectural Support for
4.   CONCLUSIONS                                                      Programming Languages and Operating Systems,
   This work advocates for the development of a compre-               pages 37–48, 2012.
hensive framework for sorting benchmarks, which accounts          [9] P. Garcia and H. Korth. Multithreaded Architectures
for various aspects of sorting algorithms starting with defin-        and the Sort Benchmark. In Proceedings of the 1st
ing the input data and measures both their software and               International Workshop on Data Management on New
hardware statistics. Such a framework may help to create              Hardware, New York, NY, USA, 2005. ACM.
a system to foster the development of sorting algorithms as      [10] N. Govindaraju, J. Gray, R. Kumar, and D. Manocha.
well as designing new computer architectures for sorting. We          GPUTeraSort: High Performance Graphics
envision that it will be beneficial for many communities out-         Co-processor Sorting for Large Database Management.
side of a group of scientists who work on the development             In Proceedings of the 2006 ACM SIGMOD
of new sorting algorithms or modifying the existing ones.             International Conference on Management of Data,
While the work in its preliminary stage, there are many de-           pages 325–336, New York, NY, USA, 2006. ACM.
sign choices that have to be done and collecting feedback        [11] G. Graefe. Sorting And Indexing With Partitioned
from database vendors and users is essential for what are             B-Trees. In In Proceedings of the 1st International
the common data features and hardware statistics they do              Conference on Innovative Data Systems Research,
care.                                                                 Asilomar, CA, USA, January 2003.
                                                                 [12] G. Graefe. Implementing Sorting in Database Systems.
5.   REFERENCES                                                       ACM Computing Surveys, 38(3), September 2006.
 [1] V. Agrawal, A. Dabral, T. Palit, Y. Shen, and               [13] J. Gray, C. Nyberg, M. Shah, and N. Govindaraju.
     M. Ferdman. Architectural Support for Dynamic                    Sorting Benchmark. http://sortbenchmark.org/.
     Linking. In Proceedings of the Twentieth International      [14] D. R. Helman, D. A. Bader, and J. JáJá. Parallel
     Conference on Architectural Support for Programming              Algorithms for Personalized Communication and
     Languages and Operating Systems, pages 691–702,                  Sorting With an Experimental Study (Extended
     New York, NY, USA, 2015. ACM.                                    Abstract). In Proceedings of the eighth annual ACM
 [2] M. Axtmann, T. Bingmann, P. Sanders, and                         symposium on Parallel Algorithms and Architectures,
     C. Schulz. Practical Massively Parallel Sorting. In              pages 211–222. ACM Press, June 1996.
     Proceedings of the 27th ACM on symposium on                 [15] C. A. R. Hoare. Quicksort. The Computer Journal,
     Parallelism in Algorithms and Architectures, pages               5(1):10–16, January 1962.
     13–23. ACM Press, June 2015.                                [16] J. Hu, X. Bai, S. Sha, Y. Luo, X. Wang, and Z. Wang.
 [3] D. H. Bailey, E. Barszcz, J. T. Barton, D. S.                    HUB: Hugepage Ballooning in Kernel-based Virtual
     Browning, R. L. Carter, L. Dagum, R. A. Fatoohi,                 Machines. In Proceedings of the International
     P. O. Frederickson, T. A. Lasinski, R. S. Schreiber,             Symposium on Memory Systems, pages 31–37, New
     H. D. Simon, V. Venkatakrishnan, and S. K.                       York, NY, USA, 2018. ACM.
     Weeratunga. The NAS Parallel Benchmarks -                   [17] Intel. Intel VTune.
     Summary and Preliminary Results. In Proceedings of               https://software.intel.com/en-us/vtune.
     the 1991 ACM/IEEE Conference on Supercomputing,             [18] D. E. Knuth. The Art of Computer Programming:
     pages 158–165, New York, NY, USA, 1991. ACM.                     Sorting and Searching, volume 3. Addison-Wesley
 [4] B. Bramas. A Novel Hybrid Quicksort Algorithm                    Professional, 2nd edition, 1998.
     Vectorized using AVX-512 on Intel Skylake.                  [19] D. Koch and J. Torresen. FPGASort. In Proceedings
     International Journal of Advanced Computer Science               of the 19th ACM/SIGDA International Symposium on
     and Applications, 8(10), 2017.                                   Field Programmable Gate Arrays, pages 45–54. ACM
 [5] C. Bron. Merge Sort Algorithm [M1]. Communications               Press, 2011.
     of the ACM, 15(5):357–358, May 1972.                        [20] S. Mashimo, T. V. Chu, and K. Kise.
 [6] J. Bucek, K.-D. Lange, and J. v. Kistowski. SPEC                 High-Performance Hardware Merge Sorter. In 2017
     CPU2017: Next-Generation Compute Benchmark. In                   IEEE 25th Annual International Symposium on
     Field-Programmable Custom Computing Machines                Hardware Merge Sorter without Feedback Datapath.
     (FCCM), pages 1–8. IEEE, April 2017.                        In 2018 IEEE 26th Annual International Symposium
[21] C. Nyberg, T. Barclay, Z. Cvetanovic, J. Gray, and          on Field-Programmable Custom Computing Machines
     D. Lomet. AlphaSort: A Cache-sensitive Parallel             (FCCM), pages 197–204. IEEE, April 2018.
     External Sort. The VLDB Journal, 4(4):603–628,         [24] K. Thearling and S. Smith. An Improved
     October 1995.                                               Supercomputer Sorting Benchmark. In Proceedings of
[22] T. Peters. Timsort.                                         the 1992 ACM/IEEE Conference on Supercomputing,
     https://bugs.python.org/file4451/timsort.txt.               pages 14–19, Los Alamitos, CA, USA, 1992. IEEE
[23] M. Saitoh, E. A. Elsayed, T. V. Chu, S. Mashimo, and        Computer Society Press.
     K. Kise. A High-Performance and Cost-E↵ective          [25] TPC. Active TPC Benchmarks.
                                                                 http://www.tpc.org/information/benchmarks.asp.