Comprehensive Framework for Sorting Benchmarks

Comprehensive Framework for Sorting Benchmarks SergeyMadaminov smadaminov@cs.stonybrook.edu Department of Computer Science Stony Brook University New Computer Science Building Stony Brook

11794-2424 New York

MichaelFerdman Department of Computer Science Stony Brook University New Computer Science Building Stony Brook

11794-2424 New York

Comprehensive Framework for Sorting Benchmarks 770017D19A1A15B49BE92B340907AE7E GROBID - A machine learning software for extracting information from scholarly documents

In the early days, sorting accounted for almost 25% of all cycles that computers were spending. That led to the development of a variety of sorting algorithms and their implementations, as well as the creation of sorting benchmarks. However, those benchmarks do not account well for increasing variability in the nature of data and they also fail to assess architectural features of di↵erent computer systems depending on the choice of the sorting algorithm. This work proposes the development of a comprehensive sorting benchmark framework to address those issues and to help with the evaluation of sorting algorithms from both software and hardware perspectives.

INTRODUCTION

Sorting is an important operation that computers have been performing from the early days [18]. This led to the development of various sorting algorithms. As it has proved to be important at datacenter scale [11,12] and it targeted different scenarios and systems, various algorithms were developed for general purpose sorting by using CPUs [22,15,5], for sorting that is suitable for highly parallel systems [2], and for sorting using other types of architectures [10,19]. However, with the rapid pace of increase in the scale of a sorting problem, the question of which algorithm to choose remains persistent. To answer this question, one needs to have a sorting benchmark that is capable of providing enough information for analyzing the needs and e ciency of available and proposed algorithms for a given purpose.

The idea of having benchmarks is not novel and there is a body of work done on the benchmarks for system components such as CPU [6], applications such as databases [25], and systems for processing cloud workloads [8]. Some existing studies have targeted sorting specifically [13,7,21,14]. Generally stated, the di↵erent types of benchmarks cover di↵erent parts of sorting systems from both architectural perspectives as well as algorithmic and software implementations.

In spite of the rich body of knowledge on benchmarks, drastic changes in computing today have made some of the benchmarks obsolete. For instance, benchmarks such as PennySort and TeraByte Sort are deprecated due to the substantial growth in computational power that allows handling much larger data sets [13]. Similarly, the nature of the data itself may also di↵er and while there is a suggested structure of a record to sort [7] that defines 100-byte records, not all studies follow it [20,23]. Moreover, sorting task itself can vary a lot: it can be local to a single computer machine or distributed among many nodes in a cluster, or it can target di↵erent architectures.

The variety of di↵erent factors makes it unnecessarily complicated to evaluate sorting algorithms and sorting systems and compare them against each other. Without a defined structure of data record or defined distribution, it may become non-trivial how to compare di↵erent sorting algorithms or their implementations directly. It becomes even more complicated when targeted systems are FPGAs as they may be programmed to process a very specific set of data and changes in the structure of the data may either significantly a↵ect results or make it unfeasible to even process that data.

To e↵ectively analyze the choice of a sorting algorithm or sorting system, one needs to collect both hardware and software statistics of any viable approach. While hardware statistics may include cache performance, branch misprediction, and TLB misses, the software statistics may include running time on a particular system and scalability of the sorting algorithm with the increasing number of available parallelism or growth in the volume of data.

To overcome the above issues, this work proposes developing a comprehensive framework for sorting benchmarks capable of evaluating various hardware and software aspects of sorting algorithms and sorting systems while maintaining ease of use. This work is structured as follows: Section 2 justifies the development of such a framework, Section 3 discusses framework architecture, and Section 4 concludes.

THE NEED FOR COMPREHENSIVE FRAMEWORK

last one is particularly important as there is a number of studies targeting various architectures such as GPUs [10], FPGAs [19], and AVX-based [4]. But without a systematic approach, the task of comparing them against each other becomes quite challenging. This task of comparing di↵erent architectures between themselves especially complicated when only part of the sorting algorithm is implemented. For example, some studies targeting FPGAs focus on the merging [20,23]. As such implementations may require data transfer to and from the sorting system, some level of data preparation, or may depend on the problem size, it is unclear how to compare results obtained from di↵erent architectures. Thus, the proposed framework should provide a facility to perform a comparison between them. For similar sorting algorithms, it can be achieved by direct comparison of similar phases of the algorithms and estimating the remaining phases, which may include potentially required communication such as data transfer over the PCIe or another medium.

Many studies related to sorting use record structure suggested by Datamation sorting benchmark [7], but it is not universally accepted. Due to variations in record structure, comparing the results of di↵erent studies directly is not straightforward. On the other hand, the Datamation sorting benchmark that defines the structure of a data record being 100-byte with ten-byte key and ninety-byte value could have become outdated. The current database vendors and users should be surveyed to collect prevailing structures of records and data distributions. However, as some works may use the di↵erent input data, it is important to allow variations in the input data. First, it will allow analyzing studies that use di↵erent input data. Second, it will enable the comparison with prior work.

It deems important to understand how sorting algorithms scale with an increasing number of parallelism or volume of data, which requires collecting corresponding information. To perform a more thorough evaluation of the sorting algorithm it is crucial to collect systems statistics such as memory bandwidth and caches miss rate. While it is possible to use existing tools for profiling, it requires the algorithm developer to install and learn a variety of tools. It can be avoided by adding such functionality into the framework itself. Some of the algorithms exhibit di↵erent behavior on systems level, e.g., Quicksort algorithm is known for good cache behavior and utilization. Gathering more information can help to get a clear picture of the sorting algorithm, which in turn can help to reason about the di↵erences between different sorting algorithms. We suggest that the framework should not just provide statistical data as feedback, but also provide an analysis report that identifies weak points of the algorithm and what potentially can be improved. Moreover, modern benchmark systems are not easy to use. Thus, the proposed framework should be user-friendly and should provide reports for further analysis in a readable format.

With a variety of studies on sorting including recent works on exploring new computer architectures such as FPGAs [19,20,23] and GPUs [10] and their suitability for sorting, comparing their result becomes a challenging task. The proposed framework will strive to address these challenges and needs while maintaining ease of use. It may be still unclear how to compare di↵erent computer architectures but this work sets resolving this problem as one of its targets.

FRAMEWORK ARCHITECTURE

This section provides a brief overview of potential framework components and argues for their need. It discusses various aspects of proposed framework such as data distribution, collectible system statistics, and some of the other aspects that include record structure.

Data Distribution

The record generator used in the Sort Benchmark [13] can produce two types of data distribution. Despite a bigger variety of data being considered by Helman et al. [14], their work focuses on the structure of the data rather than its distribution. Based on the nature of the data, it is possible to have more options in the data distribution and the proposed framework should account for both data structure and data distribution. Often, these two features may be independent of each other so the framework should provide facilities to combine them together. Thus, it may become possible to have both staggered data structure with Gaussian distribution or any other combination of data structure and data distribution. Table 1 provides the list of some of the possible distributions to account for. However, similar to the Datamation [7], a comprehensive list should be compiled using input from the database vendors and the database users to represent the actual workloads that may be found outside of research groups.

System Statistics

Currently, to assess systems performance such as memory bandwidth, the developer has to use tools such as Intel VTune [17]. While in some cases it may be inevitable to use external software, the framework should collect statistics where it can and at least provide the list of various metrics to account for. Table 2 provides the list of some of the suggested systems statistics to collect.

Many modern sorting algorithms have optimal or nearoptimal complexity, but real implementations may result in noticeable di↵erences between them. Collecting such statistics may help to identify bottlenecks that may lead to further research on how those bottlenecks can be mitigated. As a naïve example, hugepages may help to reduce TLB misses [16] and using recently introduced high-bandwidth memories may help to handle the memory bandwidth bound parts of the sorting algorithms. Moreover, identifying such bottlenecks may steer hardware research. One can imagine building a sorting specific accelerator to overcome them. For example, it may be an FPGA that accelerates a particular task or even a special-purpose processor that has an ISA targeting the sorting task.

Miscellaneous

The data distribution and systems statistics cover many di↵erent aspects of sorting but there are still some implementation details and guidelines that may become useful for

CONCLUSIONS

This work advocates for the development of a comprehensive framework for sorting benchmarks, which accounts for various aspects of sorting algorithms starting with defining the input data and measures both their software and hardware statistics. Such a framework may help to create a system to foster the development of sorting algorithms as well as designing new computer architectures for sorting. We envision that it will be beneficial for many communities outside of a group of scientists who work on the development of new sorting algorithms or modifying the existing ones. While the work in its preliminary stage, there are many design choices that have to be done and collecting feedback from database vendors and users is essential for what are the common data features and hardware statistics they do care.

Table 1 :1Example of the data distribution.UniformBernoulliPoissonExponentialGaussian Log NormalGammaBeta

Table 2 :2Example of the collectible system statistics.I/O IntensityTLB Miss RateIPC IntensityCaches Miss RateCache UtilizationBranch MispredictionMemory BandwidthMemory Peak B/Walgorithm developers. They include using custom compara-tors, avoiding using indirect function calls [1], and di↵erentrecord types with latter being tightly coupled with customcomparators. Ultimately, evaluation of sorting algorithmsand sorting systems may have more factors to consider thatwe have previously defined and it is deemed important toidentify them and leave the framework open to includingthem.

Architectural Support for Dynamic Linking VAgrawal ADabral TPalit YShen MFerdman Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

New York, NY, USA

ACM 2015 Practical Massively Parallel Sorting MAxtmann TBingmann PSanders CSchulz Proceedings of the 27th ACM on symposium on Parallelism in Algorithms and Architectures the 27th ACM on symposium on Parallelism in Algorithms and Architectures ACM Press June 2015 The NAS Parallel Benchmarks -Summary and Preliminary Results DHBailey EBarszcz JTBarton DSBrowning RLCarter LDagum RAFatoohi POFrederickson TALasinski RSSchreiber HDSimon VVenkatakrishnan SKWeeratunga Proceedings of the 1991 ACM/IEEE Conference on Supercomputing the 1991 ACM/IEEE Conference on Supercomputing

New York, NY, USA

ACM 1991 A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512 on Intel Skylake BBramas International Journal of Advanced Computer Science and Applications 8 10 2017 Merge Sort Algorithm [M1 CBron Communications of the ACM 15 5 May 1972 SPEC CPU2017: Next-Generation Compute Benchmark JBucek K.-DLange JVKistowski Companion of the 2018 ACM/SPEC International Conference on Performance Engineering

New York, NY, USA

ACM 2018 A Measure of Transaction Processing Power A Datamation 31 7 April 1985 Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware MFerdman AAdileh OKocberber SVolos MAlisafaee DJevdjic CKaynak ADPopescu AAilamaki BFalsafi Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems 2012 Multithreaded Architectures and the Sort Benchmark PGarcia HKorth Proceedings of the 1st International Workshop on Data Management on New Hardware the 1st International Workshop on Data Management on New Hardware

New York, NY, USA

ACM 2005 GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Database Management NGovindaraju JGray RKumar DManocha Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data the 2006 ACM SIGMOD International Conference on Management of Data

New York, NY, USA

ACM 2006 Sorting And Indexing With Partitioned B-Trees GGraefe Proceedings of the 1st International Conference on Innovative Data Systems Research the 1st International Conference on Innovative Data Systems Research

Asilomar, CA, USA

January 2003 Implementing Sorting in Database Systems GGraefe ACM Computing Surveys 38 3 September 2006 JGray CNyberg MShah NGovindaraju Sorting Benchmark Parallel Algorithms for Personalized Communication and Sorting With an Experimental Study (Extended Abstract) DRHelman DABader JJájá Proceedings of the eighth annual ACM symposium on Parallel Algorithms and Architectures the eighth annual ACM symposium on Parallel Algorithms and Architectures ACM Press June 1996 Quicksort CA RHoare The Computer Journal 5 1 January 1962 HUB: Hugepage Ballooning in Kernel-based Virtual Machines JHu XBai SSha YLuo XWang ZWang Proceedings of the International Symposium on Memory Systems the International Symposium on Memory Systems

New York, NY, USA

ACM 2018 Intel VTune Intel The Art of Computer Programming: Sorting and Searching DEKnuth 1998 Addison-Wesley Professional 3 2nd edition FPGASort DKoch JTorresen Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays ACM Press 2011 High-Performance Hardware Merge Sorter SMashimo TVChu KKise IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) IEEE 2017. April 2017 AlphaSort: A Cache-sensitive Parallel External Sort CNyberg TBarclay ZCvetanovic JGray DLomet The VLDB Journal 4 4 October 1995 TPeters Timsort A High-Performance and Cost-E↵ective Hardware Merge Sorter without Feedback Datapath MSaitoh EAElsayed TVChu SMashimo KKise IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) IEEE 2018. April 2018 An Improved Supercomputer Sorting Benchmark KThearling SSmith Proceedings of the 1992 ACM/IEEE Conference on Supercomputing the 1992 ACM/IEEE Conference on Supercomputing

Los Alamitos, CA, USA

IEEE Computer Society Press 1992 Active TPC Benchmarks TPC