-

The Comparative Performance Analysis of Data-intensive Applications for IBM Minsky and Newell Systems?

Ilya Afanasyev

ilya@icloud.com 0 0 Lomonosov Moscow State University , Russia afanasiev

40 49

The main goal of this work is to evaluate the performance di erences of various data-intensive applications for IBM Minsky and Newell systems. As an important example of data-intensive applications, several fundamental graph algorithms have been selected. According to already existing implementation approaches, e cient implementations of selected graph algorithms have been developed. The measured performance has been compared between two generations of IBM Power systems, demonstrating the 1.5-4 times better performance on the Newell system. In addition roo ine models have been generated for both Power and NVidia implementations, which allowed to compare the e ciency of the developed implementations for both systems. Based on the results demonstrated in the paper we can conclude that with a well-optimised code for the Minsky system, it is relatively easy to obtain signi cant acceleration with high e ciency on the Newell system at least for the described class of graph algorithms.

IBM Power NVidia GPU intensive applications Roo ine model Graph algorithms

New generations of various architectures, systems and platforms appear almost every half a year in modern supercomputing. For many supercomputer users and software developers the question if it is worth to start using the newer generation of the platform they are working with is extremely relevant. This issue is no less relevant for management of supercomputer centres, who are usually responsible for hardware updates. Answering each of the following questions can be very important: what acceleration can be achieved by using the newer generation of the system? Is it possible to use newer generations of the system without changing program source code a lot? How much e ort has one to spend to reach peak performance capabilities of the new platform?

Thus, it is very important to study di erences between various implementations on various system generations. A group of data-intensive algorithms was very important and challenging for any computational platform: even when properly implemented, graph algorithms tend to stress target platform memory subsystem, demonstrating low performance, high latency and low data locality, paired with poor cache usage. Graph algorithms represent data-intensive applications extremely well, since they tend to have an enormous amount of random memory accesses, paired with low computational complexity. Moreover, graph processing is extremely relevant nowadays, since it includes such important realworld applications as web-graphs and social-networks processing.

IBM Power systems provide an important example of modern high-performance shared-memory platforms. IBM Power systems are developed in a close cooperation with NVidia company, what leads to their native integration with modern NVidia GPUs (for example using NVLink technology). Consequently, these systems demonstrate a very high performance on a various groups of problems, including computational algebra and deep-learning. However, dataintensive applications are much less studied, despite the fact that IBM platforms have all the necessary features to execute these applications e ciently. In this paper we are going to review the two latest generations of IBM Power systems: Power S822LC system with codename "Minsky" and Power S822LC system with codename "Newell". Using the developed e cient implementations of various fundamental graph algorithms, we are going to evaluate the performance and e ciency di erence between these two generations of Power system.. 2

Target Platforms

In this paper we review the two latest generations of IMB systems. The rst one is Power S822LC system (with codename "Minsky"), which contains two Power8 CPUs of between eight and ten cores each (10 cores per socket in our con guration). Each Power core is optimised to run up to 8 threads per core, which results into 160 threads per system. The resulting frequency of 12-core Power8 processor is equal to 4.1 GHz, with sustainable peak performance equal to 395 GFLOPs. "Minsky" platform is equipped with up to four Nvidia Tesla P100 GPUs (2 in the con guration available for us). The P100 GPU is equipped with 3584 light-weighted cores each one with a frequency of 1.1 MHz, resulting into 9.3 TFLOPs single-precision performance. The P100 GPU is also equipped with 12 or 16 GB device memory with 549 or 732 GB/s peak bandwidth respectively. The main feature of Power8 processor is NVLink bus technology support, which allows to connect up to four GPU devices directly to the chip, thus leading to a 2.8 times faster communication between host and device compared to traditional PCI con gurations.

The second generation we review in this paper is Power AC922 system (with codename "Newell"), which contains two Power9 CPUs of between 12 or 24 cores (16 cores per socket on our system). Each core is optimised to run up to 4 threads per core, which results into 128 threads per system. The resulting frequency of 16-core Power 8 processor is equal to 4 GHz, with sustainable peak performance equal to 592 GFLOPs. The platform may be also equipped with 2, 4 or 6 NVidia Tesla V100 GPS (4 GPUs in the con guration available for us).

The V100 GPU of Volta architecture delivers considerably more performance: it is quipped with 5120 CUDA-cores and 640 Tensor-cores (special cores optimised for deep-learning problems), which results into 15.7 single-precision TFLOPs and 7.8 TFLOPs double-precision performance. Moreover, Volta architecture also adds many new features compared to Pascal architecture family: it introduces a new combined L1 data cache and shared memory subsystem, which signi cantly improves performance while also simplifying programming. Another important improvement is the increased memory bandwidth, where V100 provides 16GB HBM2 memory with 900 GB/sec peak memory bandwidth (1.5x delivered memory bandwidth versus Pascal GP100 and greater than 95% memory bandwidth e ciency running many workloads). Interconnect is also improved and based on the second generation of NVLink (version 2.0), which now supports CPU mastering and cache coherence capabilities with IBM Power 9 CPU-based servers.

In the memory subsystem Power 8 processors have 3 layers of cache memory: 64 KB of L1 data-cache, 512 KB of L2 cache and 8 MB of L3 cache per chip. Power 9 processors has smaller 32 KB L1 data-cache, but a very larger 120 MB L3 cache per chip. 3

Method

In order to comprehensively evaluate the performance di erences between the reviewed systems on a class of data-intensive applications, we are going to discuss three fundamentally di erent types of implementations: CPU-only, GPU-only and hybrid. All those types of implementations are perfectly suitable for execution on target IBM power systems in di erent circumstances. For example, while processing large-scale graphs on GPUs in out-of-core mode (when graph can't be fully placed in device memory), it can be more bene cial to utilize multiple GPUs using uni ed memory technology, while for smaller graphs it can be more e cient to use only a single GPU to avoid unnecessary communications. In this paper we select 4 fundamental graph-processing problems and corresponding algorithms, which allow to solve those problems e ciently on both Power and NVidia platforms. Where possible, hybrid implementations are developed. 4

Problems, Algorithms and Input Data

Thus, we review 4 fundamental graph-processing problems: single source shortest paths (SSSP), breadth- rst search (BFS), strongly connected components (SCC) and minimum spanning tree (MST) problems. The SSSP problem implies nding shortest paths in an undirected weighted graph from the selected source vertex to other vertices. The SCC problem implies partitioning a directed unweighted graph into disjoint sets of vertices, each one representing a strongly connected component (a group of vertices where each vertex is reachable from another). The MST problem implies nding a subset of the edges that connects all the vertices together, without any cycles and with the minimum possible total edge weight. The BFS problem implies exploring the whole graph in the following manner: visiting all of the neighbour nodes at the present depth prior to moving on to the nodes at the next depth level. In order to solve those problems e ciently on target architectures several fundamental algorithms have been selected. In the next paragraph, we provide the list of these algorithms together with a brief description of distinctive features and possible existing approaches to creating e cient implementations for both traditional CPUs and NVidia GPUs. 1. In oder to solve SSSP problem, Bellman-Ford [ 1, 2 ] is selected. From the computational point of view, the Bellman-Ford algorithm requires both oating point and integer arithmetics together with frequent indirect memory accesses. 2. In order to solve SCC problem, Forward-Backward algorithm with Trim step [ 3, 4 ] is selected. This algorithm requires only integer arithmetics with frequent indirected memory accesses, and has a more complex nested parallelism potential. 3. In order to solve BFS problem, Direction-Optimising algorithm is used [ 5, 6 ]. This algorithm has the smallest operation per byte ratio, with integer only arithmetics required. It also has the smallest ratio of computational complexity to the amount of transferred bytes through NVlink bus, since it has a linear-time complexity implementation. 4. Boruvka's algorithm is selected for solving MST problem [ 7, 8 ]. This algorithm requires usage of atomic operations, which may hurt the performance on certain architectures a lot.

According to the described implementation approaches, for each of the selected algorithms an e cient implementation has been developed for both Power and NVidia architectures, with hybrid generalisation where it is necessary and possible (for large-scale graph processing). We also used a variety of both synthetic and real word graphs as input data set. For synthetic graphs, RMAT [ 14 ] and SSCA-2 [ 15 ] graphs have been used. For real world graphs, we used various road, social network and web graphs, all available in the following dataset [ 13 ]. 5

Implementations Details

In order to compare the performance results between two generations of IBM Power Systems, rst we had to develop e cient implementations for Power 8 & Pascal architectures. To achieve e cient parallelisation and vectorisation inside a single Power socket, several OpenMP have been inserted. The #pragma parallel for OpenMP directive has been used with the schedule(guided,1024) OpenMP clause, in order to reduce overheads of synchronisations between threads. Moreover, for Power implementations, vectorisation has been used, which was achieved with -O3 -qsimd=auto -qhot -qarch=pwr8 -qtune=pwr8 -qpdf2 compiler ags. For Volta and Pascal operations, various optimisations including texture cache usage have been also used.

Moreover, for both Power and NVidia architectures the following sorting strategy have been used in order to improve data locality: input graph edges have been sorted in a way that edges, stored in adjacent memory cells, start pointing to adjacent cells in reachability(or distances) arrays. This reordering results into gather and scatter vector operations being much more e cient for adjacent edges (since information is gathered from adjacent cells of memory in distances or reachability array). It is also possible to remove loops and multiple arcs during this pre-processing stage. The proposed optimisation of storage formats provided almost up to a 10x performance improvement compared to the implementation with randomly sorted edges of the input graph.

As a result of the described optimisations, the developed implementations demonstrated high performance on Power 8 & Pascal architectures. In order to optimise the developed implementations for Power 9 & Volta architectures, minor changes had to be made: most importantly, graph edges reordering had to be changed in accordance to cache sizes of the new platforms. 6

Performance Evaluation

The achieved acceleration between Minsky and Newell systems is demonstrated on Fig. 1. The acceleration is calculated as the ratio of execution times on the same input graphs, measured in 3 di erent modes: Power 9 to Power 8 acceleration, and V100 to P100 acceleration with and without data copy time included in time measurement. When data copy time is not included, the input graphs are supposed to be already placed inside device memory. While when data copy time is included, the input graph have to be copied from host into device memory through NVLink, and the time required for these copies is included into measured execution time.

From Fig. 1 it is possible to highlight di erent trends. 1. Power 9 implementations demonstrate higher performance compared to Power 8 ones due to larger L3 cache on the small and medium graphs, when it is possible to fully place the arrays with indirect accesses into L3 cache. 2. Hybrid implementations for large graphs (L-RMAT) demonstrate better performance on Newell system due to higher bandwidth of NVLink 2.0 used as interconnect. 3. On a very small graphs (for example small scale RMAT graphs, with a size less then 50 MB) V100 implementations demonstrate even lower performance compared to P100 implementations, due to the lack of necessary amount of parallelism. 7

ciency evaluation For the e ciency analysis of the developed implementations, the roo ine model [ 9 ] is used. The Roo ine model is a visual performance model used to provide performance estimates of a given compute-bound or memory-bound kernels or applications running on both multi-core and accelerator architectures, by showing hardware limitations based on peak performance and peak memory bandwidth. The roo ine model provides insights into the system performance capabilities, and moreover shows inherent hardware limitations, and potential bene t and priority of optimisations of target applications. In this work the roo ine model is used to evaluate the e ciency of both CPU-only and GPU-only implementations compared to peak performance capabilities of target system.

There is also an extension of roo ine model called cache-aware roo ine model [ 10 ] - a novel approach to provide a more insightful performance modelling of modern architectures by introducing cache-awareness, thus signi cantly improving the guidelines for application optimisation.

In order to create a cache-aware roo ine model for both Power and NVidia architectures, peak integer performance, peak memory bandwidth and peak bandwidths of caches of all levels have to be benchmarked. The benchmarking of Power systems can be obtained using [ 11 ] benchmark, while peak metrics for NVidia architectures can be obtained using [ 12 ]. The collected benchmark results are shown in table 1. In addition to benchmarking peak performance and bandwidths, for each application or kernel of interest the number of integer operations and operation per byte ratio has to be collected. For obtaining those values, on NVidia platform nvprof is used for calculating both bytes requested and integer instructions operation. The amount of bytes requested for Power processors can be obtained using Linux Perf or PAPI utilities, while measuring some speci c memory-related hardware events. However, it is quite problematic to calculate the amount of integer operations on modern central processors, since both modern IBM and Intel processors lack hardware events responsible for the amount of executed integer or oating-point. As a result, we had to directly calculate an approximate amount of operations from the source code of the program. We understand that it not a very reliable method, since in complex algorithms (such as Boruvka's algorithm) it can be very problematic to estimate the amount of integer instructions calculated. So, the development of a more general method for creating roo ine models is one of the priority areas for our further research.

Based on the data provided in table 1, roo ine models can be obtained both P100/V100 GPUs and P8/P9 processors for the 3 reviewed graph-processing problems (MST problems is currently excluded due to the di culties described above). The obtained sample roo ine model for Pascal GPU architecture for several problems and various input graphs are demonstrated on Fig. 2. The provided roo ine indicates that the investigated implementations are in general memorybound, while being bottlenecked somewhere in between L1/shared memory and texture cache roofs, which is equal to 40-90% of peak performance.

Data from the generated roo ines can be used for comparing the achieved e ciency of various implementations and input graphs between V100/P100 and Power 8/Power 9 target platforms. This comparison is presented on Fig. 3. Fig. 3 indicates that in average the e ciency between the two reviewed platforms remains relatively unchanged for a given class of algorithms even in an absence of platform-speci c optimisations. This leads us to the conclusion that many graph algorithms can be e ectively ported from Minsky to Newell systems without signi cant e orts required for algorithms and code modi cations. 8

Conclusion

In this paper, a comparative analysis of the acceleration and the overall e ciency for several implementations of fundamental graph-processing algorithms has been provided for IBM Minsky to Newell systems. On various type of input graphs the developed implementations for Newell system demonstrate up to 6 times acceleration compared to previous generation of GPU architecture (pascal), while up 12x acceleration compared to previous generation of Power architectures (P8).

In order to evaluate the e ciency of the developed implementations, cacheaware roo ine model has beed utilised for a signi cant part of the algorithms and input data. Analysing the absence of changes in roo ine e ciency between the reviewed platforms it is possible to emphasise that the transition from Minsky to Newell architecture does not require complex modi cations of the source code (at least for the reviewed algorithms and implementations), and can performed relatively easily for a variety of applications, while obtaining signi cant acceleration.

The future work includes more detailed performance analysis using other performance models (which will allow to cover hybrid implementations), nding the solution of the described di culties with the construction of CPU cache-aware roo ine mode, while extending the sets of the reviewed problems, algorithms and input data-set.

1. Nepomniaschaya , A. S. ( 2001 , September) . An associative version of the BellmanFord al- gorithm for nding the shortest paths in directed graphs . In International Conference on Parallel Computing Technologies (pp. 285 - 292 ). Springer, Berlin, Heidelberg.

2. Jeong , I. K. , Uddin , J. , Kang , M. , Kim , C. H. , Kim , J. M. ( 2014 ). Accelerating a Bellman?Ford Routing Algorithm Using GPU . In Frontier and Innovation in Future Computing and Communications (pp. 153 - 160 ). Springer, Dordrecht.

3. Fleischer , L. K. , Hendrickson , B. , Pnar , A. ( 2000 , May). On identifying strongly connected components in parallel . In International Parallel and Distributed Processing Symposium (pp. 505 - 511 ). Springer, Berlin, Heidelberg.

4. Barnat , J. , Bauch , P. , Brim , L. , Ceska , M. ( 2011 , May). Computing strongly connected components in parallel on CUDA . In 2011 IEEE International Parallel & Distributed Processing Symposium (pp. 544 - 555 ). IEEE.

5. Beamer , S. , Asanovi?, K. , Patterson , D. ( 2013 ). Direction-optimizing breadth- rst search . Scienti c Programming , 21 ( 3-4 ), 137 - 148 .

6. Zou , D. , Dou , Y. , Wang , Q. , Xu , J. , Li , B. ( 2013 , November). Direction-Optimizing Breadth-First Search on CPU-GPU Heterogeneous Platforms . In HPCC/EUC (pp. 1064 - 1069 ).

7. Cong , G. , Bader , D. A. ( 2007 , August) . Techniques for designing e cient parallel graph algorithms for SMPs and multicore processors . In International Symposium on Parallel and Distributed Processing and Applications (pp. 137 - 147 ). Springer, Berlin, Heidelberg.

8. Vineet , V. , Harish , P. , Patidar , S. , Narayanan , P. J. ( 2009 , August) . Fast minimum spanning tree for large graphs on the GPU . In Proceedings of the Conference on High Performance Graphics 2009 (pp. 167 - 171 ). ACM.

Ofenbeck ,

Steinmann ,

Caparros ,

D. G.

Spampinato and M. Pu

schel, "Applying the roo ine model,"

2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) , Monterey, CA, 2014 , pp. 76 - 85 . doi: 10 .1109/ISPASS. 2014 .6844463

10.

Ilic ,

Pratas and

Sousa , "Cache-aware Roo ine model: Upgrading the loft," in IEEE Computer Architecture Letters , vol. 13 , no. 1 , pp. 21 - 24 , 21 Jan.-June 2014. doi: 10 .1109/L-CA. 2013 .6

11. Empirical Roo ine Tool, https://bitbucket.org/berkeleylab/cs -roo ine-toolkit . Last accessed 4 Sep 2018

12. Konstantinidis , E.; Cotronis , Y. , "A quantitative performance evaluation of fast on-chip memories of GPUs" , 24th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP) , Heraklion, Crete, Greece, pp. 448 - 455 , 2016 . doi: 10 .1109/PDP. 2016 .56

13. Stanford Large Network Dataset Collection - SNAP: Stanford, https://snap.stanford.edu/data/. Last accessed 3 Sep 2018

14. Chakrabarti , D. , Zhan , Y. , Faloutsos , C. ( 2004 ,

April

). R-MAT : A recursive model for graph mining . In Proceedings of the 2004 SIAM International Conference on Data Mining (pp. 442 - 446 ). Society for Industrial and Applied Mathematics.

15. Bader , D. A. , Madduri , K. ( 2006 ). Gtgraph: A synthetic graph generator suite . Atlanta, GA , February.