The Comparative Performance Analysis of
Data-intensive Applications for IBM Minsky and
                Newell Systems?

                                   Ilya Afanasyev1

                     Lomonosov Moscow State University, Russia
                           afanasiev ilya@icloud.com


        Abstract. The main goal of this work is to evaluate the performance
        differences of various data-intensive applications for IBM Minsky and
        Newell systems. As an important example of data-intensive applications,
        several fundamental graph algorithms have been selected. According to
        already existing implementation approaches, efficient implementations
        of selected graph algorithms have been developed. The measured perfor-
        mance has been compared between two generations of IBM Power sys-
        tems, demonstrating the 1.5-4 times better performance on the Newell
        system. In addition roofline models have been generated for both Power
        and NVidia implementations, which allowed to compare the efficiency of
        the developed implementations for both systems. Based on the results
        demonstrated in the paper we can conclude that with a well-optimised
        code for the Minsky system, it is relatively easy to obtain significant
        acceleration with high efficiency on the Newell system at least for the
        described class of graph algorithms.

        Keywords: IBM Power · NVidia GPU · Graph algorithms · Data-
        intensive applications · Roofline model


1     Introduction

New generations of various architectures, systems and platforms appear almost
every half a year in modern supercomputing. For many supercomputer users and
software developers the question if it is worth to start using the newer generation
of the platform they are working with is extremely relevant. This issue is no less
relevant for management of supercomputer centres, who are usually responsible
for hardware updates. Answering each of the following questions can be very
important: what acceleration can be achieved by using the newer generation
of the system? Is it possible to use newer generations of the system without
changing program source code a lot? How much effort has one to spend to reach
peak performance capabilities of the new platform?
?
    The results were obtained in Lomonosov Moscow State University with the financial
    support of the Russian Science Foundation (agreement N 14-11-00190).
    The Performance Analysis of Data-intensive Applications for IBM Systems    41

     Thus, it is very important to study differences between various implemen-
tations on various system generations. A group of data-intensive algorithms
was very important and challenging for any computational platform: even when
properly implemented, graph algorithms tend to stress target platform memory
subsystem, demonstrating low performance, high latency and low data locality,
paired with poor cache usage. Graph algorithms represent data-intensive appli-
cations extremely well, since they tend to have an enormous amount of random
memory accesses, paired with low computational complexity. Moreover, graph
processing is extremely relevant nowadays, since it includes such important real-
world applications as web-graphs and social-networks processing.
     IBM Power systems provide an important example of modern high-perfor-
mance shared-memory platforms. IBM Power systems are developed in a close
cooperation with NVidia company, what leads to their native integration with
modern NVidia GPUs (for example using NVLink technology). Consequently,
these systems demonstrate a very high performance on a various groups of
problems, including computational algebra and deep-learning. However, data-
intensive applications are much less studied, despite the fact that IBM platforms
have all the necessary features to execute these applications efficiently. In this
paper we are going to review the two latest generations of IBM Power systems:
Power S822LC system with codename ”Minsky” and Power S822LC system with
codename ”Newell”. Using the developed efficient implementations of various
fundamental graph algorithms, we are going to evaluate the performance and
efficiency difference between these two generations of Power system..


2     Target Platforms

In this paper we review the two latest generations of IMB systems. The first
one is Power S822LC system (with codename ”Minsky”), which contains two
Power8 CPUs of between eight and ten cores each (10 cores per socket in our
configuration). Each Power core is optimised to run up to 8 threads per core,
which results into 160 threads per system. The resulting frequency of 12-core
Power8 processor is equal to 4.1 GHz, with sustainable peak performance equal to
395 GFLOPs. ”Minsky” platform is equipped with up to four Nvidia Tesla P100
GPUs (2 in the configuration available for us). The P100 GPU is equipped with
3584 light-weighted cores each one with a frequency of 1.1 MHz, resulting into
9.3 TFLOPs single-precision performance. The P100 GPU is also equipped with
12 or 16 GB device memory with 549 or 732 GB/s peak bandwidth respectively.
The main feature of Power8 processor is NVLink bus technology support, which
allows to connect up to four GPU devices directly to the chip, thus leading to a
2.8 times faster communication between host and device compared to traditional
PCI configurations.
    The second generation we review in this paper is Power AC922 system (with
codename ”Newell”), which contains two Power9 CPUs of between 12 or 24 cores
(16 cores per socket on our system). Each core is optimised to run up to 4 threads
per core, which results into 128 threads per system. The resulting frequency of
42     I. Afanasyev

16-core Power 8 processor is equal to 4 GHz, with sustainable peak performance
equal to 592 GFLOPs. The platform may be also equipped with 2, 4 or 6 NVidia
Tesla V100 GPS (4 GPUs in the configuration available for us).
    The V100 GPU of Volta architecture delivers considerably more perfor-
mance: it is quipped with 5120 CUDA-cores and 640 Tensor-cores (special cores
optimised for deep-learning problems), which results into 15.7 single-precision
TFLOPs and 7.8 TFLOPs double-precision performance. Moreover, Volta ar-
chitecture also adds many new features compared to Pascal architecture family:
it introduces a new combined L1 data cache and shared memory subsystem,
which significantly improves performance while also simplifying programming.
Another important improvement is the increased memory bandwidth, where
V100 provides 16GB HBM2 memory with 900 GB/sec peak memory bandwidth
(1.5x delivered memory bandwidth versus Pascal GP100 and greater than 95%
memory bandwidth efficiency running many workloads). Interconnect is also im-
proved and based on the second generation of NVLink (version 2.0), which now
supports CPU mastering and cache coherence capabilities with IBM Power 9
CPU-based servers.
    In the memory subsystem Power 8 processors have 3 layers of cache memory:
64 KB of L1 data-cache, 512 KB of L2 cache and 8 MB of L3 cache per chip.
Power 9 processors has smaller 32 KB L1 data-cache, but a very larger 120 MB
L3 cache per chip.


3    Method
In order to comprehensively evaluate the performance differences between the re-
viewed systems on a class of data-intensive applications, we are going to discuss
three fundamentally different types of implementations: CPU-only, GPU-only
and hybrid. All those types of implementations are perfectly suitable for execu-
tion on target IBM power systems in different circumstances. For example, while
processing large-scale graphs on GPUs in out-of-core mode (when graph can’t
be fully placed in device memory), it can be more beneficial to utilize multiple
GPUs using unified memory technology, while for smaller graphs it can be more
efficient to use only a single GPU to avoid unnecessary communications. In this
paper we select 4 fundamental graph-processing problems and corresponding
algorithms, which allow to solve those problems efficiently on both Power and
NVidia platforms. Where possible, hybrid implementations are developed.


4    Problems, Algorithms and Input Data
Thus, we review 4 fundamental graph-processing problems: single source shortest
paths (SSSP), breadth-first search (BFS), strongly connected components (SCC)
and minimum spanning tree (MST) problems. The SSSP problem implies finding
shortest paths in an undirected weighted graph from the selected source vertex
to other vertices. The SCC problem implies partitioning a directed unweighted
graph into disjoint sets of vertices, each one representing a strongly connected
    The Performance Analysis of Data-intensive Applications for IBM Systems     43

component (a group of vertices where each vertex is reachable from another). The
MST problem implies finding a subset of the edges that connects all the vertices
together, without any cycles and with the minimum possible total edge weight.
The BFS problem implies exploring the whole graph in the following manner:
visiting all of the neighbour nodes at the present depth prior to moving on to
the nodes at the next depth level. In order to solve those problems efficiently
on target architectures several fundamental algorithms have been selected. In
the next paragraph, we provide the list of these algorithms together with a brief
description of distinctive features and possible existing approaches to creating
efficient implementations for both traditional CPUs and NVidia GPUs.

 1. In oder to solve SSSP problem, Bellman-Ford [1, 2] is selected. From the com-
    putational point of view, the Bellman-Ford algorithm requires both floating
    point and integer arithmetics together with frequent indirect memory ac-
    cesses.
 2. In order to solve SCC problem, Forward-Backward algorithm with Trim
    step [3, 4] is selected. This algorithm requires only integer arithmetics with
    frequent indirected memory accesses, and has a more complex nested paral-
    lelism potential.
 3. In order to solve BFS problem, Direction-Optimising algorithm is used [5,
    6]. This algorithm has the smallest operation per byte ratio, with integer
    only arithmetics required. It also has the smallest ratio of computational
    complexity to the amount of transferred bytes through NVlink bus, since it
    has a linear-time complexity implementation.
 4. Boruvka’s algorithm is selected for solving MST problem [7, 8]. This algo-
    rithm requires usage of atomic operations, which may hurt the performance
    on certain architectures a lot.

    According to the described implementation approaches, for each of the se-
lected algorithms an efficient implementation has been developed for both Power
and NVidia architectures, with hybrid generalisation where it is necessary and
possible (for large-scale graph processing). We also used a variety of both syn-
thetic and real word graphs as input data set. For synthetic graphs, RMAT [14]
and SSCA-2 [15] graphs have been used. For real world graphs, we used various
road, social network and web graphs, all available in the following dataset [13].


5     Implementations Details
In order to compare the performance results between two generations of IBM
Power Systems, first we had to develop efficient implementations for Power 8 &
Pascal architectures. To achieve efficient parallelisation and vectorisation inside
a single Power socket, several OpenMP have been inserted. The #pragma paral-
lel for OpenMP directive has been used with the schedule(guided,1024) OpenMP
clause, in order to reduce overheads of synchronisations between threads. More-
over, for Power implementations, vectorisation has been used, which was achieved
with -O3 -qsimd=auto -qhot -qarch=pwr8 -qtune=pwr8 -qpdf2 compiler flags.
44      I. Afanasyev

For Volta and Pascal operations, various optimisations including texture cache
usage have been also used.
    Moreover, for both Power and NVidia architectures the following sorting
strategy have been used in order to improve data locality: input graph edges have
been sorted in a way that edges, stored in adjacent memory cells, start pointing
to adjacent cells in reachability(or distances) arrays. This reordering results into
gather and scatter vector operations being much more efficient for adjacent edges
(since information is gathered from adjacent cells of memory in distances or
reachability array). It is also possible to remove loops and multiple arcs during
this pre-processing stage. The proposed optimisation of storage formats provided
almost up to a 10x performance improvement compared to the implementation
with randomly sorted edges of the input graph.
    As a result of the described optimisations, the developed implementations
demonstrated high performance on Power 8 & Pascal architectures. In order
to optimise the developed implementations for Power 9 & Volta architectures,
minor changes had to be made: most importantly, graph edges reordering had
to be changed in accordance to cache sizes of the new platforms.

6    Performance Evaluation
The achieved acceleration between Minsky and Newell systems is demonstrated
on Fig. 1. The acceleration is calculated as the ratio of execution times on the
same input graphs, measured in 3 different modes: Power 9 to Power 8 accelera-
tion, and V100 to P100 acceleration with and without data copy time included
in time measurement. When data copy time is not included, the input graphs
are supposed to be already placed inside device memory. While when data copy
time is included, the input graph have to be copied from host into device mem-
ory through NVLink, and the time required for these copies is included into
measured execution time.
    From Fig. 1 it is possible to highlight different trends.
 1. Power 9 implementations demonstrate higher performance compared to Pow-
    er 8 ones due to larger L3 cache on the small and medium graphs, when it
    is possible to fully place the arrays with indirect accesses into L3 cache.
 2. Hybrid implementations for large graphs (L-RMAT) demonstrate better per-
    formance on Newell system due to higher bandwidth of NVLink 2.0 used as
    interconnect.
 3. On a very small graphs (for example small scale RMAT graphs, with a size
    less then 50 MB) V100 implementations demonstrate even lower performance
    compared to P100 implementations, due to the lack of necessary amount of
    parallelism.


7    Efficiency evaluation
For the efficiency analysis of the developed implementations, the roofline model
[9] is used. The Roofline model is a visual performance model used to provide per-
  The Performance Analysis of Data-intensive Applications for IBM Systems         45


Fig. 1. The achieved acceleration between Minsky and Newell systems for various graph
problems and different input graphs
46      I. Afanasyev

formance estimates of a given compute-bound or memory-bound kernels or ap-
plications running on both multi-core and accelerator architectures, by showing
hardware limitations based on peak performance and peak memory bandwidth.
The roofline model provides insights into the system performance capabilities,
and moreover shows inherent hardware limitations, and potential benefit and
priority of optimisations of target applications. In this work the roofline model
is used to evaluate the efficiency of both CPU-only and GPU-only implementa-
tions compared to peak performance capabilities of target system.
    There is also an extension of roofline model called cache-aware roofline model
[10] - a novel approach to provide a more insightful performance modelling of
modern architectures by introducing cache-awareness, thus significantly improv-
ing the guidelines for application optimisation.
    In order to create a cache-aware roofline model for both Power and NVidia ar-
chitectures, peak integer performance, peak memory bandwidth and peak band-
widths of caches of all levels have to be benchmarked. The benchmarking of
Power systems can be obtained using [11] benchmark, while peak metrics for
NVidia architectures can be obtained using [12]. The collected benchmark re-
sults are shown in table 1.


Table 1. Peak performance/bandwidth characteristics for various components of IBM
Power systems

     Metric name              Power 8   Power 9   P100 GPU V100 GPU
     Peak integer performance 76 GIOP/s 81 GIOP/s 2242 GIOP/s 3631 GIOP/s
     L1 cache / shared memory 548 GB/s 621 GB/s 8502 GB/s 14917 GB/s
     bandwidth
     L2 cache bandwidth       235 GB/s 249 GB/s 1757 GB/s 2795 GB/s
     L3 cache bandwidth       -         -         -           -
     Texture cache bandwidth -          -         2629 GB/s 3932 GB/s
     Memory bandwidth         97 GB/s 103 GB/s 628 GB/s       898 GB/s


    In addition to benchmarking peak performance and bandwidths, for each
application or kernel of interest the number of integer operations and opera-
tion per byte ratio has to be collected. For obtaining those values, on NVidia
platform nvprof is used for calculating both bytes requested and integer in-
structions operation. The amount of bytes requested for Power processors can
be obtained using Linux Perf or PAPI utilities, while measuring some specific
memory-related hardware events. However, it is quite problematic to calculate
the amount of integer operations on modern central processors, since both mod-
ern IBM and Intel processors lack hardware events responsible for the amount
of executed integer or floating-point. As a result, we had to directly calculate
an approximate amount of operations from the source code of the program. We
understand that it not a very reliable method, since in complex algorithms (such
as Boruvka’s algorithm) it can be very problematic to estimate the amount of
    The Performance Analysis of Data-intensive Applications for IBM Systems     47

integer instructions calculated. So, the development of a more general method
for creating roofline models is one of the priority areas for our further research.
    Based on the data provided in table 1, roofline models can be obtained both
P100/V100 GPUs and P8/P9 processors for the 3 reviewed graph-processing
problems (MST problems is currently excluded due to the difficulties described
above). The obtained sample roofline model for Pascal GPU architecture for sev-
eral problems and various input graphs are demonstrated on Fig. 2. The provided
roofline indicates that the investigated implementations are in general memory-
bound, while being bottlenecked somewhere in between L1/shared memory and
texture cache roofs, which is equal to 40-90% of peak performance.


Fig. 2. An example of the cache-aware roofline model for SSSP and MST operations,
P100 GPU


     Data from the generated rooflines can be used for comparing the achieved
efficiency of various implementations and input graphs between V100/P100 and
Power 8/Power 9 target platforms. This comparison is presented on Fig. 3. Fig. 3
indicates that in average the efficiency between the two reviewed platforms re-
mains relatively unchanged for a given class of algorithms even in an absence of
platform-specific optimisations. This leads us to the conclusion that many graph
algorithms can be effectively ported from Minsky to Newell systems without
significant efforts required for algorithms and code modifications.


8     Conclusion

In this paper, a comparative analysis of the acceleration and the overall effi-
ciency for several implementations of fundamental graph-processing algorithms
48      I. Afanasyev


        Fig. 3. Efficiency comparison for various problems and input graphs


has been provided for IBM Minsky to Newell systems. On various type of in-
put graphs the developed implementations for Newell system demonstrate up
to 6 times acceleration compared to previous generation of GPU architecture
(pascal), while up 12x acceleration compared to previous generation of Power
architectures (P8).
    In order to evaluate the efficiency of the developed implementations, cache-
aware roofline model has beed utilised for a significant part of the algorithms
and input data. Analysing the absence of changes in roofline efficiency between
the reviewed platforms it is possible to emphasise that the transition from Min-
sky to Newell architecture does not require complex modifications of the source
code (at least for the reviewed algorithms and implementations), and can per-
formed relatively easily for a variety of applications, while obtaining significant
acceleration.
    The future work includes more detailed performance analysis using other per-
formance models (which will allow to cover hybrid implementations), finding the
solution of the described difficulties with the construction of CPU cache-aware
roofline mode, while extending the sets of the reviewed problems, algorithms and
input data-set.


References

 1. Nepomniaschaya, A. S. (2001, September). An associative version of the Bellman-
    Ford al- gorithm for finding the shortest paths in directed graphs. In International
    Conference on Parallel Computing Technologies (pp. 285-292). Springer, Berlin,
    Heidelberg.
  The Performance Analysis of Data-intensive Applications for IBM Systems            49

 2. Jeong, I. K., Uddin, J., Kang, M., Kim, C. H., Kim, J. M. (2014). Accelerating
    a Bellman?Ford Routing Algorithm Using GPU. In Frontier and Innovation in
    Future Computing and Communications (pp. 153-160). Springer, Dordrecht.
 3. Fleischer, L. K., Hendrickson, B., Pnar, A. (2000, May). On identifying strongly
    connected components in parallel. In International Parallel and Distributed Pro-
    cessing Symposium (pp. 505-511). Springer, Berlin, Heidelberg.
 4. Barnat, J., Bauch, P., Brim, L., Ceska, M. (2011, May). Computing strongly con-
    nected components in parallel on CUDA. In 2011 IEEE International Parallel &
    Distributed Processing Symposium (pp. 544-555). IEEE.
 5. Beamer, S., Asanovi?, K., Patterson, D. (2013). Direction-optimizing breadth-first
    search. Scientific Programming, 21(3-4), 137-148.
 6. Zou, D., Dou, Y., Wang, Q., Xu, J., Li, B. (2013, November). Direction-Optimizing
    Breadth-First Search on CPU-GPU Heterogeneous Platforms. In HPCC/EUC (pp.
    1064-1069).
 7. Cong, G., Bader, D. A. (2007, August). Techniques for designing efficient parallel
    graph algorithms for SMPs and multicore processors. In International Symposium
    on Parallel and Distributed Processing and Applications (pp. 137-147). Springer,
    Berlin, Heidelberg.
 8. Vineet, V., Harish, P., Patidar, S., Narayanan, P. J. (2009, August). Fast minimum
    spanning tree for large graphs on the GPU. In Proceedings of the Conference on
    High Performance Graphics 2009 (pp. 167-171). ACM.
 9. G. Ofenbeck, R. Steinmann, V. Caparros, D. G. Spampinato and M. Püschel, ”Ap-
    plying the roofline model,” 2014 IEEE International Symposium on Performance
    Analysis of Systems and Software (ISPASS), Monterey, CA, 2014, pp. 76-85. doi:
    10.1109/ISPASS.2014.6844463
10. A. Ilic, F. Pratas and L. Sousa, ”Cache-aware Roofline model: Upgrading the loft,”
    in IEEE Computer Architecture Letters, vol. 13, no. 1, pp. 21-24, 21 Jan.-June
    2014. doi: 10.1109/L-CA.2013.6
11. Empirical Roofline Tool, https://bitbucket.org/berkeleylab/cs-roofline-toolkit.
    Last accessed 4 Sep 2018
12. Konstantinidis, E.; Cotronis, Y., ”A quantitative performance evaluation of fast
    on-chip memories of GPUs”, 24th Euromicro International Conference on Parallel,
    Distributed and Network-Based Processing (PDP), Heraklion, Crete, Greece, pp.
    448-455, 2016. doi: 10.1109/PDP.2016.56
13. Stanford     Large    Network      Dataset    Collection     -  SNAP:      Stanford,
    https://snap.stanford.edu/data/. Last accessed 3 Sep 2018
14. Chakrabarti, D., Zhan, Y., Faloutsos, C. (2004, April). R-MAT: A recursive model
    for graph mining. In Proceedings of the 2004 SIAM International Conference on
    Data Mining (pp. 442-446). Society for Industrial and Applied Mathematics.
15. Bader, D. A., Madduri, K. (2006). Gtgraph: A synthetic graph generator suite.
    Atlanta, GA, February.