<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Comparative Performance Analysis of Data-intensive Applications for IBM Minsky and Newell Systems?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ilya Afanasyev</string-name>
          <email>ilya@icloud.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lomonosov Moscow State University</institution>
          ,
          <country>Russia afanasiev</country>
        </aff>
      </contrib-group>
      <fpage>40</fpage>
      <lpage>49</lpage>
      <abstract>
        <p>The main goal of this work is to evaluate the performance di erences of various data-intensive applications for IBM Minsky and Newell systems. As an important example of data-intensive applications, several fundamental graph algorithms have been selected. According to already existing implementation approaches, e cient implementations of selected graph algorithms have been developed. The measured performance has been compared between two generations of IBM Power systems, demonstrating the 1.5-4 times better performance on the Newell system. In addition roo ine models have been generated for both Power and NVidia implementations, which allowed to compare the e ciency of the developed implementations for both systems. Based on the results demonstrated in the paper we can conclude that with a well-optimised code for the Minsky system, it is relatively easy to obtain signi cant acceleration with high e ciency on the Newell system at least for the described class of graph algorithms.</p>
      </abstract>
      <kwd-group>
        <kwd>IBM Power</kwd>
        <kwd>NVidia GPU intensive applications</kwd>
        <kwd>Roo ine model</kwd>
        <kwd>Graph algorithms</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>New generations of various architectures, systems and platforms appear almost
every half a year in modern supercomputing. For many supercomputer users and
software developers the question if it is worth to start using the newer generation
of the platform they are working with is extremely relevant. This issue is no less
relevant for management of supercomputer centres, who are usually responsible
for hardware updates. Answering each of the following questions can be very
important: what acceleration can be achieved by using the newer generation
of the system? Is it possible to use newer generations of the system without
changing program source code a lot? How much e ort has one to spend to reach
peak performance capabilities of the new platform?</p>
      <p>Thus, it is very important to study di erences between various
implementations on various system generations. A group of data-intensive algorithms
was very important and challenging for any computational platform: even when
properly implemented, graph algorithms tend to stress target platform memory
subsystem, demonstrating low performance, high latency and low data locality,
paired with poor cache usage. Graph algorithms represent data-intensive
applications extremely well, since they tend to have an enormous amount of random
memory accesses, paired with low computational complexity. Moreover, graph
processing is extremely relevant nowadays, since it includes such important
realworld applications as web-graphs and social-networks processing.</p>
      <p>IBM Power systems provide an important example of modern
high-performance shared-memory platforms. IBM Power systems are developed in a close
cooperation with NVidia company, what leads to their native integration with
modern NVidia GPUs (for example using NVLink technology). Consequently,
these systems demonstrate a very high performance on a various groups of
problems, including computational algebra and deep-learning. However,
dataintensive applications are much less studied, despite the fact that IBM platforms
have all the necessary features to execute these applications e ciently. In this
paper we are going to review the two latest generations of IBM Power systems:
Power S822LC system with codename "Minsky" and Power S822LC system with
codename "Newell". Using the developed e cient implementations of various
fundamental graph algorithms, we are going to evaluate the performance and
e ciency di erence between these two generations of Power system..
2</p>
    </sec>
    <sec id="sec-2">
      <title>Target Platforms</title>
      <p>In this paper we review the two latest generations of IMB systems. The rst
one is Power S822LC system (with codename "Minsky"), which contains two
Power8 CPUs of between eight and ten cores each (10 cores per socket in our
con guration). Each Power core is optimised to run up to 8 threads per core,
which results into 160 threads per system. The resulting frequency of 12-core
Power8 processor is equal to 4.1 GHz, with sustainable peak performance equal to
395 GFLOPs. "Minsky" platform is equipped with up to four Nvidia Tesla P100
GPUs (2 in the con guration available for us). The P100 GPU is equipped with
3584 light-weighted cores each one with a frequency of 1.1 MHz, resulting into
9.3 TFLOPs single-precision performance. The P100 GPU is also equipped with
12 or 16 GB device memory with 549 or 732 GB/s peak bandwidth respectively.
The main feature of Power8 processor is NVLink bus technology support, which
allows to connect up to four GPU devices directly to the chip, thus leading to a
2.8 times faster communication between host and device compared to traditional
PCI con gurations.</p>
      <p>The second generation we review in this paper is Power AC922 system (with
codename "Newell"), which contains two Power9 CPUs of between 12 or 24 cores
(16 cores per socket on our system). Each core is optimised to run up to 4 threads
per core, which results into 128 threads per system. The resulting frequency of
16-core Power 8 processor is equal to 4 GHz, with sustainable peak performance
equal to 592 GFLOPs. The platform may be also equipped with 2, 4 or 6 NVidia
Tesla V100 GPS (4 GPUs in the con guration available for us).</p>
      <p>The V100 GPU of Volta architecture delivers considerably more
performance: it is quipped with 5120 CUDA-cores and 640 Tensor-cores (special cores
optimised for deep-learning problems), which results into 15.7 single-precision
TFLOPs and 7.8 TFLOPs double-precision performance. Moreover, Volta
architecture also adds many new features compared to Pascal architecture family:
it introduces a new combined L1 data cache and shared memory subsystem,
which signi cantly improves performance while also simplifying programming.
Another important improvement is the increased memory bandwidth, where
V100 provides 16GB HBM2 memory with 900 GB/sec peak memory bandwidth
(1.5x delivered memory bandwidth versus Pascal GP100 and greater than 95%
memory bandwidth e ciency running many workloads). Interconnect is also
improved and based on the second generation of NVLink (version 2.0), which now
supports CPU mastering and cache coherence capabilities with IBM Power 9
CPU-based servers.</p>
      <p>In the memory subsystem Power 8 processors have 3 layers of cache memory:
64 KB of L1 data-cache, 512 KB of L2 cache and 8 MB of L3 cache per chip.
Power 9 processors has smaller 32 KB L1 data-cache, but a very larger 120 MB
L3 cache per chip.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Method</title>
      <p>In order to comprehensively evaluate the performance di erences between the
reviewed systems on a class of data-intensive applications, we are going to discuss
three fundamentally di erent types of implementations: CPU-only, GPU-only
and hybrid. All those types of implementations are perfectly suitable for
execution on target IBM power systems in di erent circumstances. For example, while
processing large-scale graphs on GPUs in out-of-core mode (when graph can't
be fully placed in device memory), it can be more bene cial to utilize multiple
GPUs using uni ed memory technology, while for smaller graphs it can be more
e cient to use only a single GPU to avoid unnecessary communications. In this
paper we select 4 fundamental graph-processing problems and corresponding
algorithms, which allow to solve those problems e ciently on both Power and
NVidia platforms. Where possible, hybrid implementations are developed.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Problems, Algorithms and Input Data</title>
      <p>
        Thus, we review 4 fundamental graph-processing problems: single source shortest
paths (SSSP), breadth- rst search (BFS), strongly connected components (SCC)
and minimum spanning tree (MST) problems. The SSSP problem implies nding
shortest paths in an undirected weighted graph from the selected source vertex
to other vertices. The SCC problem implies partitioning a directed unweighted
graph into disjoint sets of vertices, each one representing a strongly connected
component (a group of vertices where each vertex is reachable from another). The
MST problem implies nding a subset of the edges that connects all the vertices
together, without any cycles and with the minimum possible total edge weight.
The BFS problem implies exploring the whole graph in the following manner:
visiting all of the neighbour nodes at the present depth prior to moving on to
the nodes at the next depth level. In order to solve those problems e ciently
on target architectures several fundamental algorithms have been selected. In
the next paragraph, we provide the list of these algorithms together with a brief
description of distinctive features and possible existing approaches to creating
e cient implementations for both traditional CPUs and NVidia GPUs.
1. In oder to solve SSSP problem, Bellman-Ford [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] is selected. From the
computational point of view, the Bellman-Ford algorithm requires both oating
point and integer arithmetics together with frequent indirect memory
accesses.
2. In order to solve SCC problem, Forward-Backward algorithm with Trim
step [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ] is selected. This algorithm requires only integer arithmetics with
frequent indirected memory accesses, and has a more complex nested
parallelism potential.
3. In order to solve BFS problem, Direction-Optimising algorithm is used [
        <xref ref-type="bibr" rid="ref5 ref6">5,
6</xref>
        ]. This algorithm has the smallest operation per byte ratio, with integer
only arithmetics required. It also has the smallest ratio of computational
complexity to the amount of transferred bytes through NVlink bus, since it
has a linear-time complexity implementation.
4. Boruvka's algorithm is selected for solving MST problem [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. This
algorithm requires usage of atomic operations, which may hurt the performance
on certain architectures a lot.
      </p>
      <p>
        According to the described implementation approaches, for each of the
selected algorithms an e cient implementation has been developed for both Power
and NVidia architectures, with hybrid generalisation where it is necessary and
possible (for large-scale graph processing). We also used a variety of both
synthetic and real word graphs as input data set. For synthetic graphs, RMAT [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
and SSCA-2 [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] graphs have been used. For real world graphs, we used various
road, social network and web graphs, all available in the following dataset [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Implementations Details</title>
      <p>In order to compare the performance results between two generations of IBM
Power Systems, rst we had to develop e cient implementations for Power 8 &amp;
Pascal architectures. To achieve e cient parallelisation and vectorisation inside
a single Power socket, several OpenMP have been inserted. The #pragma
parallel for OpenMP directive has been used with the schedule(guided,1024) OpenMP
clause, in order to reduce overheads of synchronisations between threads.
Moreover, for Power implementations, vectorisation has been used, which was achieved
with -O3 -qsimd=auto -qhot -qarch=pwr8 -qtune=pwr8 -qpdf2 compiler ags.
For Volta and Pascal operations, various optimisations including texture cache
usage have been also used.</p>
      <p>Moreover, for both Power and NVidia architectures the following sorting
strategy have been used in order to improve data locality: input graph edges have
been sorted in a way that edges, stored in adjacent memory cells, start pointing
to adjacent cells in reachability(or distances) arrays. This reordering results into
gather and scatter vector operations being much more e cient for adjacent edges
(since information is gathered from adjacent cells of memory in distances or
reachability array). It is also possible to remove loops and multiple arcs during
this pre-processing stage. The proposed optimisation of storage formats provided
almost up to a 10x performance improvement compared to the implementation
with randomly sorted edges of the input graph.</p>
      <p>As a result of the described optimisations, the developed implementations
demonstrated high performance on Power 8 &amp; Pascal architectures. In order
to optimise the developed implementations for Power 9 &amp; Volta architectures,
minor changes had to be made: most importantly, graph edges reordering had
to be changed in accordance to cache sizes of the new platforms.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Performance Evaluation</title>
      <p>The achieved acceleration between Minsky and Newell systems is demonstrated
on Fig. 1. The acceleration is calculated as the ratio of execution times on the
same input graphs, measured in 3 di erent modes: Power 9 to Power 8
acceleration, and V100 to P100 acceleration with and without data copy time included
in time measurement. When data copy time is not included, the input graphs
are supposed to be already placed inside device memory. While when data copy
time is included, the input graph have to be copied from host into device
memory through NVLink, and the time required for these copies is included into
measured execution time.</p>
      <p>From Fig. 1 it is possible to highlight di erent trends.
1. Power 9 implementations demonstrate higher performance compared to
Power 8 ones due to larger L3 cache on the small and medium graphs, when it
is possible to fully place the arrays with indirect accesses into L3 cache.
2. Hybrid implementations for large graphs (L-RMAT) demonstrate better
performance on Newell system due to higher bandwidth of NVLink 2.0 used as
interconnect.
3. On a very small graphs (for example small scale RMAT graphs, with a size
less then 50 MB) V100 implementations demonstrate even lower performance
compared to P100 implementations, due to the lack of necessary amount of
parallelism.
7</p>
      <p>E</p>
      <p>
        ciency evaluation
For the e ciency analysis of the developed implementations, the roo ine model
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is used. The Roo ine model is a visual performance model used to provide
performance estimates of a given compute-bound or memory-bound kernels or
applications running on both multi-core and accelerator architectures, by showing
hardware limitations based on peak performance and peak memory bandwidth.
The roo ine model provides insights into the system performance capabilities,
and moreover shows inherent hardware limitations, and potential bene t and
priority of optimisations of target applications. In this work the roo ine model
is used to evaluate the e ciency of both CPU-only and GPU-only
implementations compared to peak performance capabilities of target system.
      </p>
      <p>
        There is also an extension of roo ine model called cache-aware roo ine model
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] - a novel approach to provide a more insightful performance modelling of
modern architectures by introducing cache-awareness, thus signi cantly
improving the guidelines for application optimisation.
      </p>
      <p>
        In order to create a cache-aware roo ine model for both Power and NVidia
architectures, peak integer performance, peak memory bandwidth and peak
bandwidths of caches of all levels have to be benchmarked. The benchmarking of
Power systems can be obtained using [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] benchmark, while peak metrics for
NVidia architectures can be obtained using [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The collected benchmark
results are shown in table 1.
In addition to benchmarking peak performance and bandwidths, for each
application or kernel of interest the number of integer operations and
operation per byte ratio has to be collected. For obtaining those values, on NVidia
platform nvprof is used for calculating both bytes requested and integer
instructions operation. The amount of bytes requested for Power processors can
be obtained using Linux Perf or PAPI utilities, while measuring some speci c
memory-related hardware events. However, it is quite problematic to calculate
the amount of integer operations on modern central processors, since both
modern IBM and Intel processors lack hardware events responsible for the amount
of executed integer or oating-point. As a result, we had to directly calculate
an approximate amount of operations from the source code of the program. We
understand that it not a very reliable method, since in complex algorithms (such
as Boruvka's algorithm) it can be very problematic to estimate the amount of
integer instructions calculated. So, the development of a more general method
for creating roo ine models is one of the priority areas for our further research.
      </p>
      <p>Based on the data provided in table 1, roo ine models can be obtained both
P100/V100 GPUs and P8/P9 processors for the 3 reviewed graph-processing
problems (MST problems is currently excluded due to the di culties described
above). The obtained sample roo ine model for Pascal GPU architecture for
several problems and various input graphs are demonstrated on Fig. 2. The provided
roo ine indicates that the investigated implementations are in general
memorybound, while being bottlenecked somewhere in between L1/shared memory and
texture cache roofs, which is equal to 40-90% of peak performance.</p>
      <p>Data from the generated roo ines can be used for comparing the achieved
e ciency of various implementations and input graphs between V100/P100 and
Power 8/Power 9 target platforms. This comparison is presented on Fig. 3. Fig. 3
indicates that in average the e ciency between the two reviewed platforms
remains relatively unchanged for a given class of algorithms even in an absence of
platform-speci c optimisations. This leads us to the conclusion that many graph
algorithms can be e ectively ported from Minsky to Newell systems without
signi cant e orts required for algorithms and code modi cations.
8</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>In this paper, a comparative analysis of the acceleration and the overall e
ciency for several implementations of fundamental graph-processing algorithms
has been provided for IBM Minsky to Newell systems. On various type of
input graphs the developed implementations for Newell system demonstrate up
to 6 times acceleration compared to previous generation of GPU architecture
(pascal), while up 12x acceleration compared to previous generation of Power
architectures (P8).</p>
      <p>In order to evaluate the e ciency of the developed implementations,
cacheaware roo ine model has beed utilised for a signi cant part of the algorithms
and input data. Analysing the absence of changes in roo ine e ciency between
the reviewed platforms it is possible to emphasise that the transition from
Minsky to Newell architecture does not require complex modi cations of the source
code (at least for the reviewed algorithms and implementations), and can
performed relatively easily for a variety of applications, while obtaining signi cant
acceleration.</p>
      <p>The future work includes more detailed performance analysis using other
performance models (which will allow to cover hybrid implementations), nding the
solution of the described di culties with the construction of CPU cache-aware
roo ine mode, while extending the sets of the reviewed problems, algorithms and
input data-set.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Nepomniaschaya</surname>
            ,
            <given-names>A. S.</given-names>
          </string-name>
          (
          <year>2001</year>
          ,
          <article-title>September)</article-title>
          .
          <article-title>An associative version of the BellmanFord al- gorithm for nding the shortest paths in directed graphs</article-title>
          .
          <source>In International Conference on Parallel Computing Technologies</source>
          (pp.
          <fpage>285</fpage>
          -
          <lpage>292</lpage>
          ). Springer, Berlin, Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Jeong</surname>
            ,
            <given-names>I. K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uddin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>C. H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Accelerating a Bellman?Ford Routing Algorithm Using GPU</article-title>
          .
          <source>In Frontier and Innovation in Future Computing and Communications</source>
          (pp.
          <fpage>153</fpage>
          -
          <lpage>160</lpage>
          ). Springer, Dordrecht.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Fleischer</surname>
            ,
            <given-names>L. K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hendrickson</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pnar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2000</year>
          , May).
          <article-title>On identifying strongly connected components in parallel</article-title>
          .
          <source>In International Parallel and Distributed Processing Symposium</source>
          (pp.
          <fpage>505</fpage>
          -
          <lpage>511</lpage>
          ). Springer, Berlin, Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Barnat</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bauch</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brim</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ceska</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2011</year>
          , May).
          <article-title>Computing strongly connected components in parallel on CUDA</article-title>
          .
          <source>In 2011 IEEE International Parallel &amp; Distributed Processing Symposium</source>
          (pp.
          <fpage>544</fpage>
          -
          <lpage>555</lpage>
          ). IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Beamer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Asanovi?,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Patterson</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Direction-optimizing breadth- rst search</article-title>
          .
          <source>Scienti c Programming</source>
          ,
          <volume>21</volume>
          (
          <issue>3-4</issue>
          ),
          <fpage>137</fpage>
          -
          <lpage>148</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Zou</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dou</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2013</year>
          , November).
          <article-title>Direction-Optimizing Breadth-First Search on CPU-GPU Heterogeneous Platforms</article-title>
          .
          <source>In HPCC/EUC</source>
          (pp.
          <fpage>1064</fpage>
          -
          <lpage>1069</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Cong</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bader</surname>
            ,
            <given-names>D. A.</given-names>
          </string-name>
          (
          <year>2007</year>
          ,
          <article-title>August)</article-title>
          .
          <article-title>Techniques for designing e cient parallel graph algorithms for SMPs and multicore processors</article-title>
          .
          <source>In International Symposium on Parallel and Distributed Processing and Applications</source>
          (pp.
          <fpage>137</fpage>
          -
          <lpage>147</lpage>
          ). Springer, Berlin, Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Vineet</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harish</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patidar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Narayanan</surname>
            ,
            <given-names>P. J.</given-names>
          </string-name>
          (
          <year>2009</year>
          ,
          <article-title>August)</article-title>
          .
          <article-title>Fast minimum spanning tree for large graphs on the GPU</article-title>
          .
          <source>In Proceedings of the Conference on High Performance Graphics</source>
          <year>2009</year>
          (pp.
          <fpage>167</fpage>
          -
          <lpage>171</lpage>
          ). ACM.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>G.</given-names>
            <surname>Ofenbeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Steinmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Caparros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. G.</given-names>
            <surname>Spampinato and M. Pu</surname>
          </string-name>
          <article-title>schel, "Applying the roo ine model,"</article-title>
          <source>2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)</source>
          , Monterey, CA,
          <year>2014</year>
          , pp.
          <fpage>76</fpage>
          -
          <lpage>85</lpage>
          . doi:
          <volume>10</volume>
          .1109/ISPASS.
          <year>2014</year>
          .6844463
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>A.</given-names>
            <surname>Ilic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pratas</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Sousa</surname>
          </string-name>
          ,
          <article-title>"Cache-aware Roo ine model: Upgrading the loft," in IEEE Computer Architecture Letters</article-title>
          , vol.
          <volume>13</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>24</lpage>
          , 21 Jan.-June 2014. doi:
          <volume>10</volume>
          .1109/L-CA.
          <year>2013</year>
          .6
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. Empirical Roo ine Tool, https://bitbucket.org/berkeleylab/cs
          <article-title>-roo ine-toolkit</article-title>
          .
          <source>Last accessed 4 Sep 2018</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Konstantinidis</surname>
          </string-name>
          , E.;
          <string-name>
            <surname>Cotronis</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <article-title>"A quantitative performance evaluation of fast on-chip memories of GPUs"</article-title>
          ,
          <source>24th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)</source>
          , Heraklion, Crete, Greece, pp.
          <fpage>448</fpage>
          -
          <lpage>455</lpage>
          ,
          <year>2016</year>
          . doi:
          <volume>10</volume>
          .1109/PDP.
          <year>2016</year>
          .56
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Stanford Large Network Dataset</surname>
          </string-name>
          Collection - SNAP: Stanford, https://snap.stanford.edu/data/.
          <source>Last accessed 3 Sep 2018</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Chakrabarti</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Faloutsos</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2004</year>
          ,
          <string-name>
            <given-names>April</given-names>
            <surname>). R-MAT</surname>
          </string-name>
          :
          <article-title>A recursive model for graph mining</article-title>
          .
          <source>In Proceedings of the 2004 SIAM International Conference on Data Mining</source>
          (pp.
          <fpage>442</fpage>
          -
          <lpage>446</lpage>
          ).
          <source>Society for Industrial and Applied Mathematics.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Bader</surname>
            ,
            <given-names>D. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madduri</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>Gtgraph: A synthetic graph generator suite</article-title>
          . Atlanta,
          <string-name>
            <surname>GA</surname>
          </string-name>
          , February.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>