<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Methods for High Performance Graph Processing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mikhail Chernoskutov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Krasovskii Institute of Mathematics and Mechanics, Ural Federal University</institution>
          ,
          <addr-line>Yekaterinburg</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>33</fpage>
      <lpage>40</lpage>
      <abstract>
        <p>This paper describes two methods for accelerating the processing of a large graph distributed in the memory of multiple nodes. The ifrst allows to substantially reduce overheads connected with data transfer between diferent nodes. The second is designed to reduce workload imbalance amongst computational threads. Both methods are integrated into a breadth-first search algorithm and more than triple its performance.</p>
      </abstract>
      <kwd-group>
        <kwd>data intensive applications</kwd>
        <kwd>graph processing</kwd>
        <kwd>parallel algorithms</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Graph algorithms are used in various fields of science and engineering
applications. In many cases, big graphs can be processed in parallel by computational
systems with multiple nodes [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, the parallelization efficiency is
impaired by intensive memory access and unknown (in general) exact location of
data on systems with distributed memory. These obstacles turn graph algorithms
into typical data-intensive applications [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Intensive memory access pattern is a crucial bottleneck in implementations of
parallel graph algorithms because (in many cases) there is only a small number
of computational operations in such classes of algorithms. On the other hand,
graph algorithms tend to have a lot of operations related to access to small pieces
of data in different parts of memory. Thus, they are less demanding as far as
CPU capabilities are concerned but require much in the way of efficiency and
bandwidth of the data transfer bus.</p>
      <p>The fact that the distribution of data is not known in advance dramatically
complicates efficient implementation of multi-node parallel implementations of
graph algorithms. In the general case, the program might have to search for the
data on some particular vertex across all the computational nodes.</p>
      <p>
        The object of this research is a parallel breadth-first search algorithm on
graphs with small diameter (generally, no more than 10) and skewed degree
distribution. Such graphs arise in, for instance, social networks analysis, various
mathematical and physical simulation applications [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], etc. The main feature of
these graphs is a relatively small number of vertices with highest degrees and
large number of vertices with small degrees (with only few incident vertices).
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Parallel Breadth-First Search</title>
      <p>Breadth-first search may be parallelized with the aid of level-synchronous
algorithms. These algorithms process each level (or iteration) separately and
independently from each other. It means that (in case of breadth-first search)
processing of the level  + 1 begins after the processing of the level  has been
finished, while each of these levels (i.e., all vertices and edges of the same level)
may be processed in parallel.</p>
      <p>
        Presently, there are two most common types of level-synchronous
breadthfirst search algorithms:
– direct traversal (top-down approach);
– inverse traversal (bottom-up approach [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]).
      </p>
      <p>The direct graph traversal is the standard version of breadth-first search
algorithm of this kind. It assumes that the vertices that are active on the current
iteration would mark all their neighbors. The pseudocode of a parallel version of
breadth-first search with the top-down traversal direction is presented on fig. 1.</p>
      <p>In lines 1–4, the array of distances to source vertices is initialized. The
initialization is followed by the main loop, which repeats while there exist unmarked
vertices in the graph. First, in lines 6–14, the vertices stored at the current node
are marked. Then, in lines 16 and 17, messages to other nodes are sent and
received (point-to-point); these messages contain the data on vertices that must
be marked on those nodes. Finally, in lines 19–22, the vertices the data on which
was received from other nodes are marked. At the end of each iteration, the
level counter in line 23 is incremented by one and then, at line 24, the algorithm
checks if there are still unmarked vertices.</p>
      <p>The bottom-up implementation of breadth-first search algorithm assumes
that inactive vertices will be looking for active vertices amongst their neighbors.
In case of presence of such type of vertices on the current iteration, the vertex
is to mark itself. The pseudocode of parallel breadth-first search with inverse
traversal is presented on fig. 2.</p>
      <p>Like in the previous pseudocode, in lines 1–4, the breadth-first search is
initialized. In lines 18 and 19, the level number is updated and it is tested if
the algorithm is to terminate. However, there is a big difference in the vertex
marking procedure—in case of bottom-up traversal, it is crucial to know the
current state of all active vertices in the graph. It is most convenient to do
this by means of a bitmap the length of which equals the number of vertices in
the graph. Therefore, at each iteration, the data is synchronized by means of
updating the bitmap through collective MPI communications (line 15 and 16).
1 for each u in d i s t
2 d i s t [ u ] := −1
3 d i s t [ s ] := 0
4 l e v e l := 0
5 do
6 p a r a l l e l for each v e r t in V. t h i s n o d e
7 i f d i s t [ v e r t ] = l e v e l
8 for each neighb in v e r t . n e i g h b o r s
9 i f neighb in V. t h i s n o d e
10 i f d i s t [ neighb ] = −1
11 d i s t [ neighb ] := l e v e l + 1
12 pred [ neighb ] := v e r t
13 e l s e
14 v e r t b a t c h t o s e n d . push ( neighb )
15
16 send ( v e r t b a t c h t o s e n d )
17 r e c e i v e ( v e r t b a t c h t o r e c e i v e )
18
19 p a r a l l e l for each v e r t in v e r t b a t c h t o r e c e i v e
20 i f d i s t [ v e r t ] = −1
21 d i s t [ v e r t ] := l e v e l + 1
22 pred [ v e r t ] := v e r t . pred
23 l e v e l++
24 while ( ! check end ( ) )
p a r a l l e l for each v e r t in V. t h i s n o d e
i f d i s t [ v e r t ] = −1
for each neighb in v e r t . n e i g h b o r s
i f b i t m a p c u r r e n t . neighb = 1
d i s t [ v e r t ] := l e v e l + 1
pred [ v e r t ] := neighb
bitmap next . v e r t := 1
break
a l l g a t h e r ( bitmap next )
swap ( bitmap current , bitmap next )
3</p>
      <p>Performance Engineering of Parallel Breadth-First</p>
    </sec>
    <sec id="sec-3">
      <title>Search</title>
      <p>To increase the performance of parallel breadth-first search algorithms, one could
suggest the following two methods:
– hybrid graph traversal;
– workload distribution across computational threads.
3.1</p>
      <sec id="sec-3-1">
        <title>Hybrid Traversal</title>
        <p>This method is a combination of different types of parallel breadth-first search
algorithm for executing different iterations. In particular, the top-down approach
characterized by large computational workload and data transfer overheads on
iterations in the middle. At the same time, the first and last iterations in the
topdown approach execute faster and have almost no data transfers. The situation
is different with the bottom-up approach due to the use of collective MPI data
transfer operations. In its case, data transfer overheads are almost the same on
each iteration. However, marking the vertices on the first iterations takes much
more time than on the iterations in the middle or later.</p>
        <p>Taking into account the fact that the data produced on each iteration is the
same (regardless of traversal direction), we suggest to combine different types of
graph traversal (with smallest possible data transfer overheads on each iteration)
to achieve maximal performance. In this study, we propose the following scheme:
– top-down on first two iterations;
– bottom-up on next three iterations;
– top-down on all other iterations.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Workload Distribution</title>
        <p>In processing of graphs with skewed degree distribution, it is not known in
advance (in the general case) which vertices will be processed by a particular
computational thread, the only thing that may be known in advance is, for
instance, the total number of vertices that have to be processed by the current
thread. However, the total workload is determined not by the number of active
vertices but by the number of edges incident to them. This leads to workload
imbalance amongst threads and big overheads during the runtime of parallel
level-synchronous algorithm.</p>
        <p>To avoid workload imbalance, we suggest a transition from “looking” through
a vertex array to “looking’‘ through an array of edges. For this purpose, we
logically divide the edges array into equal pieces holding   elements.
Each thread determines the corresponding vertex for all edges in every part of
the edges’ array by using the   array, which contains the numbers of
vertices incident to the first edge in the corresponding part of the edges’ array.
The pseudocode for parallel filling of the   array is presented on fig. 3.
1
2
3
4
5
6
7
8
9
p a r a l l e l f o r each i in V. t h i s n o d e
f i r s t := V. t h i s n o d e [ i ]
l a s t := V. t h i s n o d e [ i +1]
i n d e x := round up ( f i r s t / max edges )
c u r r e n t := i n d e x * max edges
while ( c u r r e n t &lt; l a s t )
p a r t c o l u m n [ i n d e x ] := i
c u r r e n t := c u r r e n t + max edges
i n d e x++</p>
        <p>Pseudocode of a new version of breadth-first search algorithm that uses the
  array is presented on fig. 4 (the top-down approach) and fig. 5 (the
bottom-up approach).
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Benchmarking</title>
      <p>
        Both of the aforementioned methods were incorporated into a custom
implementation of the Graph500 benchmark [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The kernel of this benchmark represents
parallel breadth-first search on a big graph distributed in the memory of multiple
nodes. The size of the graph is determined by the scale parameter, which is the
logarithm base 2 of the number of vertices in the graph. The average degree of
all vertices is 16. The main performance metric of this benchmark is the number
of edges traversed per second (TEPS).
      </p>
      <p>We developed a custom implementation that uses MPI (one process per
computational node) for sending and receiving messages across all nodes and
OpenMP (eight threads per node) to deal with shared memory within each
node.</p>
      <p>Benchmarking was carried out for graphs with different sizes on 1-, 2-, 4-,
and 8-node configurations of the Uran supercomputer (located at the Krasovskii
Institute of Mathematics and Mechanics). Each node had Intel Xeon X5675 CPU
and 46 GB DRAM. Performance of the custom implementation was compared
with the following reference implementations provided by Graph500:
– simple (represents top-down breadth-first search);
– replicated (represents bottom-up breadth-first search).</p>
      <p>Benchmarking results are presented on fig. 6. As seen on the figure, our
custom implementation substantially outperforms the simple and replicated
implementations. In addition, in case of 8 nodes, it is clearly seen that that the
custom implementation scales much better than its counterparts (scalability of
all implementations presented on fig 7).
1 // p r e p a r a t i o n . . .
2 p a r a l l e l for each i in part column
3 f i r s t e d g e := i * max edges
4 l a s t e d g e := ( i +1)* max edges
5 c u r r v e r t := part column [ i ]
6 for each edge in [ f i r s t e d g e ; l a s t e d g e )
7 i f n e i g h b o r s of c u r r v e r t in [ f i r s t e d g e ; l a s t e d g e )
8 i f d i s t [ c u r r v e r t ] = l e v e l
9 for each k in n e i g h b o r s of c u r r v e r t
10 i f d i s t [ k ] = −1
11 d i s t [ k ] := l e v e l + 1
12 pred [ k ] := c u r r v e r t
13 c u r r v e r t++
14 // data s y n c h r o n i z a t i o n . . .
2 nodes
20
21
22
23
24
25
20
21
22
23
24
25
custom
replicated
simple</p>
      <p>Scale</p>
      <sec id="sec-4-1">
        <title>4 nodes</title>
        <p>custom
replicated
simple
2500
2000
S
P
E
T
,eM1500
t
a
r
e
c
n
a1000
m
r
o
fr
e
P
500</p>
        <p>0
6000
5000
S
P
E
TM4000
,
e
t
a
re3000
c
n
a
m
ro2000
fr
e
P
1000
0
3500
S
PE3000
T
M
,te2500
a
r
ce2000
n
a
rm1500
o
fr
Pe1000
500
0</p>
        <p>Scale
8 nodes
custom
replicated
simple
custom
replicated
simple
20
21
22
23
24
25
Scale
20
21
22
23
24
25</p>
      </sec>
      <sec id="sec-4-2">
        <title>Scale</title>
        <p>S
P
E
T4000
M
,
e
t
a
re3000
c
n
a
m
ro2000
fr
e
P
1000
0
custom
replicated
simple
1
2
3</p>
        <p>4 5
Number of nodes
6
7
8
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>Attempts at efficient parallelization of the breadth-first search algorithm with
skewed degree distribution are hampered by the workload imbalance amongst
computational threads and large amounts of data transfer only on few select
iterations of the algorithm. This forms a bottleneck that makes it much more
hard to make a high-performance implementation for this algorithm.</p>
      <p>In this paper, we suggest a couple of methods for workload balancing and
traversal hybridization, which allow to increase the performance (it is more than
three times higher) of the parallel level-synchronous breadth-first search in
comparison with the reference top-down and bottom-up procedures.</p>
      <p>In our future work, we intend to focus on the research in scalability the
suggested algorithm and testing it on graphs obtained from real-world
applications. Another important task is to modify our custom implementation to use
computational accelerators to improve its performance.</p>
      <p>Acknowledgments. The reported study was partially supported by RFBR,
research project No. 14-07-00435. The experiment was performed on the Uran
supercomputer.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Mark</surname>
            <given-names>E.J. Newman.</given-names>
          </string-name>
          <article-title>The structure and function of complex networks</article-title>
          .
          <source>SIAM review</source>
          ,
          <volume>45</volume>
          (
          <issue>2</issue>
          ):
          <fpage>167256</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Lumsdaine</surname>
          </string-name>
          , Douglas Gregor, Bruce Hendrickson, and
          <string-name>
            <given-names>Jonathan</given-names>
            <surname>Berry</surname>
          </string-name>
          .
          <article-title>Challenges in parallel graph processing</article-title>
          .
          <source>Parallel Processing Letters</source>
          ,
          <volume>17</volume>
          (
          <issue>01</issue>
          ):
          <fpage>520</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Bruce</given-names>
            <surname>Hendrickson</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jonathan W.</given-names>
            <surname>Berry</surname>
          </string-name>
          .
          <article-title>Graph analysis with high-performance computing</article-title>
          .
          <source>Computing in Science and Engg.</source>
          ,
          <volume>10</volume>
          (
          <issue>2</issue>
          ):
          <fpage>1419</fpage>
          ,
          <string-name>
            <surname>March</surname>
          </string-name>
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Scott</given-names>
            <surname>Beamer</surname>
          </string-name>
          , Krste Asanovic, and David Patterson.
          <article-title>Direction-optimizing breadthifrst search</article-title>
          .
          <source>In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 12</source>
          , pages
          <fpage>12</fpage>
          :
          <fpage>112</fpage>
          :
          <fpage>10</fpage>
          , Los Alamitos, CA, USA,
          <year>2012</year>
          . IEEE Computer Society Press.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Richard C. Murphy, Kyle B.
          <string-name>
            <surname>Wheeler</surname>
          </string-name>
          , Brian W. Barrett, and
          <string-name>
            <surname>James</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ang</surname>
          </string-name>
          .
          <source>Introducing the graph 500</source>
          .
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>