=Paper= {{Paper |id=Vol-1839/MIT2016-p04 |storemode=property |title= Methods for high performance graph processing |pdfUrl=https://ceur-ws.org/Vol-1839/MIT2016-p04.pdf |volume=Vol-1839 |authors=Mikhail Chernoskutov }} == Methods for high performance graph processing== https://ceur-ws.org/Vol-1839/MIT2016-p04.pdf
Mathematical and Information Technologies, MIT-2016 — Information technologies

          Methods for High Performance Graph
                      Processing

                                 Mikhail Chernoskutov

                  Krasovskii Institute of Mathematics and Mechanics,
                                Ural Federal University,
                                 Yekaterinburg, Russia
                                   mach@imm.uran.ru



       Abstract. This paper describes two methods for accelerating the pro-
       cessing of a large graph distributed in the memory of multiple nodes. The
       first allows to substantially reduce overheads connected with data trans-
       fer between different nodes. The second is designed to reduce workload
       imbalance amongst computational threads. Both methods are integrated
       into a breadth-first search algorithm and more than triple its perfor-
       mance.

       Keywords: data intensive applications, graph processing, parallel algo-
       rithms.


1    Introduction
Graph algorithms are used in various fields of science and engineering applica-
tions. In many cases, big graphs can be processed in parallel by computational
systems with multiple nodes [1]. However, the parallelization efficiency is im-
paired by intensive memory access and unknown (in general) exact location of
data on systems with distributed memory. These obstacles turn graph algorithms
into typical data-intensive applications [2].
    Intensive memory access pattern is a crucial bottleneck in implementations of
parallel graph algorithms because (in many cases) there is only a small number
of computational operations in such classes of algorithms. On the other hand,
graph algorithms tend to have a lot of operations related to access to small pieces
of data in different parts of memory. Thus, they are less demanding as far as
CPU capabilities are concerned but require much in the way of efficiency and
bandwidth of the data transfer bus.
    The fact that the distribution of data is not known in advance dramatically
complicates efficient implementation of multi-node parallel implementations of
graph algorithms. In the general case, the program might have to search for the
data on some particular vertex across all the computational nodes.
    The object of this research is a parallel breadth-first search algorithm on
graphs with small diameter (generally, no more than 10) and skewed degree
distribution. Such graphs arise in, for instance, social networks analysis, various
mathematical and physical simulation applications [3], etc. The main feature of

33
             Mathematical and Information Technologies, MIT-2016 — Information technologies

these graphs is a relatively small number of vertices with highest degrees and
large number of vertices with small degrees (with only few incident vertices).


2    Parallel Breadth-First Search

Breadth-first search may be parallelized with the aid of level-synchronous al-
gorithms. These algorithms process each level (or iteration) separately and in-
dependently from each other. It means that (in case of breadth-first search)
processing of the level 𝑁 + 1 begins after the processing of the level 𝑁 has been
finished, while each of these levels (i.e., all vertices and edges of the same level)
may be processed in parallel.
    Presently, there are two most common types of level-synchronous breadth-
first search algorithms:

 – direct traversal (top-down approach);
 – inverse traversal (bottom-up approach [4]).

The direct graph traversal is the standard version of breadth-first search algo-
rithm of this kind. It assumes that the vertices that are active on the current
iteration would mark all their neighbors. The pseudocode of a parallel version of
breadth-first search with the top-down traversal direction is presented on fig. 1.
    In lines 1–4, the array of distances to source vertices is initialized. The initial-
ization is followed by the main loop, which repeats while there exist unmarked
vertices in the graph. First, in lines 6–14, the vertices stored at the current node
are marked. Then, in lines 16 and 17, messages to other nodes are sent and
received (point-to-point); these messages contain the data on vertices that must
be marked on those nodes. Finally, in lines 19–22, the vertices the data on which
was received from other nodes are marked. At the end of each iteration, the
level counter in line 23 is incremented by one and then, at line 24, the algorithm
checks if there are still unmarked vertices.
    The bottom-up implementation of breadth-first search algorithm assumes
that inactive vertices will be looking for active vertices amongst their neighbors.
In case of presence of such type of vertices on the current iteration, the vertex
is to mark itself. The pseudocode of parallel breadth-first search with inverse
traversal is presented on fig. 2.
    Like in the previous pseudocode, in lines 1–4, the breadth-first search is
initialized. In lines 18 and 19, the level number is updated and it is tested if
the algorithm is to terminate. However, there is a big difference in the vertex
marking procedure—in case of bottom-up traversal, it is crucial to know the
current state of all active vertices in the graph. It is most convenient to do
this by means of a bitmap the length of which equals the number of vertices in
the graph. Therefore, at each iteration, the data is synchronized by means of
updating the bitmap through collective MPI communications (line 15 and 16).

                                                                                        34
     Mathematical and Information Technologies, MIT-2016 — Information technologies


 1   f o r each u in d i s t
 2       d i s t [ u ] := −1
 3   d i s t [ s ] := 0
 4   l e v e l := 0
 5   do
 6       p a r a l l e l f o r each v e r t in V. t h i s n o d e
 7           i f dist [ vert ] = l e v e l
 8               f o r each n e i g h b in v e r t . n e i g h b o r s
 9                   i f n e i g h b in V. t h i s n o d e
10                      i f d i s t [ n e i g h b ] = −1
11                         d i s t [ n e i g h b ] := l e v e l + 1
12                         pred [ n e i g h b ] := v e r t
13                      else
14                         v e r t b a t c h t o s e n d . push ( n e i g h b )
15
16        send ( v e r t b a t c h t o s e n d )
17        receive ( vert batch to receive )
18
19     p a r a l l e l f o r each v e r t in v e r t b a t c h t o r e c e i v e
20         i f d i s t [ v e r t ] = −1
21             d i s t [ v e r t ] := l e v e l + 1
22             pred [ v e r t ] := v e r t . pred
23     l e v e l++
24   while ( ! c h e c k e n d ( ) )


                 Fig. 1. Parallel top-down breadth-first search algorithm pseudocode

 1   f o r each u in d i s t
 2       d i s t [ u ] := −1
 3   d i s t [ s ] := 0
 4   l e v e l := 0
 5   do
 6       p a r a l l e l f o r each v e r t in V. t h i s n o d e
 7           i f d i s t [ v e r t ] = −1
 8               f o r each n e i g h b in v e r t . n e i g h b o r s
 9                   i f bitmap current . neighb = 1
10                      d i s t [ v e r t ] := l e v e l + 1
11                      pred [ v e r t ] := n e i g h b
12                      b i t m a p n e x t . v e r t := 1
13                      break
14
15        a l l g a t h e r ( bitmap next )
16        swap ( b i t m a p c u r r e n t , b i t m a p n e x t )
17
18     l e v e l++
19   while ( ! c h e c k e n d ( ) )

                Fig. 2. Parallel bottom-up breadth-first search algorithm pseudocode


     35
             Mathematical and Information Technologies, MIT-2016 — Information technologies

3     Performance Engineering of Parallel Breadth-First
      Search
To increase the performance of parallel breadth-first search algorithms, one could
suggest the following two methods:
 – hybrid graph traversal;
 – workload distribution across computational threads.

3.1   Hybrid Traversal
This method is a combination of different types of parallel breadth-first search
algorithm for executing different iterations. In particular, the top-down approach
characterized by large computational workload and data transfer overheads on
iterations in the middle. At the same time, the first and last iterations in the top-
down approach execute faster and have almost no data transfers. The situation
is different with the bottom-up approach due to the use of collective MPI data
transfer operations. In its case, data transfer overheads are almost the same on
each iteration. However, marking the vertices on the first iterations takes much
more time than on the iterations in the middle or later.
    Taking into account the fact that the data produced on each iteration is the
same (regardless of traversal direction), we suggest to combine different types of
graph traversal (with smallest possible data transfer overheads on each iteration)
to achieve maximal performance. In this study, we propose the following scheme:
 – top-down on first two iterations;
 – bottom-up on next three iterations;
 – top-down on all other iterations.

3.2   Workload Distribution
In processing of graphs with skewed degree distribution, it is not known in
advance (in the general case) which vertices will be processed by a particular
computational thread, the only thing that may be known in advance is, for
instance, the total number of vertices that have to be processed by the current
thread. However, the total workload is determined not by the number of active
vertices but by the number of edges incident to them. This leads to workload
imbalance amongst threads and big overheads during the runtime of parallel
level-synchronous algorithm.
    To avoid workload imbalance, we suggest a transition from “looking” through
a vertex array to “looking’‘ through an array of edges. For this purpose, we
logically divide the edges array into equal pieces holding 𝑚𝑎𝑥 𝑒𝑑𝑔𝑒𝑠 elements.
Each thread determines the corresponding vertex for all edges in every part of
the edges’ array by using the 𝑝𝑎𝑟𝑡 𝑐𝑜𝑙𝑢𝑚𝑛 array, which contains the numbers of
vertices incident to the first edge in the corresponding part of the edges’ array.
The pseudocode for parallel filling of the 𝑝𝑎𝑟𝑡 𝑐𝑜𝑙𝑢𝑚𝑛 array is presented on fig. 3.


                                                                                        36
    Mathematical and Information Technologies, MIT-2016 — Information technologies


1   p a r a l l e l f o r each i in V. t h i s n o d e
2      f i r s t := V. t h i s n o d e [ i ]
3      l a s t := V. t h i s n o d e [ i +1]
4      i n d e x := round up ( f i r s t / max edges )
5      c u r r e n t := i n d e x * max edges
6      while ( c u r r e n t < l a s t )
7          p a r t c o l u m n [ i n d e x ] := i
8          c u r r e n t := c u r r e n t + max edges
9          i n d e x++

                    Fig. 3. Parallel filling of 𝑝𝑎𝑟𝑡 𝑐𝑜𝑙𝑢𝑚𝑛 array pseudocode



       Pseudocode of a new version of breadth-first search algorithm that uses the
    𝑝𝑎𝑟𝑡 𝑐𝑜𝑙𝑢𝑚𝑛 array is presented on fig. 4 (the top-down approach) and fig. 5 (the
    bottom-up approach).


    4    Benchmarking

    Both of the aforementioned methods were incorporated into a custom implemen-
    tation of the Graph500 benchmark [5]. The kernel of this benchmark represents
    parallel breadth-first search on a big graph distributed in the memory of multiple
    nodes. The size of the graph is determined by the scale parameter, which is the
    logarithm base 2 of the number of vertices in the graph. The average degree of
    all vertices is 16. The main performance metric of this benchmark is the number
    of edges traversed per second (TEPS).
        We developed a custom implementation that uses MPI (one process per
    computational node) for sending and receiving messages across all nodes and
    OpenMP (eight threads per node) to deal with shared memory within each
    node.
        Benchmarking was carried out for graphs with different sizes on 1-, 2-, 4-,
    and 8-node configurations of the Uran supercomputer (located at the Krasovskii
    Institute of Mathematics and Mechanics). Each node had Intel Xeon X5675 CPU
    and 46 GB DRAM. Performance of the custom implementation was compared
    with the following reference implementations provided by Graph500:

     – simple (represents top-down breadth-first search);
     – replicated (represents bottom-up breadth-first search).

        Benchmarking results are presented on fig. 6. As seen on the figure, our
    custom implementation substantially outperforms the simple and replicated im-
    plementations. In addition, in case of 8 nodes, it is clearly seen that that the
    custom implementation scales much better than its counterparts (scalability of
    all implementations presented on fig 7).

    37
                    Mathematical and Information Technologies, MIT-2016 — Information technologies




 1   // p r e p a r a t i o n . . .
 2   p a r a l l e l f o r each i in p a r t c o l u m n
 3      f i r s t e d g e := i * max edges
 4      l a s t e d g e := ( i +1)* max edges
 5      c u r r v e r t := p a r t c o l u m n [ i ]
 6      f o r each edge in [ f i r s t e d g e ; l a s t e d g e )
 7           i f n e i g h b o r s of c u r r v e r t in [ f i r s t e d g e ; l a s t e d g e )
 8               if dist [ curr vert ] = level
 9                  f o r each k in n e i g h b o r s of c u r r v e r t
10                      i f d i s t [ k ] = −1
11                         d i s t [ k ] := l e v e l + 1
12                         pred [ k ] := c u r r v e r t
13           c u r r v e r t++
14   // data s y n c h r o n i z a t i o n . . .

     Fig. 4. Parallel top-down breadth-first search algorithm pseudocode (with workload
     balancing)




 1   // p r e p a r a t i o n . . .
 2   p a r a l l e l f o r i in p a r t c o l u m n
 3      f i r s t e d g e := i * max edges
 4      l a s t e d g e := ( i +1)* max edges
 5      c u r r v e r t := p a r t c o l u m n [ i ]
 6      f o r each edge in [ f i r s t e d g e ; l a s t e d g e )
 7           i f n e i g h b o r s of c u r r v e r t in [ f i r s t e d g e ; l a s t e d g e )
 8               i f d i s t [ c u r r v e r t ] = −1
 9                  f o r each k in n e i g h b o r s of c u r r v e r t
10                      i f bitmap current . k = 1
11                         d i s t [ c u r r v e r t ] := l e v e l + 1
12                         pred [ c u r r v e r t ] := k
13                         b i t m a p n e x t . v e r t := 1
14                         break
15           c u r r v e r t++
16   // data s y n c h r o n i z a t i o n . . .

     Fig. 5. Parallel bottom-up breadth-first search algorithm pseudocode (with workload
     balancing)




                                                                                                   38
Mathematical and Information Technologies, MIT-2016 — Information technologies



                                                                     1 node                                                            2500                    2 nodes
                          1200                                                            custom                                                                               custom
                                                                                          replicated                                                                           replicated
                          1000                                                            simple                                       2000                                    simple
Performance rate, MTEPS




                                                                                                             Performance rate, MTEPS
                           800                                                                                                         1500

                           600
                                                                                                                                       1000
                           400
                                                                                                                                        500
                           200

                             0                                                                                                            0
                                 20   21                    22                23         24     25                                            20     21     22           23   24     25
                                                                      Scale                                                                                      Scale


                                                                 4 nodes                                                                                       8 nodes
                                                                                                                                       6000
                          4000                                                            custom                                                                               custom
                                                                                          replicated                                   5000
                                                                                                                                                                               replicated
                          3500                                                            simple                                                                               simple
Performance rate, MTEPS




                                                                                                             Performance rate, MTEPS


                          3000
                                                                                                                                       4000
                          2500
                          2000                                                                                                         3000

                          1500                                                                                                         2000
                          1000
                                                                                                                                       1000
                           500
                             0                                                                                                            0
                                 20   21                    22                23         24     25                                            20     21     22           23   24     25
                                                                      Scale                                                                                      Scale


                                                                                       Fig. 6. Performance comparison




                                                                     6000
                                                                                                                                                   custom
                                                                     5000
                                                                                                                                                   replicated
                                                                                                                                                   simple
                                           Performance rate, MTEPS




                                                                     4000

                                                                     3000

                                                                     2000

                                                                     1000

                                                                        0
                                                                                   1      2      3        4      5     6                             7     8
                                                                                                       Number of nodes

                                                                                       Fig. 7. Scalability comparison



39
             Mathematical and Information Technologies, MIT-2016 — Information technologies

5    Conclusion
Attempts at efficient parallelization of the breadth-first search algorithm with
skewed degree distribution are hampered by the workload imbalance amongst
computational threads and large amounts of data transfer only on few select
iterations of the algorithm. This forms a bottleneck that makes it much more
hard to make a high-performance implementation for this algorithm.
    In this paper, we suggest a couple of methods for workload balancing and
traversal hybridization, which allow to increase the performance (it is more than
three times higher) of the parallel level-synchronous breadth-first search in com-
parison with the reference top-down and bottom-up procedures.
    In our future work, we intend to focus on the research in scalability the
suggested algorithm and testing it on graphs obtained from real-world applica-
tions. Another important task is to modify our custom implementation to use
computational accelerators to improve its performance.


Acknowledgments. The reported study was partially supported by RFBR,
research project No. 14-07-00435. The experiment was performed on the Uran
supercomputer.


References
1. Mark E.J. Newman. The structure and function of complex networks. SIAM review,
   45(2):167256, 2003.
2. Andrew Lumsdaine, Douglas Gregor, Bruce Hendrickson, and Jonathan Berry. Chal-
   lenges in parallel graph processing. Parallel Processing Letters, 17(01):520, 2007.
3. Bruce Hendrickson and Jonathan W. Berry. Graph analysis with high-performance
   computing. Computing in Science and Engg., 10(2):1419, March 2008.
4. Scott Beamer, Krste Asanovic, and David Patterson. Direction-optimizing breadth-
   first search. In Proceedings of the International Conference on High Performance
   Computing, Networking, Storage and Analysis, SC 12, pages 12:112:10, Los Alami-
   tos, CA, USA, 2012. IEEE Computer Society Press.
5. Richard C. Murphy, Kyle B. Wheeler, Brian W. Barrett, and James A. Ang. Intro-
   ducing the graph 500. 2010.




                                                                                        40