=Paper=
{{Paper
|id=Vol-1662/comp2
|storemode=property
|title=Performance analysis of Δ-stepping algorithm on CPU and GPU
|pdfUrl=https://ceur-ws.org/Vol-1662/comp2.pdf
|volume=Vol-1662
|authors=Dmitry Lesnikov,Mikhail Chernoskutov
}}
==Performance analysis of Δ-stepping algorithm on CPU and GPU==
<pdf width="1500px">https://ceur-ws.org/Vol-1662/comp2.pdf</pdf>
<pre>
    Performance analysis of ∆-stepping algorithm on CPU
                          and GPU

                            Dmitry Lesnikov1                      Mikhail Chernoskutov1,2
                          dslesnikov@gmail.com                      mach@imm.uran.ru
                               1 – Ural Federal University (Yekaterinburg, Russia)
                 2 – Krasovskii Institute of Mathematics and Mechanics (Yekaterinburg, Russia)


                                                        Abstract
                       ∆-stepping is an algorithm for solving single source shortest path prob-
                       lem. It is very efficient on a large class of graphs, providing nearly linear
                       time complexity in sequential implementation, and can be parallelized.
                       However, this algorithm requires some tuning and has unpredictable
                       performance behaviour on systems with different parallel architectures.
                       In this paper we describe details of implementation of this algorithm
                       on CPU and GPU. Performing numerous tests we studied scalability of
                       algorithms on CPU and GPU and compared performance for different
                       ∆ parameter. We show that performance results for CPU and GPU
                       varies for different delta. Also, we noticed that GPU performance are
                       stable for some different delta values. Our result shows how to choose
                       ∆ for achieving best performance for tested graphs.


1    Introduction
The single source shortest path problem (SSSP) is well-studied typical graph problem that has a lot of appli-
cations, both theoretical and practical. A huge amount of algorithms and their optimisations was provided for
solving this problem efficiently. However, an importance of providing fast scalable algorithm for SSSP has been
only growing for the past decade, since quantity of data being represented in terms of graphs hit the point where
we have to process that data in parallel using high-performance computational resources.
   ∆-stepping algorithm [1], the algorithm for solving SSSP problem that is quite efficient and providing suitable
level of parallelism was introduced and widely researched. It is quite convenient and fast, yet has an issue
of unstudied performance corresponding to systems with different amount of available parallelism (CPUs and
GPUs). In this paper results of ∆-stepping algorithm performance evaluation on CPU and GPU architectures
are presented.
   To formalize SSSP, consider directed graph G = G(V, E) with set of vertices V and set of edges E, where
|V | = n and |E| = m. Let ω : E → [0, +∞) be a function assigning nonnegative weights to edges. Consider s ∈ V
as some distinguished vertex that is a source vertex in SSSP problem. The algorithm’s purpose is as follows: for
each v ∈ V , reachable from s, find a weight of a minimal (i.e. shortest) path, connecting s and v; let us denote this
weight as dist(v). So, considering P (s, v) as a path from s to v, the goal is to find dist(v) = minP (s,v) ω(P (s, v)).
Weight of any path is clearly assumed as a sum of weights of all edges in that path. It is also safe to set
dist(v) = +∞ for any vertex v that is unreachable from s.

Copyright c by the paper’s authors. Copying permitted for private and academic purposes.
In: A.A. Makhnev, S.F. Pravdin (eds.): Proceedings of the 47th International Youth School-conference “Modern Problems in
Mathematics and its Applications”, Yekaterinburg, Russia, 02-Feb-2016, published at http://ceur-ws.org


                                                            164
2     Background
SSSP has a lot of efficient solutions. Most of the algorithms are derived either from Dijkstra’s algorithm or from
Bellman-Ford algorithm. General Dijkstra’s algorithm implementation can achieve O(n · log n + m) time com-
plexity and Bellman-Ford algorithm runs in O(nm) time. Some PRAM algorithms process vertices subsequently
in Dijkstra’s order, preforming edge relaxation in parallel [3]. This approach may be faster than simple Dijkstra’s
algorithm, yet it runs in Ω(n) time. Bellman-Ford algorithm is fairly simple to parallelize since it does all edge
relaxations
   √          independently, but it is√quite work inefficient. Known parallel BFS-based algorithms can perform in
O( n · log n · log∗ n) time and O( n · m · log n) work [4].
   ∆-stepping algorithm exposes both Dijkstra’s and Bellman-Ford algorithms ideas. Let d be a maximum vertex
degree in graph and L = maxv∈V dist(v). It is provable [2] that purely sequential ∆-stepping algorithm solves
SSSP in O(n + m + d · L) time. An CRCW PRAM version implementation on arbitrary graphs with random
positive weights and ∆ = Θ( d1 ) runs in O(d · L · log n + log2 n) time and O(n + m + d · L · log n) work on average [2].
Thus, for quite large class of graphs, e.g. graphs with constant maximum vertex degree and weights distributed
in [0, 1], this yields to O(log2 n) time complexity and O(n + m) work complexity. Even further optimisations like
shortcuts insertion (assuming that distance dist(v) for some vertex v is already calculated, we insert in graph
edge (s, v), connecting source vertex with v and assign this edge weight dist(v)) may provide same results on
wider class of graphs.

2.1    Sequential algorithm
The basic algorithm pseudocode is represented in Algorithm 1 [2].

    Algorithm 1: Basic ∆-stepping algorithm
 1 begin
 2    foreach v ∈ V do
 3       dist[v] ← +∞
 4      dist[s] ← 0
 5      while B 6= ∅ do
 6          i ← min{j : B[j] 6= ∅}
 7          R←∅
 8          while B[i] 6= ∅ do
 9              Req ← FindRequests(B[i], light)
10              R ← R ∪ B[i]
11              B[i] ← ∅
12              RelaxRequests(Req)
13         Req ← FindRequests(B[i], heavy)
14         RelaxRequests(Req)
15 Function F indRequests(V 0 , kind : {light|heavy})
16   return (w, dist[v] + ω(v, w)) : v ∈ V 0 & (v, w) ∈ Ekind
17 Function RelaxRequests(Req)
18   foreach (w, x) ∈ Req do
19       if x < dist[w] then
20           B[ dist[w]       dist[w]
                  ∆ ] ← B[ ∆ ] \ {w}
                x        x
21           B[ ∆ ] ← B[ ∆ ] ∪ {w}
22           dist[w] ← x


   Initially distance to the source vertex is 0 and distance to any other vertex is +∞. This algorithm requires
maintaining structure of vertices called bucket. Considering B as array of buckets, each bucket B[i] stores set
{v ∈ V : v is queued and dist(v) ∈ [i · ∆, (i + 1) · ∆)} as one-dimensional array. With cyclic reusing of buckets
|B| can be set to maxe∈E dω(e)/∆e + 1.
   On each iteration algorithm finds first nonempty bucket B[i] and processes it in inner loop. It is looking for
all light edges (e ∈ E : ω(e) ≤ ∆), that are outgoing from vertices in bucket B[i]. Then all vertices are removed


                                                          165
from that bucket, but collected in set R for further processing. After that, relaxation of previously found light
edges is performed. It is important, that some vertices may be reinserted in this bucket after relaxation due to
improvement of their current distance. When current bucket B[i] remains empty after inner loop, all weights for
vertices {v : dist(v) ∈ [i · ∆, (i + 1) · ∆)} have been finally computed. Then all heavy edges (e ∈ E : ω(e) > ∆),
outgoing from every vertex that was removed from processed bucket and gathered in R, are relaxed. All process
above is repeated until all buckets are empty.

3     Implementations
The main interest of this study is a parallel implementation of algorithm. There are a few efficient ways of
providing parallel version of ∆-stepping algorithm. Although CPU and GPU implementations will differ in some
way, they expose same ideas.
   It has to be mentioned that in this paper was considered not an optimal parallel version of algorithm, yet
very simple one, that is easy to analyze and implement.
   Parallelization strategy is clearly straightforward: some set of vertices assigned for processing by one thread.
All threads run in parallel. Hence, procedures like searching for all light or heavy edges outgoing from set of
vertices (lines 9, 13) also performs in parallel.
   Maintaining buckets structure, algorithm has to be able to concurrently determine whether each vertex belongs
to any bucket, and return the index of bucket, that vertex belongs to, if it does. It can be implemented using
two arrays. First one stores an index of bucket for each vertex in it, or -1 if vertex does not belong to any bucket.
Second array stores size of each bucket.
   Also, described strategy allows to relax found edges (lines 12, 14) in parallel.

3.1   CPU implementation
CPU implementation has restriction in number of threads running in parallel. So for that case all vertices are
evenly distributed into chunks so that each thread has to process its own chunk. For instance, if p threads are
available, all vertices are put into p chunks of size n/p. In this study this division is done automatically using
OpenMP pragmas for lines 9–11, 12, 13, 14.
   On every iteration of inner loop (lines 8–12) each thread for each vertex in its chunk subsequently checks if
vertex belongs to current bucket and moves it to set R if it does, adding relaxation requests for light edges of
this vertex. Then it moves to the next vertex. In next parallel OpenMP block each thread relaxes light edges
requests for each vertex in its chunk that occurs in request.
   When bucket remains empty, two OpenMP blocks of code process set R, creating relaxation requests for heavy
edges, outgoing from vertices in that set, and performing that relaxation.

3.2   GPU implementation
For GPU implementation CUDA technology was used. Parallelization strategy is to assign to each CUDA thread
a vertex from graph.
   This specific implementation uses four different CUDA kernels for lines 9-11, 12, 13, 14. Each of them is being
called with number of threads equal to number of vertices in graph, performing parallel processing, described
above, for each vertex. However, edge relaxation is performed subsequently for each vertex.
   Thus, on every iteration of inner loop each thread in first CUDA kernel checks if its vertex belongs to current
bucket. If vertex belongs, thread creates corresponding relaxation-request for light edges and moves vertex to
set R. In next CUDA kernel each thread performs subsequent relaxation of previously created requests.
   Finally, third kernel creates requests for heavy edges for all vertex in set R, and last kernel relaxes that
requests.

4     Benchmarking
4.1   Graphs
For performing benchmarks R-MAT graphs [5] were chosen. It is quite convenient graph model providing good
approximation of real graph such as the Internet or social networks. They are configurable, match power-law,
can have relatively small diameter, have opportunity to create “communities within communities”; also that


                                                        166
graph model may fit Erdös-Rény model as a special case. R-MAT graphs are well-studied [6] and widely used
nowadays.
   Let us denote that graph has scale n if it has 2n vertices. In performed benchmarks R-MAT graphs with scale
17 to 23 and average vertex degree 32 were used. Weights of edges are double precision floating point numbers
uniformly distributed in range [0, 1].
   Graphs were packed in CSR (compressed sparse row) data structure for achieving better memory efficiency
while keeping constant time access to any vertex’s adjacency list.

4.2   Hardware and software
All test were performed on single node of cluster of Institute of Mathematics and Computer Science in Ural
Federal University with following configuration:

  • CPU: Intel Xeon E5-2620 v2 @ 2.10 GHz, 12 cores, 32 GB of RAM,

  • GPU: Nvidia Tesla K20Xm, 2688 CUDA Cores @ 732 MHz, 6 GB of memory.

  All benchmarks code was written in C99. CPU versions were parallelized via OpenMP standard, GPU versions
were written in CUDA 7.0. Compilers used: gcc 4.8.5, nvcc 7.0.27.

4.3   Bellman-Ford and Dijkstra algorithms for performance comparison
It is reasonable to firstly compare performance of most popular algorithms: Dijkstra’s and Bellman-Ford with
performance of ∆-stepping algorithm.
   These tests were performed on CPU. We separately compared sequential implementations of all three algo-
rithms and parallel implementations of Bellman-Ford and ∆-stepping algorithms along with sequential Dijkstra’s
algorithm. Figure 1 shows resulting performance of algorithms in millions of traversed edges per second (MTEPS).
For that comparison ∆ parameter was 0.04125.
   ∆-stepping outperforms other algorithms in both sequential version and parallel one. This result corresponds
asymptotic of these algorithms. Moreover, ∆-stepping does not degrade on larger graphs as Bellman-Ford
algorithm does due to some cache efficiency. On each iteration Bellman-Ford algorithm has to relax all edges, so
it has to do a lot of memory operations. ∆-stepping algorithm allows processor to keep some buckets in cache
so it can access vertices in them and their outgoing edges very fast, keeping high performance.

4.4   ∆-stepping performance on CPU and GPU
Performance results are based on 128 launches of algorithm for every graph and then taking a mean MTEPS
value. Source vertex was randomly chosen each time. Also, for the sake of clarity of dependency on ∆ parameter
time of moving data to and from GPU was not included in total time of finding problem solution. That means
that results in real-life case might be significantly lower. However, trend will be the same.


             (a) Sequential implementations                             (b) Parallel implementations

                        Figure 1: Performance comparison of different SSSP algorithms


                                                      167
                  (a) CPU performance                                      (b) GPU performance

                                        Figure 2: ∆-stepping performance

   Figure 2 shows final results. “Delta” axis represents range of ∆ parameter, “scale” axis represents scales of
tested R-MAT graph. Performance results are represented in millions of TEPS.
   It is quite obvious to see that in tested setup CPU version is much more sensitive to ∆ varying. It has to be
mentioned that presented surfaces are on global maximum by ∆ parameter. Making ∆ lower or higher in both
CPU and GPU version will provide results worse or equal.

5   Discussion
CPU results seem to fit theoretical research in [2] quite well. ∆ that corresponds to highest performance level is
                            ¯ where d¯ is average vertex degree, 32 case of this study.
a little bit higher than 1/d,
   GPU performance results need more analysis to find out its cause. Firstly, it is clear that on relatively small
graphs (scale 18), device can not achieve enough occupancy performing parallel processing of small number of
vertices and edges, and therefore providing bad performance. On larger graph there are enough vertices to fully
utilize device, which yields best performance, comparing to even bigger graphs, on which algorithm degrades due
to graph scaling.
   Quite topical question is analysis of dependency on ∆ parameter. Seems like any ∆ ≥ 0.5 provides good
performance for no reason, because algorithm behaves differently on ∆ = 1 and ∆ = 0.5.
   The point is that with relatively low ∆ (0.5 in that case) algorithm behaves in optimal way, but average vertex
degree in graph is quite small, so all parallel work is done for not large enough set of vertices simultaneously
to achieve good occupancy on the device. Small vertex degree provides poor scaling. With higher ∆ parameter
algorithm treats more edges as light (and with ∆ ≥ 1 all of them). That means that ∆-stepping algorithm behaves
more like Bellman-Ford algorithm, providing good parallelization level, but doing a lot of work. However, with
small mean vertex degree and relatively high ∆ it is possible to get higher level of device occupancy while doing
parallel work, providing better performance. By researching graphs with higher vertex degree ∆ parameter can
be chosen closer to theoretical best one [7].
   It is reasonable to choose ∆ = m n
                                       [8]. For R-MAT, however, equation 1/d¯ = m   n
                                                                                      takes place. Taking ∆ = 1/32
in our case results to worse performance than ∆ = 0.1 and especially ∆ = 0.5.
   ∆-stepping algorithm performance depends on ∆ parameter because configuring this parameter may solve
trade off between a lot of phase number and too much work on each phase [2] [7]. However, in our study graph
size ceiling for tested hardware does not allow to consider graphs large enough to show algorithm performance
degradation with low number of phases and a lot of work on each phase, but does allow to see that if ∆ is too
small, and thus there are too many phases without high workload, algorithm performance decreases.

6   Conclusion
This study presents results of one approach of ∆-stepping parallelization and comparison of performance with
different ∆ parameters. We show that optimal ∆ parameter for R-MAT graphs with mean vertex degree 32 is
between 0.04 and 0.05 for CPU. Optimal ∆ parameter for GPU implementation on same graphs is 0.5 and may
become smaller using graphs with higher vertex degree.


                                                       168
   This result is quite case-specific due to some restrictions, yet is interesting to analyse and extrapolate on other
similar cases.
   Further research may be focused on determining best performance algorithm setup on different systems.


References
 [1] U. Meyer and P. Sanders. ∆-stepping: A parallel single source shortest path algorithm. Lecture Notes in
     Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioin-
     formatics), 1461 LNCS:393–404, 1998.
 [2] U. Meyer and P. Sanders. ∆-stepping: a parallelizable shortest path algortihm. Journal of Algorithms,
     49:114–152, 2003.
 [3] G.S. Brodal, J.L. Träff, C.D. Zaroliagis. A parallel priority queue with constant time operations. Journal
     of parallel and distributed computing, 49:4–21, 1998.

 [4] P.N. Klein and S. Subramanian. A Randomized Parallel Algorithm for Single-Source Shortest Paths. Journal
     of Algorithms, 25:205–220, 1997.
 [5] D. Chakrabarti, Y. Zhan, C. Falaotsos. R-MAT: A recursive model for graph mining. SIAM Proceedings
     Series. Proceedings of the Fourth SIAM International Conference on Data Mining, 442–446, 2004.

 [6] C. Groër, B.D. Sullivan, S. Poole. A mathematical analysis of the R-MAT random graph generator. Net-
     works, 58(3):159–170, 2011.
 [7] V.T. Chakaravarthy, F. Checconi, F. Petrini, Y. Sabharwal. Scalable Single Source Shortest Path Algo-
     rithms for Massively Parallel Systems. Parallel and Distributed Processing Symposium, 2014 IEEE 28th
     International, 889–901, 2014.

 [8] K. Madduri, D.A. Bader, W.B. Jonathan, J.R. Crobak. An experimental study of a parallel shortest path
     algorithm for solving large-scale graph instances. Proceedings of the 9th Workshop on Algorithm Engineering
     and Experiments and the 4th Workshop on Analytic Algorithms and Combinatorics, 23–25, 2007.


                                                         169

</pre>