=Paper= {{Paper |id=Vol-1981/paper6 |storemode=property |title=Early Experience with Integrating Charm++ Support to Green-Marl Domain-Specic Language |pdfUrl=https://ceur-ws.org/Vol-1981/paper6.pdf |volume=Vol-1981 |authors=Alexander Frolov }} ==Early Experience with Integrating Charm++ Support to Green-Marl Domain-Specic Language== https://ceur-ws.org/Vol-1981/paper6.pdf
Early Experience with Integrating Charm++ Support to
        Green-Marl Domain-Specific Language

                                                 Alexander Frolov
                                                   JSC NICEVT
                                                 Moscow, Russia
                                                 frolov@nicevt.ru



                      Abstract                                      The paper presents the results of the efforts to
                                                                 integrate a Charm++ support to the Green-Marl
    The paper presents the implementation of the                 compiler. Green-Marl is an open-source domain-
    code generation mechanism in the Green-Marl                  specific programming language (DSL) designed for
    domain-specific language compiler targeted at                static graph processing. Charm++ is an asynchronous
    the Charm++ framework. Green-Marl is                         message-driven execution model developed to execute
    used for parallel static graph analysis and                  parallel programs on multiprocessor computing sys-
    adopts imperative shared memory program-                     tems. Charm++ relies on its runtime system to sched-
    ming model, while Charm++ implements a                       ule computation, as well as perform dynamic load bal-
    message-driven execution model. The descrip-                 ancing.
    tion of the graph representation in the gener-                  For the translation of Green-Marl programs to
    ated Charm++ code, as well as of the trans-                  Charm++ a code generation module has been devel-
    lation of the common Green-Marl constructs                   oped in the Green-Marl compiler that uses the existing
    to Charm++, is presented. The evaluation                     internal representation of a Green-Marl program for
    of the typical graph algorithms (Single-Source               Charm++ code generation. For performance evalua-
    Shortest Path, Connected Components, and                     tion we used three well-known graph problems: Single-
    PageRank) showed that the Green-Marl pro-                    Source Shortest Path (SSSP), Connected Components
    grams translated to the Charm++ have al-                     (CC), and PageRank. We compared the performance
    most the same performance as their native im-                of the generated code to the hand-coded reference im-
    plementations in Charm++.                                    plementations.
                                                                    In the next section an overview of research efforts on
    Keywords: domain-specific language, data-
                                                                 asynchronous computation models and graph-specific
    driven computation models, graph computa-
                                                                 DSLs is presented. In section 3 a brief description of
    tions
                                                                 the Green-Marl DSL is presented, section 4 contains
                                                                 an overview of the Charm++ programming model.
1    Introduction                                                Section 5 shows the main technical aspects of trans-
A parallel static graph analysis on high-performance             lating Green-Marl programs to Charm++. Section 6
computing systems (supercomputers) is one of the re-             presents results of the performance evaluation. The
cent application domains which is characterized by a             final section contains conclusion and future work.
domination of irregular data processing rather than
massive bulks of floating point operations which are             2   Related Work
common in other scientific applications used for su-             A usage of common approaches to the development
percomputers. The key challenges of parallel graph               of parallel graph applications for massive parallel, dis-
processing are discussed in [1].                                 tributed memory systems such as MPI or Shmem in
                                                                 combination with OpenMP is complicated due to its
Copyright c by the paper’s authors. Copying permitted for
private and academic purposes.
                                                                 limited support for expressing irregular parallelism
In: V. Voevodin, A. Simonov (eds.): Proceedings of the
                                                                 and implicit orientation on statically balanced par-
GraphHPC-2017 Conference, Moscow State University, Russia,       allel bulk-synchronous problems, and, therefore, the
02-03-2017, published at http://ceur-ws.org.                     complexity of implementing parallel graph algorithms




                                                             1
falls on the application developer. In contrast, the             1 Procedure sssp(G:Graph, dist:N_P, len:E_P,
asynchronous data-flow computational models enable               2                 root: Node)
                                                                 3 {
almost transparent mapping of the graph algorithms.              4   N_P updated;
The examples of such models which are built on top of            5   N_P updated_nxt;
                                                                 6   N_P dist_nxt;
asynchronous active messages concept are Charm++                 7

[2, 3], Active Pebbles [4, 5], HPX [6]. The researched           8       Bool fin = False;
                                                                 9       G.dist = (G == root) ? 0 : +INF;
next generation parallel programming languages such             10       G.updated = (G == root) ? True: False;
as Chapel [7, 8, 9] and X10 [10, 11] also support active        11       G.dist_nxt = G.dist;
messages as one of the base principles of their pro-            12       G.updated_nxt = G.updated;
                                                                13
gramming models.                                                14       While(!fin) {
    Even better productivity of the graph application           15         fin = True;
                                                                16         Foreach(n: G.Nodes)(n.updated) {
development can be achieved with a usage of high-               17           Foreach(s: n.Nbrs) {
level domain-specific languages which enable to shift a         18             Edge e = s.ToEdge();
                                                                19             
major part of work on the control of parallel execution,        20               min= ;
including task spawning, communication, synchroniza-            21           }
tion, etc. to the compiler and dynamic load balanc-             22         }
                                                                23         G.dist = G.dist_nxt;
ing and a failure resiliency – to the runtime system.           24         G.updated = G.updated_nxt;
There are few DSLs specifically developed for a paral-          25         G.updated_nxt = False;
                                                                26         fin = ! Exist(n: G.Nodes){n.updated};
lel static graph analysis such as Green-Marl [12,13,14],        27       }
OptiGraph, Elixir [15], and Falcon [16]. However,               28   }
none of them supports distributed memory massive-
parallel HPC systems, rather the researchers focused            Figure 1: SSSP (Bellman-Ford algorithm) in Green-
their work on the shared memory systems, GPUs, and              Marl
Big Data platforms.
                                                                    In Figure 1 an implementation of the Bellman-Ford
                                                                algorithm for finding the shortest paths from the spec-
3   Green-Marl
                                                                ified vertex to other vertices in the graph is presented.
Green-Marl [12, 13, 14] is an open-source domain-               The program consists of a single procedure (sssp) with
specific programming language designed for a parallel           four parameters: G is a graph to be analyzed, dist is
static graph analysis in Stanford University (Perva-            a vertex property storing the distance from the source
sive Parallel Laboratory, PPL). The Green-Marl com-             vertex, len is an edge property storing the edge length
piler supports translation to the following parallel pro-       (or weight), and root is a source vertex.
gramming models: OpenMP and Pregel [17]. The                        In the initialization step (lines 8-12) dist is set to
OpenMP backend allows to run Green-Marl programs                +INF for all vertices except the root vertex which dist
on shared memory multi-processor systems, while the             is set to zero. Also, the root vertex is marked by set-
Pregel backends (there are two slightly different back-         ting its update property to true which signifies that
ends with Pregel implementations: GPS and Giraph)               the vertex will be processed in the first iteration of the
enable to use Green-Marl for distributed massive-               While loop (lines 14-17). Then the While loop is ex-
parallel computing systems.                                     ecuted. It completes when fin is set to true that is
   Green-Marl supports special types for declaration            there is no more updated vertices in the graph. In each
of graphs (Graph), vertices (Node), and edges (Edge),           iteration of the While loop a parallel Foreach loop is
as well as declaration of vertex properties (N P)        executed (lines 16-22), which scans vertices and for
and edges properties (E P).                              each previously updated vertex (i.e. its updated prop-
   Besides the While and Do-While constructs used               erty is set to true) checks neighbour vertices by exe-
for defining sequential loops and If and If-Else for            cuting a relaxation operation: if dist[v] + weight(v,u)
branches, Green-Marl provides statements for parallel           < dist[u] for the (v,u) edge then dist[u] is assigned to
execution such as For and Foreach loops. The seman-             dist[v] + weight(v,u) (lines 19-20). The number of iter-
tics of the parallel loops assumes that all computations        ations of the While loop is restricted to |V | − 1 (upper
should be finished before the first operation after the         bound).
loop is executed. However, the order of the iteration               Therefore, Green-Marl is an imperative domain-
execution is not defined.                                       specific programming language designed for a paral-
   Green-Marl supports the following reduction oper-            lel static graph analysis. Green-Marl allows to sig-
ators +=, *=, max=, min=, &&=, ||=. These operators             nificantly simplify the development of parallel graph
can be used in parallel loops to produce reduction as-          applications by its builtin specialized high-level ab-
signments.                                                      stractions as well as the diverse set of the compiler




                                                            2
optimizations. However, it still lacks the support of
the translation to effective programming platform for
HPC systems with distributed memory.

4     Charm++
Charm++ [2, 3] is a parallel programming framework
which is based on the object-oriented asynchronous
message-driven execution model.           Charm++ de-
signed and developed in Illinois University of Urbana-
Champaign (Parallel Programming Laboratory, PPL).
As a framework, Charm++ includes the following core
components: a compiler used to translate application’s
objects interfaces to the C++ code and a runtime sys-                 Figure 2: Main stages of Green-Marl compiler
tem which drives the application execution process.
    In Charm++ a program consists of a set of chares,           GPS [18] or Giraph [?, 19]), which is a vertex-centric
implemented as C++ objects, that have an interface              programming model built on the Bulk Synchronous
of predefined methods (entry methods) which can be              Parallel (BSP) execution model [20].
used to transfer data between chares and initiating                The general scheme of Green-Marl is shown in Fig-
asynchronous computations, that is caller thread will           ure 2. The main stages of translation are the fol-
proceed its execution. A mapping of chares to process           lowing: a lexical and syntax analysis (type checking,
elements (PEs) is done statically by the Charm++                a syntactic sugar expansion etc.), platform indepen-
runtime (by default) or by application developer. Typ-          dent optimizations (such as loop splitting and merg-
ically, PE can be regarded as a CPU core (or an oper-           ing, loop-invariant code motion etc.), platform specific
ating system process assigned to a CPU core). Addi-             optimizations and a final code generation.
tionally to stand-alone chares, single-dimensional and             An internal representation of the program in the
multi-dimensional chare arrays are supported.                   Green-Marl compiler includes an abstract syntax tree
    For addressing chares special proxy objects are used        (AST) and a control-flow graph (a finite state ma-
that provide an abstraction of global address space             chine, FSM), the nodes of the graph (or extended basic
and enable the Charm++ runtime to manage location               blocks, EBB) correspond to linear blocks of code in the
and distribution of chares transparently for applica-           translated program. Each EBB contains information
tion. This allows to use dynamic load balancing by mi-          about local variables, incoming and outgoing depen-
grating chares from more loaded nodes to less loaded            dencies, code statements, etc.
by the runtime system. A migratability of chares is                The FSM states can be of two types: sequential
another base concept of the Charm++ programming                 (SEQ) and parallel (PAR). The sequential nodes cor-
model.                                                          respond to the code blocks in which the statements
    The following properties of the Charm++ model:              should be executed in the strict order. The parallel
first, an entry method can only access data which be-           nodes correspond to the code blocks which suggest exe-
long to the chare that it is called on, and, second, that       cution for each vertex of the graph in parallel (Foreach
only single entry method can be executed on a chare             and For loops).
at a time, guarantee the atomicity of modifications of
chare’s data in Charm++ that significantly simplifies           5.2     Charm++ Code Generator
application development.                                        The developed Charm++ code generator is based on
                                                                the GPS generator developed for translating Green-
5     Porting Green-Marl Compiler to                            Marl programs to the Pregel execution model, a full
      Charm++                                                   description of the GPS generator can be found in [14].
                                                                However, the Charm++ programming model is more
5.1   Compiler Overview
                                                                general than Pregel, which allows, first, to relatively
The Green-Marl compiler performs a source-to-source             easy implement the Pregel model and, second, extends
translation from Green-Marl to one of the possible              the possibilities of Green-Marl translator, that is in-
equivalent representations: a sequential C++ code,              crease the variety Green-Marl programs which can be
a parallel code for shared memory systems in C++                translated to Charm++.
with OpenMP pragmas, a parallel code for distributed               At the same time, to prove the concept of the possi-
memory systems in Java on top of Pregel [17] (namely,           bility to translate Green-Marl programs to Charm++
one of the two Java based implementations of Pregel:            it was sufficient to implement the last stage of the com-




                                                            3
1  Procedure count(G:Graph, age:N_P, root: Int)                piler – code generation from the internal program rep-
2  {
 3   Int S = 0;                                                     resentation (IR). Further, a description of the trans-
 4   Int C = 0;                                                     lation methods is presented for main Green-Marl con-
 5   Foreach (n : G.nodes) {
 6     If (n.age < K) {                                             structs. As an example, a program for calculating a
 7       S += n.age;                                                mean age of all registered users of a social network
 8       C += 1;
 9     }                                                            whose age is less than K. The source code of the Green-
10   }
11   Float val = (C == 0) ? S / (float) C;                          Marl program and the generated code in Charm++ are
12 }                                                                shown in Figure 3.
                                (a)
                                                                    Graph representation in the generated code
1  module count {
 2   message __ep_state1_msg;
 3   readonly CProxy_count_master master_proxy;                     For a graph representation in the generated code in
 4   readonly CProxy_count_vertex vertex_proxy;                     Charm++ the most simple and natural approach has
 5   chare count_master {
 6     entry count_master(const CkCallback & cb);                   been chosen: vertices of the graph are mapped to the
 7     entry void do_count(int K);
 8     entry [reductiontarget] void __reduction_S (int S);          chares (i.e. the elements of the chare array), which are
 9     entry [reductiontarget] void __reduction_C (int C);          defined in the code by the name vertex class, name
10     entry void __ep_state0();
11     entry void __ep_state1();                                    is a name of the translated Green-Marl procedure.
12     entry void __ep_state2();
13   }; // count_master                                                The vertex properties data structure is used for
14   array [1D] count_vertex {
15     entry count_vertex();                                        storing vertex attributes, while edges is used for – an
16     entry void __ep_state1(__ep_state1_msg *msg);                edge list of the vertex, each edge is a pair of a neigh-
17     entry void add_edge(const count_edge &e);
18   }; // count_vertex                                             bour identifier and its attributes (edge properties).
19 }; // count
                                                                    As a 1-dimensional chare array (array [1D]) is used
                               (b)                                  for storing graph vertices then the vertices are dis-
1  class count_master: public CBase_count_master {                  tributed in contiguous blocks over parallel processes
     public:
 2
 3     void __reduction_S (int S) { this->S = S; }
                                                                    (or cores) by the Charm++ runtime system.
 4     void __reduction_C (int C) { this->C = C; }                     In the example (see Figure 3) the vertices are repre-
 5     void __ep_state0() {
 6       S = 0;                                                     sented by the count vertex class (Figure 3 (b), lines
 7       C = 0;
 8       thisProxy.__ep_state1();                                   14–18, 3, lines 26–52). The vertices have a single at-
 9     }                                                            tribute age, which is stored in the vertex properties
10     void __ep_state1() {
11       __ep_state1_msg *_msg = new __ep_state1_msg();             class. As the attributes for edges are not defined in the
12       _msg->K = K;
13       vertex_proxy.__ep_state1(_msg);                            Green-Marl program then edge properties does not
14       CkStartQD(CkIndex_count_master::__ep_state2(),
15            &thishandle);                                         have any field.
16     }
17     void __ep_state2() {
18       val = (C == 0)?((float)S):(0 / ((float)C));                FSM construction
19       done_callback.send();
20     }
21   private:                                                       As it has already been mentioned, after the control-
22     CkCallback done_callback;                                    flow analysis the Green-Marl compiler creates a finite
23     int S, C, K;
24     float val;                                                   state machine (FSM). For managing the execution of
25 }; // count_master
26 class count_vertex: public CBase_count_vertex {                  FSM a special chare (name master) is created, each
27   public:
28     struct vertex_properties {                                   state of FSM is mapped to one of the entry methods:
29       int age;                                                     ep state i, where i is a number of the FSM state.
30     };
31   public:                                                        The program starts with the execution of ep state 0
32     void __ep_state1 (__ep_state1_msg *msg) {
33       int K = msg->K;                                            then the states are switched according to FSM. When
34       delete msg;
35       int S, C;
                                                                    PAR states are executed, an appropriate call is initi-
36       if (this->props.age < K) {                                 ated to all chares of the name vertex class (i.e. array
37         S = this->props.age;
38         contribute(sizeof(int), &S,                              elements). Then master (name master) waits until
39            CkReduction::sum_int,
40            CkCallback(CkReductionTarget(                         all computation are done (by using a quiescence de-
                count_master, __reduction_S), master_proxy));
41
42         C = 1;
                                                                    tection) and then an entry method correspondent to
43         contribute(sizeof(int), &C,                              the next FSM state is executed. The process continues
44            CkReduction::sum_int,
45            CkCallback(CkReductionTarget(                         until the terminal (final) state is called.
46              count_master, __reduction_C), master_proxy));
47       }                                                             In the example FSM consists of three states (see
48     }                                                            Figure 4) which correspond to the following en-
49   private:
50     std::list edges;                          try methods of the count master:            ep state 0,
51     struct vertex_properties props;
52 }; // count_vertex                                                 ep state 1 and ep state 2. As the state 1 is par-
                                                                    allel, then it has a correspondent entry method in the
                                (c)                                 count vertex class ( ep state 1). In the terminal
Figure 3: Example of Green-Marl program (a) and its
generated code in Charm++ (b, c)


                                                                4
                                                                code generation. However, as it has been mentioned,
                                                                Charm++ has significantly more flexibility and gener-
                                                                ality than Pregel that is why it provides wider possibil-
                                                                ities to the compiler as well as to the domain-specific
                                                                language itself. Using of these possibilities is the topic
                                                                of further research.

                                                                6    Performance Evaluation
Figure 4: Fragment of Green-Marl program and its
                                                                For benchmarking a 32-node Angara-K1 cluster has
control-flow graph
                                                                been used, we used its 24-node segment with dual 6-
state (state 2) the callback (done callback) is trig-           core Intel Xeon E5-2630 processors and 64 GB of mem-
gered to move control flow to the boiler-plate code (or         ory in each node. The nodes are connected by a cus-
code of the Green-Marl program loader).                         tom 4D-torus Angara interconnect [21]. Eight CPU
                                                                cores have been used per each node in the test runs to
                                                                keep total number of processes equal to power of two,
Global variables and reduction
                                                                however, it may not be necessary in the general case.
In Green-Marl, all variables can be of two types: global           The benchmarks used for the performance evalu-
and local (for vertices). The global variables are de-          ation include: Single-Source Shortest paths (SSSP),
clared in the sequential code blocks and can be ac-             Connected Components (CC), and PageRank. We
cessed from the parallel code blocks. All global vari-          compared the performance of Green-Marl programs to
ables are stored in the name master class as its mem-           their hand-written Charm++ invariants (or reference
ber variables. In case of a read access to the global           implementations).
variable from the Foreach loop, its value is passed                We used two types of synthetic graphs: RMAT [22]
as a parameter of the correspondent entry method to             and Random. RMAT graphs have been designed to
all vertices (name vertex). In case of a write access           simulate many real-world graphs, which are character-
then the correspondent entry method of name master              ized by a power law of degree distribution (for example
is called, however, when write accesses operations are          social networks, etc.). For RMAT graphs we used gen-
performed to the same global variable from the mul-             erator from the Graph500 test. Random graphs have a
tiple vertices then race conditions occur and the be-           random uniform distribution of edges over graph ver-
haviour is undefined.                                           tices.
   The Green-Marl compiler applies the contribute                  The performance results are shown in Figure 5. In
operation supported in Charm++ to implement re-                 all plots the graph size is 222 vertices. For SSSP and
ductions which can be used in Green-Marl programs.              PageRank directed graphs have been used, while for
In Charm++ there are several supported reductions               CC – undirected.
operations which can be applied to the chare array                 For SSSP (as well as for CC) the two different ref-
elements, the mechanism suggests that a specified en-           erence implementations in Charm++ have been used.
try method (of the specified chare) is called when re-          The first is sssp-async (cc-async) and it implements
duction is completed which has to be marked with                a fully asynchronous algorithm which employs a label
reductiontarget keyword and has a single parameter              updating approach. For SSSP the updated label is
which will store the result of the reduction.                   a distance from the root vertex (dist) in each ver-
   In the example (see Figure 3) three global vari-             tex, while for CC – identifier of connected compo-
ables are used (S, C, K). The K value is used in-               nent (CC). The computation process continues until
side the Foreach loop, therefore ep state 2 in                  any update is possible. When there is no more up-
count vertex receives K as one of the parameters of             dates in the graph then the algorithm finishes, this is
the ep state2 msg structure, which is a Charm++                 controlled by a quiescence detection mechanism, sup-
message (message). S and C are used for the reduction           ported in Charm++. Another reference implemen-
results (lines 7, 8). In the generated code there are two       tation sssp-adapt (cc-adapt) is similar to the first
methods in the count master class:          reduction S         except that a global synchronization is used to con-
(line 3) and reduction (line 4) which are used for              trol label propagation (the new labels are sent only to
passing the reduction results to the S and C variables.         the neighbour vertices of the front and then a global
   In conclusion of this section we can say that for            synchronization is performed). In Green-Marl imple-
implementing Charm++ support in the Green-Marl                  mentations of the tests it is achieved by the outer loop
compiler it was sufficient to add a new code gener-             While.
ation module to the existing backend chain for GPS                 As can be seen from Figure 5, for SSSP and CC




                                                            5
the performance of Green-Marl implementations is                  processing,” Parallel Processing Letters, vol. 17,
very close to the performance of the reference im-                no. 1, pp. 5–20, 2007. http://dblp.uni-trier.
plementation with a partially synchronized algorithm              de/rec/bib/journals/ppl/LumsdaineGHB07
(sssp-adapt for SSSP and cc-adapt for CC). For                    (accessed: 10/23/2017).
CC the adapted reference implementation shows even
better performance than asynchronous one. It can               [2] L. V. Kale and S. Krishnan, “Charm++: A
look unexpected, but the following reasons of such                 portable concurrent object oriented system based
behaviour are possible: first, while the partially syn-            on c++,” SIGPLAN Not., vol. 28, pp. 91–
chronized algorithms (sssp-adapt and cc-adapt) re-                 108, Oct. 1993. http://doi.acm.org/10.1145/
strict the amount of available parallelism, at the same            167962.165874 (accessed: 10/23/2017).
time the number of active messages sent during exe-            [3] G. Zheng, E. Meneses, A. Bhatele, and L. V.
cution of the programs is also significantly restricted,           Kale, “Hierarchical load balancing for charm++
thus resulting in less overheads from the runtime sys-             applications on large supercomputers,” in 2010
tem and less memory consumption, and, second, the                  39th International Conference on Parallel Pro-
fully asynchronous implementations (sssp-async and                 cessing Workshops, pp. 436–444, IEEE, 2010.
cc-async) generate many speculative computations                   https://charm.cs.illinois.edu/newPapers/
(speculative wavefronts), for example propagating of               10-08/paper.pdf (accessed: 10/23/2017).
the distance which is not minimal for SSSP. It re-
sults in large amount of unnecessary computations and          [4] J. J. Willcock, T. Hoefler, N. G. Edmonds, and
degradation of performance. For the partially synchro-             A. Lumsdaine, “Active pebbles: parallel pro-
nized implementations the amount of speculative com-               gramming for data-driven applications,” in Pro-
putations is significantly lesser.                                 ceedings of the international conference on Su-
   The evaluation of PageRank has not revealed any                 percomputing, pp. 235–244, ACM, 2011. http:
significant difference in performance between reference            //doi.acm.org/10.1145/1995896.1995934 (ac-
and Green-Marl versions. In the first order, it is ex-             cessed: 10/23/2017).
plained by the fact that for both cases the algorithms         [5] J. J. Willcock, T. Hoefler, N. G. Edmonds,
are very close (almost the same).                                  and A. Lumsdaine, “Active pebbles: a program-
                                                                   ming model for highly parallel fine-grained data-
7   Conclusion and Future Work                                     driven computations,” in ACM SIGPLAN No-
In the paper the Charm++ code generator devel-                     tices, vol. 46, pp. 305–306, ACM, 2011. http:
oped for the compiler of the domain-specific language              //doi.acm.org/10.1145/1941553.1941601 (ac-
Green-Marl is presented. Therefore, the Green-Marl                 cessed: 10/23/2017).
DSL is extended to HPC clusters with distributed               [6] H. Kaiser, T. Heller, B. Adelstein-Lelbach, A. Se-
memory: Charm++ is added to already supported                      rio, and D. Fey, “Hpx: A task based program-
OpenMP, GPS, and Giraph target platforms.                          ming model in a global address space,” in Proceed-
   The performance evaluation shows that Green-Marl                ings of the 8th International Conference on Par-
programs compiled to Charm++ have the same per-                    titioned Global Address Space Programming Mod-
formance as native Charm++ programs assuming that                  els, p. 6, ACM, 2014. http://doi.acm.org/10.
the same algorithms are used in the generated code                 1145/2676870.2676883 (accessed: 10/23/2017).
and the reference implementation. In the future we
plan to add a support of the Topological Routing and           [7] D. Callahan, B. L. Chamberlain, and H. P.
Aggregation Module (TRAM) [23] to the Charm++                      Zima, “The cascade high productivity lan-
code generator implemented in the Green-Marl com-                  guage,” in High-Level Parallel Programming Mod-
piler.                                                             els and Supportive Environments, 2004. Proceed-
                                                                   ings. Ninth International Workshop on, pp. 52–
Acknowledgment                                                     60, IEEE, 2004. http://ieeexplore.ieee.org/
                                                                   document/1299190/ (accessed: 10/23/2017).
This work is partially supported by Russian Founda-
tion for Basic Research (RFBR) under Contract 15-              [8] B. L. Chamberlain, D. Callahan, and H. P.
07-09368.                                                          Zima, “Parallel programmability and the chapel
                                                                   language,” International Journal of High Per-
                                                                   formance Computing Applications, vol. 21,
References                                                         no. 3, pp. 291–312, 2007.          http://dx.
 [1] A. Lumsdaine, D. P. Gregor, B. Hendrickson,                   doi.org/10.1177/1094342007078442 (accessed:
     and J. W. Berry, “Challenges in parallel graph                10/23/2017).




                                                           6
 [9] R. Haque and D. Richards, “Optimizing pgas                 Code Optimization (TACO), vol. 12, no. 4, p. 54,
     overhead in a multi-locale chapel implementation           2016. http://doi.acm.org/10.1145/2842618
     of comd,” in Proceedings of the First Workshop             (accessed: 10/23/2017).
     on PGAS Applications, pp. 25–32, IEEE Press,
     2016. https://doi.org/10.1109/PAW.2016.9               [17] G. Malewicz, M. H. Austern, A. J. Bik, J. C.
     (accessed: 10/23/2017).                                     Dehnert, I. Horn, N. Leiser, and G. Czajkowski,
                                                                 “Pregel: a system for large-scale graph pro-
[10] P. Charles, C. Grothoff, V. Saraswat, C. Don-               cessing,” in Proceedings of the 2010 ACM SIG-
     awa, A. Kielstra, K. Ebcioglu, C. Von Praun, and            MOD International Conference on Management
     V. Sarkar, “X10: an object-oriented approach to             of data, pp. 135–146, ACM, 2010.         http:
     non-uniform cluster computing,” in Acm Sigplan              //doi.acm.org/10.1145/1807167.1807184 (ac-
     Notices, vol. 40, pp. 519–538, ACM, 2005. http:             cessed: 10/23/2017).
     //doi.acm.org/10.1145/1103845.1094852 (ac-
     cessed: 10/23/2017).                                   [18] S. Salihoglu and J. Widom, “Gps: a graph pro-
                                                                 cessing system,” in Proceedings of the 25th Inter-
[11] O. Tardieu, B. Herta, D. Cunningham, D. Grove,              national Conference on Scientific and Statistical
     P. Kambadur, V. Saraswat, A. Shinnar,                       Database Management, p. 22, ACM, 2013. http:
     M. Takeuchi, M. Vaziri, and W. Zhang, “X10                  //doi.acm.org/10.1145/2484838.2484843 (ac-
     and apgas at petascale,” ACM Transactions                   cessed: 10/23/2017).
     on Parallel Computing, vol. 2, no. 4, p. 25,
                                                            [19] S. Schelter, “Large scale graph processing with
     2016. http://doi.acm.org/10.1145/2894746
                                                                 apache giraph,” Invited talk at GameDuell Berlin
     (accessed: 10/23/2017).
                                                                 29th May, 2012.
[12] S. Hong, H. Chafi, E. Sedlar, and K. Oluko-
                                                            [20] T. Cheatham, A. Fahmy, D. Stefanescu, and
     tun, “Green-marl: a dsl for easy and efficient
                                                                 L. Valiant, “Bulk synchronous parallel com-
     graph analysis,” in ACM SIGARCH Computer
                                                                 putinga paradigm for transportable software,” in
     Architecture News, vol. 40, pp. 349–362, ACM,
                                                                 Tools and Environments for Parallel and Dis-
     2012. http://doi.acm.org/10.1145/2189750.
                                                                 tributed Systems, pp. 61–76, Springer, 1996.
     2151013 (accessed: 10/23/2017).
                                                            [21] A. Agarkov, T. Ismagilov, D. Makagon,
[13] S. Hong, J. Van Der Lugt, A. Welc, R. Ra-                   A. Semenov, and A. Simonov, “Performance
     man, and H. Chafi, “Early experiences in us-                evaluation of the Angara interconnect,” in
     ing a domain-specific language for large-scale              Proceedings of the International Conference
     graph analysis,” in First International Work-               Russian Supercomputing Days, pp. 626–639,
     shop on Graph Data Management Experiences                   2016. http://www.dislab.org/docs/rsd2016-
     and Systems, p. 5, ACM, 2013. http://doi.                   angara-bench.pdf (accessed: 11.10.2017).
     acm.org/10.1145/2484425.2484430 (accessed:
     10/23/2017).                                           [22] D. Chakrabarti, Y. Zhan, and C. Faloutsos,
                                                                 “R-mat: A recursive model for graph mining.,”
[14] S. Hong, S. Salihoglu, J. Widom, and K. Oluko-              in SDM, vol. 4, pp. 442–446, SIAM, 2004. https:
     tun, “Simplifying scalable graph processing with            //faculty.mccombs.utexas.edu/deepayan.
     a domain-specific language,” in Proceedings of              chakrabarti/mywww/papers/siam04.pdf (ac-
     Annual IEEE/ACM International Symposium on                  cessed: 10/23/2017).
     Code Generation and Optimization, p. 208, ACM,
     2014. http://doi.acm.org/10.1145/2544137.              [23] L. Wesolowski, R. Venkataraman, A. Gupta,
     2544162 (accessed: 10/23/2017).                             J.-S. Yeom, K. Bisset, Y. Sun, P. Jetley, T. R.
                                                                 Quinn, and L. V. Kale, “TRAM: Optimizing
[15] D. Prountzos, R. Manevich, and K. Pingali,                  Fine-grained Communication with Topological
     “Elixir: A system for synthesizing concur-                  Routing and Aggregation of Messages,” in
     rent graph programs,” ACM SIGPLAN Notices,                  Proceedings of the International Conference on
     vol. 47, no. 10, pp. 375–394, 2012. http:                   Parallel Processing, ICPP ’14, (Minneapolis,
     //doi.acm.org/10.1145/2398857.2384644 (ac-                  MN), September 2014.         http://charm.cs.
     cessed: 10/23/2017).                                        illinois.edu/newPapers/14-18/paper.pdf
                                                                 (accessed: 10/23/2017).
[16] R. Nasre, Y. Srikant, et al., “Falcon: a graph
     manipulation language for heterogeneous sys-
     tems,” ACM Transactions on Architecture and




                                                        7
            14                                                            250                                                      200
                                     sssp (Green-Marl)                                          cc (Green-Marl)                                       pagerank (Green-Marl)
                                sssp-async (Charm++)                                       cc-async (Charm++)                      180
                                                                                                                                                       pagerank (Charm++)
            12
                                sssp-adapt (Charm++)                                       cc-adapt (Charm++)
                                                                          200                                                      160


            10
                                                                                                                                   140


                                                                          150                                                      120
Time, sec




                                                              Time, sec




                                                                                                                       Time, sec
             8

                                                                                                                                   100

             6
                                                                          100                                                       80



                                                                                                                                    60
             4


                                                                           50                                                       40

             2
                                                                                                                                    20



             0                                                              0                                                        0
                 1         2         4            8      16                     1     2         4          8      16                     1        2        4           8      16
                                  nodes                                                     nodes                                                       nodes

                         (a) SSSP, RMAT-22                                          (b) CC, RMAT-22                                          (c) PageRank, RMAT-22
            140                                                           450                                                      110
                                     sssp (Green-Marl)                                          cc (Green-Marl)                                       pagerank (Green-Marl)
                                sssp-async (Charm++)                                       cc-async (Charm++)                      100                 pagerank (Charm++)
                                                                          400
            120
                                sssp-adapt (Charm++)                                       cc-adapt (Charm++)
                                                                                                                                    90
                                                                          350

                                                                                                                                    80
            100
                                                                          300
                                                                                                                                    70
Time, sec




                                                              Time, sec




                                                                                                                       Time, sec




             80
                                                                          250                                                       60



                                                                          200                                                       50
             60

                                                                                                                                    40
                                                                          150
             40
                                                                                                                                    30

                                                                          100
                                                                                                                                    20
             20
                                                                           50
                                                                                                                                    10


                 0                                                          0                                                        0
                     1      2         4           8      16                     1     2         4          8      16                     1        2        4           8      16
                                   nodes                                                    nodes                                                       nodes

                         (d) SSSP, Random-22                                        (e) CC, Random-22                                        (f) PageRank, Random-22

                                               Figure 5: Performance results of SSSP, CC, and PageRank




                                                                                            8