=Paper=
{{Paper
|id=Vol-1981/paper6
|storemode=property
|title=Early Experience with Integrating Charm++ Support to Green-Marl Domain-Specic Language
|pdfUrl=https://ceur-ws.org/Vol-1981/paper6.pdf
|volume=Vol-1981
|authors=Alexander Frolov
}}
==Early Experience with Integrating Charm++ Support to Green-Marl Domain-Specic Language==
Early Experience with Integrating Charm++ Support to Green-Marl Domain-Specific Language Alexander Frolov JSC NICEVT Moscow, Russia frolov@nicevt.ru Abstract The paper presents the results of the efforts to integrate a Charm++ support to the Green-Marl The paper presents the implementation of the compiler. Green-Marl is an open-source domain- code generation mechanism in the Green-Marl specific programming language (DSL) designed for domain-specific language compiler targeted at static graph processing. Charm++ is an asynchronous the Charm++ framework. Green-Marl is message-driven execution model developed to execute used for parallel static graph analysis and parallel programs on multiprocessor computing sys- adopts imperative shared memory program- tems. Charm++ relies on its runtime system to sched- ming model, while Charm++ implements a ule computation, as well as perform dynamic load bal- message-driven execution model. The descrip- ancing. tion of the graph representation in the gener- For the translation of Green-Marl programs to ated Charm++ code, as well as of the trans- Charm++ a code generation module has been devel- lation of the common Green-Marl constructs oped in the Green-Marl compiler that uses the existing to Charm++, is presented. The evaluation internal representation of a Green-Marl program for of the typical graph algorithms (Single-Source Charm++ code generation. For performance evalua- Shortest Path, Connected Components, and tion we used three well-known graph problems: Single- PageRank) showed that the Green-Marl pro- Source Shortest Path (SSSP), Connected Components grams translated to the Charm++ have al- (CC), and PageRank. We compared the performance most the same performance as their native im- of the generated code to the hand-coded reference im- plementations in Charm++. plementations. In the next section an overview of research efforts on Keywords: domain-specific language, data- asynchronous computation models and graph-specific driven computation models, graph computa- DSLs is presented. In section 3 a brief description of tions the Green-Marl DSL is presented, section 4 contains an overview of the Charm++ programming model. 1 Introduction Section 5 shows the main technical aspects of trans- A parallel static graph analysis on high-performance lating Green-Marl programs to Charm++. Section 6 computing systems (supercomputers) is one of the re- presents results of the performance evaluation. The cent application domains which is characterized by a final section contains conclusion and future work. domination of irregular data processing rather than massive bulks of floating point operations which are 2 Related Work common in other scientific applications used for su- A usage of common approaches to the development percomputers. The key challenges of parallel graph of parallel graph applications for massive parallel, dis- processing are discussed in [1]. tributed memory systems such as MPI or Shmem in combination with OpenMP is complicated due to its Copyright c by the paper’s authors. Copying permitted for private and academic purposes. limited support for expressing irregular parallelism In: V. Voevodin, A. Simonov (eds.): Proceedings of the and implicit orientation on statically balanced par- GraphHPC-2017 Conference, Moscow State University, Russia, allel bulk-synchronous problems, and, therefore, the 02-03-2017, published at http://ceur-ws.org. complexity of implementing parallel graph algorithms 1 falls on the application developer. In contrast, the 1 Procedure sssp(G:Graph, dist:N_P, len:E_P , asynchronous data-flow computational models enable 2 root: Node) 3 { almost transparent mapping of the graph algorithms. 4 N_P updated; The examples of such models which are built on top of 5 N_P updated_nxt; 6 N_P dist_nxt; asynchronous active messages concept are Charm++ 7 [2, 3], Active Pebbles [4, 5], HPX [6]. The researched 8 Bool fin = False; 9 G.dist = (G == root) ? 0 : +INF; next generation parallel programming languages such 10 G.updated = (G == root) ? True: False; as Chapel [7, 8, 9] and X10 [10, 11] also support active 11 G.dist_nxt = G.dist; messages as one of the base principles of their pro- 12 G.updated_nxt = G.updated; 13 gramming models. 14 While(!fin) { Even better productivity of the graph application 15 fin = True; 16 Foreach(n: G.Nodes)(n.updated) { development can be achieved with a usage of high- 17 Foreach(s: n.Nbrs) { level domain-specific languages which enable to shift a 18 Edge e = s.ToEdge(); 19 major part of work on the control of parallel execution, 20 min= ; including task spawning, communication, synchroniza- 21 } tion, etc. to the compiler and dynamic load balanc- 22 } 23 G.dist = G.dist_nxt; ing and a failure resiliency – to the runtime system. 24 G.updated = G.updated_nxt; There are few DSLs specifically developed for a paral- 25 G.updated_nxt = False; 26 fin = ! Exist(n: G.Nodes){n.updated}; lel static graph analysis such as Green-Marl [12,13,14], 27 } OptiGraph, Elixir [15], and Falcon [16]. However, 28 } none of them supports distributed memory massive- parallel HPC systems, rather the researchers focused Figure 1: SSSP (Bellman-Ford algorithm) in Green- their work on the shared memory systems, GPUs, and Marl Big Data platforms. In Figure 1 an implementation of the Bellman-Ford algorithm for finding the shortest paths from the spec- 3 Green-Marl ified vertex to other vertices in the graph is presented. Green-Marl [12, 13, 14] is an open-source domain- The program consists of a single procedure (sssp) with specific programming language designed for a parallel four parameters: G is a graph to be analyzed, dist is static graph analysis in Stanford University (Perva- a vertex property storing the distance from the source sive Parallel Laboratory, PPL). The Green-Marl com- vertex, len is an edge property storing the edge length piler supports translation to the following parallel pro- (or weight), and root is a source vertex. gramming models: OpenMP and Pregel [17]. The In the initialization step (lines 8-12) dist is set to OpenMP backend allows to run Green-Marl programs +INF for all vertices except the root vertex which dist on shared memory multi-processor systems, while the is set to zero. Also, the root vertex is marked by set- Pregel backends (there are two slightly different back- ting its update property to true which signifies that ends with Pregel implementations: GPS and Giraph) the vertex will be processed in the first iteration of the enable to use Green-Marl for distributed massive- While loop (lines 14-17). Then the While loop is ex- parallel computing systems. ecuted. It completes when fin is set to true that is Green-Marl supports special types for declaration there is no more updated vertices in the graph. In each of graphs (Graph), vertices (Node), and edges (Edge), iteration of the While loop a parallel Foreach loop is as well as declaration of vertex properties (N P ) executed (lines 16-22), which scans vertices and for and edges properties (E P ). each previously updated vertex (i.e. its updated prop- Besides the While and Do-While constructs used erty is set to true) checks neighbour vertices by exe- for defining sequential loops and If and If-Else for cuting a relaxation operation: if dist[v] + weight(v,u) branches, Green-Marl provides statements for parallel < dist[u] for the (v,u) edge then dist[u] is assigned to execution such as For and Foreach loops. The seman- dist[v] + weight(v,u) (lines 19-20). The number of iter- tics of the parallel loops assumes that all computations ations of the While loop is restricted to |V | − 1 (upper should be finished before the first operation after the bound). loop is executed. However, the order of the iteration Therefore, Green-Marl is an imperative domain- execution is not defined. specific programming language designed for a paral- Green-Marl supports the following reduction oper- lel static graph analysis. Green-Marl allows to sig- ators +=, *=, max=, min=, &&=, ||=. These operators nificantly simplify the development of parallel graph can be used in parallel loops to produce reduction as- applications by its builtin specialized high-level ab- signments. stractions as well as the diverse set of the compiler 2 optimizations. However, it still lacks the support of the translation to effective programming platform for HPC systems with distributed memory. 4 Charm++ Charm++ [2, 3] is a parallel programming framework which is based on the object-oriented asynchronous message-driven execution model. Charm++ de- signed and developed in Illinois University of Urbana- Champaign (Parallel Programming Laboratory, PPL). As a framework, Charm++ includes the following core components: a compiler used to translate application’s objects interfaces to the C++ code and a runtime sys- Figure 2: Main stages of Green-Marl compiler tem which drives the application execution process. In Charm++ a program consists of a set of chares, GPS [18] or Giraph [?, 19]), which is a vertex-centric implemented as C++ objects, that have an interface programming model built on the Bulk Synchronous of predefined methods (entry methods) which can be Parallel (BSP) execution model [20]. used to transfer data between chares and initiating The general scheme of Green-Marl is shown in Fig- asynchronous computations, that is caller thread will ure 2. The main stages of translation are the fol- proceed its execution. A mapping of chares to process lowing: a lexical and syntax analysis (type checking, elements (PEs) is done statically by the Charm++ a syntactic sugar expansion etc.), platform indepen- runtime (by default) or by application developer. Typ- dent optimizations (such as loop splitting and merg- ically, PE can be regarded as a CPU core (or an oper- ing, loop-invariant code motion etc.), platform specific ating system process assigned to a CPU core). Addi- optimizations and a final code generation. tionally to stand-alone chares, single-dimensional and An internal representation of the program in the multi-dimensional chare arrays are supported. Green-Marl compiler includes an abstract syntax tree For addressing chares special proxy objects are used (AST) and a control-flow graph (a finite state ma- that provide an abstraction of global address space chine, FSM), the nodes of the graph (or extended basic and enable the Charm++ runtime to manage location blocks, EBB) correspond to linear blocks of code in the and distribution of chares transparently for applica- translated program. Each EBB contains information tion. This allows to use dynamic load balancing by mi- about local variables, incoming and outgoing depen- grating chares from more loaded nodes to less loaded dencies, code statements, etc. by the runtime system. A migratability of chares is The FSM states can be of two types: sequential another base concept of the Charm++ programming (SEQ) and parallel (PAR). The sequential nodes cor- model. respond to the code blocks in which the statements The following properties of the Charm++ model: should be executed in the strict order. The parallel first, an entry method can only access data which be- nodes correspond to the code blocks which suggest exe- long to the chare that it is called on, and, second, that cution for each vertex of the graph in parallel (Foreach only single entry method can be executed on a chare and For loops). at a time, guarantee the atomicity of modifications of chare’s data in Charm++ that significantly simplifies 5.2 Charm++ Code Generator application development. The developed Charm++ code generator is based on the GPS generator developed for translating Green- 5 Porting Green-Marl Compiler to Marl programs to the Pregel execution model, a full Charm++ description of the GPS generator can be found in [14]. However, the Charm++ programming model is more 5.1 Compiler Overview general than Pregel, which allows, first, to relatively The Green-Marl compiler performs a source-to-source easy implement the Pregel model and, second, extends translation from Green-Marl to one of the possible the possibilities of Green-Marl translator, that is in- equivalent representations: a sequential C++ code, crease the variety Green-Marl programs which can be a parallel code for shared memory systems in C++ translated to Charm++. with OpenMP pragmas, a parallel code for distributed At the same time, to prove the concept of the possi- memory systems in Java on top of Pregel [17] (namely, bility to translate Green-Marl programs to Charm++ one of the two Java based implementations of Pregel: it was sufficient to implement the last stage of the com- 3 1 Procedure count(G:Graph, age:N_P , root: Int) piler – code generation from the internal program rep- 2 { 3 Int S = 0; resentation (IR). Further, a description of the trans- 4 Int C = 0; lation methods is presented for main Green-Marl con- 5 Foreach (n : G.nodes) { 6 If (n.age < K) { structs. As an example, a program for calculating a 7 S += n.age; mean age of all registered users of a social network 8 C += 1; 9 } whose age is less than K. The source code of the Green- 10 } 11 Float val = (C == 0) ? S / (float) C; Marl program and the generated code in Charm++ are 12 } shown in Figure 3. (a) Graph representation in the generated code 1 module count { 2 message __ep_state1_msg; 3 readonly CProxy_count_master master_proxy; For a graph representation in the generated code in 4 readonly CProxy_count_vertex vertex_proxy; Charm++ the most simple and natural approach has 5 chare count_master { 6 entry count_master(const CkCallback & cb); been chosen: vertices of the graph are mapped to the 7 entry void do_count(int K); 8 entry [reductiontarget] void __reduction_S (int S); chares (i.e. the elements of the chare array), which are 9 entry [reductiontarget] void __reduction_C (int C); defined in the code by the name vertex class, name 10 entry void __ep_state0(); 11 entry void __ep_state1(); is a name of the translated Green-Marl procedure. 12 entry void __ep_state2(); 13 }; // count_master The vertex properties data structure is used for 14 array [1D] count_vertex { 15 entry count_vertex(); storing vertex attributes, while edges is used for – an 16 entry void __ep_state1(__ep_state1_msg *msg); edge list of the vertex, each edge is a pair of a neigh- 17 entry void add_edge(const count_edge &e); 18 }; // count_vertex bour identifier and its attributes (edge properties). 19 }; // count As a 1-dimensional chare array (array [1D]) is used (b) for storing graph vertices then the vertices are dis- 1 class count_master: public CBase_count_master { tributed in contiguous blocks over parallel processes public: 2 3 void __reduction_S (int S) { this->S = S; } (or cores) by the Charm++ runtime system. 4 void __reduction_C (int C) { this->C = C; } In the example (see Figure 3) the vertices are repre- 5 void __ep_state0() { 6 S = 0; sented by the count vertex class (Figure 3 (b), lines 7 C = 0; 8 thisProxy.__ep_state1(); 14–18, 3, lines 26–52). The vertices have a single at- 9 } tribute age, which is stored in the vertex properties 10 void __ep_state1() { 11 __ep_state1_msg *_msg = new __ep_state1_msg(); class. As the attributes for edges are not defined in the 12 _msg->K = K; 13 vertex_proxy.__ep_state1(_msg); Green-Marl program then edge properties does not 14 CkStartQD(CkIndex_count_master::__ep_state2(), 15 &thishandle); have any field. 16 } 17 void __ep_state2() { 18 val = (C == 0)?((float)S):(0 / ((float)C)); FSM construction 19 done_callback.send(); 20 } 21 private: As it has already been mentioned, after the control- 22 CkCallback done_callback; flow analysis the Green-Marl compiler creates a finite 23 int S, C, K; 24 float val; state machine (FSM). For managing the execution of 25 }; // count_master 26 class count_vertex: public CBase_count_vertex { FSM a special chare (name master) is created, each 27 public: 28 struct vertex_properties { state of FSM is mapped to one of the entry methods: 29 int age; ep state i, where i is a number of the FSM state. 30 }; 31 public: The program starts with the execution of ep state 0 32 void __ep_state1 (__ep_state1_msg *msg) { 33 int K = msg->K; then the states are switched according to FSM. When 34 delete msg; 35 int S, C; PAR states are executed, an appropriate call is initi- 36 if (this->props.age < K) { ated to all chares of the name vertex class (i.e. array 37 S = this->props.age; 38 contribute(sizeof(int), &S, elements). Then master (name master) waits until 39 CkReduction::sum_int, 40 CkCallback(CkReductionTarget( all computation are done (by using a quiescence de- count_master, __reduction_S), master_proxy)); 41 42 C = 1; tection) and then an entry method correspondent to 43 contribute(sizeof(int), &C, the next FSM state is executed. The process continues 44 CkReduction::sum_int, 45 CkCallback(CkReductionTarget( until the terminal (final) state is called. 46 count_master, __reduction_C), master_proxy)); 47 } In the example FSM consists of three states (see 48 } Figure 4) which correspond to the following en- 49 private: 50 std::list edges; try methods of the count master: ep state 0, 51 struct vertex_properties props; 52 }; // count_vertex ep state 1 and ep state 2. As the state 1 is par- allel, then it has a correspondent entry method in the (c) count vertex class ( ep state 1). In the terminal Figure 3: Example of Green-Marl program (a) and its generated code in Charm++ (b, c) 4 code generation. However, as it has been mentioned, Charm++ has significantly more flexibility and gener- ality than Pregel that is why it provides wider possibil- ities to the compiler as well as to the domain-specific language itself. Using of these possibilities is the topic of further research. 6 Performance Evaluation Figure 4: Fragment of Green-Marl program and its For benchmarking a 32-node Angara-K1 cluster has control-flow graph been used, we used its 24-node segment with dual 6- state (state 2) the callback (done callback) is trig- core Intel Xeon E5-2630 processors and 64 GB of mem- gered to move control flow to the boiler-plate code (or ory in each node. The nodes are connected by a cus- code of the Green-Marl program loader). tom 4D-torus Angara interconnect [21]. Eight CPU cores have been used per each node in the test runs to keep total number of processes equal to power of two, Global variables and reduction however, it may not be necessary in the general case. In Green-Marl, all variables can be of two types: global The benchmarks used for the performance evalu- and local (for vertices). The global variables are de- ation include: Single-Source Shortest paths (SSSP), clared in the sequential code blocks and can be ac- Connected Components (CC), and PageRank. We cessed from the parallel code blocks. All global vari- compared the performance of Green-Marl programs to ables are stored in the name master class as its mem- their hand-written Charm++ invariants (or reference ber variables. In case of a read access to the global implementations). variable from the Foreach loop, its value is passed We used two types of synthetic graphs: RMAT [22] as a parameter of the correspondent entry method to and Random. RMAT graphs have been designed to all vertices (name vertex). In case of a write access simulate many real-world graphs, which are character- then the correspondent entry method of name master ized by a power law of degree distribution (for example is called, however, when write accesses operations are social networks, etc.). For RMAT graphs we used gen- performed to the same global variable from the mul- erator from the Graph500 test. Random graphs have a tiple vertices then race conditions occur and the be- random uniform distribution of edges over graph ver- haviour is undefined. tices. The Green-Marl compiler applies the contribute The performance results are shown in Figure 5. In operation supported in Charm++ to implement re- all plots the graph size is 222 vertices. For SSSP and ductions which can be used in Green-Marl programs. PageRank directed graphs have been used, while for In Charm++ there are several supported reductions CC – undirected. operations which can be applied to the chare array For SSSP (as well as for CC) the two different ref- elements, the mechanism suggests that a specified en- erence implementations in Charm++ have been used. try method (of the specified chare) is called when re- The first is sssp-async (cc-async) and it implements duction is completed which has to be marked with a fully asynchronous algorithm which employs a label reductiontarget keyword and has a single parameter updating approach. For SSSP the updated label is which will store the result of the reduction. a distance from the root vertex (dist) in each ver- In the example (see Figure 3) three global vari- tex, while for CC – identifier of connected compo- ables are used (S, C, K). The K value is used in- nent (CC). The computation process continues until side the Foreach loop, therefore ep state 2 in any update is possible. When there is no more up- count vertex receives K as one of the parameters of dates in the graph then the algorithm finishes, this is the ep state2 msg structure, which is a Charm++ controlled by a quiescence detection mechanism, sup- message (message). S and C are used for the reduction ported in Charm++. Another reference implemen- results (lines 7, 8). In the generated code there are two tation sssp-adapt (cc-adapt) is similar to the first methods in the count master class: reduction S except that a global synchronization is used to con- (line 3) and reduction (line 4) which are used for trol label propagation (the new labels are sent only to passing the reduction results to the S and C variables. the neighbour vertices of the front and then a global In conclusion of this section we can say that for synchronization is performed). In Green-Marl imple- implementing Charm++ support in the Green-Marl mentations of the tests it is achieved by the outer loop compiler it was sufficient to add a new code gener- While. ation module to the existing backend chain for GPS As can be seen from Figure 5, for SSSP and CC 5 the performance of Green-Marl implementations is processing,” Parallel Processing Letters, vol. 17, very close to the performance of the reference im- no. 1, pp. 5–20, 2007. http://dblp.uni-trier. plementation with a partially synchronized algorithm de/rec/bib/journals/ppl/LumsdaineGHB07 (sssp-adapt for SSSP and cc-adapt for CC). For (accessed: 10/23/2017). CC the adapted reference implementation shows even better performance than asynchronous one. It can [2] L. V. Kale and S. Krishnan, “Charm++: A look unexpected, but the following reasons of such portable concurrent object oriented system based behaviour are possible: first, while the partially syn- on c++,” SIGPLAN Not., vol. 28, pp. 91– chronized algorithms (sssp-adapt and cc-adapt) re- 108, Oct. 1993. http://doi.acm.org/10.1145/ strict the amount of available parallelism, at the same 167962.165874 (accessed: 10/23/2017). time the number of active messages sent during exe- [3] G. Zheng, E. Meneses, A. Bhatele, and L. V. cution of the programs is also significantly restricted, Kale, “Hierarchical load balancing for charm++ thus resulting in less overheads from the runtime sys- applications on large supercomputers,” in 2010 tem and less memory consumption, and, second, the 39th International Conference on Parallel Pro- fully asynchronous implementations (sssp-async and cessing Workshops, pp. 436–444, IEEE, 2010. cc-async) generate many speculative computations https://charm.cs.illinois.edu/newPapers/ (speculative wavefronts), for example propagating of 10-08/paper.pdf (accessed: 10/23/2017). the distance which is not minimal for SSSP. It re- sults in large amount of unnecessary computations and [4] J. J. Willcock, T. Hoefler, N. G. Edmonds, and degradation of performance. For the partially synchro- A. Lumsdaine, “Active pebbles: parallel pro- nized implementations the amount of speculative com- gramming for data-driven applications,” in Pro- putations is significantly lesser. ceedings of the international conference on Su- The evaluation of PageRank has not revealed any percomputing, pp. 235–244, ACM, 2011. http: significant difference in performance between reference //doi.acm.org/10.1145/1995896.1995934 (ac- and Green-Marl versions. In the first order, it is ex- cessed: 10/23/2017). plained by the fact that for both cases the algorithms [5] J. J. Willcock, T. Hoefler, N. G. Edmonds, are very close (almost the same). and A. Lumsdaine, “Active pebbles: a program- ming model for highly parallel fine-grained data- 7 Conclusion and Future Work driven computations,” in ACM SIGPLAN No- In the paper the Charm++ code generator devel- tices, vol. 46, pp. 305–306, ACM, 2011. http: oped for the compiler of the domain-specific language //doi.acm.org/10.1145/1941553.1941601 (ac- Green-Marl is presented. Therefore, the Green-Marl cessed: 10/23/2017). DSL is extended to HPC clusters with distributed [6] H. Kaiser, T. Heller, B. Adelstein-Lelbach, A. Se- memory: Charm++ is added to already supported rio, and D. Fey, “Hpx: A task based program- OpenMP, GPS, and Giraph target platforms. ming model in a global address space,” in Proceed- The performance evaluation shows that Green-Marl ings of the 8th International Conference on Par- programs compiled to Charm++ have the same per- titioned Global Address Space Programming Mod- formance as native Charm++ programs assuming that els, p. 6, ACM, 2014. http://doi.acm.org/10. the same algorithms are used in the generated code 1145/2676870.2676883 (accessed: 10/23/2017). and the reference implementation. In the future we plan to add a support of the Topological Routing and [7] D. Callahan, B. L. Chamberlain, and H. P. Aggregation Module (TRAM) [23] to the Charm++ Zima, “The cascade high productivity lan- code generator implemented in the Green-Marl com- guage,” in High-Level Parallel Programming Mod- piler. els and Supportive Environments, 2004. Proceed- ings. Ninth International Workshop on, pp. 52– Acknowledgment 60, IEEE, 2004. http://ieeexplore.ieee.org/ document/1299190/ (accessed: 10/23/2017). This work is partially supported by Russian Founda- tion for Basic Research (RFBR) under Contract 15- [8] B. L. Chamberlain, D. Callahan, and H. P. 07-09368. Zima, “Parallel programmability and the chapel language,” International Journal of High Per- formance Computing Applications, vol. 21, References no. 3, pp. 291–312, 2007. http://dx. [1] A. Lumsdaine, D. P. Gregor, B. Hendrickson, doi.org/10.1177/1094342007078442 (accessed: and J. W. Berry, “Challenges in parallel graph 10/23/2017). 6 [9] R. Haque and D. Richards, “Optimizing pgas Code Optimization (TACO), vol. 12, no. 4, p. 54, overhead in a multi-locale chapel implementation 2016. http://doi.acm.org/10.1145/2842618 of comd,” in Proceedings of the First Workshop (accessed: 10/23/2017). on PGAS Applications, pp. 25–32, IEEE Press, 2016. https://doi.org/10.1109/PAW.2016.9 [17] G. Malewicz, M. H. Austern, A. J. Bik, J. C. (accessed: 10/23/2017). Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph pro- [10] P. Charles, C. Grothoff, V. Saraswat, C. Don- cessing,” in Proceedings of the 2010 ACM SIG- awa, A. Kielstra, K. Ebcioglu, C. Von Praun, and MOD International Conference on Management V. Sarkar, “X10: an object-oriented approach to of data, pp. 135–146, ACM, 2010. http: non-uniform cluster computing,” in Acm Sigplan //doi.acm.org/10.1145/1807167.1807184 (ac- Notices, vol. 40, pp. 519–538, ACM, 2005. http: cessed: 10/23/2017). //doi.acm.org/10.1145/1103845.1094852 (ac- cessed: 10/23/2017). [18] S. Salihoglu and J. Widom, “Gps: a graph pro- cessing system,” in Proceedings of the 25th Inter- [11] O. Tardieu, B. Herta, D. Cunningham, D. Grove, national Conference on Scientific and Statistical P. Kambadur, V. Saraswat, A. Shinnar, Database Management, p. 22, ACM, 2013. http: M. Takeuchi, M. Vaziri, and W. Zhang, “X10 //doi.acm.org/10.1145/2484838.2484843 (ac- and apgas at petascale,” ACM Transactions cessed: 10/23/2017). on Parallel Computing, vol. 2, no. 4, p. 25, [19] S. Schelter, “Large scale graph processing with 2016. http://doi.acm.org/10.1145/2894746 apache giraph,” Invited talk at GameDuell Berlin (accessed: 10/23/2017). 29th May, 2012. [12] S. Hong, H. Chafi, E. Sedlar, and K. Oluko- [20] T. Cheatham, A. Fahmy, D. Stefanescu, and tun, “Green-marl: a dsl for easy and efficient L. Valiant, “Bulk synchronous parallel com- graph analysis,” in ACM SIGARCH Computer putinga paradigm for transportable software,” in Architecture News, vol. 40, pp. 349–362, ACM, Tools and Environments for Parallel and Dis- 2012. http://doi.acm.org/10.1145/2189750. tributed Systems, pp. 61–76, Springer, 1996. 2151013 (accessed: 10/23/2017). [21] A. Agarkov, T. Ismagilov, D. Makagon, [13] S. Hong, J. Van Der Lugt, A. Welc, R. Ra- A. Semenov, and A. Simonov, “Performance man, and H. Chafi, “Early experiences in us- evaluation of the Angara interconnect,” in ing a domain-specific language for large-scale Proceedings of the International Conference graph analysis,” in First International Work- Russian Supercomputing Days, pp. 626–639, shop on Graph Data Management Experiences 2016. http://www.dislab.org/docs/rsd2016- and Systems, p. 5, ACM, 2013. http://doi. angara-bench.pdf (accessed: 11.10.2017). acm.org/10.1145/2484425.2484430 (accessed: 10/23/2017). [22] D. Chakrabarti, Y. Zhan, and C. Faloutsos, “R-mat: A recursive model for graph mining.,” [14] S. Hong, S. Salihoglu, J. Widom, and K. Oluko- in SDM, vol. 4, pp. 442–446, SIAM, 2004. https: tun, “Simplifying scalable graph processing with //faculty.mccombs.utexas.edu/deepayan. a domain-specific language,” in Proceedings of chakrabarti/mywww/papers/siam04.pdf (ac- Annual IEEE/ACM International Symposium on cessed: 10/23/2017). Code Generation and Optimization, p. 208, ACM, 2014. http://doi.acm.org/10.1145/2544137. [23] L. Wesolowski, R. Venkataraman, A. Gupta, 2544162 (accessed: 10/23/2017). J.-S. Yeom, K. Bisset, Y. Sun, P. Jetley, T. R. Quinn, and L. V. Kale, “TRAM: Optimizing [15] D. Prountzos, R. Manevich, and K. Pingali, Fine-grained Communication with Topological “Elixir: A system for synthesizing concur- Routing and Aggregation of Messages,” in rent graph programs,” ACM SIGPLAN Notices, Proceedings of the International Conference on vol. 47, no. 10, pp. 375–394, 2012. http: Parallel Processing, ICPP ’14, (Minneapolis, //doi.acm.org/10.1145/2398857.2384644 (ac- MN), September 2014. http://charm.cs. cessed: 10/23/2017). illinois.edu/newPapers/14-18/paper.pdf (accessed: 10/23/2017). [16] R. Nasre, Y. Srikant, et al., “Falcon: a graph manipulation language for heterogeneous sys- tems,” ACM Transactions on Architecture and 7 14 250 200 sssp (Green-Marl) cc (Green-Marl) pagerank (Green-Marl) sssp-async (Charm++) cc-async (Charm++) 180 pagerank (Charm++) 12 sssp-adapt (Charm++) cc-adapt (Charm++) 200 160 10 140 150 120 Time, sec Time, sec Time, sec 8 100 6 100 80 60 4 50 40 2 20 0 0 0 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 nodes nodes nodes (a) SSSP, RMAT-22 (b) CC, RMAT-22 (c) PageRank, RMAT-22 140 450 110 sssp (Green-Marl) cc (Green-Marl) pagerank (Green-Marl) sssp-async (Charm++) cc-async (Charm++) 100 pagerank (Charm++) 400 120 sssp-adapt (Charm++) cc-adapt (Charm++) 90 350 80 100 300 70 Time, sec Time, sec Time, sec 80 250 60 200 50 60 40 150 40 30 100 20 20 50 10 0 0 0 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 nodes nodes nodes (d) SSSP, Random-22 (e) CC, Random-22 (f) PageRank, Random-22 Figure 5: Performance results of SSSP, CC, and PageRank 8