Counting Triangles in Parallel An Application of Pipelining Julián Aráoz1,2 and Cristina Zoltan1,2 1 UNICyT, Panamá 2 Universitat Politècnica de Catalunya, Espanya Abstract. The usual approach to producing a parallel solution to a computational problem is to find a way to use the Divide & Conquer paradigm in order to have processors acting on their own data so they can all be scheduled in parallel. MapReduce is an example of this approach. We present an alternative program schema that can exploit dynamic pipeline parallelism without having to deal with replication factors. We present the schema through an example: counting triangles in graphs. Keywords: Pipelining, triangle-counting, parallel algorithms, Data Streams 1 Introduction With the emergence of commodity multicore architectures, exploiting tightly- coupled parallelism has become increasingly important. Most parallelization ef- forts are addressed to applications that compute with large amounts of data in memory and in general have a regular behavior. Frameworks that implement MapReduce are a great success. This may be in part because having a framework helps in testing the goodness of the solution, but also because implementations deal with errors and because the programmer needn’t be concerned with deployment in parallel architectures as the implemen- tation takes care of this. We present an alternative program schema that can exploit dynamic pipeline parallelism without having to deal with replication factors. Applications using a sequence of processes working in parallel without requiring large amounts of local memory are suitable for both multicore and distributed architectures. We present the schema through an example: counting triangles in graphs. Counting triangles, having as input an unordered sequence of edges, is only one example of the type of problem solvable with this algorithmic schema. Other graph problems include: Elimination of duplicate edges, finding connected com- ponents, and finding small circuits, among others. This is an extended abstract of our paper [1] in which can be found a small example of our algorithm using a NiMo implementation together with a formal description and proof of this type of algorithm. 2 Approaches to solving the problem Various approaches to parallelizing algorithms for exact triangle counting are found in the literature. [7] uses the MapReduce framework and [3] the hypercube architecture. Both approaches use hashing functions on the vertex names to partition the graph. Both have relatively high replication rate. A divide and conquer approach is given in [2], using the BSP model to syn- chronize the parallel workers and MPI to implement the algorithm. From the input graph we construct a collection of subgraphs, of size equal to the number of processors, such that triangles can be counted by counting on each subgraph and accumulating the results. The improvement over [7] is due to the fact that the latter generates a huge volume of intermediate data, consisting of all possible 2-paths centered at each node. Our improvement over [2] is that we do not collect in a single machine all nodes adjacent to a core (also responsible or dominating) node, but only those that have not been collected by some other responsible node. Furthermore, rather than doing preconditioning in order to balance the workload, our dynamic scheduler is able to achieve this based on the size of a set of adjacent nodes to each responsible node. In [4] a sequential algorithm is presented where a dominating set of nodes is constructed for dividing the graph into subgraphs. The constructed set is different from ours, because it is based on the adjacency presentation of the input graph. In [4] the use of the subgraphs results in a sequential iterative algorithm. 3 Streaming Model of Computation In [5] we find a good survey of algorithms based on the streaming model of computation. We use the semi-streaming model of computation, where the input graph, G = (V, E), is presented as a stream of edges (in any order), and the storage space of an algorithm is bounded by Θ(n polylog n). We are particularly interested in algorithms that use only one pass over the input, but for problems where this is probably insufficient, we also look at algorithms using constant, or in some cases, logarithmically many passes. One measure of amorphous data-parallelism is the number of active nodes (fireable processes) that can be processed in parallel at each step of the algorithm for a given input, assuming that: – There is an unbounded number of processors, – An activity takes one time step to execute, – The system has perfect knowledge of neighborhood and ordering constraints so that it only executes activities that can complete successfully, and – A maximal set of activities, subject to neighborhood and ordering con- straints, is executed at each step. This is called the available parallelism at each step. A function plot showing the available parallelism at each step of execution of an irregular algorithm for a given input is called a parallelism profile. This gives an upper bound on obtainable parallelism for the given solution. 4 The Algorithm 4.1 Notation The input graph is denoted by G = (V, E), where V and E are the sets of vertices (nodes) and edges, respectively, with m = |E| edges and n = |V | vertices labeled using labels that can be compared for equality. We use the words node and vertex interchangeably. We assume that the input graph is undirected, without loops or multiple edges 3 . If (u, v) ∈ E, we say u and v are adjacent to each other. The set of all nodes adjacent to v ∈ V is denoted by Nv , i.e., Nv = {u ∈ V |(u, v) ∈ E}. A triangle is a set of three nodes u, v, w ∈ V such that there is an edge between each pair of these three nodes, i.e., (u, v), (v, w), (w, u) ∈ E. The number of triangles containing node v (in other words, triangles incident on v) is denoted by Tv . Notice that the number of triangles containing node v is the same as the number of edges among the set of nodes adjacent to v, i.e., Tv = |{(u, w) ∈ E|u, w ∈ Nv }|. 4.2 Informal Algorithm presentation We assume the input is a large graph given by an unordered enumeration of its edges. We assume a non-oriented graph with no duplicated edges. We use a two-round schema, first of all partitioning the graph building length two paths, but only one of the three corresponding to a triangle, then identifying and counting the triangles. We do not need to use a special data structure. In [7] uses also the. The main difference is the way the possible two path are identified and the program structure, also it compute all two paths. But use the same principle of a single node being responsible for making sure the triangle gets counted. In [6] this is obtained from knowledge of the degree of each node. Dealing with graphs large enough to not fit in memory, this additional knowledge requires an further traversal of the edges of the graph which is not needed in our approach. The algorithm we present is implemented in NiMo as a sequence of actors (processes), communicating using three unbounded channels. Processes change their role (mutate their behavior) when enough knowledge has been collected. Initially there is a single actor. An actor’s first role is to acquire an edge and become a process that is “responsible” for this first node on the edge. The responsible actor receiving an edge will collect in its memory the node if the edge is adjacent to its responsible node. If the received edge is not adjacent to the responsible node, the actor passes the edge to the next actor in the sequence. If there is no next actor, a newly created actor is added to the sequence that will process the edge. 3 Oriented edges are a special case of the one treated here. When there are no more edges in the first input, each actor changes role, once again receiving the sequence of edges using the second input. The new role is to count a triangle whenever the incoming edge forms a triangle with the “responsible ” node and two adjacent ones i.e. both ends of the edge are adjacent to the “responsible”. Whenever a process has received all the edges, it reads the third input, that carries the total number of triangles identified so far by its provider. It adds its own count, and passes its triangle count to its neighbor and dies. The algorithm ends with no live processes and a single result: the total number of triangles. The I/O complexity is O(m), The total memory needed is m and the internal memory of each processor is bounded for the maximal degree of the nodes which is bound by n. The number of processes created is the number of responsible nodes and is bounded by n − 1. 5 Concluding Remarks We have used the triangle counting problem on a graph described as a sequence of edges as an example of a dynamic pipeline solution of problems. In the example we see that a large amount of processors can be exploited, but the algorithm is correct even in single processor semantics using a dynamic scheduler. The solution can be regarded as a bucket-sorting method that separates the input set into disjoint classes, but in our case the classes are created on the fly. Processes in the pipeline change behavior as soon as enough data has been consumed. In this approach data input is shared with computation, without the barriers present in the BSP model of computation. References 1. Julián Aráoz and Cristina Zoltan. Parallel triangles counting using pipelining. http: //arxiv.org/pdf/1510.03354.pdf., 2015. 2. Shaikh Arifuzzaman, Maleq Khan, and Madhav Marathe. Patric: A parallel al- gorithm for counting triangles in massive networks. In Proceedings of the 22Nd ACM International Conference on Conference on Information & Knowledge Management, CIKM ’13, pages 529–538, New York, NY, USA, 2013. ACM. 3. Shumo Chu, Magdalena Balazinska, and Dan Suciu. From theory to practice: Effi- cient join query evaluation in a parallel database system. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Vic- toria, Australia, May 31 - June 4, 2015, pages 63–78, 2015. 4. Shumo Chu and James Cheng. Triangle listing in massive networks. TKDD, 6(4):17, 2012. 5. Andrew McGregor. Graph stream algorithms: A survey. SIGMOD Rec., 43(1):9–20, May 2014. 6. Thomas Schank. Algorithmic Aspects of Triangle-Based Network Analysis. PhD thesis, Universität Karlsruhe, 2007. 7. Siddharth Suri and Sergei Vassilvitskii. Counting triangles and the curse of the last reducer. In Proceedings of the 20th International Conference on World Wide Web, WWW ’11, pages 607–614, New York, NY, USA, 2011. ACM.