Introduction

Triangle Finding: How Graph Theory can Help the Semantic Web

Edward Jimenez

Eric L. Goodman

0 0 Sandia National Laboratories , Albuquerque, NM , USA

45 58

RDF data can be thought of as a graph where the subject and objects are vertices and the predicates joining them are edge attributes. Despite decades of research in graph theory, very little of this work has been applied to RDF data sets and it has been largely ignored by the Semantic Web research community. We present a case study of triangle finding, where existing algorithms from graph theory provide excellent complexity bounds, growing at a significantly slower rate than algorithms used within existing RDF triple stores. In order to scale to large volumes of data, the Semantic Web community should look to the many existing graph algorithms.

Introduction Formalisms

RDF data can be modeled as a graph, G = (V, E) where V is a set of vertices and E is a set of edges connecting the vertices v ∈ V . We use n to refer to the number of vertices, |V |, and m as the number of edges, |E|. For RDF, the subjects and objects are vertices in V . The predicates can be thought of as attributes associated with each edge. We enumerate the vertices so that each has a unique id in [1, n]. Also, we enumerate the edges of G to be in [1, m]. The notation ei refers to the ith edge under the enumeration. Since RDF is described directionally, i.e. there is a subject uni-directionally related to an object, edges in the graph are also directed. We will use source(ei) to refer to the source or subject of the edge. target(ei) refers to the target or object of the edge. We call two vertices v1, v2 ∈ V adjacent if there is an edge e ∈ E where (source(e) = v1 ∧ target(e) = v2) ∨ (source(e) = v2 ∧ target(e) = v1). We use δ(vi) to denote the total degree, both incoming and outgoing, of a vertex. Below we formally define the notion of a triangle.

Definition 1. A triangle in a graph G is a set of three edges, ei, ej , ek ∈ E such that the set of vertices vˆ = { |

v v = source(ei) ∨ target(ei) ∨ v = source(ej ) ∨ v = target(ej ) ∨ v = source(ek) ∨ v = target(ek)} have the following properties: 1. |vˆ| = 3 2. All v ∈ vˆ are adjacent to one another.

We also refer to triads, which we define as pair of edges with a common vertex.

It is also useful to define the notion of triangle equality.

Definition 2. Two triangles are considered equal if their edge sets are the same.

While triangle equality may seem obvious, it is important point to consider when generating result sets via SPARQL. Without care, duplicate triangles can be generated that are not prunable with the DISTINCT keyword. Duplicate triangles can be produced that differ in the order in which the nodes and edge labels are presented (i.e. to which variables they are bound), and thus DISTINCT will have no effect on these duplicates. More of this will be discussed in Section 4. 3

Applicability to Current Benchmarks

An obvious question when optimizing for triangles in SPARQL queries is how often triangles occur in practice. We examine two popular benchmarks, LUBM [ 11 ] and SP2Bench [ 15 ]. Two out of the 14 queries in LUBM have triangles in them, namely queries 2 and 9. For brevity and to emphasize the triangle portion of the queries, we omit the prefixes and the type constraints on each of the variables.

Query 2 SELECT ?X, ?Y, ?Z WHERE {?X ub:memberOf ?Z . ?Z ub:subOrganizationOf ?Y . ?X ub:undergraduateDegreeFrom ?Y} Query 9 SELECT ?X, ?Y, ?Z WHERE {?X ub:advisor ?Y . ?Y ub:teacherOf ?Z . ?X ub:takesCourse ?Z}

These two queries are also some of the more complicated and time intensive ones as reported by various vendors and researchers (e.g. AllegroGraph3 OWLIM4), pointing to the need for efficient processing.

SP2Bench does not have any queries with triangles in them, but there are several queries that with a natual extension of one more constaint form a triangle. For example, Query 4 requests all distinct pairs of article author names for authors that have published in the same journal. A natural extension is to define a constraint between the two authors, either directly, forming a triangle, or to another common vertex. The latter case forms what is called a quadrangle, and the algorithmic approach is similar and has the same complexity as finding triangles [ 5 ]. In conclusion, there appears to be a small but significant portion of queries that contain triangle structures within them. 4

Finding Triangles with SPARQL

Expressing a SPARQL query to find all triangles in a graph is surprisingly convoluted. Figure 1 is a query that finds all unique triangles in an RDF graph, but perhaps more importantly it finds the triangles with no duplication, meaning that no two solution triangles in the result set are equal. We will refer to this query as the Triangle-finding SPARQL query. Note that this query only works on data involving only IRIs as the subjects and objects. According to the standard the STR function is not defined for blank nodes and the function also removes typing and language modifiers on literals which may cause the comparison filter to be incorrect.

Before we discuss how this query finds the triangles without duplication, we must first discuss two concepts, that of graph isomorphism and graph automorphism. Two graphs G and H are isomorphic if there is a bijection, f , mapping the vertex sets of G and H such that two vertices vi and vj are adjacent in G if and only if f (vi) and f (vj) are also adjacent in H. A graph automorphism is 3 http://www.franz.com/agraph/allegrograph/agraph_bench_lubm.lhtml 4 http://www.ontotext.com/owlim/benchmark-results/lubm

SELECT ?X ?Y ?Z WHERE { { ?X ?a ?Y .

?Y ?b ?Z . ?Z ?c ?X FILTER (STR(?X) < STR(?Y))

FILTER (STR(?Y) < STR(?Z)) } UNION { ?X ?a ?Y . ?Y ?b ?Z . ?Z ?c ?X FILTER (STR(?Y) > STR(?Z))

FILTER (STR(?Z) > STR(?X)) } UNION { ?X ?a ?Y . ?Y ?b ?Z .

?X ?c ?Z }

} when there is a isomorphism of a graph G onto itself, and generally we will be concerned with non-identity mappings.

The reason this is important can be understood from the query below:

SELECT ?X ?Y ?Z WHERE { { ?X ?a ?Y . ?Y ?b ?Z . ?Z ?c ?X }

For this query, every triangle found will appear three times. For instance, a triangle with solution s1, s2, s3 will also appear as s2, s3, s1 and s3, s1, s2 . The reason for this is that each of the vertices s1, s2, and s3 each bind to each of the three variables ?X, ?Y, and ?Z under different circumstances. All three solutions satisfy the query, but each solution is the same triangle. DISTINCT does not help because the bindings are different for each duplicate solution. The problem arises because the query graph represented above is automorphic. Each non-trivial automorphism is a mapping to translate from one solution to another solution using the same set of edges.

Thus we arrive at the convoluted nature of the query in Figure 1. When constructing the SPARQL query, we need to account for all possible triangles but not generate duplicates. There are eight types of triangles, shown in Figure 2. However, all eight of these types can be collapsed down to the three unioned clauses in Figure 1. We present this formally as a proof.

Theorem 1. The Triangle-finding SPARQL query finds all triangles in the queried graph and each solution is unique.

Proof. First consider that triangle type iii is isomorphic to types iv through viii. The below table outlines the functions needed to show the isomorphisms. As they are isomorphic, type iii is sufficient to represent all the other types iv viii. Also, type iii is not automorphic, and thus will not produce any duplicate triangles. Finally, the solutions resulting from iii are disjoint from i and ii as there is not mapping from iii to either i or ii. Thus we can deal with the solution sets separately.

Mapping ?X ?Y ?Z iii to ...

iv ?X ?Z ?Y v ?Y ?X ?Z vi ?Z ?Y ?X vii ?Z ?X ?Y vii ?Y ?Z ?X

Concerning types i and ii, they are isomorphic under the mapping f (?X) = ?Z, f (?Y ) =?Y , and f (?Z) =?X. Thus, we need only include one of the patterns in the query. We arbitrarily select type i. However, type i is automorphic (as is ii ), and we must concoct a way of avoiding duplicate triples with the available SPARQL language features. We solve the issue by enforcing an ordering on the bindings. The first clause of the Triangle-finding SPARQL query enforces that the string representation of the bindings must obey ?X <?Y <?Z under an alphanumeric ordering. The second clause enforces ?Y >?Z >?X. It remains to show that these two orderings will find all triangles of type i without duplication. The table below outlines all possible relations between the variables assuming no self loops. ?X ? ?Y ?Y ? ?Z ?Z ? ?X < < < < < < > > > > < > > < < > > > < > < > < >

Empty since ?X <?Y ∧?Y <?Z =⇒ ?X <?Z, contradicting the third constraint.

Fulfilled directly by first clause Use automorphism f (?X) =?Y, f (?Y ) =?Z, f (?Z) =?X.

Fulfilled by first clause.

Fulfilled directly by second clause.

Use automorphism f (?X) =?Z, f (?Y ) =?X, f (?Z) =?Y .

Fulfilled by first clause.

Use automorphism f (?X?) =?Z, f (?Y ) =?X, f (?Z) =?Y ).

Fulfilled by second clause.

Use automorphism f (?X) =?Y, f (?Y ) =?Z, f (?Z) =?X.

Fulfilled by second clause.

Empty since ?X >?Y ∧?Y >?Z =⇒ ?X >?Z, contradicting the third constraint.

Two of the possibilities are invalid because the constraints are contradictory. Another two possibilities directly match the first two clauses of the trianglefinding query. The remaining four cases match the two clauses through an automorphism. Thus, we may conclude that the Triangle-finding SPARQL query does in fact find all triangles in the graph with no duplicates. 5

An O(m3/2) Triangle Finding Algorithm There are many triangle finding algorithms that are O(m3/2). For the experiments we employ an algorithm presented by Cohen [ 6 ]. This algorithm has the benefit of already being described in a parallel fashion in terms of mappers and reduces of the MapReduce paradigm [ 8 ]. Also, there is a version implemented in the MultiThreaded Graph Library5 (MTGL) [ 2 ]. However, a formal complexity analysis was not outlined in [ 6 ], so we perform that here. 5 https://software.sandia.gov/trac/mtgl

Cohen’s algorithm operates on what he calls a simplified graph. Namely, a graph in which self-loops are eliminated, directionality is ignored, and there are no duplicate edges. In our later experiments we do not allow self-loops, but we account for directionality and do allow duplicate edges. This is due to the fact that RDF is directional and multiple edges can be defined between vertices with different edge types (predicate types). We do not want to collapse all of these edge types down into one edge. We do not formally account for these different assumptions in our analysis.

Theorem 2. Given a simplified graph G, Cohen’s triangle finding algorithm is O(m3/2).

Proof. We assume the worst case, that G is completely connected. Relating m to n, we have m =

i = n−1 i=1 (n − 1)(n) 2 (1)

Cohen’s algorithm is composed of two MapReduce phases. The input to the first map is a list of edges. Each edge has been previously been augmented with the degree of each vertex. If one were to include this preprocessing step in the overall complexity, it is O(m) which is also in O(m3/2).

Map 1 Map each edge to its low-degree vertex. According to our assumptions, δ(vx) = δ(vy) ∀x, y ≤ n. Cohen suggests a tie-breaker based on vertex ordering; we’ll use v1 < v2 < · · · < vn. Below is the composition of the bins. Since directionality is ignored, we’ll use a canonical representation of each edge, vi, vj , such that i < j.

Bin 1

Reduce 1 Emit a record for each pair of edges in a bin (one for every open triad). For the graph G, the first Map phase created n−1 bins, and bin i contains n − i edges. Therefore the number of triads created is n−2 i=1 n − i = 2 n−2

(n − i)! i=1 2(n − (i + 2))! 1 n−2 = < = = = < 2 Hence, the first MapReduce task has complexity O(m3/2).

For the second MapReduce phase the input is the emitted records of the first MapReduce phase (O(m3/2)) as well as the augmented edge list that was used as the input for the first MapReduce phase (O(m)). For the second MapReduce phase, let p be the size of the combined input, note:

O(p) = O(m) + O(m3/2) = O(m3/2).

Map 2 Combine degree-augmented file and output from Reduce 1. For the augmented edge list we have the following mapping to remove the vertex valences: key1 = [e1, δ(vx), δ(vy)] key2 = [e2, δ(vw), δ(vz)] . .

. keyn = [en, δ(vu), δ(vt)] ⇒ keye1 = [e1] keye2 = [e2] . .

. keyen = [en] and for the records emitted by Reduce 2, we have the identity operation. Therefore, this task is O(p). Reduce 2 Each bin corresponds with a vertex pair. A bin will contain at most one edge record and any number of triad records. With our assumptions of a completely connected graph, bin i contains n−2i triad records and one edge record, and the reducer will emit n−i triangles. From our previous analysis, 2 this is O(m3/2) and therefore the overall complexity is O(m3/2). 6

Experiments

We experimentally compare open source RDF/SPARQL engines, Jena, version 2.7.2, and Sesame, version 2.6.8, with MTGL’s implementation of Cohen’s algorithm. For both Jena and Sesame we use the in-memory backend version of each. We did need to make some modifications to the MTGL version in order to allow for directionality and duplicate edges. Namely, we created a multimap that gives a the list of edges connecting any two vertices. This allowed us to create multiple triangles from single instances of a triad in the last reduce phase. We used a workstation with 8 GB of memory and a 2.2 GHz Intel Core i7 processor. Detailed times for our experiments can be found in the Appendix.

For our experiments we create R-MAT [ 4 ] graphs to simulate real-world graph properties such as power-law distributions on degree, small-world graphs, and small diameter. We varied the size of the graph from between n = 25 to n = 219. Also, we tried three different edge factors (average degree per vertex), namely 16, 32, and 64. R-MAT has four other parameters, a, b, c, and d. These four parameters are probabilities used recursively to determine where edges exist within the adjacency matrix. We set these to the values of the Graph5006 search benchmark: a = 0.57, b = 0.19, c = 0.19, and d = 0.05. To enable the SPARQL engines the ability to process the data, we created IRI’s of the form <http://i > where i is the vertex id given by the R-MAT generator. Also we made all edges of the same type.

Figures 3(a), 3(b), and 3(c) show the individual performance of each of the three platforms as a function of the number of edges. All of the plots have a loglog scale. Figure 3(d) shows the three platforms side by side. For this Figure, we exclude graphs below 8192 edges to give a better idea of the scaling behavior for larger graphs. Both Jena and Sesame exhibit a fair amount of constant overhead per query that dominates the times in the smaller graphs. When excluding this data, the fit of trendlines using power regression is quite good, with R2 all exceeding 0.98. As can be seen from Figure 3(d), Jena has a complexity of around O(m1.83), Sesame has O(m1.58), and MTGL’s version of Cohen’s algorithm is around O(m1.39). The best possible for this data would be around O(m1.12), which is rate of growth of triangles for the data as determined experimentally and shown in Figure 3(e).

It is clear that both Jena and Sesame are not employing a triangle finding algorithm as their rate of growth is significantly larger than O(m1.5). While the differences in powers may seem small between the three, consider that extrapolating out to a billion edges, the difference in times between an O(m1.39) 6 www.graph500.org (a) Jena

(b) Sesame (c) MTGL (d) All (e) Triangle Growth algorithm and one with O(m1.58) with the same constant is about a 50x difference. And the difference between O(m1.39) and O(m1.83) is around 9000x. While not the focus of this paper, it is interesting to note that the MTGL version computed the triangles in about two-orders of magnitude less time than either Jena or Sesame.

Related Work

There is much work within the graph theory community where there is a straightforward application to the Semantic Web. For the most part, interfacing with the Semantic Web takes place through SPARQL. This largely confines interactions to subgraph matching tasks. Under this limitation, we can find work on clique-finding [ 5,10 ], or generalizations of cliques such as trusses [ 6 ]. An excellent overview of research in subgraph pattern matching over a thirty year timespan can be found in Conte et al. [ 7 ].

There are also algorithms outside of subgraph matching that can be of help in analyzing the Semantic Web. Probably one of the most fundamental algorithms is breadth-first search. In recent years, the Graph500 list has significantly increased the competitiveness and visibility of the task, and scalability and performance have increased concomitantly with the greater attention. Other algorithms include single source shortest path [ 13 ], betweeness centrality [ 1 ], connected components [ 12 ], and many others.

Related to all of these efforts is the task of graph partitioning. Graphs have historically been difficult to partition in a distributed memory setting, where the interconnectedness of the graph make it difficult to divide the data in such a way to minimize communication overhead. Notable efforts include Buluc¸ and Madduri [ 3 ] and Devine et al. [ 9 ].

Programming models have also been created to aid development of graphcentric algorithms. Notable among them are Google’s Pregel [ 14 ] and Signal/Collect by Stutz et al. [ 16 ]. These may prove to be valuable paradigms for implementing efficient graph-oriented code that can scale on large distributed systems. 8

Conclusions

In order for Semantic Web applications to scale, the community needs to adopt efficient algorithms to use as the computational kernels underlying analytics. In this paper we’ve demonstrated how long existing triangle finding algorithms can be employed to speed up SPARQL queries. We believe there are other algorithms and lessons from graph theory that can utilized to speed up Semantic Web applications and also open up other avenues for analysis.

Experimental Data

Below is the data we collected running the triangle query on Jena, Sesame, and MTGL. We divide the data into three tables, one for each edge factor. Times are in seconds. A.1

Edge Factor = 16 |V | |E| Num Triangles Jena Sesame MTGL 25 512 7,645 0.66 0.39 0.0011 26 1,024 16,159 0.96 0.44 0.0024 27 2,048 33,263 1.23 0.57 0.0081 28 4,096 72,357 1.96 0.88 0.0132 29 8,192 156,716 2.96 1.78 0.0365 210 16,384 333,174 7.79 3.86 0.0721 211 32,768 739,951 26.38 10.67 0.1798 212 65,536 1,648,301 89.72 31.12 0.4696 213 131,072 3,450,520 291.65 92.78 1.1767 214 262,144 7,573,624 1263.91 317.21 3.3691 215 524,288 16,864,063 5550.93 1022.38 8.7734 216 1,048,576 35,286,039 3062.77 21.6454 217 2,097,152 74,837,468 57.1703 218 4,194,304 168,767,188 153.9390 219 8,388,608 357,383,850 409.6070 A.2

Edge Factor = 32

D. A.

Bader ,

Kintali ,

Madduri , and

Mihail . Approximating betweenness centrality . In Proceedings of the 5th international conference on Algorithms and models for the web-graph, WAW'07 , pages 124 - 137 , Berlin, Heidelberg, 2007 . Springer-Verlag.

J. W.

Berry ,

Hendrickson ,

Kahan , and

Konecny . Software and algorithms for graph queries on multithreaded architectures . Parallel and Distributed Processing Symposium , International, 0 : 495 , 2007 .

Bulu ¸c and K. Madduri . Graph partitioning for scalable distributed graph computations . In Proc. 10th DIMACS Implementation Challenge Workshop - Graph Partitioning and Graph Clustering , Feb. 2012 .

Chakrabarti ,

Zhan , and

Faloutsos. R-mat: A recursive model for graph mining . In M. W. Berry,

Dayal ,

Kamath , and D. B. Skillicorn, editors, SDM. SIAM , 2004 .

Chiba and

Nishizeki . Arboricity and subgraph listing algorithms . SIAM Journal on Computing , 14 ( 1 ): 210 - 223 , 1985 .

Cohen . Graph twiddling in a mapreduce world . Computing in Science Engineering , 11 ( 4 ): 29 - 41 , july -aug. 2009 .

Conte ,

Foggia ,

Sansone , and

Vento . Thirty years of graph matching in pattern recognition . IJPRAI , pages 265 - 298 , 2004 .

Dean and

Ghemawat . Mapreduce: simplified data processing on large clusters . Commun. ACM , 51 : 107 - 113 , January 2008 .

9. K. D. Devine , E. G.

Boman , R. T. Heaphy, R. H.

Bisseling , and U. V.

Catalyurek . Parallel hypergraph partitioning for scientific computing . In Proceedings of the 20th international conference on Parallel and distributed processing , IPDPS'06 , pages 124 - 124 , Washington, DC, USA, 2006 . IEEE Computer Society.

10.

Eisenbrand and

Grandoni . On the complexity of fixed parameter clique and dominating set . Theor. Comput. Sci. , 326 ( 1-3 ): 57 - 67 , Oct. 2004 .

11.

Guo ,

Pan , and

Heflin . Lubm: A benchmark for owl knowledge base systems . J. Web Sem ., 3 ( 2 -3): 158 - 182 , 2005 .

12.

Krishnamurthy ,

S. S.

Lumetta ,

D. E.

Culler , and

Yelick . Connected components on distributed memory machines . In Parallel Algorithms: 3rd DIMACS Implementation Challenge October 17-19 , 1994 , volume 30 of DIMACS Series in Discrete Mathematics and Theoretical Computer Science , pages 1 - 21 . American Mathematical Society, 1994 .

13.

Madduri ,

Bader ,

Berry , and

Crobak . An experimental study of a parallel shortest path algorithm for solving large-scale graph instances . In ALENEX , 2007 .

14. G. Malewicz,

M. H.

Austern ,

A. J.

Bik ,

J. C.

Dehnert , I. Horn,

Leiser , and

Czajkowski. Pregel: a system for large-scale graph processing . In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, SIGMOD '10 , pages 135 - 146 , New York, NY, USA, 2010 . ACM.

15. M. Schmidt , T.

Hornung , G. Lausen, and C.

Pinkel . Sp2bench: A sparql performance benchmark . CoRR, abs/0806.4627 , 2008 .

16.

Stutz ,

Bernstein , and

Cohen . Signal/collect: graph algorithms for the (semantic) web . In Proceedings of the 9th international semantic web conference on The semantic web - Volume Part I, ISWC'10 , pages 764 - 780 , Berlin, Heidelberg, 2010 . Springer-Verlag.

Edge

Factor = 64