1. INTRODUCTION

A Join Operator for Property Graphs

Giacomo Bergami

giacomo.bergami2@unibo.it 0

Matteo Magnani

matteo.magnani@it.uu.se 1

Danilo Montesi

danilo.montesi@unibo.it 0 0 University of Bologna, CSE Department , Bologna , Italy 1 Uppsala University, Department of Information, Technology , Uppsala , Sweden

In the graph database literature the term \join" does not refer to an operator combining two graphs, but involves path traversal queries over a single graph. Current languages express binary joins through the combination of path traversal queries with graph creation operations. Such solution proves to be not e cient. In this paper we introduce a binary graph join operator and a corresponding algorithm outperforming the solution proposed by query languages for either graphs (Cypher, SPARQL) and relational databases (SQL). This is achieved by using a speci c graph data structure in secondary memory showing better performance than state of the art graph libraries (Boost Graph Library, SNAP) and database systems (Sparksee).

1. INTRODUCTION

Despite the term \join" appearing in the graph database literature, such operator cannot be used to combine two distinct graphs, as for table joins in the relational model. Such joins are path joins running over a single graph [ 1 ]: they are used for graph traversal queries [ 13 ] where vertices and edges are considered as relational tables [ 25, 16 ]. The result of such path joins cannot be directly used to combine values from di erent sources (e.g. join two distinct vertices appearing in di erent graphs alongside with their values), and hence supplementary graph operations are required. SPARQL allows to access multiple graph resources through named graphs and performs graph traversals one graph at a time through path joins [ 12, 3, 27 ]. At this point the CONSTRUCT clause is required if we want to nally combine the traversed paths from both graphs into a resulting graph. Similarly, Cypher's CREATE clause has to be used to generate new vertices and edges from graph patterns extracted through the MATCH...WHERE clause, and intemediate results are merged with UNION ALL. While current graph query languages allow to express our proposed graph join operator as a combination of the aforementioned operators, our study shows that our specialized graph join algorithm outperforms the evaluation of the graph join with existing graph and relational query languages.

As for relational databases, they solve common graph queries e ciently, so graph database management systems rely either on relational database engines [ 1, 21, 11 ] or on column store databases [ 25, 7 ]. Moreover, relational databases already have e cient implementations for (equi) join algorithms [ 23 ]. We want to show that graph joins over the relational data model are not ine cient. Before all, let us see an example of a graph join query:

Example 1. Consider an on-line service such as ResearchGate (Figure 1a, or Academia.edu) where researchers can follow each others' work, and a citation graph (Figure 1b). Now we want to \return the paper graph where a paper cites another one i . the rst author 1Author of the rst paper follows the 1Author of the second. (Figure 1c)". The ResearchGate graph does not contain any edge regarding the references, while the Reference graph does not contain any information pertaining to the follow relations. This demands a join between the two graphs: as a rst step we join the vertices together as in the relational model (vertices are considered as tuples using Name = 1Auth as a vertex equi-join predicate, ) and then combine the edges from both graphs. Accordingly to the query formulation, we establish an edge between two joined vertices only if the source has a paper citing the destination, and the user in the source follows the user in the destination.

Let us now examine the graph join implementation within the relational model: vertices and edges are represented as two relational tables ([ 25 ], Figure 2a). In addition to the attributes within the vertices' and the edges' tables, we assume that each row (on both vertices and edges) has an attribute id enumerating vertices and edges. Concerning SQL interpretation of such graph join, we rst join the vertices (see the records linked by lines in Figure 2a). Then the edges are computed through the join query provided in Figure 2b: the root and the leaves are the result of the join between the vertices, while the edges appear as the intermediate nodes. An adjacency list representation of a graph, as the one proposed in the current paper, reduces the joins within the relation solution to one (each vertex and edge is traversed only once), thus reducing the number of required operation to create the resulting graph. Other ine ciency { F o ll o w s }

{User} Name=Alice

{User} Name=Carl {Follows} {Follows}

{User} Name=Bob

{User} Name=Dan { F o ll o w s } {User,Paper} {User,Paper} Title=Graphs Title=Join 1Author=Alice 1Author=Alice

Name=Alice Name=Alice Tit{Ules=erP,Praopjeecrt}ion {Follows,Cites} 1Author=Carl Name=Carl {User,Paper} Title=OWL 1Author=Bob Name=Bob {User,Paper} Title=μ-calc 1Author=Dan Name=Dan

VResearchGate id Name `v 6 Alice {User} 7 Bob {User} 8 Carl {User} 9 Dan {User}

EResearchGate id src dst `e 5 6 7 {Follows} 6 6 8 {Follows} 7 7 9 {Follows} 8 9 8 {Follows} θ θ θ θ

VReference id Title Name `v 1 Graphs Alice {Paper} 2 Join Alice {Paper} 3 OWL Bob {Paper} 4 Project Carl {Paper} 5 μ-calc Dan {Paper} {Paper}

Title=Join 1Author=Alice } s e t Ci {Paper} { Title=Projection 1Author=Carl

Cie} ts { {Cites} {Follows}

VResearchGate ./ VRef erence

EResearchGate

VResearchGate ./ VRef erence VResearchGate ./ VRef erence

{Paper} Title=OWL 1Author=Bob {Paper} Title=μ-calc 1Author=Dan {User,Paper} Title=OWL 1Author=Bob Name=Bob { F o ll o w {User,Paper} } s Title=μ-calc 1Author=Dan Name=Dan on

ERef erence

on (b) SQL join query plan required to create edges for (a) Representing the operands' vertices and edges with tables. The ResearchGate1N^ame=1AuthorReference. The leaves acts as the join for the vertices only involves tables VResearchGate and Vprojects. edges' sources while the root as their destinations. considerations for graph query languages are provided in the Related Work section (Section 6.1).

Example 1 showed only one possible way to combine the operands' edges, but we can even return edges pertaining to both operands as in the following query: \For each paper reveal both the direct and the indirect dependencies (either there is a direct paper citation, or one of the authors follows the other one in ResearchGate)". The resulting graph (Figure 1d) has the same vertex set than the previous one, but they di er on the nal edges. This implies that our graph join de nition must be general enough to allow di erent edge combinations: we refer to those as edge semantics, \es" for shorthand. This paper provides two contributions:

Graph join operator 1es (Section 3), combining both vertices ( ) and edges (es). A property graph model (Section 2) is used as a data model of choice.

Graph Conjunctive Equijoin Algorithm (Section 4): vertex buckets ordered by hash value are created and the resulting graphs' edges and vertices are produced at the same time. Our solution outperforms the query evaluations in SPARQL, Cypher and SQL (Section 5.2). Since the aforementioned algorithm relies on an ad hoc secondary memory data structure, we tested it over di erent graph libraries (Boost, SNAP) and low level graph databases (Sparksee). Even in this case our solution provide better results with large graphs (Section 5.1).

2. PRELIMINARIES

We model the vertices' and edges' set as multisets (of tuples) S of elements si, where si unequivocally identi es the i-th occurrence of a tuple s in S. Each tuple associates to each attribute a value: it is a function A 7! V [ fNULLg mapping each attribute in A to either a value in V or NULL (" is the empty tuple). We slightly change the property graph de nition in [ 16 ] in order to ease the join de nition between vertices and edges as later on required by the graph join:

Definition 1 (Property Graph). A property graph is a tuple G = (V; E; v; e; Av; Ae; ; `v; `e) where (a) V is a multiset of nodes, (b) E is a multiset of edges, (c) v is a set of node labels, (d) e is a set of edge labels, (e) Av is a set of node attributes, (f ) Ae is a set of edge attributes, (g) : E ! V V is a function assigning node pairs to edges, (h) `v : V ! P( v) is a function assigning a set of labels to nodes, and (i) `e : E ! P( e) is a function assigning a set of labels to edges.

This is the baseline for our graph database:

Definition 2 (Graph Database). A graph database is a collection of n distinct property graphs fG1; : : : ; Gng represented as a single property graph D with n distinct connected components. From now on we refer to each component simply as graph. Each graph is identi ed by two functions: V : f1; : : : ; ng 7! P(V ) determining the vertices V(i) of the i-th graph and E : f1; : : : ; ng 7! P(E) determining the edges E(i) of the i-th graph.

Example 2. Two edges ei and fj come from two distinct graphs, respectively Ga and Gb, within the same graph database D. Edge ei connects vertex uh to vk ( (ei) = (uh; vk)), while fj connects u0h to vk0 ( (fj ) = (u0h; vk0)). Such edges store only the following values:

ei(Time) = 12:04; fj (Day) = M on and have the following labels:

`e(ei) = f Follow g ; `e(fj ) = f FriendOf g

For the multiset -join, we need a function combining two tuples for the relational join operator over multisets, where ri tj is a valid multiset element (r s)i j and i j maps each integer pair (i; j) to a single number.

Definition 3 ( -Join). Given two (multiset) tables R and S over a set of attributes A1 and A2, the -join R ./ S [ 4, 5 ] is de ned as follows: fri sj j ri 2 R; sj 2 S; (ri; sj );(ri (ri sj )(A1) = ri; sj )(A2) = sj g where (t t0)(Ai) denotes the projection of the tuple t t0 over Ai. If is the always true predicate, can be omitted and, when also A1 \ A2 = ;, we have a cartesian product.

If we de ne as a linear function (that is for each function H, H(ei fj ) = H(ei) H(fj )), the -join also induces the de nition of `v, `e and for the joined tuples. As a consequence, must be overloaded for each possible expected output from H (see De nition 7 in Appendix).

Example 3. By continuing the previous example, suppose that the edge ei fj comes from a graph join where edges from Ga are joined to the ones in Gb in a resulting graph, where also vertices uh u0h and vk vk0 appear. So: (ei fj )(Time) = 12:04; (ei fj )(Day) = M on By 's linearity, we have that the labels are merged: `e(ei fj ) = `e(ei) `e(fj ) = fFollowg fFriendOfg = fFollow; FriendOfg And the result's vertices are updated accordingly: (ei fj ) = (ei) (fj ) = (uh; vk)

(u0u; vk0) = (uh u0h; vk vk0)

Since all the relevant informations are stored in the graph database, we represent the graph as the set of the minimum information required for the join operation.

Definition 4 (Graph). The i-th graph of a graph database D is a tuple Gi = (V(i); E(i); Av; Aie), where V(i) is i a multiset of vertices and E(i) is a multiset of edges. Furthermore, Aiv is a set of attributes a 2 Aiv s.t. there is at least one vertex vj 2 V(i) having vi(a) 6= NULL; Aie is a set of attributes a0 2 Aie s.t. there is at least one edge ek 2 E(i) having ek(a0) 6= NULL. 3.

GRAPH JOINS

As we discussed in the introduction, our graph join is based on the combination of vertices and edges: Ga ./es Gb expresses the join of graph Ga with Gb where (i) we rst use a relational -join among the vertices, and then (ii) we combine the edges using an appropriate user-determined edge semantics, es. This modularity is similar to the graph products de ned in graph theory literature [ 14, 17 ], where instead of a join between vertices they have a cross product, and different semantics are expressed as di erent graph products. We now provide the graph join de nition:

Definition 5 (Graph -Join). Given two graphs Ga = (V; E; Av; Ae) and Gb = (V 0; E0; A0v; A0e), a graph -join is de ned as follows:

Ga ./es Gb = (V ./ V 0; Ees; Av [ A0v; Ae [ A0e) where is a binary predicate over the vertices and ./ the -join (De nition 3) among the vertices, and Ees is a subset of all the possible edges linking the vertices in V ./ V 0 expressed with the es semantics.

Given that graph join returns a property graph like the graphs in input, property graphs are closed under the graph join operator via the de nition of for the multiset -join. 3.1

Two possible “es” edge semantics

The result of the join between two graphs, ResearchGate (Figure 1a) and References (Figure 1b), produces the same set of vertices regardless of the edge semantics of choice. On the other hand, edges among the resulting vertices change according to the edge semantics. In the rst one (Figure 1c) we combine edges appearing in both graphs and linking vertices that appear combined in the resulting graph. We have a Conjunctive Join, that in graph theory is known as Kronecker graph product [ 26, 14 ]. In this case Ees is de ned with the \^" es semantics as an edge join E^ = E ./ ^ E0, where the ^ predicate is the following one: ^(eh; e0k) = (eh 2 E ^ e0k 2 E0) ^ (eh e0k) 2 (V ./ V 0)2 (1)

We can also de ne a disjunctive semantics (Figure 1d), having \_" as es. In this case we want edges appearing either in the rst or in the second operand. This means that two vertices, uh u0h and vk vk0, could have a resulting edge ei "0j even if only (ei) = (uh; vk) appears in the rst operand and "0 is a \fresh" empty edge ("0j ) = (u0h; vk0) not appearing in Gb such that (ea "0b) = (uh u0h; vk vk0). Consequently the disjunctive join can be represented as a full outer join, where the edges either match in the conjunctive semantics, or appear in the two distinct graph operands: E_ = E ./ ^ E0 (2)

4. ALGORITHM AND DATA STRUCTURE

We now outline our algorithm, GCEA, for equijoin predicates, involving an equivalence between attributes or a conjunction of such equivalences. This speci c predicate choice was driven by the fact that the most performant and implemented relational database join is the equi-join [ 23 ]. Moreover we provide an implementation for conjunctive semantics, since this task is more prone to be optimized than the disjunctive one. Algorithm 1 for GCEA consists in three parts: (i) vertex partitioning (bucketing) through an hashing function (OperandPartitioning) (ii) graph serialization on secondary memory (SerializeOperand), and (iii) actual join algorithm over the graphs' buckets (PartitionHashJoin). Relational partition hash-join undergo the same phases, even if relational algorithms do not deal with outgoing edges (lines 31-35 and Section 6). We allow vertices with replicated values as in current graph databases implementations (such as Titan and Neo4J). Consequently ids enumerate the vertices within a single graph.

As a rst step, the hashing function h is inferred from (line 2): if (u; v) is a binary predicate between distinct attributes from u and v, then h is de ned as a linear combination of hash functions over the attributes of either u or v. When no h could be inferred from , then h is a constant function.

Algorithm 1 Graph Conjunctive EquiJoin Algorithm (GCEA) 1: procedure ConjunctiveJoin(G; G0; ) 2: hashFunction = generateHash( ); 3: omap1 = OperandPartitioning(G;hashFunction) 4: omap2 = OperandPartitioning(G0;hashFunction) 5: G1 = SerializeOperand(G;omap1) 6: G2 = SerializeOperand(G0;omap2)

return PartitionHashJoin(G1; G2; ) 7: 8: procedure SerializeOperand(G;omap): 9: File VertexIndex = Open(); 10: VertexVals= Open(), HashO set= Open(); 11: ulong o set = HashO set = 0; 12: for each h 2Keys(omap) do . Ordered maps have ordered keys. 13: HashO set.Write(fh,HashO setg); 14: for each id 2omap[h] do 15: v = G.V [id]; 16: v.hash = h; v.o set = VertexVals; 17: VertexIndex.Write(fv.id, h, o setg); 18: ulong o setNext = VA.Write(serialize(v)); 19: o set+=o setNext; HashO set+=o setNext;

return (VertexIndex,VertexVals,HashO set,G.Av,G.Ae) 20: 21: procedure PartitionHashJoin(G1; G2; ): 22: 0(u; u0) := (u; v) ^ (u u0)(Av) = u ^ (u u0)(A0v) = u0; 23: 0(e; e0) := (e e0)(Ae) = e ^ (e e0)(A0e) = e0 24: HI = IntersectHashes(HashO set1,HashO set2).iterator(); 25: File AdjF ile = Open(); 26: while HI.hasNext() do 27: h = HI.next(); 28: for each u 2 VertexVals1[h:o set1], u0 2 VertexVals2[h:o set2] do 29: if 0(u; u0) then 30: AdjFile.Write(V=fu u0g,) 31: HIout = IntersectHashes(outV (u),outV 0(u0)).iterator(); 32: while HIout.hasNext() do 33: hout = HIout.next(); 34: for each edge e 2 outV (u)[hout:o set1], e0 2

outV 0(u0)[hout:o set2] do 35: if 0(e:outvertex; e0:outvertex) and 0(e; e0) then 36: AdjFile.Write(E=fe e0g)

OperandPartitioning performs a vertex bucketing in main memory: its outcome is an ordered map, where each vertex v is stored in a collection omap[h(v)], where h is the aforementioned hashing function. For each operand Gi, the omapi construction takes at most PjV(i)j log(j) time, where

j=0 jV(i)j is the multiset vertex size. Such time complexity is bounded by jV(i)j PjjV=(0i)j log(j) < jV(i)j2 where jV(i)j 1.

SerializeOperand stores the operand in secondary memory: both buckets (line 12) and vertices (line 14) are already sorted by hash value, and hence such data structures are accessed linearly. Figure 3c depicts a serialized representation of the graph in Figure 3b: all the labels and the edge values are not serialized but are still accessible through the original graph G via id. Buckets are represented by HashO set providing both the bucket value and the pointer to the rst vertex of the bucket stored in VertexVals. VertexVals stores vertices alongside with their adjacency list, where vertices are sorted by hash value and are represented by id and hash value. VertexIndex allows to nd the vertices stored in VertexVals in constant time: each record is ordered by vertex id, has a constant size and contains the pointer to where the vertex data is stored in VertexVals. Even the outgoing edges are stored by the destination vertex's hash value. Given ki the size of Keys(omapi), this phase takes 3ki + jGij time, where 2ki is the omap visit cost, ki is the omap serialization as HashO set and jGij is the time to serialize the graph as V Ai.

The last step performs the actual conjunctive join over the serialized graph (PartitionHashJoin): the data structure is accessed from secondary memory through memory mapping. Line 24 prepares the intersection: while performing a linear scan over the buckets, the HI iterator checks if both operands have a bucket with the same hash value (line 26), then the common hash value is extracted (line 27) and the two buckets accessed (line 28), then the composition u u0 between the vertices is performed (line 30).

Next, di erently from the relational join, the adjacent vertices for both operands are visited. Similarly to line 24, the hash-sorted edges induce a bucketing (line 31), and then we check if the destination vertices meet the join conditions alongside with the to-be-joined edges (line 35). Please note that, as stated out in De nition 5, edges are not ltered by

predicate. Furthermore, the resulting graph is stored in a bulk graph where only the vertices id from the two graph operators appear as pairs. This last operation takes time k1 + k2 + Ph2HI b1h b2h + out1h out2h where bih is the size of the h bucket for the i-th operand, while outih is the outgoing vertices' size for all the vertices within the h bucket for the i-th operand.

Such algorithm could be also extended to the disjunctive semantics as follows: all the edges discarded from the intersection in line 30 for u u0 should be considered, either if they come from the left operand or from the right one. Between all such edges, we consider only those e0 that have a destination vertex which hash value appears in HI. Moreover it has to satisfy the binary predicate jointly with another vertex 0, coming from the opposite operand. Hence we establish (e.g.) an edge (u u0; 0) having the same values and attributes of e0 and the same set of labels. v2 eo offset HashOffset value offset (a) Data structures used to implement the graph in secondary memory. Each data structure represents a di erent le.

VertexIndex v00 h1 v11 h2 v22 h1 HashOffset hh11 hh22

EXPERIMENTAL EVALUATION

Through the following experiments we want to prove that (i) both hash buckets and memory mapping for the graph join operands provide better results for GCEA, (ii) which outperforms the query plans for other query languages (both graph and relational). For the rst case we have to use graph libraries or graph databases where transactions and logging can be disabled, while for the second we choose state of the art graph databases implementing speci c query languages.

In order to do so we choose the simplest graph representation that provides better performances for all the addressed languages: we choose a graph where only vertices contain values and where labels are stored in both vertices and edges. We created our data using the LiveJournal Graph [ 19 ] containing 4,847,571 unlabelled vertices and 68,993,773 unlabelled edges. Each vertex represents a user which is connected to each of its friends by an edge. Since no data values are given within the datasets, we enriched the graph using the guidelines of the LDBC Social Network Benchmark protocol [ 10 ], and hence associated to each user an IP address, an Organization and the year of employment1. For each experiment, the input data were obtained by starting a random walk from the same vertex but using a di erent seed for the graph traversal. New data sets were obtained incrementally by visiting each time a number of vertices that is a power of 10, from 10 to 106.

We performed our tests over a MacOsX with a 2.2 GHz Intel Core i7 processor and 16 GB of RAM at 1600 MHz, and an SSD Secondary Storage with an HFS le system. We evaluate the graph join using as operands two distinct sampled subgraphs with the same vertex size (jV j), where the predicate is the following one: (u; v) d=ef u:Y ear1 = v:Y ear2 ^ u:Organization1 = v:Organization2. Such predicate does not perform a perfect 1-to-1 match with the graph vertices, thus allowing to test the algorithm with di erent multiplicities values. We tested the algorithm with the con1More informations regarding our proposed solutions are available at http://smartdata.cs.unibo.it/?page id=798. junctive semantics, having a subset of the operations of the disjunctive one. 5.1

Evaluating Data Structures

We benchmark our solution with graph data models where database transactions either do not exist or can be disabled. We rst consider two graph libraries accessing graphs in main memory; we tested the Boost Graph Library 1.60.0 with the most e cient con guration for graph traversals tasks, vec [ 24 ], and Snap 3.0 [ 20 ] considering the attributes only over the vertices (TNodeNet<TAttr>). Then we consider the Sparksee graph database [ 9 ]: transactions were disabled in the con guration le, as well as logging, rollback and recovery facilities. Concerning the graph database management implementation, no assumptions can be made as it is closed source.

We implemented our graph join algorithm for all the aforementioned libraries. We used the standard graph library methods to store the graph in secondary memory (serialization or graph database storage) and extended the PartitionHashJoin by doing a preliminary vertex bucketing phase: buckets are not supported and vertices cannot be sorted by hash value.

Join Evaluation Time. In this case we evaluate two aspects: (i) the join algorithm running time and (ii) the time required to create the solution and store it in secondary memory.

Table 4a provides the cost of performing the sole join algorithm excluding the result storing time. All the competitors' graphs were joined through GCEA and vertices with the same hash were put in the same bucket in main memory. It must be emphasised that both Boost and SNAP operands were loaded in primary memory, while our operands were accessed in secondary memory through memory mapping. The table shows how all the other data structures had a worse performance due to the initial cost of the bucket creation and sorting. We must also remark that this result justi es the need of our data structure for the proposed algorithm. The same table provides the time required to store the results as an adjacent list in secondary memory using the default graph library representation (non-labelled vertices and edges, default serialization). In this case our solution always outperforms the other graph libraries and databases. Operand creation time. We consider the graph creation time in main memory and the cost of storing it in secondary memory per operand (Table 4b). For both Boost and SNAP the default serialization methods are performed, while for Sparksee we simply closed the database. In this case our solution outperforms all the competitors. 5.2

Join Execution Time

This last experiment compares the interpretation of query plans for both relational and graph databases with GCEA. The opponents' query plans are discussed in Section 6.1.

We used default con gurations for both Neo4J and PostgreSQL, while we changed the cache bu ers con gurations for Virtuoso (as suggested in the con guration le) for 16GB of RAM. We kept the default multithreaded query execution plan. Cypher queries were sent using the Java API but the graph join operation was performed only in Cypher through the execute method of an GraphDatabaseService object. PostgreSQL queries were evaluated directly through the psql client and benchmarked using both explain analyze and \timing commands. Virtuoso was benchmarked through iODBC connection evoked in C using Redland RDF library: no HTTP connections were used and only the librdf_model_query_execute function was involved in the graph join operation. Neo4J graphs were ne tuned by indexing the attributes Organization and Year involved in the query and, since Cypher language does not allow to access to different graphs, both graph join operands were stored within the same graph. Virtuoso triples were not indexed, as a default set of indices are de ned during the graph creation, and data is automatically indexed. All the aforementioned conditions do not degrade the query evaluations.

Table 5 represents the result of such benchmarks. The competitors' join time is made up only by the query evaluation time, while our proposed implementation considered the whole GCEA algorithm, and hence both the partitioning phase, the operands' serialization and the actual join execution were considered. As a result our solutions always outperform the competitors' query plans within their own language implementation. 6.

RELATED WORK

At the time of writing, the only eld where a binary graph operation is discussed is Discrete Mathematics. Such operations are de ned over either nite graphs or nite graphs with cycles, and are named graph products [ 14 ]. Every graph product produces a graph whose vertex set is de ned as a cartesian product between the vertices' sets producing pair of vertices, while the edge set changes accordingly to the different graph product de nition. Consequently the Kroneker Graph Product [ 26 ] is de ned as follows: G

G0 = (V

V 0; ((g; h); (g0; h0)) 2 V

V 0 (g; g0) 2 E; (h; h0) 2 E0 ) while the cartesian graph product [ 18 ] is de ned as follows: G G0 = (V V 0; ((g; h); (g0; h0)) 2 V V 0 (g = g0; (h; h0) 2 E0) _ (h = h0; (g; g0) 2 E) ) Other graph products are lexicographic product and strong product [ 14, 17 ]. This de nition has two issues: (i) the resulting vertex set is not made of single vertices but of pair of vertices, and hence (ii) those graph product de nitions are not tailored for graphs with embedded data (i.e. property graphs, triple stores).

6.1 (Graph) Query Languages

In this section we describe how a graph join is implemented for property graph databases and RDF triplestores. The reason is twofold: we want both to show that graph joins can be represented in di erent data representations, and to detail how our experiments in Section 5 were performed. Property Graph and Cypher. Property graphs are, to the best of our knowledge, the more general graph data model representation because they consider both vertices and edges as multi-labelled tuples. It is necessary to compare the performances of our algorithm with query languages running on top of property graph databases because our physical model generalizes property graphs. Among the Property Graph databases we do not consider SQLGraph [ 25 ] because there is no existing implementation and, most importantly, the Gremlin query language allows only to perform graph traversal queries returning bag of values. We choose to perform our tests over Neo4J using Cypher as a query language, because Neo4J allows to extend the built-in query plans with ad hoc solutions [ 16 ], eventually allowing an implementation of our algorithm in a future.

Cypher uses a pipe query evaluation model allowing to rene queries in further steps. Regarding the implementation of the graph conjunctive join operator in Cypher, ValueHashJoins are performed between vertices coming from different graph operands, and hash values are either evaluated at run time, or depend on attributes' values indexings. This choice supports the experimental evidence of Cypher having a better scalability than SPARQL, where RDF graphs cannot be indexed by values (see the next paragraph). Once the Cypher query is transformed into a pipe-based query plan, most of the pipes' sources appear to be NodeByLabelScan and AllNodeScan: this means that all the graph's vertices (with a given label) are considered in the rst steps of computation. As a result the query plan scans more data than it should to provide the nal result. In our algorithm this drawback does not occur because we directly access the data per buckets on both graph operands, avoiding to consider any vertices' combinations that will not appear in the nal result.

RDF triplestores and SPARQL. Triplestore systems store the graph informations as triplets, (source; property; destination), where source and destination are two vertices, and property is the edge linking them. Such property could even appear as a source vertex whenever additional information is provided [ 8 ]. [ 8 ] shows that property graphs can be entirely mapped into RDF triplestore systems as follows:

Definition 6 (Property Graph over Triplestore). Given a property graph G = (V; E; Av; Ae), each vertex vi 2 V induces a set of triples (vi; ; ) for each 2 Av such that vi( ) = having 6= NULL. Each edge ej 2 E induces a set of triples (s; ej; d) such that (ej) = (s; d) and another set of triples (ej; 0; 0) for each 0 2 Ae such that ej( 0) = 0 having 0 6= NULL. Each property graph G is stored as a distinct named graph.

This allows us to query each property graph with SPARQL query language, speci cally targeted for triplestores, through their RDF representation. We took this query language into account for our benchmarks because it is well consolidated: a lot of research has been carried out [ 22 ] and e cient query plans have been implemented [ 15 ], even when multiple graphs are took in input. These results involve the interpretation and the execution of \optional joins" paths [ 2 ], thus allowing to check whether the graph conjunctive join conditions are not met for the outgoing edges. Such performances quickly degrade due to both the sparsity of the data representation requiring to perform more path joins than the ones required for the property graph model, and to the CONSTRUCT clauses are not included in the SPARQL algebra optimizations. However, CONSTRUCT is required for produce a graph as a nal outcome of our graph join query. Moreover RDF triplestores as Virtuoso prefer to index triplets per patterns and do not allow triplet indexing by values. Within our tests, we also took into account that both input and output met the requirements of De nition 6. 7.

CONCLUSIONS

This paper de nes for the rst time a graph join operator. A graph algorithm is proposed for the conjunctive semantics, outperforming the implementations on di erent languages. By comparing the execution of our algorithm with our graph data structure with other ones provided by graph libraries, we also show that our choice of sorting the vertices per hash value (required for the partition hash join) allows to have better performances during the overall execution of the join algorithm.

Some results have been omitted for lack of space. First, we could prove that our operator is both commutative and associative. This result could be proved both for the vertices' and edges' tuples, and for their labels. Such are relevant properties when such operators are used for data integration tasks. The bucketing approach allows even to implement the graph join through a map-reduce approach. Last, we could extend our algorithm to support predicates in over one attribute having partially ordered values [ 6 ]: since our data structure is already ordered by hash value, we could just use a monotone hashing function h w.r.t such attribute. 8.

APPENDIX

Definition 7 (Concatenation). : A A 7! A is a lazy evaluated concatenation function between two operands of type A returning an element of the same type, A. The concatenation function is a linear function such that, given any function H with dom(H) = A, H(u v) = H(u) H(v). is de ned for the following A-s: sets: it performs the union of the two sets: S S0 d=ef S [ S0 to each pair of integers an unique integer: i Pik+=j0 k + i integers: it returns the dovetail number associating j d=ef functions: given a function f : A 7! B and g : C 7! D, f g is the function returning f (x) if x 2 dom(A), and g(x) if x 2 dom(C). NULL is returned otherwise. Such function concatenation are used when 8x 2 A \ C:f (x) = g(x). pairs: given two pairs (u; v) and (u0; v0), then the pair concatenation is de ned as the pairwise concatenation of each element, that is (u; v) (u0; v0) d=ef (u u0; v v0). Elements belonging to multisets are represented as pairs

[1]

C. R.

Aberger ,

Tu ,

Olukotun , and

Re . Emptyheaded: A relational engine for graph processing . In Proceedings of the 2016 International Conference on Management of Data, SIGMOD '16 , pages 431 { 446 , New York, NY, USA, 2016 . ACM.

[2]

Atre . Left Bit Right: For SPARQL Join Queries with OPTIONAL Patterns (Left-outer-joins) . In SIGMOD Conference , pages 1793 { 1808 . ACM, 2015 .

[3]

Atre ,

Chaoji ,

M. J.

Zaki , and

J. A.

Hendler. Matrix "bit" loaded: A scalable lightweight join query processor for rdf data . In Proceedings of the 19th International Conference on World Wide Web, WWW '10 , pages 41 { 50 , New York, NY, USA, 2010 . ACM.

[4]

Atzeni ,

Ceri ,

Paraboschi , and

Torlone . Database Systems - Concepts , Languages and Architectures. McGraw-Hill , 1 edition , 1999 .

[5]

Atzeni ,

Ceri ,

Paraboschi , and

Torlone . Basi di dati. Modelli e linguaggi di interrogazione . McGraw-Hill, Milan , 3 edition , 2009 .

[6]

Bergami ,

Magnani , and

Montesi . On joining graphs . CoRR, abs/1608.05594 , 2016 .

[7]

M. A.

Bornea ,

Dolby ,

Kementsietsidis ,

Srinivas ,

Dantressangle ,

Udrea , and

Bhattacharjee . Building an e cient rdf store over a relational database . In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13 , pages 121 { 132 , New York, NY, USA, 2013 . ACM.

[8]

Das ,

Srinivasan ,

Perry ,

E. I.

Chong , and

Banerjee . A tale of two graphs: Property graphs as RDF in oracle . In Proceedings of the 17th International Conference on Extending Database Technology, EDBT 2014 , Athens, Greece, March 24 -28, 2014 ., pages 762 { 773 , 2014 .

[9]

Dominguez-Sal ,

Urbon-Bayes ,

Gimenez-Van ~o,

Gomez-Villamor , N.

Mart nez-Bazan, and

J. L.

Larriba-Pey . Survey of graph database performance on the hpc scalable graph analysis benchmark . In Proceedings of the 2010 International Conference on Web-age Information Management, WAIM'10 , pages 37 { 48 , Berlin, Heidelberg, 2010 . Springer-Verlag.

[10]

Erling ,

Averbuch ,

Larriba-Pey ,

Cha ,

Gubichev ,

Prat , M.-D. Pham , and P. Boncz . The ldbc social network benchmark: Interactive workload . In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15 , pages 619 { 630 , New York, NY, USA, 2015 . ACM.

[11]

Erling and

Mikhailov . Virtuoso: Rdf support in a native rdbms . In R. D. Virgilio , F. Giunchiglia , and L. Tanca, editors, Semantic Web Information Management , pages 501 { 519 . Springer, 2009 .

[12]

G. H.

Fletcher and

P. W.

Beck . Scalable indexing of rdf graphs for e cient join processing . In Proceedings of the 18th ACM Conference on Information and Knowledge Management , CIKM '09 , pages 1513 { 1516 , New York, NY, USA, 2009 . ACM.

[13]

Gao ,

Yu ,

Qiu ,

Jiang ,

Wang , and

Yang . Holistic top-k simple shortest path join in graphs . IEEE Trans. on Knowl. and Data Eng ., 24 ( 4 ): 665 { 677 , Apr . 2012 .

[14]

Hammack ,

Imrich , and

Klavzar . Handbook of Product Graphs, Second Edition . CRC Press, Inc., Boca

Raton

, FL, USA, 2nd edition, 2011 .

[15]

Huang ,

D. J.

Abadi , and

Ren . Scalable sparql querying of large rdf graphs . PVLDB , 4 ( 11 ): 1123 { 1134 , 2011 .

[16]

Ho lsch and M. Grossniklaus. An algebra and equivalences to transform graph patterns in neo4j . Fifth International Workshop on Querying Graph Structured Data , 2016 .

[17]

Imrich and

S. Klavzar. Product

Graphs . Structure and Recognition . John Wiley & Sons, Inc., New York, NY, USA, 2nd edition, 2000 .

[18]

Imrich and I. Peterin. Recognizing cartesian products in linear time . Discrete Mathematics , 307 ( 3-5 ): 472 { 483 , 2007 .

[19]

Leskovec ,

Lang ,

Dasgupta , and

Mahoney . Community structure in large networks: Natural cluster sizes and the absence of large well-de ned clusters . Internet Mathematics , 6 ( 1 ): 29 { 123 , 2009 .

[20]

Leskovec and

Sosic . Snap: A general-purpose network analysis and graph-mining library . ACM Transactions on Intelligent Systems and Technology (TIST) , 8 ( 1 ): 1 , 2016 .

[21]

Paradies ,

Lehner , and C. Bornhovd. Graphite: An extensible graph traversal framework for relational database management systems . In Proceedings of the 27th International Conference on Scienti c and Statistical Database Management, SSDBM '15 , pages 29:1 { 29 : 12 , New York, NY, USA, 2015 . ACM.

[22]

Perez ,

Arenas , and

Gutierrez . Semantics and complexity of sparql . ACM Trans. Database Syst ., 34 ( 3 ): 16 :1{ 16 : 45 , Sept . 2009 .

[23]

Schuh ,

Chen , and

Dittrich . An experimental comparison of thirteen relational equi-joins in main memory . In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016 , San Francisco, CA, USA, June 26 - July 01, 2016 , pages 1961 { 1976 , 2016 .

[24]

Siek ,

L.-Q.

Lee , and

Lumsdaine . The Boost Graph Library: User Guide and

Reference

Manual . Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2002 .

[25]

Sun ,

Fokoue ,

Srinivas ,

Kementsietsidis , G. Hu, and

Xie. Sqlgraph : An e cient relational-based property graph store . In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15 , pages 1887 { 1901 , New York, NY, USA, 2015 . ACM.

[26]

P. M.

Weichsel . The kronecker product of graphs . Proceedings of the American Mathematical Society , 13 ( 1 ): 47 { 52 , 1962 .

[27]

Yuan , P. Liu,

Wu ,

Jin ,

Zhang , and

Liu . Triplebit: A fast and compact system for large scale rdf data . Proc. VLDB Endow ., 6 ( 7 ): 517 { 528 , May 2013 .