<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Community-Based RDF Graph Partitioning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fredah Banda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Boris Motik</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Oxford</institution>
          ,
          <addr-line>Oxford</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <fpage>33</fpage>
      <lpage>48</lpage>
      <abstract>
        <p>A common approach to processing large RDF datasets is to partition the data in a cluster of shared-nothing servers and then use a distributed query evaluation algorithm. It is commonly assumed in the literature that the performance of query processing in such systems is limited mainly by network communication. In this paper, we show that this assumption does not always hold: we present a new RDF partitioning method based on Louvain community detection, which drastically reduces communication, but without a corresponding decrease in query running times. We show that strongly connected partitions can incur workload imbalance among the servers during query processing. We thus further re ned our technique to strike a balance between reducing communication and spreading processing more evenly, and we show that this technique can reduce both communication and query times.</p>
      </abstract>
      <kwd-group>
        <kwd>RDF</kwd>
        <kwd>Louvain modularity</kwd>
        <kwd>Community partitioning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Resource Description Framework (RDF) is a widely-used data model for
publishing data on the Web. Applications typically rely on RDF systems to store
and answer SPARQL queries over RDF data; Wylot et al. [28] have recently
surveyed the state of the art in RDF systems. Centralised systems such as Jena [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
Sesame [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], RDFSuite [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and RDF-3X [21] store and process the entire dataset
on a single server. However, the size of RDF datasets has been steadily
increasing. Thus, datasets containing several billions of triples are routinely encountered
in practice, and they often exceed the capacity of a single server.
      </p>
      <p>
        A common solution to that problem is to partition a dataset across a cluster
of shared-nothing, interconnected servers and use a distributed query processing
strategy. Numerous distributed RDF systems have been developed along these
principles, such as YARS2 [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], 4store [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], H-RDF-3X [16], Trinity.RDF [29],
SHARD [25], SHAPE [18], Partout [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], AdPart [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], TriAD [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], SemStore [27],
DREAM [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], WARP [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], and a prototype based on RDFox [24]. Abdelaziz et
al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] surveyed 22 and evaluated 11 distributed RDF stores on a variety of data
and query loads, and they showed AdPart [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and TriAD [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] to be among the
best performing ones. Computing joins presents the main technical di culty in
distributed query processing since all triples participating in a join need to be
transmitted over the network to a common server. Since network
communication is much slower than accessing local disk or RAM, it is commonly assumed
that reducing or even eliminating communication is key to the scalability of
distributed RDF stores. Thus, many existing RDF partitioning techniques aim to
place groups of related triples on the same server and thus decrease
communication cost. Moreover, to further reduce communication, many systems duplicate
data|that is, they place the same triple on more than one server. While data
duplication can even completely eliminate all communication, it can also lead to
a signi cant increase in storage requirements [16].
      </p>
      <p>
        Motivated by these common assumptions, we developed a new RDF
partitioning technique based on Louvain community detection [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]|a well-known
algorithm for detecting groups of highly connected vertices in a graph. By
assigning communities to servers, we hoped to minimise communication and thus
also query answering times. We target the distributed query answering technique
by Potter et al. [24]: this technique aims to avoid communication when triples to
be joined are colocated, which is achieved by using a special index for locating
join counterparts and by employing a speci c left-deep query evaluation plan.
      </p>
      <p>
        We evaluated our technique on the LUBM [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and WatDiv [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] benchmarks
against two well-known data partitioning techniques: subject hash partitioning
(which assigns each triple to the server obtained by hashing the subject value),
and the weighted vertex partitioning with pruning by Potter et al. [24] (see
Section 3). To our surprise, our results suggested that the relationship between
network communication and query answering performance is not as
straightforward as commonly assumed. Our technique was very e ective at reducing
network communication, which was consistently signi cantly below the other two
techniques. However, this reduction in communication did not translate into a
reduction in query answering times; in fact, our technique was sometimes even
worst-performing. A careful analysis revealed this to be due to the workload
imbalance between the servers. In particular, when servers contain strongly
interconnected communities, query answers are only produced by data on one or
just a handful of servers; thus, only some servers are involved in query
processing, while others are largely idle. In contrast, with subject hash partitioning, the
data contributing to query answers is likely to be evenly distributed across the
cluster, and so query processing tends to be evenly distributed as well. Since
the overall amount of work is the same in both cases and server processing is
parallelised, subject hash partitioning can be more e cient than community
partitioning, even though the latter involves signi cantly less communication.
      </p>
      <p>Motivated by this observation, we further re ned our community
partitioning algorithm with the aim to strike a balance between producing strongly
connected communities with distributing data evenly in the cluster. Our further
experiments show that, on certain queries, such an approach can reduce network
communication as well as improve query answering times.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <p>The Resource Description Framework (RDF) data model is commonly used for
publishing structured data on the Web. RDF data is constructed using resources,
which can be IRIs, blank nodes, or literals (e.g., strings or integers). An RDF
triple is an expression of the form hs; p; oi, where s, p, and o are resources called
subject, predicate, and object, respectively. An RDF graph G is a nite set of
triples. SPARQL is the standard language for querying RDF graphs. In this
paper, we consider only the fragment of SPARQL covering conjunctive queries
(CQs)|that is, queries of the form</p>
      <p>SELECT ?X1 : : :?Xn WHERE f s1 p1 o1 : : : : sn pn on g;
where each si, pi, and oi is a resource or a variable. Each si pi oi is called a triple
pattern. Conjunctive queries correspond to the select{project{join fragment of
the relational algebra, and they form the basis of SPARQL.</p>
      <p>In this paper, we study the problem of answering CQs over an RDF graph
G stored in a cluster consisting of ` shared-nothing servers connected by a
network. For convenience, we identify each server with a unique integer k between
1 and `. We assume that the RDF graph is partitioned into ` partition
elements P1; : : : ; P`|that is, each Pk is the subset of G stored at server k, and
S1 k ` Pk = G. In our work we assume that data is stored without replication,
so Pk \ Pk0 = ; for all 1 k &lt; k0 `.</p>
      <p>Distributed query evaluation algorithms need to know how to locate triples
that can match di erent triple patterns, and so they often depend on the details
of data partitioning. In this paper, we evaluate conjunctive SPARQL queries
using the distributed query answering technique by Potter et al. [24], which
aims to avoid communication when triples participating in a join reside on the
same server. We next brie y summarise the main aspects of this algorithm on
an example of evaluating a conjunctive query</p>
      <p>SELECT ?X1 ?X2 ?X3 WHERE f ?X1 r ?X2 : ?X2 s ?X3 g
over an RDF graph split into partition elements P1 = fha; r; bi; hb; s; cig and
P2 = fhb; s; dig. The query is evaluated using index nested loop joins over a
leftdeep query evaluation plan. The query is sent to all servers, and each server starts
asynchronously matching the rst triple pattern of the query. In our example,
server 2 cannot match the rst triple pattern in P2, so it stops evaluation. In
contrast, server 1 matches the rst triple pattern to triple ha; r; bi and thus
produces a partial binding that maps variables ?X1 and ?X2 to resources a
and b, respectively. Server 1 then proceeds to the second triple pattern, which
can be matched both in P1 and P2. To take this into account, server 1 consults
a special index called occurrence mappings, which informs the algorithm of the
location of possible join counterparts. In our example, this index tells server 1
that triples containing b (i.e., the current value of ?X2) occurs in the subject
position in P1 and P2. Therefore, server 1 sends a message to server 2 that
contains the partial and instructs server 2 to continue matching the second
triple pattern. Upon the receipt of that message, server 2 tries to match the
second triple pattern in P2; that is, it extends by additionally mapping ?X3 to
d and thus produces one answer to the query. Moreover, server 1 also matches
the second triple pattern in P1; that is, it extends by additionally mapping
?X3 to c and thus produces another answer to the query. No further matches are
possible, so the servers use a specially designed protocol to inform each other
that query evaluation is complete.</p>
      <p>While the above summary omits several details, it highlights certain
important properties of the approach. First, servers use a special index to locate join
counterparts, which allows the algorithm to be applied to any partition of an
RDF graph without replication (i.e., the algorithm makes no assumptions about
how triples are allocated to servers as long as the index is constructed
appropriately). Second, partial bindings correspond to the answers of the pre xes of the
triple patterns; in our example, is an answer to the pre x containing just the
rst triple pattern. Third, distributed joins are processed not by moving data,
but by moving computation. In our example, join processing `hops' from server
1 to server 2. Fourth, the processing at all servers is asynchronous: a server can
process messages with partial bindings as soon as they are received, and it does
so independently from the other servers. As a result of these optimisations, this
approach outperformed TriAD and AdPart on certain data and query loads [24].
3</p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>
        Wylot et al. [28] have recently surveyed the state of the art in both centralised
and distributed RDF stores, and Abdelaziz et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] have presented a detailed
survey of the distributed RDF stores. Prominent distributed RDF stores
include YARS2 [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], 4store [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], H-RDF-3X [16], Trinity.RDF [29], SHARD [25],
SHAPE [18], AdPart [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], TriAD [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], SemStore [27], DREAM [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], WARP [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],
Partout [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and RDFox [24], to name just a few. In this section, we brie y
summarise the data partitioning methods commonly used in these systems.
      </p>
      <p>
        To partition RDF data, most systems aim to place triples with the same
subject onto the same server. This is motivated by an empirical study of over
three million SPARQL queries, which showed that approximately 60% of joins in
queries are subject{subject joins, 35% are subject{object joins, and less than 5%
are object{object joins [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Thus, by assigning triples with the same subject to the
same partition element, no communication is needed for the most common type
of join occurring in practice, which is widely seen as critical for good performance
of query answering in distributed RDF systems [16].
      </p>
      <p>
        Many RDF systems partition data using subject hashing [
        <xref ref-type="bibr" rid="ref14">14, 29, 26, 23</xref>
        ]. In
particular, to partition an RDF graph G into ` partition elements, each triple
hs; p; oi 2 G is stored at server (h(s) mod `) + 1, where h is a suitable hash
function that maps the triple's subject to an integer. As mentioned in the previous
paragraph, subject hashing requires no communication for subject{subject joins.
Moreover, the technique is very simple and can be implemented in streaming
fashion, which makes it very popular in practice. Finally, the technique typically
produces very balanced partitions. The main drawback is that communication
is usually needed for subject{object or object{object joins. To improve this,
certain systems (e.g., TriAD [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]) also hash triples by object; this increases the
likelihood that triples participating in a join are co-located, but it e ectively
duplicates the amount of data in the system, which can be prohibitive.
      </p>
      <p>The RDF partitioning methods we discuss in the rest of this section use the
METIS graph partitioning algorithm [17], so we next brie y discuss the latter.
This algorithm is not directly applicable to RDF: it accepts as input a directed,
unlabelled graph G = (V; E; w) consisting of a nite vertex set V , an edge set
E V V , and a weighting function w that assigns to each vertex v 2 V a
non-negative weight w(v). The algorithm aims to split the set of vertices V into
mutually disjoint sets V1; : : : ; Vn such that
{ Pv2Vi w(v) is roughly the same for each Vi, and
{ the number of edges connecting vertices in distinct Vi and Vj is minimised.
The decision version of this problem is NP-hard. However, the METIS algorithm
has been highly tuned to produce good solutions on graphs that are typically
encountered in practice.</p>
      <p>METIS is used in n-hop vertex partitioning [16]. In particular, this technique
rst removes all triples whose predicate is rdf :type: the objects of such triples
typically have many incoming edges, which can confuse METIS. The resulting
graph is converted into a directed, unlabelled graph in the obvious way, with the
weight of each vertex set to one. The vertices of this graph are next partitioned
using METIS into ` partitions V1; : : : ; V`. Finally, each triple hs; p; oi 2 G is
assigned to partition element Pk such that s 2 Vk. Intuitively, partitioning vertices
using METIS minimises the number of triples spanning partitioning elements,
which should in turn minimise communication during query answering.
However, since not all communication is eliminated in this way, the approach further
replicates data. In particular, for a chosen integer n, the approach adds to each
partition element Pk all triples of G that are necessary to ensure that any query
containing n `hops' (i.e., joins whose computation requires traversing at most
n connected vertices) can be evaluated purely within each server. While this is
very e ective at reducing communication, it can be impractical as it incurs a
signi cant overhead in terms of data storage [16]. A variant of this approach has
been used in the D-SPARQ system by Mutharaju et al. [20].</p>
      <p>Weighted vertex partitioning with pruning [24] is also based on METIS.
It also rst prunes the given RDF graph by removing all triples of the form
hx; rdf :type; yi, as well all triples of the form hx; p; lit i where lit is a literal: such
triples are also prone to introducing hub vertices, but they rarely participate in
object{object joins, so it is bene cial to remove them before partitioning. The
resulting graph is partitioned using METIS, but the weight of each vertex v is set
to the number of triples in G that have v as subject. Since the METIS algorithm
aims to make the sum of weights of vertices in each partition roughly the same,
this e ectively ensures that nal partition elements are of roughly the same size.
Finally, triples of the input RDF graph are assigned to partition elements as in
the previous paragraph, but without any replication.</p>
    </sec>
    <sec id="sec-4">
      <title>Community-Based RDF Graph Partitioning</title>
      <p>We now present two variants of our approach for partitioning RDF data. To
make this paper self-contained, in Section 4.1 we rst summarise the well-known
community detection technique based on Louvain modularity, which we use as
the basis for our work. Then, in Section 4.2 we discuss how we adapt this
technique to the problem of partitioning RDF data.
4.1</p>
      <sec id="sec-4-1">
        <title>Louvain Algorithm for Community Detection</title>
        <p>Graph analysis tasks often involve the concept of a community, which is
intuitively a set of vertices that are more interconnected among themselves than to
the remaining vertices of a graph. This notion can be captured using the concept
of graph modularity, which has been formalised by Newman [22] as follows.</p>
        <p>Consider a graph with n vertices where the weight of the connection between
vertex i and vertex j is given by Ai;j . We assume that the graph is undirected,
so Ai;j = Aj;i for all 1 i; j n. Note that the graph is allowed to contain
self-loops|that is, Ai;i 6= 0 is allowed. Moreover, we assume that no weight is
negative. The degree of a vertex i is the sum of the weights of the connections of
i|that is, ki = P1 j n Ai;j . Finally, m = 0:5 P1 i n ki is the sum of all the
edge weights, where factor 0:5 takes into account that the graph is undirected
so the sum includes the weight of each edge twice.</p>
        <p>Now let C be a set of community labels, and let us assume that each vertex
i has been assigned to some community ci 2 C. The modularity of such a
community assignment is the fraction of the graph edges that connect vertices in the
same community minus the expected fraction had the edges been distributed at
random. This value can be computed by</p>
        <p>M =
1
2m</p>
        <p>X
1 i;j n
kikj
2m
Ai;j
(ci; cj );
(1)
where (ci; cj ) is the Kronecker delta symbol|that is, (ci; cj ) = 1 if ci = cj , and
(ci; cj ) = 0 otherwise. For unweighted graphs (i.e., if each Ai;j is either 0 or 1),
modularity is a number between 0:5 and 1. Given two community assignments,
the one with the higher modularity better captures the community structure
inside the graph. Thus, the problem of community detection can be reduced to
the problem of nding a community assignment with maximum modularity.</p>
        <p>
          The Louvain algorithm [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] solves this problem approximately using a
hierarchical greedy method. The algorithm rst assigns each vertex to its own
community, after which it iteratively performs the following two phases.
        </p>
        <p>In the rst phase, the algorithm tries to increase modularity by moving a
vertex to a community of its neighbour. The increase in modularity of moving
vertex i into community c can be e ciently calculated by</p>
        <p>M (i; c) =
" P
in +ki;in
2m</p>
        <p>Ptot +ki
2m
2#
" P</p>
        <p>in
2m</p>
        <p>Ptot
2m
2
ki
2m</p>
        <p>2#
where Pin is the sum the weights of all the edges within community c, Ptot is
the sum of all the weights of the edges incident to some vertex in community
c, and ki;in is the sum of the weights of the edges from i to all vertices in
community c. Based on this idea, the Louvain algorithm considers in the rst
phase each vertex i and each neighbour j of i, and it calculates the increase in
modularity M (i; cj ) (where, as above, cj is the current community of vertex j).
Once all neighbours of i have been considered, if M (i; cj ) for some neighbour
j is maximal and positive, vertex i is moved into community cj . This process is
repeated as long as moving a vertex is possible.</p>
        <p>In the second phase, the Louvain algorithm groups all vertices inside the same
community and creates a new graph whose vertices are communities detected in
the rst phase. The weight of an edge connecting two communities c and c0 is
de ned as the sum of the weights of all edges between vertices i and j that were
assigned to c and c0, respectively, in the rst phase. The algorithm then applies
the rst phase to this new graph, and the entire process is repeated as long as
applying the rst phase leads to a further increase in modularity.</p>
        <p>The notion of modularity has been extended to directed graphs [19]. One
might intuitively expect directed modularity to be more suitable to RDF since
RDF graphs are directed. However, note that triple patterns in queries can
traverse an edge of an RDF graph in either direction. Because of that, we intuitively
expect undirected modularity to be more applicable to our setting.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Our Partitioning Algorithms for RDF</title>
        <p>In this section, we describe two variants of a new partitioning algorithm for
RDF datasets that are both based on the Louvain community detection
algorithm. The pseudo-code of both approaches is shown in Algorithm 1. Function
Community-RDF-Partitioning accepts as input an RDF graph G and the
desired number of partitions `, and it produces partition elements P1; : : : ; P`
that can be used for distributed query answering. From a high-level point of
view, partition elements are produced by rst partitioning the resources of the
input graph into communities using the Louvain algorithm, allocating each
community to a server, and nally allocating each triple to the server containing the
community of the subject. We next discuss these steps in more detail.</p>
        <p>First, in line 2, the algorithm creates a pruned RDF graph G0 by removing all
triples of the form hx; rdf :type; yi or hx; p; lit i where lit is a literal. Such triples
can make y and lit hubs, which can confuse the community detection algorithm.
The same step is used in weighted vertex partitioning with pruning [24], and in
n-hop vertex partitioning [16].</p>
        <p>Second, in line 4, the algorithm applies the Louvain community detection
algorithm to G0. Our initial experiments revealed that the Louvain algorithm,
as described in Section 4.1, can create very large communities, and moreover that
the number of vertices in each community can vary signi cantly. Assigning triples
to servers based on such communities is likely to produce partition elements of
very di erent sizes, which is undesirable. Therefore, we use a slightly modi ed
variant of the Louvain algorithm: vertex i is moved to a community cj of a
Algorithm 1 Community-Based RDF Partitioning
1: procedure Community-RDF-Partitioning(G; `)
2: G0 := fhs; p; oi 2 G j p 6= rdf :type and o is not a literalg
3: Determine maxSize depending on the allocation variant
4: S := LouvainCommunityDetection(G0; maxSize)
5: for each 1 k ` do Ak := ;
6: AllocateCommunities
7: for each 1 k ` do
8: Pk := fhs; p; oi 2 G j s 2 Akg
9: procedure AllocateCommunities-Tight
10: for each T 2 S do
11: OT := fo j 9s; p : hs; p; oi 2 G0 and s 2 T g
12: for each 1 k ` do Rk := ;
13: while S 6= ; do
14: for each T 2 S and each 1 k ` do
15: if jRk [ T [ OT j maxSize then
16: rank kT := jRk \ (T [ OT )j
17: else
18: rank kT := 0
19:
20:
21:
22:</p>
        <p>Choose T 2 S and k such that rank kT is largest
Ak := Ak [ T
Rk := Rk [ T [ OT</p>
        <p>Remove T from S
23: procedure AllocateCommunities-Loose
24: for S 2 S do
25: Choose k such that jAkj jAk0 j for each 1
26: Ak := Ak [ S
k0
`
neighbour j of i only if community cj contains at most maxSize vertices after
the move. As a result, our modi ed algorithm creates communities of at most
maxSize elements. The value of maxSize should be selected in line 3 depending
on which community allocation variant is used in line 6, and we discuss the
details in Sections 4.2.1 and 4.2.2. The community detection returns a set S
of communities|that is, each set T 2 S contains all resources of exactly one
community. Note that, for all T; T 0 2 S, we have T \ T 0 = ; whenever T 6= T 0|
that is, each community contains distinct resources.</p>
        <p>Third, the algorithm assigns the resources in each community to a server.
To this end, for each server 1 k `, the algorithm introduces a set Ak that
will contain the resources assigned to server k. All sets Ak are initially empty in
line 5, and they are populated inside the allocation step in line 6. We developed
two variants of community allocation, tight and loose, the details of which we
discuss shortly. Either way, after the community allocation step, each resource
r occurring in G in subject position can be found in exactly one set Ak.</p>
        <p>Fourth, the algorithm allocates in line 8 each triple of G to the partition
element Pk such that Ak contains the subject of the triple.</p>
        <p>We next discuss our two variants of community allocation. Both take as input
the set S of communities produced by the Louvain algorithm.
4.2.1</p>
      </sec>
      <sec id="sec-4-3">
        <title>Tight Community Allocation</title>
        <p>Tight community allocation aims to distribute communities to servers such that,
after triples are allocated to servers in line 8, the resources assigned to each of
the servers are highly connected. During the triple allocation process note that,
even though each T 2 S contains distinct resources, allocating T to some server
k will bring into Pk resources occurring as objects of triples whose subjects are
in T . To ensure high connectivity of resources assigned to any particular server,
in the tight variant we set maxSize to N=` in line 3 where N is the number of
resources of G0|that is, each community is limited to at most N=` resources.</p>
        <p>The tight community allocation variant aims to ensure that the resources
introduced as part of the triple allocation step occur on as few servers as possible.
To achieve this, for each community T 2 S, the algorithm rst constructs in
line 11 the set OT of resources occurring as objects in triples whose subjects
are in T . Moreover, in line 12, for each server k, the algorithm introduces a
set Rk that is analogous to Ak, but that will accumulate resources occurring
in both subject and object position on server k. Then, the algorithm greedily
assigns communities to servers using the loop in lines 13{22. In each pass through
the loop, the algorithm considers each unassigned community T 2 S and each
possible target server k. For each T and k, the algorithm calculates rank kT in
line 16 as the overlap between the resources Rk that are currently assigned to
server k and the resources occurring in triples whose subjects are in T . In line 19,
the algorithm chooses the community T and the server k with the best rank (with
ties broken arbitrarily), and then it allocates T to k by adding T to Ak (line 20)
and by recording all relevant resources in Rk (line 20).</p>
        <p>There is one nal detail: if adding T to server k would assign more than
maxSize resources to server k, then rank kT is set to zero (line 18). This is key to
ensuring that nal partition elements are of roughly the same sizes (i.e., that no
partition element is much larger than any other element).
4.2.2</p>
      </sec>
      <sec id="sec-4-4">
        <title>Loose Community Allocation</title>
        <p>As we have already mentioned in Section 1 and as we discuss in detail in
Section 5, when an RDF graph is partitioned into highly connected communities,
computational load during query answering can be focused to only some servers,
which can have a negative e ect on query answering times. As a possible
solution to this problem, we developed the loose community allocation approach,
which aims to strike a balance between reducing communication and spreading
the computational workload more evenly to servers.</p>
        <p>The loose approach is quite simple: each community T 2 S is assigned to
server k that currently has fewest resources assigned to it. Ties between servers
are broken arbitrarily. To make this approach e ective, we must ensure that
the Louvain method detects su ciently many communities. Thus, when loose
approach is used, maxSize is set in line 3 to 30. We found empirically that this
value seems to produce the best results across a range of datasets.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experimental Evaluation</title>
      <p>We investigated empirically how di erent data partitioning approaches a ect
query answering performance. To this end, we compared our tight (C-T) and
loose (C-L) community-based partitioning variants, the weighted vertex
partitioning with pruning (PRN), and subject hash partitioning (HSH); the last two
techniques were described in Section 3. Our main objective was to see how di
erent methods a ect query evaluation time and the amount of communication. In
this section, we rst describe our experimental setting, then present the results
of our experiments, and nally discuss our ndings.
5.1</p>
      <sec id="sec-5-1">
        <title>Experimental Setup</title>
        <p>
          Test Datasets and Queries We evaluate our approaches on two well-known
Semantic Web datasets. First, we used WatDiv [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], a well-known synthetic
benchmark that aims to simulate real-world data and query workloads. WatDiv
provides a data generator, which we used to produce an RDF graph containing
1,099,208,068 triples. Moreover, WatDiv provides 20 query templates, which have
been divided into four classes: star (S), linear (L), snow ake (F), and complex
(C) queries. Since star queries contain only subject{subject joins, they do not
incur any communication on either of the data partitioning approaches we consider
and are thus not interesting for this paper. Consequently, we considered only the
13 query templates consisting of linear, snow ake, and complex queries. These
templates contain a parameter that is usually replaced with a resource from the
graph; however, queries obtained in such a way tend to be quite localised and
thus do not incur a lot of distributed processing. To obtain more complex and
thus more interesting queries, we replaced the parameter in each of the query
templates with a fresh variable.
        </p>
        <p>
          Second, we used the widely used Lehigh University Benchmark (LUBM) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
We generated a dataset with 10,000, universities which resulted in 1,382,072,494
triples. The LUBM benchmark comes with 14 queries; however, most of them do
not return any results when reasoning is not enabled. Thus, we used the seven
queries (T1{T7) by Zeng et al. [29] that compensate for the lack of reasoning,
and three new, complex cyclic queries (N1{N3) introduced by Potter et al. [24].
Test Systems We produced the Prune (PRN) and Hash (HSH) partitions of
the datasets, as well as ran all queries using the distributed RDFox prototype by
Potter et al. [24]. The system was written in C++. We implemented C-L and C-T
in Python, making sure to generate partitions compatible with RDFox. We used
a well-known library by Thomas Aynaud for Louvain community detection,1
which we modi ed as discussed in Section 4 to limit the maximum community
size.
        </p>
        <p>Hardware We performed our experiments on ten memory-optimised r5.8xlarge
servers of the Amazon Elastic Cloud. Each of the ten instances was equipped
with 32 CPUs and 256 GB of RAM, and it was running Ubuntu Server 16.04
LTS (HVM). The servers were connected using 10 Gigabit Ethernet.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Evaluation Protocol and Results</title>
        <p>In each experiment, we partitioned a dataset into ten partition elements using
each of the four data partitioning techniques. We then loaded each partition into
the ten servers and ran all queries while measuring wall-clock query answering
times. To compensate for the variability in query answering performance, we
rerun each query ten times, and we report the average time.</p>
        <p>In addition, for each query, we also measured the number of partial binding
messages that each server sends to any other server. These numbers are stable
across repeated runs, and we report the total number of messages sent from
all servers. These numbers do not include messages needed to communicate
query answers. The number of latter messages is determined primarily by the
dataset, and we expect it to be largely the same for all data partitioning strategies
assuming that query answers are distributed evenly in the cluster. In this way,
the number of partial binding messages re ects the amount of communication
incurred by distributed join processing, which is the main objective of our study.</p>
        <p>The average times (in seconds) for answering all queries on all datasets are
shown in Table 1, and the corresponding total numbers of partial binding
messages are shown in Table 2.
5.3</p>
      </sec>
      <sec id="sec-5-3">
        <title>Discussion</title>
        <p>As shown in Table 1, query evaluation times vary signi cantly across the datasets
and partitioning techniques. On WatDiv, C-L and HSH are fastest on ve queries
each, C-T is fastest on two queries and PRN is fastest on one query. Moreover,
the times are generally close across queries and partitioning styles: the biggest
di erence is on query F2, where C-T is 6.95 times slower than HSH. On LUBM,
if we disregard trivial queries T4 to T6, PRN is fastest on four queries, and C-T
is fastest on the remaining three queries. Moreover, the query times seem to
di er more than on WatDiv: on query N3, PRN is 16.9 times faster than HSH;
and on queries N1 and N2 the fastest partition approach is more than eight times
faster than the slowest.</p>
        <p>The picture is completely di erent for the numbers of partial binding
messages shown in Table 2. On the WatDiv dataset, C-T requires the least number
1 https://github.com/taynaud/python-louvain
of messages for all but two queries, and C-L is slightly more e cient than
CT on one more query. The di erence is particularly pronounced on query F5,
where C-T outperforms HSH in terms of the message number by a factor of 36.
Across all queries, HSH sends 4.2 times more messages than C-T. On the LUBM
dataset, the di erence is even more pronounced. On queries N1{N3, C-T and
C-L are about six orders of magnitude more e cient than HSH, and by around
four orders or magnitude than PRN in terms of the message number.</p>
        <p>These results show that community partitioning is exceptionally e ective
at reducing communication during join processing, but unfortunately this does
not seem to translate to an analogous reduction in query answering times. To
understand why this is the case, we rst analysed the number of partial binding
messages sent by each server: while Table 2 shows only the numbers aggregated
for all servers, the numbers tend to be fairly uniform across servers for HSH, but
not for C-L and C-T. This suggested an unexpected side-e ect of community
partitioning: query processing tends to be con ned to only some communities,
which can lead to signi cant di erence in server workload.</p>
        <p>To verify this hypothesis, we measured the workload of all servers. The
distributed query answering approach by Potter et al. [24] evaluates queries
left-toright matching atoms using index nested loops, so the number of times that an
atom is matched at each server provides a good proxy for the server workload.
We modi ed the RDFox prototype to capture the number of atom matches, and
we rerun all queries for all datasets. The number of matches on each server does
not depend on the exact run, so we run each query just once. Moreover, the total
amount of work (i.e., the total number of atom matches) depends only on the
input RDF graph, and not on the data partitioning technique.</p>
        <p>Tables 3 and 4 summarise the results of this experiment for WatDiv and
LUBM, respectively. Showing the matches for each of the ten servers would be
inconvenient, so, for each query and partitioning technique, we show only the
maximum, minimum, and median number of matches across all servers.</p>
        <p>These results con rmed our initial hypothesis. On WatDiv, C-L and C-T
incur considerable di erence in server workload on most queries. This di erence
also exists with PRN, but it is somewhat less pronounced. Finally, there is almost
no di erence in the workload for HSH, which is unsurprising given that triples
are spread uniformly across the cluster. In contrast, the workload was roughly
the same for all data partitioning techniques on LUBM.</p>
        <p>We interpret these results as follows. The overall query answering time seems
primarily determined by (i) the distribution of the workload in the cluster, and
(ii) the communication cost. Thus, if the communication cost of the two
strategies is roughly the same, then the strategy with a more even workload
distribution is more e cient: the total amount of work is the same, so the query
answering time depends on the time of the slowest server. This seems to be
happening on WatDiv: C-T requires only about 4.2 times fewer partial binding
messages than HSH, but this improvement is insu cient to o set the stark
difference in server workload. As a result, HSH often outperforms C-L and C-T
on WatDiv in terms of query times. This is particularly pronounced on queries
F1{F5 and L4, where the gap in both the performance and workload
distribution seems signi cant. In contrast, if workload distribution is roughly the same,
then the strategy with less communication is more e cient. This seems to be the
case for LUBM: all four data partitioning strategies distribute the workload in
roughly the same way; however, C-L and C-T incur signi cantly less
communication than HSH on queries N1{N3, which seems to correlate with a signi cant
boost in query performance. Although C-L and C-T incur signi cantly less
communication compared to PRN, the absolute numbers of partial answer messages
for PRN are quite small, and so the performance in these cases seems to mainly
depend on other factors.</p>
        <p>These results are quite surprising. It was commonly assumed in the
literature that communication between servers is the main factor determining the
performance of distributed join processing, but we show that performance can
also depend on the workload distribution in the cluster. In particular, the C-T
variant of our technique was extremely e ective at reducing communication, but
this was often coupled with worse query answering times. We use this as the
main motivation for the C-L variant of our technique, which distributes
communities across servers evenly and thus aims to strike a balance between reducing
communication and distributing workload. Indeed, as one can see in Table 3, the
workload is generally more evenly distributed for C-L than for C-T.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>In this paper, we present two new techniques for partitioning RDF data based on
graph community detection. Both techniques aim to reduce communication
during distributed join processing, and we compared them empirically with several
well-known data partitioning techniques. While communication is commonly
assumed to be the main source of overhead, we show that the distribution of
workload across servers in a cluster can also have a signi cant impact on the
performance of query answering in distributed RDF systems.</p>
      <p>In future, we will re ne our data partitioning technique to further improve the
balance between network communication and server workload and thus ensure
consistent performance on a wide range of datasets and query loads. A particular
challenge is to develop techniques that can partition an RDF graph in a streaming
fashion|that is, by reading the input graph sequentially and keeping only small
subsets of the graph in memory at once.
16. Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL Querying of Large RDF</p>
      <p>Graphs. PVLDB 4(11), 1123{1134 (2011)
17. Karypis, G., Kumar, V.: A Fast and High Quality Multilevel Scheme for
Partitioning Irregular Graphs. SIAM J. Sci. Comput. 20(1), 359{392 (1998)
18. Lee, K., Liu, L.: Scaling Queries over Big RDF Graphs with Semantic Hash
Partitioning. PVLDB 6(14), 1894{1905 (2013)
19. Leicht, E.A., Newman, M.E.J.: Community Structure in Directed Networks. Phys.</p>
      <p>Rev. Lett. 100, 118703 (2008)
20. Mutharaju, R., Sakr, S., Sala, A., Hitzler, P.: D-sparq: distributed, scalable and
e cient rdf query engine. In: ISWC (Posters &amp; Demos). CEUR Workshop
Proceedings, vol. 1035, pp. 261{264 (2013)
21. Neumann, T., Weikum, G.: The RDF-3X engine for scalable management of RDF
data. VLDB Journal 19(1), 91{113 (2010)
22. Newman, M.E.J.: Analysis of weighted networks. Phys. Rev. E 70, 056131 (2004)
23. Papailiou, N., Konstantinou, I., Tsoumakos, D., Karras, P., Koziris, N.: H2RDF+:
High-performance Distributed Joins over Large-scale RDF Graphs. In: Big Data.
pp. 255{263 (2013)
24. Potter, A., Motik, B., Nenov, Y., Horrocks, I.: Dynamic Data Exchange in
Distributed RDF Stores. IEEE TKDE 30(12), 2312{2325 (2018)
25. Rohlo , K., Schantz, R.E.: Clause-Iteration with MapReduce to Scalably Query</p>
      <p>Data Graphs in the SHARD Graph-Store. In: DIDC. pp. 35{44 (2011)
26. Sun, J., Jin, Q.: Scalable rdf store based on hbase and mapreduce. In: 2010 3rd
international conference on advanced computer theory and engineering (ICACTE).
vol. 1, pp. V1{633. IEEE (2010)
27. Wu, B., Zhou, Y., Yuan, P., Jin, H., Liu, L.: SemStore: A Semantic-Preserving</p>
      <p>Distributed RDF Triple Store. In: CIKM. pp. 509{518 (2014)
28. Wylot, M., Hauswirth, M., Cudre-Mauroux, P., Sakr, S.: RDF data storage and
query processing schemes: A survey. ACM Computing Surveys (CSUR) 51(4),
1{36 (2018)
29. Zeng, K., Yang, J., Wang, H., hao, B., Wang, Z.: A Distributed Graph Engine for
Web Scale RDF Data. PVLDB 6(4), 265{276 (2013)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abdelaziz</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harbi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khayyat</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalnis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data</article-title>
          .
          <source>PVLDB</source>
          <volume>10</volume>
          (
          <issue>13</issue>
          ),
          <year>2049</year>
          {
          <year>2060</year>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Al-Harbi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abdelaziz</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalnis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mamoulis</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ebrahim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sahli</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning</article-title>
          .
          <source>VLDB Journal</source>
          <volume>25</volume>
          (
          <issue>3</issue>
          ),
          <volume>355</volume>
          {
          <fpage>380</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Alexaki</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Christophides</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karvounarakis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plexousakis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tolle</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>The ICS-FORTH RDFSuite: Managing Voluminous RDF Description Bases</article-title>
          . In: SemWeb (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Aluc</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hartig</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>O</given-names>
            zsu, M.T.,
            <surname>Daudjee</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          :
          <article-title>Diversi ed Stress Testing of RDF Data Management Systems</article-title>
          . In: ISWC. pp.
          <volume>197</volume>
          {
          <issue>212</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>V.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guillaume</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lambiotte</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lefebvre</surname>
          </string-name>
          , E.:
          <article-title>Fast unfolding of communities in large networks</article-title>
          .
          <source>J. Stat. Mech</source>
          .
          <year>2008</year>
          (
          <volume>10</volume>
          ),
          <source>P10008</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Broekstra</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kampman</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>van Harmelen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema</article-title>
          . In: ISWC. pp.
          <volume>54</volume>
          {
          <issue>68</issue>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Carroll</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dickinson</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dollin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reynolds</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seaborne</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilkinson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Jena: Implementing the Semantic Web Recommendations</article-title>
          . In: WWW. pp.
          <volume>74</volume>
          {
          <issue>83</issue>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Galarraga</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hose</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schenkel</surname>
          </string-name>
          , R.:
          <article-title>Partout: a distributed engine for e cient RDF processing</article-title>
          .
          <source>In: WWW</source>
          . pp.
          <volume>267</volume>
          {
          <issue>268</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Gallego</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernandez</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mart</surname>
            nez-Prieto,
            <given-names>M.A.</given-names>
          </string-name>
          , de la Fuente,
          <string-name>
            <surname>P.:</surname>
          </string-name>
          <article-title>An Empirical Study of Real-World SPARQL Queries</article-title>
          .
          <source>CoRR abs/1103</source>
          .5043 (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          , He in, J.:
          <article-title>LUBM: A benchmark for OWL knowledge base systems</article-title>
          .
          <source>J. Web Semantics</source>
          <volume>3</volume>
          (
          <issue>2</issue>
          ),
          <volume>158</volume>
          {
          <fpage>182</fpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Gurajada</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seufert</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miliaraki</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Theobald</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>TriAD: A Distributed SharedNothing RDF Engine based on Asynchronous Message Passing</article-title>
          . In: SIGMOD. pp.
          <volume>289</volume>
          {
          <issue>300</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Hammoud</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rabbou</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nouri</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beheshti</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sakr</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>DREAM: Distributed RDF Engine with Adaptive Query Planner and Minimal Communication</article-title>
          .
          <source>PVLDB</source>
          <volume>8</volume>
          (
          <issue>6</issue>
          ),
          <volume>654</volume>
          {
          <fpage>665</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Harris</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lamb</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shadbolt</surname>
          </string-name>
          , N.:
          <article-title>4store: The Design and Implementation of a Clustered RDF Store</article-title>
          .
          <source>In: SSWS. CEUR Workshop Proceedings</source>
          , vol.
          <volume>517</volume>
          , pp.
          <volume>94</volume>
          {
          <issue>109</issue>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Harth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Decker</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>YARS2: A Federated Repository for Querying Graph Structured Data from the Web</article-title>
          . In: ISWC. pp.
          <volume>211</volume>
          {
          <issue>224</issue>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Hose</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schenkel</surname>
          </string-name>
          , R.: WARP:
          <article-title>Workload-aware replication and partitioning for RDF</article-title>
          . In: Workshop at ICDE. pp.
          <volume>1</volume>
          {
          <issue>6</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>