<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thanh Tran</string-name>
          <email>ducthanh.tran@kit.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gu¨ nter Ladwig</string-name>
          <email>guenter.ladwig@kit.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute AIFB, Karlsruhe Institute of Technology (KIT)</institution>
          ,
          <addr-line>76128 Karlsruhe</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>3</fpage>
      <lpage>8</lpage>
      <abstract>
        <p>In recent years, the amount of structured RDF data available on the Web has been increasing rapidly. Efficient query processing that can scale to large amounts of RDF data has become an important topic. Significant efforts have been dedicated to the development of solutions for RDF data management. Along this line of research, we elaborate on a novel data partitioning strategy, which leverages the structure of the underlying data. This structure is represented in form of a parameterized structure index we propose for (RDF) data graphs called PIG. It is not only used for data partitioning but also has been designed to accelerate the matching of graph-structured queries against RDF data. In our benchmark against state-of-theart techniques, our structure-based approach for partitioning and query processing exhibits 7-8 times faster performance.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>In recent years, the amount of structured data available on the
Web has been increasing rapidly, especially RDF data consisting of
triples of the form hs; p; oi, where s is the subject, p is a property,
and o is the object. Such triples form a data graph G(V; L; E)
where the vertices V denote resources and their values, which are
connected by directed edges E, each endowed with a label from a
label set L. One example is shown in Fig. 1.
Supervises</p>
      <p>Supervises</p>
      <p>Supervises</p>
      <p>Supervises
4</p>
      <p>5
Name</p>
      <p>Name</p>
      <p>
        This development of a data web opens new ways for addressing
complex information needs. Search is no longer limited to
matching keywords against documents, but instead, structured queries
can be processed against web resources. In this regard,
conjunctive queries represent an important fragment of widely used
languages (SQL, SPARQL), which has been a focus of recent work on
RDF data management [
        <xref ref-type="bibr" rid="ref1 ref10 ref6">1, 10, 6</xref>
        ]. Essentially, a query of this type
consists of a set of triple patterns of the form p(s; o), where p is
a predicate and s and o are variables (V arq) or constants (Conq).
Intuitively speaking, variables appearing in the SELECT clause are
called distinguished variables (V arqd), otherwise undistinguished
variables (V arqu). Triple patterns constitute a query graph, as
illustrated in Fig. 2b.
      </p>
      <p>A match of a conjunctive query q on a graph G is a mapping h
from the variables of q to vertices of G such that the according
substitution of variables in the graph-representation of q would yield
a subgraph of G. Therefore a query match h can be interpreted as
a certain type of homomorphism (i.e. a structure preserving
mapping) from the “query graph” to the data graph. Because the amount
of data is enormous and largely increasing, scalability of this graph
pattern matching on the data web has become a key issue.</p>
      <sec id="sec-1-1">
        <title>State-of-the-art in RDF Management Triple stores developed</title>
        <p>
          for storing and querying RDF data such as Hexastore, RDF-3X and
SW-Store, can be distinguished along three dimensions: data
organization in terms of (1) partitioning and (2) indexing, and (3) query
processing. Vertical partitioning has been proposed to decompose
the data graph into n two-columns tables, where n is number of
properties. This has shown to be superior than the “one single giant
table” organization, and the use of property tables [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The
strategy used for indexing is based on the coverage of multiple lookup
patterns [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], sextuple indexing has been suggested, which
generalizes the work in [
          <xref ref-type="bibr" rid="ref1 ref6">6, 1</xref>
          ] such that for different access patterns,
retrieved data comes in a sorted fashion. This greatly accelerates
query processing, as fast (near linear) merge joins originally
proposed in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], can now be performed for many more combinations
of triple patterns. Further efficiency gains can be achieved by
finding an optimal query plan [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>Overview and Main Contributions We elaborate on concepts</title>
        <p>that improve the state-of-the-art in data partitioning and query
processing:</p>
        <sec id="sec-1-2-1">
          <title>Parameterized Structure Index for RDF Data: Generalizing</title>
          <p>
            work on XML data such as dataguide [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ], we propose an
index called PIG that summarizes the structure of general
graph structured data like RDF. The size of this index can be
controlled by means of parameters (e.g. derived from
workload).
          </p>
        </sec>
        <sec id="sec-1-2-2">
          <title>Structure-based Partitioning: Based on PIG, we propose a</title>
          <p>structure-based partitioning scheme, where triples about
elements with the same structure are physically grouped. This
is to obtain a contiguous storage of data that likely co-occurs
in query answers.</p>
        </sec>
        <sec id="sec-1-2-3">
          <title>Structure-aware Query Processing: We propose to match</title>
          <p>the query against the structure index first, which is typically
much smaller than the data graph (c.f. examples in Fig. 2).
This helps to focus on data that satisfy the overall structure
of the query and on this basis, to proceed with standard
processing at the level of the data for only certain parts of the
query.</p>
          <p>B1: 3,7
t
A
rs
k
o
W
e
m
a</p>
          <p>N
B3: 8,9
B5:KIT,
MIT</p>
          <p>
            Our solution is complementary to the concepts for indexing and
query optimization [
            <xref ref-type="bibr" rid="ref10 ref8">10, 8</xref>
            ], and offers the following additional
benefits:
          </p>
          <p>Reduction of I/O Costs: We do not simply retrieve all data
that matches some given triple patterns but focus on the one
that satisfies the entire query structure.</p>
        </sec>
        <sec id="sec-1-2-4">
          <title>Reduction of Union and Joins: These operations are only</title>
          <p>needed only for some parts of the query. In the extreme cases
where no structure index matches can be found, we can skip
data access and joins at the data level completely.</p>
          <p>
            In a benchmark against the state-of-the-art techniques for data
partitioning and query processing used in SW-Store [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ], our
approach is 7-8 times faster for a PIG that is parameterized according
to the query workload.
          </p>
          <p>
            Outline We introduce PIG in Section 2. Partitioning, query
processing and parameterization are discussed in Section 3, 4 and 5.
Experiments along with results are discussed in Section 6 before
we review related work in Section 7 and conclude in Section 8. For
more details, we refer the interest readers to our technical report
[
            <xref ref-type="bibr" rid="ref2">2</xref>
            ].
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>STRUCTURE INDEX FOR RDF</title>
      <p>We introduce the notion of a parameterizable structure index for
general graph-structured data (like RDF) called PIG.</p>
      <p>PIG - Parameterizable Index Graph PIG is a special graph
forming a compact representation of the data graph, whose vertices
stand for groups of data graph elements that have a similar or equal
structural “neighborhood”. We capture the concept of equal
structural neighborhood by the well-known notion of bisimulation
originating from the theoretical analysis of state-based dynamic
systems. We adopt this notion to capture both directions of edges.
We consider graph nodes v1; v2 as bisimilar (written: v1 v2),
if they cannot be distinguished by looking only at their outgoing
(forward bisimilarity) and incoming (backward bisimilarity) “edge
trees”. For better control, we parameterize our notion of
bisimulation by two sets of edge labels L1 and L2 that specify the labels of
edges taken into account for determining the forward and backward
bisimulation respectively:</p>
      <p>DEFINITION 1. Given a data graph G = (V; L; E) and two
edge label sets L1; L2 L, a L1-forward-L2-backward
bisimulation on G is a binary relation R V V on the vertices of G such
that for v; w 2 V , l1 2 L1 and l2 2 L2:
vRw and l1(v; v0) 2 E implies that there is a w0 2 V with
l1(w; w0) 2 E and v0Rw0,
vRw and l1(w; w0) 2 E implies that there is a v0 2 V with
l1(v; v0) 2 E and v0Rw0,
vRw and l2(v0; v) 2 E implies that there is a w0 2 V with
l2(w0; w) 2 E and v0Rw0,
vRw and l2(w0; w) 2 E implies that there is a v0 2 V with
l2(v0; v) 2 E and v0Rw0.</p>
      <sec id="sec-2-1">
        <title>Two vertices v; w will be called bisimilar (written v exists a bisimulation R with vRw. w), if there</title>
        <p>The relation is itself a bisimulation and can be represented
by the set of its equivalence classes called extensions, where all
vertices contained in one extension are pairwise bisimilar. These
extensions form a partition of the vertices V of the data graph, i.e.
a family P of pairwise disjoint sets whose union is V . The
index graph G of G is defined in terms of extensions and relations
between them:</p>
        <p>DEFINITION 2. Let G = (V; L; E) be a data graph and
its L1-L2-bisimulation. Vertices of the associated index graph
G = (V ; L1 [ L2; E ) are exactly G’s -equivalence classes
V = f[v] j v 2 V g, with [v] = fw 2 V j v wg. Labels
of G are exactly the labels of G. An edge with a label e is
established between two equivalence classes [v] and [w] exactly if
there are two vertices v 2 [v] and w 2 [w] such that there is
an edge e(v ; w ) in the data graph, i.e. E = fe([v ] ; [w ] ) j
e(v ; w ) 2 Eg.</p>
      </sec>
      <sec id="sec-2-2">
        <title>EXAMPLE 1. The index graph associated with the data graph in Fig. 1 is depicted in Fig. 2a, for</title>
        <p>L1 = fWorksAt ; Supervises ; AuthorOf g and L2 =
fWorksAt ; AuthorOf g. Elements that are indistinguishable
in terms of their edge-labelled, incoming and outgoing paths are
for instance 2, 4, and 6. Thus, they are elements of the extension
B4.</p>
        <p>
          Construction of PIG First, a bismulation for L1 and L2 is
calculated, using an adapted version of the algorithm for determining the
coarsest stable refinement of a partitioning [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. The algorithm starts
with a partition consisting of a single block that contains all data,
and splits into smaller blocks until the partition is a forward
bisimulation. In order to perform both backward and forward bisimulation
for only the parameters L1 and L2, we essentially exploit the
observation that L1-forward-L2-backward bisimulation on a data graph
G = (V; L; E) coincide with forward bisimulation on an altered
data graph GL1L2 = (V; L1 [ fl j l 2 L2g; EL1L2 g) where
EL1L2 = fl(x; y) j l(x; y) 2 E; l 2 L1g [ fl (y; x) j l(x; y) 2
E; l 2 L2g. After having determined the bisimulation, the
resulting blocks from the partition P are used to form vertices in the
index graph according to Definition 1.
u
y
KIT
        </p>
        <p>KIT
y
B6</p>
        <sec id="sec-2-2-1">
          <title>Structured-based Partitioning Clearly, whether a graph vertex</title>
          <p>instantiates a variable of a query obviously depends on its
structural properties, i.e. the incoming and outgoing edges resp. paths.
Therefore, if nodes are physically grouped together based on
structural similarity, a group would contain more candidates for variable
instantiations. Thus, we apply structure-based partitioning to the
data graph by creating a physical group (e.g. a table) for every
vertex of the index graph, i.e. one group for every extension. Every
group contains the triples, which “describe” elements contained in
the corresponding extension. That is, triples are in the same group
when they contain the same properties and their subjects belong to
the same extension.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>RDF QUERY PROCESSING</title>
      <p>Query processing in our scenario is essentially finding a
homomorphism from the query graph q = (Vvar ] Vcon; L; P ) to
elements of the data graph G = (V; L; E). According to the
following proposition, the structure index can be exploited to perform this
task:</p>
      <sec id="sec-3-1">
        <title>PROPOSITION 1. Let G be a data graph with associated index graph G and let q be another graph such that there is a homomorphism h from q into G. Then h with h (v) = [h(v)] is a homomorphism from q into G .</title>
        <p>Proof Given a q-edge l(v1; v2), first observe that
l(h(v1); h(v2)) is a G-edge since h is a homomorphism.
Then we find that l(h (v1); h (v2)) = l([h(v1)] ; [h(v2)] ) is
a G -edge due to the definition of the index graph. 2</p>
        <p>Intuitively speaking, a mapping into G exists only if it does also
exist into the associated index graph G . Further, the resulting
extensions Bi = [h(v)] from V in G that match the nodes in
q will obviously contain the data graph matches h(v). Thus, the
procedure for query processing can be decomposed into two steps:
(1) finding matching extensions Bi on the index graph G first, (2)
then combining data elements retrieved for Bi to obtain the final
data graph matches. In the following, we denote G Idx as the
index used for the retrieval of elements from the index graph and
GIdx is used for accessing elements of the data graph.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Index Graph Matches</title>
      <p>Just like an answer, an index graph match is the result of a
homomorphic mapping h from the query graph q(Vq; Lq; P ) to the
index graph G (V ; L; E ). Elements of an index graph match
are vertices of G that are assigned to variables and constants of
q. For this computation, we propose a join procedure that returns a
result table R containing all matches h : Vq ! V . First, a set
of index graph candidate edges El is retrieved from G for every
query edge label l occurring in the query (using G Idx). Then,
these candidate sets are joined along the vertices of q to obtain R.
Figure 3 illustrates two matches.</p>
    </sec>
    <sec id="sec-5">
      <title>Data Matches for an Index Match</title>
      <p>The previous computation results in a set R of index graph
matches h : Vq ! V . Every element of these matches is an
extension which essentially is a set of vertices of the queried data
graph G. According to Proposition 1, every match of the query
against the data graph is “contained” in one of the index graph
matches calculated so far (e.g. h(v1) is in h (v1) = [h(v1)] ).
It suffices to focus on the index graph matches for the computation
of the data graph matches because only data contained by them
satisfy the overall query structure. We will now show that tree-shaped
parts containing only undistinguished variables can even be pruned</p>
      <p>DEFINITION 3. Every edgeless single-vertex rooted graph
(G; r) with G = (frg; L; ;) is L1-forward-L2-backward
treeshaped. Moreover let (G1; r1) and (G2; r2) with G1 =
(V1; L; E1) and G2 = (V2; L; E2) be two
L1-forward-L2backward tree-shaped graphs with disjoint vertex sets. Let v 2 V1
and let e 2 fl(v; r2) j l 2 L1g [ fl(r2; v) j l 2 L2g. Then the
rooted graph (G3; r1) with G3 = (V1 [ V2; L; E1 [ E2 [ feg) is</p>
      <sec id="sec-5-1">
        <title>L1-forward-L2-backward tree shaped.</title>
        <p>Given such a tree-shaped query part, a stronger property can be
asserted for the index graph matches, i.e. they contain all and only
data graph matches such that no further processing at the level of
the data graph is needed:</p>
      </sec>
      <sec id="sec-5-2">
        <title>PROPOSITION 2. Let G be a data graph with associated index</title>
        <p>graph G (where is a L1-forward-L2-backward bisimulation)
and let q be a L1-forward-L2-backward tree-shaped graph with
root r. Let h0 be a homomorphism from q to G . Then for every
data graph node v 2 h0(r), there is a homomorphism h from q to
G with h(r) = v</p>
        <p>Proof. We do an induction on the maximal tree-depth of G0. As
base case, note that for tree depth 0, the claim is trivially satisfied.
For any greater tree depth, subsequently consider every child node
v0 of r in G0. Assume l 2 L1 and l(r; v0) 2 E0 (the backward case
follows by symmetry). Then, by definition of bisimularity, there
must exist a w 2 h(v0) with l(v; w) 2 E. We chose h(v0) = w.
Now we invoke the induction hypothesis for the subtree of G0 with
root v0 which yields us the values h(u0) for all successors of v0. So
we have constructed h with the desired properties. 2</p>
        <p>In words: if the query is of the aforementioned tree shape, then
every data node from any extension associated to the query root r
by an index graph match is a data graph match for r. Hence, before
computing data graph matches, the respective query parts can be
removed.</p>
        <p>
          After pruning the query, we use another join procedure to
compute a result table where rows capture bindings to distinguished
query variables. These bindings are data elements contained in the
index graph matches h , which satisfy the structure as well as the
concrete elements (i.e. constants and distinguished variables)
mentioned in the query. Query edges are processed successively. At
every iteration, triples are retrieved from GIdx and are joined with
the (intermediate) results set. More precisely, given the query edge
p(x; y), the triples hx 7! s; p; y 7! oi matching p(x; y) are
considered. They are fetched from the corresponding block [s] of the
structure-based partitioned data graph index GIdx, where s 2 [s]
and h : x ! [s] . Intuitively speaking, only triples with subjects
that are contained in the index match [s] are retrieved from disk.
Thus, only subjects that are known to satisfy the query structure are
considered. This is different from the standard approaches [
          <xref ref-type="bibr" rid="ref1 ref8">1, 8</xref>
          ],
where all triples matching the query edge are taken into account,
which might contain subjects not in [s] .
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>EXAMPLE 2. Fig. 2b depicts a query, which asks for authors x</title>
        <p>working at the same place as their supervisors y, namely a place
called KIT . One match on the index graph is h1 = fu 7!
B3; x 7! B4; y 7! B1; z 7! B2; KIT 7! B5g. Based on this,
we know that data elements belonging to extensions obtained from
the index graph match satisfy the query structure, e.g. elements in
B4 are authors of z, supervised by y, work at some place u that has
a name. A tree-like part that can be pruned is AuthorOf (x; z). It
produces the index graph match hB4 AuthorOf B2i. Since 2; 4;
and 6 in B4 are already known to be authors of some z, no further
data processing is needed for this query part. However, we have
to look at the data to verify that elements in B4 work at KIT ,
and are supervised by some y also working at KIT . For this,
we need to retrieve and join the triple matches for hx W orksAt
ui, hy W orksAt ui, hu N ame KIT i, hy Supervises xi. Note
that the query example here contains cycles. In practice, there are
many queries exhibiting simpler structure, which offer greater
potential for query pruning. In the extreme cases where no index
graph matches can be found, we can skip the entire second step
to avoid data access and joins completely.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Data Matches for Index Match Classes</title>
      <p>The procedure presented in the previous section computes
answers data matches for an index graph match h . In order to
compute all data matches, this has to be repeated for all index graph
matches h in R. However, the diverse matches might partially
overlap. To formalize and computationally exploit this, we
introduce the following notion:</p>
      <p>DEFINITION 4. Let G be a data graph and G be an index
graph. For a query q(Vq; Lq; P ), let P 0 P be a subset of
the query atoms. We say that two matches h1 and h2 from q
into G are in the same P 0-match class if for all l(v; w) 2 P 0
h1 (v) = h2 (v) and h1 (w) = h2 (w). The set of partial
data matches of a P 0-match class M consists of all the functions
: vertices(P 0) ! V satisfying
1. l( (v); (w)) 2 E for all l(v; w) 2 P 0 and</p>
      <sec id="sec-6-1">
        <title>2. there is a h</title>
        <p>have [ (v)]
2 M such that for all v 2 vertices(P 0) we
= h (v).</p>
        <p>Fig. 4 displays the case for our example, where P 0 = P n
fAuthorOf (x; z)g, i.e. the two matches h1 ; h2 overlap on all
query edges (resp. atoms) except AuthorOf (x; z). Given such an
equivalence between matches, the aforementioned na¨ıve strategy
for the independent evaluation of matches might result in many
redundant disk accesses and join operations. The following
proposition enables a more efficient computation of data graph matches:
PROPOSITION 3. Let G be a data graph, let G
ated index graph, and let q be a conjunctive query.
be the
associ1. Every match of q(Vq; Lq; P ) into G is contained in the set
of partial data matches associated to one of the P -match
classes.
2. Let P 0 P and l(v; w) 2 P n P 0. For every P 0 [ fl(v;
w)gmatch class M with a corresponding partial data match ,
there is a P 0-match class M 0 with corresponding data match
0 such that there is a h0 2 M 0 with</p>
        <p>M = fh
h0 (w)g,
j h (v) =</p>
        <p>h0 (v) and h (w) =
2 f 0g ./ ffv 7! v1; w 7!
h0 (v); v2 2 h0 (w); l(v1; v2) 2 Eg.
v2g j v1 2</p>
        <p>Proof. The first part of the proposition is an immediate
consequence of Proposition 1 and Definition 4. For the
proposition’s second part, the claim follows from choosing M 0 = fh j
h (v) = h0 (v) for all v 2 vertices(P 0) and some h0 2 M g
and 0 : v 7! (v). 2</p>
        <p>In words, the preceding proposition ensures that all data graph
matches of a query can be obtained by a successive refinement of
match classes and their associated data matches. Consequently, the
optimized procedure for computing query data matches consists of
two main parts: (1) update of match classes and (2) evaluation of
match classes.</p>
        <p>Match classes are defined w.r.t. query vertices. For the first part,
match classes are thus created (updated) according to the query
vertices that are added during the process of join processing. At first,
there is only one initial match class R consisting of all index graph
matches (line 1). During the processing of query atoms p(x; y),
the set of classes M C becomes more and more “fine-grained”, as
any matches not coinciding on how x and y are mapped to V will
be distributed to different match classes (line 11). A hash map H,
which associates pairs of index matches (x-y-instantiations) with
match classes, is employed to check for overlaps. Note that during
the processing of the atoms in P , the number of classes grows as
high as the number of matches, i.e. every match constitutes its own
class.</p>
        <p>For answer computation, results are retrieved and stored in Ai
for every match class Mi encountered during processing of atoms
in P . Since every class groups together matches that overlap
w.r.t. the already considered part of the query, all matches are
evaluated only once during query processing. After all atoms have been
evaluated for all match classes, intermediate data matches for each
match class are combined to obtain the final data matches (line 20).</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>OPTIMALLY PARAMETERIZED PIG</title>
      <p>We briefly discuss the complexity of query processing and upon
the results, elaborate on the parametrization of PIG.</p>
      <p>
        Complexity For RDF query processing, we have
O(edgemaxjP j), where jP j denotes the number of query
atoms and edgemax is jl(v1; v2)j with l being the property
instantiated by the largest number of edges, e.g. the size of the
largest table used in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This is due to multiplicative cost of join
processing in the general case – but the use of special indexes [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
gives us near-linear behavior in practice. Compared to this, the
complexity of our approach is as follows:
      </p>
      <p>PROPOSITION 4. For a query with jP j atoms, the time and
space for calculating the answers through the computation of
index matches and the subsequent processing of data graph matches
is bounded by O(edgemaxidxjP j +jRj edgemaxdatajPprunedj).
Algorithm 1: Evaluating Match Classes
Input: G(V; L; E), q(Vq; Lq; P ), Index matches</p>
      <p>R = fh1 ; h2 ; : : : ; hn : Vq ! V g.</p>
      <p>Data: Two-columns table Ep containing matches
h0 : fx; yg ! V retrieved for a predicate p, a set of
match classes M C = fM1; M2; : : :g with Mi R,
intermediate partial matches Ai for every match class
Mi, partial hash map H : V V ! M C mapping
pairs of index matches to match classes.</p>
      <p>Result: Table A, where each row represents an answer.
1 set M C = fRg.</p>
    </sec>
    <sec id="sec-8">
      <title>2 foreach p(x; y) 2 P do</title>
    </sec>
    <sec id="sec-9">
      <title>3 foreach Mi 2 M C do</title>
      <p>4 H ;</p>
    </sec>
    <sec id="sec-10">
      <title>5 foreach h 2 Mi do</title>
      <sec id="sec-10-1">
        <title>6 if H contains (h (x); h (y)) as key then</title>
        <p>7 move h from Mi to the match class</p>
        <p>H(h (x); h (y)).</p>
        <p>This is because our approach involves an additional
component, i.e. the cost O(edgemaxidxjP j) for computing index
matches, where edgemaxidx = maxl2L jEl j is bounded by
the size of the index graph. However, we need only to join
along a pruned version Ppruned P of the query. So we
obtain O(jRj edgemaxdatajPprunedj) where edgemaxdata =
maxB1;B22V ;l2L jfl(v; v0) j v 2 B1; v0 2 B2gj. Thus,
edgemaxdata is bounded by the size of the largest extension.</p>
        <p>Parametrization of PIG In comparison, less data have to be
retrieve from G, i.e. edgemaxdata edgemax, and also, fewer
joins are required, i.e. jPprunedj instead of P . The overhead
introduced to achieve this O(edgemaxidxjP j). The parametrization
of L1 and L2 has the following effect on the overhead and gain
introduced by our structure-based query processing:</p>
        <p>When more labels are included in L1 and L2, the index graph
becomes larger. Thus, edgemaxidx becomes larger. On the
other hand, more labels can be considered for query pruning,
thus potentially increasing the fraction of prunable query parts
(jP n Pprunedj). Also, the physical groups obtained from
structurebased partitioning become more fine-grained, thereby reducing the
size of edgemaxdata.</p>
        <p>Thus L1 and L2 shall be parameterized w.r.t the query
workload to keep the index as small as possible. As a general strategy,
the parametrization shall be computed based on frequent prunable
query parts, i.e. search for prunable query parts in the workload
and then, use labels occurring in these parts as parameters.</p>
        <p>
          More optimized systems have been built that implement the
concepts of indexing and query optimization [
          <xref ref-type="bibr" rid="ref10 ref8">8, 10</xref>
          ]. Since these
aspects are orthogonal, we use the work in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] as baseline, which is
the state-of-the-art in, and is purely focused on partitioning and
query processing. We compare our work called structure-based
query processing
        </p>
        <p>
          We now summarize the experiment reported in details in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. It
is based on DBLP and several synthetic datasets containing
several millions of triples created using the Lehigh University
Benchmark. A set of 30 queries categorized into five classes ranging from
single-atom query to complex structured graph-shaped queries has
been used. We use two parameterizations for the experiments: (1)
SPB is based on G0B calculated using backward bisimulation only
and (2) SPFB uses G0F B , a restricted back- and forward
bisimulation adapted to the workload by setting L1; L2 to include only
labels occurring in prunable query parts. G0F B is much smaller
(4%-30%) and the indexes for G0B makes up only a small
percentage (0.08%-2%) of the data graph.
        </p>
        <p>Performance In Fig. 5a+b, total query processing time is
plotted for both approaches. Clearly, SQP is faster than VPQP. For
DBLP in particular, SQP performs a factor of 7-8 faster. Further,
we note that SQP exhibits much better performance than VPQP
w.r.t. queries that have more complex structures, i.e. the queries
Q4-Q15. SQP is slightly worse w.r.t. simple queries, i.e. the
single triple patterns Q1-Q3. This suggests that with more complex
queries, the overhead incurred by the additional structure-level
processing can be outweighed by the accumulated gain.</p>
        <p>Breaking down the results, we found that the gain in loading and
join is small for single-atom queries, but substantial for more
complex structured queries. In these cases, it outweighs the additional
cost of index matching.</p>
        <p>Scalability We measured the average query performance for
LUBM with varying size (i.e. generated for 1, 5, 10, 20 and 50
universities). We found that the performance of our improves with
the size of the data. In particular, the gain for load and join
increases in larger proportion than the overhead incurred for index
match. This is because match performance is determined by the
size of the index graph. This depends on the structure but not on
the size of the data graph. Thus, the match time does not
necessarily increase when the data graph becomes larger. The positive
effect of data filtering (IO reduction) and query pruning (load and
join) however, correlates with the data size.</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>RELATED WORK</title>
      <p>We already compare our approach against triple stores along
different dimensions and now, discuss related work on XML and
semi-structured data processing.</p>
      <p>
        Structure Index Structural information has been exploited for
indexing semi-structured and XML data. Unlike Dataguide [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
and extensions of this concept for dealing with XML and
semistructured data [
        <xref ref-type="bibr" rid="ref4 ref7">4, 7</xref>
        ], (1) PIG can be constructed from general
graph-structured data, (2) its concept of structural similarity is more
fine grained, as it rests on characterizing the neighborhood via trees
instead of paths, and (3) it can be parameterized such that only
certain edge labels are considered.
      </p>
      <p>
        Query Processing Our structure-aware processing is similar to
the work on using structure indexes for (a) evaluating path queries,
and for (b) finding occurrences of twig patterns (tree-structure) in
a XML database based on multi-predicate merge join [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] such as
TwigJoin [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These techniques rely on tree structures, and in
particular, assume that relationships among query elements are of the
      </p>
      <p>SQP
load(VPQP‐SQP)
join(VPQP‐SQP)
# removed query 
nodes
VPQP
c) Time for Queries on DBLP
types parent-child or ancestor-descendant whereas in our setting,
both query and data are generally graph structured. We employ
rather a combination of (a) and (b): similar to (a), we match the
query against the index graph. Similar to (b), intermediate results
are computed and then joined to obtain the final answers. While
intermediate results in (b) are simple root-to-leaf paths of the twig
pattern, we retrieve and operate only on data elements that satisfy
the structural constraints of the entire graph-structured query.</p>
    </sec>
    <sec id="sec-12">
      <title>CONCLUSIONS AND FUTURE WORK</title>
      <p>We have proposed techniques for RDF data partitioning and
query processing that can exploit the underlying structure to
improve the management of RDF data, based on a novel structure
index call PIG. In an principled manner, we showed that this
approach is faster than the state-of-the-art, especially for complex
structured queries.</p>
      <p>As future work, we will elaborate on how existing work on RDF
query optimization can be used for the proposed structure-based
query processing technique. Further, strategies proposed for
optimizing updates of XML structure indexes will be studied and
adopted.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Abadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Marcus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K. J.</given-names>
            <surname>Hollenbach</surname>
          </string-name>
          .
          <article-title>Scalable Semantic Web Data Management Using Vertical Partitioning</article-title>
          .
          <source>In VLDB</source>
          , pages
          <fpage>411</fpage>
          -
          <lpage>422</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Anonymous</surname>
          </string-name>
          .
          <article-title>Efficient RDF Data Management Using Structure Indexes for General Graph Structured Data</article-title>
          .
          <source>Technical report</source>
          ,
          <year>2010</year>
          . http://rs385.rapidshare. com/files/337240978/paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Bruno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Koudas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <article-title>Holistic twig joins: optimal XML pattern matching</article-title>
          .
          <source>In SIGMOD Conference</source>
          , pages
          <fpage>310</fpage>
          -
          <lpage>321</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Buneman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Davidson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Suciu</surname>
          </string-name>
          .
          <article-title>Adding Structure to Unstructured Data</article-title>
          .
          <source>In ICDT</source>
          , pages
          <fpage>336</fpage>
          -
          <lpage>350</lpage>
          . Springer Verlag,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Goldman</surname>
          </string-name>
          and
          <string-name>
            <surname>J. Widom.</surname>
          </string-name>
          <article-title>DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases</article-title>
          .
          <source>In VLDB</source>
          , pages
          <fpage>436</fpage>
          -
          <lpage>445</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Harth</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Decker</surname>
          </string-name>
          .
          <article-title>Optimized Index Structures for Querying RDF from the Web</article-title>
          .
          <source>In LA-WEB</source>
          , pages
          <fpage>71</fpage>
          -
          <lpage>80</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kaushik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shenoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bohannon</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Gudes</surname>
          </string-name>
          .
          <article-title>Exploiting Local Similarity for Indexing Paths in Graph-Structured Data</article-title>
          .
          <source>In ICDE</source>
          , pages
          <fpage>129</fpage>
          -
          <lpage>140</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Neumann</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum. RDF-</surname>
          </string-name>
          <article-title>3X: a RISC-style engine for RDF</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <fpage>647</fpage>
          -
          <lpage>659</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Paige</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Tarjan</surname>
          </string-name>
          .
          <article-title>Three partition refinement algorithms</article-title>
          .
          <source>SIAM J. Comput.</source>
          ,
          <volume>16</volume>
          (
          <issue>6</issue>
          ):
          <fpage>973</fpage>
          -
          <lpage>989</lpage>
          ,
          <year>1987</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Weiss</surname>
          </string-name>
          , P. Karras,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          .
          <article-title>Hexastore: sextuple indexing for semantic web data management</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1008</fpage>
          -
          <lpage>1019</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Naughton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>DeWitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Luo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Lohman</surname>
          </string-name>
          .
          <article-title>On Supporting Containment Queries in Relational Database Management Systems</article-title>
          .
          <source>In SIGMOD Conference</source>
          , pages
          <fpage>425</fpage>
          -
          <lpage>436</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>