<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Fully Parallel Framework for Analyzing RDF Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Long Cheng</string-name>
          <email>long.cheng@tu-dresden.de</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Spyros Kotoulas</string-name>
          <email>spyros.kotoulas@ie.ibm.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomas E Ward</string-name>
          <email>tomas.ward@nuim.ie</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Georgios Theodoropoulos</string-name>
          <email>theogeorgios@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Durham University</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IBM Research</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>National University of Ireland Maynooth</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Technische Universita ̈t Dresden</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We introduce the design of a fully parallel framework for quickly analyzing large-scale RDF data over distributed architectures. We present three core operations of this framework: dictionary encoding, parallel joins and indexing processing. Preliminary experimental results on a commodity cluster show that we can load large RDF data very fast while remaining within an interactive range for query processing.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>R
D
F
D
a
t
a
e
n
c
o
d
e
tI
n
e
g
e
r
s
l
o
a
d
I
n
d
e
x
e
s
itrr
e
e
v
e
C
re an
s d
ltu id
s ta
e</p>
      <sec id="sec-1-1">
        <title>Data loading</title>
      </sec>
      <sec id="sec-1-2">
        <title>Local Data File</title>
        <p>Data Flow with
Intermachine Communication
liftr
e
i
n
g
re eR
s f
lu i</p>
        <p>n
ts ed
d
iitr
s
b
u
t
e</p>
      </sec>
      <sec id="sec-1-3">
        <title>Data querying</title>
      </sec>
      <sec id="sec-1-4">
        <title>Local Data Flow</title>
        <p>D
i
s
tad itrb
a tu
e
d
ij
o
n
s</p>
      </sec>
      <sec id="sec-1-5">
        <title>Optional Local</title>
      </sec>
      <sec id="sec-1-6">
        <title>Data Flow</title>
        <p>approaches, we combine elements of both to achieve fast loading while still keeping
query time in an interactive range.</p>
        <p>Our parallel framework is shown in Figure 1. The entire data process is divided into
two parts: data loading and data querying. (1) The equal-size partitioned raw RDF data
at each computation node (core) is encoded in parallel in the form of integers and then
loaded in memory in local indexes (without redistributing data). (2) Based on the query
execution plan, the candidate results are retrieved from the built indexes, and parallel
joins are applied to formulate the final outputs. In the latter process, local filters1 at each
node can be used to reduce/remove the retrieved results that have no contribution for
the final outputs, and the redistributed data during parallel joins can be used to create
additional sharded indexes.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Core Operations</title>
      <p>
        Triple Encoding. We utilise a distributed dictionary encoding method, as described
in [
        <xref ref-type="bibr" rid="ref6 ref7">6,7</xref>
        ], to transform RDF terms into 64-bit integers and to represent statements (aligned
in memory) using this encoding. Using a straightforward technique and an efficient
skew-handling strategy, our implementation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is shown to be notable faster than [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
and additionally supports small incremental updates.
      </p>
      <p>
        Parallel Joins. Based on existing indexes, we can lookup the candidate results for each
graph pattern and then use joins to compute SPARQL queries. For the most critical join
operation, parallel hash joins [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] are commonly used in current RDF systems. However,
they always bring in load-imbalance problems. The reason is that the terms in
realworld Linked Data are highly skewed [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In comparison to that, our implementation
adopts the query-based distributed joins we proposed in [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13">10–13</xref>
        ] so as to achieve more
efficient and robust performance on each join operation in the presence of different
query workloads.
      </p>
      <p>
        Two-tier Indexing. We adopt an efficient two-tier index architecture we presented
in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. We build the primary index l1 for the encoded triples at each node using a
modified vertical partitioning approach [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Different from [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], to speedup the load
process, we do not do any sort operation, but just insert each tuple in a
corresponding vertical table. For join operations, we could have to redistribute a large number of
1 Though our system supports filtering operations, we do not give the details in this paper.
(intermediate) results around all computation nodes, which is normally very costly. To
remedy this, we employ a bottom-up dynamic programming-like parallel algorithm to
build a multi-level secondary index (l2 ... ln), based on each query execution plan. With
that, we will simply copy the redistributed data of each join to the local secondary
indexes, and these parts of data will be re-used by other queries that contain patterns in
common, so as to reduce (or remove) the corresponding network communication during
the execution. In fact, according to the terminology regarding graph partitioning used
in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the k-level index lk on each node in our approach will dynamically construct a
k-hop subgraph. This means that our method essentially does dynamic graph-based
partitioning based on the query load, starting from an initial equal-size partitioning.
Therefore, our approach can combine the loading speed of similar-size partitioning with the
execution speed of graph-based partitioning in an analytical environment.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Preliminary Results</title>
      <p>Experiments were conducted on 16 IBM iDataPlexr nodes with two 6-core Intel Xeonr
X5679 processors, 128GB of RAM and a single 1TB SATA hard-drive, connected using
Gigabit Ethernet. We use Linux kernel version 2.6.32-220 and implement our method
using X10 version 2.3, compiled to C++ with gcc version 4.4.6.</p>
      <p>Data Loading. We test the performance of triple encoding and primary index building
through loading 1.1 billion triples (LUBM8000 with indexes on P, PO and PS) in
memory. The entire process takes 340 secs, for an average throughput of 540MB or 3.24M
triples per second (254 secs to encode triples and 86 secs to build the primary index).
12
10
) 8
c
e
s
( 6
e
m
i
tn 4
u
R
2
0</p>
      <p>Data Querying2. We implement queries over the indexes l1, l2 and l3 to examine the
efficiency of our secondary indexes. We run the two most complex queries Q2 and Q9
of LUBM. As we do not support RDF inference, the query Q9 is modified as below so
as to guarantee that we can get results for each basic graph pattern.</p>
      <p>
        Q9: select ?x ?y ?z where f ?x rdf:type ub:GraduateStudent. ?y rdf:type ub:FullProfessor. ?z
rdf:type ub:Course. ?x ub:advisor ?y. ?y ub:teacherOf ?z. ?x ub:takesCourse ?z.g
2 The results presented here is a mirror of our previous work [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>To focus on analyzing the core performance only, we report times for the operations of
results retrieval and the joins (namely we are excluding the time to decode the output)
in the execution phase.</p>
      <p>The results in Figure 2 show that the secondary indexes can obviously improve the
query performance. Moreover, the higher the level of index is, the lower the execution
time. Additionally, it can be seen that building a high-level index is very fast, taking
only hundreds of ms, which is extremely small compared to the query execution time.</p>
      <p>
        We did not employ the query-based joins as mentioned in our query execution
presented here, as we found the data skew in our tests was not obvious (due to the nature
structure of the LUBM benchmark). We plan to integrate the joins with the
development of our system, and then present more detailed results using much more complex
workloads (e.g., similar to the one used in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]).
      </p>
      <p>Acknowledgments. This work was supported by the DFG in grant KR 4381/1-1. We
thank Markus Kr o¨tzsch for comments that greatly improved the manuscript.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shao</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Efficient subgraph matching on billion node graphs</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>5</volume>
          (
          <issue>9</issue>
          ) (May
          <year>2012</year>
          )
          <fpage>788</fpage>
          -
          <lpage>799</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Gurajada</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seufert</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miliaraki</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Theobald</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>TriAD: A distributed shared-nothing RDF engine based on asynchronous message passing</article-title>
          . In: SIGMOD. (
          <year>2014</year>
          )
          <fpage>289</fpage>
          -
          <lpage>300</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Weaver</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>G.T.</given-names>
          </string-name>
          :
          <article-title>Scalable RDF query processing on clusters and supercomputers</article-title>
          . In: SSWS. (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kotoulas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Urbani</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boncz</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mika</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Robust runtime optimization and skew-resistant execution of analytical SPARQL queries on PIG</article-title>
          . In: ISWC. (
          <year>2012</year>
          )
          <fpage>247</fpage>
          -
          <lpage>262</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abadi</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Scalable SPARQL querying of large RDF graphs</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>4</volume>
          (
          <issue>11</issue>
          ) (
          <year>2011</year>
          )
          <fpage>1123</fpage>
          -
          <lpage>1134</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. Cheng, L.,
          <string-name>
            <surname>Malik</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kotoulas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ward</surname>
            ,
            <given-names>T.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Theodoropoulos</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Efficient parallel dictionary encoding for RDF data</article-title>
          . In: WebDB. (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Cheng, L.,
          <string-name>
            <surname>Malik</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kotoulas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ward</surname>
            ,
            <given-names>T.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Theodoropoulos</surname>
          </string-name>
          , G.:
          <article-title>Scalable RDF data compression using X10</article-title>
          .
          <source>arXiv preprint arXiv:1403.2404</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Urbani</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maassen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Drost</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seinstra</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bal</surname>
          </string-name>
          , H.:
          <article-title>Scalable RDF data compression with MapReduce</article-title>
          .
          <source>Concurrency and Computation: Practice and Experience</source>
          <volume>25</volume>
          (
          <issue>1</issue>
          ) (
          <year>2013</year>
          )
          <fpage>24</fpage>
          -
          <lpage>39</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kotoulas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oren</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Van Harmelen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Mind the data skew: Distributed inferencing by speeddating in elastic regions</article-title>
          .
          <source>In: WWW</source>
          . (
          <year>2010</year>
          )
          <fpage>531</fpage>
          -
          <lpage>540</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Cheng, L.,
          <string-name>
            <surname>Kotoulas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ward</surname>
            ,
            <given-names>T.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Theodoropoulos</surname>
          </string-name>
          , G.:
          <article-title>QbDJ: A novel framework for handling skew in parallel join processing on distributed memory</article-title>
          .
          <source>In: HPCC</source>
          . (
          <year>2013</year>
          )
          <fpage>1519</fpage>
          -
          <lpage>1527</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. Cheng, L.,
          <string-name>
            <surname>Kotoulas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ward</surname>
            ,
            <given-names>T.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Theodoropoulos</surname>
          </string-name>
          , G.:
          <article-title>Efficiently handling skew in outer joins on distributed systems</article-title>
          . In: CCGrid. (
          <year>2014</year>
          )
          <fpage>295</fpage>
          -
          <lpage>304</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Cheng, L.,
          <string-name>
            <surname>Kotoulas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ward</surname>
            ,
            <given-names>T.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Theodoropoulos</surname>
          </string-name>
          , G.:
          <article-title>Robust and efficient large-large table outer joins on distributed infrastructures</article-title>
          . In: Euro-Par.
          <article-title>(</article-title>
          <year>2014</year>
          )
          <fpage>258</fpage>
          -
          <lpage>269</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. Cheng, L.,
          <string-name>
            <surname>Kotoulas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ward</surname>
            ,
            <given-names>T.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Theodoropoulos</surname>
          </string-name>
          , G.:
          <article-title>Robust and skew-resistant parallel joins in shared-nothing systems</article-title>
          . In: CIKM. (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. Cheng, L.,
          <string-name>
            <surname>Kotoulas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ward</surname>
            ,
            <given-names>T.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Theodoropoulos</surname>
          </string-name>
          , G.:
          <article-title>A two-tier index architecture for fast processing large RDF data over distributed memory</article-title>
          .
          <source>In: HT</source>
          . (
          <year>2014</year>
          )
          <fpage>300</fpage>
          -
          <lpage>302</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Abadi</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marcus</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madden</surname>
            ,
            <given-names>S.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hollenbach</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Scalable semantic web data management using vertical partitioning</article-title>
          .
          <source>In: VLDB</source>
          . (
          <year>2007</year>
          )
          <fpage>411</fpage>
          -
          <lpage>422</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>