<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SparkRDF: Elastic Discreted RDF Graph Processing Engine With Distributed Memory</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xi Chen</string-name>
          <email>xichen@zju.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Huajun Chen</string-name>
          <email>huajunsir@zju.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ningyu Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Songyang Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>College of Computer Science, Zhejiang University</institution>
          ,
          <addr-line>Hangzhou 310027</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>With the explosive growth of semantic data on the Web over the past years, many large-scale RDF knowledge bases with billions of facts are generating. This poses signi cant challenges for the storage and retrieval of big RDF graphs. In this paper, we introduce the SparkRDF, an elastic discreted semantic graph processing engine with distributed memory. To reduce the high I/O and communication costs for distributed platforms, SparkRDF implements SPARQL query based on Spark, a novel in-memory distributed computing framework. All the intermediate results are cached in the distributed memory to accelerate the process of iterative join. To reduce the search space and memory overhead, SparkRDF splits the RDF graph into the multi-layer subgraphs based on the relations and classes. For SPARQL query optimization, SparkRDF generates an optimal execution plan for join queries, leading to e ective reduction on the size of intermediate results, the number of joins and the cost of communication. Our extensive evaluation demonstrates the e ciency of our system.</p>
      </abstract>
      <kwd-group>
        <kwd>Big RDF Graph</kwd>
        <kwd>SPARQL</kwd>
        <kwd>SPARK</kwd>
        <kwd>Distributed memory</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        With the development of Semantic technologies and Web 3.0, the amount of
Semantic Web data represented by the Resource Description Framework (RDF)
is increasing rapidly. Traditional RDF systems are mainly facing two challenges.
i)scalability: the ability to process the big RDF data. Most existing RDF systems
are based on single node[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which are easily vulnerable to the growth of the
data size because they usually need to load large indexes into the limited memory.
ii) real-time: the capacity to implement SPARQL query over big RDF graph in
near real time. For highly iterative SPARQL query, existing MapReduce-based
RDF systems su er from high I/O cost because of iteratively reading and writing
large intermediate results in disk[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        In this paper, we introduce SparkRDF, an elastic discreted RDF graph
processing system with distributed memory. It is based on Spark, a in-memory
cluster computing system which is quite suitable for large-scale real-time
iterative computing jobs[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. SparkRDF splits the big RDF graph into
MESGs(Multilayer Elastic SubGraph) based on relations and classes by creating 5 kinds of
indexes(C,R,CR,RC,CRC) with di erent grains to cater for diverse triple
patterns(TP). These index les on demand are modeled as RDSG(Resilient
Discreted SubGraph), a collection of in-memory semantic subgraph objects partitioned
across machines, which can implement SPARQL query by a series of basic
operators. All intermediate results(IR), which are also regarded as the RDSG, remain
in the distributed memory to support further fast joins. Based on the query
model, several corresponding optimization tactics are then presented.
      </p>
      <p>The remaining of this paper is organized as follows. Section 2 introduces
the index data model and iterative query model of SparkRDF. In Section 3,
we present the results of our experiments. Finally, we conclude and discuss the
future work in Section 4.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>SparkRDF</title>
      <sec id="sec-2-1">
        <title>Index Data Model: MESG</title>
        <p>We create the index model called MESG based on relations and classes, which
extends traditional vertical partitioning solution by connecting class indexes with
predicate indexes, whose goal is to construct a smaller index le for every TP in
the SPARQL query. At the same time, as it is uncertain that the class
information about the entities can are given in the SPARQL query, the SparkRDF needs
a multi-layer elastic index scheme to meet the query need for di erent kinds of
TP. Speci cally, we rst construct the class indexes(C) and relation indexes(R).
Then a set of ner-grained index les(CR,RC,CRC) are created by joining the
two kinds of index les. All the index les are stored in the HDFS.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>RDSG-based Iterative Query Model</title>
        <p>For SparkRDF, all the index les and IRs can be modeled as an uni ed
concept called RDSG(Resilient Discreted SubGraph). It is a distributed memory
abstraction that lets us perform in-memory query computations on large
clusters by providing the following basic operators: RDSG Gen, RDSG Filter,
RDSG Prepartition, RDSG Join. Figure 1 illustrates the RDSG-based query process.
Every job corresponds to one query variable.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Optimization techniques</title>
        <p>Based on the data model and query model, several optimization strategies are
made to improve query e ciency. First, TR-SPARQL refers to a Type-Restrictive
SPARQL by passing variable's implicit class message to corresponding TPs that
contains the variable. It cuts down the number of task (remove the TPs whose
predicate is rdf:type )and the cost of parsing every TP(form a more restrictive
index le). Then we use a selectivity-based greedy algorithm to design a optimal
execution order of TPs, greatly reducing the size of IR. At last, the
locationfree prepartitioning is implemented to avoid the shu ing cost in the distributed
join. It ignores the partitioning information of index les, while repartitioning
the data with the same join key to the same node.</p>
        <p>SparkRDF: Elastic Discreted RDF Engine With Distributed Memory</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>We implement the experiment on a cluster with three machines. Each node has
16 GB DDR3 RAM, 8-core Intel Xeon(R) E5606 CPUs at 2.13GHz. We
compare SparkRDF with the state-of-the-art centralized RDF-3X and distributed
HadoopRDF. We run the RDF-3X on one of the nodes. HadoopRDF and
SparkRDF were executed in the cluster. We use the widely-used LUBM dataset
with the scale of 10000, 20000 and 30000 universities, consisting of 1.3 , 2.7 and
4.1 billion triples. For the LUBM queries, we chose 7 representative queries which
are roughly classi ed into 2 categories: highly selective queries (Q4,Q5,Q6) and
unselective queries(Q1,Q2,Q3,Q7). A short description on the chosen queries is
provided in the Appendix.</p>
      <p>Table 1 summarizes our comparison with HadoopRDF and RDF-3X(best
times are boldfaced). The rst observation is that SparkRDF performs much
better than HadoopRDF for all queries. This can be mainly attributed to the
following three characteristics of SparkRDF: ner granularity of index scheme,
optimal query order and e ective memory-based joining. Another observation
is that SparkRDF outperformed RDF-3X in Q1,Q2,Q3,Q7, while RDF-3X did
better in Q4,Q5,Q6. The result conforms to our initial conjecture: RDF-3X can
achieve high performance for queries with high selectivity and bound objects or
subjects, while SparkRDF did well for queries with unbound objects or subjects,
low selectivity or large intermediate results joins. Another result is that RDF-3X
fails to answer Q1 and Q3 when the data set size is 4.1 billion triples. On the
contrary, SparkRDF scales linearly and smoothly when the scale of the datasets
increases from 1.3 to 4.1 billion triples. It proves that SparkRDF has a good
scalability.</p>
      <p>Selectivity of Variables</p>
      <p>Job1
Index RDSG TPx1 RDSG RDSG
RDSG_Gen Prepartition RDSG_Filter
Index RDSG TPx2 RDSG RDSG
....RDSG_Gen.... PreTpPaxrktitio....n RDSG_Filt....er
....</p>
      <p>Job2
Index ..TPy1 RDSG</p>
      <p>..</p>
      <p>RDSG_OP
Index .TPy2 RDSG</p>
      <p>...</p>
      <p>.... RDSG....T_POyPn ....</p>
      <p>IR1</p>
      <p>IR2
...</p>
      <p>IRnotshuffled</p>
      <p>IRk IRk IRk+1 IRk+2
RDSG_Join</p>
      <p>RDSG_Join</p>
      <p>RDSG_Join RDSG_Join</p>
      <p>RDSG_Join</p>
      <p>Fig. 1. The Iterative Query Model of SparkRDF
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future Work</title>
      <p>In the paper, we introduce the SparkRDF, a real-time scalable big RDF graph
processing engine. Also We give some the experimental results to show e
ectiveness of the SparkRDF. In the future, we would like to extend the work in few
directions. First, we will handle more complex SPARQL patterns(such as
OPTIONAL). Finally, we will make a more complete and comprehensive experiment
to validate the e ciency of SparkRDF.</p>
    </sec>
    <sec id="sec-5">
      <title>APPENDIX</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Atre</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaoji</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaki</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hendler</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          :
          <article-title>Matrix Bit loaded: a scalable lightweight join query processor for rdf data</article-title>
          .
          <source>In: Proceedings of the 19th international conference on World wide web</source>
          . pp.
          <volume>41</volume>
          {
          <fpage>50</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>LUBM: A benchmark for owl knowledge base systems</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>3</volume>
          (
          <issue>2</issue>
          ),
          <volume>158</volume>
          {
          <fpage>182</fpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Husain</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGlothlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Masud</surname>
            ,
            <given-names>M.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thuraisingham</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Heuristicsbased query processing for large rdf graphs using cloud computing. Knowledge and Data Engineering</article-title>
          , IEEE Transactions on
          <volume>23</volume>
          (
          <issue>9</issue>
          ),
          <volume>1312</volume>
          {
          <fpage>1327</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>The rdf-3x engine for scalable management of rdf data</article-title>
          .
          <source>The VLDB Journal</source>
          <volume>19</volume>
          (
          <issue>1</issue>
          ),
          <volume>91</volume>
          {
          <fpage>113</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Zaharia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chowdhury</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dave</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ma</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>McCauley</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shenker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoica</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing</article-title>
          .
          <source>In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation</source>
          . pp.
          <volume>2</volume>
          {
          <issue>2</issue>
          (
          <issue>2012</issue>
          )
          <article-title>We provide the SPARQL queries used in the experimental section: Q1-Q6 are the same as [1]. Q7 corresponds to the Q14 of [2].</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>