<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>D-SPARQ: Distributed, Scalable and E cient RDF Query Engine</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Raghava Mutharaju</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sherif Sakr</string-name>
          <email>ssakr@cse.unsw.edu.au</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandra Sala</string-name>
          <email>alessandra.sala@alcatel-lucent.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pascal Hitzler</string-name>
          <email>pascal.hitzlerg@wright.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Alcatel-Lucent Bell Labs, Blanchardstown Industrial Park</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>College of Computer Science and Information Technology, University of Dammam, Saudi Arabia. University of New South Wales</institution>
          ,
          <addr-line>High Street, Kensington, NSW</addr-line>
          ,
          <country country="AU">Australia 2052</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Kno.e.sis Center, Wright State University</institution>
          ,
          <addr-line>Dayton, OH</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present D-SPARQ, a distributed RDF query engine that combines the MapReduce processing framework with a NoSQL distributed data store, MongoDB. The performance of processing SPARQL queries mainly depends on the e ciency of handling the join operations between the RDF triple patterns. Our system features two unique characteristics that enable e ciently tackling this challenge: 1) Identifying speci c patterns of the input queries that enable improving the performance by running di erent parts of the query in a parallel mode. 2) Using the triple selectivity information for reordering the individual triples of the input query within the identi ed query patterns. The preliminary results demonstrate the scalability and e ciency of our distributed RDF query engine.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>With the recent surge in the amount of RDF data, there is an increasing need
for scalable RDF query engines. In SPARQL, even a simple query may translate
to multiple triple patterns which have to be joined. In practice, centralized RDF
engines lack scalability and query performance is abridged as they are highly
dependent on main memory constraints in order to e ciently process these join
operations. MapReduce-based processing platforms are becoming the de facto
standard for distributed processing of large scale datasets.</p>
      <p>We present D-SPARQ, a distributed and scalable RDF query engine that
combines the MapReduce processing framework with a NoSQL distributed data
store, MongoDB. In particular, we make the following contributions:
{ We describe how RDF data can be partitioned, stored and indexed in a
horizontally scalable NoSQL store, MongoDB.
{ We describe a number of distributed query optimization techniques which
consider the patterns of the input query together with selectivity
information to minimize the processing time by e ciently parallelizing the query
execution.
{ A comparative performance evaluation of our approach with the
state-ofthe-art in distributed RDF query processing.</p>
      <p>Data Partitioner</p>
      <p>Graph
Partitioner</p>
      <p>Triple Placer</p>
      <p>Query</p>
      <p>Coordinator
MongoDB</p>
      <p>Query
Analyzer and
Processor</p>
      <p>MongoDB</p>
      <p>MongoDB</p>
      <p>Query
Analyzer and
Processor</p>
      <p>Query
Analyzer and</p>
      <p>Processor
4 http://www.mongodb.org
triples which satisfy subject-based star patterns in one read call. In addition,
MongoDB supports indexing on any attribute of a document. It also supports
single and compound indexes. We create compound indexes involving both of
subject-predicate and predicate-object pairs. In MongoDB, a compound index
handles queries on any pre xes of the index. For example, queries on predicate
alone can be handled by the compound index predicate-object.</p>
      <p>We tackle query processing by identifying the following patterns in the input
query:
1. Triple patterns which are independent of each other and can be run in
parallel.
2. Star patterns, i.e., triple patterns which need to be joined on the same subject
or object.
3. Pipeline patterns, i.e., dependency among the triple patterns such as object
of one triple pattern is same as the subject of another triple pattern
(objectsubject, subject-object, object-object joins).</p>
      <p>Identifying these patterns enable us to run di erent parts of the query in
parallel. In order to identify these patterns, an undirected labelled graph is
constructed. In this graph, we nd articulation points and biconnected components.
In particular, articulation points provide the triples involved in pipeline pattern.
A star pattern is treated as a block or a component here i.e., a star pattern
cannot be split further. In general, a star pattern would have at least one
articulation point and would be split up into smaller pieces if a regular biconnected
component algorithm is run on it. With this tweak (keeping the star pattern as
an indivisible block), all the independent star patterns can be obtained from the
query graph.</p>
      <p>In a star pattern, selectivity of each triple pattern plays an important role
in reducing the query runtime. Therefore, for each predicate, we keep a count
of the number of triples involving that particular predicate. For a star
pattern, this information is used to reorder the individual triple patterns within a
star pattern. After identifying the patterns from the query graph, processing of
queries becomes a straightforward task of using querying capabilities provided
by MongoDB, which automatically makes use of the appropriate indices while
retrieving the records from the database. If pipeline patterns are involved in the
query, care is taken to share the output of the dependent variable among all the
triple patterns involved in the pipeline.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Evaluation</title>
      <p>
        For our experimental evaluation, we used RDF datasets which are generated
using SP2Bench benchmark [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The benchmark generates DBLP data in the
form of RDF triples. Our cluster consists of 3 nodes where each node has a
quad-core AMD Opteron Processor with 16GB RAM and 2300MHz processor
speed. MongoDB version 2.2.0 is used as a backend for our query engine. We
compared our approach with the approach of [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a distributed RDF query engine
that uses RDF-3X [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] query processor as its backend. RDF-3X5 Version 0.3.7
5 http://www.mpi-inf.mpg.de/~neumann/rdf3x
has been used in our experiments. In particular, RDF-3X have been running on
each node of the cluster and the same number of triples which are handled by
our implementation are also loaded into RDF-3X on each node.
      </p>
      <p>We picked three queries of the benchmark for our experiments (Query2,
Query3 and Query4 ). In particular, we have not considered the queries which
uses the OPTIONAL, FILTER, ORDER features of the SPARQL query
language as they are out of the scope of this paper where we are mainly focusing on
the e cient execution of the join operations between the RDF triple patterns.
The numbers of triples of our experimental datasets which are illustrated in
Table 1 are the average number of triples loaded into RDF-3X, MongoDB of each
node. So the total number of triples across all three nodes in the rst case (with
average of 77 million) is around 230 million and for the second case (163 million)
is around 490 million triples. Each query has been executed ve times and
average of the runtime across all these runs has been collected. The results of Table 1
show that the query runtimes of our implementation are signi cantly better than
that of RDF-3X, especially for larger number of triples. We observed that the
performance of RDF-3X decreases with increase in the number of triples. This
is a clear advantage for our query optimization techniques and the scalability of
our data storage backend that relies on a NoSQL store, MongoDB.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>We presented a distributed RDF query engine that combines a scalable data
processing framework, MapReduce, with a NoSQL distributed data store,
MongoDB. A comparative performance evaluation show that our approach can
outperform the state-of-the-art in distributed RDF query processing. We are
planning to continue evaluating our approach using di erent and bigger datasets and
extend our approach to support other features of the SPARQL query language.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abadi</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <source>Scalable SPARQL Querying of Large RDF Graphs. PVLDB</source>
          <volume>4</volume>
          (
          <issue>11</issue>
          ),
          <volume>1123</volume>
          {
          <fpage>1134</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Karypis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          , V.:
          <article-title>A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs</article-title>
          .
          <source>SIAM Journal on Scienti c Computing</source>
          <volume>20</volume>
          (
          <issue>1</issue>
          ),
          <volume>359</volume>
          {392 (Dec
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ravindra</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anyanwu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>From SPARQL to MapReduce: The Journey Using a Nested TripleGroup Algebra</article-title>
          . PVLDB
          <volume>4</volume>
          (
          <issue>12</issue>
          ),
          <volume>1426</volume>
          {
          <fpage>1429</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>The RDF-3X engine for scalable management of RDF data</article-title>
          .
          <source>VLDB J</source>
          .
          <volume>19</volume>
          (
          <issue>1</issue>
          ),
          <volume>91</volume>
          {
          <fpage>113</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Schmidt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hornung</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lausen</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pinkel</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>SP2Bench: A SPARQL Performance Benchmark</article-title>
          . In: ICDE. pp.
          <volume>222</volume>
          {
          <issue>233</issue>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>