=Paper=
{{Paper
|id=None
|storemode=property
|title=D-SPARQ: Distributed, Scalable and Efficient RDF Query Engine
|pdfUrl=https://ceur-ws.org/Vol-1035/iswc2013_poster_21.pdf
|volume=Vol-1035
|dblpUrl=https://dblp.org/rec/conf/semweb/MutharajuSSH13
}}
==D-SPARQ: Distributed, Scalable and Efficient RDF Query Engine==
<pdf width="1500px">https://ceur-ws.org/Vol-1035/iswc2013_poster_21.pdf</pdf>
<pre>
    D-SPARQ: Distributed, Scalable and Efficient
              RDF Query Engine

    Raghava Mutharaju1 , Sherif Sakr2 , Alessandra Sala3 , and Pascal Hitzler1
             1
                 Kno.e.sis Center, Wright State University, Dayton, OH, USA.
                          {mutharaju.2, pascal.hitzler}@wright.edu
                2 College of Computer Science and Information Technology,
                           University of Dammam, Saudi Arabia.
      University of New South Wales, High Street, Kensington, NSW, Australia 2052.
                                  ssakr@cse.unsw.edu.au
       3 Alcatel-Lucent Bell Labs, Blanchardstown Industrial Park, Dublin, Ireland.
                           alessandra.sala@alcatel-lucent.com


       Abstract. We present D-SPARQ, a distributed RDF query engine that
       combines the MapReduce processing framework with a NoSQL distributed
       data store, MongoDB. The performance of processing SPARQL queries
       mainly depends on the efficiency of handling the join operations between
       the RDF triple patterns. Our system features two unique characteristics
       that enable efficiently tackling this challenge: 1) Identifying specific pat-
       terns of the input queries that enable improving the performance by
       running different parts of the query in a parallel mode. 2) Using the
       triple selectivity information for reordering the individual triples of the
       input query within the identified query patterns. The preliminary results
       demonstrate the scalability and efficiency of our distributed RDF query
       engine.


1    Introduction
With the recent surge in the amount of RDF data, there is an increasing need
for scalable RDF query engines. In SPARQL, even a simple query may translate
to multiple triple patterns which have to be joined. In practice, centralized RDF
engines lack scalability and query performance is abridged as they are highly
dependent on main memory constraints in order to efficiently process these join
operations. MapReduce-based processing platforms are becoming the de facto
standard for distributed processing of large scale datasets.
    We present D-SPARQ, a distributed and scalable RDF query engine that
combines the MapReduce processing framework with a NoSQL distributed data
store, MongoDB. In particular, we make the following contributions:
  – We describe how RDF data can be partitioned, stored and indexed in a
     horizontally scalable NoSQL store, MongoDB.
  – We describe a number of distributed query optimization techniques which
     consider the patterns of the input query together with selectivity informa-
     tion to minimize the processing time by efficiently parallelizing the query
     execution.
  – A comparative performance evaluation of our approach with the state-of-
     the-art in distributed RDF query processing.
                                   Data Partitioner


                            Graph                              Query
                                             Triple Placer
                           Partitioner                       Coordinator


                     MongoDB               MongoDB              MongoDB


                       Query                 Query                Query
                    Analyzer and          Analyzer and         Analyzer and
                     Processor             Processor            Processor


                              Fig. 1. System architecture


2     Approach

Figure 1 illustrates an overview of D-SPARQ’s architecture. The system receives
RDF triple datasets that are imported into MongoDB using a single MapReduce
job which also captures all the required statistical information needed by our join
reordering module in the query optimization process. A graph is constructed
from the given RDF triples and using a graph partitioner [2], triples are spread
across the machines in the cluster where the number of partitions is equal to the
number of machines in the cluster. ‘rdf:type’ triples are removed from the data
before partitioning the graph in order to make the graph more connected and
reduce the quality of partitions. After partitioning, all the triples whose subject
matches a vertex, are placed in the same partition as the vertex. A partial data
replication is then applied where some of the triples are replicated across different
partitions to enable the parallelization of query execution. In particular, in each
partition, for all the vertices that are already assigned to that partition, vertices
along a path of length n (in either direction) are added to that partition [1].
    Triples assigned to each partition (machine) are stored in MongoDB4 , a
NoSQL document database. In general, each RDF triple has three parts, subject,
predicate and object. In a general key-value store, even if two parts of the triple
are compressed into one, it would be less efficient to index them. Therefore, we
used a document store, MongoDB, that has a good read and write speed along
with good indexing, querying and sharding mechanisms. For example, in Mon-
goDB, all triples with the same subject can be grouped in to one document.
While many SPARQL queries in their basic graph patterns have a “star” pat-
tern [3], these the patterns can share (joined on) either the same subject or ob-
ject. By grouping the triples with the same subject, we would be able to retrieve
4
    http://www.mongodb.org
triples which satisfy subject-based star patterns in one read call. In addition,
MongoDB supports indexing on any attribute of a document. It also supports
single and compound indexes. We create compound indexes involving both of
subject-predicate and predicate-object pairs. In MongoDB, a compound index
handles queries on any prefixes of the index. For example, queries on predicate
alone can be handled by the compound index predicate-object.
    We tackle query processing by identifying the following patterns in the input
query:
 1. Triple patterns which are independent of each other and can be run in par-
     allel.
 2. Star patterns, i.e., triple patterns which need to be joined on the same subject
     or object.
 3. Pipeline patterns, i.e., dependency among the triple patterns such as object
     of one triple pattern is same as the subject of another triple pattern (object-
     subject, subject-object, object-object joins).
    Identifying these patterns enable us to run different parts of the query in
parallel. In order to identify these patterns, an undirected labelled graph is con-
structed. In this graph, we find articulation points and biconnected components.
In particular, articulation points provide the triples involved in pipeline pattern.
A star pattern is treated as a block or a component here i.e., a star pattern
cannot be split further. In general, a star pattern would have at least one artic-
ulation point and would be split up into smaller pieces if a regular biconnected
component algorithm is run on it. With this tweak (keeping the star pattern as
an indivisible block), all the independent star patterns can be obtained from the
query graph.
    In a star pattern, selectivity of each triple pattern plays an important role
in reducing the query runtime. Therefore, for each predicate, we keep a count
of the number of triples involving that particular predicate. For a star pat-
tern, this information is used to reorder the individual triple patterns within a
star pattern. After identifying the patterns from the query graph, processing of
queries becomes a straightforward task of using querying capabilities provided
by MongoDB, which automatically makes use of the appropriate indices while
retrieving the records from the database. If pipeline patterns are involved in the
query, care is taken to share the output of the dependent variable among all the
triple patterns involved in the pipeline.
3     Evaluation
For our experimental evaluation, we used RDF datasets which are generated
using SP2 Bench benchmark [5]. The benchmark generates DBLP data in the
form of RDF triples. Our cluster consists of 3 nodes where each node has a
quad-core AMD Opteron Processor with 16GB RAM and 2300MHz processor
speed. MongoDB version 2.2.0 is used as a backend for our query engine. We
compared our approach with the approach of [1], a distributed RDF query engine
that uses RDF-3X [4] query processor as its backend. RDF-3X5 Version 0.3.7
5
    http://www.mpi-inf.mpg.de/~neumann/rdf3x
                         Query2         Query3             Query4
         #Triples
                     RDF-3X D-SPARQ RDF-3X D-SPARQ        RDF-3X D-SPARQ
         77 million     217s   192.5s   80s   69.43s OutOfMemory   319.87s
         163 million   1537s     398s  434s     166s OutOfMemory      671s
         Table 1. Query runtimes (in seconds) for RDF-3X and D-SPARQ


has been used in our experiments. In particular, RDF-3X have been running on
each node of the cluster and the same number of triples which are handled by
our implementation are also loaded into RDF-3X on each node.
    We picked three queries of the benchmark for our experiments (Query2,
Query3 and Query4 ). In particular, we have not considered the queries which
uses the OPTIONAL, FILTER, ORDER features of the SPARQL query lan-
guage as they are out of the scope of this paper where we are mainly focusing on
the efficient execution of the join operations between the RDF triple patterns.
The numbers of triples of our experimental datasets which are illustrated in Ta-
ble 1 are the average number of triples loaded into RDF-3X, MongoDB of each
node. So the total number of triples across all three nodes in the first case (with
average of 77 million) is around 230 million and for the second case (163 million)
is around 490 million triples. Each query has been executed five times and aver-
age of the runtime across all these runs has been collected. The results of Table 1
show that the query runtimes of our implementation are significantly better than
that of RDF-3X, especially for larger number of triples. We observed that the
performance of RDF-3X decreases with increase in the number of triples. This
is a clear advantage for our query optimization techniques and the scalability of
our data storage backend that relies on a NoSQL store, MongoDB.
4   Conclusion
We presented a distributed RDF query engine that combines a scalable data
processing framework, MapReduce, with a NoSQL distributed data store, Mon-
goDB. A comparative performance evaluation show that our approach can out-
perform the state-of-the-art in distributed RDF query processing. We are plan-
ning to continue evaluating our approach using different and bigger datasets and
extend our approach to support other features of the SPARQL query language.
References
1. Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL Querying of Large RDF Graphs.
   PVLDB 4(11), 1123–1134 (2011)
2. Karypis, G., Kumar, V.: A Fast and High Quality Multilevel Scheme for Partitioning
   Irregular Graphs. SIAM Journal on Scientific Computing 20(1), 359–392 (Dec 1998)
3. Kim, H., Ravindra, P., Anyanwu, K.: From SPARQL to MapReduce: The Journey
   Using a Nested TripleGroup Algebra. PVLDB 4(12), 1426–1429 (2011)
4. Neumann, T., Weikum, G.: The RDF-3X engine for scalable management of RDF
   data. VLDB J. 19(1), 91–113 (2010)
5. Schmidt, M., Hornung, T., Lausen, G., Pinkel, C.: SP2 Bench: A SPARQL Perfor-
   mance Benchmark. In: ICDE. pp. 222–233 (2009)

</pre>