SPARQLGX in Action: Efficient Distributed
       Evaluation of SPARQL with Apache Spark

        Damien Graux, Louis Jachiet, Pierre Genevès, and Nabil Layaïda

               Inria, Cnrs, lig and Univ. Grenoble Alpes, France
              {damien.graux,louis.jachiet,nabil.layaida}@inria.fr
                             pierre.geneves@cnrs.fr


        Abstract. We demonstrate sparqlgx: our implementation of a dis-
        tributed sparql evaluator. We show that sparqlgx makes it possible to
        evaluate sparql queries on billions of triples distributed across multiple
        nodes, while providing attractive performance figures.


1     Introduction
We demonstrate the sparqlgx system introduced in [5] which is designed to
evaluate sparql queries efficiently in a distributed manner on top of the Apache
Spark framework1 . sparql [1] is the standard query language for retrieving and
manipulating data represented in rdf [7]. The core of the sparql query language
is the Basic Graph Pattern fragment (bgp) composed of conjunctions of triple
patterns (tps) which express conditions on rdf triples. sparqlgx supports the
bgp fragment of sparql extended with union and optional operators at top
level and with solution modifiers. sparqlgx implements specific optimizations
aimed at optimizing the evaluation of bgps. For example, the following query
taken from the WatDiv benchmark [3] (c3) involving one bgp composed of six
triple patterns returns all the matching subjects from the dataset.
SELECT ?v0 WHERE {
  ?v0 wsdbm:likes ?v1 .                 ?v0 wsdbm:friendOf ?v2 .
  ?v0 terms:Location ?v3 .              ?v0 foaf:age ?v4 .
  ?v0 wsdbm:gender ?v5 .                ?v0 foaf:givenName ?v6 . }


2     SPARQLGX Architecture and Principles
Data Storage Model. In order to process rdf datasets, we adopt the vertical
partitioning approach introduced by Abadi et al. in [2] which stores a triple
(s p o) in a file named p whose contents keeps only s and o entries. Converting
rdf data into a vertically partitioned dataset is straightforward while (1) tending
to minimize the memory footprint and the datasets size on disks and (2) reducing
response time when queries have bounded predicates since searches are limited
to the relevant files.
1
    http://spark.apache.org/
          Dataset Number of Triples HDFS File Size (with 2 replications)
         WatDiv1k   109 million                  46.8 Go
         Lubm1k     134 million                  72.0 Go
         Lubm10k    1.38 billion                 747 Go
              Table 1: General Information about Used Datasets.


Compilation of sparql bgps. Conjunctions of triple patterns are translated
in terms of primitives of the Apache Spark framework expressed in Scala code.
Each bgp is first translated in terms of a list of filters, which are then joined
based on common variables.

Optimized Joins With Statistics. To speed up the evaluation of queries, par-
ticular attention is paid to the ordering of joins in the translation process. We
compute statistics on data (i.e. we count all the distinct subjects, predicates
and objects to obtain a notion of triple pattern selectivity based on occurrence
numbers). Then we rewrite queries in order to minimize the sizes of intermediate
results. Triple patterns are joined in decreasing order of their selectivities (i.e.
triple patterns that return the smallest number of results are joined first).

Direct Evaluation. In certain situations, queried data might be subject to up-
dates; in others users might only need to evaluate a single query once (for data
cleaning purposes for instance). In such cases, it is interesting to limit as much
as possible both the preprocessing time and the query evaluation time. For this
purpose, sparqlgx provides a specific tool, called sde, capable of directly eval-
uating sparql queries without preprocessing.


3   Demonstration Details
We report on our experimental comparisons of sparqlgx against other open
source hdfs-based distributed rdf systems such as PigSPARQL [9], RYA [8],
CliqueSquare [4], S2RDF [10] and RDFHive [5]. We deploy them on the same
10-node cluster having the hdfs installed with default settings which imply
a resiliency to the loss of two nodes. Furthermore, we consider several datasets
(presented in Table 1) coming from two popular bgp benchmarks: LUBM [6] and
WatDiv [3]. These benchmarks are respectively composed of 14 and 20 sparql
queries.
    We present in Figure 1 the response times obtained with WatDiv1k. This
illustrates that, for this dataset: (1) SDE always outperforms other tested “direct
evaluators” (e.g. PigSPARQL and RDFHive); (2) sparqlgx is able to answer all
the queries unlike RYA and CliqueSquare; (3) it shares with CliqueSquare the
same order of magnitude for queries L[1–5]; (4) it outperforms its competitors
on queries C1, C2 and C3 (shown in the introduction).
    To further illustrate the performance of sparqlgx, we provide an interactive
GUI. Attendees can interact directly with our engines (i.e. sparqlgx and sde)
           SPARQLGX            RYA        S2RDF          CliqueSquare          PigSPARQL            RDFHive         SDE

          104


          103
Time(s)


          102


          101

                C1   C2   C3   F1    F2   F3   F4   F5     L1   L2   L3   L4    L5   S1   S2   S3    S4   S5   S6    S7


                           Fig. 1: Query Response Times with WatDiv1k.


            (a) Statistic Module Screenshot.                          (b) Query Evaluator Screenshot.

                                                Fig. 2: Screenshots.


by evaluating several sparql queries on various rdf datasets: such as those
presented in Table 1 and other smaller ones. Additionnally, we also deploy the
other hdfs-based distributed systems (to have a common basis of comparison);
thereby, participants can compare several systems running exactly on the same
cluster. We demonstrate sparqlgx and sde with various interaction scenarios:

 1. Loading datasets: participants can select a dataset among several predefined
    ones and a system to run its preprocessing phase. They can experience the
    preprocessing time of sparqlgx, RYA, CliqueSquare and S2RDF. Since the
    loading and indexing cost can be high, we limit this feature to the smaller
    rdf datasets.
 2. Query Evaluation (e.g. Figure 2b): attendees can evaluate predefined sparql
    queries which are extracted from LUBM and Watdiv. After choosing a
    dataset and a query, participants can select the evaluation system among
    sparqlgx, sde, RYA, CliqueSquare, S2RDF, RDFHive and PigSPARQL.
 3. Query Rewriting (e.g. Figure 2a): participants can also experience the op-
    timizer module which orders by selectivity the triple patterns of a sparql
    query according to a chosen rdf dataset. Attendees can see the generated
    Spark code.
 4. System Comparison: finally, participants can observe how the different sys-
    tems compare head-to-head.


4    Conclusion
sparqlgx outperforms several related implementations in many cases, while
implementing a simple architecture exclusively built on top of open source and
publicly available technologies. The sparqlgx implementation is available from:

                  http://github.com/tyrex-team/sparqlgx


References
 1. SPARQL 1.1 overview (March 2013), http://www.w3.org/TR/sparql11-overview/
 2. Abadi, Marcus, Madden, Hollenbach: Scalable semantic web data management
    using vertical partitioning. VLDB (2007)
 3. Aluç, G., Hartig, O., Özsu, M.T., Daudjee, K.: Diversified stress testing of RDF
    data management systems. In: ISWC. pp. 197–212. Springer (2014)
 4. Goasdoué, F., Kaoudi, Z., Manolescu, I., Quiané-Ruiz, J.A., Zampetakis, S.:
    Cliquesquare: Flat plans for massively parallel RDF queries. In: ICDE. pp. 771–
    782. IEEE (2015)
 5. Graux, D., Jachiet, L., Genevès, P., Layaïda, N.: SPARQLGX: A distributed RDF
    store mapping SPARQL to Spark. ISWC (2016)
 6. Guo, Y., Pan, Z., Heflin, J.: LUBM: A benchmark for OWL knowledge base sys-
    tems. Web Semantics (2005)
 7. Hayes, P., McBride, B.: RDF semantics. W3C Rec. (2004)
 8. Punnoose, R., Crainiceanu, A., Rapp, D.: Rya: a scalable RDF triple store for the
    clouds. In: Workshop on Cloud Intelligence. p. 4. ACM (2012)
 9. Schätzle, A., Przyjaciel-Zablocki, M., Lausen, G.: PigSPARQL: Mapping SPARQL
    to pig latin. In: SemWeb Information Management. p. 4. ACM (2011)
10. Schätzle, A., Przyjaciel-Zablocki, M., Skilevic, S., Lausen, G.: S2RDF: RDF query-
    ing with SPARQL on spark. VLDB pp. 804–815 (2016)