<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PigSPARQL: A SPARQL Query Processing Baseline for Big Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexander Schatzle</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Przyjaciel-Zablocki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Hornung</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Georg Lausen</string-name>
          <email>lausen@informatik.uni-freiburg.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Freiburg Georges-Kohler-Allee 051</institution>
          ,
          <addr-line>79110 Freiburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>ion makes our approach independent of the actual Hadoop version and thus ensures the compatibility to future changes of the Hadoop framework as they will be covered by the underlying Pig layer. We revisit PigSPARQL and demonstrate the performance improvement when simply switching the underlying version of Pig from 0.5.0 to 0.11.0 without any changes to PigSPARQL itself. Because of this sustainability, PigSPARQL is an attractive long-term baseline for comparing various MapReduce based SPARQL implementations which is also underpinned by its competitiveness with existing systems, e.g. HadoopRDF.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Today, MapReduce has been widely adopted in manifold application elds,
especially in the broad area of Big Data, with Hadoop being the most prominent
open source implementation. Though node e ciency is known to be rather poor,
its success is mainly attributed to the inherent high degree of parallelism,
robustness, reliability and excellent scalability properties while running on cheap
and heterogeneous commodity hardware. Furthermore, new nodes can be added
to the system on demand seamlessly at runtime.</p>
      <p>
        Driven by the Semantic Web and Linked Open Data, new challenges with
regard to SPARQL query evaluation arise and scalability becomes an issue as
RDF datasets continuously grow in size, exceeding the capabilities of state of the
art non-distributed RDF triple stores [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The wide spread adoption of
MapReduce makes it an interesting candidate for distributed SPARQL processing on
large RDF graphs, especially for rather costly queries involving several joins that
cannot be executed in real-time at web-scale and hence need to be processed
ofine. However, existing approaches in this direction are often accompanied by
proof-of-concept implementations that are hard to deploy or not compatible
with newer versions of Hadoop, do not run out of the box or they are even not
available for download at all. Moreover, they often support only a small subset
of SPARQL, usually basic graph patterns. All this hampers the comparison of
di erent approaches as evaluation results are hard to reproduce and a
comprehensive evaluation becomes very cumbersome and time consuming.
      </p>
      <p>
        In this paper we rst revisit PigSPARQL1, a mapping from SPARQL to the
query language of Pig [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], that was originally presented in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. PigSPARQL is easy
to use without complicated deployment, installation or con guration. By using
Pig Latin as an intermediate layer of abstraction between SPARQL and
MapReduce, the mapping is automatically compatible to future versions of Hadoop
(including major changes like the new YARN framework) while it bene ts from
further developments and optimizations of Pig without having to change a single
line of code since the query language of Pig is kept backward compatible. This is
con rmed by experiments that are presented in short in this paper (cf. Section 3).
Switching the version of Pig from 0.5.0 to 0.11.0 improved the query execution
times by up to one order of magnitude, while no adaptations of PigSPARQL were
required. Because of this feature of sustainability, PigSPARQL is an attractive
long-term baseline for comparing various MapReduce based SPARQL
implementations. This is also underpinned by PigSPARQL's competitiveness with existing
systems like HadoopRDF and others (cf. Section 3).
2
      </p>
      <p>PigSPARQL Architecture
Pig is a data analysis platform on top of Hadoop with a fully nested data model,
complemented by a comprehensive imperative query language (Pig Latin) that
gives us a simple level of abstraction from the procedural model of MapReduce by
providing relational style operators like lters and joins which are not available
in MapReduce out of the box. We represent an RDF triple in the data model of
Pig as a tuple of three atomic elds with schema (s, p, o). As we do not require
any preprocessing and RDF triples are converted into the data model of Pig on
the y, PigSPARQL is particularly suited for ad-hoc query processing, e.g. for
ETL like scenarios where we do not want to build up a costly index structure in
advance. In a typical SPARQL query the predicate of a triple pattern is usually
bounded. Hence, PigSPARQL also supports optional vertical partitioning of the
dataset as an additional preprocessing step.</p>
      <p>
        Our mapping of SPARQL to Pig Latin follows a common design principle
based on an algebraic representation of SPARQL expressions (cf. Figure 1). First,
a SPARQL query is parsed to generate an abstract syntax tree which is then
translated into a SPARQL algebra tree. Next, we apply several optimizations on
the algebra level like the early execution of lters and a rearrangement of triple
patterns by selectivity. Finally, we traverse the optimized algebra tree bottom up
and generate for every SPARQL algebra operator an equivalent sequence of Pig
Latin expressions. At runtime, Pig automatically maps the resulting Pig Latin
script into a sequence of MapReduce iterations. More details are given in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
1 See http://dbis.informatik.uni-freiburg.de/PigSPARQL for download.
      </p>
      <p>SPARQL Query
Parser</p>
      <p>Syntax Tree
Algebra Compiler</p>
      <p>Algebra Tree
Algebra Optimizer</p>
      <p>Algebra Tree
Pig Latin Translator</p>
      <p>Pig Latin Program
Pig</p>
      <p>MapReduce</p>
      <p>SELECT * WHERE { ?person knows Peter . ?person age ?age</p>
      <p>OPTIONAL { ?person mbox ?mb } FILTER (?age &gt;= 18)}
LeftJoin 4</p>
      <p>BGP 3
?person mbox ?mb</p>
      <p>Filter 2
?age &gt;= 18</p>
      <p>BGP 1
?person knows Peter .</p>
      <p>?person age ?age 
knows = LOAD 'rdf/knows' USING rdfLoader() AS (s,o);
age = LOAD 'rdf/age' USING rdfLoader() AS (s,o);
f1 = FILTER knows BY o == 'Peter';
t1 = FOREACH f1 GENERATE s AS person;
t2 = FOREACH age GENERATE s AS person,o AS age;
j1 = JOIN t1 BY person, t2 BY person;
BGP1 = FOREACH j1 GENERATE t1::person AS person,</p>
      <p>t2::age AS age;
F = FILTER BGP1 BY age &gt;= 18;
mbox = LOAD 'rdf/mbox' USING rdf() AS (s,o);
BGP2 = FOREACH mbox GENERATE s AS person,o AS mb;
lj = JOIN F BY person LEFT OUTER, BGP2 BY person;
LJ = FOREACH lj GENERATE F::person AS person,</p>
      <p>
        F::age AS age, BGP2::mb AS mb;
STORE LJ INTO 'output' USING resultWriter();
1
Most of the published MapReduce based approaches are proof-of-concept
implementations, which are neither well documented nor running out of the box,
nor are available for public. Evaluation of such systems is time consuming and
easily leads to inexplicable results. This hampers the comparability of di erent
proposed solutions which is a key driver for further development. To introduce a
stable basis for comparison, we suggest PigSPARQL as an easy to use baseline
for SPARQL query processing with MapReduce because of the following reasons:
1. PigSPARQL is a reliable and stable system as it uses Pig as an
intermediate layer which is widely-used and maintained by Yahoo! Research. Pig's
processing framework is fairly competitive and continuously optimized and
enhanced with new features. This is con rmed by Figure 2.a that shows
exemplary the runtime improvement of PigSPARQL for SP2Bench Query 2
between Pig 0.5.0 and Pig 0.11.0 where we can observe a speed up by an order
of magnitude without changing a single line of code - other queries exhibit a
similar behavior as can be expected because of PigSPARQL's architecture.
2. For a comprehensive evaluation of di erent systems, they should be
installable and usable within a reasonable e ort, without the need of tricky con
gurations. In the context of such evaluations, PigSPARQL is very attractive.
The LUBM evaluation of HadoopRDF [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], for example, took us several weeks
including an exhaustive troubleshooting whereas the same evaluation with
PigSPARQL was done in only one day.
3. We evaluated the competitiveness of PigSPARQL with respect to three other
SPARQL engines based on MapReduce by using LUBM, as some of these
systems only support basic graph patterns: (1) HadoopRDF [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is an advanced
SPARQL engine that utilizes a cost-based execution plan for reduce-side
joins. (2) MAPSIN [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is a map-side index nested loop join based on HBase.
(3) Merge Join [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a MapReduce adoption of merge joins for SPARQL
basic graph patterns. Figure 2.b illustrates the execution times for LUBM
Query 4 distinguishing between n-way and 2-way join execution, if
supported. PigSPARQL shows a competitive runtime performance while scaling
smoothly when increasing the size of the dataset. MAPSIN performs a bit
faster, however it uses a sophisticated storage schema based on HBase that
works well for selective queries but decreases signi cantly in performance for
less selective ones [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. All approaches need a rather time consuming initial
preprocessing of up to several hours compared to the vertical partitioning of
PigSPARQL, which took less than 14 minutes for 1.6 billion triples.
260
) 220
s
(e 180
im140
t
un100
r 60
      </p>
      <p>20
100 400 800 1200 1600
#triples (in million)
PigSPARQL (Pig 0.5.0)
PigSPARQL (Pig 0.11.0)
500
1000 1500 2000 2500 3000</p>
      <p>#universities
Merge Join (n-way) Merge Join (2-way)
MAPSIN (n-way) MAPSIN (2-way)
PigSPARQL (n-way) PigSPARQL (2-way)
HadoopRDF</p>
      <p>Conclusion. PigSPARQL is an easy to use and competitive baseline for the
comparison of MapReduce based SPARQL processing. With the support of SPARQL
1.0, it already exceeds the functionalities of most existing research prototypes.
For future work, we plan to add support for additional SPARQL 1.1 features.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. (
          <year>2013</year>
          ), http://dbis.informatik.uni-freiburg.de/forschung/projekte/DiPoS/
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abadi</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <source>Scalable SPARQL Querying of Large RDF Graphs. PVLDB</source>
          <volume>4</volume>
          (
          <issue>11</issue>
          ),
          <volume>1123</volume>
          {
          <fpage>1134</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Husain</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          , et al.:
          <article-title>Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing</article-title>
          .
          <source>IEEE TKDE 23(9)</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Olston</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reed</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tomkins</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Pig Latin: A Not-SoForeign Language for Data Processing</article-title>
          . In: SIGMOD. pp.
          <volume>1099</volume>
          {
          <issue>1110</issue>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Schatzle,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Przyjaciel-Zablocki</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          , et al.:
          <article-title>Cascading Map-Side Joins over HBase for Scalable Join Processing</article-title>
          . In: SSWS+HPCSW. p.
          <volume>59</volume>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. Schatzle,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Przyjaciel-Zablocki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Lausen</surname>
          </string-name>
          , G.:
          <article-title>PigSPARQL: Mapping SPARQL to Pig Latin</article-title>
          .
          <source>In: Proc. SWIM</source>
          . pp.
          <volume>4</volume>
          :
          <issue>1</issue>
          {
          <issue>4</issue>
          :
          <issue>8</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>