<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Comparing Schema Advancements for Distributed RDF Querying Using SparkSQL</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mohamed Ragab</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Riccardo Tommasini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sherif Sakr?</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Computer Science, Tartu University</institution>
          ,
          <addr-line>Tartu</addr-line>
          ,
          <country country="EE">Estonia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <abstract>
        <p>Linked Data reveals the need for big semantic data processing. The underlying literature already discusses numerous attempts at leveraging the relational engines of Big Data frameworks like Apache Spark to run SPARQL queries at scale. However, the choice of a relational schema to store RDF data may signi cantly impact the query performance and hence various alternatives exist. In this paper, we investigate the improvement of two recent proposals, i.e., Extended Vertically Partitioned Tables and Wide Property Tables, w.r.t. the baseline approaches Vertically Partitioned Tables and Property Tables. To generalize our results, we observe how the two schemas behave together with di erent RDF partitioning techniques and HDFS storage data formats. We run our experiments using SparkSQL over a 100-million triples dataset generated using SP2Bench.</p>
      </abstract>
      <kwd-group>
        <kwd>RDF Relational Schemata Spark-SQL SPARQL</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        1The Semantic Web community is investigating how to leverage frameworks like
Apache Spark to run SPARQL queries at scale. Since Big Data frameworks excel
in relational data analysis, several solutions have been proposed to utilize their
query engines for processing large RDF graphs [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Intuitively, the choice of a
relational schema signi cantly impacts the performance of query processing. For
instance, the Single Statement Table Schema (ST), which prescribes to
store triples using a ternary relation (subject, predicate, object), often requires
many self-joins. Many alternatives to ST exist, i.e., The (i) Vertically
Partitioned Tables Schema (VP) proposes to use binary relations (subject,
object) for each unique predicate in the dataset. The (ii) Property Table Schema
(PT) suggests n-ary relations to represent RDF triples, grouping those with the
same subject.
      </p>
      <p>
        In this paper, we aim at validating two further improvements proposed for
Apache Spark, i.e., the Extended Vertically-Partitioned Table (ExtVP), which
extends VP with precomputed semi-join tables [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to reduce data shu ing, and
the Wide Property Table (WPT) schema, which extends PT considering the
whole dataset in a single table [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and minimizes the number of joins. We
performed an extensive comparative evaluation of ExtVP and WPT vs VP and PT
respectively using SparkSQL. To generalize the study, we check the approaches
under varying experimental conditions2: we tested how ExtVP and WPT
perform combined with RDF-based partitioning techniques and storage formats.
Partitioning techniques impact the query execution as they change data locality.
On the other hand, data formats are considered by the Spark optimizer and, thus,
impact the query plan. In particular, we used the following partitioning
techniques (i) Horizontal (HO) partitioning, which divides data evenly over n
equivalent chunks where n is number of machines in the cluster; (ii) Subject-based (SB)
and (iii) Predicate-based (PB) partitioning, which distribute triples across the
various partitions according to the hash value computed for the subjects or
predicate, respectively. Moreover, we use two row-oriented data formats, i.e., CSV,
Avro, and two column-oriented ones, i.e., ORC, and Parquet. As a baseline
conguration, we chose the one of the original papers: we store data in HDFS using
Parquet without any speci c partitioning technique, i.e. No Partitioning(NP).
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Experiments</title>
      <p>In our evaluation, we used data and queries from SP2Bench (SPARQL
Performance Benchmark) . We prepared the SQL versions of the SP2Bench queries3 for
ExtVP and WPT4. We generated a synthetic RDF dataset with 100M triples
size in Notation3 format, which was su cient for unveiling di erences in the
query execution under di erent experimental conditions. We run the experiments
ve times and computed an average5 on a four-nodes bare-metal cluster (master
node and 3 worker machines). Each node runs has 32 cores, 128GB of RAM per
node, and 2-TB SSD drive.
2.1</p>
      <p>
        WPT VS. PT
According to [
        <xref ref-type="bibr" rid="ref2 ref3">3, 2</xref>
        ], we expect Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q10 Q11
that WPT outperforms PT es- PT 2 9 2 8 7 6 9 5 2
pecially with the "Star-Shaped" WPT 0 0 0 3 3 3 10 3 0
queries, which can be answered
without any join operations Table 1: Number of SQL joins in PT vs WPT.
when using WPT schema. Table 1 compares the required number of joins for
the alternative SQL translations of SP2Bench queries when translated. Except
for Q8 that has many self-joins of the WPT table, WPT always requires fewer
joins than PT. Moreover, we expect that Spark handles e ciently the sparsity
caused by the WPT schema when using Parquet data format, since it ignores
Null values.
      </p>
      <p>
        Table 2 shows the overall benchmark results of the performance of WPT over
PT across all le formats (Horizontally), and across the di erent partitioning
2
https://www.nist.gov/pml/nist-technical-note-1297/nist-tn-1297-appendix-d1terminology
3 http://dbis.informatik.uni-freiburg.de/index.php?project=SP2B/queries.php
4 https://github.com/DataSystemsGroupUT/SPARKSQLRDFBenchmarking
5 We excluded the rst run to avoid warm-up bias
techniques (Vertically). Values in this table specify the number of queries in
which the WPT schema gives performance better than the baseline PT schema6.
The experiments con rm that WPT
outperforms the baseline PT schema in all the Avro CSV ORC Parquet
queries (i.e 9 queries out of 9 in the bench- NP 2/9 2/9 8/9 9/9
mark) using the baseline con guration, i.e., Ho 2/9 3/9 6/9 6/9
Parquet le format and No partitioning tech- SP 2/9 2/9 6/9 6/9
nique (NP). These results con rm the
stateof-the-art ndings [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Table 2: Number of queries for
      </p>
      <p>To investigate the relations between WPT which WPT outperforms PT.
and PT, we introduce di erent le formats and partitioning techniques. Table 3
shows the e ect of data partitioning (left) and storage formats (right) considering
the other new factors across all the experiments. To this extent, we calculate the
percentages by grouping the experiment by partitioning technique and
counting how many times WPT is better than PT across all the experiments. We
calculated the storage e ect in a similar way but grouping by le format.</p>
      <p>Table 3 shows that WPT
outperforms PT schema for Partitioning e ect Storage Formats e ect
58% of the experiments when NP 58.33% Parquet 77.78%
No partitioning technique is Ho 47.22% ORC 74.07%
used. Moreover, only in 78% SP 44.44% CSV 25.93%
of the experiments, using Par- PB NA AVRO 22.22%
quet as le format as an
improvement. This unveils a Table 3: E ects of partitioning techniques and
trade-o between le formats, storage formats on the WPT/PT comparison.
partitioning techniques, and
the relational schemata. ORC, which is an alternative column-based le format,
gives results that are similar to Parquet(74%). Although Parquet is even better
because it e ciently handles the sparsity of the WPT schema, we can generalize
the bene ts of column-based le formats. Indeed, SP2Bench queries only have
one query with more than 2-column projections, which justi es why columnar
formats give better results for the WPT than the row-based ones. Row-oriented
formats have a negative impact on the WPT results. WPT outperforms PT only
for 22% and 25% experiments when Avro and CSV are used, respectively. In
conclusion, the state-of-the-art results for WPT cannot be reproduced in the
presence of di erent formats and partitioning techniques.
2.2</p>
      <p>
        ExtVP VS. VP
In this section, we discuss Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11
the comparison of the ExtVP Red 58% 77% 59% 96% 60% 31% 5% 0% 0% 0%
schema to VP schema. For the
comparison, we used the same Table 4: [Red]uctions percentage using ExtVP.
approach followed in the previous section. According to [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we expect that
ExtVP provides better or at least similar execution times of VP. The
enhancement depends on the percentage of semi-join reductions of table input sizes that
6 Green: improvement always outperforms the baseline; Yellow : improvement
outperforms the baseline in at least 50% of the cases, and Red means less than 50%.
the ExtVP introduces, which reduce the required data shu ing. Table 4 shows
the percentage of ExtVP reductions in the processed rows for each query over
the original input table processed rows. We expect queries Q9,Q10,and Q11
giving similar results to the VP schema as they do not present any input table
reductions.
      </p>
      <p>Table 5 shows the comparison between
ExtVP and the baseline VP schema perfor- Avro CSV ORC Parquet
mance in all the SP2Bench queries, for dif- NP 6/10 6/10 5/10 7/10
ferent formats (horizontally) and partition- Ho 3/10 3/10 3/10 3/10
ing techniques (vertically)5. We observe that PB 2/10 3/10 6/10 6/10
for queries Q9,Q10,Q11, which did not
bene</p>
      <p>t from any join reductions, ExtVP does not Table 5: Number of queries for
beat VP even for the baseline con guration which ExtVP outperforms VP.
(NP and Parquet).</p>
      <p>Table 6 also shows how far the data partitioning (left) and data formats
(right) impact the results of ExtVP in comparison to VP schema performance.
Percentages are calculated in the same as in Table 3, pivoting on the dimension
of choice, i.e., le format X or partitioning technique Y .</p>
      <p>The results con rm that
partitioning signi cantly degrades the perfor- Partitioning e ect Storage Formats e ect
mance of ExtVP, which outperforms NP 67.5% Parquet 55%
VP schema in 67% of the experi- Ho 35% ORC 45%
ments for the NP con guration. How- PB 55% AVRO 42.5%
ever, adopting predicate-based par- SP 30% CSV 42.5%
titioning has a less negative impact
(55%), then horizontal partitioning Table 6: E ects of partitioning
tech(35%), followed by the subject-based niques and storage formats on the
partitioning; the worst with only 30%. ExtVP vs VP comparison.
The small number of projections in SP2Bench queries suggests that columnar
le formats can t such query workloads better than the row-oriented ones. In
55% of the cases where Parquet is used ExtVP outperforms VP, while only 45%
with ORC. Nevertheless, ExtVP beats VP in only 42.5% of the experiments that
adopt either Avro or CSV. In conclusion, we cannot reproduce completely the
state-of-the-art results for ExtVP when di erent partitioning techniques or le
formats are used.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>In this paper we investigated the reproducibility of RDF relational optimizations
within distributed Spark-SQL while introducing complex experimental solution
space. The optimized relational schemata can be a ected with new experimental
factors such as the data partitioning or the storage data formats.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abdelaziz</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harbi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khayyat</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalnis</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>A survey and experimental comparison of distributed sparql engines for very large rdf data</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Arrascue</given-names>
            <surname>Ayala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.A.</given-names>
            ,
            <surname>Koleva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Alzogbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Cossu</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          , Farber,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Philipp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Schievelbein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Taxidou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            ,
            <surname>Lausen</surname>
          </string-name>
          , G.:
          <article-title>Relational schemata for distributed sparql query processing</article-title>
          .
          <source>In: International Workshop on Semantic Big Data</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Schatzle,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Przyjaciel-Zablocki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Neu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Lausen</surname>
          </string-name>
          , G.:
          <article-title>Sempala: Interactive sparql query processing on hadoop</article-title>
          .
          <source>In: ISWC</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Schatzle,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Przyjaciel-Zablocki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Skilevic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Lausen</surname>
          </string-name>
          , G.:
          <article-title>S2rdf: Rdf querying with sparql on spark</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>