<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Benchmarking Spark-SQL under Alliterative RDF Relational Storage Backends</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mohamed Ragab</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Riccardo Tommasini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sherif Sakr</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Politecnico di Milano, DEIB</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Tartu University, Data Systems Group</institution>
          ,
          <addr-line>Tartu</addr-line>
          ,
          <country country="EE">Estonia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recently, a wide range of Web applications (e.g. DBPedia, Uniprot, and Probase) are built on top of vast RDF knowledge bases and using the SPARQL query language. The continuous growth of these knowledge bases led to the investigation of new paradigms and technologies for storing, accessing, and querying RDF data. In practice, modern big data systems (e.g, Hadoop, Spark) can handle vast relational repositories, however, their application in the Semantic Web context is still limited. One possible reason is that such frameworks rely on distributed systems, which are good for relational data, however, their performance on dealing with graph data models like RDF have not been well-studied yet. In this paper, we present a systematic comparison of there relevant RDF relational schemas, i.e., Single Statement Table, Property Tables or Vertically-Partitioned Tables queried using Apache Spark. We evaluate the performance of Spark SQL querying engine for processing SPARQL queries using three di erent storage back-ends, namely, PostgreSQL, Hive, and HDFS. For the latter one, we compare four di erent data formats (CSV, ORC, Avro, and Parquet). We drove our experiment using a representative query workloads from the SP2Bench benchmark scenario. The results of our experiments show many interesting insights about the impact of the relational encoding scheme, storage backends and storage formats on the performance of the query execution process.</p>
      </abstract>
      <kwd-group>
        <kwd>Large RDF Graphs Apache Spark SPARQL Spark-SQL RDF Relational Schema</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The Linked Data initiative is fostering increasing adoption of semantic
technologies like never before [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Vast RDF datasets (e.g. DBPedia, Uniprot, and
Probase) are now publicly available and new challenges for storing, managing
and querying such data remain unveil. Recently, the Semantic Web
community started investigating on how to leverage big data processing frameworks
(e.g., Hadoop, Spark) to process large amounts of RDF data [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. A number
of systems were designed to handle a huge amount of RDF data [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In
practice, big data processing frameworks demand data partitioning to exploit the
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
full power of a distributed solution. However, a main challenge towards scalable
and distributed RDF query processing is data partitioning. In particular, e
cient partitioning of RDF data remains an open challenge [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. An alternative
approach relies on storing RDF data using a relational schema. To this extent,
the relational RDF storage were proposed [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ], e.g., (i) Single Statement
Table Schema (ST): A schema in which all triples are stored in one single large
table [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ]. (ii) Vertically Partitioned Tables Schema (VT): A schema
in which a table for each property with only two columns (subject, object) is
stored [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. (iii) Property (n-ary) Table Schema (PT): A schema in which
multiple RDF properties are grouped and stored as columns in one table for
the same RDF subject [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. In general, in relations-based processing of RDF
queries, the design of the underlying relational schema can signi cantly impact
the performance of query processing [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In principle, a systematic analysis for
the performance of Big Data framework on answering queries over relational
RDF storage is still missing. In this paper, we take the rst step for lling
this gap by presenting a systematic analysis of the performance of Spark-SQL
query engine for answering SPARQL queries over RDF repositories. In
particular, we experimentally evaluate the performance of various storage backends,
namely, PostgreSQL, Hive, and HDFS with textual and binary formats (e.g.
      </p>
      <p>
        CSV, Avro, Parquet, ORC). Moreover, we evaluate Spark-SQL under the three
aforementioned relational RDF schemas (Single Statement Table, Property
Tables, Vertically-Partitioned Tables) using various sizes of RDF databases and
di erent query workloads generated by the SP2Bench benchmark [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>The remainder of the paper is organized as follows: Section 2 presents an
overview of the required background information for our study. Section 3
describes the benchmarking scenario of our study. Section 4 describes the
experimental setup of our benchmark. Section 5 presents the results and discusses
various insights. We discuss the related work in Section 6 before we conclude the
paper in Section 7.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>In this section, we present the necessary background to understand the content
of the paper. We assume that the reader is familiar with RDF data model and
the SPARQL query language.
2.1</p>
      <sec id="sec-2-1">
        <title>Spark &amp; Spark-SQL</title>
        <p>
          Apache Spark [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] is an in-memory distributed computing framework for large
scale data processing. Its core abstractions are Resilient Distributed Datasets
(RDDs) and DataFrames, both are an immutable distributed collection of data
elements, but DataFrames are also organized according to a speci c schema
into named and data-typed columns like a table in relational databases.
SparkSQL [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] is a high-level library for processing structured data on top of DataFrames.
In particular, SparkSQL allows the ability to query data stored in the DataFrames
The study of e cient storage of RDF data that also supports e cient data access
is still an important research problem. Although there have been some research
proposals for storing and querying RDF data in a non-relational stores,
relational Schemas remains an e cient and scalable solution [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. In the following,
we present three prominent relational RDF schemas. Moreover, we provide
examples for each schema, using the RDF data in Listing 1.1, and we translate the
SPARQL query in Listing 1.2 into the corresponding SQL query for that given
schema.
: J o u r n a l 1
: A r t i c l e 1
r d f : type : J o u r n a l ;
dc : t i t l e " J o u r n a l 1 ( 1 9 4 0 ) " ;
dcterms : i s s u e d "1940" .
r d f : type : A r t i c l e ;
dc : t i t l e " r i c h e r d w e l l i n g scrapped " ;
dcterms : i s s u e d "2019" ;
: j o u r n a l : J o u r n a l 1 .
        </p>
        <p>Listing 1.1: RDF example in N-Triples. Pre xes are omitted.</p>
        <p>SELECT ? yr
WHERE f ? j o u r n a l r d f : type bench : J o u r n a l .</p>
        <p>? j o u r n a l dc : t i t l e " J o u r n a l 1 ( 1 9 4 0 ) " ^^ xsd : s t r i n g .</p>
        <p>? j o u r n a l dcterms : i s s u e d ? yr . g
Listing 1.2: SPARQL Example against RDF graph in Listing 1.1. Pre xes are
omitted.</p>
        <sec id="sec-2-1-1">
          <title>3 https://databricks.com/glossary/catalyst-optimizer 4 https://parquet.apache.org/ 5 https://avro.apache.org/ 6 https://orc.apache.org/</title>
          <p>
            Single Statement Table Schema is the approach that has been adopted by
the majority of existing open-source RDF stores, e.g., Apache Jena, RDF4J and
Virtuoso, as well as by several centralized RDF processing systems [
            <xref ref-type="bibr" rid="ref11 ref18">11, 18</xref>
            ]. This
approach requires storing RDF datasets in a single triples table of three columns
that represent the three components of the RDF triple, i.e., Subject, Predicate,
and Object. Figure 1 shows the Single Statements Table representation schema of
the Sample RDF graph shown in Listing 1.1, and the associated SQL translation
for the query in Listing 1.2.
          </p>
          <p>
            Vertically Partitioned Tables Schema is an alternative schema storage
proposed by Abadi et.al. [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] to speed up the queries over RDF triple stores. In
this schema, the RDF triples table is decomposed into a table of two columns
(Subject, Object) for each unique property in the RDF dataset such that the
rst (subject) column contains all subject URIs of that unique property, and
the second (object) contains all the object values (URIs and Literals) for those
subjects. Figure 2 shows the vertically partitioned tables schema of the sample
RDF graph shown in Listing 1.1, and the associated SQL translation for the
query in Listing 1.2.
          </p>
          <p>
            Property (n-ary) Tables Schema proposed to cluster multiple RDF
properties as n-ary table columns for the same subject to group entities that are similar
in structure. As one of the advantages of the Property Tables Schema is that
it works perfectly with the highly structured RDF scenarios, but not for poorly
structured ones [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ]. Figure 3 shows the relational attened property tables of
the RDF graph in Listing 1.1 and the associated SQL translation for the query
in Listing 1.2.
3
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Benchmark Datasets &amp; Queries</title>
      <p>
        In our experimental evaluation, we have used one of the most popular RDF
Benchmarks, SP2Bench [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        SPARQL Performance Benchmark (SP2-Benchmark) Dataset
SP2Bench is a publicly available, language-speci c SPARQL performance
benchmark. We have chosen SP2Bench for our experimental evaluation, since it is one
of the most well-structured synthetic benchmarks [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], something which
perfectly ts the goals of our study. In particular, SP2Bench is centered around the
Computer Science DBLP scenario, and it comprises both a data generator that
enables the creation of arbitrarily large DBLP-like documents (in N-Triples
format) in addition to a set of carefully designed benchmark SPARQL queries with
a high diversity score of benchmark query features as stated by Saleem et.al.
[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Moreover, these queries are covering most of the SPARQL key operators
as well as various data access patterns. The generated RDF dataset simulates
the real-world key characteristics of the academic social network distributions
encountered from the original DBLP datasets7.
      </p>
      <p>
        We have reused a similar schema like the relational schema proposed by
Schmidt et.al [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. In their experiment, the SP2Bench RDF dataset contains nine
di erent relational entities namely, Journal, Article, Book, Person, InProceeding,
Proceeding, InCollecion, PhDThesis, MasterThesis, and WWW documents. This
schema is inspired by the original DBLP schema8 that is generated by SP2Bench
generator.
      </p>
      <sec id="sec-3-1">
        <title>7 https://dblp.uni-trier.de/db/.</title>
        <p>8 DBLP-like RDF Data Produced by the
SP2Benchhttp://dbis.informatik.unifreiburg.de/forschung/projekte/SP2B/</p>
        <p>Q1
Q2
Q3(a)
Q4
Q5(a)
Q6
Q7
Q8
Q9
Q10
Q11
The set of queries selected for our experiments are associated with the SP2Bench
scenario. These queries implement meaningful requests on top of the RDF data
generated by the SP2Bench generator, covering a variety of SPARQL operators
as well as various RDF access patterns. This list of queries can be found,
including a short textual description for each query in the benchmark project website9.
Notably, Q9 is not applicable for the PT relational schema.</p>
        <p>In our experiments, we focus on two query features that give an indication
of the query complexity, namely, number of joins, and the number of projected
variables. Table 1 summarizes these complexity measures for SP2Bench queries in
SPARQL, and for the SQL-translations that are related to each RDF relational
schema. We use the number of variable projections in the SQL statements as
an indicator for the performance comparison between the data formats of the
storage backends in terms of being row-oriented (e.g., Avro) or columnar-oriented
(e.g., Parquet or ORC).
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Setup</title>
      <p>In this section, we describe our experimental environment. In addition, we
discuss how we con gured our experimental hardware and software components.
Furthermore, we describe how we prepared and stored the datasets. Finally, we
provide the design details of our experiments.</p>
      <sec id="sec-4-1">
        <title>Hardware and Software Con gurations: Our experiments have been exe</title>
        <p>cuted on a Desktop PC running a Cloudera Virtual Machine (VM) v.5.13 with
Centos v7.3 Linux system, running on Intel(R) Core(TM) i5-8250U 1.60 GHz
X64-based CPU and 24 GB DDR3 of physical memory. We also used a 64GB
virtual hard drive for our VM. We used Spark V2.3 parcel on Cloudera VM to
9 http://dbis.informatik.uni-freiburg.de/index.php?project=SP2B/queries.php
fully support Spark-SQL capabilities. We used the already installed Hive
service on Cloudera VM (version:hive-1.1.0+cdh5.16.1+1431). We have installed a
relational DB PostgreSQL (V. 11.4).</p>
        <p>
          Benchmark datasets: Using the SP2Bench generator, we generated three
synthetic RDF datasets with scaling sizes (100k, 1M, 10M triples) in NTriples
format. For SP2Bench, 11 SPARQL queries are provided with their relational
schemas translation10. We have evaluated all of these 11 queries of type
SELECT. In some experiments, the results of Q7 and Q9 are missing. In particular,
the results of Q7 is missing for the 10M triples dataset using the Property tables
schemes as its execution time is very long (more than 30 minutes to complete)
even after caching its join tables DataFrames, while the results of Q9 are missing
in all dataset sizes as it is not implemented in the third schema (property tables
relational schema PT) according to [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], and as shown in the early mentioned
SPARQL-SQL translations Webpage. We have used these translated queries from
SPARQL into SQL to be compliant with the Spark-SQL framework in our
experiment.
        </p>
        <p>
          Data Storage: We have conducted our experiments using various data storage
backends and data storage le formats. We have used the Spark framework to
convert the data from the CSV format (generated from processing N-triples les)
into the other HDFS le formats (Avro, Parquet, and ORC). For this step, we
have used the Spark framework, because of its ability of fast handling for the
conversion of large les. Moreover, Spark supports reading di erent le formats
into and from HDFS. This approach has been also used to load the data into the
tables of the Apache Hive data warehouse (DWH) using three created databases,
one for each dataset size. Converting the data of the CSV les into the Hive
data warehouse has been done in a little bit di erent way. In particular, to store
data into hive tables, it is a must to enable the support for Hive in the Spark
session con guration using the enableHiveSupport function. Moreover, it is also
important to give the Hive metastore URI using the Thrift URI protocol, also
speci ed in the Spark session con guration in addition to the warehouse location.
Last but not least, we have also created three PostgreSQL databases, one for
each dataset size, and created tables within them with the expected schema and
data-types for each table according to the di erent RDF relational schemes (ST,
VT and PT). Then, we have loaded the data into the PostgreSQL tables from
CSV les using the PostgreSQL databases tables 'COPY' command.
Experiments: The main goal of our benchmarking experiment is to evaluate
and compare the execution times of the SQL translations of the SPARQL queries
over the Spark-SQL framework using the three introduced relational schemes
as well as on top of di erent storage backends. We have used the standard
SP2Bench SPARQL benchmark as one of the most popular and well-structured
synthetic RDF benchmarks [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. SP2Bench comes with several SPARQL queries
for evaluating the performance of di erent triple stores. In this experiment, we
focused on the 'SELECT' queries of the benchmark. In particular, we selected
11 queries (Table 1) and used their SQL translations to conduct our experiment.
10 http://dbis.informatik.uni-freiburg.de/index.php?project=SP2B/translations.html
        </p>
        <p>The descriptions for the translations from SPARQL to SQL for all schemes (ST,
VT, and PT) are available in the translation page of SP2Bench. We also made
our used SQL translation for the SPARQL queries using the di erent relational
schemes available in our project repository11</p>
        <p>We used the Spark.time function by passing the Spark.sql(query) query
execution function as a parameter. The output of this function is the running time
of evaluating the SQL query into the Spark environment using the Spark session
interface. All queries are evaluated for all schemas and on top of all the di erent
storage backends Hive, PostgreSQL, and the HDFS le formats namely, CSV,
Parquet, ORC, and Avro.</p>
        <p>For each storage backend and a relational schema, we run the experiments
for all queries ve times (excluding the rst cold start run time, to avoid the
warm-up bias, and computed an average of the other 4 run times).
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experimental Results</title>
      <p>In this section, we present the results of our experiments and discuss several
interesting insights on the performance of the Spark-SQL query engine using
the various relational RDF storage schemas and the various storage backends.
5.1</p>
      <sec id="sec-5-1">
        <title>Query Performance Analysis</title>
        <p>11 https://github.com/DataSystemsGroupUT/SPARKSQLRDFBenchmarking
(a) Short running queries ST schema.
(b) Long running queries ST schema.
(c) Short running queries VT schema.
(d) Long running queries VT schema.
(e) Short running queries PT schema. (f) Long running queries PT schema.</p>
        <p>Fig. 4: Query Execution Times for 100K Triples dataset.
(a) Short running queries ST schema.
(b) Long running queries ST schema.
(c) Short running queries VT schema.
(d) Long running queries VT schema.
(e) Short running queries PT schema. (f) Long running queries PT schema.</p>
        <p>Fig. 5: Query Execution Times for 1M Triples dataset.
(a) Short running queries ST schema.
(b) Long running queries ST schema.
(c) Short running queries VT schema.
(d) Long running queries VT schema.
(e) Short running queries PT schema. (f) Long running queries PT schema.</p>
        <p>Fig. 6: Query Execution Times for 10M Triples dataset.</p>
        <p>Avro CSV Hive ORC Parquet PostgreSQL
ST100k 0.0% 0.0% 9.1% 54.5% 27.3% 9.1%
VT100K 9.1% 9.1% 0.0% 0.0% 63.6% 18.2%
PT100K 0.0% 40.0% 0.0% 40.0% 10.0% 10.0%
ST1M
VT1M
PT1M
increase on the average run times when using the PT schema, followed by the
VT schema which is better than the ST schema for all queries as well as for the
majority of the storage backends except for the PostgreSQL. The same
observation can be seen in the long-running queries (Figures 6(b), (d), and (f)). In
particular, the PT schema is greatly outperforming the VT and the ST schemes.
This is due to the minimal number of joins required by the PT over VT then
followed by ST schema.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Storage Backends Performance Analysis</title>
        <p>Let us now investigate how di erent storage backends impact the performance
in our experiments. Tables 2 and 3 report how many times a particular backend
achieves the best or the lowest performance, respectively, considering the results
of all experiments.</p>
        <p>Considering the dataset with size of 100K triples, we observe that for the ST
schema (Figures 4(a) and (b)), Hive is the lowest performing backend in 81.8%
of the queries, for both short and long-running ones. Hadoop CSV immediately
follows. In contrast, HDFS with ORC le format is the best performing storage
backend in 54.5% of the queries, followed by HDFS Parquet. Notably, Hive and
PostgreSQL achieve the best for one query out of eleven, respectively Q10 and
Q11.</p>
        <p>For the VT schema (Figures 4 (c) and (d)), we notice that PostgreSQL is
the lowest performing storage backend in 81.8% of the queries; HDFS with ORC
le format has the lowest performance for 2 queries out of 11 queries. HDFS
with Parquet le format is the best performing storage backend for 63.6% of the
queries. PostgreSQL immediately follows by outperforming the other backends
in 18.2% of the cases (Query evaluations).</p>
        <p>Last but not least, for the PT schema (Figures 4(e) and (f)), PostgreSQL
is the lowest performing storage backend in 90% of the queries, except for Q3
where Avro is the lowest performing one. Equally, HDFS with CSV and ORC
le formats are the best performing backends in 40% of the queries, that we
recall do not include Q9.</p>
        <p>Considering the dataset of 1M triples size, for the ST schema (Figures 5 (a),
and (b)), Postgres has the lowest performance in 54.5% of the queries, followed
by CSV (45,5%). ORC is the best performing storage format in 81.8% of the
queries, followed by Parquet and Hive (9.1%).</p>
        <p>For the VT schema (Figures 5 (c) and (d)), Postgres is still the lowest
performing backend in 81.8% of the queries, followed by CSV (18.2%). ORC is the
best performing backend in almost all the queries (90.9%), except for Q1 where
HDFS CSV has the highest performance.</p>
        <p>For the PT schema (Figures 5 (e) and (f)), the performance dramatically
dropped with almost the same outcomes of the VT schema. PostgreSQL is always
the lowest performing backend. While ORC is the outperforming backend for 7
out of 10 queries (Q9 is not applicable here). That is, for queries Q1, Q3 and
Q7, HDFS Parquet has the highest performance.</p>
        <p>Regarding the dataset with 10M triple size, for the ST schema (Figures 6 (a)
and (b)), CSV is the lowest performing storage backend with 90% of the queries,
with the exception of Q10 where Avro has the lowest performance. queries. The
best performing storage backend is ORC in 63.6% of the cases, followed by
Parquet (18.2%).</p>
        <p>For the VT schema (Figures 6 (c) and (d)), CSV is still the lowest performing
storage backend in 90.9% of the cases, followed by Avro that has the lowest
performance in Q4 this time. ORC has the best performance in 45.5% of the
queries, followed by Parquet in 36.4%, and then PostgreSQL in 18.2%/</p>
        <p>For the PT schema (Figures 6 (e) and (f)), that we recall for 10M triple
dataset size do not include neither Q7 nor Q9, we observe that the CSV le
format is the lowest performing storage backend in 66.7% of the cases. The best
storage backends are HDFS with ORC and Parquet le formats both in 44.4%
of the cases. Only for Q4, Hive shows the highest performance this time.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Related Work</title>
      <p>
        Several related experimental evaluation and comparisons of the relational-based
evaluation of SPARQL queries over RDF databases have been presented in the
literature [
        <xref ref-type="bibr" rid="ref17 ref8">8, 17</xref>
        ]. For example, Schmidt et.al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] performed an experimental
comparison between existing RDF storage approaches using the SP2Bench
performance suite, and the pure relational models of RDF data implementations
namely, Single Triples relation, Flattened Tables of clustered properties
relation, and Vertical partitioning Relations. In particular, they compared the
native RDF scenario using Seasme SPARQL engine (known currently as RDF4j 12)
that is relied on a native RDF store using SP2Bench dataset, with a pure
translation of the same SP2Bench scenario into pure relational database technologies.
Another experimental comparison of the single triples table and vertically
partitioned relational schemes was conducted by Alexaki et. al. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] in which the
additional costs of predicate table unions in the vertical partitioned tables
scenario are clearly shown. This experiment was also similar to the ones performed
by Abadi et.al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], followed by Sidirourgos et.al. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] who used the Barton
library catalog data scenario13 to evaluate a similar comparison between the
Single Triples schema and the Vertical schema. On another side, Owens et.al [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]
performed benchmarking experiments for comparing di erent RDF stores (eg.
Allegrograph14, BigOWLIM 15) using di erent RDF benchmarks (e.g., LUBM16)
and RDBMS benchmarks (e.g., The Transaction Processing Performing Council
family (TPC-C) benchmark)17. This work is focused on a pure detailed RDF
stores comparison using SPARQL beyond any relational schemes
implementations or comparisons.
      </p>
      <p>To the best of our knowledge, our benchmarking study is the rst that
consider evaluating and comparing various relational-based schemes for processing
RDF queries on top of the big data processing framework, Spark, and using
di erent backend storage techniques.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>Apache Spark is a prominent Big Data framework that o ers a high-level SQL
interface, Spark-SQL, optimized by means of the Catalyst query optimizer. In
this paper, we conducted a systematic evaluation for the performance of the
Spark-SQL query engine for answering SPARQL queries over di erent relational
encoding for RDF datasets. In particular, we studied the performance of
SparkSQL using three di erent storage backends, namely, HDF, Hive and PostgreSQL.
For HDFS we compared four di erent data formats, namely, CSV, OCR, Avro,
and Parquet. We used SP2Bench to generate our experimental RDF datasets.
We translated the benchmark queries into SQL, storing the RDF data using
Spark's DataFrame abstraction. To this extent, we evaluated three di erent
ap12 https://rdf4j.eclipse.org/
13 http://simile.mit.edu/rdf-test-data/barton
14 https://franz.com/agraph/allegrograph3.3/
15 http://www.proxml.be/products/bigowlim.html
16 http://swat.cse.lehigh.edu/projects/lubm/
17 http://www.tpc.org/tpcc/
proaches for RDF relational storage, i.e., Single Triples Table Schema, Vertically
Partitioned Tables schema, and Property Tables Schema.</p>
      <p>The results of our experiments show that Property (n-ary) tables schema is
able to achieve better performance in terms of query execution times. This is
due to the extensive number of joins and self-joins required by Vertical
Partitioned and Single Statement Table schemas. For the same reason, the
VerticallyPartitioned schema works, in most of times, better than the Single Table schema.
Regrading the supported Spark storage backend alternatives, the results have
shown that using columnar HDFS le formats provide better performance for
short running queries. For this, the main reason is that most of the queries of
SP2Bench are with a small number of projections. Thus, columnar le storage
backends are able to perform better. On the other side, Postgres, CSV and Hive
are shown to have the lowest performing storage options, respectively. Last but
not least, scaling up the dataset sizes from 100K to 10 Million triples showed
a dramatic performance enhancements for Property Tables and Vertical
Partitioned Table schemas over the Single Statement Table schema. Moreover, with
10M triples dataset, the HDFS CSV le format has been shown to be the lowest
performing storage backend followed by Avro.</p>
      <p>As a natural extension of our benchmarking study, we aim to conduct our
evaluations on a cluster deployments with varying node sizes, with more RDF
benchmarks that have di erent types of queries and more scaling sizes of RDF
datasets.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Ibrahim</given-names>
            <surname>Abdelaziz</surname>
          </string-name>
          , Razen Harbi, Semih Salihoglu, Panos Kalnis, and
          <string-name>
            <given-names>Nikos</given-names>
            <surname>Mamoulis</surname>
          </string-name>
          .
          <article-title>Spartex: A vertex-centric framework for RDF data analytics</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>8</volume>
          (
          <issue>12</issue>
          ):
          <year>1880</year>
          {
          <year>1883</year>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Mar</surname>
          </string-name>
          <article-title>a Hallo, Sergio Lujan-Mora, Alejandro Mate, and Juan Trujillo. Current state of linked data in digital libraries</article-title>
          .
          <source>Journal of Information Science</source>
          ,
          <volume>42</volume>
          (
          <issue>2</issue>
          ),
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Giannis</given-names>
            <surname>Agathangelos</surname>
          </string-name>
          , Georgia Troullinou, Haridimos Kondylakis, Kostas Stefanidis, and
          <string-name>
            <given-names>Dimitris</given-names>
            <surname>Plexousakis</surname>
          </string-name>
          .
          <article-title>RDF query answering using apache spark: Review and assessment</article-title>
          .
          <source>In 34th IEEE International Conference on Data Engineering Workshops, ICDE Workshops</source>
          <year>2018</year>
          , Paris, France,
          <source>April 16-20</source>
          ,
          <year>2018</year>
          , pages
          <fpage>54</fpage>
          {
          <fpage>59</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Marcin</given-names>
            <surname>Wylot</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sherif</given-names>
            <surname>Sakr</surname>
          </string-name>
          .
          <article-title>Framework-based scale-out RDF systems</article-title>
          .
          <source>In Encyclopedia of Big Data Technologies</source>
          .
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Marcin</given-names>
            <surname>Wylot</surname>
          </string-name>
          , Manfred Hauswirth, Philippe Cudre-Mauroux, and
          <string-name>
            <given-names>Sherif</given-names>
            <surname>Sakr</surname>
          </string-name>
          .
          <article-title>Rdf data storage and query processing schemes: A survey</article-title>
          .
          <source>ACM Computing Surveys (CSUR)</source>
          ,
          <volume>51</volume>
          (
          <issue>4</issue>
          ):
          <fpage>84</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Adnan</given-names>
            <surname>Akhter</surname>
          </string-name>
          ,
          <string-name>
            <surname>Axel-Cyrille Ngonga Ngomo</surname>
            , and
            <given-names>Muhammad</given-names>
          </string-name>
          <string-name>
            <surname>Saleem</surname>
          </string-name>
          .
          <article-title>An empirical evaluation of RDF graph partitioning techniques</article-title>
          .
          <source>In EKAW</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Sherif</given-names>
            <surname>Sakr</surname>
          </string-name>
          .
          <article-title>GraphREL: A Decomposition-Based and Selectivity-Aware Relational Framework for Processing Sub-graph Queries</article-title>
          .
          <source>In DASFAA</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Sherif</given-names>
            <surname>Sakr</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ghazi</given-names>
            <surname>Al-Naymat</surname>
          </string-name>
          .
          <article-title>Relational processing of rdf queries: a survey</article-title>
          .
          <source>ACM SIGMOD Record</source>
          ,
          <volume>38</volume>
          (
          <issue>4</issue>
          ):
          <volume>23</volume>
          {
          <fpage>28</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Neumann</surname>
          </string-name>
          and
          <string-name>
            <given-names>Gerhard</given-names>
            <surname>Weikum</surname>
          </string-name>
          .
          <article-title>Rdf-3x: a risc-style engine for rdf</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <volume>647</volume>
          {
          <fpage>659</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Cathrin</surname>
            <given-names>Weiss</given-names>
          </string-name>
          , Panagiotis Karras, and
          <string-name>
            <given-names>Abraham</given-names>
            <surname>Bernstein</surname>
          </string-name>
          .
          <article-title>Hexastore: sextuple indexing for semantic web data management</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Daniel</surname>
          </string-name>
          J Abadi, Adam Marcus,
          <string-name>
            <surname>Samuel R Madden</surname>
            , and
            <given-names>Kate</given-names>
          </string-name>
          <string-name>
            <surname>Hollenbach</surname>
          </string-name>
          .
          <article-title>Scalable semantic web data management using vertical partitioning</article-title>
          .
          <source>In VLDB</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Justin J Levandoski and Mohamed F Mokbel</surname>
          </string-name>
          .
          <article-title>Rdf data-centric storage</article-title>
          .
          <source>In 2009 IEEE International Conference on Web Services</source>
          , pages
          <volume>911</volume>
          {
          <fpage>918</fpage>
          . IEEE,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. Michael Schmidt, Thomas Hornung, Georg Lausen, and
          <string-name>
            <given-names>Christoph</given-names>
            <surname>Pinkel</surname>
          </string-name>
          .
          <article-title>Sp^2bench: A SPARQL performance benchmark</article-title>
          .
          <source>In Proceedings of the 25th International Conference on Data Engineering, ICDE 2009, March 29 2009 - April 2</source>
          <year>2009</year>
          , Shanghai, China, pages
          <volume>222</volume>
          {
          <fpage>233</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Matei</surname>
            <given-names>Zaharia</given-names>
          </string-name>
          , Reynold S. Xin, Patrick Wendell,
          <string-name>
            <surname>Tathagata Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>Michael Armbrust</surname>
            , Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman,
            <given-names>Michael J.</given-names>
          </string-name>
          <string-name>
            <surname>Franklin</surname>
            , Ali Ghodsi,
            <given-names>Joseph</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>Scott</given-names>
          </string-name>
          <string-name>
            <surname>Shenker</surname>
            , and
            <given-names>Ion</given-names>
          </string-name>
          <string-name>
            <surname>Stoica</surname>
          </string-name>
          .
          <article-title>Apache spark: a uni ed engine for big data processing</article-title>
          .
          <source>Commun. ACM</source>
          ,
          <volume>59</volume>
          (
          <issue>11</issue>
          ):
          <volume>56</volume>
          {
          <fpage>65</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Michael</surname>
            <given-names>Armbrust</given-names>
          </string-name>
          , Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan,
          <string-name>
            <given-names>Michael J.</given-names>
            <surname>Franklin</surname>
          </string-name>
          , Ali Ghodsi, and
          <string-name>
            <given-names>Matei</given-names>
            <surname>Zaharia</surname>
          </string-name>
          .
          <article-title>Spark SQL: relational data processing in spark</article-title>
          .
          <source>In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data</source>
          , Melbourne, Victoria, Australia, May 31 - June 4,
          <year>2015</year>
          , pages
          <fpage>1383</fpage>
          {
          <fpage>1394</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Jesus</surname>
          </string-name>
          Camacho-Rodr guez, Ashutosh Chauhan, Alan Gates, Eugene Koifman,
          <string-name>
            <surname>Owen O'Malley</surname>
          </string-name>
          , Vineet Garg, Zoltan Haindrich, Sergey Shelukhin, Prasanth Jayachandran, Siddharth Seth, Deepak Jaiswal, Slim Bouguerra, Nishant Bangarwa, Sankar Hariappan, Anishek Agarwal, Jason Dere, Daniel Dai, Thejas Nair, Nita Dembla, Gopal Vijayaraghavan, and Gunther Hagleitner.
          <article-title>Apache hive: From mapreduce to enterprise-grade big data warehousing</article-title>
          .
          <source>In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference</source>
          <year>2019</year>
          , Amsterdam, The Netherlands, June 30 - July 5,
          <year>2019</year>
          ., pages
          <volume>1773</volume>
          {
          <fpage>1786</fpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17. Michael Schmidt, Thomas Hornung, Norbert Kuchlin, Georg Lausen, and
          <string-name>
            <given-names>Christoph</given-names>
            <surname>Pinkel</surname>
          </string-name>
          .
          <article-title>An experimental comparison of RDF data management approaches in a SPARQL benchmark scenario</article-title>
          .
          <source>In The Semantic Web - ISWC</source>
          <year>2008</year>
          , 7th International Semantic Web Conference,
          <string-name>
            <surname>ISWC</surname>
          </string-name>
          <year>2008</year>
          , Karlsruhe, Germany,
          <source>October 26-30</source>
          ,
          <year>2008</year>
          . Proceedings, pages
          <volume>82</volume>
          {
          <fpage>97</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. Eugene Inseok Chong,
          <string-name>
            <surname>Souripriya Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>George Eadon</surname>
            , and
            <given-names>Jagannathan</given-names>
          </string-name>
          <string-name>
            <surname>Srinivasan</surname>
          </string-name>
          .
          <article-title>An e cient sql-based rdf querying scheme</article-title>
          .
          <source>In VLDB</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Muhammad</surname>
            <given-names>Saleem</given-names>
          </string-name>
          , Gabor Szarnyas, Felix Conrads, Syed Ahmad Chan Bukhari, Qaiser Mehmood, and
          <article-title>Axel-Cyrille Ngonga Ngomo</article-title>
          .
          <article-title>How representative is a sparql benchmark? an analysis of rdf triplestore benchmarks?</article-title>
          <source>In The World Wide Web Conference</source>
          , pages
          <volume>1623</volume>
          {
          <fpage>1633</fpage>
          . ACM,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>So</surname>
          </string-name>
          <article-title>a Alexaki, Vassilis Christophides</article-title>
          , Gregory Karvounarakis, and
          <string-name>
            <given-names>Dimitris</given-names>
            <surname>Plexousakis</surname>
          </string-name>
          .
          <article-title>On storing voluminous RDF descriptions: The case of web portal catalogs</article-title>
          .
          <source>In Proceedings of the Fourth International Workshop on the Web and Databases</source>
          ,
          <source>WebDB</source>
          <year>2001</year>
          ,
          <string-name>
            <given-names>Santa</given-names>
            <surname>Barbara</surname>
          </string-name>
          , California, USA, May
          <volume>24</volume>
          -25,
          <year>2001</year>
          ,
          <article-title>in conjunction with ACM PODS/SIGMOD 2001</article-title>
          .
          <article-title>Informal proceedings</article-title>
          , pages
          <volume>43</volume>
          {
          <fpage>48</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Lefteris</surname>
            <given-names>Sidirourgos</given-names>
          </string-name>
          , Romulo Goncalves, Martin L. Kersten, Niels Nes, and
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Manegold</surname>
          </string-name>
          .
          <article-title>Column-store support for RDF data management: not all swans are white</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>1</volume>
          (
          <issue>2</issue>
          ):
          <volume>1553</volume>
          {
          <fpage>1563</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Alisdair</surname>
            <given-names>Owens</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Nick</given-names>
            <surname>Gibbins</surname>
          </string-name>
          , et al.
          <article-title>E ective benchmarking for rdf stores using synthetic data</article-title>
          .
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>