=Paper=
{{Paper
|id=Vol-2840/paper11
|storemode=property
|title=An In-depth Investigation of Large-scale RDF Relational Schema Optimizations Using Spark-SQL
|pdfUrl=https://ceur-ws.org/Vol-2840/paper11.pdf
|volume=Vol-2840
|authors=Mohamed Ragab,Riccardo Tommasini,Feras M. Awaysheh,Juan Carlos Ramos
|dblpUrl=https://dblp.org/rec/conf/dolap/00010AR21
}}
==An In-depth Investigation of Large-scale RDF Relational Schema Optimizations Using Spark-SQL==
An In-depth Investigation of Large-scale RDF Relational
Schema Optimizations Using Spark-SQL
Mohamed Ragab Riccardo Tommasini
Data Systems Group, University of Tartu Data Systems Group, University of Tartu
mohamed.ragab@ut.ee riccardo.tommasini@ut.ee
Feras M. Awaysheh Juan Carlos Ramos
Data Systems Group, University of Tartu Data Systems Group, University of Tartu
feras.awaysheh@ut.ee jramos@ut.ee
ABSTRACT among these dimensions is of paramount importance [3] yet is
This paper discusses one of the most significant challenges of still missing.
large-scale RDF data processing over Apache Spark, the relational In this paper, we try to fill this research gap by experimen-
schema optimization. The choice of RDF partitioning techniques tally evaluating SPARQL on top of SparkSQL. In particular, our
and storage formats using SparkSQL significantly impacts query analysis focuses on existing RDF relational schemas and their
performance. The impact of the relational schemas and the un- state-of-the-art improvements. To this end, we present a sys-
derlying data storage formats is indisputable; they significantly tematic and comparative evaluation of the query performance
affect the query performance. Nevertheless, the trade-offs in dif- considering (i) 𝑡ℎ𝑟𝑒𝑒 RDF partitioning techniques (most suit-
ferent configurations have not been a subject of intensive study able for relational nature of data in Spark-SQL), i.e., Horizontal,
in the literature. This paper presents an in-depth investigation for Subject-based, and Predicate-based partitioning and (ii) 𝑓 𝑜𝑢𝑟 dif-
practitioners to understand such trade-offs and their best prac- ferent well-established storage formats, i.e., ORC, CSV, Parquet,
tices. It also reports on the pitfalls behind the implementation and Avro [15, 16]. In this way, our work differs from previous
SPARQL optimizations over SparkSQL. Our experiments provide ones [21, 22] that only focus on the complexity of the workloads
insights into these schemas’ relative strengths by comparing and the size of the data.
three different partitioning techniques and four other storage The contribution of this paper is threefold. (i) First, it uses
formats. Our results draw a better understanding of the current SparkSQL to validate the performance of RDF schema advance-
State-Of-The-Art (S.O.T.A) and pave the way for a wide range ments (i.e. ExtVP and WPT ) compared to their baseline opponents
of best practices and systematically tuning the performance of (i.e PT, and VP). (ii) Second, it empirically analyzes the effect of
distributed systems to handle vast RDF data. partitioning techniques on the ExtVP and WPT schema runtime
performance. (iii) Third, it tests the effects of multiple distributed
storage row and columnar-oriented file formats on HDFS. Finally,
1 INTRODUCTION it outlines the best practices and recommendations that help
Currently, we are witnessing an enormous amount of widely in achieving the best RDF query performance. Overall, the pa-
available RDF datasets [19]. Centralized RDF engines, e.g., RDF- per findings guide the realization of next-generation large-scale
3X [13] and gStore [26], provide native ways for processing/- RDF solutions over Apache Spark by optimizing the relational
querying RDF datasets with the full expressive capabilities of schemas.
SPARQL. Yet, they can not handle large-scale RDF datasets effec- The remainder of the paper is organized as follows: section 2
tively [2, 9]. The need for processing large RDF datasets calls for presents an overview of the required background information
innovative solutions to store, analyze, and query these massive and key concepts necessary to understand our study. Section 3
RDF datasets [2]. This call leads the community to leverage Big discusses the experimental methodology. Section 4 presents the
Data (BD) processing frameworks like Apache Spark [25] to benchmarking scenario and the experimental setup. Section 5
process large RDF datasets [3]. presents the paper results, while we provide a comprehensive
BD platforms excel in the analytical processing of relational discussion in section 6. Section 7 presents the related work, posi-
data. The literature includes several attempts that leverage such tioning this paper in the context of other survey on RDF process-
capabilities to analyze RDF data [2, 17]. In practice, utilizing ing using BD frameworks. Finally, section 8 concludes the paper
BD engines for RDF relational processing requires storing RDF and presents future works.
data using a relational schema and translating SPARQL queries
into equivalent SQL ones. On the same note, BD platforms are 2 BACKGROUND
designed to scale horizontally [7]. However, the choice of the
right schema can significantly impact the performance of query In this section, we present the information that is necessary to
processing [18]. Moreover, choosing the right partitioning tech- understand the content of this paper. We assume that the reader
nique also returns with variant query runtime performance [4]. is familiar with the RDF data model and the SPARQL query
In this regard and from a BD perspective, we cannot ignore the language.
variety of data formats [11]. Given the complexity of the solu-
tion space, i.e., relational schema, partitioning technique, storage 2.1 Apache Spark & SparkSQL
format, current works focus on one dimension at a time. How- Apache Spark is currently the de-facto BD engine [25]. It is one
ever, the relevance of a comprehensive analysis of the trade-offs of the most active and widely-used large-scale data processing
systems in both industry and academia [5]. It mainly adopts
© Copyright © 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). in-memory distributed computing of large scale data analytics.
SparkSQL is a relational package built on top of Apache Spark algorithm that is likely to produce sub-optimal schemas for an
[5] with support for the SQL interface while providing capabilities arbitrary RDF dataset. Unfortunately, WPT does not overcome
for structured and semi-structured data. all the limitations of the PT schema. Indeed, this representation
can also be very sparse for poorly structured data, and it may
face a large storage overhead, especially with many multi-valued
2.2 RDF Relational Schema
properties existing in the RDF dataset.
The most intuitive approach to follow for representing RDF into
a relational structure is the Single Statement Table Schema (ST),
2.3 RDF Data Partitioning
which requires storing RDF datasets in a single triples table of
three columns that represent components of the RDF triple, i.e., For RDF data processing, many partitioning techniques exist [2,
Subject, Predicate, and Object. This solution is the simplest, and 4]. In the following, we present the partitioning techniques that
it is commonly adopted by several existing open-source RDF are suitable for our experiments on SparkSQL.
triplestores, e.g., Apache Jena, RDF4J, and Virtuoso. However, it Horizontal-Based Partitioning (HP) requires dividing the
inevitably increases the number of required self-joins for long RDF dataset evenly (as much as possible) on the number of ma-
chains SPARQL query evaluation when they run on top of rela- chines in the cluster. In particular, we use this technique to parti-
tional SQL systems. tion the relational RDF tables of the different schemas horizon-
Vertically Partitioned Tables Schema (VP) is an RDF stor- tally into even 𝑛 chunks(i.e partitions) over the cluster machines.
age schema proposed to mitigate the performance issues of the ST Subject-Based Partitioning(SBP) requires the distribution
schema. It aims to speed up the queries over RDF triple stores [1]. of triples into partitions according to the hash value computed
This schema is simple to design; the RDF triples table is decom- for the RDF subjects. As a result, all the triples that have the
posed into a table of two columns (Subject, Object) for each unique same subject are assumed to reside on the same partition. In our
property in the RDF dataset. scenario, we applied spark partitioning using the subject as the
Extended Vertical Partitioning schema (ExtVP) is a query- partitioning key with our different relational schema tables (i.e
driven optimization that aims at minimizing the input size of DataFrames).
the data during query evaluation [22], inspired by the semi-Join Predicate-Based Partitioning (PBP) is similar to the SBP,
reductions. In particular, ExtVP minimizes data skewness and it distributes triples to the various partitions based on the hash
eliminates dangling triples (i.e. triples that do not have a joining value computed for the predicate. Similarly, all the triples that
partner or do not contribute to any join in the SPARQL query) have the same predicate are assumed to reside on the same parti-
from the input tables. ExtVP speeds up query answering by pre- tion. We also applied the Spark partitioning using the predicate
computing the possible join relations between the VP tables and as the partitioning key with our different relational schemas
materializing the results of these semi-joins as tables in the stor- Dataframes.
age backend, e.g. HDFS. Particularly, for every two VP relations Baseline partitioning (BP): In our experiments, we also used
ExtVP relies on pre-computing semi-join reductions of Subject- the baseline partitioning technique that basically depends on the
Subject (SS), Subject-Object (SO), and Object-Subject (OS) join native default partitioning of HDFS of the tables files over the
patterns. The output tables are reduced in size and will be used cluster nodes. This is the technique used in the state-of-the-art
in joins instead of the original VP tables. However, one of the works of the schema advancements [21, 22].
limitations of the ExtVP schema is the additional storage over-
head of the materialized ExtVP tables in comparison to the VP 3 EVALUATION METHODOLOGY
schema tables (cf. Table 1). In this section, we discuss the experimental methodology that
Property (n-ary) Tables Schema (PT) is a storage schema we used for the reproducibility of the state-of-the-art findings [6,
proposed to cluster multiple RDF properties as n-ary table columns 21, 22] that imply some changes in the experimental artifacts, we
for the same subject to group entities that are similar in structure. organize our experiments as follows.
The biggest advantage of property tables compared to a single First, we assess if we can reproduce the state-of-the-art re-
triples table schema (ST) is that they can reduce the number of sults of those schema optimizations over the baseline relational
subject-subject self-joins that result from star-shaped patterns in a schemas performance. Thus, we performed our experiments in
SPARQL query. Whereas, one of the limitations of the PT schema a setup as similar as possible to what the original authors have
is that it works quite well with the highly structured RDF data. done [21, 22]. In this regard, we use the baseline HDFS partition-
However, its performance degrades for the poorly structured ing technique. We also use Parquet as our baseline storage file
ones [23]. Furthermore, typical RDF comes with diverse struc- format (grey shaded boxes cf. Figure 1).
tures, which make it virtually hard to define an optimal layout Second, we introduce disturbing factors to our experiments,
of this schema [22]. Moreover, a poorly-selected property table such as the different partitioning techniques, and different file
layout can significantly slow down the query performance [2]. formats alongside different SPARQL query shapes.
Due to its sparse-tables representation nature, PT schema also Regarding the data partitioning, we introduce the Horizontal
suffers from high storage overheads when a large number of Partitioning technique and Subject-based partitioning for the
predicates is present in the RDF data model [1]. WPT and PT schema experiments.On the other hand, Horizon-
Wide Property Table Schema (WPT) represents the whole tal, Subject and Predicate-based partitioning techniques were
RDF dataset into a single unified table [21]. Such table uses all used for the VP and ExtVP schema experiments. We expect that
RDF properties in the dataset as columns. It aims at extending the these partitioning techniques will negatively impact the perfor-
PT schema for optimizing star-shaped SPARQL queries, which mance of SparkSQL when evaluating SPARQL queries due tothe
are highly common in the SPARQL query workloads. Therefore, distribution of the relational table across nodes. This will force
star-shaped SPARQL queries will require no joins to be answered. more shuffling in the presence of joins. In particular, Horizon-
Moreover, this schema does not require any kind of clustering tal partitioning should have a worse impact than Subject-based
Table 1: SP2 Bench-100M RDF relational schemata table data sizes with different file formats
SP2 Bench RDF (n3) PT WPT VP ExtVP
∼9.2MB-1.9GB 8KB-1.9GB - OS (4.9GB) - SS (39GB)
CSV 11GB 9.4GB
-Total: 6.8GB -Total: 8.3GB - SO (806MB) -Total:∼45GB
980KB-416MB 8KB-272MB - OS (359MB) - SS (8.8GB)
Avro 11GB 1.8GB
-Total: 1.6GB -Total: 1.7GB - SO (331MB) -Total:∼9.5GB
620KB-362MB 8KB-249MB - OS (243MB) - SS (7.8GB)
ORC 11GB 1.4GB
-Total: 1.4GB -Total: 1.5GB - SO (301MB) -Total:∼8.4GB
620KB-382MB 8KB-264MB - OS (319MB) - SS (8.4GB)
Parquet 11GB 1.7GB
-Total: 1.5GB -Total: 1.6GB - SO (318MB) -Total:∼9GB
partitioning on PT and WPT schemas, and Predicated-based on
(Ext)VP ones. It worth mentioning that the HP technique does Wide Property Baseline
Parquet
Tables HDFS
not take the query shape into account and possibly place these
rows in different nodes.
Property Horizontal
Regarding the storage of file formats besides the baseline Par- Tables Based
ORC
quet, we consider an additional columnar one, i.e., ORC, and two
row-oriented ones, i.e., CSV and Avro. We expect columnar for- Ext. Vertical Subject
mats to perform better for the queries with a subset of column AVRO
Tables Based
projections, since they allow an efficient scan of tables by reading
only a portion of columns [10]. In action, SP2 Bench has a small Vertical Predicate
CSV
number of column projections across all its benchmark queries. Tables Based
Finally, aiming to draft our observations, primary findings,
and propose best practices, we discuss and analyze our results. Relational Partitioning Storage
Schemata Technique Formats
Additionally, we highlight the trade-offs of combining all these
dimensions in the discussion section.
Moreover, we aim to observe these optimizations’ impact on Figure 1: Experiments architecture and evaluation envi-
the large SPARQL query performance on the SparkSQL engine. ronment
Mostly, we want to verify and answer the following questions:
(1) How far do RDF partitioning techniques and storage for- #Joins #Filters #Projections Query Shape
mats impact the query performance? Q1 3 0 1 S
(2) How can we systematically analyze different relational Q2 8 0 10 S
schemas? How can these schemas effectively improved to Q3 1 1 1 S
achieve the highest performance? Q4 7 1 2 SF
(3) What are the best practices that guide the large RDF com- Q5 5 1 2 SF
munity efforts in adopting performance-oriented solu- Q6 8 3 2 SF
tions? Q7 12 2 1 SF
Q8 10 2 1 SF
4 BENCHMARK & EXPERIMENTAL SETUP Q9 3 0 1 S (U)
This section outlines the paper experiment setup and the used Q10 0 0 2 TP (U)
benchmark with its queries. The experimental setups (presented Q11 0 0 1 TP
in Figure 1) summarizes the configuration combinations (Rela- Table 2: Benchmark Queries Characteristics: Shape, i.e.,
tional schema, Partitioning, Storage). The triangle with X repre- [S]tar, [S]now[F]lake, or a single [T]riple[P]attern; (U) for
sents that we have performed our experiments for 4 different unbounded Predicate Variable, Number of Joins, filters,
relational schemas, partitioning each schema across 4 various and projections.
relational techniques, i.e one baseline HDFS, and other 3 RDF-
specific techniques. Last but not least, those schemas are stored The generated n3 RDF dataset is converted into CSV relational
across 4 different storage formats. In detail: schemas using Jena TDB 1 , a disk-based access repository for
Benchmark &Dataset: In our evaluation, we used the SP2 Bench storing RDF datasets. We further used the Jena ARQ 2 for query-
(SPARQL Performance Benchmark) [24]. SP2 Bench has a reason- ing these TDB datasets and generating the output schemas tables
able low score of data structuredness, making it closer to the in the CSV file format. Finally, these raw textual CSV documents
structure of real-world RDF datasets [20]. So, it is valid to state are loaded to the HDFS. Moreover, we have used the Spark frame-
that, to the best of our understanding, SP2 Bench meets a wide work to write the relational schemas data tables from the CSV
spectrum of queries and answers well the main claims we are format into the other HDFS file formats (Avro, Parquet, and ORC).
investigating. Table 1 shows the size of the generated native RDF dataset (i.e
Data Storage: We generated a synthetic RDF dataset with 100𝑀 11GB), as well as store sizes of each relational schema in the men-
triples size in Notation3 format. This scale size is enough for tioned different file formats on top of HDFS. It is clearly shown,
checking the validity of the literature findings regarding the RDF how the different relational schemas affect the input data sizes.
relational schemas optimizations, and maintaining the repro- 1 https://github.com/apache/jena/tree/master/jena-tdb
ducibility of them in a more complex solution space. 2 https://github.com/apache/jena/tree/master/jena-arq
In action, the PT schema has the smallest table sizes in total, fol- Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q10 Q11
lowed by the VP schema, then the WPT table schema. Whereas,
the largest storage overheads come with the ExtVP schema. We PT 2 9 2 8 7 6 9 5 2
can also notice how the storage formats affect the sizes of the WPT 0 0 0 3 3 3 10 3 0
schemas significantly. In particular, columnar-oriented formats Table 3: SP2Bench queries: Number of Joins of PT vs WPT.
have the minimum table sizes across all the schemas. Indeed, ORC
is shown to have the minimum table sizes, followed by Parquet. WPT vs. PT Avro CSV ORC Parquet
While, the Avro row-oriented formats have quite larger schema Baseline 2/9 2/9 8/9 9/9
sizes, and CSV has the largest table sizes.
Horizontal 2/9 3/9 6/9 6/9
Queries: SP2 Bench queries have different complexities and a Subject 2/9 2/9 6/9 6/9
high diversity of features [20]. These queries implement meaning-
Table 4: Number of queries for which WPT beats PT for
ful requests on top of RDF data. In our experiments, we reused the
data formats and partitioning techniques.
SQL version of the queries associated with the SP2 Bench bench-
mark 3 for the mentioned RDF relational schemas. However, for
the new relational schema advancements (e.g. ExtVP, WPT) that 5.1 WPT VS. PT Schema Results
are missing on the benchmark website, we have manually trans- Table 3 shows the SP2 Bench queries’ number of joins when trans-
lated these queries into SQL, and we provide all these translated lated into SQL concerning the PT and WPT schemas. Except for
queries in our project repository 4 . We have evaluated all of these 𝑄8 (that requires many self-joins of the WPT table), the number
11 queries of type SELECT, except 𝑄9, and 𝑄11 which are not of joins always decreases, adopting the WPT schema. Moreover,
applicable (’NA’) for the PT and the WPT relational schemas. we expect that the WPT schema query performance (i.e., in terms
𝑄7 is also not applicable in the VP and ExtVP schemas. Notably, of latency) will outperform other relational schemas [6]. In this
for generating the ExtVP tables, the default selectivity threshold regard, the Parquet data format efficiently handles the sparsity
of 1 has been configured [22]. Table 2 shows our benchmark caused by the WPT table schema —as Null values are efficiently
queries complexities, in terms of the number of joins, filters, and ignored in this file format [21].
projections, alongside the SPARQL query shape. Meanwhile, Table 4 shows the overall benchmark results of
Environment Setup: Our experiments were executed on a bare- the WPT performance over PT schema across all file formats
metal cluster of 4 machines with CentOS-Linux V7 OS, running (horizontally in the table), and across the different partitioning
on 32 cores per node processor, and 128 GB of memory per node, techniques (vertically). Values in this table specify the number of
alongside with a high speed 2 TB SSD drive for each node. We queries in which the WPT schema performs better than the base-
used Spark V2.4 to fully support SparkSQL capabilities. In partic- line PT schema. The green color indicates that WPT performing
ular, our Spark cluster consists of one master node and 3 worker the best, while the yellow color indicates that its performance is
machines, while Yarn is used as the resource manager, which in above 50% over PT, and the red means that performance is less
total uses 330 GB and 84 virtual processing cores. than 50%.
RDF Data Partitioning: We used Spark partitioners for parti- Our experiment results confirm that the WPT schema per-
tioning the registered relational schemas tables/Spark DataFrames. forms better than the baseline PT schema in all the queries (i.e., 9
This is required to persist those DataFrames on top of the HDFS queries out of 9 queries in the benchmark) with Parquet file for-
default file blocks partitioning level. We use the resulting Data mat, alongside using the baseline HDFS partitioning technique.
Frames as the input for the query engine. In our experiments, we Indeed, these results confirm the findings in [6, 21] assessing the
have the baseline HDFS partitioning (grey partitioning box cf. 1). reproducibility regarding the WPT schema optimization.
While other RDF partitioning techniques also have been tested, To investigate how the performance difference between the
namely HP, SBP, and PBP approaches. These techniques depend WPT and PT schemas changes, we introduce two new dimensions,
on partitioning the tables’ data horizontally across machines i.e., various file formats and different partitioning techniques. In
(i.e HP), or on the Spark key partitioning of the RDF subject or this regard, Table 5 shows the effect of data partitioning (left of
predicate (i.e SBP, PBP respectively). the table) and storage formats (right of the table) considering the
other new factors across all the experiments. To this extent, we
Performance Evaluation measure (Latency): We used the
have calculated the percentages as follows, for the partitioning
Spark.time function by passing the spark.sql(...) query execution
factor’s impact, we pivoted on each partitioning technique and
function as a parameter to measure the query latency. We run the
counted the percentage of how much the WPT schema perfor-
experiments for all queries 5 times (excluding the first cold start
mance in SparkSQL is better than the PT schema one across all
run time, to avoid the warm-up bias, and computed an average
the queries while considering all the changes of the storage file
of the other 4 run times).
formats (moving across them). We calculated the partitioning ef-
fect similarly but pivoting on the storage file format and moving
across the partitioning techniques in all of queries.
Table 5 also demonstrates that in such a complex space of
different relational schema, data partitioning, and storage file
5 EXPERIMENT RESULTS formats, the schema-based query optimization is not straightfor-
In this section, we discuss our experiment results. Also, we com- ward. As we can see, WPT outperforms PT schema only for 58%
pare the optimized relational schemas (i.e., WPT, and ExtVP) in the queries using only the baseline default HDFS partitioning
against their baseline schemas, i.e., PT, and VP, respectively, ac- technique regarding the storage formats, and only 78% for the
cording to our methodology (cf. Section 3).
3 http://dbis.informatik.uni-freiburg.de/index.php?project=SP2B/queries.php
4 https://datasystemsgrouput.github.io/SPARKSQLRDFBenchmarking/
Baseline Subject Baseline Subject
Horizontal Average Horizontal Average
1.8 1.8
Ratio of WPT over PT 1.6 1.6
Ratio of WPT over PT
1.4 1.4
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
avro csv orc Parquet avro csv orc Parquet
(a) Q2 (b) Q4
Figure 2: The performance of WPT over PT schema in 𝑄2 and 𝑄4 (values below 1 means WPT is better than PT)
Baseline Subject WPT/PT Partitioning effect Storage effect
Horizontal Average
3 Baseline_Part 58.33% Parquet 77.78%
2.5 Horizontal 47.22% ORC 74.07%
Ratio of WPT over PT
2
Subject-based 44.44% CSV 25.93%
1.5
Predicate-based NA AVRO 22.22%
1 Table 5: The effect of other partitioning techniques, and
0.5 other storage formats on the reproducibility of the WPT
0 S.O.T.A findings
avro csv orc Parquet
Figure 3: The performance of WPT over PT schema in Q8. WPT over PT in that query and across the different configuration
values (below 1 means WPT is better than PT) settings.
Not surprisingly, we can notice that 𝑄8 is the only query that
witnesses worse performance for the WPT compared to the PT
schema. Figure 3 shows that most of the ratios of ’WPT over PT’
Parquet file format. The determination of this result shows the is greater than 1 in the baseline-partitioned data experiments
trade-off of considering alternative storage file formats and parti- (i.e. only partitioned with HDFS), and other file formats instead
tioning techniques alongside the experiments’ query evaluation. of Parquet. Notably, all the results (i.e., total query runtimes)
Regarding the storage, we can see that ORC, another columnar and query histograms can be found on our mentioned GitHub
file format gives closer performance to our baseline columnar repository.
Parquet file format with 74%. However, the baseline Parquet is
yet better, as Parquet is unlike ORC, can efficiently handle the 5.2 ExtVP VS. VP Schema Results
WPT table’s sparsity. Whereas, we can see that row-oriented According to [22], ExtVP outperforms or at least has a similar
formats have a significant negative effect on the performance performance to the VP schema. The reason is that queries are sim-
of WPT. WPT schema performance is better than PT with only ilar, and the number of SQL joins in the VP and ExtVP schemas
22% and 25% in all Avro and CSV queries, respectively. In action, are the same. This clarification is reflected in Table 6. Indeed, the
SP2 Bench queries only have one query (i.e., 𝑄2) with more than 2 performance improvement depends mainly on the percentage of
column projections. This justifies why column-oriented formats reductions in the input table sizes that the ExtVP optimization
give better results for the WPT than the row-based ones. In might introduce out of the join correlations for each query [22].
general, we can state that file formats affected the generalization Table 6 also presents the percentage of ExtVP reductions of the
of the state-of-the-art results for the WPT schema. processed tables’ rows for each query over the original input
At last, we enroll in three specific queries, namely, 𝑄2, 𝑄4, and tables processed rows with the baseline VP tables. The semi-join
𝑄8 , which well exemplify our findings. We selected these queries reductions provided by the ExtVP help speeding-up the perfor-
as good representatives of our findings. There is a tremendous mance of SparkSQL by reducing the size of the shuffled data.
performance enhancement in WPT over PT in 𝑄2 and 𝑄4. The
reason behind this refers to the number of SparkSQL joins of VP ExtVP Input tables data Size Red.
WPT is significantly less than the joins in PT schema (cf. Table 3). Q1 2 2 58%
Particularly, in 𝑄2 number of joins in PT (SQL-version) is 9 com- Q2 9 9 77%
pared to no-joins in WPT schema. While in 𝑄4 with PT schema, Q3 1 1 59%
we have 8 SQL joins in comparison to 3 self-joins of the WPT table. Q4 7 7 96%
Interestingly, we have more joins in WPT than the baseline PT Q5 5 5 60%
schema in 𝑄8, i.e., 10 self-joins, and 8 joins, respectively. Figure Q6 9 9 31%
2 (a), (b) and Figure 3 depict the performance of SparkSQL for Q8 9 & 1 Union 9 & 1 Union 5%
𝑄2, 𝑄4, and 𝑄8 respectively under a various combination of file Q9 2 & 1 Union 2 & 1 Union 0%
formats and partitioning techniques. In particular, these figures Q10 1 Union 1 Union 0%
combine the ratios of WPT being better than PT in those men- Q11 0 0 0%
tioned queries. Ratios less than 1 indicate better performance of Table 6: Number of joins and percentage of input tables
sizes [Red]uctions after optimization ExtVP VS. VP.
Baseline Subject Baseline Subject
Horizontal Average Horizontal Average
Predicate Predicate
Ratio of ExtVP over VP 1.4 1.4
Ratio of ExtVP over VP
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
avro csv orc Parquet avro csv orc Parquet
(a) Q4 (b) Q9
Figure 4: The performance of ExtVP over VP schema in Q9. (values below 1 indicates that ExtVP is better than VP)
ExtVP VS. VP Avro CSV ORC Parquet
In more details, ExtVP optimizes specific queries according Baseline_Part 6/10 6/10 5/10 7/10
to the correlations between triple patterns in those queries [22],
Horizontal_Part 3/10 3/10 3/10 3/10
namely, in Subject-to-Subject(SS), Object-to-Subject(OS), and Subject-
Predicate_Part 2/10 3/10 6/10 6/10
to-Object(SO) [22]. Thus, we expect some queries to give similar
Subject_Part 2/10 3/10 3/10 3/10
results to the VP schema queries (i.e., No reductions occurred in
the VP tables by the ExtVP schema optimization). Notably, in our Table 7: Comparison of ExtVP schema with the VP schema
experiments, 𝑄9,𝑄10, and 𝑄11 do not present any input data re- in different storage formats, and in different partitioning
ductions. Thus, we state that it is expected that their performance techniques.
to be very close to baseline VP performance.
The same approach that has been adopted in WPT to PT ExtVP/VP Partitioning effect Storage effect
schemas performance comparison is also used for evaluating Baseline_Part 67.5% Parquet 55%
the performance of ExtVP against the VP. Horizontal 35% ORC 45%
First, we check if our experiments’ results confirm the state- Predicate-bsed 55% AVRO 42.5%
of-the-art regarding the ExtVP schema optimization over the Subject-based 30% CSV 42.5%
baseline VP schema performance. Table 8: The effect of other partitioning techniques, and
Table 7 (on the right) shows the total number of queries in other storage formats on the reproducibility of the ExtVP
which the ExtVP performance is better than VP schema perfor- S.O.T.A findings
mance across all the benchmark queries. For our baseline HDFS
partitioning technique, and with the Parquet file format, we can Predicate-based partitioning slightly reduces this negative effect
see that some queries do not benefit from the optimizations of (i.e., 55% of the queries show that performance improvement).
the ExtVP. Indeed, 3 out of 10 queries fail to utilize the optimized From Table 8, we can also see that the ExtVP schema is only
ExtVP technique. The reason behind such behavior is that those outperforming the VP schema, with 67% of the queries using
queries have unbounded predicates that can not be optimized the baseline HDFS partitioning scenario. Thus, we can see the
by the ExtVP schema [22] (see 𝑄9 and 𝑄10 in Table 2), or they trade-off of considering various storage file formats. We can
have no effective join reductions (see 𝑄9,𝑄10,𝑄11 in Table 6). see also that the baseline Parquet file format is the one that has
The performance of these queries is a subject of discussion in less impact on the overall performance for ExtVP. Indeed, in
detail in the next sections. 55% of the cases where Parquet is used, ExtVP outperforms the
Second, similarly to what we have done for the WPT schema VP performance. Additionally, the ORC columnar file format
optimization, we now investigate how generalizable the state- provides high performance of ExtVP over VP schema with an
of-the-art results are when we introduce different file formats overall 45%. However, there is a clear difference from the Parquet
partitioning techniques over the data for both the ExtVP and VP file format with 10%.
schemas. On the other hand, the row-oriented formats degrade the per-
Similarly, Table 8 shows how far the data partitioning (left of formance of ExtVP. For only 42.5% of the experiments that adopt
the table) and data formats (right of the table) impact the results either Avro or CSV, ExtVP performance beats the performance of
of ExtVP in comparison to VP schema performance. Notably, this the VP schema. Such behavior is related to the number of column
table’s percentage values are also calculated similarly to how projections in the SP2 Bench queries, which are the minimum
we have calculated the WPT against the PT. We pivoted on the in this benchmark scenario. Thus, columnar file formats can fit
analysis dimension of choice, i.e., file format 𝑋 or partitioning such query workloads better than the row-oriented ones.
technique 𝑌 , and we calculated how many times SparkSQL per- Last but not least, herein the most notable query examples
forms better using ExtVP than using the baseline VP approach. are introduced, confirming our previous findings but with more
Regarding the partitioning techniques’ effect on ExtVP, our innumerable details. First, 𝑄4 is revealed to be the query with
expectations are confirmed. In particular, we can observe that the most benefit with the ExtVP optimization. The reason be-
the partitioning techniques degraded the performance of ExtVP hind this is that 𝑄4 includes a high number of joins (i.e., 7 joins),
significantly. Only, 35%, and 30% of the experiments adopting and has the maximum number of input tables’ rows reductions
Horizontal, and Subject-based partitioning respectively show a while using the ExtVP schema optimization with 96% of reduced
performance improvement in using ExtVP over VP. Adopting processed rows (cf. Table 6). This query is directly followed by
𝑄2 with 77%. Although 𝑄2 has a higher number of table joins
WPTH PTH WPTS PTS WPT PT
200 200 200
150 150 150
Time (seconds)
Time (seconds)
Time (seconds)
100 100 100
50 50 50
0 0 0
Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q10 Q11 Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q10 Q11 Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q10 Q11
(a) CSV - Horizontal Partitioning (b) CSV - Subject-based Partitioning (c) Avro - Horizontal Partitioning
WPT PT WPT PT WPT PT
200 200 200
150 150 150
Time (seconds)
Time (seconds)
Time (seconds)
100 100 100
50 50 50
0 0 0
Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q10 Q11 Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q10 Q11 Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q10 Q11
(d) Avro - Subject-based Partitioning (e) ORC - Horizontal Partitioning (f) ORC - Subject-based Partitioning
WPT PT WPT PT
200 200
150 150
Time (seconds)
Time (seconds)
100 100
50 50
0 0
Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q10 Q11 Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q10 Q11
(g) Parquet - Horizontal Partitioning (h) Parquet - Subject-based Partitioning
Figure 5: WPT Vs. PT schemata performance using different partitioning techniques and file formats
than 𝑄4, the reductions in input table sizes in 𝑄4 are more sig- into consideration, herein, we discuss our results and give some
nificant. On the other side, 𝑄9, 𝑄10, and 𝑄11 do not benefit from insights on processing RDF best practices at a large scale.
the ExtVP optimization, i.e., ExtVP does not provide any input Next, we place the literature assumptions on the relational
table size reductions. In particular, 𝑄9 and 𝑄10 have unbounded schema optimizations’ superiority against our experimental find-
predicate variables in the original SPARQL queries. ExtVP cannot ings. We follow this by recommendations to the large RDF prac-
directly handle this type of queries[22]. While 𝑄11 has only a titioners.
single triple pattern, and thus it has no joins in optimizing the
ExtVP optimization approach. Figures 4 (a) and (b) show the per-
formance of SparkSQL for 𝑄4 and 𝑄9, respectively, under various 6.1 Assumption: WPT always outperforms
combination of formats and partitioning techniques in the ExtVP PT
experiments. Figure 4 (a) shows that 𝑄4 is always below the line
of all the other queries’ average runtimes. Whereas, ExtVP does According to [6, 21], we expect that the performance of the WPT
not show a remarkable difference over the VP schema in 𝑄9, i.e., schema outperforms the PT schema, especially with the "star-
they show pretty close performance to each other. shaped" queries. Star-shaped queries can be answered when the
In the next section, we discuss in further details the experiment WPT table is queried with no-joins included. This assumption is
findings against the current S.O.T.A regarding the superiority of because all the properties relevant to the same subject are present
ExtVP and PT. in the same row of the WPT table.
The state-of-the-art findings of the WPT schema are fully
6 DISCUSSION reproduced with the default HDFS partitioning and with using
the baseline Parquet file format. That is, the performance of
The paper helps to characterize and classify the RDF schemas Spark using WPT schema for representing RDF dataset is always
and their optimizations within the SparkSQL realm. It helps data outperforming the baseline PT schema.
architects and practitioners interested in large scale RDF bet- Nevertheless, our results show when we deviate from the origi-
ter understanding the relational RDF schema’s potential using nal setup [21] introducing new experimental factors, the solution
different partitioning techniques and storage formats. This un- space increases in complexity. Consequently, the trade-offs be-
derstanding will lead to a better selection of the most suitable tween relational schema, partitioning techniques, and storage for-
and performance-optimized solution that adequately suits their mats make the WPT optimization reproducibility not straightfor-
case. Doing so will also accommodate better design and develop- ward. Using other partitioning techniques alongside the baseline
ment of new SPARQL systems, leading to reliable RDF services Parquet format affected the reproducibility of the WPT schema
with high Spark performance. Taking our experiment findings
ExtVP VP ExtVP VP ExtVP VP
500 500 500
400 400 400
Time (seconds)
Time (seconds)
Time (seconds)
300 300 300
200 200 200
100 100 100
0 0 0
Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11 Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11 Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11
HO Subj Pred
(a) Parquet-Horizontal Partitioning (b) Parquet-Subject-based Partitioning (c) Parquet-Predicate-based Partitioning
ExtVP VP ExtVP VP ExtVP VP
500 500 500
400 400 400
Time (seconds)
Time (seconds)
Time (seconds)
300 300 300
200 200 200
100 100 100
0 0 0
Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11 Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11 Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11
HO Subj Pred
(d) ORC-Horizontal Partitioning (e) ORC-Subject-based Partitioning (f) ORC-Predicate-based Partitioning
ExtVP VP ExtVP VP ExtVP VP
500 500 500
400 400 400
Time (seconds)
Time (seconds)
Time (seconds)
300 300 300
200 200 200
100 100 100
0 0 0
Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11 Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11 Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11
HO Subj Pred
(g) Avro-Horizontal Partitioning (h) Avro-Subject-based Partitioning (i) Avro-Predicate-based Partitioning
ExtVP VP ExtVP VP ExtVP VP
500 500 500
400 400 400
Time (seconds)
Time (seconds)
Time (seconds)
300 300 300
200 200 200
100 100 100
0 0 0
Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11 Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11 Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11
HO Subj Pred
(j) CSV-Horizontal Partitioning (k) CSV-Subject-based Partitioning (l) CSV-Predicate-based Partitioning
Figure 6: ExtVP Vs. VP schemata performance using different partitioning techniques and file formats
optimizations. Only 78% of the queries results conform with the representing such wide tables (WPT and PT). Columnar file for-
fact that WPT is better than the PT schema (Table 5). mats are the best for sparse queries (i.e., queries with few column
Figure 5 aims to analyze the schemas performance when the projections or columns to access) out of the wide tables. They
solution adopts different partitioning techniques and file formats. perform better than the row-oriented file formats, e.g., CSV and
Figures 5 (a-h) show clearly the effect of partitioning techniques Avro, which would be only better with queries that require full
on the reproducibility of the WPT optimizations across all the dif- rows reading.
ferent file formats. For instance, notably the horizontal partition- Figure 5 shows the performance degradation considering dif-
ing (Figures 5 (a,c,e,g)) affected the performance of WPT, making ferent file formats. For instance, moving from Parquet and ORC
its performance in SparkSQL worse than the baseline PT schema in Figures 5 (e-h) to other row-oriented file formats such as Avro
in most of the queries (i.e., 𝑄1,𝑄3,𝑄5,𝑄6,𝑄8,𝑄11). Similarly, we and CSV in Figures 5 (a-d), we can notice the performance degra-
can observe the negative effect of the subject-based technique dation of the queries with the WPT schema optimizations.
on WPT schema (Figures 5 (b,d,f,h)) in the same queries.
The impact of file formats aside Parquet is even worse. Even 6.2 Assumption: ExtVP always outperforms
using the baseline (HDFS) partitioning technique affects the re- VP
producibility of the WPT schema optimizations. Overall, only According to [22], we expect that ExtVP provides better or at
58% of the query results conforming with the fact that WPT is least similar performance gains, as the queries are similar, and the
outperforming PT schema (Table 5). The experiments show that number of SQL joins in the VP schema is equal to the ExtVP joins.
columnar file formats, e.g., ORC, and Parquet, are the best for Nevertheless, one should keep in mind that ExtVP improvements
are mainly due to the original SPARQL query nature. It also
Table 9: Mapping the partitioning technique to the storage Table 10: Mapping the partitioning technique to the stor-
format best practices in WPT age format best practices in ExtVP
Avro CSV ORC Parquet Avro CSV ORC Parquet
Baseline-HDFS X X ✓* ✓** Baseline-HDFS ✓ ✓ ˜ ✓*
Horizontal X X ✓ ✓ Horizontal X X X X
Subject-based X X ✓ ✓ Subject-based X X X X
Where ✓is good practice, X is bad practice,and ˜ has the same Predicate-based X X ✓ ✓
performance compared to PT. Where ✓is good practice, X is bad practice, and ˜ has the same
* WPT had very competitive performance performance compared to VP.
** WPT had the best performance * ExtVP had a very competitive performance
depends on the possible reductions in the table input data size and few numbers of column projections. Thus, it would work better
excluding the dangling triples (rows that do not contribute to any with columnar rather than row-based file formats.
joins) [22]. Typically, ExtVP queries are similar to the VP ones;
the only difference realizes in the queried tables/DataFrames (i.e., 6.3 Recommendations
their size reduced by ExtVP or their size are the same VP). Thus, Overall, Tables 9 and 10 provides an abstracted map of good and
the relational engine’s performance, e.g., Spark with the ExtVP, bad storage format and partitioning techniques.
should be equivalent or better to its performance with the VP The results in Figure 5 and Table 9, show that partitioning
schema. the WPT table has, in the majority, a negative effect on the WPT
Based on our experiments, the findings of the ExtVP schema optimization, making it perform even worse than its baseline
are not fully reproduced, even considering the default HDFS par- approach, i.e, the PT schema. The effect of the storage formats
titioning and the baseline Parquet file format. Some queries do is more significant in the WPT optimization (cf. Tables 5, 9).
not benefit from the ExtVP optimizations (𝑄9, 𝑄10, 𝑄11), no- Therefore, this WPT schema’s storage format selection decision
table input size reductions occurred in those queries), cf. Table 6. should be dealt with as a first-class citizen in such experiments.
Beyond those queries, we can confirm that the state-of-the-art The horizontal and subject-based partitioning techniques are
results (ExtVP performs better than VP in most cases). However, not recommended with ExtVP optimization. However, Predicate-
our results show that the schema-based query optimization is based still gives better results than those two other RDF par-
not straightforward in such a complex solution space. titioning techniques (cf. Tables 8 and 10). Also, columnar file
Regarding the partitioning techniques, using an alternative to formats are still recommended with the ExtVP schema optimiza-
the baselines technique (HDFS) affects the reproducibility of the tion. However, it was noticed that the effect of the partitioning
ExtVP optimizations even if the storage format is Parquet. Only is more significant to this optimization (cf. Figure 6, Tables 8,
55% of the queries results show that ExtVP is superior to the VP and 10). Thus, the partitioning selection decision of this ExtVP
schema (cf. Table 8). Moreover, Figure 6 shows the effect of other schema should be highly considered in these experiments.
RDF partitioning techniques on the reproducibility findings of Also, our analysis yields the following recommendations
the ExtVP optimization. For instance, deviating from the baseline
(1) With WPT, it is recommended to use the columnar storage
partitioning technique to other RDF-based techniques with the
formats rather than row-oriented ones (cf. Table 9).
same baseline Parquet, i.e., Figures 6(a-c) degrades the results of
(2) With the WPT schema, Parquet is yet the best columnar
ExtVP and makes it perform worse than the baseline VP schema
file format to select, it efficiently handles its sparsity.
in several queries (𝑄1, 𝑄4, 𝑄5, 𝑄6, 𝑄8) with the Horizontal and
(3) With WPT, it is recommended to use the native HDFS
Subject-based partitioning. The predicate-based partitioning in
partitioning, rather than selecting an RDF-oriented parti-
Figure 6(c) has a better performance with this schema, which has
tioning technique.
performance close to VP’s in the previously-mentioned queries.
(4) With ExtVP, the baseline HDFS partitioning is more recom-
Similarly, using storage formats different from Parquet affects
mended than specific RDF ones. However, larger datasets
the ExtVP optimizations’ reproducibility, even with the baseline
would require partitioning anyway.
(HDFS) partitioning technique. Indeed, we have only 67.5% of
(5) With ExtVP, the columnar file formats is a recommended
the queries results of ExtVP outperforming VP (cf. Table 8). Simi-
optimization.
larly, Figure 6 shows the effect of other file formats other than
the baseline Parquet, i.e Figures 6 (d-l) for ORC, Avro, and CSV
respectively. We can notice the queries’ performance degradation 7 RELATED WORK
with the ExtVP schema optimizations moving vertically to these In this section, we present the related work. In particular, we focus
other formats. on comparative studies that investigate the use of BD frameworks
Finally, from our experiments, we observe that columnar file for distributed RDF processing. To the best of our knowledge,
formats are better than the Row-oriented ones. However, the per- the literature includes several studies that compare partitioning
formance difference is not significant with such similar schemas. techniques, relational schemas, and storage formats [2, 6, 8, 14].
The table structure is the same table of two columns Predi- However, none of these approaches focus on replicating and
cate (Subject-Object) in both vertical schemas. Moreover, both comparing existing optimization techniques.
schemas have not wide tables in comparison to the WPT and PT Abdelaziz et al. [2] discussed several relational schemas for
schemas. That is, these schemas will not benefit a lot from the materializing RDF datasets. Their main goal was to assess differ-
columnar file formats. The performance gain of columnar over ent native and non-native RDF processing systems. However, it
the row-oriented file formats is because SP2Bench queries have a does not discuss the impact of different relational schemas on a
specific system’s performance, such as SparkSQL; nor it discusses [2] Ibrahim Abdelaziz, Razen Harbi, Zuhair Khayyat, and Panos Kalnis. 2017. A
partitioning techniques and data formats. survey and experimental comparison of distributed SPARQL engines for very
large RDF data. Proceedings of the VLDB Endowment 10, 13 (2017), 2049–2060.
Arrascue et al. [6] lead an investigation on the performance [3] Giannis Agathangelos, Georgia Troullinou, Haridimos Kondylakis, Kostas
of the WPT schema against alternative relational schemas, i.e., Stefanidis, and Dimitris Plexousakis. 2018. RDF Query Answering Using
Apache Spark: Review and Assessment. In 34th IEEE International Conference
triple tables, VP, and domain-dependent tables. Additionally, they on Data Engineering Workshops, ICDE Workshops 2018, Paris, France, April
consider subject-based partitioning but limit the data formats 16-20, 2018. IEEE Computer Society, 54–59.
to Parquet. The work’s main finding is the flexibility of WPT [4] Adnan Akhter, Axel-Cyrille Ngomo Ngonga, and Muhammad Saleem. 2018.
An empirical evaluation of RDF graph partitioning techniques. In European
for generic query shapes in contrast with other approaches and Knowledge Acquisition Workshop. Springer, 3–18.
even considering partitioning. However, their exploration of the [5] Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K.
solution space is limited in terms of partitioning techniques and Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and
Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In
data formats. SIGMOD Conference. ACM, 1383–1394.
Cossu et al. [8] focused on a hybrid storage approach that [6] Victor Anthony Arrascue Ayala, Polina Koleva, Anas Alzogbi, Matteo Cossu,
Michael Färber, Patrick Philipp, Guilherme Schievelbein, Io Taxidou, and Georg
combines the benefits of PT and VP schemas to boost the query Lausen. 2019. Relational schemata for distributed SPARQL query processing.
performance without the need for extensive loading time. Their In Proceedings of the International Workshop on Semantic Big Data. 1–6.
solution, PROST, was able to outperform state of the art systems [7] Feras M Awaysheh, Mamoun Alazab, Maanak Gupta, Tomás F Pena, and
José C Cabaleiro. 2020. Next-generation big data federation access control: A
like S2RDF for several query shapes. Nevertheless, their explo- reference model. Future Generation Computer Systems (2020).
ration of partitioning techniques and data formats is limited. [8] Matteo Cossu, Michael Färber, and Georg Lausen. 2018. PRoST: Distributed
Additionally, they focused their work on PT and VP schemas, Execution of SPARQL Queries Using Mixed Partitioning Strategies. In Pro-
ceedings of the 21st International Conference on Extending Database Technology,
not considering WPT as an alternative schema that may further EDBT 2018, Vienna, Austria, March 26-29, 2018, Michael H. Böhlen, Reinhard
improve the performance. Pichler, Norman May, Erhard Rahm, Shan-Hung Wu, and Katja Hose (Eds.).
OpenProceedings.org, 469–472. https://doi.org/10.5441/002/edbt.2018.49
On another side, Pham et al. results in [14] indicates that more [9] J. Huang, D. Abadi, and K. Ren. 2011. Scalable SPARQL querying of large RDF
than 95% of RDF dataset triples have tabular structure. They graphs. Proceedings of the VLDB Endowment 4 (2011), 1123 – 1134.
combine structural non-quotient and statistical methods to auto- [10] Todor Ivanov and Matteo Pergolesi. 2019. The impact of columnar file for-
mats on SQL-on-hadoop engine performance: A study on ORC and Parquet.
matically discover and detect an emergent relational schema (in Concurrency and Computation: Practice and Experience (2019), e5523.
the form of property tables) in RDF datasets. A similar approach [11] Todor Ivanov and Matteo Pergolesi. 2020. The impact of columnar file formats
has been proposed in [12] to mitigate the limitations of the WPT on SQL-on-hadoop engine performance: A study on ORC and Parquet. Concurr.
Comput. Pract. Exp. 32, 5 (2020). https://doi.org/10.1002/cpe.5523
and PT RDF schemata by merging the related hierarchical char- [12] Marios Meimaris, George Papastefanatos, and Panos Vassiliadis. 2020. Hier-
acteristic sets and provide a novel RDF relational schema. The archical Property Set Merging for SPARQL Query Optimization.. In DOLAP.
36–45.
aim of so doing is to provide a better SPARQL query evaluation. [13] Thomas Neumann and Gerhard Weikum. 2010. The RDF-3X engine for scalable
Finally, Akhter et al. [4], investigated the performance of dif- management of RDF data. The VLDB Journal 19, 1 (2010), 91–113.
ferent partitioning techniques for RDF data, proposing a ranking [14] Minh-Duc Pham, Linnea Passing, Orri Erling, and Peter A. Boncz. 2015. De-
riving an Emergent Relational Schema from RDF Data. In Proceedings of the
function that helps practitioners to choose the most appropriate 24th International Conference on World Wide Web, WWW 2015, Florence, Italy,
technique. May 18-22, 2015, Aldo Gangemi, Stefano Leonardi, and Alessandro Panconesi
(Eds.). ACM, 864–874. https://doi.org/10.1145/2736277.2741121
[15] Mohamed Ragab, Riccardo Tommasini, Sadiq Eyvazov, and Sherif Sakr. 2020.
8 CONCLUSIONS & FUTURE WORK Towards making sense of Spark-SQL performance for processing vast dis-
tributed RDF datasets. In Proceedings of The International Workshop on Semantic
The reproducibility of well-known relational RDF processing Big Data. 1–6.
optimizations is critical to foster best practices that guide the [16] Mohamed Ragab, Riccardo Tommasini, and Sherif Sakr. 2019. Benchmark-
ing Spark-SQL under Alliterative RDF Relational Storage Backends. In
practitioners’ efforts. In this paper, we presented a comprehensive QuWeDa@ISWC.
empirical evaluation using three RDF partitioning techniques [17] Sherif Sakr. 2009. GraphREL: A Decomposition-Based and Selectivity-Aware
Relational Framework for Processing Sub-graph Queries. In DASFAA.
and four storage formats over the distributed SparkSQL engine [18] Sherif Sakr and Ghazi Al-Naymat. 2010. Relational processing of RDF queries:
to cope with this limitation. Our analysis demonstrates decisively a survey. ACM SIGMOD Record 38, 4 (2010), 23–28.
variant trade-offs using different relational schemas, data parti- [19] Sherif Sakr, Angela Bonifati, Hannes Voigt, Alexandru Iosup, Khaled Ammar,
Renzo Angles, Walid Aref, Marcelo Arenas, Maciej Besta, Peter A Boncz, et al.
tioning, and storage file formats against these state-of-the-art 2020. The Future is Big Graphs! A Community View on Graph Processing
optimizations. Our experiments show significant degradation Systems. arXiv preprint arXiv:2012.06171 (2020).
in Spark performance when partitioning by subject in the WPT [20] Muhammad Saleem, Gábor Szárnyas, Felix Conrads, Syed Ahmad Chan
Bukhari, Qaiser Mehmood, and Axel-Cyrille Ngonga Ngomo. 2019. How
and partitioning horizontally due to the vast, sparse, and large Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore
partitions of its schema table. On the same note, the storage for- Benchmarks?. In The World Wide Web Conference. ACM, 1623–1633.
[21] Alexander Schätzle, Martin Przyjaciel-Zablocki, Antony Neu, and Georg
mat also affects the WPT performance, where ORC and Parquet Lausen. 2014. Sempala: Interactive SPARQL query processing on hadoop.
are the most suitable representation of such configuration. Our In International Semantic Web Conference. Springer, 164–179.
results on ExtVP illustrate that schema-based query optimization [22] Alexander Schätzle, Martin Przyjaciel-Zablocki, Simon Skilevic, and Georg
Lausen. 2016. S2RDF: RDF querying with SPARQL on spark. Proceedings of
is not straightforward using different configurations. the VLDB Endowment 9, 10 (2016), 804–815.
Future work includes extending this study by analyzing the [23] Michael Schmidt, Thomas Hornung, Norbert Küchlin, Georg Lausen, and
impact of data scalability on SparQL performance. We intend to Christoph Pinkel. 2008. An Experimental Comparison of RDF Data Manage-
ment Approaches in a SPARQL Benchmark Scenario. In International Semantic
utilize other RDF benchmarks such as WatDiv with different types Web Conference (Lecture Notes in Computer Science), Vol. 5318. Springer, 82–97.
of query shapes and complexities. Our plans include investigating [24] Michael Schmidt, Thomas Hornung, Georg Lausen, and Christoph Pinkel.
2009. SPˆ2Bench: A SPARQL Performance Benchmark. In Proceedings of the
this area further to design a benchmark that combines query 25th International Conference on Data Engineering, ICDE 2009, March 29 2009 -
workloads with precise partitioning and storage instructions. April 2 2009, Shanghai, China. 222–233. https://doi.org/10.1109/ICDE.2009.28
[25] Matei Zaharia, Reynold S. Xin, and Patrick Wendell et.al. 2016. Apache Spark:
a unified engine for big data processing. Commun. ACM 59, 11 (2016), 56–65.
REFERENCES [26] Lei Zou, Jinghui Mo, Lei Chen, M Tamer Özsu, and Dongyan Zhao. 2011.
gStore: answering SPARQL queries via subgraph matching. Proceedings of the
[1] Daniel J Abadi, Adam Marcus, Samuel R Madden, and Kate Hollenbach. 2007.
VLDB Endowment 4, 8 (2011), 482–493.
Scalable semantic web data management using vertical partitioning. In VLDB.