=Paper= {{Paper |id=Vol-2840/paper11 |storemode=property |title=An In-depth Investigation of Large-scale RDF Relational Schema Optimizations Using Spark-SQL |pdfUrl=https://ceur-ws.org/Vol-2840/paper11.pdf |volume=Vol-2840 |authors=Mohamed Ragab,Riccardo Tommasini,Feras M. Awaysheh,Juan Carlos Ramos |dblpUrl=https://dblp.org/rec/conf/dolap/00010AR21 }} ==An In-depth Investigation of Large-scale RDF Relational Schema Optimizations Using Spark-SQL== https://ceur-ws.org/Vol-2840/paper11.pdf
        An In-depth Investigation of Large-scale RDF Relational
               Schema Optimizations Using Spark-SQL
                            Mohamed Ragab                                                         Riccardo Tommasini
              Data Systems Group, University of Tartu                                   Data Systems Group, University of Tartu
                      mohamed.ragab@ut.ee                                                      riccardo.tommasini@ut.ee

                         Feras M. Awaysheh                                                         Juan Carlos Ramos
              Data Systems Group, University of Tartu                                   Data Systems Group, University of Tartu
                       feras.awaysheh@ut.ee                                                         jramos@ut.ee

ABSTRACT                                                                         among these dimensions is of paramount importance [3] yet is
This paper discusses one of the most significant challenges of                   still missing.
large-scale RDF data processing over Apache Spark, the relational                    In this paper, we try to fill this research gap by experimen-
schema optimization. The choice of RDF partitioning techniques                   tally evaluating SPARQL on top of SparkSQL. In particular, our
and storage formats using SparkSQL significantly impacts query                   analysis focuses on existing RDF relational schemas and their
performance. The impact of the relational schemas and the un-                    state-of-the-art improvements. To this end, we present a sys-
derlying data storage formats is indisputable; they significantly                tematic and comparative evaluation of the query performance
affect the query performance. Nevertheless, the trade-offs in dif-               considering (i) 𝑡ℎ𝑟𝑒𝑒 RDF partitioning techniques (most suit-
ferent configurations have not been a subject of intensive study                 able for relational nature of data in Spark-SQL), i.e., Horizontal,
in the literature. This paper presents an in-depth investigation for             Subject-based, and Predicate-based partitioning and (ii) 𝑓 𝑜𝑢𝑟 dif-
practitioners to understand such trade-offs and their best prac-                 ferent well-established storage formats, i.e., ORC, CSV, Parquet,
tices. It also reports on the pitfalls behind the implementation                 and Avro [15, 16]. In this way, our work differs from previous
SPARQL optimizations over SparkSQL. Our experiments provide                      ones [21, 22] that only focus on the complexity of the workloads
insights into these schemas’ relative strengths by comparing                     and the size of the data.
three different partitioning techniques and four other storage                       The contribution of this paper is threefold. (i) First, it uses
formats. Our results draw a better understanding of the current                  SparkSQL to validate the performance of RDF schema advance-
State-Of-The-Art (S.O.T.A) and pave the way for a wide range                     ments (i.e. ExtVP and WPT ) compared to their baseline opponents
of best practices and systematically tuning the performance of                   (i.e PT, and VP). (ii) Second, it empirically analyzes the effect of
distributed systems to handle vast RDF data.                                     partitioning techniques on the ExtVP and WPT schema runtime
                                                                                 performance. (iii) Third, it tests the effects of multiple distributed
                                                                                 storage row and columnar-oriented file formats on HDFS. Finally,
1    INTRODUCTION                                                                it outlines the best practices and recommendations that help
Currently, we are witnessing an enormous amount of widely                        in achieving the best RDF query performance. Overall, the pa-
available RDF datasets [19]. Centralized RDF engines, e.g., RDF-                 per findings guide the realization of next-generation large-scale
3X [13] and gStore [26], provide native ways for processing/-                    RDF solutions over Apache Spark by optimizing the relational
querying RDF datasets with the full expressive capabilities of                   schemas.
SPARQL. Yet, they can not handle large-scale RDF datasets effec-                     The remainder of the paper is organized as follows: section 2
tively [2, 9]. The need for processing large RDF datasets calls for              presents an overview of the required background information
innovative solutions to store, analyze, and query these massive                  and key concepts necessary to understand our study. Section 3
RDF datasets [2]. This call leads the community to leverage Big                  discusses the experimental methodology. Section 4 presents the
Data (BD) processing frameworks like Apache Spark [25] to                        benchmarking scenario and the experimental setup. Section 5
process large RDF datasets [3].                                                  presents the paper results, while we provide a comprehensive
   BD platforms excel in the analytical processing of relational                 discussion in section 6. Section 7 presents the related work, posi-
data. The literature includes several attempts that leverage such                tioning this paper in the context of other survey on RDF process-
capabilities to analyze RDF data [2, 17]. In practice, utilizing                 ing using BD frameworks. Finally, section 8 concludes the paper
BD engines for RDF relational processing requires storing RDF                    and presents future works.
data using a relational schema and translating SPARQL queries
into equivalent SQL ones. On the same note, BD platforms are                     2     BACKGROUND
designed to scale horizontally [7]. However, the choice of the
right schema can significantly impact the performance of query                   In this section, we present the information that is necessary to
processing [18]. Moreover, choosing the right partitioning tech-                 understand the content of this paper. We assume that the reader
nique also returns with variant query runtime performance [4].                   is familiar with the RDF data model and the SPARQL query
In this regard and from a BD perspective, we cannot ignore the                   language.
variety of data formats [11]. Given the complexity of the solu-
tion space, i.e., relational schema, partitioning technique, storage             2.1    Apache Spark & SparkSQL
format, current works focus on one dimension at a time. How-                     Apache Spark is currently the de-facto BD engine [25]. It is one
ever, the relevance of a comprehensive analysis of the trade-offs                of the most active and widely-used large-scale data processing
                                                                                 systems in both industry and academia [5]. It mainly adopts
© Copyright © 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).                       in-memory distributed computing of large scale data analytics.
   SparkSQL is a relational package built on top of Apache Spark        algorithm that is likely to produce sub-optimal schemas for an
[5] with support for the SQL interface while providing capabilities     arbitrary RDF dataset. Unfortunately, WPT does not overcome
for structured and semi-structured data.                                all the limitations of the PT schema. Indeed, this representation
                                                                        can also be very sparse for poorly structured data, and it may
                                                                        face a large storage overhead, especially with many multi-valued
2.2    RDF Relational Schema
                                                                        properties existing in the RDF dataset.
The most intuitive approach to follow for representing RDF into
a relational structure is the Single Statement Table Schema (ST),
                                                                        2.3    RDF Data Partitioning
which requires storing RDF datasets in a single triples table of
three columns that represent components of the RDF triple, i.e.,        For RDF data processing, many partitioning techniques exist [2,
Subject, Predicate, and Object. This solution is the simplest, and      4]. In the following, we present the partitioning techniques that
it is commonly adopted by several existing open-source RDF              are suitable for our experiments on SparkSQL.
triplestores, e.g., Apache Jena, RDF4J, and Virtuoso. However, it           Horizontal-Based Partitioning (HP) requires dividing the
inevitably increases the number of required self-joins for long         RDF dataset evenly (as much as possible) on the number of ma-
chains SPARQL query evaluation when they run on top of rela-            chines in the cluster. In particular, we use this technique to parti-
tional SQL systems.                                                     tion the relational RDF tables of the different schemas horizon-
    Vertically Partitioned Tables Schema (VP) is an RDF stor-           tally into even 𝑛 chunks(i.e partitions) over the cluster machines.
age schema proposed to mitigate the performance issues of the ST            Subject-Based Partitioning(SBP) requires the distribution
schema. It aims to speed up the queries over RDF triple stores [1].     of triples into partitions according to the hash value computed
This schema is simple to design; the RDF triples table is decom-        for the RDF subjects. As a result, all the triples that have the
posed into a table of two columns (Subject, Object) for each unique     same subject are assumed to reside on the same partition. In our
property in the RDF dataset.                                            scenario, we applied spark partitioning using the subject as the
    Extended Vertical Partitioning schema (ExtVP) is a query-           partitioning key with our different relational schema tables (i.e
driven optimization that aims at minimizing the input size of           DataFrames).
the data during query evaluation [22], inspired by the semi-Join            Predicate-Based Partitioning (PBP) is similar to the SBP,
reductions. In particular, ExtVP minimizes data skewness and            it distributes triples to the various partitions based on the hash
eliminates dangling triples (i.e. triples that do not have a joining    value computed for the predicate. Similarly, all the triples that
partner or do not contribute to any join in the SPARQL query)           have the same predicate are assumed to reside on the same parti-
from the input tables. ExtVP speeds up query answering by pre-          tion. We also applied the Spark partitioning using the predicate
computing the possible join relations between the VP tables and         as the partitioning key with our different relational schemas
materializing the results of these semi-joins as tables in the stor-    Dataframes.
age backend, e.g. HDFS. Particularly, for every two VP relations            Baseline partitioning (BP): In our experiments, we also used
ExtVP relies on pre-computing semi-join reductions of Subject-          the baseline partitioning technique that basically depends on the
Subject (SS), Subject-Object (SO), and Object-Subject (OS) join         native default partitioning of HDFS of the tables files over the
patterns. The output tables are reduced in size and will be used        cluster nodes. This is the technique used in the state-of-the-art
in joins instead of the original VP tables. However, one of the         works of the schema advancements [21, 22].
limitations of the ExtVP schema is the additional storage over-
head of the materialized ExtVP tables in comparison to the VP           3     EVALUATION METHODOLOGY
schema tables (cf. Table 1).                                            In this section, we discuss the experimental methodology that
    Property (n-ary) Tables Schema (PT) is a storage schema             we used for the reproducibility of the state-of-the-art findings [6,
proposed to cluster multiple RDF properties as n-ary table columns      21, 22] that imply some changes in the experimental artifacts, we
for the same subject to group entities that are similar in structure.   organize our experiments as follows.
The biggest advantage of property tables compared to a single              First, we assess if we can reproduce the state-of-the-art re-
triples table schema (ST) is that they can reduce the number of         sults of those schema optimizations over the baseline relational
subject-subject self-joins that result from star-shaped patterns in a   schemas performance. Thus, we performed our experiments in
SPARQL query. Whereas, one of the limitations of the PT schema          a setup as similar as possible to what the original authors have
is that it works quite well with the highly structured RDF data.        done [21, 22]. In this regard, we use the baseline HDFS partition-
However, its performance degrades for the poorly structured             ing technique. We also use Parquet as our baseline storage file
ones [23]. Furthermore, typical RDF comes with diverse struc-           format (grey shaded boxes cf. Figure 1).
tures, which make it virtually hard to define an optimal layout            Second, we introduce disturbing factors to our experiments,
of this schema [22]. Moreover, a poorly-selected property table         such as the different partitioning techniques, and different file
layout can significantly slow down the query performance [2].           formats alongside different SPARQL query shapes.
Due to its sparse-tables representation nature, PT schema also             Regarding the data partitioning, we introduce the Horizontal
suffers from high storage overheads when a large number of              Partitioning technique and Subject-based partitioning for the
predicates is present in the RDF data model [1].                        WPT and PT schema experiments.On the other hand, Horizon-
    Wide Property Table Schema (WPT) represents the whole               tal, Subject and Predicate-based partitioning techniques were
RDF dataset into a single unified table [21]. Such table uses all       used for the VP and ExtVP schema experiments. We expect that
RDF properties in the dataset as columns. It aims at extending the      these partitioning techniques will negatively impact the perfor-
PT schema for optimizing star-shaped SPARQL queries, which              mance of SparkSQL when evaluating SPARQL queries due tothe
are highly common in the SPARQL query workloads. Therefore,             distribution of the relational table across nodes. This will force
star-shaped SPARQL queries will require no joins to be answered.        more shuffling in the presence of joins. In particular, Horizon-
Moreover, this schema does not require any kind of clustering           tal partitioning should have a worse impact than Subject-based
                Table 1: SP2 Bench-100M RDF relational schemata table data sizes with different file formats

                 SP2 Bench     RDF (n3)       PT                WPT          VP                    ExtVP
                                              ∼9.2MB-1.9GB                   8KB-1.9GB             - OS (4.9GB) - SS (39GB)
                 CSV           11GB                             9.4GB
                                              -Total: 6.8GB                  -Total: 8.3GB         - SO (806MB) -Total:∼45GB
                                              980KB-416MB                    8KB-272MB             - OS (359MB) - SS (8.8GB)
                 Avro          11GB                             1.8GB
                                              -Total: 1.6GB                  -Total: 1.7GB         - SO (331MB) -Total:∼9.5GB
                                              620KB-362MB                    8KB-249MB             - OS (243MB) - SS (7.8GB)
                 ORC           11GB                             1.4GB
                                              -Total: 1.4GB                  -Total: 1.5GB         - SO (301MB) -Total:∼8.4GB
                                              620KB-382MB                    8KB-264MB             - OS (319MB) - SS (8.4GB)
                 Parquet       11GB                             1.7GB
                                              -Total: 1.5GB                  -Total: 1.6GB         - SO (318MB) -Total:∼9GB


partitioning on PT and WPT schemas, and Predicated-based on
(Ext)VP ones. It worth mentioning that the HP technique does              Wide Property                     Baseline
                                                                                                                                       Parquet
                                                                             Tables                          HDFS
not take the query shape into account and possibly place these
rows in different nodes.
                                                                              Property                      Horizontal
   Regarding the storage of file formats besides the baseline Par-             Tables                        Based
                                                                                                                                        ORC
quet, we consider an additional columnar one, i.e., ORC, and two
row-oriented ones, i.e., CSV and Avro. We expect columnar for-              Ext. Vertical                    Subject
mats to perform better for the queries with a subset of column                                                                          AVRO
                                                                              Tables                         Based
projections, since they allow an efficient scan of tables by reading
only a portion of columns [10]. In action, SP2 Bench has a small              Vertical                      Predicate
                                                                                                                                        CSV
number of column projections across all its benchmark queries.                Tables                         Based
   Finally, aiming to draft our observations, primary findings,
and propose best practices, we discuss and analyze our results.              Relational                   Partitioning                 Storage
                                                                             Schemata                     Technique                    Formats
Additionally, we highlight the trade-offs of combining all these
dimensions in the discussion section.
   Moreover, we aim to observe these optimizations’ impact on           Figure 1: Experiments architecture and evaluation envi-
the large SPARQL query performance on the SparkSQL engine.              ronment
Mostly, we want to verify and answer the following questions:
    (1) How far do RDF partitioning techniques and storage for-                       #Joins     #Filters     #Projections      Query Shape
        mats impact the query performance?                                    Q1         3          0               1                S
    (2) How can we systematically analyze different relational                Q2         8          0              10                S
        schemas? How can these schemas effectively improved to                Q3         1          1               1                S
        achieve the highest performance?                                      Q4         7          1               2               SF
    (3) What are the best practices that guide the large RDF com-             Q5         5          1               2               SF
        munity efforts in adopting performance-oriented solu-                 Q6         8          3               2               SF
        tions?                                                                Q7        12          2              1                SF
                                                                              Q8        10          2              1                SF
4    BENCHMARK & EXPERIMENTAL SETUP                                           Q9         3          0               1              S (U)
This section outlines the paper experiment setup and the used                 Q10        0          0               2             TP (U)
benchmark with its queries. The experimental setups (presented                Q11        0          0               1               TP
in Figure 1) summarizes the configuration combinations (Rela-           Table 2: Benchmark Queries Characteristics: Shape, i.e.,
tional schema, Partitioning, Storage). The triangle with X repre-       [S]tar, [S]now[F]lake, or a single [T]riple[P]attern; (U) for
sents that we have performed our experiments for 4 different            unbounded Predicate Variable, Number of Joins, filters,
relational schemas, partitioning each schema across 4 various           and projections.
relational techniques, i.e one baseline HDFS, and other 3 RDF-
specific techniques. Last but not least, those schemas are stored          The generated n3 RDF dataset is converted into CSV relational
across 4 different storage formats. In detail:                          schemas using Jena TDB 1 , a disk-based access repository for
Benchmark &Dataset: In our evaluation, we used the SP2 Bench            storing RDF datasets. We further used the Jena ARQ 2 for query-
(SPARQL Performance Benchmark) [24]. SP2 Bench has a reason-            ing these TDB datasets and generating the output schemas tables
able low score of data structuredness, making it closer to the          in the CSV file format. Finally, these raw textual CSV documents
structure of real-world RDF datasets [20]. So, it is valid to state     are loaded to the HDFS. Moreover, we have used the Spark frame-
that, to the best of our understanding, SP2 Bench meets a wide          work to write the relational schemas data tables from the CSV
spectrum of queries and answers well the main claims we are             format into the other HDFS file formats (Avro, Parquet, and ORC).
investigating.                                                          Table 1 shows the size of the generated native RDF dataset (i.e
Data Storage: We generated a synthetic RDF dataset with 100𝑀            11GB), as well as store sizes of each relational schema in the men-
triples size in Notation3 format. This scale size is enough for         tioned different file formats on top of HDFS. It is clearly shown,
checking the validity of the literature findings regarding the RDF      how the different relational schemas affect the input data sizes.
relational schemas optimizations, and maintaining the repro-            1 https://github.com/apache/jena/tree/master/jena-tdb
ducibility of them in a more complex solution space.                    2 https://github.com/apache/jena/tree/master/jena-arq
In action, the PT schema has the smallest table sizes in total, fol-               Q1     Q2      Q3     Q4     Q5      Q6     Q8     Q10      Q11
lowed by the VP schema, then the WPT table schema. Whereas,
the largest storage overheads come with the ExtVP schema. We             PT         2       9      2      8       7      6       9      5        2
can also notice how the storage formats affect the sizes of the          WPT        0       0      0      3       3      3      10      3        0
schemas significantly. In particular, columnar-oriented formats        Table 3: SP2Bench queries: Number of Joins of PT vs WPT.
have the minimum table sizes across all the schemas. Indeed, ORC
is shown to have the minimum table sizes, followed by Parquet.                  WPT vs. PT          Avro      CSV       ORC       Parquet
While, the Avro row-oriented formats have quite larger schema                   Baseline             2/9      2/9        8/9        9/9
sizes, and CSV has the largest table sizes.
                                                                                Horizontal           2/9      3/9        6/9        6/9
Queries: SP2 Bench queries have different complexities and a                    Subject              2/9      2/9        6/9        6/9
high diversity of features [20]. These queries implement meaning-
                                                                       Table 4: Number of queries for which WPT beats PT for
ful requests on top of RDF data. In our experiments, we reused the
                                                                       data formats and partitioning techniques.
SQL version of the queries associated with the SP2 Bench bench-
mark 3 for the mentioned RDF relational schemas. However, for
the new relational schema advancements (e.g. ExtVP, WPT) that          5.1     WPT VS. PT Schema Results
are missing on the benchmark website, we have manually trans-          Table 3 shows the SP2 Bench queries’ number of joins when trans-
lated these queries into SQL, and we provide all these translated      lated into SQL concerning the PT and WPT schemas. Except for
queries in our project repository 4 . We have evaluated all of these   𝑄8 (that requires many self-joins of the WPT table), the number
11 queries of type SELECT, except 𝑄9, and 𝑄11 which are not            of joins always decreases, adopting the WPT schema. Moreover,
applicable (’NA’) for the PT and the WPT relational schemas.           we expect that the WPT schema query performance (i.e., in terms
𝑄7 is also not applicable in the VP and ExtVP schemas. Notably,        of latency) will outperform other relational schemas [6]. In this
for generating the ExtVP tables, the default selectivity threshold     regard, the Parquet data format efficiently handles the sparsity
of 1 has been configured [22]. Table 2 shows our benchmark             caused by the WPT table schema —as Null values are efficiently
queries complexities, in terms of the number of joins, filters, and    ignored in this file format [21].
projections, alongside the SPARQL query shape.                             Meanwhile, Table 4 shows the overall benchmark results of
Environment Setup: Our experiments were executed on a bare-            the WPT performance over PT schema across all file formats
metal cluster of 4 machines with CentOS-Linux V7 OS, running           (horizontally in the table), and across the different partitioning
on 32 cores per node processor, and 128 GB of memory per node,         techniques (vertically). Values in this table specify the number of
alongside with a high speed 2 TB SSD drive for each node. We           queries in which the WPT schema performs better than the base-
used Spark V2.4 to fully support SparkSQL capabilities. In partic-     line PT schema. The green color indicates that WPT performing
ular, our Spark cluster consists of one master node and 3 worker       the best, while the yellow color indicates that its performance is
machines, while Yarn is used as the resource manager, which in         above 50% over PT, and the red means that performance is less
total uses 330 GB and 84 virtual processing cores.                     than 50%.
RDF Data Partitioning: We used Spark partitioners for parti-               Our experiment results confirm that the WPT schema per-
tioning the registered relational schemas tables/Spark DataFrames.     forms better than the baseline PT schema in all the queries (i.e., 9
This is required to persist those DataFrames on top of the HDFS        queries out of 9 queries in the benchmark) with Parquet file for-
default file blocks partitioning level. We use the resulting Data      mat, alongside using the baseline HDFS partitioning technique.
Frames as the input for the query engine. In our experiments, we       Indeed, these results confirm the findings in [6, 21] assessing the
have the baseline HDFS partitioning (grey partitioning box cf. 1).     reproducibility regarding the WPT schema optimization.
While other RDF partitioning techniques also have been tested,             To investigate how the performance difference between the
namely HP, SBP, and PBP approaches. These techniques depend            WPT and PT schemas changes, we introduce two new dimensions,
on partitioning the tables’ data horizontally across machines          i.e., various file formats and different partitioning techniques. In
(i.e HP), or on the Spark key partitioning of the RDF subject or       this regard, Table 5 shows the effect of data partitioning (left of
predicate (i.e SBP, PBP respectively).                                 the table) and storage formats (right of the table) considering the
                                                                       other new factors across all the experiments. To this extent, we
Performance Evaluation measure (Latency): We used the
                                                                       have calculated the percentages as follows, for the partitioning
Spark.time function by passing the spark.sql(...) query execution
                                                                       factor’s impact, we pivoted on each partitioning technique and
function as a parameter to measure the query latency. We run the
                                                                       counted the percentage of how much the WPT schema perfor-
experiments for all queries 5 times (excluding the first cold start
                                                                       mance in SparkSQL is better than the PT schema one across all
run time, to avoid the warm-up bias, and computed an average
                                                                       the queries while considering all the changes of the storage file
of the other 4 run times).
                                                                       formats (moving across them). We calculated the partitioning ef-
                                                                       fect similarly but pivoting on the storage file format and moving
                                                                       across the partitioning techniques in all of queries.
                                                                           Table 5 also demonstrates that in such a complex space of
                                                                       different relational schema, data partitioning, and storage file
5   EXPERIMENT RESULTS                                                 formats, the schema-based query optimization is not straightfor-
In this section, we discuss our experiment results. Also, we com-      ward. As we can see, WPT outperforms PT schema only for 58%
pare the optimized relational schemas (i.e., WPT, and ExtVP)           in the queries using only the baseline default HDFS partitioning
against their baseline schemas, i.e., PT, and VP, respectively, ac-    technique regarding the storage formats, and only 78% for the
cording to our methodology (cf. Section 3).
                                                                       3 http://dbis.informatik.uni-freiburg.de/index.php?project=SP2B/queries.php
                                                                       4 https://datasystemsgrouput.github.io/SPARKSQLRDFBenchmarking/
                                                                          Baseline                    Subject                                                   Baseline               Subject
                                                                         Horizontal                  Average                                                   Horizontal             Average
                                                         1.8                                                                                     1.8
                                  Ratio of WPT over PT   1.6                                                                                     1.6




                                                                                                                          Ratio of WPT over PT
                                                         1.4                                                                                     1.4
                                                         1.2                                                                                     1.2
                                                           1                                                                                       1
                                                         0.8                                                                                     0.8
                                                         0.6                                                                                     0.6
                                                         0.4                                                                                     0.4
                                                         0.2                                                                                     0.2
                                                           0                                                                                       0
                                                                  avro            csv               orc         Parquet                                 avro            csv          orc         Parquet


                                                                                      (a) Q2                                                                                (b) Q4

     Figure 2: The performance of WPT over PT schema in 𝑄2 and 𝑄4 (values below 1 means WPT is better than PT)

                                                                 Baseline                 Subject                                      WPT/PT                  Partitioning effect                     Storage effect
                                                                Horizontal               Average
                                 3                                                                                                                             Baseline_Part      58.33%               Parquet 77.78%
                                2.5                                                                                                                            Horizontal         47.22%               ORC        74.07%
         Ratio of WPT over PT




                                 2
                                                                                                                                                               Subject-based      44.44%               CSV        25.93%
                                1.5
                                                                                                                                                               Predicate-based     NA                  AVRO       22.22%
                                 1                                                                                        Table 5: The effect of other partitioning techniques, and
                                0.5                                                                                       other storage formats on the reproducibility of the WPT
                                 0                                                                                        S.O.T.A findings
                                                         avro            csv            orc          Parquet



Figure 3: The performance of WPT over PT schema in Q8.                                                                    WPT over PT in that query and across the different configuration
values (below 1 means WPT is better than PT)                                                                              settings.
                                                                                                                              Not surprisingly, we can notice that 𝑄8 is the only query that
                                                                                                                          witnesses worse performance for the WPT compared to the PT
                                                                                                                          schema. Figure 3 shows that most of the ratios of ’WPT over PT’
Parquet file format. The determination of this result shows the                                                           is greater than 1 in the baseline-partitioned data experiments
trade-off of considering alternative storage file formats and parti-                                                      (i.e. only partitioned with HDFS), and other file formats instead
tioning techniques alongside the experiments’ query evaluation.                                                           of Parquet. Notably, all the results (i.e., total query runtimes)
    Regarding the storage, we can see that ORC, another columnar                                                          and query histograms can be found on our mentioned GitHub
file format gives closer performance to our baseline columnar                                                             repository.
Parquet file format with 74%. However, the baseline Parquet is
yet better, as Parquet is unlike ORC, can efficiently handle the                                                          5.2                          ExtVP VS. VP Schema Results
WPT table’s sparsity. Whereas, we can see that row-oriented                                                               According to [22], ExtVP outperforms or at least has a similar
formats have a significant negative effect on the performance                                                             performance to the VP schema. The reason is that queries are sim-
of WPT. WPT schema performance is better than PT with only                                                                ilar, and the number of SQL joins in the VP and ExtVP schemas
22% and 25% in all Avro and CSV queries, respectively. In action,                                                         are the same. This clarification is reflected in Table 6. Indeed, the
SP2 Bench queries only have one query (i.e., 𝑄2) with more than 2                                                         performance improvement depends mainly on the percentage of
column projections. This justifies why column-oriented formats                                                            reductions in the input table sizes that the ExtVP optimization
give better results for the WPT than the row-based ones. In                                                               might introduce out of the join correlations for each query [22].
general, we can state that file formats affected the generalization                                                       Table 6 also presents the percentage of ExtVP reductions of the
of the state-of-the-art results for the WPT schema.                                                                       processed tables’ rows for each query over the original input
    At last, we enroll in three specific queries, namely, 𝑄2, 𝑄4, and                                                     tables processed rows with the baseline VP tables. The semi-join
𝑄8 , which well exemplify our findings. We selected these queries                                                         reductions provided by the ExtVP help speeding-up the perfor-
as good representatives of our findings. There is a tremendous                                                            mance of SparkSQL by reducing the size of the shuffled data.
performance enhancement in WPT over PT in 𝑄2 and 𝑄4. The
reason behind this refers to the number of SparkSQL joins of                                                                          VP          ExtVP     Input tables data Size Red.
WPT is significantly less than the joins in PT schema (cf. Table 3).                                                       Q1          2            2                  58%
Particularly, in 𝑄2 number of joins in PT (SQL-version) is 9 com-                                                          Q2          9            9                  77%
pared to no-joins in WPT schema. While in 𝑄4 with PT schema,                                                               Q3          1            1                  59%
we have 8 SQL joins in comparison to 3 self-joins of the WPT table.                                                        Q4          7            7                  96%
Interestingly, we have more joins in WPT than the baseline PT                                                              Q5          5            5                  60%
schema in 𝑄8, i.e., 10 self-joins, and 8 joins, respectively. Figure                                                       Q6          9            9                  31%
2 (a), (b) and Figure 3 depict the performance of SparkSQL for                                                             Q8     9 & 1 Union 9 & 1 Union               5%
𝑄2, 𝑄4, and 𝑄8 respectively under a various combination of file                                                            Q9     2 & 1 Union 2 & 1 Union               0%
formats and partitioning techniques. In particular, these figures                                                          Q10      1 Union      1 Union                0%
combine the ratios of WPT being better than PT in those men-                                                               Q11         0            0                   0%
tioned queries. Ratios less than 1 indicate better performance of                                                         Table 6: Number of joins and percentage of input tables
                                                                                                                          sizes [Red]uctions after optimization ExtVP VS. VP.
                                                      Baseline                Subject                                                    Baseline                Subject
                                                     Horizontal              Average                                                    Horizontal              Average
                                                     Predicate                                                                          Predicate
               Ratio of ExtVP over VP   1.4                                                                                1.4




                                                                                                  Ratio of ExtVP over VP
                                        1.2                                                                                1.2
                                         1                                                                                  1
                                        0.8                                                                                0.8
                                        0.6                                                                                0.6
                                        0.4                                                                                0.4
                                        0.2                                                                                0.2
                                         0                                                                                  0
                                              avro            csv          orc          Parquet                                  avro            csv          orc          Parquet


                                                                  (a) Q4                                                                             (b) Q9

   Figure 4: The performance of ExtVP over VP schema in Q9. (values below 1 indicates that ExtVP is better than VP)

                                                                                                       ExtVP VS. VP       Avro CSV ORC Parquet
   In more details, ExtVP optimizes specific queries according                                         Baseline_Part       6/10   6/10 5/10        7/10
to the correlations between triple patterns in those queries [22],
                                                                                                       Horizontal_Part     3/10   3/10 3/10        3/10
namely, in Subject-to-Subject(SS), Object-to-Subject(OS), and Subject-
                                                                                                       Predicate_Part      2/10   3/10 6/10        6/10
to-Object(SO) [22]. Thus, we expect some queries to give similar
                                                                                                       Subject_Part        2/10   3/10 3/10        3/10
results to the VP schema queries (i.e., No reductions occurred in
the VP tables by the ExtVP schema optimization). Notably, in our                                  Table 7: Comparison of ExtVP schema with the VP schema
experiments, 𝑄9,𝑄10, and 𝑄11 do not present any input data re-                                    in different storage formats, and in different partitioning
ductions. Thus, we state that it is expected that their performance                               techniques.
to be very close to baseline VP performance.
   The same approach that has been adopted in WPT to PT                                                           ExtVP/VP              Partitioning effect                     Storage effect
schemas performance comparison is also used for evaluating                                                                              Baseline_Part 67.5%                     Parquet      55%
the performance of ExtVP against the VP.                                                                                                Horizontal          35%                 ORC          45%
   First, we check if our experiments’ results confirm the state-                                                                       Predicate-bsed      55%                 AVRO        42.5%
of-the-art regarding the ExtVP schema optimization over the                                                                             Subject-based       30%                 CSV         42.5%
baseline VP schema performance.                                                                   Table 8: The effect of other partitioning techniques, and
   Table 7 (on the right) shows the total number of queries in                                    other storage formats on the reproducibility of the ExtVP
which the ExtVP performance is better than VP schema perfor-                                      S.O.T.A findings
mance across all the benchmark queries. For our baseline HDFS
partitioning technique, and with the Parquet file format, we can                                  Predicate-based partitioning slightly reduces this negative effect
see that some queries do not benefit from the optimizations of                                    (i.e., 55% of the queries show that performance improvement).
the ExtVP. Indeed, 3 out of 10 queries fail to utilize the optimized                                  From Table 8, we can also see that the ExtVP schema is only
ExtVP technique. The reason behind such behavior is that those                                    outperforming the VP schema, with 67% of the queries using
queries have unbounded predicates that can not be optimized                                       the baseline HDFS partitioning scenario. Thus, we can see the
by the ExtVP schema [22] (see 𝑄9 and 𝑄10 in Table 2), or they                                     trade-off of considering various storage file formats. We can
have no effective join reductions (see 𝑄9,𝑄10,𝑄11 in Table 6).                                    see also that the baseline Parquet file format is the one that has
The performance of these queries is a subject of discussion in                                    less impact on the overall performance for ExtVP. Indeed, in
detail in the next sections.                                                                      55% of the cases where Parquet is used, ExtVP outperforms the
   Second, similarly to what we have done for the WPT schema                                      VP performance. Additionally, the ORC columnar file format
optimization, we now investigate how generalizable the state-                                     provides high performance of ExtVP over VP schema with an
of-the-art results are when we introduce different file formats                                   overall 45%. However, there is a clear difference from the Parquet
partitioning techniques over the data for both the ExtVP and VP                                   file format with 10%.
schemas.                                                                                              On the other hand, the row-oriented formats degrade the per-
   Similarly, Table 8 shows how far the data partitioning (left of                                formance of ExtVP. For only 42.5% of the experiments that adopt
the table) and data formats (right of the table) impact the results                               either Avro or CSV, ExtVP performance beats the performance of
of ExtVP in comparison to VP schema performance. Notably, this                                    the VP schema. Such behavior is related to the number of column
table’s percentage values are also calculated similarly to how                                    projections in the SP2 Bench queries, which are the minimum
we have calculated the WPT against the PT. We pivoted on the                                      in this benchmark scenario. Thus, columnar file formats can fit
analysis dimension of choice, i.e., file format 𝑋 or partitioning                                 such query workloads better than the row-oriented ones.
technique 𝑌 , and we calculated how many times SparkSQL per-                                          Last but not least, herein the most notable query examples
forms better using ExtVP than using the baseline VP approach.                                     are introduced, confirming our previous findings but with more
   Regarding the partitioning techniques’ effect on ExtVP, our                                    innumerable details. First, 𝑄4 is revealed to be the query with
expectations are confirmed. In particular, we can observe that                                    the most benefit with the ExtVP optimization. The reason be-
the partitioning techniques degraded the performance of ExtVP                                     hind this is that 𝑄4 includes a high number of joins (i.e., 7 joins),
significantly. Only, 35%, and 30% of the experiments adopting                                     and has the maximum number of input tables’ rows reductions
Horizontal, and Subject-based partitioning respectively show a                                    while using the ExtVP schema optimization with 96% of reduced
performance improvement in using ExtVP over VP. Adopting                                          processed rows (cf. Table 6). This query is directly followed by
                                                                                                  𝑄2 with 77%. Although 𝑄2 has a higher number of table joins
                                      WPTH                          PTH                                                               WPTS             PTS                                                                              WPT             PT
                      200                                                                                             200                                                                                         200


                      150                                                                                             150                                                                                         150
    Time (seconds)




                                                                                                     Time (seconds)




                                                                                                                                                                                                Time (seconds)
                      100                                                                                             100                                                                                         100


                      50                                                                                               50                                                                                         50


                       0                                                                                                0                                                                                          0
                            Q1   Q2    Q3               Q4    Q5    Q6       Q8 Q10 Q11                                     Q1   Q2    Q3    Q4   Q5    Q6   Q8 Q10 Q11                                                      Q1    Q2    Q3   Q4   Q5   Q6   Q8 Q10 Q11


                      (a) CSV - Horizontal Partitioning                                                               (b) CSV - Subject-based Partitioning                                                        (c) Avro - Horizontal Partitioning
                                      WPT                               PT                                                            WPT               PT                                                                              WPT             PT
                      200                                                                                             200                                                                                         200


                      150                                                                                             150                                                                                         150
    Time (seconds)




                                                                                                     Time (seconds)




                                                                                                                                                                                                Time (seconds)
                      100                                                                                             100                                                                                         100


                      50                                                                                               50                                                                                         50


                       0                                                                                                0                                                                                          0
                            Q1   Q2    Q3               Q4    Q5    Q6       Q8 Q10 Q11                                     Q1   Q2    Q3    Q4   Q5    Q6   Q8 Q10 Q11                                                      Q1    Q2    Q3   Q4   Q5   Q6   Q8 Q10 Q11


                     (d) Avro - Subject-based Partitioning                                                             (e) ORC - Horizontal Partitioning                                                         (f) ORC - Subject-based Partitioning
                                                                              WPT             PT                                                                                                WPT                               PT
                                                        200                                                                                                                     200


                                                        150                                                                                                                     150
                                       Time (seconds)




                                                                                                                                                               Time (seconds)

                                                        100                                                                                                                     100


                                                        50                                                                                                                      50


                                                         0                                                                                                                       0
                                                                   Q1    Q2    Q3   Q4   Q5   Q6   Q8 Q10 Q11                                                                         Q1   Q2   Q3                Q4    Q5    Q6       Q8 Q10 Q11


                                                    (g) Parquet - Horizontal Partitioning                                                                      (h) Parquet - Subject-based Partitioning

                            Figure 5: WPT Vs. PT schemata performance using different partitioning techniques and file formats


than 𝑄4, the reductions in input table sizes in 𝑄4 are more sig-                                                                                       into consideration, herein, we discuss our results and give some
nificant. On the other side, 𝑄9, 𝑄10, and 𝑄11 do not benefit from                                                                                      insights on processing RDF best practices at a large scale.
the ExtVP optimization, i.e., ExtVP does not provide any input                                                                                             Next, we place the literature assumptions on the relational
table size reductions. In particular, 𝑄9 and 𝑄10 have unbounded                                                                                        schema optimizations’ superiority against our experimental find-
predicate variables in the original SPARQL queries. ExtVP cannot                                                                                       ings. We follow this by recommendations to the large RDF prac-
directly handle this type of queries[22]. While 𝑄11 has only a                                                                                         titioners.
single triple pattern, and thus it has no joins in optimizing the
ExtVP optimization approach. Figures 4 (a) and (b) show the per-
formance of SparkSQL for 𝑄4 and 𝑄9, respectively, under various                                                                                        6.1          Assumption: WPT always outperforms
combination of formats and partitioning techniques in the ExtVP                                                                                                     PT
experiments. Figure 4 (a) shows that 𝑄4 is always below the line
of all the other queries’ average runtimes. Whereas, ExtVP does                                                                                        According to [6, 21], we expect that the performance of the WPT
not show a remarkable difference over the VP schema in 𝑄9, i.e.,                                                                                       schema outperforms the PT schema, especially with the "star-
they show pretty close performance to each other.                                                                                                      shaped" queries. Star-shaped queries can be answered when the
   In the next section, we discuss in further details the experiment                                                                                   WPT table is queried with no-joins included. This assumption is
findings against the current S.O.T.A regarding the superiority of                                                                                      because all the properties relevant to the same subject are present
ExtVP and PT.                                                                                                                                          in the same row of the WPT table.
                                                                                                                                                          The state-of-the-art findings of the WPT schema are fully
6        DISCUSSION                                                                                                                                    reproduced with the default HDFS partitioning and with using
                                                                                                                                                       the baseline Parquet file format. That is, the performance of
The paper helps to characterize and classify the RDF schemas                                                                                           Spark using WPT schema for representing RDF dataset is always
and their optimizations within the SparkSQL realm. It helps data                                                                                       outperforming the baseline PT schema.
architects and practitioners interested in large scale RDF bet-                                                                                           Nevertheless, our results show when we deviate from the origi-
ter understanding the relational RDF schema’s potential using                                                                                          nal setup [21] introducing new experimental factors, the solution
different partitioning techniques and storage formats. This un-                                                                                        space increases in complexity. Consequently, the trade-offs be-
derstanding will lead to a better selection of the most suitable                                                                                       tween relational schema, partitioning techniques, and storage for-
and performance-optimized solution that adequately suits their                                                                                         mats make the WPT optimization reproducibility not straightfor-
case. Doing so will also accommodate better design and develop-                                                                                        ward. Using other partitioning techniques alongside the baseline
ment of new SPARQL systems, leading to reliable RDF services                                                                                           Parquet format affected the reproducibility of the WPT schema
with high Spark performance. Taking our experiment findings
                                  ExtVP        VP                                          ExtVP            VP                                                ExtVP          VP
                     500                                                       500                                                                500

                     400                                                       400                                                                400
    Time (seconds)




                                                              Time (seconds)




                                                                                                                                Time (seconds)
                     300                                                       300                                                                300

                     200                                                       200                                                                200

                     100                                                       100                                                                100

                       0                                                         0                                                                 0
                            Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11                          Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11                                    Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11
                                          HO                                                       Subj                                                               Pred


                     (a) Parquet-Horizontal Partitioning           (b) Parquet-Subject-based Partitioning                      (c) Parquet-Predicate-based Partitioning
                                  ExtVP        VP                                          ExtVP            VP                                                ExtVP          VP
                     500                                                       500                                                                500

                     400                                                       400                                                                400
    Time (seconds)




                                                              Time (seconds)




                                                                                                                                Time (seconds)
                     300                                                       300                                                                300

                     200                                                       200                                                                200

                     100                                                       100                                                                100

                       0                                                         0                                                                 0
                            Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11                          Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11                                    Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11
                                          HO                                                       Subj                                                               Pred


                      (d) ORC-Horizontal Partitioning                          (e) ORC-Subject-based Partitioning                         (f) ORC-Predicate-based Partitioning
                                  ExtVP        VP                                          ExtVP            VP                                                ExtVP          VP
                     500                                                       500                                                                500

                     400                                                       400                                                                400
    Time (seconds)




                                                              Time (seconds)




                                                                                                                                Time (seconds)
                     300                                                       300                                                                300

                     200                                                       200                                                                200

                     100                                                       100                                                                100

                       0                                                         0                                                                 0
                            Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11                          Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11                                    Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11
                                          HO                                                       Subj                                                               Pred


                      (g) Avro-Horizontal Partitioning                         (h) Avro-Subject-based Partitioning                          (i) Avro-Predicate-based Partitioning
                                  ExtVP        VP                                          ExtVP            VP                                                ExtVP          VP
                     500                                                       500                                                                500

                     400                                                       400                                                                400
    Time (seconds)




                                                              Time (seconds)




                                                                                                                                Time (seconds)




                     300                                                       300                                                                300

                     200                                                       200                                                                200

                     100                                                       100                                                                100

                       0                                                         0                                                                 0
                            Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11                          Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11                                    Q1 Q2 Q3 Q4 Q5 Q6 Q8 Q9 Q10 Q11
                                          HO                                                       Subj                                                               Pred


                      (j) CSV-Horizontal Partitioning                          (k) CSV-Subject-based Partitioning                                (l) CSV-Predicate-based Partitioning

                           Figure 6: ExtVP Vs. VP schemata performance using different partitioning techniques and file formats


optimizations. Only 78% of the queries results conform with the                                           representing such wide tables (WPT and PT). Columnar file for-
fact that WPT is better than the PT schema (Table 5).                                                     mats are the best for sparse queries (i.e., queries with few column
    Figure 5 aims to analyze the schemas performance when the                                             projections or columns to access) out of the wide tables. They
solution adopts different partitioning techniques and file formats.                                       perform better than the row-oriented file formats, e.g., CSV and
Figures 5 (a-h) show clearly the effect of partitioning techniques                                        Avro, which would be only better with queries that require full
on the reproducibility of the WPT optimizations across all the dif-                                       rows reading.
ferent file formats. For instance, notably the horizontal partition-                                         Figure 5 shows the performance degradation considering dif-
ing (Figures 5 (a,c,e,g)) affected the performance of WPT, making                                         ferent file formats. For instance, moving from Parquet and ORC
its performance in SparkSQL worse than the baseline PT schema                                             in Figures 5 (e-h) to other row-oriented file formats such as Avro
in most of the queries (i.e., 𝑄1,𝑄3,𝑄5,𝑄6,𝑄8,𝑄11). Similarly, we                                          and CSV in Figures 5 (a-d), we can notice the performance degra-
can observe the negative effect of the subject-based technique                                            dation of the queries with the WPT schema optimizations.
on WPT schema (Figures 5 (b,d,f,h)) in the same queries.
    The impact of file formats aside Parquet is even worse. Even                                          6.2    Assumption: ExtVP always outperforms
using the baseline (HDFS) partitioning technique affects the re-                                                 VP
producibility of the WPT schema optimizations. Overall, only                                              According to [22], we expect that ExtVP provides better or at
58% of the query results conforming with the fact that WPT is                                             least similar performance gains, as the queries are similar, and the
outperforming PT schema (Table 5). The experiments show that                                              number of SQL joins in the VP schema is equal to the ExtVP joins.
columnar file formats, e.g., ORC, and Parquet, are the best for                                           Nevertheless, one should keep in mind that ExtVP improvements
                                                                                                          are mainly due to the original SPARQL query nature. It also
Table 9: Mapping the partitioning technique to the storage             Table 10: Mapping the partitioning technique to the stor-
format best practices in WPT                                           age format best practices in ExtVP

                          Avro        CSV        ORC         Parquet                             Avro        CSV        ORC         Parquet
  Baseline-HDFS           X           X          ✓*          ✓**           Baseline-HDFS         ✓           ✓          ˜           ✓*
  Horizontal              X           X          ✓           ✓             Horizontal            X           X          X           X
  Subject-based           X           X          ✓           ✓             Subject-based         X           X          X           X
  Where ✓is good practice, X is bad practice,and ˜ has the same            Predicate-based       X           X          ✓           ✓
  performance compared to PT.                                              Where ✓is good practice, X is bad practice, and ˜ has the same
  * WPT had very competitive performance                                   performance compared to VP.
   ** WPT had the best performance                                         * ExtVP had a very competitive performance


depends on the possible reductions in the table input data size and    few numbers of column projections. Thus, it would work better
excluding the dangling triples (rows that do not contribute to any     with columnar rather than row-based file formats.
joins) [22]. Typically, ExtVP queries are similar to the VP ones;
the only difference realizes in the queried tables/DataFrames (i.e.,   6.3      Recommendations
their size reduced by ExtVP or their size are the same VP). Thus,      Overall, Tables 9 and 10 provides an abstracted map of good and
the relational engine’s performance, e.g., Spark with the ExtVP,       bad storage format and partitioning techniques.
should be equivalent or better to its performance with the VP              The results in Figure 5 and Table 9, show that partitioning
schema.                                                                the WPT table has, in the majority, a negative effect on the WPT
    Based on our experiments, the findings of the ExtVP schema         optimization, making it perform even worse than its baseline
are not fully reproduced, even considering the default HDFS par-       approach, i.e, the PT schema. The effect of the storage formats
titioning and the baseline Parquet file format. Some queries do        is more significant in the WPT optimization (cf. Tables 5, 9).
not benefit from the ExtVP optimizations (𝑄9, 𝑄10, 𝑄11), no-           Therefore, this WPT schema’s storage format selection decision
table input size reductions occurred in those queries), cf. Table 6.   should be dealt with as a first-class citizen in such experiments.
Beyond those queries, we can confirm that the state-of-the-art             The horizontal and subject-based partitioning techniques are
results (ExtVP performs better than VP in most cases). However,        not recommended with ExtVP optimization. However, Predicate-
our results show that the schema-based query optimization is           based still gives better results than those two other RDF par-
not straightforward in such a complex solution space.                  titioning techniques (cf. Tables 8 and 10). Also, columnar file
    Regarding the partitioning techniques, using an alternative to     formats are still recommended with the ExtVP schema optimiza-
the baselines technique (HDFS) affects the reproducibility of the      tion. However, it was noticed that the effect of the partitioning
ExtVP optimizations even if the storage format is Parquet. Only        is more significant to this optimization (cf. Figure 6, Tables 8,
55% of the queries results show that ExtVP is superior to the VP       and 10). Thus, the partitioning selection decision of this ExtVP
schema (cf. Table 8). Moreover, Figure 6 shows the effect of other     schema should be highly considered in these experiments.
RDF partitioning techniques on the reproducibility findings of             Also, our analysis yields the following recommendations
the ExtVP optimization. For instance, deviating from the baseline
                                                                            (1) With WPT, it is recommended to use the columnar storage
partitioning technique to other RDF-based techniques with the
                                                                                formats rather than row-oriented ones (cf. Table 9).
same baseline Parquet, i.e., Figures 6(a-c) degrades the results of
                                                                            (2) With the WPT schema, Parquet is yet the best columnar
ExtVP and makes it perform worse than the baseline VP schema
                                                                                file format to select, it efficiently handles its sparsity.
in several queries (𝑄1, 𝑄4, 𝑄5, 𝑄6, 𝑄8) with the Horizontal and
                                                                            (3) With WPT, it is recommended to use the native HDFS
Subject-based partitioning. The predicate-based partitioning in
                                                                                partitioning, rather than selecting an RDF-oriented parti-
Figure 6(c) has a better performance with this schema, which has
                                                                                tioning technique.
performance close to VP’s in the previously-mentioned queries.
                                                                            (4) With ExtVP, the baseline HDFS partitioning is more recom-
    Similarly, using storage formats different from Parquet affects
                                                                                mended than specific RDF ones. However, larger datasets
the ExtVP optimizations’ reproducibility, even with the baseline
                                                                                would require partitioning anyway.
(HDFS) partitioning technique. Indeed, we have only 67.5% of
                                                                            (5) With ExtVP, the columnar file formats is a recommended
the queries results of ExtVP outperforming VP (cf. Table 8). Simi-
                                                                                optimization.
larly, Figure 6 shows the effect of other file formats other than
the baseline Parquet, i.e Figures 6 (d-l) for ORC, Avro, and CSV
respectively. We can notice the queries’ performance degradation       7     RELATED WORK
with the ExtVP schema optimizations moving vertically to these         In this section, we present the related work. In particular, we focus
other formats.                                                         on comparative studies that investigate the use of BD frameworks
    Finally, from our experiments, we observe that columnar file       for distributed RDF processing. To the best of our knowledge,
formats are better than the Row-oriented ones. However, the per-       the literature includes several studies that compare partitioning
formance difference is not significant with such similar schemas.      techniques, relational schemas, and storage formats [2, 6, 8, 14].
The table structure is the same table of two columns Predi-            However, none of these approaches focus on replicating and
cate (Subject-Object) in both vertical schemas. Moreover, both         comparing existing optimization techniques.
schemas have not wide tables in comparison to the WPT and PT              Abdelaziz et al. [2] discussed several relational schemas for
schemas. That is, these schemas will not benefit a lot from the        materializing RDF datasets. Their main goal was to assess differ-
columnar file formats. The performance gain of columnar over           ent native and non-native RDF processing systems. However, it
the row-oriented file formats is because SP2Bench queries have a       does not discuss the impact of different relational schemas on a
specific system’s performance, such as SparkSQL; nor it discusses                   [2] Ibrahim Abdelaziz, Razen Harbi, Zuhair Khayyat, and Panos Kalnis. 2017. A
partitioning techniques and data formats.                                               survey and experimental comparison of distributed SPARQL engines for very
                                                                                        large RDF data. Proceedings of the VLDB Endowment 10, 13 (2017), 2049–2060.
   Arrascue et al. [6] lead an investigation on the performance                     [3] Giannis Agathangelos, Georgia Troullinou, Haridimos Kondylakis, Kostas
of the WPT schema against alternative relational schemas, i.e.,                         Stefanidis, and Dimitris Plexousakis. 2018. RDF Query Answering Using
                                                                                        Apache Spark: Review and Assessment. In 34th IEEE International Conference
triple tables, VP, and domain-dependent tables. Additionally, they                      on Data Engineering Workshops, ICDE Workshops 2018, Paris, France, April
consider subject-based partitioning but limit the data formats                          16-20, 2018. IEEE Computer Society, 54–59.
to Parquet. The work’s main finding is the flexibility of WPT                       [4] Adnan Akhter, Axel-Cyrille Ngomo Ngonga, and Muhammad Saleem. 2018.
                                                                                        An empirical evaluation of RDF graph partitioning techniques. In European
for generic query shapes in contrast with other approaches and                          Knowledge Acquisition Workshop. Springer, 3–18.
even considering partitioning. However, their exploration of the                    [5] Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K.
solution space is limited in terms of partitioning techniques and                       Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and
                                                                                        Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In
data formats.                                                                           SIGMOD Conference. ACM, 1383–1394.
   Cossu et al. [8] focused on a hybrid storage approach that                       [6] Victor Anthony Arrascue Ayala, Polina Koleva, Anas Alzogbi, Matteo Cossu,
                                                                                        Michael Färber, Patrick Philipp, Guilherme Schievelbein, Io Taxidou, and Georg
combines the benefits of PT and VP schemas to boost the query                           Lausen. 2019. Relational schemata for distributed SPARQL query processing.
performance without the need for extensive loading time. Their                          In Proceedings of the International Workshop on Semantic Big Data. 1–6.
solution, PROST, was able to outperform state of the art systems                    [7] Feras M Awaysheh, Mamoun Alazab, Maanak Gupta, Tomás F Pena, and
                                                                                        José C Cabaleiro. 2020. Next-generation big data federation access control: A
like S2RDF for several query shapes. Nevertheless, their explo-                         reference model. Future Generation Computer Systems (2020).
ration of partitioning techniques and data formats is limited.                      [8] Matteo Cossu, Michael Färber, and Georg Lausen. 2018. PRoST: Distributed
Additionally, they focused their work on PT and VP schemas,                             Execution of SPARQL Queries Using Mixed Partitioning Strategies. In Pro-
                                                                                        ceedings of the 21st International Conference on Extending Database Technology,
not considering WPT as an alternative schema that may further                           EDBT 2018, Vienna, Austria, March 26-29, 2018, Michael H. Böhlen, Reinhard
improve the performance.                                                                Pichler, Norman May, Erhard Rahm, Shan-Hung Wu, and Katja Hose (Eds.).
                                                                                        OpenProceedings.org, 469–472. https://doi.org/10.5441/002/edbt.2018.49
   On another side, Pham et al. results in [14] indicates that more                 [9] J. Huang, D. Abadi, and K. Ren. 2011. Scalable SPARQL querying of large RDF
than 95% of RDF dataset triples have tabular structure. They                            graphs. Proceedings of the VLDB Endowment 4 (2011), 1123 – 1134.
combine structural non-quotient and statistical methods to auto-                   [10] Todor Ivanov and Matteo Pergolesi. 2019. The impact of columnar file for-
                                                                                        mats on SQL-on-hadoop engine performance: A study on ORC and Parquet.
matically discover and detect an emergent relational schema (in                         Concurrency and Computation: Practice and Experience (2019), e5523.
the form of property tables) in RDF datasets. A similar approach                   [11] Todor Ivanov and Matteo Pergolesi. 2020. The impact of columnar file formats
has been proposed in [12] to mitigate the limitations of the WPT                        on SQL-on-hadoop engine performance: A study on ORC and Parquet. Concurr.
                                                                                        Comput. Pract. Exp. 32, 5 (2020). https://doi.org/10.1002/cpe.5523
and PT RDF schemata by merging the related hierarchical char-                      [12] Marios Meimaris, George Papastefanatos, and Panos Vassiliadis. 2020. Hier-
acteristic sets and provide a novel RDF relational schema. The                          archical Property Set Merging for SPARQL Query Optimization.. In DOLAP.
                                                                                        36–45.
aim of so doing is to provide a better SPARQL query evaluation.                    [13] Thomas Neumann and Gerhard Weikum. 2010. The RDF-3X engine for scalable
   Finally, Akhter et al. [4], investigated the performance of dif-                     management of RDF data. The VLDB Journal 19, 1 (2010), 91–113.
ferent partitioning techniques for RDF data, proposing a ranking                   [14] Minh-Duc Pham, Linnea Passing, Orri Erling, and Peter A. Boncz. 2015. De-
                                                                                        riving an Emergent Relational Schema from RDF Data. In Proceedings of the
function that helps practitioners to choose the most appropriate                        24th International Conference on World Wide Web, WWW 2015, Florence, Italy,
technique.                                                                              May 18-22, 2015, Aldo Gangemi, Stefano Leonardi, and Alessandro Panconesi
                                                                                        (Eds.). ACM, 864–874. https://doi.org/10.1145/2736277.2741121
                                                                                   [15] Mohamed Ragab, Riccardo Tommasini, Sadiq Eyvazov, and Sherif Sakr. 2020.
8    CONCLUSIONS & FUTURE WORK                                                          Towards making sense of Spark-SQL performance for processing vast dis-
                                                                                        tributed RDF datasets. In Proceedings of The International Workshop on Semantic
The reproducibility of well-known relational RDF processing                             Big Data. 1–6.
optimizations is critical to foster best practices that guide the                  [16] Mohamed Ragab, Riccardo Tommasini, and Sherif Sakr. 2019. Benchmark-
                                                                                        ing Spark-SQL under Alliterative RDF Relational Storage Backends. In
practitioners’ efforts. In this paper, we presented a comprehensive                     QuWeDa@ISWC.
empirical evaluation using three RDF partitioning techniques                       [17] Sherif Sakr. 2009. GraphREL: A Decomposition-Based and Selectivity-Aware
                                                                                        Relational Framework for Processing Sub-graph Queries. In DASFAA.
and four storage formats over the distributed SparkSQL engine                      [18] Sherif Sakr and Ghazi Al-Naymat. 2010. Relational processing of RDF queries:
to cope with this limitation. Our analysis demonstrates decisively                      a survey. ACM SIGMOD Record 38, 4 (2010), 23–28.
variant trade-offs using different relational schemas, data parti-                 [19] Sherif Sakr, Angela Bonifati, Hannes Voigt, Alexandru Iosup, Khaled Ammar,
                                                                                        Renzo Angles, Walid Aref, Marcelo Arenas, Maciej Besta, Peter A Boncz, et al.
tioning, and storage file formats against these state-of-the-art                        2020. The Future is Big Graphs! A Community View on Graph Processing
optimizations. Our experiments show significant degradation                             Systems. arXiv preprint arXiv:2012.06171 (2020).
in Spark performance when partitioning by subject in the WPT                       [20] Muhammad Saleem, Gábor Szárnyas, Felix Conrads, Syed Ahmad Chan
                                                                                        Bukhari, Qaiser Mehmood, and Axel-Cyrille Ngonga Ngomo. 2019. How
and partitioning horizontally due to the vast, sparse, and large                        Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore
partitions of its schema table. On the same note, the storage for-                      Benchmarks?. In The World Wide Web Conference. ACM, 1623–1633.
                                                                                   [21] Alexander Schätzle, Martin Przyjaciel-Zablocki, Antony Neu, and Georg
mat also affects the WPT performance, where ORC and Parquet                             Lausen. 2014. Sempala: Interactive SPARQL query processing on hadoop.
are the most suitable representation of such configuration. Our                         In International Semantic Web Conference. Springer, 164–179.
results on ExtVP illustrate that schema-based query optimization                   [22] Alexander Schätzle, Martin Przyjaciel-Zablocki, Simon Skilevic, and Georg
                                                                                        Lausen. 2016. S2RDF: RDF querying with SPARQL on spark. Proceedings of
is not straightforward using different configurations.                                  the VLDB Endowment 9, 10 (2016), 804–815.
   Future work includes extending this study by analyzing the                      [23] Michael Schmidt, Thomas Hornung, Norbert Küchlin, Georg Lausen, and
impact of data scalability on SparQL performance. We intend to                          Christoph Pinkel. 2008. An Experimental Comparison of RDF Data Manage-
                                                                                        ment Approaches in a SPARQL Benchmark Scenario. In International Semantic
utilize other RDF benchmarks such as WatDiv with different types                        Web Conference (Lecture Notes in Computer Science), Vol. 5318. Springer, 82–97.
of query shapes and complexities. Our plans include investigating                  [24] Michael Schmidt, Thomas Hornung, Georg Lausen, and Christoph Pinkel.
                                                                                        2009. SPˆ2Bench: A SPARQL Performance Benchmark. In Proceedings of the
this area further to design a benchmark that combines query                             25th International Conference on Data Engineering, ICDE 2009, March 29 2009 -
workloads with precise partitioning and storage instructions.                           April 2 2009, Shanghai, China. 222–233. https://doi.org/10.1109/ICDE.2009.28
                                                                                   [25] Matei Zaharia, Reynold S. Xin, and Patrick Wendell et.al. 2016. Apache Spark:
                                                                                        a unified engine for big data processing. Commun. ACM 59, 11 (2016), 56–65.
REFERENCES                                                                         [26] Lei Zou, Jinghui Mo, Lei Chen, M Tamer Özsu, and Dongyan Zhao. 2011.
                                                                                        gStore: answering SPARQL queries via subgraph matching. Proceedings of the
 [1] Daniel J Abadi, Adam Marcus, Samuel R Madden, and Kate Hollenbach. 2007.
                                                                                        VLDB Endowment 4, 8 (2011), 482–493.
     Scalable semantic web data management using vertical partitioning. In VLDB.