<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluation of RDF Archiving strategies with Spark</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Meriem Laajimi</string-name>
          <email>laajimimeriem@yahoo.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Afef Bahri</string-name>
          <email>afef.bahri@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nadia Yacoubi Ayadi</string-name>
          <email>nadia.yacoubi.ayadi@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>High Institute of Management Tunis</institution>
          ,
          <country country="TN">Tunisia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>MIRACL Laboratory, University of Sfax Tunisia</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>RIADI Research Laboratory, ENSI, University of Manouba</institution>
          ,
          <addr-line>2010</addr-line>
          <country country="TN">Tunisia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Over the last decade, the published RDF data in the Web is continuously evolving leading to an important number of RDF datasets in the Linked Open Data (LOD). There is an emergent need for e cient RDF data archiving systems. In fact, applications need to access to not only the actual version of a dataset but equally to the previous ones in order to query and track data over time. Querying RDF dataset archives involves performance and scalability. The proposed RDF archiving systems or benchmarks are built on top of existing RDF query processing engine. Nevertheless, e ciently processing a time-traversing query over Big RDF data archives is more challenging than processing the same query over an RDF datastore. We propose in this paper to use a distributed system, namely Apache Spark, in order to evaluate RDF archiving strategies. We propose and compare di erent query processing approaches with a detailed experimentation.</p>
      </abstract>
      <kwd-group>
        <kwd>RDF archives SPARK SQL</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        SPARK
The Linked Data paradigm promotes the use of the RDF model to publish
structured data on the Web. As a result, several datasets have emerged incorporating
a huge number of RDF triples. The Linked Open Data cloud [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], as published
in 22 August 2017 illustrates the important number of published datasets and
their possible interconnections (1,184 datasets having 15,993 links). LODstats, a
project constantly monitoring statistics reports 2,973 RDF datasets that
incorporate approximately 149 billion triples. As a consequence, an emerging interest
on what we call archiving of RDF datasets [
        <xref ref-type="bibr" rid="ref13 ref5 ref8">13, 5, 8</xref>
        ] has emerged raising several
challenges that need to be addressed. Moreover, the emergent need for e cient
web data archiving leads to recently developed Benchmarking RDF archiving
systems such as BEAR (BEnchmark of RDF ARchives) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and EvoGen [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The
authors of the BEAR system propose a theoretical formalization of an RDF
archive and conceive a benchmark focusing on a set of general and abstract
queries with respect to the di erent categories of queries as de ned before. More
recently, the EU H2020 HOBBIT1 project is focusing the problem of
Benchmarking Big Linked Data. A new Benchmark SPBv was developed with some
preliminary experimental results [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Similar to EvoGen, SPBv proposes a
congurable and adaptive data and query load generator.
      </p>
      <p>
        Obviously, the fast increasing size of RDF datasets raises the need to treat the
problem of RDF archiving as a Big data problem. Many e orts have been done
to process RDF linked data with existing Big data processing infrastructure like
Hadoop or Spark [
        <xref ref-type="bibr" rid="ref12 ref9">9, 12</xref>
        ]. Nevertheless, no works has been realized for managing
RDF archives on top of cluster computing engine. The problem is more
challenging here as Big data processing framework are not designed for RDF processing
nor for evolution management. Many versioning strategies have been proposed
in the literature: (a) Independent Copies (IC), (b) Change Based copies (CB) or
Deltas and (c) Timestamp-based approaches (TB) [
        <xref ref-type="bibr" rid="ref10 ref13 ref5">13, 5, 10</xref>
        ]. The rst one is a
naive approach since it manages each version of a dataset as an isolated one.
Obviously, scalability problem is expected due to the large size of duplicated data
across dataset versions. The delta-based approach aims to resolve (partially) the
scalability problem by computing and storing the di erences between versions.
While the use of deltas reduces space storage, the computation of full version
on-the- y may cause overhead at query time. Using Big data processing
framework would give advantage to the Independent Copies/Temporal approaches as
CB approach may induce the computing of one or more versions on the y.
Given the fact that we use IC approach and that all the versions are stored,
querying evolving RDF datasets data represents the most important challenge
beyond the use of RDF archiving system on top of Big data processing
framework. Many types of RDF archives queries have been proposed: version
materialization, delta materialization, single version and cross version query types.
Which partitioning strategy would be adopted for treating theses queries. We
note that in case of version Materialization, as all the version need to be loaded,
the performance of the query processing does not depend on the used
partitioning strategy. This is not the case of single/cross-time structured query where the
use of partitioning may improve query performance [
        <xref ref-type="bibr" rid="ref1 ref2 ref9">2, 9, 1</xref>
        ].
      </p>
      <p>In this paper, we use the in-memory cluster computing framework SPARK for
managing and querying RDF data archive. The paper is organized as follows.
Section 2 presents existing approaches for the design and evaluation of RDF
archiving and versioning systems. Section 3 presents our approach for managing
and querying RDF dataset archives with SPARK. A mapping of SPARQL into
SPARK SQL and a discussion of the cost of versionning RDF queries are
presented in section 4. Finally, an evaluation of RDF versioning queries is presented
in section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>1 https://project-hobbit.eu/</title>
      <sec id="sec-2-1">
        <title>Related Works</title>
        <p>Over the last decade, the published data is continuously growing leading to the
explosion of the data on the Web and the associated Linked Open Data (LOD)
in various domains. This evolution naturally happens without pre-de ned policy
hence the need to track data changes and thus the requirement to build their
own infrastructures in order to preserve and query data over time. We note
that these RDF datasets are automatically populated by extracting information
from di erent resources (Web pages, databases, text documents) leading to an
unprecedented volume of RDF triples. Indeed, published data is continuously
evolving and it will be interesting to manage not only a current version of a
dataset but also previous ones.</p>
        <p>
          Three versionning approaches are proposed in RDF archiving systems and cited
in literature as follows: (a) Independent Copies (IC), (b) Change Based copies
(CB) or Deltas and (c) Timestamp-based approaches (TB) [
          <xref ref-type="bibr" rid="ref10 ref5">10, 5</xref>
          ]. We talk about
hybrid approaches when the above techniques are combined [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. The IC
approach manages each version of a dataset as an isolated one while the CB
approach stores only the changes that should be kept between versions also known
as delta. The advantage beyond the use of IC or CB approaches depends on
the ratio of changes occurring between consecutive versions. If only few changes
are kept, CB approach reduces space overhead compared to the IC one.
Nevertheless, if frequent changes are made between consecutive versions, IC approach
becomes more storage-e cient than CB. Equally, the computation of full version
on-the- y with CB approach may cause overhead at query time. To resolve this
issue, authors in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] propose hybrid archiving policies to take advantage of both
the IC and CB approaches. In fact, a cost model is conceived to determine what
to materialize at a given time: a version or a delta.
        </p>
        <p>
          Archiving systems not only need to store and provide access to di erent
versions, but should also be able to provide query processing functionalities [
          <xref ref-type="bibr" rid="ref10 ref13 ref5">13,
5, 10</xref>
          ]. Four query types are mainly discussed in the literature. We note version
materialization which is a basic query where a full version is retrieved. Delta
materialization which is a type of query performed on two versions to detect changes
occurring at a given moment. Single-version and cross-version queries correspond
to queries, namely SPARQL, performed respectively on a single or di erent
versions. The authors in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] propose a taxonomy containing eight queries classi ed
according to their types (materialization, single version, cross version) and focus
(version or delta).
        </p>
        <p>
          Moreover, the emergent need of e cient web data archiving leads to recently
developed Benchmarking RDF archiving systems such as BEAR (BEnchmark
of RDF ARchives) [
          <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
          ] EvoGen [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and SPBv [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. The authors of the BEAR
system propose a theoretical formalization of an RDF archive and conceive a
benchmark focusing on a set of general and abstract queries with respect to the
di erent categories of queries as de ned before. More recently, the EU H2020
HOBBIT project is focusing on the problem of Benchmarking Big Linked Data.
In this context, EvoGen is proposed as a con gurable and adaptive data and
query load generator. EvoGen extends the LUBM ontology and is con gurable
in terms of archiving strategies and the number of versions or changes. Recently,
new Benchmark SPBv was developed with some preliminary experimental
results [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Similar to EvoGen, SPBv proposes a con gurable and adaptive data
and query load generator.
        </p>
        <p>
          Concerning Big RDF dataset archives, the use of a partitioning strategy depends
on the shape of the used SPARQL queries. Many works handles the Big RDF
data by a simple hash partitioning on their RDF subject [
          <xref ref-type="bibr" rid="ref2 ref9">2, 9</xref>
          ] which improves the
performance with star queries. For example, subject based partitioning strategy
seems to be more adapted for treating Star based query shape (s,p,o) as all the
triples for which the query is written are stored in the same node even though
they do not belong to the same version which may accurate the performance
of cross-version queries. For example, to follow the evolution of a given person
career over time, we need to ask a star shape query of the form (?x,hasJob, ?y)
on di erent versions.
        </p>
        <p>
          Even though with simple query, the performance often drop signi cantly for
queries with large diameter. The authors in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] propose a novel approach to
partition RDF data, named ExtVP (Extended Vertical Partitioning). In fact,
based on pre-evaluation of the data, many RDF triple patterns are used to
partition the data into partition tables (a partition for each triple pattern). That
is, a triple query pattern can be retrieved by only accessing the partition table
that bounds the query leading to a reduction of the execution time. The
problem become more complex when we ask about cross-version join queries. For
example we may need to know if the diploma of a person ?x has any equivalence
in the RDF dataset archive: (?x hasDip ?y) on version V1 and (?y hasEqui ?z)
on versions V2; :::; Vn. Realizing a partition on the subject for this kind of query
may engender many transfer between nodes.
3
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>RDF dataset archiving on Apache Spark</title>
        <p>In this section, we present the main features of Apache SPARK cluster computing
framework we show how we can use it for change detection and RDF dataset
versioning.
3.1</p>
        <sec id="sec-2-2-1">
          <title>Apache Spark</title>
          <p>
            Apache Spark [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ] is a main-memory extension of the MapReduce model for
parallel computing that brings improvements through the data-sharing
abstraction called Resilient Distributed Dataset (RDD) [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ] and Data frames o ering a
subset of relational operators (project, join and lter) not supported in Hadoop.
Spark also o ers two higher-level data accessing models, an API for graphs and
graph-parallel computation called GraphX [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] and Spark SQL, a Spark module
for processing semi-structured data.
          </p>
          <p>
            SPARK SQL [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] is a Spark module that performs relational operations via a
DataFrame API o ering users the advantage of relational processing, namely
declarative queries and optimized storage. SPARK SQL supports relational
processing both on native RDDs or on external data sources using any of the
programming language supported by Spark, e.g, Java, Scala or Python [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]. SPARK
SQL can automatically infer their schema and data types from the language type
system.
3.2
          </p>
        </sec>
        <sec id="sec-2-2-2">
          <title>RDF Dataset storage and change detection</title>
          <p>SPARK SQL o ers the users the possibility to extract data from heterogeneous
data sources and can automatically infer their schema and data types from the
language type system (e.g Scala, Java or Python). In our approach, we use
SPARK SQL for querying and managing the evolution of Big RDF dataset. An
RDF dataset stored in HDFS or as a table in Hive or any external database
system is mapped into a SPARK dataframes (equivalent to tables in a relational
database) with columns corresponding respectively to the subject, property,
object, named graph and eventually a tag of the corresponding version.
In order to obtain a view of a dataframe named \table", for example, we execute
the following SPARK SQL query:</p>
          <p>SELECT * FROM table</p>
          <p>When we want to materialize a given version, V1 for example, the following
SPARK SQL query is used:</p>
          <p>SELECT Subject,Object,Predicate FROM table WHERE version ='V1'
In order to de ne delta between versions we de ne the following SQL SPARK
query:</p>
          <p>SELECT Subject,Predicate,Object FROM table WHERE Version='Vi'
MINUS</p>
          <p>SELECT Subject,Predicate,Object FROM table WHERE Version='Vj '</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>RDF Dataset partitioning</title>
          <p>In this section, we present the principle that we adopt for the partitioning of
RDF dataset archives for e ciently executing single version and cross-versions
queries ( gure 2). Concerning version and delta materialization queries, all the
data (version or delta) will be loaded and no partition is needed.
{ First of all, we load RDF datasets in a N-triple format from HDFS as input.
{ Then, a mapping is realized from RDF les into dataframes with
corresponding columns: subject, object, predicate and a tag of the version.
{ We adopt a partitioning by RDF subject for each version.</p>
          <p>{ The SPARK SQL engine processes and the query result is returned.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Querying RDF dataset archives with SPARK SQL</title>
        <p>In this section, we de ne basic RDF archiving queries (version/delta
materialization, single/cross version query) with SPARK SQL.</p>
        <sec id="sec-2-3-1">
          <title>Querying RDF dataset archives with SPARK SQL</title>
          <p>Using SPARK SQL, we can de ne RDF dataset archiving queries as follows:
{ Version materialization: M at(Vi).
{ Delta materialization: Delta(Vi; Vj ).</p>
          <p>SELECT Subject,Object,Predicate FROM table WHERE Version ='Vi'
SELECT Subject,Predicate,Object FROM table WHERE Version='Vi'
MINUS
SELECT Subject,Predicate,Object FROM table WHERE Version='Vj'
UNION
SELECT Subject,Predicate,Object FROM table WHERE Version='Vj'
MINUS</p>
          <p>SELECT Subject,Predicate,Object FROM table WHERE Version='Vi'
{ Single-version query: [[Q]]Vi . We suppose here a simple query Q which
asks for all the subject in the RDF dataset.</p>
          <p>SELECT Subject FROM table WHERE Version=Vi
{ Cross-version structured query: J oin(Q1; Vi; Q2; Vj ). What we need
here is a join between the two query results. We de ne two dataframe tablei
and tablej containing respectively the version Vi and Vj . The cross-version
query is de ned as follows:</p>
          <p>SELECT * FROM dfi
INNER JOIN dfj</p>
          <p>ON dfi.Subject = dfj .Subject
4.2</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>From SPARQL to SPARK SQL</title>
          <p>
            SPARK SQL is used in [
            <xref ref-type="bibr" rid="ref12 ref9">9, 12</xref>
            ] for querying RDF big data where a query compiler
from SPARQL to SPARK SQL is provided. That is, a FILTER expression can be
mapped into a condition in Spark SQL while UNION, OFFSET, LIMIT, ORDER
BY and DISTINCT are mapped into their equivalent clauses in the SPARK
SQL syntax. Theses mapping rules are used without considering SPARQL query
shapes. SPARQL graph pattern can have di erent shapes which can in uence
query performance. Depending on the position of variables in the triple patterns,
SPARQL graph pattern may be classi ed into three shapes:
1. Star pattern: this query pattern is commonly used in SPARQL. A star
pattern has diameter (longest path in a pattern) one and is characterized by a
subject-subject joins between triple patterns.
2. Chain pattern: this query pattern is characterized by object-subject (or
subject-object) joins. The diameter of this query corresponds to the number
of triple patterns.
3. Snow ake pattern: this query pattern results from the combination of many
star patterns connected by short paths.
          </p>
          <p>When we query RDF dataset archives, we have to deal with SPARQL query
shapes only in single version and cross-version queries. We propose in the
following a mapping from SPARQL to SPARK SQL based on query shapes:
{ Star pattern: a Star SPARQL query with n triple patterns Pi is mapped into
a SPARK SQL query with n-1 joins on the subject attribute. If we consider a
SPARQL query with two triple patterns P1 and P2 of the form (x1?,p1,?y1)
and (x1?,p2,?z2), the dataframes df1 and df2 corresponding respectively to
the query patterns P1 and P2 are de ned with SPARK SQL as follows:
df1= \SELECT Subject, Object FROM table
WHERE Predicate = `p1"'
df2= \SELECT Subject, Object FROM table</p>
          <p>WHERE Predicate = `p2"'
For example, given a SPARQL query pattern (?X, hasDip ?Y, ?X hasJob
?Z), we need to create two dataframes df1 and df2 as follows:
df1= \SELECT Subject, Object FROM table
WHERE Predicate = `hasDip"'
df2= \SELECT Subject, Object FROM table</p>
          <p>WHERE Predicate = `hasJob"'
We give in the following the obtained SPARK SQL query:</p>
          <p>SELECT * FROM df1
INNER JOIN
df2 ON df1.Subject = df2.Subject
{ Chain pattern: a chain SPARQL query with n triple patterns ti is mapped
into a SPARK SQL query with n-1 joins object-subject (or subject-object):
If we consider a SPARQL query with two triple patterns P1 and P2 of the
form (x1?,p1,?z1) and (z1?,p2,?t2), the dataframes df1 and df2 corresponding
respectively to the query patterns P1 and P2 are de ned with SPARK SQL
as follows:
df1= \SELECT Subject, Object FROM table
WHERE Predicate = `p1"'
df2= \SELECT Subject, Object FROM table</p>
          <p>WHERE Predicate = `p2"'
For example, given a SPARQL query with two triples (?X, hasJob ?Z, ?Z
hasSpec ?Z), we need to create a dataframe for each triple:
The query result is obtained as a join between dataframes df1 and df2:
df1= \SELECT Subject, Object FROM table
WHERE Predicate = `hasJob"'
df2= \SELECT Subject, Object FROM table
WHERE Predicate = `hasSpec"'</p>
          <p>SELECT * FROM df1
INNER JOIN
df2 ON df1.Object = df2.Subject
{ Snow ake pattern: the rewritten of snow ake queries follows the same
principle and may need more join operations depending equally on the number
of triples used in the query.</p>
          <p>For single version query [[Q]]Vi , we need to add a condition on the version for
which we want to execute the query Q. Nevertheless, the problem becomes more
complex for cross-version join query J oin(Q1; Vi; Q2; Vj ) as other join operations
are needed between di erent versions of the dataset. Two cases may occur:
1. Cross-version query type1: this type of cross-version queries concerns the
case where we have one query Q on two or more di erent versions. For
example, to follow the evolution of a given person career, we need to execute
(?x,hasJob,?z) on di erent versions. Given a query Q and n versions, we
denote T1,...,Tn the results obtained by executing Q on versions V1,...,Vn
respectively. The nal result is obtained by realizing the union of the Ti.
What we can conclude here is that the number of versions does not increase
the number of joins which only depends on the shape of the query. Given a
SPARQL query with a triple pattern P of the form (x1?,p,?y1) de ned on
di erent versions V1 and V2, the SPARK SQL query is de ned as follows:
SELECT Subject, Object FROM table
WHERE Predicate = `p' and Version = `V1'
UNION
SELECT Subject, Object FROM table</p>
          <p>WHERE Predicate = `p' and Version = `V2'
2. Cross-version query type2: the second case occurs when we have two or more
di erent queries Q1, Q2,...,Qm on many di erent versions. For example, we
may need to know if the diploma of a person ?x has any equivalence in RDF
dataset archive:</p>
          <p>Q1 :?x hasDip ?y on version V1</p>
          <p>Q2 :?y hasEqui ?z on versions V2; :::; Vn
Given a SPARQL patterns P1 and P2 of the form (x1?,p1,?z1) and (z1?,p2,?t2)
de ned on di erent versions V1 and V2, the dataframes df1 and df2
corresponding respectively to the query patterns P1 and P2 are de ned with
SPARK SQL as follows:
df1= \SELECT Subject, Object FROM table
WHERE Predicate = `p1' and Version = 'V1'
df2= \SELECT Subject, Object FROM table</p>
          <p>WHERE Predicate = `p2' Version = 'V2'
The query result is obtained as a join between dataframes df1 and df2:
SELECT * FROM df1
INNER JOIN
df2 ON df1.Object = df2.Subject
Given df1,...,dfn the di erent dataframes obtained by executing Q1, Q2,...,Qn,
respectively, on versions V1,...,Vn, the nal result is obtained with a
combination of join and/or union operations between the dfi. In the worst case we
may need to compute n-1 joins:</p>
          <p>join(join(join(join(df1; df2); df3); df4); :::; dfn)
That is, for cross-version query type2, the number of joins depends on the
shape of the query as well as the number of versions.
5</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>Experimental evaluation</title>
        <p>Evaluation was performed in cloud environment 'Amazon Web services' using
EMR (Elastic Map reduce) as a platform. The data input les were saved on S3
Amazon. The experiments were done in a cluster with three nodes (one master
and 2 core nodes) using m3.xlarge as an instance type. We use the BEAR dataset
Versions Triples Added triples Deleted triples
Version 1 30,035,245 -
Version 5 27,377,065 6,922,375 9,598,805
version 10 28,910,781 9,752,568 11,092,386
version 15 33,253,221 14,110,358 11,150,069
version 20 35,161,469 18,233,113 13,164,710
version 25 31,510,558 16,901,310 15,493,857
version 30 44,025,238 30,697,869 16,797,313
version 35 32,606,132 19,210,291 16,645,753
version 40 32,923,367 18,125,524 15,312,146</p>
        <p>Table 1. RDF dataset description
Benchmark which monitors more than 650 di erent domains across time and
is composed of 58 snapshots. The description of the dataset is given in table
1. In the following we present the evaluation2 of versioning queries on top of
SPARK framework. The evaluation concerns four query types: version and delta
materialization, single version and cross-version queries respectively.
5.1</p>
        <sec id="sec-2-4-1">
          <title>Version and Delta Materialization</title>
          <p>The content of the entire version (resp. Delta) is materialized. For each version,
the average execution time of the queries was computed. Based on the plots
shown in gure 4, we observe that the execution times obtained with IC
strategy are approximately constant and show better results compared to the ones
obtained with CB approach. In fact, versions in CB approach are not already
stored and need to be computed each time we want to query a given version
(resp. Delta).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2 https://github.com/meriemlaajimi/Archiving</title>
      <p>5.2</p>
      <sec id="sec-3-1">
        <title>Single-version queries</title>
        <p>
          We realize di erent experimentations with subject, predicate and object based
queries or a combination of them. Figure 5 concerns single version queries where
the object and/or predicate is given whereas the subject corresponds to what
we ask for. The analysis of the obtained plots shows that the use of partitioning
ameliorates the query execution times. Nevertheless, using query with individual
triple pattern does not need an important number of I/O operations. That is,
the real advantage beyond the use of partitioning is not highlighted for this kind
of queries which is not the case of cross-version queries.
In this section, we focus on Cross-version queries. The rst series of tests are
realized with STAR query shape of the form (?X, p, ?Y) and (?X, q, ?Z). We give
on the following an example of query used for experimentations. The obtained
execution times are shown in table 3. We note that the advantage beyond the
use of partitioning is highlighted for this kind of queries compared to the result
obtained with single triple queries (subsection 5.2). As we can see in gure 6, the
use of partitioning ameliorates execution times. In fact, Star query invokes triple
patterns having the same subject value. When we use partitioning on subject
attribute, theses triples patterns are loaded in the same partition and no transfer
is needed between nodes [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. In our approach, RDF triple patterns belonging to
di erent versions and having the same subject are equally loaded in the same
partition.
        </p>
        <p>Versions
V1 and V5
V5 and V10
V10 and V15
V15 and V20
V20 and V25
V25 and V30</p>
        <p>Triples SQ without parti- SQ with partitions</p>
        <p>tions (ms) (ms)
57,412,310 15005.226 12431.357
56,287,846 15808.009 13531.05
62,164,002 16482.251 13223.434
68,414,690 16563.959 14165.733
66,672,027 15839.788 14532.462
75,535,796 16158.124 15053.127</p>
        <p>Table 2. Query time evaluation of Star query(SQ)
We equally realize a second series of tests using Chain queries with two triples
patterns of the form (?X, p, ?Y) and (?Y, q, ?Z). Table 3 shows the execution
times obtained with the Chain query. As we can equally see in gure 6, the use
of partitioning ameliorates execution times. We note that for executing Chain
queries, object-subject (or subject-object) joins are needed and data transfer
between nodes is necessary for executing queries. In fact, as the partition is realized
on subject attribute, to realize the join between subject and object values we
need to transfer the triples having such an object value from other partitions.
After that, the execution times obtained with Chain query are superior to those
obtained with Star query shapes.</p>
        <p>Versions
V1 and V5
V5 and V10
V10 and V15
V15 and V20
V20 and V25
V25 and V30</p>
        <p>Triples CQ without parti- CQ with partitions</p>
        <p>tions (ms) (ms)
57,412,310 15002.811 13630.838
56,287,846 16072.282 14029,593
62,164,002 16939.459 14395.548
68,414,690 17670.103 14247.463
66,672,027 16999.656 14681.513
75,535,796 19044.695 16257.424</p>
        <p>Table 3. Runtime evaluation of Chain query (CQ)
What we can conclude is that, using partition with SPARK is favourable for
executing cross-versions queries (Star queries). Nevertheless, Chain (or Snow ake)
queries need to be deeply addressed as the number of join between non
partitioned data may a ect query execution times.
In this paper, we propose an evaluation of main versioning queries on top of
SPARK framework using scala. Di erent performance tests have been realized
based on: versioning approaches (Change Based or Independent Copies
approaches), the types of RDF archives queries, the size of versions, the shape
of SPARQL queries and nally the data partitioning strategy. What we can
conclude is that, using partitioning on the subject attribute with SPARK is
favourable for executing cross-version Star queries as the execution of this query
type does not need transfer between nodes which is not the case of cross-version
Chain queries.</p>
        <p>
          We note that, the number of patterns used in the Chain query as well as the
number of versions have an implication on the number of join operations used to
execute the query and by the way the number of data transfers. Di erent issues
need to be considered, namely, which partitioning strategy will be adapted for
e ciently executing cross-version Chain queries. In the future works we project
to use di erent partitioning strategies [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and to de ne execution plan of join
operations [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] by taking into consideration, the size of a version, the number of
versions and the shape of used queries.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abdelaziz</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harbi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khayyat</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalnis</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>A survey and experimental comparison of distributed SPARQL engines for very large RDF data</article-title>
          .
          <source>PVLDB</source>
          <volume>10</volume>
          (
          <issue>13</issue>
          ),
          <year>2049</year>
          {
          <year>2060</year>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ahn</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Im</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eom</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zong</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
          </string-name>
          , H.:
          <article-title>G-di : A grouping algorithm for RDF change detection on mapreduce</article-title>
          .
          <source>In: Semantic Technology - 4th Joint International Conference</source>
          , JIST,
          <string-name>
            <surname>Chiang</surname>
            <given-names>Mai</given-names>
          </string-name>
          , Thailand, November 9-
          <issue>11</issue>
          , Revised Selected Papers. pp.
          <volume>230</volume>
          {
          <issue>235</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Andrejs</surname>
            <given-names>Abele</given-names>
          </string-name>
          ,
          <string-name>
            <surname>John P. McCrae</surname>
            ,
            <given-names>P.B.A.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
          </string-name>
          , R.:
          <article-title>Linking Open Data cloud diagram 2018</article-title>
          . http://lod-cloud.net// (
          <year>2018</year>
          ), [Online; accessed April-2018]
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Armbrust</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xin</surname>
            ,
            <given-names>R.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lian</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huai</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bradley</surname>
            ,
            <given-names>J.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaftan</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghodsi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaharia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Spark</surname>
            <given-names>SQL</given-names>
          </string-name>
          :
          <article-title>relational data processing in spark</article-title>
          .
          <source>In: Proceedings of the SIGMOD International Conference on Management of Data</source>
          , Melbourne, Victoria, Australia, May 31 - June 4. pp.
          <volume>1383</volume>
          {
          <issue>1394</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Fernandez</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knuth</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Evaluating query and storage strategies for rdf archives</article-title>
          .
          <source>In: Proceedings of the 12th International Conference on Semantic Systems</source>
          . pp.
          <volume>41</volume>
          {
          <fpage>48</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Fernandez</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knuth</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Evaluating query and storage strategies for rdf archives</article-title>
          .
          <source>Semantic web journal IOS</source>
          Press (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Graube</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hensel</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Urbas</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>R43ples: Revisions for triples - an approach for version control in the semantic web</article-title>
          .
          <source>In: Proceedings of the 1st Workshop on Linked Data Quality co-located with 10th International Conference on Semantic Systems</source>
          , Leipzig, Germany,
          <year>September 2nd</year>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Meimaris</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Papastefanatos</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>The evogen benchmark suite for evolving rdf data</article-title>
          .
          <source>In: MEPDaW Workshop</source>
          , Extended Semantic Web Conference (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Naacke</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cure</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amann</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>SPARQL query processing with apache spark</article-title>
          .
          <source>CoRR</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Papakonstantinou</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Flouris</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fundulaki</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stefanidis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roussakis</surname>
          </string-name>
          , G.:
          <article-title>Versioning for linked data: Archiving systems and benchmarks</article-title>
          .
          <source>In: Proceedings of the Workshop on Benchmarking Linked Data, Kobe, Japan, October</source>
          <volume>18</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Papakonstantinou</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Flouris</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fundulaki</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stefanidis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roussakis</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Spbv: Benchmarking linked data archiving systems</article-title>
          .
          <source>In: 2nd International Workshop on Benchmarking Linked Data</source>
          , ISWC
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Schatzle,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Przyjaciel-Zablocki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Skilevic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Lausen</surname>
          </string-name>
          , G.:
          <article-title>S2RDF: RDF querying with SPARQL on spark</article-title>
          .
          <source>PVLDB</source>
          <volume>9</volume>
          (
          <issue>10</issue>
          ),
          <volume>804</volume>
          {
          <fpage>815</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Stefanidis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chrysakis</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Flouris</surname>
          </string-name>
          , G.:
          <article-title>On Designing Archiving Policies for Evolving RDF Datasets on the Web</article-title>
          , pp.
          <volume>43</volume>
          {
          <issue>56</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Zaharia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chowdhury</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dave</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ma</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>McCauly</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shenker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoica</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing</article-title>
          .
          <source>In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation</source>
          , San Jose, CA, USA, April
          <volume>25</volume>
          -27. pp.
          <volume>15</volume>
          {
          <issue>28</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Zaharia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xin</surname>
            ,
            <given-names>R.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wendell</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Armbrust</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dave</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Venkataraman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghodsi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shenker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoica</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Apache spark: a uni ed engine for big data processing</article-title>
          .
          <source>Commun. ACM</source>
          <volume>59</volume>
          (
          <issue>11</issue>
          ),
          <volume>56</volume>
          {
          <fpage>65</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>