<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint Federation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ali Hasnain</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Muhammad Saleem</string-name>
          <email>saleem@informatik.uni-leipzig.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Axel-Cyrille Ngonga Ngomo</string-name>
          <email>axel.ngonga@upb.de</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dietrich Rebholz-Schuhmann</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DICE, University of Paderborn</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Insight Centre for Data Analytics, National University of Ireland</institution>
          ,
          <addr-line>Galway</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universitat Leipzig, IFI/AKSW</institution>
          ,
          <addr-line>PO 100920, D-04009 Leipzig</addr-line>
        </aff>
      </contrib-group>
      <fpage>28</fpage>
      <lpage>44</lpage>
      <abstract>
        <p>4 Querying the Web of Data is highly motivated by the use of federation approaches mainly SPARQL query federation when the data is available through endpoints. Di erent benchmarks have been proposed to exploit the full potential of SPARQL query federation approaches in real world scenarios with their limitations in size and complexity. Previously, we introduced LargeRDFBench - a billion-triple benchmark for SPARQL query federation. In this work, we pinpoint some of of the limitation of LargeRDFBench and propose an extension with 8 additional queries. Our evaluation results of the state-of-the-art federation engines revealed interesting insights, when tested on these additional queries.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Due to linked, autonomous, and decentralised architecture of Linked Open Data
(LOD), several queries require collecting information from more than one dataset
also called data sources [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Processing such queries called federated queries are of
central importance for the scale-able deployment of Semantic Web technologies.
The importance of federated SPARQL queries for Linked Data management has
led to the development of several federated SPARQL querying federation engines
[
        <xref ref-type="bibr" rid="ref1 ref10 ref12 ref14 ref5 ref7">12,1,14,5,7,10</xref>
        ] etc. Consequently, this has motivated the design of several
federated SPARQL querying benchmarks [
        <xref ref-type="bibr" rid="ref13 ref6 ref9">9,13,6</xref>
        ]. LargeRDFBench [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] addressed
several limitation of FedBench [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and Splodge [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        In this work, we highlight some of the limitations of LargeRDFBench. In
particular, the number of distinct datasets (sources for short) required to get
the complete result set of the query is smaller in number (range between 1-4).
As such, federation engines (e.g., [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]) which optimise the ordering of the
required distinct sources explicitly mentioned as SPARQL SERVICES cannot be
fully tested with existing LargeRDFBench queries. To ll this gap, we extended
the LargeRDFBench with 8 additional queries of varying complexities and
number of distinct sources required. We discussed the key characteristics of each of
these additional queries and evaluated state-of-the-art engines on these queries.
The evaluation results revealed interesting insights about the performance and
stability of these engines. The LargeRDFBench along with the proposed
extension is available at: https://github.com/AKSW/largerdfbench.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Over the last decade, various benchmarks have been proposed for comparing
triple stores and SPARQL query processing systems. In this work, we only
focus on federated SPARQL queries benchmarks. SPLODGE [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] benchmark uses
heuristic for automatic generation of federated queries with conjunctive BGPs.
Non-conjunctive queries that make use of the SPARQL UNION, OPTIONAL clauses
are not considered. FedBench [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] comprise of 9 real-world datasets and a total of
25 queries from di erent domains. Some of the limitations of FedBench was
addressed in LargeRDFBench [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] with more real-world datasets and more complex
and large data queries. In this work, we addressed some of the key limitations
of LargeRDFBench and proposed an extension to this benchmark.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Design Features</title>
      <p>
        In this section, we present the key SPARQL query features that should be
considered while designing a federated SPARQL benchmark. Note that all of these
key SPARQL features are formally presented in LargeRDFBench [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Here, we
are re-introducing all of them for the sake of self containment of this paper and
understanding the subsequent analysis.
      </p>
      <p>
        The previous research contributions [
        <xref ref-type="bibr" rid="ref6 ref9">6,9</xref>
        ] on SPARQL querying
benchmarking pointed out that SPARQL queries used in the benchmark should vary with
respect to the the following key query characteristics : total number of triple
patterns, number of join vertices, mean join vertex degree, number of sources
span, query result set sizes, mean triple pattern selectivities, BGP-restricted
triple pattern selectivity, join-restricted triple pattern selectivity, join vertex
types (`star', `path', `hybrid', `sink'), and important SPARQL clauses used (e.g.,
LIMIT, OPTIONAL, UNION, FILTER etc.).
      </p>
      <p>
        We represent any basic graph pattern (BGP) of a given SPARQL query as
a directed hypergraph (DH) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], a generalisation of a directed graph in which a
hyperedge can join any number of vertices. In our speci c case, every hyperedge
captures a triple pattern. The subject of the triple becomes the source vertex of
a hyperedge and the predicate and object of the triple pattern become the target
vertices. For instance, the query (Figure 1) shows the hypergraph representation
of a SPARQL query. Unlike a common SPARQL representation where the subject
and object of the triple pattern are connected by an edge, our hypergraph-based
representation contains nodes for all three components of the triple patterns. As
a result, we can capture joins that involve predicates of triple patterns. Formally,
our hypergraph representation is de ned as follows:
      </p>
      <sec id="sec-3-1">
        <title>De nition 1 (Directed hypergraph of a BGP). The hypergraph represen</title>
        <p>tation of a BGP B is a directed hypergraph HG = (V; E) whose vertices are all
the components of all triple patterns in B, i.e., V = S(s;p;o)2B fs; p; og, and that
contains a hyperedge (S; T ) 2 E for every triple pattern (s; p; o) 2 B such that
S = fsg and T = (p; o).</p>
        <p>The representation of a complete SPARQL query as a DH is the union of
the representations of the query's BGPs. Based on the DH representation of
SPARQL queries, we can de ne the following features of SPARQL queries:
De nition 2 (Join Vertex). For every vertex v 2 V in such a hypergraph
we write Ein(v) and Eout(v) to denote the set of incoming and outgoing edges,
respectively; i.e., Ein(v) = f(S; T ) 2 E j v 2 T g and Eout(v) = f(S; T ) 2 E j v 2 Sg.
If jEin(v)j + jEout(v)j &gt; 1, we call v a join vertex.</p>
        <p>De nition 3 (Join Vertex Types). A vertex v 2 V can be of type \star",
\path", \hybrid", or \sink" if this vertex participates in at least one join. A
\star" vertex has more than one outgoing edge and no incoming edges. A \path"
vertex has exactly one incoming and outgoing edge. A \hybrid" vertex has
either more than one incoming and at least one outgoing edge or more than one
outgoing and at least one incoming edge. A \sink" vertex has more than one
incoming edge and no outgoing edge. A vertex that does not participate in joins
is \simple".</p>
        <p>De nition 4 (Number of Join Vertices). Let ST =fst1,. . . , stj g be the set
of vertices of type `star', P T =fpt1,. . . , ptkg be the set of vertices of type `path',
HB =fhb1,. . . , hblg be the set of vertices of type `hybrid', and SN =fsn1,. . . ,
snmg be the set of vertices of type `sink' in a DH representation of a query, then
the number of join vertices in the query #J V = jST j + jP T j + jHBj + jSN j.
The total number of join vertices in a query is the sum of the total number of
join vertices across all of the BGPs contained in this query.</p>
      </sec>
      <sec id="sec-3-2">
        <title>De nition 5 (Join Vertex Degree). The DH representation of SPARQL</title>
        <p>queries makes use of the notion of Ein(v) E and Eout(v) E to denote
the set of incoming and outgoing hyperedges of a vertex v. The join vertex degree
of a vertex v is denoted J V Dv = jEin(v)j + jEout(v)j.
The join vertex degree of the complete query is the average of all join vertex
degrees of all the joins contained in this query. In our example (see Figure 1),
the number of triple patterns is seven and the number of join vertices is four
(two star, one sink and path each). The join vertex degree of each of the `star'
join vertex (shown in green colour) given in Figure 1 is three (i.e., three outgoing
hyperedges from both vertices).</p>
      </sec>
      <sec id="sec-3-3">
        <title>De nition 6 (Relevant Source Set). Let D be the set of all data sources</title>
        <p>(e.g., SPARQL endpoints), T P be the set of all triple patterns in query Q. Then,
a source d 2 D, is relevant (also called capable) for a triple pattern tpi 2 T P if
at least one triple contained in d matches tpi.5 The relevant source set Ri D
for tpi is the set that contains all sources that are relevant for that particular
triple pattern.</p>
      </sec>
      <sec id="sec-3-4">
        <title>De nition 7 (Total Triple Pattern-wise Sources). By using De nition 6,</title>
        <p>we can de ne the total number of triple pattern-wise sources selected for query Q
as the sum of the magnitudes of relevant source sets Ri over all individual triple
patterns tpi 2 Q.</p>
      </sec>
      <sec id="sec-3-5">
        <title>De nition 8 (BGP-Restricted Triple Pattern Selectivity). Consider a</title>
        <p>
          Basic Graph Pattern BGP and a triple pattern tpi belonging to BGP , let R(tpi; D)
be the set of distinct solution mappings (i.e., resultset) of executing tpi over
dataset D and R(BGP ; D) be the set of distinct solution mappings of
executing BGP over dataset D. Then the BGP-restricted triple pattern selectivity
denoted by Sel BGP Restricted(tpi; D) is the fraction of distinct solution mappings
in R(tpi; D) that are compatible (as per standard SPARQL semantics [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]) with
a solution mapping in R(BGP ; D) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Formally, if and 0 denote the sets
underlying the (bag) query results R(tpi; D) and R(BGP ; D), respectively, then
Sel BGP Restricted(tpi; D) = jf
2
De nition 9 (Join-Restricted Triple Pattern Selectivity). Consider a
join vertex x in the DH representation of a BGP . Let BGP 0 belonging to BGP
be the set of triple patterns that are incidents to x. Furthermore, let tpi
belonging to BGP 0 be a triple pattern and R(tpi; D) be the set of distinct solution
mappings of executing tpi over dataset D and R(BGP 0; D) be the set of distinct
solution mappings of executing BGP 0 over dataset D. Then the x restricted
triple pattern selectivity denoted by SelJVx Restricted(tpi; D), is the fraction of
distinct solution mappings in R(tpi; D) that are compatible with a solution
mapping in R(BGP 0; D) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Formally, if and 0 denote the sets underlying the
(bag) query results R(tpi; D) and R(BGP 0; D), respectively, then
Sel JVx Restricted(tpi; D) = jf
j j
5 The concept of matching a triple pattern is de ned formally in the SPARQL
specication found at http://www.w3.org/TR/rdf-sparql-query/
2
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Analysis</title>
      <p>We now present analysis of the queries included in the original and the extended
LargeRDFBench, based on the important query features introduced in the
previous section.
4.1</p>
      <sec id="sec-4-1">
        <title>LargeRDBench</title>
        <p>The original LargeRDFBench comprises a total of 32 queries which are divided
into three di erent types namely Simple, Complex, and Large Data queries. The
Simple queries category includes a total of 14 queries (namely S1-S14). The
Complex queries category includes a total of 10 queries (namely C1-C10). The
Large Data queries category includes a total of 8 queries (namely L1-L8). Table
1 shows the characteristics of each query across the important query features
discussed in the previous section. These queries were de ned by considering
increasing number of source selected per query. A brief summary of each queries
category is given below.</p>
        <p>
          Simple Queries Simple queries were taken directly from the FedBench queries
[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. These queries are relatively fast to execute (around 2 seconds [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]) and
include the smallest (in comparison to other categories) number of triple patterns.
The number of triple patterns in this category range from 2 to 7. The number of
join vertices and the mean join vertex degree for these queries are lower (average
#JV = 2.6, MJVD = 2.1, ref. Table 1). Moreover, they only use a subset of the
SPARQL clauses as shown in Table 1. Amongst others, they do not use LIMIT,
REGEX, DISTINCT and ORDER BY clauses.
        </p>
        <p>Complex Queries The complex queries were particularly designed to address
the aforementioned limitations of the simple queries. In particular, this queries
category tackles the limitations with respect to the number of triple patterns,
the number of join vertices, the mean join vertices degree, the SPARQL clauses,
and the small query execution times of simple queries. Consequently, queries in
this category rely on at least 8 triple patterns, i.e., one more than the maximum
number (i.e. 7) of triple patterns in a simple query. The number of join vertices
ranges from 3 to 6 (average #JV = 4.3, ref. Table 1). The mean join vertices
degree ranges from 2 to 6 (average MJVD = 2.93, ref. Table 1). In addition, they
were designed to use more SPARQL clauses, especially, DISTINCT, LIMIT, FILTER
and ORDER BY. The evaluation results presented in LargeRDFBench show that
the query execution time for complex queries exceeds 10 minutes.
Large Data Queries The goal of the queries included in this category was to
test the federation engines for real large data use cases. These queries span over
large datasets and involve processing large intermediate result sets (usually in
hundreds of thousands, see mean triple pattern selectivities in Table 1) or lead to
large result sets (minimum 80459, see Table 1). The evaluation results presented
in LargeRDFBench show that the query processing time for large data queries
exceeds one hour.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Extended LargeRDFBench</title>
        <p>One of the important question is to know the main motivation behind the need
of an extension of LargeRDFBench with additional queries. The values inside the
brackets of #RS column of Table 1 reveal one of the key problems with original
LargeRDFBench queries. Note that these values show the minimum number
of distinct sources required to get the complete result set. Furthermore, these
values also show the total number of distinct SERVICES used in the SPARQL
1.1 version of each of the benchmark queries. By looking at Table 1, majority
of the simple queries (except S6, S12) require only 2 distinct sources to get the
complete result set of the queries. Even in the complex queries category (i.e.,
C1-C10), the maximum number of distinct sources to get the complete result of
the query is only 3. Finally, in the large data queries category (i.e. L1-L8), there
is only one query which requires 3 distinct sources. All others in this category
require only two data sources.</p>
        <p>
          As an overall result, there are 24 out of total 32 queries which only require 2
data sources to get the complete result set of the queries. This clearly shows that
the federation engines (e.g., [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]) which optimise the ordering of the execution of
SPARQL SERVICES in federated SPARQL 1.1 queries cannot be fully tested with
existing LargeRDFBench queries. This is because if there are only two SERVICES
used in the query, there are only two possible orderings of the execution of these
SERVICES. As such, even the probability of random SERVICE ordering is 0.5
without the need of any heuristics or cost model. The goal of this extension was
to ll this gap by adding more federated queries which require more data sources
to get the complete result set of the query. We now describe the queries which
are added into the LargeRDFBench.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Complex and High Data Sources Queries In this work, we added 8 ad</title>
        <p>ditional Complex and High data sources (namely CH1-CH8) queries into
LargeRDFBench, making the total number of queries in the benchmark equal to 40.
These queries have increasing numbers (from 4-10) of the distinct data sources
required to get the complete result set. In addition, the number of join vertices
and the number triple patterns in these queries are much higher than existing
LargeRDFBench queries (see Table 1). Consequently, we will see in our
evaluation that the query runtimes of these queries ranges from less than one second
to more than 1 hour. All the extended queries are given at the end of the paper
as in Appendix and their key characteristics are discussed below.
CH1: This query requires 4 LargeRDFBench data sources 6, i.e., DBpedia, New
York Times, Geonames, and Semantic Web Dog Food to get the complete result
6 Please look at LargRDFBench home page for the datasets included in the benchmark
set. The total triple patterns in this query is 16 with result size of 384. The key
characteristic of query is the high mean join vertex degree (i.e., 4.2, ref. Table
1). This means that the number of incoming and outgoing edges of a join node
is high compared to other queries in the benchmark. In other words, the number
of triple patterns in a single SERVICE would be relatively high. A smart query
planner can combine a set of triple patterns and send them all in to the relevant
source as single group. Federation engine needs not to perform the join between
triple patterns. Rather, the join can be migrated to the relevant source (i.e., the
SPARQL endpoint) and hence can greatly improve the runtime performance by
dividing the load between endpoints and federation engine. On the other hand,
a good estimation of the cardinality of multi-triple patterns join can be
particularly challenging. A wrong estimation, can leads to a wrong query execution
plan. We will see in our evaluation results, the runtime for this query ranges
from 1 second to over 1 hour for the di erent federation engines.
CH2: This query requires 4 LargeRDFBench data sources, i.e., DBpedia,
DrugBank, KEGG, and ChEBI to get the complete result set. The total triple patterns
in this query is 10 with result size of 840. The key characteristic of this query is
the low mean triple patterns selectivity (i.e., 0.00005) and high BGP-restricted
(i.e., 0.2595) and Join-restricted (i.e., 0.58115) triple pattern selectivities. This
means that the triple patterns of this query are selective, i.e., they can have
smaller result sizes. However, they are less selective when involved in joins with
other triple patterns. Since the triple patterns are selective, choosing the right
join order which quickly converges to smaller result size is particularly crucial.
The FILTER combined with REGEX made this query particularly very selective.
CH3: This query involves 5 data sources, i.e., DBpedia, DrugBank, KEGG,
ChEBI, Linked TCGA-A to get the complete result set. The triple patterns
involved in this query is 11 with result size of 48. Query has relatively high number
of join vertices (i.e., 7 from 11 triple patterns), moderate join vertex degree (i.e.,
2.71), and low mean BGP-restricted triple pattern selectivity (i.e., 0.0196). This
query is a candidate example of using a mix values for the important query
features and can challenge the federation engines for a mix of these values.
CH4: This query requires 6 LargeRDFBench data sources, i.e., DBpedia,
Semantic Web Dog Food, GeoNames, New York Times, Jamendo, and Linked MDB
to get the complete result set. The total number of triple patterns in this query
is 12 with result size of 1248. This query has relatively very high number of
join vertices (i.e., 10 join vertices from 12 triple patterns). This means that the
join order optimisation of this query can be particularly challenging due to more
joins with less number of triple patterns involved in the joins.</p>
        <p>CH5: This query requires 7 data sources, i.e., DBpedia, Linked TCGA-A,
DrugBank, GeoNames, New York Times, Jamendo, and Linked MDB to recieve the
complete result set. The triple patterns involved in this query is 18 with result
size of 5 using LIMIT clause. The triple patterns in aforementioned queries where
mostly unbound subjects and objects with bound predicate. Unlike the previous
queries, the key characteristic of this query is number of bounds subjects and
objects in the triple patterns. There are 4 triple patterns for which the subject
is bound and 2 triple patterns for which object is bound. Also there is one triple
pattern that contains unbound predicate. Thus, this query can be challenging
to accurately estimate the triple patterns as well as the joins cardinalities due
to bound subjects and objects as well as unbound predicate in triple patterns.
CH6: This query requires 8 LargeRDFBench data sources, i.e., Linked
TCGAA, DBpedia, DrugBank, KEGG, GeoNames, New York Times, Jamendo, and
Linked MDB to get the complete result set. The triple patterns of this query is
24 with result size of 16. Similar to CH2, The key characteristic of this query is
the low mean triple patterns selectivity (i.e., 0.00002) and high BGP-restricted
(i.e., 0.2522) and Join-restricted (i.e., 0.3186) triple pattern selectivities. This
query also contains triple patterns with bound subjects and objects.
CH7: This query requires 9 LargeRDFBench data sources, i.e., Linked TCGA-A,
DBpedia, DrugBank, KEGG, GeoNames, Semantic Web Dog Food, New York
Times, Jamendo, and Linked MDB to get the complete result set. The total
number of triple patterns in this query is 21 with result size of 775 using LIMIT
clause. There are a total of 14 join nodes in this query with 5 Star, 3 Path, 4
Sink, and 2 Hybrid join nodes. This query be particularly challenging due to
more join nodes and hence join ordering could not be a trivial task.
CH8: This query requires 10 data sources, i.e., Linked TCGA-A, DBpedia,
DrugBank, KEGG, GeoNames, Semantic Web Dog Food, New York Times,
Jamendo, ChEBI, and Linked MDB to get the complete result set. This query
contains the highest number of triple patterns among the benchmark queries
(i.e., 33). The result size of this query is only 1. There are a total of 19 join
nodes in this query with 7 Star, 4 Path, 6 Sink, and 2 Hybrid join nodes. This
query also contains OPTIONAL, FILTER, and LIMIT. As one of the most complex
query of the benchmark our evaluation (section 5) shows that non of the
federation engines is able to execute this query within the timeout limit of 1 hour.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>In this section, we evaluate state-of-the-art SPARQL query federation systems
by using the extended queries added into the LargeRDFBench. We rst describe
our experimental setup in detail. Then, we present our evaluation results. All
data used in this evaluation can be found on the benchmark homepage.
5.1</p>
      <sec id="sec-5-1">
        <title>Experimental Setup</title>
        <p>
          The experimental setup was used as of the original LargeRDFBench evaluation.
In summary, LargeRDFBench contains a total of 13 real-world datasets. Each of
the datasets were loaded in to a Virtuoso 7.1. Each of the 13 Virtuoso SPARQL
endpoints used in our experiments was installed on a separate machine. The
speci cation of each of the machines are exactly the same used in the original
LargeRDFBench. we ran the extended queries experiments on a clustered server
with 32 physical CPU cores of 2.10GHz each and a total RAM of 512GB. Each
of the 13 Virtuoso SPARQL endpoints used in our experiments was started as a
separate instance on the clustered server. The federation engines were also run on
the same machine. We set the maximal amount of memory for each of the
federation engines to 128GB. Experiments were conducted on local copies of Virtuoso
SPARQL endpoints with number of bu ers 1360000, maximum dirty bu ers
1000000, number of server threads 20, result set maximum rows 100,000,000,000
and maximum SPARQL endpoint query execution time of 6000,000,000 seconds.
The query timeout was set 1 hour. Seven SPARQL endpoint federation engines
(versions available as of May 2018) were compared { FedX [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], SPLENDID [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ],
ANAPSID [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], FedX+HiBISCuS [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], SPLENDID+HiBISCuS [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], SemaGrow
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], CostFed [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] { on all of the 8 extended benchmark queries. We used all of
the performance metrics used in the original LargeRDFBench except for the
number of endpoints requests which is proven to not necessarily correlate with
the overall query runtimes [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. We used: (1) the total number of triple
patternwise (TPW) sources selected during the source selection, (2) the total number
of SPARQL ASK requests submitted to perform (1), (3) the completeness
(recall) and correctness (precision) of the query result set retrieved, (4) the average
source selection time, (5) the average query execution time.
5.2
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Experimental Results</title>
        <p>E ciency of Source Selection Similar to LargeRDFBench, we de ne e cient
source selection in terms of: (1) the total number of triple pattern-wise sources
selected (#T), (2) the total number of SPARQL ASK requests (#AR) used to
obtain (1), and (3) the source selection time (SST). Table 2, 3 show the results
of these three metrics for the selected approaches.</p>
        <p>
          Overall, CostFed (total 151 #TP, ref. Table 2) is the most e cient approach
in terms of smaller number of total TPW sources selected, followed by HiBISCuS
(total 153 #TP) and ANAPSID (total 161 #TP). Which is equally followed by
FedX, SPLENDID, and SemaGrow with 356 # TP each. Interestingly,
ANAPSID which was the best approach in terms of #TP for original LargeRDFBench
ranked third in our proposed extension. The reason behind this is as the number
of triple patterns increases in the queries, the e cient source selection becomes
more di cult. The SSGM heuristics [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] used in the ANAPSID may not work
that e cient with increasing number of triple patterns. CostFed, HiBISCuS,
and ANAPSID are equally best approaches in terms of smaller number of ASK
requests used (i.e., 0 for all queries). Noteworthy, FedX (cold) (without cache)
used a total of 1872 ASK requests. This is because FedX(cold) needs to sent a
request for each triple pattern to each of the 13 SPARQL endpoints (hosted on
separate machines). In terms of the source selection time, FedX (warm) (with
cache) is the fastest approach (avg. 6 msec, ref, Table 3) followed by CostFed
(avg. 39 msec), SemaGrow (avg. 54 msec), HiBISCuS (avg. 195 msec), FedX
(cold) (avg. 239 msec), and ANAPSID (avg. 322 msec). The results clearly
suggest that the FedX source selection is grealy improved by using caching.
Query Runtime Table 3 (column RT) shows the query runtime performances
of the selected federation engines. As an overall performance evaluation, it is
rather hard to rank the selected engines as there are many timeouts, runtime or
parse errors, suggesting the selected federation engines are not that stable when
tested with queries containing more triple patterns and require collecting results
from more sources (i.e., greater than 3). CostFed has the smallest query runtime
for CH1 (i.e., 800 msec) while the same query time out for ANAPSID. For CH2,
ANAPSID has the smallest query runtime (i.e., 101 min) while the same query
almost timeout for FedX (i.e., 55 min). for CH3, SemaGrow is the fastest while
both CostFed and ANAPSID gives zero results. Only ANAPSID and SemaGrow
is able to get complete results for CH4. For CH5, HiBiCuS+SPLENDID is the
fastest (i.e, 15 sec) while SemaGrow and FedX timout. For CH6, CostFed is the
fastest (i.e, only 173 msec) while SemaGrow timeout. Note that for this query
FedX gives 1 result and CostFed gives 4 results while actual result size is 16. For
CH7, ANAPSID is the fastest (i.e., 2.4 sec) while FedX, SemaGrow time out.
All of the selected engines timeout of 1 hour for CH8.
        </p>
        <p>In conclusion, above results clearly suggest that the extended
LargeRDFBench can be extremely costly or can be executed extremely fast when proper
optimised query plan is selected. However, the number of timeout and runtime
errors suggesting that choosing the optimised query plans for these queries is not
a trivial task. The results revealed FedX, CostFed, and ANAPSID can result in
incomplete or zero results.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgement</title>
      <p>This work was supported by the project HOBBIT, which has received funding
from the European Union's H2020 research and innovation action program (GA
number 688227). Also, this publication has emanated from research supported
in part by a research grant from Science Foundation Ireland (SFI) under Grant
Number SFI/12/RC/2289, co-funded by the European Regional Development
Fund.</p>
    </sec>
    <sec id="sec-7">
      <title>Appendix A: Queries</title>
      <sec id="sec-7-1">
        <title>PREFIX owl: &lt;http://www.w3.org/2002/07/owl#&gt;</title>
        <p>PREFIX tcga: &lt;http://tcga.deri. ie/schema/&gt;
PREFIX kegg: &lt;http://bio2rdf.org/ns/kegg#&gt;
PREFIX dbpedia: &lt;http://dbpedia.org/ontology/&gt;
PREFIX dbr:&lt;http://dbpedia.org/resource/&gt;
PREFIX foaf : &lt;http://xmlns.com/foaf/0.1/&gt;
PREFIX geo: &lt;http://www.w3.org/2003/01/geo/wgs84 pos#&gt;
PREFIX geonames: &lt;http://www.geonames.org/ontology#&gt;
PREFIX nytimes: &lt;http://data.nytimes.com/elements/&gt;
PREFIX linkedmdb: &lt;http://data.linkedmdb.org/resource/movie/&gt;
PREFIX linkedmdbr: &lt;http://data.linkedmdb.org/resource/&gt;
PREFIX purl : &lt;http://purl.org/dc/elements/1.1/&gt;
PREFIX bio2rdf : &lt;http://bio2rdf.org/ns/bio2rdf#&gt;
PREFIX chebi: &lt;http://bio2rdf.org/ns/chebi#&gt;
PREFIX swc: &lt;http://data.semanticweb.org/ns/swc/ontology#&gt;
PREFIX eswc: &lt;http://data.semanticweb.org/conference/eswc/&gt;
PREFIX swcp: &lt;http://data.semanticweb.org/person/&gt;
PREFIX rdf : &lt;http://www.w3.org/1999/02/22 rdf syntax ns#&gt;
PREFIX dbp: &lt;http://dbpedia.org/property/&gt;
PREFIX rdfs : &lt;http://www.w3.org/2000/01/rdf schema#&gt;
PREFIX drugbank: &lt;http://www4.wiwiss.fu berlin.de/drugbank/resource/drugbank
/&gt;
PREFIX drug:&lt;http://www4.wiwiss.fu berlin.de/drugbank/resource/drugs/&gt;
f
?place geonames:name ?countryName;
geonames:countryCode ?countryCode;
geonames:population ?population;
geo:long ?longitude;
geo: lat ? latitude ;
owl:sameAs ?geonameplace.
?geonameplace dbpedia:capital ?capital;
dbpedia:anthem ?nationalAnthem;
dbpedia:foundingDate ?foundingDate;
dbpedia:largestCity ?largestCity ;
dbpedia:ethnicGroup ?ethnicGroup;
dbpedia:motto ?motto.
?role swc:heldBy ?writer.
?writer foaf :based near ?geonameplace.
?dbpediaCountry owl:sameAs ?geonameplace ;
nytimes: latest use ?dateused g
ORDER BY DESC (?population)</p>
        <p>LargeRDFBench (new) queries.
SELECT DISTINCT ?drug ?drugBankName ?keggmass ?chebiIupacName
WHERE
f
?dbPediaDrug rdf:type dbpedia:Drug .
?dbPediaDrug dbpedia:casNumber ?casNumber .
?drugbankDrug owl:sameAs ?dbPediadrug .
?drugbankDrug drugbank:keggCompoundId ?keggDrug .
?keggDrug bio2RDF:mass ?keggmass .
?drug drugbank:genericName ?drugBankName .
?chebiDrug purl: title ?drugBankName .
?chebiDrug chebi:iupacName ?chebiIupacName .
?drug drugbank: inchiIdenti er ?drugbankInchi .
?chebiDrug bio2RDF:inchi ?chebiInchi.</p>
        <p>FILTER REGEX (?chebiIupacName, "adenosine")
g
SELECT ∗
WHERE</p>
        <p>CH4</p>
        <p>################
################
SELECT DISTINCT ∗
WHERE
f
?uri tcga:bcr patient barcode ?patient .
?patient ?p ?country.
?country dbpedia:populationDensity 32 .
?nytimesCountry owl:sameAs ?country ;
nytimes: latest use ?dateused;
owl:sameAs ?geonames.
? artist foaf :based near ?geoname;</p>
        <p>foaf :homepage ?homepage.
?director dbpedia:nationality ?dbpediaCountry.
? lm dbpedia:director &lt;dbr:Michael Haussman&gt; .
?x owl:sameAs ? lm .
?x linkedmdb:genre ?genre.
?patient tcga:bcr drug barcode ?drugbcr.
?drugbcr tcga:drug name ?drugName.
drug:DB00441 drug:genericName ?drugName.
drug:DB00441 drugbank:indication ?indication.
drug:DB00441 drugbank:chemicalFormula ?formula.
drug:DB00441 drugbank:keggCompoundId ?compound .
g
LIMIT 5</p>
      </sec>
      <sec id="sec-7-2">
        <title>SELECT ?patient ?country ?articleCount ?chemicalStructure ?id</title>
        <p>WHERE
f
&lt;http://tcga.deri. ie/TCGA 43 2576&gt; tcga:bcr patient barcode
?patient tcga:gender "FEMALE".
?patient dbpedia:country ?country.
?country dbpedia:populationDensity ?popDen .
?nytimesCountry owl:sameAs ?country ;
nytimes: latest use ?latestused ;
nytimes:number of variants ?totalVariants;
nytimes: associated article count ?articleCount;
owl:sameAs ?geonames.
swcp:christian bizer foaf :based near ?geoname;</p>
        <p>foaf :homepage ?homepage.
?director dbpedia:nationality ?dbpediaCountry.
dbr:The Last Valley dbpedia:director ?director .
?x owl:sameAs dbr:The Last Valley .
?x linkedmdb:genre linkedmdbr: lm genre/4 .
?patient tcga:bcr drug barcode ?drugbcr.
?drugbcr tcga:drug name "Cisplatin".
?drgBnkDrg drugbank:inchiKey ?inchiKey.
?drgBnkDrg drugbank:meltingPoint ?meltingPoint.
?drgBnkDrg drugbank:chemicalStructure ?chemicalStructure.
?drgBnkDrg drugbank:casRegistryNumber ?id .
?keggDrug rdf:type kegg:Drug .
?keggDrug bio2rdf:xRef ?id .
?keggDrug purl: title " Follitropin alfa /beta" .
g
43
LargeRDFBench (new) queries.
?patient .
SELECT DISTINCT ∗ WHERE f
?uri tcga:bcr patient barcode ?patient .
?patient tcga: consent or death status ?deathStatus .
?patient dbpedia:country ?country.
?country dbpedia:areaMetro ?areaMetro.
?nytimesCountry owl:sameAs ?country ;
nytimes:search api query ?apiQuery; owl:sameAs ?location .
? artist foaf :based near ?location ; foaf : rstName ? rstName .
?director dbpedia:spouse ?spouse.
? lm dbpedia:director ?director .
?x owl:sameAs ? lm .
?x linkedmdb:runtime ?runTime.
?patient tcga:bcr drug barcode ?drugbcr.
?drugbcr tcga:drug name ?drugName.
?drgBnkDrg drugbank:casRegistryNumber ?id .
?drgBnkDrg drugbank:brandName ?brandName.
?keggDrug bio2rdf:xRef ?id ; bio2rdf :mass ?mass .
?keggDrug bio2rdf:synonym ?synonym .
?chebiDrug purl: title ?drugName . g LIMIT 775</p>
        <p>CH8
################
SELECT ∗ WHERE f
?uri tcga:bcr patient barcode ?patient .
?patient tcga:gender ?gender.
?patient dbpedia:country ?country.
?country dbpedia:populationDensity ?popDensity.
?nytimesCountry owl:sameAs ?country ; nytimes:latest use ?latestused;
nytimes:number of variants ?totalVariants;
nytimes: associated article count ?articleCount;
owl:sameAs ?geonames.
?role swc:isRoleAt eswc:2010.
?role swc:heldBy ?author.
?author foaf :based near ?geoname.
? artist foaf :based near ?geoname; foaf:homepage ?homepage.
?director dbpedia:nationality ?dbpediaCountry.
? lm dbpedia:director ?director .
?x owl:sameAs ? lm .
?x linkedmdb:genre ?genre.
?patient tcga:bcr drug barcode ?drugbcr.
?drugbcr tcga:drug name ?drugName.
?drgBnkDrg drugbank:inchiKey ?inchiKey.
?drgBnkDrg drugbank:meltingPoint ?meltingPoint.
?drgBnkDrg drugbank:chemicalStructure ?chemicalStructure.
?drgBnkDrg drugbank:casRegistryNumber ?id .
?keggDrug rdf:type kegg:Drug ; bio2rdf:xRef ?id .
?keggDrug purl: title ? title .
?chebiDrug purl: title ?drugName .
?chebiDrug chebi:iupacName ?chebiIupacName .</p>
        <p>OPTIONAL f
?drgBnkDrg drugbank:inchiIdenti er ?drugbankInchi .
?chebiDrug bio2rdf:inchi ?chebiInchi .</p>
        <p>FILTER (?drugbankInchi = ?chebiInchi) g 4g4LIMIT 1</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Acosta</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vidal</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lampo</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Castillo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruckhaus</surname>
            ,
            <given-names>E.: ANAPSID</given-names>
          </string-name>
          :
          <article-title>An Adaptive Query Processing Engine for SPARQL Endpoints</article-title>
          . In: ISWC (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Aluc</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hartig</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>O</given-names>
            zsu, M.T.,
            <surname>Daudjee</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          :
          <article-title>Diversi ed stress testing of rdf data management systems</article-title>
          .
          <source>In: ISWC</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Arenas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gutierrez</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perez</surname>
          </string-name>
          , J.:
          <source>On the Semantics of SPARQL</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Charalambidis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.:
          <article-title>SemaGrow: Optimizing federated sparql queries</article-title>
          .
          <source>In: SEMANTICS</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Gorlitz,
          <string-name>
            <given-names>O.</given-names>
            ,
            <surname>Staab</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.:</surname>
          </string-name>
          <article-title>SPLENDID: SPARQL Endpoint Federation Exploiting VoID Descriptions</article-title>
          . In: COLD ISWC (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. Gorlitz,
          <string-name>
            <given-names>O.</given-names>
            ,
            <surname>Thimm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Staab</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          :
          <article-title>Splodge: systematic generation of sparql benchmark queries for linked open data</article-title>
          .
          <source>In: ISWC</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Hasnain</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mehmood</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          , Sana e Zainab,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Saleem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Warren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Zehra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Decker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Rebholz-Schuhmann</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          :
          <article-title>Biofed: federated query processing over life sciences linked open data</article-title>
          .
          <source>JBMS</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Hasnain</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mehmood</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          , e Zainab,
          <string-name>
            <given-names>S.S.</given-names>
            ,
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Sportal: Pro ling the content of public sparql endpoints</article-title>
          .
          <source>IJSWIS</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Saleem</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasnain</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>A.C.N.</given-names>
          </string-name>
          :
          <article-title>Largerdfbench: a billion triples benchmark for sparql endpoint federation</article-title>
          .
          <source>Journal of Web Semantics</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Saleem</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasnain</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ermilov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>A.C.N.:</given-names>
          </string-name>
          <article-title>A ne-grained evaluation of sparql endpoint federation systems</article-title>
          .
          <source>Semantic Web Journal</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Saleem</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Ngonga</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.C.</surname>
          </string-name>
          :
          <article-title>HiBISCuS: Hypergraph-based source selection for SPARQL endpoint federation</article-title>
          .
          <source>In: ESWC</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Saleem</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potocki</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soru</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hartig</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Voigt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>A.C.N.</given-names>
          </string-name>
          :
          <article-title>Costfed: Cost-based query optimization for sparql endpoint federation</article-title>
          .
          <source>In: SEMANTICS</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Schmidt</surname>
          </string-name>
          , e.a.:
          <article-title>FedBench: A Benchmark Suite for Federated Semantic Data Query Processing</article-title>
          . In: ISWC (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Schwarte</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haase</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hose</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schenkel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidt</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>FedX: Optimization Techniques for Federated Query Processing on Linked Data</article-title>
          . In: ISWC (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Yannakis</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fafalios</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tzitzikas</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Heuristics-based query reordering for federated queries in sparql 1.1 and sparql-ld</article-title>
          . In: QuWeDa at ESWC (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>