<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>S. Cheng); http://olafhartig.de/ (O. Hartig)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Cost Model to Optimize Queries over Heterogeneous Federations of RDF Data Sources</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sijin Cheng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olaf Hartig</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Linköping University</institution>
          ,
          <addr-line>SE-58183 Linköping</addr-line>
          ,
          <country country="SE">Sweden</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Federated processing of queries over RDF data sources ofers significant potential when a SPARQL query cannot be answered by a single data source alone. However, finding eficient plans to execute a query over a federation is challenging, especially if diferent federation members provide diferent types of data access interfaces. Diferent interfaces imply diferent request types, diferent forms of responses, and diferent physical algorithms that can be used, each of which consumes varying amounts of resources during query execution. This heterogeneity poses additional obstacles to the task of planning query executions, in addition to the inherent complexity arising from numerous possible join orderings and various physical algorithms. As a first step to address these challenges, we propose a cost model that captures the resource requirements of diferent operators depending on the type of federation member, allowing us to estimate cost of a given query execution plan without actually executing it. To evaluate our approach, we conduct experiments on FedBench with our cost model and compare it to the current state-of-the-art approach to query planning for heterogeneous federations of RDF data sources.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Heterogeneous Federations</kwd>
        <kwd>Cost Model</kwd>
        <kwd>Query Optimization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        For queries that cannot be answered by accessing a single data source, a federation of RDF data
sources becomes essential to collectively answer a query. Processing such a query is challenging
as not all parts of the query need to be issued to each federation member. Efectively grouping
the subqueries and determining which parts are to be issued to which of the federation members
may impact the overall query performance significantly, and the order in which the results of
the subqueries are joined is another important performance-related factor. Over the past decade,
a number of approaches have been proposed to improve the performance of processing such
queries (e.g., [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6">1, 2, 3, 4, 5, 6</xref>
        ]). One line of work is focused on the source selection phase, which
aims to split a given SPARQL query into subqueries that can be assigned to federation members.
The objective is to ensure that each subquery is assigned to only those federation members that
can provide a nonempty result (e.g., [
        <xref ref-type="bibr" rid="ref5 ref7">5, 7</xref>
        ]). Another research direction involves query planning
heuristics, which are typically used to determine the join order in logical plans. Furthermore,
some studies propose cost models to determine optimal physical query plans (e.g., [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]).
      </p>
      <p>
        All these approaches focus only to homogeneous federations, assuming that all federation
members provide a SPARQL endpoint interface. In recent years, however, other types of
interfaces were proposed, including the Triple Pattern Fragment (TPF) interface [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the
BindingsRestricted TPF (brTPF) interface [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], the SaGe interface [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and the smart-KG interface [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. In
the federated setting, each federation member may support a diferent type of interface to access
its RDF dataset. The resulting heterogeneity of federations poses extra challenges that render
many of the existing source selection approaches, heuristics, and cost models inadequate [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
These challenges are primarily due to the features of diferent interfaces. For instance, diferent
interfaces may require (or enable!) federation engines to leverage specific physical operators,
not all forms of subqueries can be answered directly by every interface, and diferent interfaces
may provide diferent forms of metadata relevant for query processing.
      </p>
      <p>
        Recently, Heling and Acosta proposed a query planner that is designed to address these
challenges of query processing over heterogeneous data sources [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. This query planner builds
left-deep query plans, for which it first determines a join order based on cardinality estimation.
Thereafter, the planner selects join algorithms, choosing between a symmetric hash join and
bind join, where the choice is made based on the estimated number of requests needed to execute
the join. We argue that deciding the join order first and determining the physical algorithm
afterwards might overlook the optimal plan, as the order becomes fixed without considering
any features of federation members. Moreover, we believe that, when determining the physical
algorithm, it is crucial to consider not only the number of requests, but also factors such as the
size of data to be transferred and the amount of work that is processed by the engine.
      </p>
      <p>
        In this paper, we propose a cost model that enables to estimate resource usage in physical
plans, considering factors such as the number of requests, the size of data transferred, and the
amount of work need to be done by the federation engine. Our cost model considers diferent
features that federation members have when using diferent types of interfaces. To evaluate the
efectiveness of the proposed cost model, we employ it in a greedy plan-enumeration algorithm
similar to the one used by Heling and Acosta’s approach [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and, then, compare the approaches
experimentally based on the FedBench benchmark. The experiments show that applying our
cost model can result in a plan that requires less data to be transferred, without a significant
increase in query execution time. In particular, for queries with UNION operators in subqueries
after source selection, our cost model is able to find a plan that requires significantly less data
to be transferred, resulting in less query execution time with higher completeness of results.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Preliminary</title>
      <p>
        This section introduces concepts, notations and a running example used in the rest of the paper.
To represent query plans in this paper we use the FedQPL language as introduced in our earlier
work [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The syntax of FedQPL expressions is defined as follows.
      </p>
      <p>
        Definition 1. A FedQPL expression  is constructed from the following grammar, where req,
tpAdd, bgpAdd, join, union, mj, mu, (, ), are terminal symbols.  is an expression in the request
language req of some interface [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] (e.g., triple patterns, whole SPARQL patterns), fm is a
federation member,  is a triple pattern,  is a BGP, and Φ is a nonempty set of FedQPL expressions.
      </p>
      <p>
        ::= reqfm | tpAddfm ( ) | bgpAddfm( ) | join(,  ) | union(,  ) | mj Φ | mu Φ |
The first operator, req, captures the intention to retrieve the result of a certain (sub)query 
from a given federation member. The unary operator tpAdd captures the intention to access a
federation member to obtain solution mappings for a single triple pattern that must be compatible
with solution mappings obtained from the plan represented by the given subexpression. The
operator bgpAdd is a BGP-based variation of tpAdd. In contrast to these operators, join is a
binary operator that joins two inputs, capturing the intention to get the input sets of solution
mappings independently, and then join them in the query federation engine. As for the remaining
operators, union lifts the standard SPARQL algebra operator union into the FedQPL language,
whereas mj and mu are multiway variations of join and union to capture the intention to apply
a multiway algorithm that can combine an arbitrary number of inputs. For a formal definition
of the semantics of these operators, refer to our earlier work [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>Example 1. As our running example, consider a basic graph pattern (BGP) ex = {1, 2}
with 1 = (?, foaf:name, "Alice") and 2 = (?, dbo:country, dbr:Canada), and a
heterogeneous federation ex with three members, denoted by fm1, fm2, and fm3. While federation
members fm1 and fm2 provide a SPARQL endpoint interface, fm3 provides a TPF interface.</p>
      <p>
        The following fragment of FedQPL captures the output of source selection processes [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
Definition 2. A source assignment is a FedQPL expression that is constructed using only
the operators req, mj, and mu such that, for each subexpression of the form reqfm, it holds that
 is a triple pattern or a BGP.
      </p>
      <p>Example 2. For the running example (in Example 1), we assume that members fm1 of our
example federation ex can contribute matching triples for 1 of the example BGP ex (cf.
Example 1), whereas 2 is matched in the data of fm2 and fm3. Hence, we may use the following
source assignment ex for ex over ex.</p>
      <p>mj{︀ reqfm11 , mu{reqfm22 , reqfm23 } ︀}
(1)</p>
      <p>
        A left-deep logical plan (see Figure 1a) can be converted from source assignment (Equation 1)
by converting multiway join (mj) to binary join directly. Further from a logical plan to a
physical plan, physical algorithms for each operator can be established directly based on a
logical plan. Alternatively, the physical plan can be reconstructed by considering the possible
physical algorithms for each operator in a combined manner. The physical algorithms that can
be used for each logical operator of FedQPL depend on the type of interface. Table 1 (left-hand
side) lists the physical algorithms that we consider in this work. For the unary operators
tpAddfm ( ) and bgpAddfm( ), the operator accesses federation members with requests where
input solution mappings are shipped as part of the requests. A straightforward algorithm to
implement a unary operator is to use an index-nested loop join (index-NLJ). The input solution
bindings are iterated for creating requests with updated  or , where the variables are replaced
(a) A logical plan
(b) Physical plan 1
(c) Physical plan 2
with obtained variable bindings. If the federation member provides a TPF interface, only the
index-NLJ can be applied as TPF interface restricts the type of queries that can be answered by
the server to single triple patterns. If the federation member uses a brTPF interface, a set of
input solution mappings (variable bindings for the shared variable) can be attached to a brTPF
request. While federation members support the SPARQL endpoint, as further alternatives, the
FILTER, UNION and FILTER operators can be used for shipping the input bindings along with
the query pattern as requests [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], denoted as bind join (BJ). The binary operator join( 1,  2)
joins input solution mappings in the query federation engine, so the physical algorithm is not
afected by the interface type that the federation member uses. In this paper, the symmetric
hash join (SHJ) (see [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]) is used for implementing join by default, while there are other options,
such as hash join and merge join.
      </p>
      <p>Example 3. For the running example, the logical operator join can be implemented via SHJ
(Figure 1b) or rewriting the logical plan to push join into the union (Figure 1c). In the latter case,
tpAddfm22 can use an index-NLJ or any variations of the bind join, while tpAddfm23 can only
apply index-NLJ as fm3 provides a TPF interface. Regarding query execution, subplans of binary
operators (e.g., SHJ or union) may be executed in parallel, for example, tpAddfm22 and tpAddfm23
of Figure 1c can consume the intermediate results of reqfm11 at the same time.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Cost Model</title>
      <p>Each physical query plan consumes a specific amount of diferent kinds of resources when it is
executed, and these resource requirements can difer significantly for diferent possible plans
for the same query. In this section, we present a cost model that is designed to capture these
resource requirements and can be used to estimate the cost of a given plan without actually
executing the plan. In this cost model, the (estimated) cost of a query plan is calculated as
a weighted sum of multiple metrics, and each of these metrics is calculated by summing up
corresponding measures from all operators in the plan. In the following, we first define the
metrics based on which the cost of a physical plan is captured in our cost model, and introduce
aaaaa</p>
      <p>Physical Algorithm
index-NLJ
BJ( UNION )
BJ( FILTER )
BJ( VALUES )
index-NLJ
index-NLJ
brTPF-based BJ
index-NLJ
BJ( UNION )
BJ( FILTER )
BJ( VALUES )
request
TPF request
request
SHJ
Union
⌈︁ |sols( )| ⌉︁
blockSize
|sols( )| · ⌈︁  ⌉︁</p>
      <p>|sols( )|
⌈︁ |sols( )| ⌉︁ ⌈︂  ︂⌉</p>
      <p>blockSize · ⌈ b|slooclsk(Siz)e| ⌉
|sols( )|
⌈︁ |sols( )| ⌉︁</p>
      <p>blockSize
1
⌈︁ (,fm) ⌉︁</p>
      <p>pageSize
functions for measuring each of these metrics for each physical operator. Thereafter, we present
how these cost metrics are used to calculate the overall cost for an entire physical plan.</p>
      <sec id="sec-3-1">
        <title>3.1. Cost Metrics</title>
        <p>The metrics of our cost model are defined in terms of the costs of communication between the
federation engine and the federation members and the processing costs induced at the federation
engine. The communication cost considers both the request process and the response process,
including the number of requests (#requestsest), the size of shipped request data for all requests
together (#reqDataest), and the size of relevant responses all together (#respRDFterms est).
The metric for capturing the processing cost by the federation engine, denoted as fedProcessest,
is estimated based on the cardinality of intermediate results of the query plan. Since binary
operators, join and union, do not interact with the federation members, there is no associated
communication cost involved. Therefore, the primary cost incurred by binary operations is the
processing cost by the federation engine.</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Number of requests (#requestsest)</title>
          <p>To estimate the number of requests involved during query execution, it is important to note
that this number depends not only on the physical algorithm but also on the interface type of
the accessed federation member. In other words, the same physical algorithm may necessitate
a diferent number of requests if the federation member uses diferent interfaces. Therefore,
we provide a set of functions for measuring #requestsest for each combination of physical
algorithm and interface type of federation members (illustrated in the right part of Table 1)
Logical Operator Interface of fm #respRDFtermsest
tpAddfm( ) STPPAF,RbQrTLPeFndpoint |3v·ajrosi(nCa)|r d·(join, Cfmar,ds(ols(, f m)), sols( ))
bgpAddfm( ) SPARQL endpoint |vars()| · joinCard(, fm, sols( ))
reqfm STPPAF,RbQrTLPeFndpoint |3v·ars(()| ·,fm)(, fm)
reqfm SPARQL endpoint |vars()| · (, fm)</p>
          <p>For the reqfm operator with  a BGP , the number of requests is fixed to one, since this
operator is considered valid only when the federation member employs an interface that supports
BGP requests (e.g., SPARQL endpoints). For the reqfm operator with  a triple pattern , the
federation member can use any type of interface that supports triple pattern requests (e.g.,
SPARQL endpoints, TPF servers, and brTPF servers). If the federation member provides a
SPARQL endpoint, the number of requests is one whereas, for TPF servers and brTPF servers,
the number of requests depends on both the cardinality of the results of reqfm and the page size
since TPF and brTPF interfaces employ a paging mechanism. The cardinality of the results of
reqfm , denoted as card(, fm), represents the number of triples in the RDF dataset of fm that
match the given triple pattern . The page size (pageSize) refers to the maximum number of
triples that can be retrieved per request from a TPF server or a brTPF server and is typically
set to 100. Consequently, the number of requests can be calculated as the ceiling value of the
division between the cardinality and the page size: ⌈card(, fm)/pageSize⌉.</p>
          <p>The bgpAddfm( ) operator can be implemented using some physical operators: the
indexnested loop join (index-NLJ) operator or the bind join (BJ) operator. If the bgpAdd operator is
implemented via an index-NLJ algorithm, the number of requests (#requestsest) is primarily
influenced by the number of solution mappings that the operator receives as input from its
subplan  , represented as |sols( )|. During execution, for each input solution mapping  , the
join variables (i.e. joinVars(,  )) of the BGP  are substituted with the bindings from  ,
and then a new BGP ′ is sent to fm. Hence, each solution mapping causes one request and,
thus, the number of requests #requestsest is equivalent to |sols( )|. Alternatively, the BJ
algorithm can be applied to reduce the number of requests by consolidating multiple bindings
together with the BGP  into a single request. The number of bindings that can be attached
for one request is determined based on the server capabilities, denoted as blockSize. Hence,
the number of requests depends on the total number of input solution mappings and page size:
⌈|sols( )|/blockSize⌉.</p>
          <p>For the tpAddfm ( ) operator, the number of requests is similar to the bgpAdd operator in
case the federation member provides a SPARQL endpoint. However, if the federation member
is a TPF server, only the index-NLJ algorithm can be used. Given that TPF servers adopt a
paging mechanism, the total number of pages is at least equal to the number of input solution
mappings, as each such solution mapping results in at least one request. In cases in which all
matching triples cannot be retrieved with a single page, the total number of pages with all
responses together can be estimated based on the join cardinality (joinCard) divided by the page
size (pageSize).</p>
          <p>= max
︂{
|sols( )| ,
︂⌈ joinCard(, fm, sols( )) ⌉︂}
pageSize</p>
          <p>Furthermore, the number of input solution mappings is |sols( )| and each such solution
mapping results in at least one request. In our analysis, we make an assumption that the
matching triples are evenly distributed for each of such input solution mapping, implying that
each request is expected to yield an equivalent number of pages. Consequently, the estimated
number of pages for each input solution mapping can be calculated as Equation 3. Hence,
the total number of requests for tpAddfm ( ) with a TPF server fm is estimated as Equation 4.
|sols( )| ·   
   =
︂⌈    ⌉︂
|sols( )</p>
          <p>|</p>
          <p>If federation members provide a brTPF interface, an index-NLJ algorithm or brTPF algorithm
can be applied. In case index-NLJ is applied, the total number of requests is the same as for
the TPF interface. If the brTPF algorithm is applied, a set of input solution mappings can be
attached to a brTPF request. The number of bindings that can be attached to each request is
defined as block size ( blockSize). Under the even distribution assumption, the total number of
requests for the brTPF algorithm is estimated as follows.</p>
          <p>(3)
⎡
· ⎢
⎢
⎢
(2)
(4)
(5)
︂⌈ |sols( )| ⌉︂
blockSize
totalPageNum
⌈︁ |sols( )| ⌉︁
blockSize
⎤
⎥
⎥
⎥</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Total size of shipped request data (#reqDataest)</title>
          <p>Regarding the triple pattern request operator, the total size of shipped request data is estimated
as the number of RDF terms and variables in all responses together (listed in the third column of
#reqDataest is determined
as 3, taking into account the three components of the triple pattern. In the case of a basic
graph pattern (), the value of #reqDataest becomes 3 multiplied by the number of triple
patterns (countTP()) within the basic graph pattern .</p>
          <p>Concerning the tpAddfm ( ) operator, the size of shipped request data depends on both the
number of input solution mappings and variables that are common across the given subquery
with input solution mappings, denoted by joinVars(,  ). Diferent physical algorithms exhibit
diferent total sizes of shipped request data. Specifically, for the index-NLJ algorithm, each
request contains a single triple pattern where the join variables are substituted by bindings of
the input solution mappings. Hence, the size of shipped request data of all requests is three times
the number of input solution mappings, 3 · | sols( )|. For the BJ, the size of shipped request data
is diferent for diferent implementations of BJ (i.e., UNION, FILTER or VALUES). When using
UNION to implement bind join, triple patterns with bound variables are consolidated into one
request with the UNION operator, resulting in a total size of shipped request data equivalent
to that of index-NLJ, despite the size for individual requests being diferent. In case FILTER is
applied, a new request is constructed based on the original triple pattern along with attached
join variables as well as corresponding values. Consequently, the size of shipped request data
for all queries is three plus 2 · | sols( )| · joinVars(,  ). Using VALUES is similar to using
FILTER, but VALUES allows for specifying multiple values for given variables, to which only
the bound values in the input solution mappings would be attached. As a result, the overall size
of shipped request data decreases to three plus |sols( )| · joinVars(,  ).</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.1.3. Total size of shipped response data (#respRDFtermsest)</title>
          <p>The overall size of shipped response data mainly depends on two factors: the estimated number of
solution mappings and the type of interface. Diferent types of interfaces have diferent response
types. For instance, TPF and brTPF servers return matching triples in their responses, whereas
SPARQL endpoints directly respond with solution mappings. For this reason, the size of shipped
data in TPF/ brTPF responses depends only on the total number of matching triples, while the size
of shipped response data for SPARQL endpoints also depends on the number of variables in the
corresponding subquery (as illustrated in Table 3), where |vars()| and |vars()| represents
the number of distinct variables in the triple pattern  or BGP , respectively).</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>3.1.4. Processing cost by federation engines (fedProcessest)</title>
          <p>The amount of work that needs to be done by the federation engine is hard to measure accurately.
In an efort to obtain a rough estimation, we adopt a simplified approach by considering the
size of intermediate results that need to be processed by the federation engine. The size of
intermediate results is listed in the fourth column of Table 2.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Cost Calculation</title>
        <p>In our proposed model, the cost of a query plan is calculated as a weighted sum of multiple
metrics. Each metric is computed by aggregating the corresponding measures from all operators
encompassed within the plan. For example, the total number of requests for the physical plan is
obtained by summing the number of requests for each operator in the plan (see Equation 6).
total #requestsest =∑︁#requestsest(op)
op∈subPlan
(6)
In addition, our cost model accommodates parallel evaluation of operators, thereby capturing
the efects of parallelism. If the root operator of a physical plan is an SHJ or a union, the work
performed by the corresponding two subplans (referred to as subPlan1 and subPlan2) can occur in
parallel. The overall work processed by the federation engine (total fedProcessest) is calculated
using the Equation 7, which applies specifically for SHJ and union operators. In addition, this
formula for parallelism is only used for fedProcessest but not for any communication costs.</p>
        <p>⎧
fedProcessest(rootOp) + max ⎨</p>
        <p>∑︁fedProcessest(op),
⎩op∈subPlan1
⎫
∑︁fedProcessest(op)⎬</p>
        <p>(7)
op∈subPlan2
⎭
Example 4. To illustrate these concepts, Table 4 presents the metric values for two physical
plans (1b and 1c) for our running example. The total number of requests is computed by
summing up values for all operators in each plan. Furthermore, the total work processed by
the federation engine is evaluated with considering parallelism. For instance, the total amount
of work by the federation engine for the physical plan (1b) is 2105, which is calculated by:
5 + (80, 1100 + (100, 1000)).
To determine the total cost of a physical plan while considering the combination of diferent
cost metrics, a weighted sum can be employed (see Equation 8). The weights for each metric
may be adjusted based on the specific scenario and user requirements. For instance, in a
scenario where the network delay is substantial and the number of requests dominates the total
processing time, using a greater value for  may be beneficial in identifying a suitable plan for
the scenario. In the evaluation of our cost model presented in this paper, a uniform weight of 1
is assigned to all metrics, maintaining equal importance across the board.</p>
        <p>totalCost =  · total #requestsest +  · total #reqDataest +
 · total #respRDFterms est +  · total fedProcessest
(8)</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Cardinality Estimation</title>
        <p>As indicated in the preceding section, the metrics of our cost model rely on the expected
cardinality of intermediate results. It is noteworthy that our cost model is not tied to any
specific approach for estimating such cardinalities. Instead, our cost model is designed to
accommodate any approach that can be utilized to obtain such cardinality estimates.</p>
        <p>
          For the evaluation of our cost model in this paper, we are adopting a comparatively simple
method for cardinality estimation [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. This method is based on issuing requests to the federation
members. For each subquery with only one req operator, the number of triples that match
the given graph pattern can be obtained by issuing a request to the corresponding federation
member. The concrete type of this request depends on the type of interface employed by the
federation member. If the federation member provides a SPARQL endpoint, the cardinality of
a graph pattern ( or ) can be obtained by constructing a SELECT COUNT query for each
subquery. If the federation member uses a TPF interface or a brTPF interface, a request can be
issued to retrieve the first page of matching triples, and the count estimates can be obtained
from the metadata of the retrieved pages.
        </p>
        <p>
          In the case of subqueries involving operators other than req operator, the expected cardinality
of the subquery can be determined by recursively applying a cardinality estimation function to
each subquery. Specifically, the cardinality of join operations involving two subqueries in the
approach that we adopt for the evaluation in this paper is estimated as the minimum of their
respective cardinalities, while the cardinality of union operations is computed by summing the
individual cardinalities [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. However, as mentioned above, our cost model can also be used in
combination with other cardinality estimation approaches.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation</title>
      <p>
        This section presents an empirical evaluation in which we study the efectiveness of our
cost model by using Heling and Acosta’s approach [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] as a baseline. We first describe the
implementation that we use and the experiment setup, including the datasets, queries, and
evaluation metrics. Thereafter, we present the measurements and discuss our observations.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Implementation</title>
        <p>
          We implemented the proposed cost model in our query federation engine HeFQUIN1 in
combination with a simple greedy plan-enumeration algorithm. This algorithm produces left-deep
execution plans for source assignments consisting of joins over subplans that may be individual
request operators or unions of requests. In particular, after picking a first such subplan based
on the estimated cost, and using that subplan as the starting point for building up a left-deep
plan, the algorithm iterates over the remaining subplans that can be joined with the partial
left-deep plan that has been built so far (i.e., ignoring subplans that would introduce cross
products, unless there are no other subplans left). In each step of this iteration, the algorithm
considers all available subplans in combination with all possible algorithms for joining each
such subplan with the current left-deep plan, and then greedily picks the least costly of these
options. Additionally, we extended HeFQUIN with an implementation of the approach that
Heling and Acosta proposed [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] to produce left-deep execution plans for the aforementioned
types of source assignments. For the experiments, such source assignments are given to the
engine in the form of SPARQL queries with SERVICE clauses.
1https://github.com/LiUSemWeb/HeFQUIN
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experiment Setup</title>
        <p>
          All experiments described in this paper have been performed on a server machine with two 8-core
Intel Xeon E5-2667 v3@3.20GHz CPUs and 256 GB of RAM. The machine runs a 64-bit Debian
GNU/Linux 10 server operation system. On this machine, we use KOBE [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] (a benchmarking
system based on Kubernetes infrastructures) to containerize and configure federations of RDF
datasets, queries, federation engines and experiments. For setting up heterogeneous federations,
we use Virtuoso v7.20 for configuring a SPARQL endpoint, the Server.js (v3.3.0) 2 and Java brTPF
server3 to deploy TPF and brTPF servers with HDT backends. The source code and experimental
results are provided online.4
        </p>
        <p>
          Datasets and Queries: The datasets and queries we use for the evaluation are from
FedBench [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], which involves 9 datasets containing a total number of 10M triples. Each dataset can
be exposed using diferent types of interfaces. We use the same types of federations as used
by Heling and Acosta. These two federations difer in terms of the types of interfaces used
by the federation members (illustrated in Table 5). We use a total of 25 queries from Cross
Domain (cd1–7), Life Science (ls1–7) and Linked Data (ld1–11). For these queries, we manually
applied the source selection approach of FedX [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] in order to produce source assignments as
assumed as input for the tested approaches (cf. Section 4.1).
        </p>
        <p>Evaluation Metrics We report performance metrics based on the following definitions:
i) Query execution time (QET) is the amount of time elapsed since the evaluation of the query
plan starts until the complete answer has been received. ii) Number of requests (#requests)
is the number of requests that are issued to federation members during the execution of the
query. iii) Size of data transferred (#respRDFterms ) is represented as the total number of
RDF terms that are transferred from federation members to the federation engine. iv) Local
work (fedProcess) is the amount of work that needs to be done by the federation engine.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Experimental Results</title>
        <p>Figures 2a and 2b illustrate the average query execution time per query for federations Fed I and
Fed II, respectively; Figures 2c and 2d illustrate the respective number of requests sent per query,
and Figures 2e and 2f illustrate the total size of data transferred from federation members to the
federation engine. We observe that there is not much diference between the two approaches for
about half of queries (11 out of 24 queries). Compared to the baseline approach, it also shows up
that for some queries, the plans selected using our cost model require less data to be retrieved
from federation members without a significant increase in query execution time (some queries
2https://github.com/LinkedDataFragments/Server.js
3https://github.com/LiUSemWeb/Server.Java/tree/feature-brtpf
4https://github.com/LiUSemWeb/HeFQUIN-DMKG2023-Experiments
(a) Query execution time (Fed I)</p>
        <p>(b) Query execution time (Fed II)
(c) Overall number of requests (Fed I)</p>
        <p>(d) Overall number of requests (Fed II)
(e) Size of data transferred (Fed I)
(f) Size of data transferred (Fed II)
have a significant reduction in query execution time). Table 6 illustrates these queries for which
there is a significant diference in the measurements (i.e., where the respective greater value is
at least double the smaller value). All these are queries for which the plans selected by the two
approaches were diferent.</p>
        <p>Based on Table 6, we observe that the number of queries that shows improvement varies
across the federations since the interface’s capabilities afect diferent queries. We find that,
for these queries, even though the plans selected using our cost model issue more requests to
the federation members, the query execution times are reduced or without significant increase,
but the total size of data transferred is significantly decreased! The best reduction in query
execution time occurs on the query ld8 over Fed I, which is nearly 300 times. Further, the plans
selected using our cost model produces more complete result than those using the baseline
approach, which is caused by the SPARQL endpoint as the Virtuoso server does not produce
complete results (at most 220), but this issue does not happen for the plans selected using our
cost model as this cost model not only considers the number of requests but also the size of
data transferred. This query is a case like the one illustrated in the running example 1c. Due to
Measurements for the queries for which there is a significant diference between the two approaches
(i.e., where the respective greater value is at least double the smaller value). The blue highlighting
indicates cases in which our cost model is better than the baseline, while orange highlighting indicates
cases in which it is the other way around. Red-colored numbers (additionally marked with an asterisk
*) are cases in which the produced query result was incomplete.</p>
        <p>Federation Measurement</p>
        <p>QET
#requests
#resultSize
QET
#requests
#dataTransferred
#dataTransferred
#resultSize
this rewriting, the number of requests increases a bit on the number of requests (from 20 to 80),
but the size of data transferred is reduced by more than 2600 times.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Summary and Future Work</title>
      <p>In this paper we propose a cost model designed for query optimization over heterogeneous
federations of RDF data sources. This cost model can be seamlessly integrated with a greedy
algorithm or a dynamic programming algorithm. With experimental evaluation, we have
demonstrated that the query plan selected using the cost model requires less data to be transferred from
federation members to the federation engine compared to the baseline approach. Furthermore,
for certain queries, leveraging our cost model achieves a better plan that not only delivers
complete results but also achieves shorter execution times.</p>
      <p>To attain a more comprehensive understanding of the model’s capabilities and limitations,
we intend to conduct a more extensive evaluation of the cost model in our future work. This
evaluation shall involve testing with a diverse range of federation setups, exploring additional
types of interfaces, and accounting for network latency. Additionally, we aim to enhance our
cost model by also incorporating the processing cost incurred by the federation members.
Furthermore, we will explore opportunities to accommodate more expressive forms of query
patterns to further enhance the applicability of our cost model.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was funded by the National Graduate School in Computer Science, Sweden (CUGS),
and by the Swedish Research Council (Vetenskapsrådet, project reg. no. 2019-05655).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Acosta</surname>
          </string-name>
          , M.-E. Vidal,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lampo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Castillo</surname>
          </string-name>
          , E. Ruckhaus,
          <string-name>
            <surname>ANAPSID:</surname>
          </string-name>
          <article-title>An Adaptive Query Processing Engine for SPARQL Endpoints</article-title>
          ,
          <source>in: Proc. of the 11th International Semantic Web Conference (ISWC)</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Charalambidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Troumpoukis</surname>
          </string-name>
          , S. Konstantopoulos, SemaGrow: Optimizing Federated SPARQL Queries,
          <source>in: Proc. of the 11th Int. Conf. on Semantic Systems</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Saleem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Potocki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Soru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hartig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <article-title>CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation</article-title>
          ,
          <source>in: Proc. of the 14th Int. Conf. on Semantic Systems</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Görlitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Haase</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ladwig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schwarte</surname>
          </string-name>
          , T. Tran,
          <article-title>FedBench: A Benchmark Suite for Federated Semantic Data Query Processing</article-title>
          ,
          <source>in: Proceedings of the 10th International Semantic Web Conference (ISWC)</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Schwarte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Haase</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schenkel</surname>
          </string-name>
          , M. Schmidt,
          <source>FedX: Optimization Techniques for Federated Query Processing on Linked Data, in: Proceedings of the 10th International Semantic Web Conference (ISWC)</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Vidal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Castillo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Acosta</surname>
          </string-name>
          , G. Montoya, G. Palma,
          <article-title>On the Selection of SPARQL Endpoints to Eficiently Execute Federated SPARQL Queries, Trans. Large-Scale Data-</article-title>
          and
          <string-name>
            <surname>Knowledge-Centered</surname>
            <given-names>Systems</given-names>
          </string-name>
          25 (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Saleem</surname>
          </string-name>
          , A.-C.
          <article-title>Ngonga Ngomo, HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation</article-title>
          ,
          <source>in: Proc. of the European Sem. Web Conf. (ESWC)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Vander</given-names>
            <surname>Sande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hartig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Van Herwegen</given-names>
            ,
            <surname>L. De Vocht</surname>
          </string-name>
          , B. De Meester, G. Haesendonck,
          <string-name>
            <given-names>P.</given-names>
            <surname>Colpaert</surname>
          </string-name>
          ,
          <article-title>Triple Pattern Fragments: A Low-Cost Knowledge Graph Interface for the Web</article-title>
          ,
          <source>Journal of Web Semantics</source>
          <volume>37</volume>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>O.</given-names>
            <surname>Hartig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Buil-Aranda</surname>
          </string-name>
          ,
          <article-title>Bindings-Restricted Triple Pattern Fragments</article-title>
          ,
          <source>in: 15th Ontologies, Databases, and Applications of Semantics (ODBASE)</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Minier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Skaf-Molli</surname>
          </string-name>
          , P. Molli,
          <article-title>SaGe: Web Preemption for Public SPARQL Query Services</article-title>
          ,
          <source>in: Proceedings of the Web Conference (WWW)</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Azzam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Acosta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Beno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          , SMART-KG:
          <article-title>Hybrid Shipping for SPARQL Querying on the Web</article-title>
          ,
          <source>in: Proc. of The Web Conference (WWW)</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Cheng</surname>
          </string-name>
          , O. Hartig,
          <article-title>Source Selection for SPARQL Endpoints: Fit for Heterogeneous Federations of RDF Data Sources?</article-title>
          ,
          <source>in: Proceedings of the 6th Workshop on Storing, Querying and Benchmarking Knowledge Graphs (QuWeDa)</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Heling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Acosta</surname>
          </string-name>
          ,
          <article-title>Federated sparql query processing over heterogeneous linked data fragments</article-title>
          ,
          <source>in: Proceedings of the ACM Web Conference</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Cheng</surname>
          </string-name>
          , O. Hartig,
          <article-title>FedQPL: A Language for Logical Query Plans over Heterogeneous Federations of RDF Data Sources</article-title>
          ,
          <source>in: Proc. of the 22nd Int. Conf. on Information Integration and Web-based Applications &amp; Services (iiWAS)</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>C. B. Aranda</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Umbrich</surname>
          </string-name>
          ,
          <article-title>Strategies for executing federated queries in SPARQL1.1</article-title>
          , in: 13th
          <source>International Semantic Web Conference</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>C.</given-names>
            <surname>Kostopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mouchakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Troumpoukis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Prokopaki-Kostopoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Charalambidis</surname>
          </string-name>
          , S. Konstantopoulos, KOBE:
          <article-title>Cloud-native Open Benchmarking Engine for Federated Query Processors</article-title>
          , in: European Semantic Web Conference,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>