1. Introduction

S. Cheng); http://olafhartig.de/ (O. Hartig)

A Cost Model to Optimize Queries over Heterogeneous Federations of RDF Data Sources

Sijin Cheng

Olaf Hartig

0 0 Linköping University , SE-58183 Linköping , Sweden

2023

000 0 0003

Federated processing of queries over RDF data sources ofers significant potential when a SPARQL query cannot be answered by a single data source alone. However, finding eficient plans to execute a query over a federation is challenging, especially if diferent federation members provide diferent types of data access interfaces. Diferent interfaces imply diferent request types, diferent forms of responses, and diferent physical algorithms that can be used, each of which consumes varying amounts of resources during query execution. This heterogeneity poses additional obstacles to the task of planning query executions, in addition to the inherent complexity arising from numerous possible join orderings and various physical algorithms. As a first step to address these challenges, we propose a cost model that captures the resource requirements of diferent operators depending on the type of federation member, allowing us to estimate cost of a given query execution plan without actually executing it. To evaluate our approach, we conduct experiments on FedBench with our cost model and compare it to the current state-of-the-art approach to query planning for heterogeneous federations of RDF data sources.

eol>Heterogeneous Federations Cost Model Query Optimization

1. Introduction

For queries that cannot be answered by accessing a single data source, a federation of RDF data sources becomes essential to collectively answer a query. Processing such a query is challenging as not all parts of the query need to be issued to each federation member. Efectively grouping the subqueries and determining which parts are to be issued to which of the federation members may impact the overall query performance significantly, and the order in which the results of the subqueries are joined is another important performance-related factor. Over the past decade, a number of approaches have been proposed to improve the performance of processing such queries (e.g., [ 1, 2, 3, 4, 5, 6 ]). One line of work is focused on the source selection phase, which aims to split a given SPARQL query into subqueries that can be assigned to federation members. The objective is to ensure that each subquery is assigned to only those federation members that can provide a nonempty result (e.g., [ 5, 7 ]). Another research direction involves query planning heuristics, which are typically used to determine the join order in logical plans. Furthermore, some studies propose cost models to determine optimal physical query plans (e.g., [ 3 ]).

All these approaches focus only to homogeneous federations, assuming that all federation members provide a SPARQL endpoint interface. In recent years, however, other types of interfaces were proposed, including the Triple Pattern Fragment (TPF) interface [ 8 ], the BindingsRestricted TPF (brTPF) interface [ 9 ], the SaGe interface [ 10 ], and the smart-KG interface [ 11 ]. In the federated setting, each federation member may support a diferent type of interface to access its RDF dataset. The resulting heterogeneity of federations poses extra challenges that render many of the existing source selection approaches, heuristics, and cost models inadequate [ 12 ]. These challenges are primarily due to the features of diferent interfaces. For instance, diferent interfaces may require (or enable!) federation engines to leverage specific physical operators, not all forms of subqueries can be answered directly by every interface, and diferent interfaces may provide diferent forms of metadata relevant for query processing.

Recently, Heling and Acosta proposed a query planner that is designed to address these challenges of query processing over heterogeneous data sources [ 13 ]. This query planner builds left-deep query plans, for which it first determines a join order based on cardinality estimation. Thereafter, the planner selects join algorithms, choosing between a symmetric hash join and bind join, where the choice is made based on the estimated number of requests needed to execute the join. We argue that deciding the join order first and determining the physical algorithm afterwards might overlook the optimal plan, as the order becomes fixed without considering any features of federation members. Moreover, we believe that, when determining the physical algorithm, it is crucial to consider not only the number of requests, but also factors such as the size of data to be transferred and the amount of work that is processed by the engine.

In this paper, we propose a cost model that enables to estimate resource usage in physical plans, considering factors such as the number of requests, the size of data transferred, and the amount of work need to be done by the federation engine. Our cost model considers diferent features that federation members have when using diferent types of interfaces. To evaluate the efectiveness of the proposed cost model, we employ it in a greedy plan-enumeration algorithm similar to the one used by Heling and Acosta’s approach [ 13 ] and, then, compare the approaches experimentally based on the FedBench benchmark. The experiments show that applying our cost model can result in a plan that requires less data to be transferred, without a significant increase in query execution time. In particular, for queries with UNION operators in subqueries after source selection, our cost model is able to find a plan that requires significantly less data to be transferred, resulting in less query execution time with higher completeness of results.

2. Preliminary

This section introduces concepts, notations and a running example used in the rest of the paper. To represent query plans in this paper we use the FedQPL language as introduced in our earlier work [ 14 ]. The syntax of FedQPL expressions is defined as follows.

Definition 1. A FedQPL expression is constructed from the following grammar, where req, tpAdd, bgpAdd, join, union, mj, mu, (, ), are terminal symbols. is an expression in the request language req of some interface [ 14 ] (e.g., triple patterns, whole SPARQL patterns), fm is a federation member, is a triple pattern, is a BGP, and Φ is a nonempty set of FedQPL expressions.

::= reqfm | tpAddfm ( ) | bgpAddfm( ) | join(, ) | union(, ) | mj Φ | mu Φ | The first operator, req, captures the intention to retrieve the result of a certain (sub)query from a given federation member. The unary operator tpAdd captures the intention to access a federation member to obtain solution mappings for a single triple pattern that must be compatible with solution mappings obtained from the plan represented by the given subexpression. The operator bgpAdd is a BGP-based variation of tpAdd. In contrast to these operators, join is a binary operator that joins two inputs, capturing the intention to get the input sets of solution mappings independently, and then join them in the query federation engine. As for the remaining operators, union lifts the standard SPARQL algebra operator union into the FedQPL language, whereas mj and mu are multiway variations of join and union to capture the intention to apply a multiway algorithm that can combine an arbitrary number of inputs. For a formal definition of the semantics of these operators, refer to our earlier work [ 14 ].

Example 1. As our running example, consider a basic graph pattern (BGP) ex = {1, 2} with 1 = (?, foaf:name, "Alice") and 2 = (?, dbo:country, dbr:Canada), and a heterogeneous federation ex with three members, denoted by fm1, fm2, and fm3. While federation members fm1 and fm2 provide a SPARQL endpoint interface, fm3 provides a TPF interface.

The following fragment of FedQPL captures the output of source selection processes [ 14 ]. Definition 2. A source assignment is a FedQPL expression that is constructed using only the operators req, mj, and mu such that, for each subexpression of the form reqfm, it holds that is a triple pattern or a BGP.

Example 2. For the running example (in Example 1), we assume that members fm1 of our example federation ex can contribute matching triples for 1 of the example BGP ex (cf. Example 1), whereas 2 is matched in the data of fm2 and fm3. Hence, we may use the following source assignment ex for ex over ex.

mj{︀ reqfm11 , mu{reqfm22 , reqfm23 } ︀} (1)

A left-deep logical plan (see Figure 1a) can be converted from source assignment (Equation 1) by converting multiway join (mj) to binary join directly. Further from a logical plan to a physical plan, physical algorithms for each operator can be established directly based on a logical plan. Alternatively, the physical plan can be reconstructed by considering the possible physical algorithms for each operator in a combined manner. The physical algorithms that can be used for each logical operator of FedQPL depend on the type of interface. Table 1 (left-hand side) lists the physical algorithms that we consider in this work. For the unary operators tpAddfm ( ) and bgpAddfm( ), the operator accesses federation members with requests where input solution mappings are shipped as part of the requests. A straightforward algorithm to implement a unary operator is to use an index-nested loop join (index-NLJ). The input solution bindings are iterated for creating requests with updated or , where the variables are replaced (a) A logical plan (b) Physical plan 1 (c) Physical plan 2 with obtained variable bindings. If the federation member provides a TPF interface, only the index-NLJ can be applied as TPF interface restricts the type of queries that can be answered by the server to single triple patterns. If the federation member uses a brTPF interface, a set of input solution mappings (variable bindings for the shared variable) can be attached to a brTPF request. While federation members support the SPARQL endpoint, as further alternatives, the FILTER, UNION and FILTER operators can be used for shipping the input bindings along with the query pattern as requests [ 15 ], denoted as bind join (BJ). The binary operator join( 1, 2) joins input solution mappings in the query federation engine, so the physical algorithm is not afected by the interface type that the federation member uses. In this paper, the symmetric hash join (SHJ) (see [ 15 ]) is used for implementing join by default, while there are other options, such as hash join and merge join.

Example 3. For the running example, the logical operator join can be implemented via SHJ (Figure 1b) or rewriting the logical plan to push join into the union (Figure 1c). In the latter case, tpAddfm22 can use an index-NLJ or any variations of the bind join, while tpAddfm23 can only apply index-NLJ as fm3 provides a TPF interface. Regarding query execution, subplans of binary operators (e.g., SHJ or union) may be executed in parallel, for example, tpAddfm22 and tpAddfm23 of Figure 1c can consume the intermediate results of reqfm11 at the same time.

3. Cost Model

Each physical query plan consumes a specific amount of diferent kinds of resources when it is executed, and these resource requirements can difer significantly for diferent possible plans for the same query. In this section, we present a cost model that is designed to capture these resource requirements and can be used to estimate the cost of a given plan without actually executing the plan. In this cost model, the (estimated) cost of a query plan is calculated as a weighted sum of multiple metrics, and each of these metrics is calculated by summing up corresponding measures from all operators in the plan. In the following, we first define the metrics based on which the cost of a physical plan is captured in our cost model, and introduce aaaaa

Physical Algorithm index-NLJ BJ( UNION ) BJ( FILTER ) BJ( VALUES ) index-NLJ index-NLJ brTPF-based BJ index-NLJ BJ( UNION ) BJ( FILTER ) BJ( VALUES ) request TPF request request SHJ Union ⌈︁ |sols( )| ⌉︁ blockSize |sols( )| · ⌈︁ ⌉︁

|sols( )| ⌈︁ |sols( )| ⌉︁ ⌈︂ ︂⌉

blockSize · ⌈ b|slooclsk(Siz)e| ⌉ |sols( )| ⌈︁ |sols( )| ⌉︁

blockSize 1 ⌈︁ (,fm) ⌉︁

pageSize functions for measuring each of these metrics for each physical operator. Thereafter, we present how these cost metrics are used to calculate the overall cost for an entire physical plan.

3.1. Cost Metrics

The metrics of our cost model are defined in terms of the costs of communication between the federation engine and the federation members and the processing costs induced at the federation engine. The communication cost considers both the request process and the response process, including the number of requests (#requestsest), the size of shipped request data for all requests together (#reqDataest), and the size of relevant responses all together (#respRDFterms est). The metric for capturing the processing cost by the federation engine, denoted as fedProcessest, is estimated based on the cardinality of intermediate results of the query plan. Since binary operators, join and union, do not interact with the federation members, there is no associated communication cost involved. Therefore, the primary cost incurred by binary operations is the processing cost by the federation engine.

3.1.1. Number of requests (#requestsest)

To estimate the number of requests involved during query execution, it is important to note that this number depends not only on the physical algorithm but also on the interface type of the accessed federation member. In other words, the same physical algorithm may necessitate a diferent number of requests if the federation member uses diferent interfaces. Therefore, we provide a set of functions for measuring #requestsest for each combination of physical algorithm and interface type of federation members (illustrated in the right part of Table 1) Logical Operator Interface of fm #respRDFtermsest tpAddfm( ) STPPAF,RbQrTLPeFndpoint |3v·ajrosi(nCa)|r d·(join, Cfmar,ds(ols(, f m)), sols( )) bgpAddfm( ) SPARQL endpoint |vars()| · joinCard(, fm, sols( )) reqfm STPPAF,RbQrTLPeFndpoint |3v·ars(()| ·,fm)(, fm) reqfm SPARQL endpoint |vars()| · (, fm)

For the reqfm operator with a BGP , the number of requests is fixed to one, since this operator is considered valid only when the federation member employs an interface that supports BGP requests (e.g., SPARQL endpoints). For the reqfm operator with a triple pattern , the federation member can use any type of interface that supports triple pattern requests (e.g., SPARQL endpoints, TPF servers, and brTPF servers). If the federation member provides a SPARQL endpoint, the number of requests is one whereas, for TPF servers and brTPF servers, the number of requests depends on both the cardinality of the results of reqfm and the page size since TPF and brTPF interfaces employ a paging mechanism. The cardinality of the results of reqfm , denoted as card(, fm), represents the number of triples in the RDF dataset of fm that match the given triple pattern . The page size (pageSize) refers to the maximum number of triples that can be retrieved per request from a TPF server or a brTPF server and is typically set to 100. Consequently, the number of requests can be calculated as the ceiling value of the division between the cardinality and the page size: ⌈card(, fm)/pageSize⌉.

The bgpAddfm( ) operator can be implemented using some physical operators: the indexnested loop join (index-NLJ) operator or the bind join (BJ) operator. If the bgpAdd operator is implemented via an index-NLJ algorithm, the number of requests (#requestsest) is primarily influenced by the number of solution mappings that the operator receives as input from its subplan , represented as |sols( )|. During execution, for each input solution mapping , the join variables (i.e. joinVars(, )) of the BGP are substituted with the bindings from , and then a new BGP ′ is sent to fm. Hence, each solution mapping causes one request and, thus, the number of requests #requestsest is equivalent to |sols( )|. Alternatively, the BJ algorithm can be applied to reduce the number of requests by consolidating multiple bindings together with the BGP into a single request. The number of bindings that can be attached for one request is determined based on the server capabilities, denoted as blockSize. Hence, the number of requests depends on the total number of input solution mappings and page size: ⌈|sols( )|/blockSize⌉.

For the tpAddfm ( ) operator, the number of requests is similar to the bgpAdd operator in case the federation member provides a SPARQL endpoint. However, if the federation member is a TPF server, only the index-NLJ algorithm can be used. Given that TPF servers adopt a paging mechanism, the total number of pages is at least equal to the number of input solution mappings, as each such solution mapping results in at least one request. In cases in which all matching triples cannot be retrieved with a single page, the total number of pages with all responses together can be estimated based on the join cardinality (joinCard) divided by the page size (pageSize).

= max ︂{ |sols( )| , ︂⌈ joinCard(, fm, sols( )) ⌉︂} pageSize

Furthermore, the number of input solution mappings is |sols( )| and each such solution mapping results in at least one request. In our analysis, we make an assumption that the matching triples are evenly distributed for each of such input solution mapping, implying that each request is expected to yield an equivalent number of pages. Consequently, the estimated number of pages for each input solution mapping can be calculated as Equation 3. Hence, the total number of requests for tpAddfm ( ) with a TPF server fm is estimated as Equation 4. |sols( )| · = ︂⌈ ⌉︂ |sols( )

If federation members provide a brTPF interface, an index-NLJ algorithm or brTPF algorithm can be applied. In case index-NLJ is applied, the total number of requests is the same as for the TPF interface. If the brTPF algorithm is applied, a set of input solution mappings can be attached to a brTPF request. The number of bindings that can be attached to each request is defined as block size ( blockSize). Under the even distribution assumption, the total number of requests for the brTPF algorithm is estimated as follows.

(3) ⎡ · ⎢ ⎢ ⎢ (2) (4) (5) ︂⌈ |sols( )| ⌉︂ blockSize totalPageNum ⌈︁ |sols( )| ⌉︁ blockSize ⎤ ⎥ ⎥ ⎥

3.1.2. Total size of shipped request data (#reqDataest)

Regarding the triple pattern request operator, the total size of shipped request data is estimated as the number of RDF terms and variables in all responses together (listed in the third column of #reqDataest is determined as 3, taking into account the three components of the triple pattern. In the case of a basic graph pattern (), the value of #reqDataest becomes 3 multiplied by the number of triple patterns (countTP()) within the basic graph pattern .

Concerning the tpAddfm ( ) operator, the size of shipped request data depends on both the number of input solution mappings and variables that are common across the given subquery with input solution mappings, denoted by joinVars(, ). Diferent physical algorithms exhibit diferent total sizes of shipped request data. Specifically, for the index-NLJ algorithm, each request contains a single triple pattern where the join variables are substituted by bindings of the input solution mappings. Hence, the size of shipped request data of all requests is three times the number of input solution mappings, 3 · | sols( )|. For the BJ, the size of shipped request data is diferent for diferent implementations of BJ (i.e., UNION, FILTER or VALUES). When using UNION to implement bind join, triple patterns with bound variables are consolidated into one request with the UNION operator, resulting in a total size of shipped request data equivalent to that of index-NLJ, despite the size for individual requests being diferent. In case FILTER is applied, a new request is constructed based on the original triple pattern along with attached join variables as well as corresponding values. Consequently, the size of shipped request data for all queries is three plus 2 · | sols( )| · joinVars(, ). Using VALUES is similar to using FILTER, but VALUES allows for specifying multiple values for given variables, to which only the bound values in the input solution mappings would be attached. As a result, the overall size of shipped request data decreases to three plus |sols( )| · joinVars(, ).

3.1.3. Total size of shipped response data (#respRDFtermsest)

The overall size of shipped response data mainly depends on two factors: the estimated number of solution mappings and the type of interface. Diferent types of interfaces have diferent response types. For instance, TPF and brTPF servers return matching triples in their responses, whereas SPARQL endpoints directly respond with solution mappings. For this reason, the size of shipped data in TPF/ brTPF responses depends only on the total number of matching triples, while the size of shipped response data for SPARQL endpoints also depends on the number of variables in the corresponding subquery (as illustrated in Table 3), where |vars()| and |vars()| represents the number of distinct variables in the triple pattern or BGP , respectively).

3.1.4. Processing cost by federation engines (fedProcessest)

The amount of work that needs to be done by the federation engine is hard to measure accurately. In an efort to obtain a rough estimation, we adopt a simplified approach by considering the size of intermediate results that need to be processed by the federation engine. The size of intermediate results is listed in the fourth column of Table 2.

3.2. Cost Calculation

In our proposed model, the cost of a query plan is calculated as a weighted sum of multiple metrics. Each metric is computed by aggregating the corresponding measures from all operators encompassed within the plan. For example, the total number of requests for the physical plan is obtained by summing the number of requests for each operator in the plan (see Equation 6). total #requestsest =∑︁#requestsest(op) op∈subPlan (6) In addition, our cost model accommodates parallel evaluation of operators, thereby capturing the efects of parallelism. If the root operator of a physical plan is an SHJ or a union, the work performed by the corresponding two subplans (referred to as subPlan1 and subPlan2) can occur in parallel. The overall work processed by the federation engine (total fedProcessest) is calculated using the Equation 7, which applies specifically for SHJ and union operators. In addition, this formula for parallelism is only used for fedProcessest but not for any communication costs.

⎧ fedProcessest(rootOp) + max ⎨

∑︁fedProcessest(op), ⎩op∈subPlan1 ⎫ ∑︁fedProcessest(op)⎬

(7) op∈subPlan2 ⎭ Example 4. To illustrate these concepts, Table 4 presents the metric values for two physical plans (1b and 1c) for our running example. The total number of requests is computed by summing up values for all operators in each plan. Furthermore, the total work processed by the federation engine is evaluated with considering parallelism. For instance, the total amount of work by the federation engine for the physical plan (1b) is 2105, which is calculated by: 5 + (80, 1100 + (100, 1000)). To determine the total cost of a physical plan while considering the combination of diferent cost metrics, a weighted sum can be employed (see Equation 8). The weights for each metric may be adjusted based on the specific scenario and user requirements. For instance, in a scenario where the network delay is substantial and the number of requests dominates the total processing time, using a greater value for may be beneficial in identifying a suitable plan for the scenario. In the evaluation of our cost model presented in this paper, a uniform weight of 1 is assigned to all metrics, maintaining equal importance across the board.

totalCost = · total #requestsest + · total #reqDataest + · total #respRDFterms est + · total fedProcessest (8)

3.3. Cardinality Estimation

As indicated in the preceding section, the metrics of our cost model rely on the expected cardinality of intermediate results. It is noteworthy that our cost model is not tied to any specific approach for estimating such cardinalities. Instead, our cost model is designed to accommodate any approach that can be utilized to obtain such cardinality estimates.

For the evaluation of our cost model in this paper, we are adopting a comparatively simple method for cardinality estimation [ 13 ]. This method is based on issuing requests to the federation members. For each subquery with only one req operator, the number of triples that match the given graph pattern can be obtained by issuing a request to the corresponding federation member. The concrete type of this request depends on the type of interface employed by the federation member. If the federation member provides a SPARQL endpoint, the cardinality of a graph pattern ( or ) can be obtained by constructing a SELECT COUNT query for each subquery. If the federation member uses a TPF interface or a brTPF interface, a request can be issued to retrieve the first page of matching triples, and the count estimates can be obtained from the metadata of the retrieved pages.

In the case of subqueries involving operators other than req operator, the expected cardinality of the subquery can be determined by recursively applying a cardinality estimation function to each subquery. Specifically, the cardinality of join operations involving two subqueries in the approach that we adopt for the evaluation in this paper is estimated as the minimum of their respective cardinalities, while the cardinality of union operations is computed by summing the individual cardinalities [ 13 ]. However, as mentioned above, our cost model can also be used in combination with other cardinality estimation approaches.

4. Evaluation

This section presents an empirical evaluation in which we study the efectiveness of our cost model by using Heling and Acosta’s approach [ 13 ] as a baseline. We first describe the implementation that we use and the experiment setup, including the datasets, queries, and evaluation metrics. Thereafter, we present the measurements and discuss our observations.

4.1. Implementation

We implemented the proposed cost model in our query federation engine HeFQUIN1 in combination with a simple greedy plan-enumeration algorithm. This algorithm produces left-deep execution plans for source assignments consisting of joins over subplans that may be individual request operators or unions of requests. In particular, after picking a first such subplan based on the estimated cost, and using that subplan as the starting point for building up a left-deep plan, the algorithm iterates over the remaining subplans that can be joined with the partial left-deep plan that has been built so far (i.e., ignoring subplans that would introduce cross products, unless there are no other subplans left). In each step of this iteration, the algorithm considers all available subplans in combination with all possible algorithms for joining each such subplan with the current left-deep plan, and then greedily picks the least costly of these options. Additionally, we extended HeFQUIN with an implementation of the approach that Heling and Acosta proposed [ 13 ] to produce left-deep execution plans for the aforementioned types of source assignments. For the experiments, such source assignments are given to the engine in the form of SPARQL queries with SERVICE clauses. 1https://github.com/LiUSemWeb/HeFQUIN

4.2. Experiment Setup

All experiments described in this paper have been performed on a server machine with two 8-core Intel Xeon E5-2667 v3@3.20GHz CPUs and 256 GB of RAM. The machine runs a 64-bit Debian GNU/Linux 10 server operation system. On this machine, we use KOBE [ 16 ] (a benchmarking system based on Kubernetes infrastructures) to containerize and configure federations of RDF datasets, queries, federation engines and experiments. For setting up heterogeneous federations, we use Virtuoso v7.20 for configuring a SPARQL endpoint, the Server.js (v3.3.0) 2 and Java brTPF server3 to deploy TPF and brTPF servers with HDT backends. The source code and experimental results are provided online.4

Datasets and Queries: The datasets and queries we use for the evaluation are from FedBench [ 4 ], which involves 9 datasets containing a total number of 10M triples. Each dataset can be exposed using diferent types of interfaces. We use the same types of federations as used by Heling and Acosta. These two federations difer in terms of the types of interfaces used by the federation members (illustrated in Table 5). We use a total of 25 queries from Cross Domain (cd1–7), Life Science (ls1–7) and Linked Data (ld1–11). For these queries, we manually applied the source selection approach of FedX [ 5 ] in order to produce source assignments as assumed as input for the tested approaches (cf. Section 4.1).

Evaluation Metrics We report performance metrics based on the following definitions: i) Query execution time (QET) is the amount of time elapsed since the evaluation of the query plan starts until the complete answer has been received. ii) Number of requests (#requests) is the number of requests that are issued to federation members during the execution of the query. iii) Size of data transferred (#respRDFterms ) is represented as the total number of RDF terms that are transferred from federation members to the federation engine. iv) Local work (fedProcess) is the amount of work that needs to be done by the federation engine.

4.3. Experimental Results

Figures 2a and 2b illustrate the average query execution time per query for federations Fed I and Fed II, respectively; Figures 2c and 2d illustrate the respective number of requests sent per query, and Figures 2e and 2f illustrate the total size of data transferred from federation members to the federation engine. We observe that there is not much diference between the two approaches for about half of queries (11 out of 24 queries). Compared to the baseline approach, it also shows up that for some queries, the plans selected using our cost model require less data to be retrieved from federation members without a significant increase in query execution time (some queries 2https://github.com/LinkedDataFragments/Server.js 3https://github.com/LiUSemWeb/Server.Java/tree/feature-brtpf 4https://github.com/LiUSemWeb/HeFQUIN-DMKG2023-Experiments (a) Query execution time (Fed I)

(b) Query execution time (Fed II) (c) Overall number of requests (Fed I)

(d) Overall number of requests (Fed II) (e) Size of data transferred (Fed I) (f) Size of data transferred (Fed II) have a significant reduction in query execution time). Table 6 illustrates these queries for which there is a significant diference in the measurements (i.e., where the respective greater value is at least double the smaller value). All these are queries for which the plans selected by the two approaches were diferent.

Based on Table 6, we observe that the number of queries that shows improvement varies across the federations since the interface’s capabilities afect diferent queries. We find that, for these queries, even though the plans selected using our cost model issue more requests to the federation members, the query execution times are reduced or without significant increase, but the total size of data transferred is significantly decreased! The best reduction in query execution time occurs on the query ld8 over Fed I, which is nearly 300 times. Further, the plans selected using our cost model produces more complete result than those using the baseline approach, which is caused by the SPARQL endpoint as the Virtuoso server does not produce complete results (at most 220), but this issue does not happen for the plans selected using our cost model as this cost model not only considers the number of requests but also the size of data transferred. This query is a case like the one illustrated in the running example 1c. Due to Measurements for the queries for which there is a significant diference between the two approaches (i.e., where the respective greater value is at least double the smaller value). The blue highlighting indicates cases in which our cost model is better than the baseline, while orange highlighting indicates cases in which it is the other way around. Red-colored numbers (additionally marked with an asterisk *) are cases in which the produced query result was incomplete.

Federation Measurement

QET #requests #resultSize QET #requests #dataTransferred #dataTransferred #resultSize this rewriting, the number of requests increases a bit on the number of requests (from 20 to 80), but the size of data transferred is reduced by more than 2600 times.

5. Summary and Future Work

In this paper we propose a cost model designed for query optimization over heterogeneous federations of RDF data sources. This cost model can be seamlessly integrated with a greedy algorithm or a dynamic programming algorithm. With experimental evaluation, we have demonstrated that the query plan selected using the cost model requires less data to be transferred from federation members to the federation engine compared to the baseline approach. Furthermore, for certain queries, leveraging our cost model achieves a better plan that not only delivers complete results but also achieves shorter execution times.

To attain a more comprehensive understanding of the model’s capabilities and limitations, we intend to conduct a more extensive evaluation of the cost model in our future work. This evaluation shall involve testing with a diverse range of federation setups, exploring additional types of interfaces, and accounting for network latency. Additionally, we aim to enhance our cost model by also incorporating the processing cost incurred by the federation members. Furthermore, we will explore opportunities to accommodate more expressive forms of query patterns to further enhance the applicability of our cost model.

Acknowledgments

This work was funded by the National Graduate School in Computer Science, Sweden (CUGS), and by the Swedish Research Council (Vetenskapsrådet, project reg. no. 2019-05655).

[1]

Acosta , M.-E. Vidal,

Lampo ,

Castillo , E. Ruckhaus, ANAPSID: An Adaptive Query Processing Engine for SPARQL Endpoints , in: Proc. of the 11th International Semantic Web Conference (ISWC) , 2011 .

[2]

Charalambidis ,

Troumpoukis , S. Konstantopoulos, SemaGrow: Optimizing Federated SPARQL Queries, in: Proc. of the 11th Int. Conf. on Semantic Systems , 2015 .

[3]

Saleem ,

Potocki ,

Soru ,

Hartig ,

A. N.

Ngomo , CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation , in: Proc. of the 14th Int. Conf. on Semantic Systems , 2018 .

[4]

Schmidt ,

Görlitz ,

Haase ,

Ladwig ,

Schwarte , T. Tran, FedBench: A Benchmark Suite for Federated Semantic Data Query Processing , in: Proceedings of the 10th International Semantic Web Conference (ISWC) , 2011 .

[5]

Schwarte ,

Haase ,

Hose ,

Schenkel , M. Schmidt, FedX: Optimization Techniques for Federated Query Processing on Linked Data, in: Proceedings of the 10th International Semantic Web Conference (ISWC) , 2011 .

[6]

Vidal ,

Castillo ,

Acosta , G. Montoya, G. Palma, On the Selection of SPARQL Endpoints to Eficiently Execute Federated SPARQL Queries, Trans. Large-Scale Data- and Knowledge-Centered

Systems

25 ( 2016 ).

[7]

Saleem , A.-C. Ngonga Ngomo, HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation , in: Proc. of the European Sem. Web Conf. (ESWC) , 2014 .

[8]

Verborgh ,

M. Vander

Sande ,

Hartig , J. Van Herwegen , L. De Vocht , B. De Meester, G. Haesendonck,

Colpaert , Triple Pattern Fragments: A Low-Cost Knowledge Graph Interface for the Web , Journal of Web Semantics 37 ( 2016 ).

[9]

Hartig ,

Buil-Aranda , Bindings-Restricted Triple Pattern Fragments , in: 15th Ontologies, Databases, and Applications of Semantics (ODBASE) , 2016 .

[10]

Minier ,

Skaf-Molli , P. Molli, SaGe: Web Preemption for Public SPARQL Query Services , in: Proceedings of the Web Conference (WWW) , 2019 .

[11]

Azzam ,

J. D.

Fernández ,

Acosta ,

Beno ,

Polleres , SMART-KG: Hybrid Shipping for SPARQL Querying on the Web , in: Proc. of The Web Conference (WWW) , 2020 .

[12]

Cheng , O. Hartig, Source Selection for SPARQL Endpoints: Fit for Heterogeneous Federations of RDF Data Sources? , in: Proceedings of the 6th Workshop on Storing, Querying and Benchmarking Knowledge Graphs (QuWeDa) , 2022 .

[13]

Heling ,

Acosta , Federated sparql query processing over heterogeneous linked data fragments , in: Proceedings of the ACM Web Conference , 2022 .

[14]

Cheng , O. Hartig, FedQPL: A Language for Logical Query Plans over Heterogeneous Federations of RDF Data Sources , in: Proc. of the 22nd Int. Conf. on Information Integration and Web-based Applications & Services (iiWAS) , 2020 .

[15] C. B. Aranda , A.

Polleres , J.

Umbrich , Strategies for executing federated queries in SPARQL1.1 , in: 13th International Semantic Web Conference , 2014 .

[16]

Kostopoulos ,

Mouchakis ,

Troumpoukis ,

Prokopaki-Kostopoulou ,

Charalambidis , S. Konstantopoulos, KOBE: Cloud-native Open Benchmarking Engine for Federated Query Processors , in: European Semantic Web Conference, 2021 .