<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Leveraging Query Decomposition for Scalable SPARQL Materialization in Dataspaces</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maarten Vandenbrande</string-name>
          <email>maarten.vandenbrande@ugent.be</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thibeau Vercruyssen</string-name>
          <email>Thibeau.Vercruyssen@UGent.be</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pieter Bonte</string-name>
          <email>pieter.bonte@kuleuven.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Femke Ongenae</string-name>
          <email>femke.ongenae@ugent.be</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Departement of Computer Science, KU Leuven Kulak</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IDLab, Ghent University - imec</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In dataspaces data is often shared and integrated from multiple heterogeneous and distributed sources, querying over these large datasets becomes computationally expensive. This challenge is exacerbated by the fact that SPARQL queries frequently overlap, as they often involve shared subqueries, especially in the context of federated and distributed data sources. The result is that materializing answers to such queries leads to large numbers of redundant materialized views, requiring substantial resources for storage and constant maintenance with limited reuse. This redundancy becomes particularly problematic as materializing views for individual queries can only answer those specific queries, leaving little opportunity for reusing common results from overlapping subqueries. To address this, we propose an algorithm that analyzes a SPARQL query in order to rewrite and decompose it into a collection of subqueries whose materialized views can be concatenated to form the answer to the original query. This way, the materialization of a query results in a number of materialized subquery views that can be re-used more often, thereby reducing the space taken up by materialized views of queries. We evaluate the correctness, resource usage, and execution time of our proposed algorithm on well-known benchmarks, such as the BSBM, LDBC-SNB, and SP²Bench. Our results demonstrate that the proposed implementation can significantly reduce redundancy and improves performance in most cases, making it an efective solution for optimizing query execution in dataspaces, where the need for eficient data retrieval and minimal resource usage is paramount.</p>
      </abstract>
      <kwd-group>
        <kwd>Query optimization</kwd>
        <kwd>SPARQL</kwd>
        <kwd>Query rewriting</kwd>
        <kwd>Query materialization</kwd>
        <kwd>Query decomposition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In dataspaces, organizations and individuals share and integrate data from multiple sources, enabling
distributed querying over a federated ecosystem. This brings forth that queries need to be performed
over large amounts of data [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ]. Because data consumers frequently execute similar SPARQL
queries over the same shared datasets, storing the results of these queries as materialized views provides
an efective solution to ensure eficient and timely query results [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ]. However, when dealing
with similar but not identical queries, creating separate materialized views for each query can lead to
substantial ineficiencies in computation and storage. Overlapping query results often get recomputed
and stored multiple times, increasing redundancy. For instance, in public transport networks, diferent
data consumers have varying needs regarding delay information. A regional transit authority might need
to analyze delays across all forms of public transport (buses, trains, and trams) to optimize scheduling
and improve eficiency. Meanwhile, a navigation app provider service may only care about train and
tram delays, as these impact route recommendations for commuters. Similarly, a local bus operator
might only track bus-specific delays for internal performance monitoring. Despite their diferences, all
these queries share a common subquery for retrieving delay data, leading to redundant computations
and storage when materialized separately. Without an optimized query decomposition approach, each
data consumer would efectively duplicate parts of the materialization process.
      </p>
      <p>This article proposes a new approach to mitigate such redundancy through query rewriting and
decomposition based on only a single SPARQL query. Specifically, a query is rewritten and decomposed
based on union operations. The resulting subqueries can be materialized separately and recombined on
demand through simple concatenations of their results. In the public transport example, our approach
would create separate subqueries for the delays in each mode of public transport. The data consumers
that need the information of all modes of transport would then receive the recombined results of all
these subqueries, whilst others get results that leverage the individual subqueries. By reusing a set
of shared materialized subquery views instead of creating a large separate view for each query, this
approach significantly reduces both storage requirements and maintenance overhead. To achieve our
approach, we introduce a set of rewriting rules and algorithms in this article, ensuring that queries
are transformed into a normal form suitable for decomposition while preserving their results. Once
rewritten, the queries are decomposed into subqueries that can be materialized independently, allowing
new queries that share those subqueries to reuse existing results. This leads to a “virtual view” for each
query, which points to the relevant subquery views instead of storing yet another result, minimizing
duplication in the materialization layer.</p>
      <p>
        Query decomposition and rewriting has been explored in a few areas to increase performance of
query execution, either to decrease the computational expensiveness or to decrease the memory used.
Decomposition is also used to improve query scalability by performing queries in parallel on multiple
nodes [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. In federated querying, where queries are executed over a number of endpoints, query
decomposition has also proven to be useful. Here, query decomposition is used to only query parts
of the query that the endpoint has a response for. This decreases the overall query cost [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
        ]. All
approaches described above are data-driven, i.e. the decomposition of the query is determined by the
organization of the data. In our approach, the decomposition is query-driven in order to promote
re-use of results in partially overlapping queries. That being said, the other approaches are able to
decompose their queries on all operations and recombine them, whilst our approach only decomposes
on unions. Finally, our approach tries to preemptively split the queries into logical subqueries to increase
materialized view reuse without knowing what other queries will be asked in the future.
      </p>
      <p>The remainder of this article is structured as follows. Section 2 provides an in-depth description of the
methodology, including details on the rewriting rules, the decomposition algorithm, and the mechanics
of materializing rewritten queries. Section 3 outlines the experimental setup, where queries sourced or
derived from established RDF benchmarks (BSBM, LDBC-SNB, SP²Bench) are used. Section 4 presents
empirical results illustrating the benefits of the proposed technique in terms of cache eficiency and
overall storage reduction. Section 5 discusses key observations and trade-ofs that emerge from these
experiments. Finally, Section 6 ofers concluding remarks and suggests directions for future research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>This section introduces the methodology used to develop the rewriting and decomposition algorithms.
Decomposing queries based on unions allows for eficient reconstruction of the materialized views,
as it is a simple concatenation. However, for this to work, it is essential that no other operations
appear after the union. Since queries rarely have a union operation as the top-level operand, they
need to be rewritten to elevate the union operations to the top level, ensuring that the query can be
properly decomposed. To enable this eficient rewriting, each query is first translated from its SPARQL
syntax into a hierarchical tree structure or query plan. Next, the query is rewritten to move unions
to the top of this tree. As query execution occurs from bottom (leaf) to top (root), we refer to this
process as the Union Deferring Rewriting System (UDRS). Finally the query can be decomposed into
subqueries, each corresponding to a part of the original query that is now separated by the union
operation. The following subsections present the rewriting rules, the rewriting algorithm, and the
decomposition algorithm. To clarify the algorithms in these subsections, we continue the example
made in the introduction in the field of public transport. Listing 1 provides the SPARQL query for this
example, it queries the buses, trains, and trams and filters out those that were not on time.
PREFIX e x : &lt; h t t p : / / e x a m p l e . com / &gt;
SELECT ? r o u t e ? t y p e ? s t a t u s WHERE {
{
# B u s BGP
? r o u t e r d f : t y p e e x : B u s R o u t e .
}
UNION
{
}
UNION
{
}
}
# T r a i n BGP
? r o u t e r d f : t y p e e x : T r a i n R o u t e .
# Tram BGP
? r o u t e r d f : t y p e e x : TramRoute .
# Time BGP
? r o u t e e x : s c h e d u l e d D e p a r t u r e ? s c h e d u l e d T i m e .
? r o u t e e x : a c t u a l D e p a r t u r e ? a c t u a l T i m e .</p>
      <p>FILTER ( ? s c h e d u l e d T i m e ! = ? a c t u a l T i m e )</p>
      <sec id="sec-2-1">
        <title>Listing 1: SPARQL query that checks if public transport routes are on time.</title>
        <sec id="sec-2-1-1">
          <title>2.1. Binary Tree Query Representation</title>
          <p>In order to facilitate the process of query rewriting and optimization, it is useful to represent a SPARQL
query in an algebraic form:
Definition 1. The algebraic form of a query is represented as a rooted binary tree, wherein each internal
node represents either a unary or binary algebraic operator, with the number of children corresponding
to the arity of the operator. The order of these children matters only if the operator is non-commutative.
Meaning the Left-Hand Side (LHS) and Right-Hand Side (RHS) operators can’t be swapped. Finally, each
leaf represents a Basic Graph Pattern (BGP) operand that is composed of one or more triple patterns.
Remark that the root of a query tree does not have to represent a projection, i.e. the tree may represent
only part of a query. This representation encodes algebraic operator precedence within its hierarchy.
According to SPARQL’s bottom-up semantics, each operator is evaluated only after its operands. This
means that a node is only evaluated after all its children. This allows any subtree to be conceptually
replaced by a leaf representing its (future) solution, and vice versa. Queries are expressed in their
algebraic form, while tree terminology is used to reference the hierarchical relationships within them.
Figure 1 illustrates the algebraic tree representation of the query in Listing 1.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.2. The Rewriting System</title>
          <p>The process of rewriting consists of systematically applying all applicable rules from a set of rules,
referred to as a rewrite system, to the given input. Applying a rule consists of first identifying an instance
of a rule’s pattern and then substituting it with an instance of the rule’s corresponding replacement
pattern. We provide the definition of an Abstract Rewrite System (ARS):
Definition 2. An ARS is a 2-tuple (, →), where  is a set and → is a binary relation on  that is a
subset of the cartesian product  × .</p>
          <p>We write  → ′ (, ′ ∈ ) to denote that ′ is derivable from  by a single rule application, and
 →* ′ to denote derivability by zero or more applications. The normal form of  is ′ if  →* ′ and
no rule is still applicable to ′. In other words, there does not exist an ′′ ∈  such that (′, ′′) ∈ →.
Based on set of query trees representing valid algebraic expressions in the Binary Tree Query
Presentation from the previous section, we define an ARS, referred to as the Union Deferring Rewriting System
(UDRS), to defer the evaluation of union operators whose ancestor operators are only unions. The
objective is to maximize the number of union nodes that have only union operator ancestors. A query
tree is in normal form when it is no longer possible to decrease the depth of any union node such that
it could have only union operator ancestors.</p>
          <p>Each rewrite rule that decreases the depth of a union node does so by applying a property based on the
arity of the operator its parent represents. Unary parent operators must preserve the union operator,
while binary parent operators must exhibit left, right, or both distributive over the union operator.
Below, we provide rewrite rules for all algebraic operators that satisfy these properties. After, we also
specify the cases where these properties are not valid. The subtrees of the children that do not change
when applying the property are represented by their (future) solution Ω .</p>
          <p>Definition 3 (Union Deferring Rewriting System). Let Ω 1, Ω 2, and Ω 3 be multisets of solution mappings,
and  be an expression, and   be a set of query variables.</p>
          <p>R1: The filter operator preserves the union operator
 (Ω 1 ∪ Ω 2) →  (Ω 1) ∪  (Ω 2)
R2: The join operator is distributive over the union operator
(Ω 1 ∪ Ω 2) ◁▷ Ω 3 → (Ω 1 ◁▷ Ω 3) ∪ (Ω 2 ◁▷ Ω 3)
Ω 1 ◁▷ (Ω 2 ∪ Ω 3) → (Ω 1 ◁▷ Ω 2) ∪ (Ω 1 ◁▷ Ω 3)
R3: The left join operator is right-distributive over the union operator</p>
          <p>(Ω 1 ∪ Ω 2) ◁▷ Ω 3 → (Ω 1 ◁▷ Ω 3) ∪ (Ω 2 ◁▷ Ω 3)
R4: The minus operator is right-distributive over the union operator</p>
          <p>(Ω 1 ∪ Ω 2) − Ω 3 → (Ω 1 − Ω 3) ∪ (Ω 2 − Ω 3)
R5: The projection operator preserves the union operator</p>
          <p>(Ω 1 ∪ Ω 2) →    (Ω 1) ∪    (Ω 2)
Note that the only combinations of operator and union operands for which no corresponding rule is
provided are the minus or left join operator with a union operation as the RHS operand, since neither
one of these is left-distributive over the union operator. Also, a union operator with a union operation
on the LHS or RHS is not rewritten, as this does not decrease the depth of the union operators. Below,
we defined these patterns that can’t be rewritten:
Definition 4 (Invalid Rewrite Patterns). Let Ω 1, Ω 2, and Ω 3 be multisets of solution mappings, and 
be an expression.</p>
          <p>The left join operator is not left-distributive over the union operator</p>
          <p>Ω 1 ◁▷ (Ω 2 ∪ Ω 3)
The minus operator is not left-distributive over the union operator</p>
          <p>Ω 1 − (Ω 2 ∪ Ω 3)
The union operator over the union operator is not rewritten
(Ω 1 ∪ Ω 2) ∪ Ω 3
Ω 1 ∪ (Ω 2 ∪ Ω 3)
As stated in Definition 1, each internal node of the binary query tree represents only a single algebraic
operator. An important drawback of this representation is that a sequence of  consecutive union
operators is represented by  union nodes. Consequently, if they are all rewritable (which is logically
equivalent to one being rewritable), then the rewrite process consists of at least  rewrite sequences,
since each rewrite sequence repeatedly rewrites only one union node. Therefore, we should generalize
our representation so that a single internal node represents all consecutive identical associative operators
within an algebraic subexpression. So, as an optimization, we can group subsequent operations into
one, creating an n-ary query tree. The proofs of the correctness of this optimization are out of scope of
this article. Figure 2 shows this optimization applied to the binary query tree in Figure 1. In this query,
the two consecutive unions can be grouped together into a ternary union operation.</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>2.3. Verification of the Rewriting System</title>
          <p>
            We need to fulfill the following properties involved in the verification of rewriting systems, as detailed
by Dershowitz et al. [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]:
• Termination: No infinite derivations are possible. Hence, each object  ∈  has at least one
normal form.
• Confluence: For all objects , , ′ ∈  such that  →*  and  →* ′, there exists an object  such
that  →*  and ′ →* . In other words, for any two diverging sequences of rewriting steps, there
exists an object  to which both sequences can be rewritten. Hence, every object has at most one
normal form.
• Soundness: Applying a rule of the system, only rewrites an input 1 ∈  to an output 2 ∈ 
that is equal to it.
          </p>
          <p>To prove that a rewriting algorithm that uses the UDRS to rewrite an algebraic expression terminates
irrespective of the chosen rewrite strategy, we demonstrate that it is terminating.</p>
          <p>Property 1 (Termination of the Union Deferring Rewriting System). The UDRS is terminating.
Proof. We prove this by contradiction. Assume the existence of an infinite derivation. As shown, every
rewrite step of the derivation decreases the depth of one union node by one. Therefore, the existence of
an infinite derivation requires that the sum of the depths of all the union nodes is infinite, necessitating
an infinite number of nodes within the query tree. However, the number of nodes within the query tree
is finite.</p>
          <p>To prove that the normal form of an algebraic expression is independent of the rewrite strategy used in
the rewriting process, we prove that the UDRS is confluent.</p>
          <p>
            Property 2 (Confluence of the Union Deferring Rewriting System) . The UDRS is confluent.
Proof. We use Newman’s lemma [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] that states that a terminating ARS is confluent if it is locally
confluent. An ARS is locally confluent if for all objects , ,  ∈  such that  →  and  → , there
exists  ∈  such that  →*  and  →*  .
          </p>
          <p>Let
•  →  be the application of rule 1, denoted by →, to the subtree 1 .</p>
          <p>1
•  →  be the application of 2, denoted by →, to the subtree 2 .</p>
          <p>2
1 applies left- or right-distributivity
and 2 is the non-union child subtree of 1’s top-level operator
⎪⎩ →  →  otherwise
1 2
2 applies left- or right-distributivity
and 1 is the non-union child subtree of 2’s top-level operator
⎪⎩ →  →  otherwise</p>
          <p>2 1
Given the structure of a rewrite rule’s pattern, it follows that the maximum overlap between two pattern
instances, 1 and 2, occurs when 2 is either a child subtree (operand) of 1’s highest level union node
or the non-union child subtree (operand) of 1’s top-level operator node, or vice versa. Furthermore,
each rule treats these subtrees as indivisible units. Therefore, applying 1 will not modify 2, but might
create a copy of it, and vice versa.
 rewritten to  can be rewritten to  as follows:
 rewritten to  can be rewritten to  as follows:
⎧
⎪⎨ →  →2</p>
          <p>1 2
⎧
⎪⎨ →  →2</p>
          <p>2 1
Since the UDRS is both terminating and confluence, it follows that each query has exactly one normal
form. Hence, exhaustively applying all rewrite rules applicable to a query always yields the same
resulting query, regardless of application order.</p>
          <p>
            Finally to prove that applying a rewrite rule of the UDRS—and hence, exhaustively applying all rewrite
rules applicable to a query—is sound, we leverage on the proofs from Schmidt et al. [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]. Two query
trees (input and output of rewriting) are equal if the algebraic expressions they represent are equal.
These expressions are equal if the multisets of solution mappings they produce when evaluated on the
same RDF graph are equal. Hence, the UDRS is sound if applying any rewrite rule of it to a query tree
results in a query tree that is equal to the original.
          </p>
        </sec>
        <sec id="sec-2-1-4">
          <title>2.4. The Query Decomposition and Rewriting Algorithm</title>
          <p>Having established that the order of application of applicable rewrite rules used during the rewriting
process does not afect the normal form of the query tree, we have the flexibility to freely select the
rewrite strategy. The sole requirement is to apply all applicable rules such that the normal form is
reached. Consequently, each node must be visited at least once, as every node may be part of a rewrite
rule’s pattern instance. An efective and simple way to find all rule pattern instances (subtrees) is
to locate all union nodes and then examine the parent operator node and determine which side the
union is on. This is due to the fact that each rule’s pattern instance includes a union operation and,
as noted before in Definition 4, only four of parents and child union operator nodes do not constitute
rule patterns. As such, we traverse the query tree using Depth First Search (DFS) rather than Breadth
First Search (BFS) because after applying a rewrite rule, a new rule can be applied to the union node
included in the rule’s pattern instance and its new parent, which was previously its grandparent. Of
course, as stated previously, except for the cases in which the new parent does not exist, is a union
node, or is a minus or left join node that is RHS of its parent. Therefore, maintaining a path from the
root enables applying a sequence of rewrites that rewrite the union to the top-level if possible, without
revisiting nodes. Hence, adopting DFS is preferred as it inherently maintains this exploration path
during traversal. The structure of the rewriting algorithm is shown in Algorithm 1, with the helper
function  defined in Algorithm 2.</p>
          <p>Input: A query tree (root)
  ← {}
end
end
  ←   ∪ {}
while  ← DFS on root for a  in the query tree that is ∈/   do
while ( ℎ ()) do
 ← (, ())</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Algorithm 1: The Union Deferring Rewriting System Rewriting Algorithm</title>
        <p>Where  ℎ  is a function that returns the path from the root to the given node,  is
a function that checks if there exists an applicable rule, and  is a function that applies the
applicable rule and returns a reference to the union operator that has been rewritten upwards.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Input: A union path (unionPath) Output: boolean</title>
        <p>if  ℎ. = ∪ then // check parent of union</p>
        <p>return false
end
for  ← 0 to  ℎ.ℎ − 1 do
if  ℎ[] ∈ {− , ◁▷} ∧  ℎ[ + 1] is the RHS of  ℎ[] then
return false
end
end
return true</p>
      </sec>
      <sec id="sec-2-4">
        <title>Algorithm 2: isRewritable(unionPath)</title>
        <p>
          The root of the tree in Figure 2 is the projection. From here, a DFS will go down the tree and encounter
the union node on the left of the join. As the path to the union does not include minus or optional and
the parent is not a union, it is rewritable. The first rule from Definition 3 we can apply is R2, this will
rewrite the union above the joins. This results in a union of three joins between the bus, train, or tram
BGP’s and the time BGP. After, the union can be rewritten above the filter by applying R1. Finally, R5
can be applied to rewrite the union above the projection, the resulting tree after all these rewriting
steps is shown in Figure 3.
This query tree now consists of three separate queries that can easily be decomposed into separate
queries and individually executed, and the result can then be recombined by simple concatenation.
Although uncommon that a select query doesn’t have a projection on top, the individual projections
would still give a correct query response.
2.5. Query Materialization
The process of materializing a query consists of executing it, storing its result, and continuously
maintaining that. We associate the query’s query tree representation with this view to identify it. We
store the query result and its tree representation together in a state referred to as the materialization
state. The current implementation of the algorithm does not support maintaining views, i.e. updating
the result as the underlying data changes. However, the functionality to do it is orthogonal to our
current implementation and can, hence, be implemented later without altering or afecting any of the
existing functionality, with for example incremental query engines like Incremunica [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
To associate a given query with a materialization state, we first need to determine whether the query
(tree) was already materialized previously by determining whether it is equivalent to any materialized
query tree. Determining whether two query trees are equivalent is performed by statically determining
whether their root nodes are equivalent. Determining the equivalence of two (root) nodes can be done
recursively. The base case is whether the set of triple patterns of two BGP leaf nodes are equal. The
recursive case consists of determining whether the children of the two nodes are equal, ordered if
the identical operation the nodes represent is non-commutative (associativity is not considered as
it is beyond the scope of this research), and if any other properties they have are equivalent. These
properties are expressions used as conditions in filters or left joins and sets of projection variables, of
which only the in-scope ones must be equal.
        </p>
        <p>If the query tree is equal to an already materialized one, we directly return the already materialized
answer (cache hit). In case no equal materialized views are found (cache miss), the query needs to be
executed. Without query decomposition this would mean executing the complete query using a query
engine and storing the result together with the tree representation of the query as a materialization
state, such that other queries can re-use the results. This materialization technique is referred to as Full
Query Materialization (FQM).</p>
        <p>However, by utilizing the query rewriting and decomposition techniques outlined previously in this
article, we can achieve a more re-usable and resource eficient algorithm, called Decomposed Query
Materialization (DQM). A given query is rewritten and decomposed into a multiset of subqueries,
which are executed in parallel since each is independent. The results of these subqueries are stored and
materialized through virtual views identified by their query trees. The original query is also materialized
via a virtual view that references the materialized views of its subqueries and concatenates them on
demand. If the query is not split into at least two subqueries, materializing its single subquery sufices to
materialize the entire query, rendering the final step redundant. Remark that a subquery may already be
materialized through multiple (sub)subqueries, leading to the query’s virtual view referencing another
virtual view. However, this can be avoided by incorporating all references contained in that virtual
view.</p>
        <p>Without query decomposition, the materialized view from the query in Listing 1 would contain all
buses and trains that are delayed. Other data consumers that require only the delayed buses cannot use
this materialized view. So for the original data consumer, a virtual view is established that combines
the result from the two queries. Whilst for other data consumers that require delays for buses and
metro’s, the query can be decomposed as well where the buses result can be reused from the original
data consumer and the metro result can then be calculated in a separate materialized view. Meaning
the new data consumer only needs to calculate part of their query, increasing performance.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation Setup</title>
      <p>
        In this section, we outline the evaluation strategy used to compare Full Query Materialization (FQM),
i.e. the methodology that is currently used in dataspaces, and Decomposed Query Materialization
(DQM), i.e. the algorithm proposed in this article. The code for the implemented query rewriting
and materialization algorithm can be found at:
https://github.com/KNowledgeOnWebScale/querydecomposition. This repository also contains all the code to execute the evaluations described below.
3.1. Datasets
We source realistic queries and the datasets to execute them on from other RDF benchmarks and execute
them against a locally hosted SPARQL endpoint. Specifically, we select relevant benchmarks from the
RDF Store Benchmarking W3C Wiki1: The BSBM [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]; The LDBC-SNB [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], where test queries are
sourced from the Business Intelligence [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and Interactive [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] workload; and SP²Bench [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. The
remaining ones are considered irrelevant for one or more of the following reasons: their source code is
not or no longer public, they provide or use no queries, none of their queries contain at least one union,
or they use query-mined queries that would require too much manual attention.
      </p>
      <sec id="sec-3-1">
        <title>3.2. Evaluation Scenarios</title>
        <p>We create four scenarios for each query selected from the datasets. In each scenario the query or one
or more derived versions of it are materialized. Each scenario is parameterized by a materialization
technique and consists of materialization processes that each have two input parameters: a query and a
materialization state.</p>
        <p>The first scenario consists of materializing the original query with an empty materialization state, and
is referred to as the Full Query (FQ) scenario. It demonstrates a cache miss for both materialization
techniques, i.e. FQM and DQM, and thus requires the most work to materialize the query. The resulting
materialization state is used as input for the remaining scenarios.</p>
        <p>In the second scenario, called Already Materialized Full Query (AMFQ) scenario, the original query is
executed again to show a cache hit for both techniques.</p>
        <p>Third, the Only One SubQuery (OOSQ) scenario executes each subquery that is derived from rewriting
and decomposing the original query. This constitutes a cache miss for the FQM case (as the subquery is
not a match with the complete query), but will lead to a cache hit in the DQM case (as it will match on
a decomposition).</p>
        <p>Finally, the Changed One SubQuery (COSQ) scenario consists of materializing one query variant for
each subquery identified through rewriting and decomposition of the original query. The th variant is
derived from the original query such that the th subquery of its decomposition difers from the original
query’s decomposition, showing a partial cache hit for DQM and a cache miss for FQM.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Deriving Test Queries From The Selected Datasets</title>
        <p>Each test query must meet the following criteria: the query should be a SELECT query and contain at
least one rewritable union node. Additionally, the query’s execution must yield a result with more than
a handful of solution mappings. As materializing and caching queries with small results is useless, as
computing the result is easier than finding the equal query. But the result size should remain below
the realistic limit of 10000 solution mappings. This realistic limit was adopted from Virtuoso2. For
each eligible query in the selected datasets, we first assess whether it meets the criteria stipulated
above. If this is the case, we use it directly. If it is not the case, we transform the query such that it
meets the criteria, i.e. by changing the query type to SELECT, removing unsupported operators like
aggregations, order-by, and limit. The scenario run metrics of queries derived from the same source
query are averaged. As such, 5 queries were maintained for LDBC-SNB, 1 query from BSBM and 1
query for SP²Bench. The complete list of used queries and data can be found in this canonical citation:
https://doi.org/10.5281/zenodo.15015062.</p>
        <sec id="sec-3-2-1">
          <title>1https://www.w3.org/wiki/RdfStoreBenchmarking 2https://virtuoso.openlinksw.com/</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>We measure two metrics for each scenario run: the execution time of each step in the materialization
process and the space taken up by (or size of ) the materialization state after completing the run. All
scenarios are run using Node.js v21.1.0 under WSL on a machine with an Intel® Core™ i5-7400 CPU
and 16 GB of DDR4 RAM in dual-channel configuration. Each scenario is run twice for warmup and
then ten times to gather metrics. Each displayed metric value represents the average value across these
ten runs. Each FQM and DQM scenario run’s metrics are reported in Tables 1, 2, and 3, where each
single cell is formatted as FQM⁄DQM.’-’ denotes that the metric was not measured because of the absence
of its associated step in that scenario run. In the benchmark queries the average number of children
(operands) of a union node, which also represents the average number of subqueries in a decomposition,
consists of two numbers because of the LDBC-SNB 3 outlier, which has 14 union operands. The
overall average is 4.00, while the average without this outlier is 2.33.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>This section discusses and describes the general observations that can be made from the results table.
While network overhead is negligible with our local SPARQL endpoint, it can be significant in practice.
This renders the comparison of query rewriting to total materialization time slightly misleading here,
as rewriting constitutes a larger portion of the total time than it would in practice.
The total materialization state size overhead in the FQ and AMFQ scenario is 8.32% (without Q3) or
36.79% (with Q3). While in the OOSQ and COSQ scenarios, the DQM state size is 25.46% (without
Q3) or − 0.25% (with Q3) smaller than the FQM state (the latter being caused by the large number of
subquery trees). Hence, we observe that if any subquery’s materialized view is used to obtain an answer
to a query other than the original query, the size of the materialization state is smaller than if FQM
were used, provided that the number of subqueries is not excessive. The OOSQ and COSQ scenario are
cache hit and partial cache hit for DQM, while both are cache misses for FQM, forcing it to create new
materialized views that duplicate already materialized data. Thus, the size of the FQM materialized
views grows larger than that of DQM. Demonstrating that the views created by DQM can be used to
(partially) answer queries that the views created by FQM cannot.</p>
      <p>Initially materializing a query using DQM is, on average, 193.22% (without Q3) or 8689.93% (with Q3)
slower (the latter figure being because of having to materialize 14 subqueries). The execution time
taken by DQM compared to FQM, when they both have a cache hit, is, on average, 15.76% (0.39s)
(without Q3) or − 130.42% (21.72s) (with Q3) longer (the latter again being caused by the large number
of subqueries), which is attributed to the larger number of views to check against. While in the OOSQ
cache hit scenario, it is 89.43% (without Q3) or 98.66% (with Q3) faster, and in the COSQ partial cache
hit scenario, it is 4.59% (without Q3) or − 178.73% (with Q3) faster. The latter being a mere 4.59% faster
is because, for queries with short execution times, the time taken to rewrite and decompose constitutes
46.93% of the total execution time. The latter figure is due to the execution of the non-materialized
subquery taking significantly longer than the full query.</p>
      <p>We observe that the time spent rewriting and decomposing a query is significant, constituting, on
average,18.63% (without Q3) or 1.28% (with Q3) (FQ scenario) and 42.73% (without Q3) or 38.34% (with
Q3) (COSQ scenario) of the total execution time. Moreover, we observe a clear correlation, especially
for LDBC-SNB 3, between the efort required to rewrite a union node and the quantity and complexity
of its ancestors, which are duplicated by the rewriting process. However, rewriting and decomposing
can reduce the time taken by the rest of the materialization process on (partial) cache hits, depending
on the time needed to answer non-materialized subqueries (done in parallel). The second component of
the time overhead of DQM compared to FQM is that materializing all subqueries in parallel may take
longer than executing the entire query.</p>
      <p>Our analysis demonstrates that while query decomposition and rewriting (DQM) introduce a notable
overhead in initial materialization and execution time, they ofer significant benefits in reducing storage
redundancy and improving cache utilization. The trade-of between query execution time and storage
eficiency is particularly evident in scenarios with frequent cache hits, where DQM outperforms full
query materialization (FQM) by leveraging reusable subquery results. However, the efectiveness of
DQM is influenced by the complexity of query decomposition, particularly in cases with a high number
of unions to rewrite and decompose. Despite these challenges, our results indicate that DQM minimizes
redundant materialization, improves reuse of query results, and enhances performance in cache hit
scenarios, making it a viable strategy for optimizing query execution in dataspaces.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this work, we aimed to reduce the total size of materialized views used in data spaces by increasing
their reusability. We achieved this by mitigating a portion of the overlap between the views of related
SPARQL queries using our proposed methodology. This methodology consisted of materializing the
solution of each subquery, resulting from first rewriting and then decomposing the query by union
operators whose ancestor operators were only other unions and the top-level projection. Subsequently,
it involved creating a virtual view for the query itself that referenced those subquery views.
Consequently, when queries whose decompositions shared subqueries were materialized, the result was virtual
views that referenced the same corresponding subquery views, thereby reducing the total size of the
materialized views. We developed, implemented, and integrated the rewriting and the decomposition
algorithm into the process of query materialization, deriving a new materialization technique termed
DQM. We evaluated this new technique against (direct) Full Query Materialization (FQM) and showed
two of the situations in which the views created by Decomposed Query Materialization (DQM) can
be used to (partially) answer queries that the views created by FQM cannot. Notably, forcing FQM to
create new materialized views that duplicate already materialized data. Although our implementation
struggled with queries with a lot of unions, in most situations the DQM returned query answers faster
than FQM.</p>
      <p>Future work should focus on improving the performance of the query rewriting process, particularly
for queries with extensive union operations. As demonstrated in our results, the current implementation
does not perform optimally when faced with a high number of unions, leading to increased rewriting
overhead. Optimizing the rewriting algorithm to handle deep or complex union structures more
eficiently could significantly enhance the overall materialization process.</p>
      <p>While our approach focuses on decomposing queries based on union operations, future research could
explore decomposition strategies based on other SPARQL algebraic operations.</p>
      <p>Finally, extending the implementation to support a broader range of SPARQL algebra operators would
improve its applicability. Currently, our method does not fully support operations such as aggregations,
ORDER-BY, LIMIT, ect. Incorporating these features would allow for a wider variety of queries to
benefit from decomposition-based materialization. Handling aggregations eficiently, for instance, would
require the development of specialized decomposition techniques that allow incremental aggregation
computation over shared subqueries.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work was partly funded by the SolidLab Vlaanderen project (Flemish Government, EWI and RRF
project VV023/10) and the FWO Project FRACTION (Nr. G086822N).</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the writing of this paper, the author(s) used DeepL and GPT-4o in order to: Grammar, translation
and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Franklin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Halevy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Maier</surname>
          </string-name>
          ,
          <article-title>From databases to dataspaces: a new abstraction for information management</article-title>
          ,
          <source>SIGMOD</source>
          <volume>34</volume>
          (
          <year>2005</year>
          )
          <fpage>27</fpage>
          -
          <lpage>33</lpage>
          . URL: https://dl.acm.org/doi/10.1145/1107499.1107502. doi:
          <volume>10</volume>
          .1145/1107499.1107502.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Halevy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Franklin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Maier</surname>
          </string-name>
          ,
          <article-title>Principles of dataspace systems, in: Proceedings of the twentyiffth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems</article-title>
          , ACM,
          <year>2006</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          . URL: https://dl.acm.org/doi/10.1145/1142351.1142352. doi:
          <volume>10</volume>
          .1145/1142351.1142352.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <article-title>Materialization and decomposition of dataspaces for eficient search</article-title>
          ,
          <source>IEEE</source>
          <volume>23</volume>
          (
          <year>2011</year>
          )
          <fpage>1872</fpage>
          -
          <lpage>1887</lpage>
          . URL: https://ieeexplore.ieee.org/abstract/document/5611525. doi:
          <volume>10</volume>
          .1109/TKDE.
          <year>2010</year>
          .
          <volume>213</volume>
          , conference Name:
          <article-title>IEEE Transactions on Knowledge and Data Engineering</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Jarke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Quix</surname>
          </string-name>
          ,
          <article-title>Federated data integration in data spaces</article-title>
          .,
          <source>Designing Data Spaces</source>
          <volume>181</volume>
          (
          <year>2022</year>
          ). URL: https://library.oapen.org/bitstream/handle/20.500.12657/57901/978-3-
          <fpage>030</fpage>
          -93975-5.pdf#page=
          <fpage>190</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Vandenbrande</surname>
          </string-name>
          ,
          <article-title>Aggregators to realize scalable querying across decentralized data sources</article-title>
          ,
          <source>in: ISWC2023</source>
          , the International Semantic Web Conference,
          <year>2023</year>
          . URL: https://biblio.ugent.be/ publication/01HFXXKQ51BYBEMG4Z01EK4CH1/file/01HFXXN3ZQNVKX0M4J5MFHP2RH.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Abadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <article-title>Scalable SPARQL querying of large RDF graphs</article-title>
          ,
          <source>VLDB</source>
          <volume>4</volume>
          (
          <year>2011</year>
          )
          <fpage>1123</fpage>
          -
          <lpage>1134</lpage>
          . URL: https://doi.org/10.14778/3402707.3402747. doi:
          <volume>10</volume>
          .14778/3402707.3402747.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Yuan</surname>
          </string-name>
          , L. Liu,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <article-title>Scalable SPARQL querying using path partitioning</article-title>
          ,
          <source>in: 2015 IEEE 31st International Conference on Data Engineering</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>795</fpage>
          -
          <lpage>806</lpage>
          . URL: https://ieeexplore. ieee.org/abstract/document/7113334. doi:
          <volume>10</volume>
          .1109/ICDE.
          <year>2015</year>
          .7113334, ISSN:
          <fpage>2375</fpage>
          -
          <lpage>026X</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>K. M. Endris</surname>
            ,
            <given-names>P. D.</given-names>
          </string-name>
          <string-name>
            <surname>Rohde</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-E. Vidal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Auer</surname>
          </string-name>
          , Ontario:
          <article-title>Federated query processing against a semantic data lake</article-title>
          , in: S. Hartmann,
          <string-name>
            <given-names>J.</given-names>
            <surname>Küng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chakravarthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Anderst-Kotsis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Tjoa</surname>
          </string-name>
          , I. Khalil (Eds.),
          <source>Database and Expert Systems Applications</source>
          , Springer International Publishing, Cham,
          <year>2019</year>
          , pp.
          <fpage>379</fpage>
          -
          <lpage>395</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Nural</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Koksal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ozcan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dogac</surname>
          </string-name>
          ,
          <article-title>Query decomposition and processing in multidatabase systems</article-title>
          ,
          <source>in: OODBMS Symposium of the European Joint Conference on Engineering Systems Design and Analysis</source>
          , Montpellier, Citeseer,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Suciu</surname>
          </string-name>
          ,
          <article-title>Query decomposition and view maintenance for query languages for unstructured data</article-title>
          ,
          <source>in: VLDB</source>
          , volume
          <volume>96</volume>
          ,
          <string-name>
            <surname>Citeseer</surname>
          </string-name>
          ,
          <year>1996</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dershowitz</surname>
          </string-name>
          ,
          <article-title>Computing with rewrite systems</article-title>
          ,
          <source>Information and Control</source>
          <volume>65</volume>
          (
          <year>1985</year>
          )
          <fpage>122</fpage>
          -
          <lpage>157</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M. H. A.</given-names>
            <surname>Newman</surname>
          </string-name>
          ,
          <article-title>On theories with a combinatorial definition of" equivalence"</article-title>
          ,
          <source>Annals of mathematics 43</source>
          (
          <year>1942</year>
          )
          <fpage>223</fpage>
          -
          <lpage>243</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>e. a. Schmidt, Foundations of sparql query optimization</article-title>
          ,
          <source>in: Proceedings of the 13th International Conference on Database Theory</source>
          , ICDT '10,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2010</year>
          , p.
          <fpage>4</fpage>
          -
          <lpage>33</lpage>
          . doi:
          <volume>10</volume>
          .1145/1804669.1804675.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>V.</given-names>
            <surname>Maarten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ruben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pieter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Femke</surname>
          </string-name>
          , Incremunica:
          <article-title>Web-based incremental view maintenance for sparql</article-title>
          ,
          <source>in: The Semantic Web: 22th International Conference, ESWC</source>
          <year>2025</year>
          , June 1-5,
          <year>2025</year>
          , Proceedings 22, Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>e</article-title>
          . a. Bizer, The berlin sparql benchmark,
          <source>International Journal on Semantic Web and Information Systems (IJSWIS) 5</source>
          (
          <issue>2009</issue>
          )
          <fpage>1</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>R. A.</surname>
          </string-name>
          et al.,
          <source>The LDBC Social Network Benchmark</source>
          , CoRR abs/
          <year>2001</year>
          .02299 (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>G. S.</surname>
          </string-name>
          et al.,
          <article-title>An early look at the LDBC Social Network Benchmark's Business Intelligence workload</article-title>
          , in: GRADES-NDA at SIGMOD/PODS, ACM,
          <year>2018</year>
          , pp.
          <volume>9</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          :
          <fpage>11</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>O. E.</surname>
          </string-name>
          et al.,
          <article-title>The LDBC Social Network Benchmark: Interactive workload</article-title>
          , in: SIGMOD,
          <year>2015</year>
          , pp.
          <fpage>619</fpage>
          -
          <lpage>630</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>e</year>
          . a. Schmidt, Spˆ 2bench:
          <article-title>a sparql performance benchmark</article-title>
          ,
          <source>in: 2009 IEEE 25th International Conference on Data Engineering</source>
          , IEEE,
          <year>2009</year>
          , pp.
          <fpage>222</fpage>
          -
          <lpage>233</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>