<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Provenance-Based Caching System to Speed-up SPARQL Query Answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gianmaria Silvello</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dennis Dosso</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Siav S.p.A.</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Padua</institution>
          ,
          <addr-line>Via Gradenigo 6/b, Padova, 35131</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose a new provenance-based approach for the dynamic extraction of subgraphs from an RDF database that can be used as a cache for a faster response to SPARQL queries. The proposed system can deal with dynamic workloads, allowing us to use the cache to answer queries similar but not identical to others seen before by the system. Moreover, our approach exploits existing technologies and can be seamlessly integrated on top of existing RDF DBMS without requiring any change to their internals. We conducted an in-depth evaluation on a widely used synthetic database (BSBM) and on Wikidata using real-world query logs. We show that the proposed approach improves the query execution times over the whole database.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;SPARQL</kwd>
        <kwd>Caching</kwd>
        <kwd>Provenance</kwd>
        <kwd>RDF datasets</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In recent years, the Web of Data has continuously expanded, accompanied by a new generation of
semantically enhanced applications that leverage this infrastructure to develop novel services [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
In this context, the Resource Description Framework (RDF) and the SPARQL query language
are the cornerstones to represent, share, and query resources [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. RDF can represent, query,
and exchange heterogeneous data in distributed and unconstrained environments [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. These
technologies play a central role in the creation and access of Knowledge Graphs (KG), which
have gained increasing prominence in both academic and industrial settings [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, RDF
introduces challenges for downstream applications, especially when eficiently querying large
graphs in real time is needed.
      </p>
      <p>
        Being able to query large KGs in real-time eficiently is a key to improving their use in
downstream applications such as (1) exploratory search and (2) integration with Large Language
Models (LLMs). Exploratory search on KGs arises when a user needs to understand and extract
insights from an unfamiliar KG. In these exploratory sessions, the users issue a series of SPARQL
queries to identify relevant portions of the KG that can answer their questions [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Whereas
integrating LLMs with the symbolic knowledge encoded in KGs is a promising avenue for
enhancing the precision of factual query answering by LLMs [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        These applications are highly interactive and/or resource-intensive and, in both cases, require
fast answering times where query completeness can be a tradeof with query eficiency, especially
in the first phases of KG exploration and integration [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. There is vast work in the literature
[
        <xref ref-type="bibr" rid="ref10 ref11 ref2 ref9">2, 9, 10, 11</xref>
        ] to improve the performances of RDF DBMS in answering SPARQL queries. Two main
directions have been pursued, focusing on static workloads – i.e., fixed sets of known queries not
changing over time – and on dynamic workloads – i.e., unknown queries that may have little or
no similarity with queries issued before – that better suit RDF-based real-world Web applications.
Working with dynamic workloads is generally hard since caching already seen queries only
partially covers the workload and the results may be very big. Moreover, understanding if
two queries are equivalent or similar (pairwise similarity) is a dificult problem [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Query
containment for SPARQL fragments is NP-complete for conjunctive queries and undecidable for
general queries [
        <xref ref-type="bibr" rid="ref13 ref14 ref15">13, 14, 15</xref>
        ]. Thus, it is also not practical to face the caching task by resolving to
check for each new query if it is contained in some previously answered query [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>
        Despite these dificulties, the dynamic workload research direction is well-suited to target the
typical usage scenario of a SPARQL endpoint represented by an agent who “queries the data
and gradually refines the search until the desired result is obtained” [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The users interested
in RDF data typically use programmatic query clients to retrieve information from SPARQL
endpoints [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] producing subsequent queries with similar query patterns difering in specific
attributes of triple patterns. Zhang et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] also noticed that many queries issued to SPARQL
endpoints are robotic queries, i.e., queries produced by automated scripts or programs for index
size inferring, data crawling, or malicious attacking. As an example of this phenomenon, in
the query log datasets explored in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], 90% of queries are provided by only 0.4% automated
software programs and, of these, the percentage of unique query templates was 0.28%. Further
evidence is given by Bonifati et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] who analyzed a large query log from DBpedia highlighting
a high number of unique queries with long streaks of similar queries repeated in narrow
timeframes that call for “optimization techniques for sequences of similar queries”. This aspect
is further emphasized if we consider that the query load and server performance of another
widely-used KG as Wikidata are dominated by robotic queries [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and that such queries report
a higher cluster similarity than organic ones [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>
        Following these observations, we propose a new workload-adaptive approach based on
data provenance [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] that allows us to detect the triples in an RDF graph responsible for the
computation of a query result set. We propose a system – the Cache-Based Architecture (CBA)
– employing a cache to store the triples in the lineage of past-seen queries and leverage them to
quickly answer similar, but not necessarily identical, queries without adding any computational
complexity to the standard query processing.
      </p>
      <p>We evaluated CBA on BSBM, a synthetic database, instantiated to 250K, 1M, 25M, and 100M
triples to check system scalability and on Wikidata using a real query log. The evaluation
shows that the query provenance can be used to answer previously unseen queries, presents
sizeable speed-ups achievable with diferent cache sizes, shows that the system scales well to
large databases and query logs, and achieves good completeness even for first-time queries.</p>
      <p>
        The advantages of CBA are that (i) it natively exploits the technologies already implemented
by RDF DBMS without modifying them or developing a new system from scratch. As such, it can
be easily deployed on top of already functioning systems; (ii) it can be used as a system to create
smaller databases composed of the most relevant triples used by the queries coming from large
dynamic workloads; and (iii) it helps users in scenarios where they need (even partial) results
quickly to refine their query, for instance, in the case of tasks such as exploratory search [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]
and exploratory analytics [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
      </p>
      <p>The rest of the paper is organized as follows: Section 2 presents some preliminaries RDF,
SPARQL, and data provenance) and introduces the running example; Section 3 illustrates the
CBA system architecture; Section 4 describes how provenance and data impact are calculated,
and the cache is maintained and updated; Section 5 reports the experimental setup and reports
the results of the evaluation; Section 6 presents some related work; and, Section 7 draws some
conclusions and outlines future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Preliminaries</title>
      <p>
        The RDF is a family of specifications developed and supported by the W3C 1 to represent
interlinked data on the Web. The key component of RDF is the statement – a &lt;subject,
predicate, object&gt; triple-based structure [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. We assume the pairwise disjoint infinite
sets  (IRIs),  (blank nodes), and  (literals). Assume also an infinite set of variables ,
disjointed from the above sets. A triple (, , ) ∈ ( ∪ ) ×  × ( ∪  ∪ ) is called
RDF triple. An RDF graph is a set of RDF triples. A triple pattern is an element of the set
  = ( ∪ ) × ( ∪ ) × ( ∪  ∪ ). A Basic Graph Pattern (BGP) is a finite set of triple
patterns {1, . . . , }.
      </p>
      <p>
        We focus on BGP queries because they are the most frequent type of queries [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] issued
to SPARQL endpoints. More precisely, we consider queries with a basic graph pattern that
can also contain the optional and filter clauses. For brevity, we call them BGP+ queries.
A built-in condition is constructed using elements from the set  ∪  ∪  , constants, logical
connectives, inequality and equality symbols, and other features [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. We do not consider the
queries employing the negation clause (not exists). The negation (not exists) can also be
achieved by combining the optional and filter clauses, but we exclude this option from the
queries considered in this work. Moreover, we exclude the queries employing property paths.
      </p>
      <p>contains the triple patterns to be matched onto the RDF data searching for specific
subgraphs in the database. The evaluation of a SPARQL pattern returns a multi-set of solution
mappings,  , that match the pattern to the data. A solution mapping is a partial function
 :  →  , where  is a set of triples.</p>
      <p>Figure 2 reports a simple RDF graph representing a use case with products connected to
features (e.g., type and price) and producers. The query Q1 shown in Figure 1 is an example of
BGP+ query with the filter operator. The result is &lt;product_3, Chemical Z, 150&gt;.</p>
      <p>
        Data Provenance is a form of metadata describing the life-cycle and origin of data. It has
been widely studied within the database, distributed systems, and Web communities [
        <xref ref-type="bibr" rid="ref21 ref26">26, 21</xref>
        ].
Works as [
        <xref ref-type="bibr" rid="ref27 ref28 ref29">27, 28, 29</xref>
        ] defined types of data provenance on graph databases along with diferent
algorithms to compute them [
        <xref ref-type="bibr" rid="ref30 ref31 ref32">30, 31, 32</xref>
        ].
      </p>
      <p>
        In this work, we employ lineage; for relational databases, the lineage of an output tuple [
        <xref ref-type="bibr" rid="ref21 ref33">21, 33</xref>
        ]
is defined as the set of all and only the tuples in the input database that have a role in the
generation of that output. Similarly, the lineage of a BGP+ query can be defined as the set of all
Q1: SELECT DISTNCT ?p ?l ?price
HWERE {
?p labe ?l .
?p type product_ye1 .
?p price ?price .
?p produce rpoduce_1 .
      </p>
      <p>FILTER (?price &lt; 200) }</p>
      <p>Q1
“sour”
product_type_2</p>
      <p>100
“Chemical Y”</p>
      <p>“Better
Specialists”</p>
      <p>Q2: CONSTRUCT { ?p labe ?l .
?p type product_ye1 .
?p price ?price .
?p produce rpoduce_1 . }
HWERE { ?p labe ?l .
?p type product_ye1 .
?p price ?price .
?p produce rpoduce_1 .</p>
      <p>FILTER (?price &lt; 200) }</p>
      <p>Q2</p>
      <p>Q3: SELECT DISTINCT ?P ?PF1 ?PF2
WHERE { ?product drfs:labe ?labe .
?product a ?P .
?product sbm:productFaetur ?PF1 .
?product sbm:productFaetur ?PF2 .
?product sbm:productPeyNmueric1 ?value1 .</p>
      <p>FILTER (?value1 &gt; 300)
FILTER (?PF1 != ?PF2)
} ORDER by ?labe
LIMIT 1000</p>
      <p>Q3
and only the triples in the input RDF graph that match the query pattern and thus have a role
in computing the output. For example, the lineage of query Q1 is the set of triples highlighted
with the dotted line in Figure 2.</p>
      <p>In this paper we use construct queries as a means to compute lineage since it is a simple
and efective method that fits our use cases. Another way to calculate the lineage, more eficient
but requiring to change the DBMS internals, is to get the query planner to return the pattern
matching along with the result set of the query.</p>
      <p>Certainly! Here’s a revised version:</p>
      <p>
        It’s worth highlighting that the current framework is versatile and compatible with advanced
data provenance calculation strategies, such as how-provenance based on spm-semirings [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ]),
which have recently demonstrated eficiency even in RDF through streamlined query rewriting
[
        <xref ref-type="bibr" rid="ref35">35</xref>
        ]. Nonetheless, lineage remains one of the most prevalent, intuitive, and computationally
eficient forms of data provenance. It’s particularly suited to assess the efectiveness of our
proposal in query answering scenarios.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. System Architecture</title>
      <p>The main components of the proposed Cache-Based Architecture (CBA) are reported in Figure
3. We can see that when a query is submitted to the system, a series of operations are carried
out online (red arrows) and ofline (green arrows). Given a user query, the system first issues it
against the cache (step 1), which contains a subgraph of the database. If a cache is hit and an
answer can be built, the result is immediately returned to the user. In case of a cache miss, the
PROVENANCE COMPUTATION</p>
      <p>2
1
QUERY</p>
      <p>5
ANSWER</p>
      <p>IMPACT
MANAGEMENT</p>
      <p>SYSTEM</p>
      <p>PROCVAECNHAENCE
CACHE
EMPTYRESULT
SETSCACHE</p>
      <p>RDF DB</p>
      <p>STRIKES DB
4
UPDATE</p>
      <p>IMPACT
ANNOTATION
3
query is submitted to the main database to get the result.</p>
      <p>Another supporting cache called Empty Result Sets Cache (ERSC) is employed to keep track
of the queries returning the empty set. ERSC is a list of hashed (SHA-256 hashing algorithm)
queries stored as strings in a side relational table. The ERSC is particularly useful for databases
like Wikidata, where robotic agents sequentially submit identical queries. When these queries
present an empty result set, such a support cache reduces the answer times drastically. ERSC is
highly eficient, requiring minimal disk space since it stores only strings.</p>
      <p>In a convenient moment and thus ofline, the system calculates the query provenance on the
RDF graph (step 2). The query’s data provenance provides the complete set of triples involved
in generating the output data. In step 3, the triples in the provenance set are used to update a
relational database – i.e., the “strike database” – which keeps track of how many times a triple
has been used by the workload so far – i.e., the triple number of strikes. The number of strikes
is the basis for calculating the impact of a triple within the RDF graph. The triples’ impact is
periodically used to update the cache (step 4), deciding which triple has to be added or removed.
The more frequently the system updates the cache, the faster it adapts to a dynamic workload.</p>
      <p>Another supporting cache we employ is the “provenance cache” (PC) containing the
provenance of the queries. It also becomes beneficial when identical queries are submitted many
times over. We maintain the PC size under control by employing a standard Least Recently
Used (LRU) algorithm. Note that PC is used ofline and can be maintained on disk; thus, it has
no impact on the system runtime and a low impact on the system’s requirements.</p>
      <p>For the supporting relational database, one main table contains the triples of the cache
together with the total number of strikes of each triple. This information is required to keep the
cache updated and limit its size. A second relational table tracks when the strikes were assigned
to implement the refresh algorithms described below.</p>
      <p>This architecture provides quick access and a fast computation for several queries. It can be
built on top of any database system by employing existing technologies with every DBMS. The
proposed architecture does not require modifying the support DBMS or implementing additional
indexes or view selection mechanisms. This architecture can also be realized with diferent
memory systems and in a distributed manner. It can be easily integrated with other indexing or
caching techniques further to improve the execution time of dynamic query workloads.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methods</title>
      <p>Provenance Computation The lineage of BGP+ queries can be computed by issuing a
construct query on the RDF graph using the same BGP of the corresponding select query.
In the case of queries including the optional operator, triples not originally present in the
graph may be used in the construct query. In this case, ask queries allow us to check if a
triple is present in the original database or not and to avoid including it in the lineage. We can
see that Q2 returns all the subgraphs that match the specified pattern present in the construct
and in the where clauses.</p>
      <p>Data Impact Let  be an RDF graph and  = {1, . . . , } a query workload, where
∀ ∈ [1, ], ℒ() is the lineage of . Then, the strike count of a triple  ∈ , is defined as
() = ∑︀=1 1ℒ()(). 1ℒ()() is an indicator function that returns 1 if the triple  is in the
lineage ℒ() and 0 otherwise. Hence, the total strike count of a triple  is the number of times
 has been used by the queries of the workload.</p>
      <p>Based on the strike count of a triple, we compute its impact as () ∈ R describing the
importance of a triple in a given context – i.e., the workload in our case. The more a workload
uses a triple, the higher its impact. We employ a log-based impact function such that, given
a triple  ∈  with strike count (), its impact is computed by the function  :  → R,
() = (1 + ()).</p>
      <p>This function guarantees that the triples that were never used have zero impact. The
perstrike increment gets less marked as the strike count increases, i.e., the importance of the triple
is already established. The idea is to favor the addition of newly used triples into the cache and
keep the growth of frequently used triples contained.</p>
      <p>Update phase In the update phase, the cache is updated by deciding which triples should
enter or leave it. We define a threshold-based update strategy based on the triple impact.</p>
      <p>Given an RDF graph , a triple  ∈  with impact () and a threshold value  ∈ R≥ 0, the
threshold-based update strategy is  :  → {0, 1}, such that:
 () =
{︃1 if () &gt;</p>
      <p>0 if () ≤</p>
      <p>When  () = 1, a triple is added (or maintained) in the cache, otherwise it is removed. If
 = 0, a triple needs to be used just once by the workload to be added in the cache at the next
update. The higher the  , the higher the triple impact required to enter the cache.</p>
      <p>The system runs the update strategy every time an epoch is elapsed. An epoch is the amount
of time or the number of queries after which the system updates the cache, and it is a system
parameter. As a rule of thumb, if a database has a high workload, an epoch could be based on
time (e.g., one hour or one day); whereas, if the workload is low, an epoch could be based on
the number of issued queries to allow the system to gather enough strikes to make the update
of the cache worth the efort.</p>
      <p>To contain the cache’s size and avoid it growing too much, we set an upper bound to the
cache size. We define the size as the number of triples stored in the cache, specified by the
parameter  . With this strategy, the cache grows if its overall size is less than or equal to  .
When the cache is full, and we want to insert new triples, we employ a LRU-based strategy,
where the least recently used triples are removed from the cache to make space for the new ones.
In particular, when the cache size needs to be reduced, the system finds the least recently used
set of triples belonging to the same lineage and removes them from the cache. This operation is
iterated until the cache size gets below  . The strike count of the released triples is also set to
zero, so to avoid that, they immediately get back in the cache at the next update phase.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Evaluation</title>
      <p>All the experiments were run on a Linux Machine with architecture Intel x86_64. The Slurm2
workload manager was used to run the experiments by assigning to each job one processor
and 8GB of RAM. We tested the proposed architecture on two datasets both largely used and
reference points for the community: the Berlin SPARQL BenchMark (BSBM)3 synthetic database
and Wikidata4. We used a PostgreSQL (v14.1) relational database as a support database. The
source code is written in Java, using the rdf4j5 (for BSBM) and the Virtuoso6 and the Virtuoso
Jena RDF Data Provider7 (for Wikidata) as API to manage the RDF datasets and run the SPARQL
queries. In the case of rdf4j, for each experiment, in each database and its corresponding cache
we used a default spoc index. In the case of Virtuoso, the default settings were kept.</p>
      <p>Given the controlled environment where we conduct the experimental evaluation, we can
define a very eficient system – i.e. the Hash-Based Architecture ( HBA) – to determine a
reasonable lower bound that can be obtained with a caching system. Indeed, the experiments
are carried out with a controlled number of synthetic queries (in the case of BSBM) or robotic
ones (in the case of Wikidata).</p>
      <p>HBA is based on a hash table containing the hashed queries and their result set. When a
query is submitted to HBA, it is hashed and saved in the support database along with its entire
result set. To control the size of the table, we put an upper limit to the total number of entries
that can be stored, i.e., to the sum of the cardinalities of the result sets stored in the table. We
implemented a LRU strategy to refresh the table when it becomes too big.</p>
      <p>BSBM is a synthetic database benchmark simulating commercial products and related features.
We instantiated the database in diferent sizes 8. The queries for BSBM are generated from the
query templates proposed in the “Explore Use Case”9, – i.e., a set of query templates that
simulates a consumer looking for products and their features and that are meant to challenge
the DBMS. There are twelve query templates available, covering diferent query patterns. We
used the eight templates that can be answered with BGP+ queries (the sole modification regards
2https://slurm.schedmd.com/
3http://wbsg.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/spec/BenchmarkRules/index.html#
datagenerator
4https://www.wikidata.org/wiki/Wikidata:Main_Page
5https://rdf4j.org/
6In its Virtuoso Open-Source (VOS) edition http://vos.openlinksw.com/owiki/wiki/VOS.
7http://vos.openlinksw.com/owiki/wiki/VOS/VirtJenaProvider.
8All the parameters used to build these instances are described at http://wbsg.informatik.uni-mannheim.de/bizer/
berlinsparqlbenchmark/spec/Dataset/index.html#datarules
9http://wbsg.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/spec/ExploreUseCase/index.html
query pattern three where we removed the negation clauses), each representing a diferent
information need. The templates difer in the structure of their query pattern, i.e., the number of
triples they contain, the type of filter operators used, and the selection of variables required.
These queries can be considered “hard” since they are thought to stress the query engine.</p>
      <p>We created a set of queries for each template by parameterizing the template and then binding
the query variables to valid values in the database. For example, to create the set of queries for
the query template 1, we used query Q3 reported in Figure 1 to obtain valid values to bind the
query variables ?P, ?PF1, ?PF2, and then create a set of queries produced from query template
1. This process was followed for each query template, creating eight sets with cardinality 1K.</p>
      <p>After creating these sets of queries, we interrogated the database following a mixed scenario,
where we issued 400 queries randomly sampled from all the templates. The queries were
sampled by employing a uniform distribution to decide on the query template, and then we
randomly sampled the query from the corresponding set of queries with a normal distribution.
The parameters of the distribution (mean  and standard deviation  ) were set to  = ||/2
and  = ||/ , where || is the number of diferent queries we created for the considered
class. The parameter  is used to change the standard deviation. A big  gets a small standard
deviation resulting in a distribution concentrated around the mean, thus resulting in some
queries repeated more frequently. Each query was repeated ten times and after excluding the
minimum and the maximum, we calculated the average execution time.</p>
      <p>We set a query timeout of five minutes for the select and the corresponding construct
queries. The same value was also used as time limit for the update of the cache. We never met
the timeout with the sole exception of query templates 5 and 7 for BSBM100M. From this reason,
and only in that case, we ignored those query classes when sampling the queries.</p>
      <p>
        Wikidata is a free and open knowledge base that acts as central storage for the structured
data of its sister projects in the Wikimedia foundation [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ]. It is one of the largest and most
prominent collections of linked open data on the web and it is one of the most successful projects
of the Wikimedia Foundation in terms of contributors [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Its content is available under a free
license and it can be interlinked to other open data sets on the linked data web.
      </p>
      <p>For the experiments, we used a 2017 version of Wikidata10 with more than 4.7 billion triples.
We employed the query logs from the University of Dresden International Center for Computer
Logic. We used the Interval 2 query log11. From this query log we extracted the BGP+ queries
generated by bots (aka the “robotic” queries). We chose these queries because they can be easily
converted to their construct version to get the provenance and present several repetitions
allowing us to see the efect of CBA against the HBA lower bound baseline. For our experiments,
we get the first million queries of the query log, and then we extracted the first 50K subsequent
suitable BGP+ queries.</p>
      <p>Table 1 reports the average execution times obtained on BSBM using the two considered
systems: HBA and CBA. As parameters we used  = 20,  = 0 (a triple gets in the cache after
it has been employed once),  = 10 (an epoch composed of 10 queries).</p>
      <p>We can see that CBA speeds up the execution time for every database instance compared
10In particular, the 2017-11-20 version, retrieved at the following URL: https://archive.org/details/
wikibase-wikidatawiki-20171120.
11https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en
to the whole database. The larger the database, the higher the improvement. In particular, we
could reduce the average execution time in half for BSBM250K, or by one-third, with BSBM1M,
BSBM25M, and BSBM100M. Diferently, from HBA, CBA can incur a cache hit for queries never
seen before. This happens when the triples in the cache overlap, even partially, with the graph
pattern of the new queries. In this sense, CBA can answer queries to which HBA cannot, and
this brings an improvement in the average execution times that we can see for large databases
where the execution time of a single query impacts more on the overall result.</p>
      <p>We note that the time required to compute the provenance of the queries through the
construct version and reported in the table (prov. time) may become quite big and a bottleneck
for the system, even though it is carried out ofline. The use of PC (see prov.time w/PC in the
table) reduces significantly, on average, the time to get the lineage of a query processed before.</p>
      <p>The ofline time required to keep the cache and the supporting relational database updated is
stable for all the sizes of the database (under 1 second). This is because the amount of data that
needs to be managed (the lineage of the queries) is the same across the database sizes.</p>
      <p>Table 2 reports the results on BSBM100M using diferent values for the maximum dimension
of the cache ( = {500, 1, 2, ∞}). For CBA,  is the number of triples that are kept in
the cache graph, whereas for HBA, it is the sum of the cardinalities of the result sets being
cached. As it can be seen, the smaller  , the smaller the percentage of cache hits over the set of
400 considered queries. Intuitively, there are fewer available triples and thus fewer cache hits,
resulting in a worse performance with a smaller cache.</p>
      <p>Table 2 also reports the first time hits , i.e., the ratio of cache hits obtained over queries seen
for the first time on the total number of cache hits. The completeness of these queries cannot,
in principle, be guaranteed. For this reason, we also reported the first-time completeness . The
completeness of a query is the percentage of triples in the whole result set that are also in the
cache result set; i.e., a query with 100% completeness means that the cache answers by returning
all the triples that would have been returned by evaluating the query on the whole database.
Hence, the first time completeness is the average percentage of completeness of the result set
obtained from queries that are seen for the first time; of course, for HBA, this completeness is
always zero. Whereas, for CBA, some queries can be answered even though they have never
been seen before; the average first-time completeness is up to 68%, showing that even though
a query is seen for the first time, CBA is still returning a partially complete answer to the
query if there is a hit. On the other hand, the overall completeness of the CBA answers is
relatively high (92%) even with a cache cap of 2 triples. This shows that CBA is a viable
method to return quick answers, providing a reasonable approximation of the result set. The
best results for CBA are obtained with  = ∞ reaching average completeness of 94% and
the first-time completeness above 70% and an execution time smaller than that HBA achieves.
Notice that  = ∞ for BSBM100M leads to a cache with a reasonable maximum size of 3.6
triples (0.0036% of the DB size). Intuitively, including more data in the cache also increases the
completeness in the answers to these peculiar types of queries since more triples are available
to answer these queries potentially.</p>
      <p>The dimension  of the cache can be modified at runtime to obtain a higher number of cache
hits. A bigger  can be set to increase the cache hits, or to increase the completeness of the
answers to queries seen for the first time.</p>
      <p>Table 3 reports the results obtained on Wikidata. We used the first 50K BGP queries extracted
from the considered query log, and we divided them in five “windows” (from  1 to  5) of
10K queries each. The maximum size of the CBA cache was kept to 1M triples, and that of HBA
to 1M tuples. Notice that neither of the caches reached full occupation.</p>
      <p>We notice that in windows  1, 2, 3, using the cache improves the average execution time and
that the number of hits is between 20% and 48%. In the case of  4 and  5, the performances
of CBA are lower due to a lower hit ratio (especially for  4) and a low average execution
time (lower than 100 ms) due to the simplicity of many queries in this time window. When the
queries are fast, the cache misses are particularly detrimental to the average result of CBA.</p>
      <p>HBA is generally faster than CBA (on average) because the percentage of repeated queries is
quite high, resulting in swift look-ups in the hash table. However, CBA can often answer queries
never seen before, with first-time completeness ranging from 30% up to 100% for window
 3. This result is particularly significant, showing that CBA can answer new queries with
high completeness on average. Indeed, the completeness of all the queries is 97% in the worst
case ( 4) and between 98% and 100% for the other windows. This shows how CBA returns
complete answers to the user in most cases.</p>
      <p>This means that (i) the eficiency of CBA increases together with the percentage of cache
hits, producing results that are close to the best-case scenario represented by HBA, and (ii) CBA
can answer queries (even those never seen before) with a high level of completeness.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Related Work</title>
      <p>
        Martin et al. [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ] analyzes cached graph patterns to decide when to update the cache. Their
strategy stems from the observation that only relatively small parts of a knowledge base change
within a short period in typical scenarios. This approach, however, does not recognize when
the same query is issued twice with minor deviations such as pattern reordering and variable
renaming (isomorphism problem), and may store duplicate results. The proposed solution does
not store the query results but a selection of previously used triples.
      </p>
      <p>
        Yang and Wu [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] present a more sophisticated approach that caches the intermediate result
of BGP SPARQL queries, i.e. Algebra Expression Trees (AETs), to speed up the subsequent
operations. Our method difers since it does not cache whole results or AETs, but relevant parts
of the database, thus avoiding redundancy in the data.
      </p>
      <p>
        Lorey and Nauman [
        <xref ref-type="bibr" rid="ref38 ref39">38, 39</xref>
        ] note that machine agents typically issue queries that present
similar structures and only difer in small triple pattern parts. They propose approaches to
pre-fetch data by issuing never-seen-before queries. One of the limits of this greedy technique
is that it cannot guarantee to find all candidate subgraph matches for a SPARQL query. Chun et
al. [
        <xref ref-type="bibr" rid="ref40">40</xref>
        ] propose a proactive policy for maintaining a local cache to alleviate the expensive job
of copying the data at idle time. While many works based on caching (see [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ]) are
contentblind, i.e. unaware of the content of cached results, Shu et al. [
        <xref ref-type="bibr" rid="ref41">41</xref>
        ] proposed instead to reuse
cached queries in a content-aware fashion. The approach requires SPARQL query containment
checking, which, for general SPARQL queries, is undecidable. The authors, therefore, focus
on a subset of SPARQL, namely conjunctive queries with simple filter conditions (CQSF). Our
method does not try to pre-fetch new data, but focuses only on data lineage and data with
enough “impact”. Using provenance, we do not deal with the query containment problem [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>
        Papailiou et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] cache the SPARQL queries’ results in real-time by indexing the query
graph patterns based on their topology. Among the limitations of this work, there is the lack of
a cache replacement policy, significant storage overhead for caching the SPARQL results, lack
of generalization to broader query workloads, and the complexity to put the method into use in
a real-world environment.On the other hand, our technique does not require any change in the
structure of an already established DBMS.
      </p>
      <p>
        Madkour et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] proposed WORQ, a workload-adaptive system that does not cache the
ifnal results, but sets of intermediate results, called reductions. These are then reused across
multiple queries that share join patterns. Our work diferentiates from WORQ because we
“annotate” data with their impact rather than storing them.
      </p>
      <p>The solution in this paper difers from the others since it can be seen as a layer that can be
put on top of already existing systems. It is technology-independent, it relies on functionalities
of the available DBMSs, without needing its own specific system.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>This paper presents CBA, a cache-based architecture to reduce the answer time of SPARQL
queries on RDF graphs. CBA has the advantage of being easy to implement and deploy on top of
existing DBMS. It can be easily integrated with other caching and indexing systems to improve
the execution time of dynamic query workloads. The triples of the RDF graph are annotated
with a strike count incremented by one each time a triple appears in the lineage of a query. We
calculate the impact of a triple by using a logarithmic function on the triple strike count. It
guarantees a relatively fast warm-up and then a slow-growing if the triple is used over time.</p>
      <p>CBA employs a cache to store the triples with an impact above a given threshold. The cache
size is controlled by employing an LRU strategy to remove the least used triples and replace
them with fresh more used ones. The overall size of the cache is a customizable parameter of
the system. We tested CBA on a largely used synthetic database (BSBM) and on Wikidata by
comparing CBA with a hash-based architecture, representing the best possible system employed
in a controlled environment. CBA reported average answer time improvements up to one order
of magnitude concerning querying the full RDF graph.</p>
      <p>As with all cache-based solutions in the literature, let us note that CBA may return incomplete
answers. Nonetheless, CBA can be used to quickly obtain a first partial solution from a query
while the system works to obtain the rest of the answer working in the background. Nevertheless,
we show that CBA reaches high level of completeness in both the synthetic and the real use
case, with the unique advantage of producing rather complete answers also for the queries seen
for the first time.</p>
      <p>CBA is based on lineage and this approach can be a bottleneck when processing queries
matching big subgraphs. Other types of data provenance can be used, such as the why-provenance
and how-provenance. These can be the basis of new impact functions, allowing the system
to annotate the impact of a triple based on its participation in a query and the role that triple
played in the computation.</p>
      <p>The focus of CBA is on BGP SPARQL queries, the most common query types issued to SPARQL
endpoints. In future works, we will focus on other nonmonotonic query operators, such as set
diference, where the traditional definition of the lineage may not be the best suited for these
applications since it would reward triples that are not useful for the cache. For this reason, we
will investigate other types of provenance ranging from why-provenance to responsibility [42]
and the Shapley value [43] more suited for the task.
for semantic web applications, in: Web Information Systems Engineering - WISE 2013
14th International Conference, Nanjing, China, October 13-15, 2013, Proceedings, Part I,
volume 8180 of Lecture Notes in Computer Science, Springer, 2013, pp. 320–329. URL: https:
//doi.org/10.1007/978-3-642-41230-1_27. doi:10.1007/978-3-642-41230-1\_27.
[42] A. Meliou, W. Gatterbauer, K. F. Moore, D. Suciu, The complexity of causality and
responsibility for query answers and non-answers, Proc. VLDB Endow. 4 (2010) 34–45. URL:
http://www.vldb.org/pvldb/vol4/p34-meliou.pdf. doi:10.14778/1880172.1880176.
[43] D. Deutch, N. Frost, B. Kimelfeld, M. Monet, Computing the Shapley Value of Facts in
Query Answering, in: Proc. of the 2022 International Conference on Management of
Data, SIGMOD ’22, ACM Press, New York, NY, USA, 2022, pp. 1570—-1583. URL: https:
//doi.org/10.1145/3514221.3517912. doi:10.1145/3514221.3517912.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Umbrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Knuth</surname>
          </string-name>
          ,
          <article-title>Evaluating query and storage strategies for RDF archives</article-title>
          ,
          <source>Semantic Web</source>
          <volume>10</volume>
          (
          <year>2019</year>
          )
          <fpage>247</fpage>
          -
          <lpage>291</lpage>
          . URL: https://doi.org/10.3233/SW-180309. doi:
          <volume>10</volume>
          .3233/SW-180309.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Papailiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tsoumakos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Karras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Koziris</surname>
          </string-name>
          ,
          <article-title>Graph-aware, workload-adaptive SPARQL query caching</article-title>
          ,
          <source>in: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, ACM</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1777</fpage>
          -
          <lpage>1792</lpage>
          . URL: https://doi.org/10.1145/ 2723372.2723714. doi:
          <volume>10</volume>
          .1145/2723372.2723714.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wylot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hauswirth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cudré-Mauroux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sakr</surname>
          </string-name>
          ,
          <article-title>RDF data storage and query processing schemes: A survey</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>51</volume>
          (
          <year>2018</year>
          )
          <volume>84</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>84</lpage>
          :
          <fpage>36</fpage>
          . URL: https://doi.org/10.1145/ 3177850. doi:
          <volume>10</volume>
          .1145/3177850.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saleem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          , A.
          <string-name>
            <surname>-C. Ngonga Ngomo</surname>
          </string-name>
          ,
          <article-title>A survey of RDF stores &amp; SPARQL engines for querying knowledge graphs</article-title>
          ,
          <source>VLDB J</source>
          .
          <volume>31</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>26</lpage>
          . URL: https: //doi.org/10.1007/s00778-021-00711-3. doi:
          <volume>10</volume>
          .1007/S00778-021-00711-3.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X. L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <article-title>Generations of knowledge graphs: The crazy ideas and the business impact</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>16</volume>
          (
          <year>2023</year>
          )
          <fpage>4130</fpage>
          -
          <lpage>4137</lpage>
          . URL: https://doi.org/10.14778/3611540.3611636. doi:
          <volume>10</volume>
          .14778/3611540.3611636.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lissandrini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mottin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hose</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Pedersen</surname>
          </string-name>
          ,
          <article-title>Knowledge graph exploration systems: are we lost?</article-title>
          ,
          <source>in: Proc. of the 12th Conference on Innovative Data Systems Research, CIDR</source>
          <year>2022</year>
          , volume
          <volume>22</volume>
          ,
          <year>2022</year>
          , pp.
          <fpage>10</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mruthyunjaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pezeshkpour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hruschka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bhutani</surname>
          </string-name>
          ,
          <article-title>Rethinking language models as symbolic knowledge graphs</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2308</volume>
          .
          <fpage>13676</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <article-title>Knowledge graphs querying</article-title>
          ,
          <source>SIGMOD Rec</source>
          .
          <volume>52</volume>
          (
          <year>2023</year>
          )
          <fpage>18</fpage>
          -
          <lpage>29</lpage>
          . URL: https://doi. org/10.1145/3615952.3615956. doi:
          <volume>10</volume>
          .1145/3615952.3615956.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Goasdoué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Karanasos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leblay</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Manolescu</surname>
          </string-name>
          ,
          <article-title>View selection in semantic web databases</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>5</volume>
          (
          <year>2011</year>
          )
          <fpage>97</fpage>
          -
          <lpage>108</lpage>
          . URL: http://www.vldb.org/pvldb/vol5/ p097_francoisgoasdoue_vldb2012.pdf.
          <source>doi:10.14778/2078324</source>
          .2078326.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Madkour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Aly</surname>
          </string-name>
          , W. G.
          <article-title>Aref, WORQ: workload-driven RDF query processing</article-title>
          ,
          <source>in: The Semantic Web - ISWC 2018 - 17th International Semantic Web Conference</source>
          , volume
          <volume>11136</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2018</year>
          , pp.
          <fpage>583</fpage>
          -
          <lpage>599</lpage>
          . URL: https://doi. org/10.1007/978-3-
          <fpage>030</fpage>
          -00671-6_
          <fpage>34</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -00671-6\_
          <fpage>34</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          , G. Wu,
          <article-title>Caching intermediate result of SPARQL queries</article-title>
          ,
          <source>in: Proceedings of the 20th International Conference on World Wide Web, WWW</source>
          <year>2011</year>
          , ACM,
          <year>2011</year>
          , pp.
          <fpage>159</fpage>
          -
          <lpage>160</lpage>
          . URL: https://doi.org/10.1145/1963192.1963273. doi:
          <volume>10</volume>
          .1145/1963192.1963273.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Kul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. T. A.</given-names>
            <surname>Luong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chandola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kennedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Upadhyaya</surname>
          </string-name>
          ,
          <article-title>Similarity metrics for SQL query clustering</article-title>
          ,
          <source>IEEE Trans. Knowl. Data Eng</source>
          .
          <volume>30</volume>
          (
          <year>2018</year>
          )
          <fpage>2408</fpage>
          -
          <lpage>2420</lpage>
          . URL: https://doi.org/10.1109/TKDE.
          <year>2018</year>
          .
          <volume>2831214</volume>
          . doi:
          <volume>10</volume>
          .1109/TKDE.
          <year>2018</year>
          .
          <volume>2831214</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pichler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Skritek</surname>
          </string-name>
          ,
          <article-title>Containment and equivalence of well-designed SPARQL</article-title>
          ,
          <source>in: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS'14</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2014</year>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>50</lpage>
          . URL: https://doi.org/10.1145/2594538.2594542. doi:
          <volume>10</volume>
          .1145/2594538.2594542.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M. W.</given-names>
            <surname>Chekol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Euzenat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Genevès</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>Layaïda, SPARQL query containment under schema</article-title>
          ,
          <source>J. Data Semant</source>
          .
          <volume>7</volume>
          (
          <year>2018</year>
          )
          <fpage>133</fpage>
          -
          <lpage>154</lpage>
          . URL: https://doi.org/10.1007/s13740-018-0087-1. doi:
          <volume>10</volume>
          .1007/s13740-018-0087-1.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mailis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kotidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Nikolopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kharlamov</surname>
          </string-name>
          , I. Horrocks,
          <string-name>
            <given-names>Y. E.</given-names>
            <surname>Ioannidis</surname>
          </string-name>
          ,
          <article-title>An eficient index for RDF query containment</article-title>
          ,
          <source>in: Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference</source>
          <year>2019</year>
          , ACM,
          <year>2019</year>
          , pp.
          <fpage>1499</fpage>
          -
          <lpage>1516</lpage>
          . URL: https://doi.org/10.1145/3299869.3319864. doi:
          <volume>10</volume>
          .1145/3299869.3319864.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bonifati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Martens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Timm</surname>
          </string-name>
          ,
          <article-title>An analytical study of large SPARQL query logs</article-title>
          ,
          <source>VLDB J</source>
          .
          <volume>29</volume>
          (
          <year>2020</year>
          )
          <fpage>655</fpage>
          -
          <lpage>679</lpage>
          . URL: https://doi.org/10.1007/s00778-019-00558-9. doi:
          <volume>10</volume>
          .1007/ s00778-019-00558-9.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>W. E.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. Z.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shemshadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. L.</given-names>
            <surname>Taylor</surname>
          </string-name>
          , SECF:
          <article-title>improving SPARQL querying performance with proactive fetching and caching</article-title>
          ,
          <source>in: Proceedings of the 31st Annual ACM Symposium on Applied Computing, ACM</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>362</fpage>
          -
          <lpage>367</lpage>
          . URL: https://doi.org/10.1145/2851613.2851846. doi:
          <volume>10</volume>
          .1145/2851613.2851846.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H. Yang,
          <article-title>Characterizing robotic and organic query in sparql search sessions, in: Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM</article-title>
          ) Joint
          <source>International Conference on Web and Big Data</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>270</fpage>
          -
          <lpage>285</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Malyshev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krötzsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>González</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonsior</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bielefeldt</surname>
          </string-name>
          ,
          <article-title>Getting the Most Out of Wikidata: Semantic Technology Usage in Wikipedia's Knowledge Graph</article-title>
          ,
          <source>in: Proc. of ISWC 2018 - 17th International Semantic Web Conference</source>
          , volume
          <volume>11137</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2018</year>
          , pp.
          <fpage>376</fpage>
          -
          <lpage>394</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -00668-6\_
          <fpage>23</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bonifati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Martens</surname>
          </string-name>
          , T. Timm,
          <article-title>Navigating the Maze of Wikidata Query Logs</article-title>
          ,
          <source>in: Proc of the World Wide Web Conference, WWW</source>
          <year>2019</year>
          , ACM Press,
          <year>2019</year>
          , pp.
          <fpage>127</fpage>
          -
          <lpage>138</lpage>
          . doi:
          <volume>10</volume>
          .1145/3308558.3313472.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cheney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chiticariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Tan</surname>
          </string-name>
          , Provenance in databases: Why, how, and where,
          <source>Foundations and Trends in Databases 1</source>
          (
          <year>2009</year>
          )
          <fpage>379</fpage>
          -
          <lpage>474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lissandrini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mottin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Palpanas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Velegrakis</surname>
          </string-name>
          ,
          <article-title>Example-based search: a new frontier for exploratory search</article-title>
          ,
          <source>in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <string-name>
            <surname>SIGIR</surname>
          </string-name>
          <year>2019</year>
          , Paris, France,
          <source>July 21-25</source>
          ,
          <year>2019</year>
          , ACM,
          <year>2019</year>
          , pp.
          <fpage>1411</fpage>
          -
          <lpage>1412</lpage>
          . URL: https://doi.org/10.1145/3331184.3331387. doi:
          <volume>10</volume>
          .1145/3331184.3331387.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>D.</given-names>
            <surname>Mottin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lissandrini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Velegrakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Palpanas</surname>
          </string-name>
          ,
          <article-title>New trends on exploratory methods for data analytics</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>10</volume>
          (
          <year>2017</year>
          )
          <fpage>1977</fpage>
          -
          <lpage>1980</lpage>
          . URL: http://www.vldb.org/ pvldb/vol10/p1977-mottin.pdf.
          <source>doi:10.14778/3137765</source>
          .3137824.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arenas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          , Semantics and Complexity of SPARQL,
          <source>ACM Trans. Database Syst</source>
          .
          <volume>34</volume>
          (
          <year>2009</year>
          )
          <fpage>1</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>M.</given-names>
            <surname>Saleem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Szárnyas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Conrads</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A. C.</given-names>
            <surname>Bukhari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Mehmood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <article-title>How representative is a SPARQL benchmark? an analysis of RDF triplestore benchmarks</article-title>
          ,
          <source>in: The World Wide Web Conference, WWW</source>
          <year>2019</year>
          , ACM,
          <year>2019</year>
          , pp.
          <fpage>1623</fpage>
          -
          <lpage>1633</lpage>
          . URL: https://doi.org/10.1145/3308558.3313556. doi:
          <volume>10</volume>
          .1145/3308558.3313556.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>L.</given-names>
            <surname>Moreau</surname>
          </string-name>
          ,
          <article-title>The foundations for provenance on the web</article-title>
          ,
          <source>Found. Trends Web Sci. 2</source>
          (
          <year>2010</year>
          )
          <fpage>99</fpage>
          -
          <lpage>241</lpage>
          . URL: https://doi.org/10.1561/1800000010. doi:
          <volume>10</volume>
          .1561/1800000010.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ramusat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Maniu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Senellart</surname>
          </string-name>
          ,
          <article-title>Semiring provenance over graph databases</article-title>
          , in: M.
          <string-name>
            <surname>Herschel</surname>
          </string-name>
          (Ed.),
          <source>10th USENIX Workshop on the Theory and Practice of Provenance</source>
          , TaPP, USENIX Association,
          <year>2018</year>
          . URL: https://www.usenix.org/conference/tapp2018/ presentation/ramusat.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>G.</given-names>
            <surname>Karvounarakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Fundulaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Christophides</surname>
          </string-name>
          ,
          <article-title>Provenance for linked data, in: In Search of Elegance in the Theory and Practice of Computation - Essays Dedicated to Peter Buneman</article-title>
          , volume
          <volume>8000</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2013</year>
          , pp.
          <fpage>366</fpage>
          -
          <lpage>381</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -41660-6_
          <fpage>19</fpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>642</fpage>
          -41660-6\_
          <fpage>19</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>C. V.</given-names>
            <surname>Damásio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Analyti</surname>
          </string-name>
          , G. Antoniou,
          <article-title>Provenance for sparql queries</article-title>
          , in: International Semantic Web Conference, Springer,
          <year>2012</year>
          , pp.
          <fpage>625</fpage>
          -
          <lpage>640</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>G.</given-names>
            <surname>Flouris</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Fundulaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pediaditis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Theoharis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Christophides</surname>
          </string-name>
          ,
          <article-title>Coloring rdf triples to capture provenance</article-title>
          , in: International Semantic Web Conference, Springer,
          <year>2009</year>
          , pp.
          <fpage>196</fpage>
          -
          <lpage>212</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wylot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cudre-Mauroux</surname>
          </string-name>
          , P. Groth,
          <article-title>TripleProv: Eficient processing of lineage queries in a native RDF store</article-title>
          ,
          <source>in: Proceedings of the 23rd international conference on World wide web</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>455</fpage>
          -
          <lpage>466</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>F.</given-names>
            <surname>Geerts</surname>
          </string-name>
          , U. T, G. Karvounarakis,
          <string-name>
            <given-names>I.</given-names>
            <surname>Fundulaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Christophides</surname>
          </string-name>
          ,
          <article-title>Algebraic structures for capturing the provenance of SPARQL queries</article-title>
          ,
          <source>J. ACM</source>
          <volume>63</volume>
          (
          <year>2016</year>
          ) 7:
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          :
          <fpage>63</fpage>
          . URL: https: //doi.org/10.1145/2810037. doi:
          <volume>10</volume>
          .1145/2810037.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Widom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Wiener</surname>
          </string-name>
          ,
          <article-title>Tracing the lineage of view data in a warehousing environment</article-title>
          ,
          <source>ACM Trans. Database Syst</source>
          .
          <volume>25</volume>
          (
          <year>2000</year>
          )
          <fpage>179</fpage>
          -
          <lpage>227</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>T. J.</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Karvounarakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tannen</surname>
          </string-name>
          ,
          <article-title>Provenance semirings, in: Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems</article-title>
          , ACM,
          <year>2007</year>
          , pp.
          <fpage>31</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Galárraga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hose</surname>
          </string-name>
          ,
          <article-title>Computing how-provenance for SPARQL queries via query rewriting</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>14</volume>
          (
          <year>2021</year>
          )
          <fpage>3389</fpage>
          -
          <lpage>3401</lpage>
          . URL: http://www.vldb.org/ pvldb/vol14/p3389-galarraga.pdf.
          <source>doi:10.14778/3484224</source>
          .3484235.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>D.</given-names>
            <surname>Vrandecic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krötzsch</surname>
          </string-name>
          ,
          <article-title>Wikidata: a free collaborative knowledgebase</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>57</volume>
          (
          <year>2014</year>
          )
          <fpage>78</fpage>
          -
          <lpage>85</lpage>
          . URL: https://doi.org/10.1145/2629489. doi:
          <volume>10</volume>
          .1145/2629489.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>M.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Unbehauen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <article-title>Improving the performance of semantic web applications with SPARQL query caching</article-title>
          ,
          <source>in: The Semantic Web: Research and Applications, 7th Extended Semantic Web Conference, ESWC</source>
          <year>2010</year>
          , volume
          <volume>6089</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2010</year>
          , pp.
          <fpage>304</fpage>
          -
          <lpage>318</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -13489-0_
          <fpage>21</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>642</fpage>
          -13489-0\_
          <fpage>21</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lorey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <article-title>Caching and prefetching strategies for SPARQL queries</article-title>
          , in: The Semantic Web:
          <article-title>ESWC 2013 Satellite Events</article-title>
          , volume
          <volume>7955</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2013</year>
          , pp.
          <fpage>46</fpage>
          -
          <lpage>65</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -41242-
          <issue>4</issue>
          _5. doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>642</fpage>
          -41242-4\_5.
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lorey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <article-title>Detecting SPARQL query templates for data prefetching</article-title>
          ,
          <source>in: The Semantic Web: Semantics and Big Data</source>
          , 10th International Conference, ESWC 2013, Montpellier, France, May
          <volume>26</volume>
          -30,
          <year>2013</year>
          . Proceedings, volume
          <volume>7882</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2013</year>
          , pp.
          <fpage>124</fpage>
          -
          <lpage>139</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -38288-
          <issue>8</issue>
          _9. doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>642</fpage>
          -38288-8\_9.
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Proactive policy for eficiently updating join views on continuous queries over data streams and linked data</article-title>
          ,
          <source>IEEE Access 7</source>
          (
          <year>2019</year>
          )
          <fpage>86226</fpage>
          -
          <lpage>86241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Compton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Taylor</surname>
          </string-name>
          ,
          <article-title>Towards content-aware SPARQL query caching</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>