<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>March</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>How Caching Improves Efficiency and Result Completeness for Querying Linked Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Olaf Hartig</string-name>
          <email>hartig@informatik.hu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Humboldt-Universität zu Berlin Unter den Linden 6 10099 Berlin</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <volume>29</volume>
      <issue>2011</issue>
      <abstract>
        <p>Link traversal based query execution is a novel query approach which enables applications that exploit the Web of Data to its full potential. This approach makes use of the characteristics of Linked Data: During query execution it traverses data links to discover data that may contribute to query results. Once retrieved from the Web, the data can be cached and reused for subsequent queries. We expect such a reuse to be beneficial for two reasons: First, it may improve query performance because it reduces the need to retrieve data multiple times; second, it may provide for additional query results, calculated based on cached data that would not be discoverable by a link traversal based execution alone. However, no systematic analysis exist that justifies the application of caching strategies based on these assumptions. In this paper we evaluate the potential of caching to improve efficiency and result completeness in link traversal based query execution systems. We conceptually analyze the potential benefit of keeping and reusing retrieved data. Furthermore, we verify the theoretical impact of caching by conducting a comprehensive experiment that is based on a real-world application scenario.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The emerging Web of Data [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is a large dataspace that connects
data from different providers who publish their data according to
the Linked Data principles [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This dataspace opens possibilities
not conceivable before: Applications are not restricted to data from
a predefined set of sources; instead, new sources can be
discovered and integrated at run-time. To enable applications to exploit
the Web of Data to its full potential we propose link traversal based
query execution, a novel approach to execute SPARQL queries over
the Web of Data [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. This approach makes use of the
characteristics of Linked Data. The general idea is to intertwine query pattern
matching with the traversal of data links in order to discover data
that might be relevant to answer the executed query. We illustrate
the application of this idea with an example:
      </p>
      <p>EXAMPLE 1. The Linked Data based application Foaf Letter1
allows users to observe and manage their FOAF2 based social
network. Foaf Letter uses a query engine that implements our link
traversal based approach to issue SPARQL queries similar to Q1
(cf. Listing 1). Q1 asks for projects of acquaintances of user Bob,
who is identified by URI http://bob.name. Link traversal based
query execution typically starts with an empty, query-local dataset.
1http://www.linkeddata-a-thon.com/index.php/FoafLetter
2http://www.foaf-project.org/</p>
      <p>We obtain some seed data for pattern matching by looking up URIs
in the query: for the URI http://bob.name in Q1 we may retrieve
a set Gb of RDF triples (cf. Figure 1), which we add to the local
dataset. Now, we alternate between i) constructing intermediate
solutions from RDF triples that match a pattern of our query in
the query-local dataset and ii) augmenting the dataset by looking
up URIs which are part of these intermediate solutions. For the
triple pattern in line 2 of Q1 the local dataset contains a matching
triple, originating from Gb. Hence, we can construct an
intermediate solution 1 = f?p ! http://alice.nameg that maps query
variable ?p to the URI http://alice.name. By looking up this URI
we may retrieve a set Ga of RDF triples, which we also add to the
local dataset. Based on the augmented dataset we can construct
an intermediate solution 2 = f?p ! http://alice.name; ?pr !
http://.../AlicesPrjg for the pattern in line 3. Notice, constructing
2 is only possible because we discovered Ga based on 1. We
proceed with our execution strategy: We retrieve Gp by looking up
http://.../AlicesPrj and construct 3 = f?pr ! http://.../AlicesPrj;
?l ! "Alice’s Project"g. Finally, we merge (i.e. join) 1,
2 and 3 to create a solution for the whole query.
1 SELECT ? p ? l WHERE {
2 &lt; h t t p : / / bob . name&gt; f o a f : knows ? p .
3 ? p f o a f : c u r r e n t P r o j e c t ? p r .
4 ? p r r d f s : l a b e l ? l . }</p>
      <sec id="sec-1-1">
        <title>Listing 1: Sample query Q1</title>
        <p>Excerpt from RDF data Gb retrieved for http://bob.name :
&lt; h t t p : / / bob . name&gt; f o a f : knows &lt; h t t p : / / a l i c e . name&gt; .
Excerpt from RDF data Ga retrieved for http://alice.name :
&lt; h t t p : / / a l i c e . name&gt; f o a f : name " A l i c e " ;
f o a f : knows &lt; h t t p : / / bob . name&gt; ;
f o a f : c u r r e n t P r o j e c t &lt; h t t p : / / . . . / A l i c e s P r j &gt; ;
f o a f : t o p i c _ i n t e r e s t &lt; h t t p : / / . . . / T e n n i s &gt; .</p>
        <p>Excerpt from RDF data Gp retrieved for http://.../AlicesPrj :
&lt; h t t p : / / . . . / A l i c e s P r j &gt; r d f s : l a b e l " A l i c e ’ s P r o j e c t " .
Excerpt from RDF data Gt retrieved for http://.../Tennis :
&lt; h t t p : / / . . . / T e n n i s &gt; r d f s : l a b e l " T e n n i s " .</p>
        <p>As the example demonstrates, link traversal based query
execution proposes to evaluate a query over a dataset that is continuously
augmented with potentially relevant data from the Web. The
discovery of this data is driven by the URIs in intermediate solutions.
Thus, the idea of the link traversal based approach is not to
follow arbitrary links in the discovered data but only those links that
correspond to triple patterns in the executed query.</p>
        <p>Usually, not all the URIs looked up during query execution are in
the same namespace, controlled by a single data provider. Instead,
since the Linked Data principles require to provide links to data
from other sources it is likely that data from multiple Linked
Dataproviding sources is discovered and used to construct query results;
this may include sources the query engine did not even know they
exist before executing the query. Hence, link traversal based query
execution is fundamentally different from traditional query
execution paradigms which always assume knowledge of the existence
of data sources that may contribute to the query result. This
assumption presents a restriction that inhibits applications to tap the
full potential of the Web; link traversal based query execution, in
contrast, enables serendipitous discovery and utilization of relevant
data from unknown sources.</p>
        <p>However, due to the openness and the widely distributed
nature of the Web we cannot assume to find all data that is relevant
for a query. Hence, we should never expect results that are
complete w.r.t. the whole Web of Data. Nonetheless, users demand
approaches that obtain as many results as possible. One possibility
to improve completeness is to keep discovered and retrieved data
and reuse it as additional seed data for subsequent queries as the
following example illustrates:</p>
        <p>EXAMPLE 2. Query Q2 (cf. Listing 2) asks for the interests of
people who know Bob. The Foaf Letter application uses queries
similar to Q2 to propose possible acquaintances for users. As for
Q1 in Example 1, we start the execution of Q2 with an empty
querylocal dataset and we add Gb, discovered by looking up the URIs
in Q2. Unfortunately, Gb does not contain triples that match the
patterns in Q2 so that we cannot construct an intermediate solution.
Hence, it is impossible to return a result for this query. However,
instead of starting with an empty dataset, we could keep the data
from the previous execution of Q1 and, thus, start executing Q2
with a query-local dataset that already contains Gb, Ga, and Gp.
By using this data we can construct a query result with our link
traversal based approach: t = f?name ! "Alice"; ?p !
http://alice.name; ?i ! http://.../Tennis; ?l ! "Tennis"g.
1 SELECT ? name ? p ? i ? l WHERE {
2 ? p f o a f : knows &lt; h t t p : / / bob . name &gt; .
3 ? p f o a f : name ? name .
4 ? p f o a f : t o p i c _ i n t e r e s t ? i .
5 ? i r d f s : l a b e l ? l . }</p>
      </sec>
      <sec id="sec-1-2">
        <title>Listing 2: Sample query Q2</title>
        <p>As the example demonstrates, understanding the query-local
dataset as a cache that can be reused for subsequent queries may
benefit the result completeness for some of the queries. An additional
argument for such a reuse is the typical purpose of caching: A cache
may improve the efficiency of query executions. Such an
expectation is based on the assumption that reusing formerly retrieved data
may reduce the need for looking up URIs on the Web and, thus,
decrease the amount of delays caused by these look-ups during query
execution. While the advantage of using a cache for the link
traversal based approach seems to be obvious, it requires a systematic
investigation to understand how caching may improve result
completeness and query performance. In particular, the first benefit that
we expect, more complete result sets, must be researched because
the discovery of a specific set of RDF triples is not guaranteed with
our query approach, a characteristic which is untypical for most
caching scenarios where all objects missing from the cache, can be
obtained with some additional effort.</p>
        <p>In this paper we evaluate the potential of caching for link
traversal based query execution. This evaluation includes two parts: a
conceptual analysis and an application based experiment. Hence,
our main contributions are:</p>
        <p>We extend the original formalization of link traversal based
query execution to provide a theoretical foundation to reuse
the query-local dataset for the execution of multiple queries.
Based on the extended formalization we conceptually
analyze the impact of caching for link traversal based query
execution.</p>
        <p>We empirically verify the theoretical impact of caching by
conducting experiments that are based on a real world
application scenario.</p>
        <p>Notice, this paper does not focus on caching strategies such as
replacement or invalidation. Instead, we understand the presented
evaluation as a necessary first step to develop such strategies for
our link traversal based approach.</p>
        <p>The remainder of this paper is structured as follows. First, in
Section 2 we introduce the preliminaries for the work presented in
this paper, in particular, we present a formal definition of the idea of
link traversal based query execution. Section 3 defines the
integration of caching into our query approach and provides the
conceptual analysis of the impact of caching. In Section 4 we present our
experiment. Finally, Section 5 reviews related work and Section 6
concludes this paper.
2.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>DEFINITION OF LINK TRAVERSAL</title>
    </sec>
    <sec id="sec-3">
      <title>BASED QUERY EXECUTION</title>
      <p>As a preliminary to a conceptual analysis of the potential benefit
of caching for link traversal based query execution we formally
introduce our query approach in this section. For the formalization
we assume no changes are made to the data on the Web during the
execution of a query.</p>
      <p>
        Link traversal based query execution is a novel query approach
to exploit the Web of data to its full potential. Since adhering to the
Linked Data principles [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is the minimal requirement for
publication in the Web of Data, our query approach relies solely on these
principles. Hence, we do not require each data publisher to provide
a query interface (i.e. a SPARQL endpoint) for their dataset.
      </p>
      <p>In the Web of Data entities have to be identified via HTTP scheme
based URIs. Let U LD be the set of all these URIs. By looking up
such a URI we retrieve RDF data about the entity identified by the
URI. Conceptually, we understand such a look-up as a function that
refers to the resulting data: lookup is a surjective function which
returns for each URI u 2 U LD a descriptor object, that is, a set
of RDF triples f(s1; p1; o1); ::: (sn; pn; on)g which i) can be
retrieved by looking up u on the Web and which ii) describes the
entity identified by u. Hence, based on the Linked Data principles
we expect:</p>
      <p>8u 2 U LD : 9(s; p; o) 2 lookup(u) : s = u _ o = u
Note, lookup is not injective; the same descriptor object might be
retrieved by looking up distinct URIs. In this case, the descriptor
object describes multiple entities. For each t 2= U LD the look-up
function returns an empty descriptor object: lookup(t) = ?.</p>
      <p>EXAMPLE 3. The set Gb of RDF triples in Example 1 was
retrieved by looking up the URI http://bob.name which identifies Bob.
We understand Gb as a descriptor object that describes Bob. Hence,
it holds: lookup( http://bob.name ) = Gb.</p>
      <p>
        We define link traversal based query execution for basic graph
patterns3 (BGPs), the basic building block of SPARQL queries [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
A BGP is a set of triple patterns which are RDF triples that may
contain query variables at the subject, predicate, and object
position. For the sake of a more straightforward formalization we do
not permit blank nodes in triple patterns as is possible according
to the SPARQL specification. In practice, each blank node in a
SPARQL query can be replaced by a new variable. A matching
triple in a set of RDF triples G for a triple pattern tp = (s~; p~; o~) is
any RDF triple (s; p; o) 2 G with
      </p>
      <p>(s~ 2= V ) s~ = s) ^ (p~ 2= V ) p~ = p) ^ (o~ 2= V ) o~ = o)
where V denotes the set of query variables.</p>
      <p>To formally integrate the traversal of data links into the query
execution process we adjust the semantics of BGP queries. We define
the notion of solutions for the link traversal based query execution
of a BGP using a two-phase approach: First, we define the set of
all descriptor objects that can be discovered during the link
traversal based query execution of a BGP query. Then, we formalize
solutions as sets of variable bindings that correspond to a subset
of all data from all discovered descriptor objects. Notice, while
this two-phase approach provides for a straightforward definition
of solutions it does not correspond to the actual query execution
strategy of intertwining the traversal of data links (i.e. URI
lookup) and graph pattern matching as is characteristic for link traversal
based query execution.</p>
      <p>To formalize what descriptor objects can be discovered during
the link traversal based execution of a BGP we introduce the
concept of reachability by recursion:</p>
      <p>Definition 1. Let b = ftp1; ::: ; tpng be a BGP; let D be a
descriptor object. D is reachable by the execution of b iff either:
there exists a triple pattern tpi = (s~i; p~i; o~i) 2 b such that
lookup(s~i) = D _ lookup(p~i) = D _ lookup(o~i) = D; or
there exists another descriptor object D0 6= D, a triple
pattern tpj 2 b, and an RDF triple t = (s; p; o) such that i) D0
is reachable by the execution of b, ii) t is a matching triple
for tpj in D0, and iii) lookup(s) = D, lookup(p) = D, or
lookup(o) = D.</p>
      <p>EXAMPLE 4. The descriptor object Gb in Example 1 is
reachable by the execution of the BGP in the sample query Q1 (cf.
Listing 1) because Gb satisfies the first case in Definition 1: The first
triple pattern in the BGP contains the URI http://bob.name and
Gb = lookup( http://bob.name ). Ga = lookup( http://alice.name )
is also reachable; although, it satisfies the second case in
Definition 1 because it depends on the reachability of Gb and on the
matching triple therein. Similarly for Gp, which depends on Ga.</p>
      <p>
        To represent the solutions of BGPs we adopt the notion of a
solution mapping as defined in the SPARQL specification [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. These
mappings bind query variables to RDF terms. The application of a
solution mapping to a BGP b, denoted as [b], implies replacing
each variable in the triple patterns of b by the RDF term the variable
is bound to in ; unbound variables must not be replaced. Using
solution mappings we introduce our notion of solutions for a BGP:
      </p>
      <p>Definition 2. Let b be a BGP; let A be the set of all descriptor
objects reachable by the execution of b. A solution mapping is
3While we consider only BGPs in this paper, the solutions for
BGPs that might be determined using link traversal based query
execution, can be processed by the SPARQL algebra as usual.
a solution for b iff
holds4:
maps only these variables that are in b and it</p>
      <p>
        While the presented two-phase definition approach defines the
notion of solutions for BGPs in the context of link traversal based
query execution, it does not reflect the fundamental idea of
intertwining link traversal with the construction of solutions. Instead,
a query execution engine that would directly implement this
twophase approach would have to retrieve all reachable descriptor
objects before it could generate solutions for a BGP. Hence, the first
solutions could only be generated after all data links that
correspond to triple patterns in the BGP have been followed recursively.
Retrieving the complete set of reachable data can take a long time
and may exceed the resources of the execution engine. For this
reason, the link traversal based query execution approach requires to
construct the solutions incrementally, using a query-local dataset
that is continuously augmented with additional descriptor objects.
These descriptor objects are discovered by looking up URIs that
occur in intermediate solution, that are, solution mappings from
which the solutions are constructed as we illustrate in Example 1.
For a more detailed discussion of link traversal based query
execution, including a comparison with other approaches that execute
SPARQL queries over data from multiple sources on the Web, we
refer to [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
3.
      </p>
    </sec>
    <sec id="sec-4">
      <title>LINK TRAVERSAL BASED QUERY</title>
    </sec>
    <sec id="sec-5">
      <title>EXECUTION WITH A CACHE</title>
      <p>The introductory Example 2 illustrates how link traversal based
query execution may benefit from a data cache which keeps
retrieved data (i.e. descriptor objects) and enables to reuse this data
for subsequent queries. In this section we introduce a formalism to
describe and analyze the impact of such a cache; based on this
formalism, we provide an analysis of the potential benefit of caching.
Notice, since we aim to analyze the potential of caching in
general we assume the best case in this paper, that is, we assume an
infinitely large memory to keep the query-local dataset so that no
descriptor objects have to be replaced.
3.1</p>
    </sec>
    <sec id="sec-6">
      <title>Formal Foundation</title>
      <p>To conceptually enable the reuse of the query-local dataset for
multiple queries we have to adjust the formalism introduced in the
previous section. In particular, we have to extend our notion of
reachability by introducing the integration of an already populated
set of descriptor objects:</p>
      <p>Definition 3. (Extension of Definition 1) Let b = ftp1; ::: ; tpng
be a BGP; let D be a descriptor object; let Dseed be a set of
descriptor objects. D is reachable by the Dseed-initialized execution of b
iff either:</p>
      <p>D 2 Dseed;
there exists a triple pattern tpi = (s~i; p~i; o~i) 2 b such that
lookup(s~i) = D, lookup(p~i) = D, or lookup(o~i) = D; or
there exists another descriptor object D0 6= D, a triple
pattern tpj 2 b, and an RDF triple t = (s; p; o) such that i) D0
is reachable by the Dseed-initialized execution of b, ii) t is
4For the union of descriptor objects we assume that no two
descriptor objects share the same blank nodes. This requirement can
be guaranteed by using a unique set of blank nodes identifiers for
each descriptor object retrieved from the Web.
a matching triple for tpj in D0, and iii) lookup(s) = D,
lookup(p) = D, or lookup(o) = D.</p>
      <p>This definition extends Definition 1 such that the reachability of a
descriptor object depends on an initial content of the query-local
dataset with which the query execution starts.</p>
      <p>EXAMPLE 5. Example 2 illustrates the effect of the extended
notion of reachability: Neither Ga nor Gt are reachable by the
execution of the BGP in the sample query Q2, initialized with an
empty set of descriptor objects. However, both, Ga and Gt, are
indeed reachable by the DQ1-initialized execution of the BGP where
DQ1 = fGb; Ga; Gpg is the set of descriptor objects discovered
and retrieved during the execution of Q1 as outlined in Example 1.</p>
      <p>We note that our extended notion of reachability is monotone:
PROPOSITION 1. Let b be an arbitrary BGP; let D1 and D2 be
two sets of descriptor objects where D1 D2. Each descriptor
object that is reachable from the D1-initialized execution of b is
also reachable from the D2-initialized execution of b.</p>
      <p>The proof of this proposition is trivial: It follows from the first
case in Definition 3. From the proposition it follows also that each
descriptor object that is reachable from the execution of a BGP b
(according to Definition 1) is also reachable from any D-initialized
execution of b where D can be an arbitrary set of descriptor objects.</p>
      <p>Based on our extended understanding of reachability we also
extend our notion of solutions:</p>
      <p>Definition 4. (Extension of Definition 2) Let b be a BGP; let
Dseed be a set of descriptor objects; let A0 be the set of all descriptor
objects reachable by the Dseed-initialized execution of b. A solution
mapping is a Dseed-dependent solution for b iff maps only these
variables that are in b and it holds:</p>
      <p>EXAMPLE 6. The solution mapping t discovered during the
second query execution outlined in Example 2 was generated based
on matching triples in Ga and Gt; i.e. t[b2] Ga [ Gt where
b2 is the BGP in Q2. Since Ga and Gt are reachable by the
DQ1initialized execution of b2 (cf. Example 5) we note that t is a
DQ1dependent solution for b2.</p>
      <p>The extended notion of solutions introduced in Definition 4
always includes all solution mappings that are solutions w.r.t.
Definition 2, independent of the set of descriptor objects which a query
execution was initialized with:</p>
      <p>PROPOSITION 2. Let b be an arbitrary BGP and D be an
arbitrary set of descriptor objects. Each solution mapping that is a
solution for b is also a D-dependent solution for b.</p>
      <p>PROOF. Definition 2 assumes that each BGP is executed with an
initially empty query-local dataset. Hence, each solution for a BGP
is a Dempty-dependent solution for that BGP where Dempty = ? is
the empty set of descriptor objects. Due to this fact, Proposition 2
follows from Proposition 1.</p>
      <p>As outlined before, the general idea of applying a data cache for
the link traversal based execution of a sequence of queries is to
reuse the data discovered during the execution of previous queries
as seed data for subsequent queries. More precisely, we propose to
reuse reachable descriptor objects from previous queries.
Conceptually, we substitute the direct dependency of reachability on the
initial content of the query-local dataset by an indirect dependency
on the sequence of queries that have been executed before. Hence,
we define which descriptor objects are reachable by the execution
of a query sequence:</p>
      <p>Definition 5. Let B = [b1; ::: ; bn] be a sequence of non-empty
BGPs. The set of reachable descriptor objects, denoted as R(B),
for the serial execution of B is recursively defined as:
R(B) =
(?</p>
      <p>; if B = [ ] (i.e. if B is empty)</p>
      <p>R0 ; else
where R0 is the set of all descriptor objects that are reachable by
the R(B0)-initialized execution of bn, with B0 = [b1; ::: ; bn 1].</p>
      <p>EXAMPLE 7. Let BQ1 = [bQ1] be a BGP sequence that
contains only the BGP bQ1 from our sample query Q1; let BQ1,Q2 =
[bQ1; bQ2] be a sequence that also contains the BGP bQ2 from query
Q2. The set of reachable descriptor objects for these sequences are:
R(BQ1) = fGa; Gb; Gpg and R(BQ1,Q2) = fGa; Gb; Gp; Gtg.
Furthermore, we can understand the t discovered in Example 2 as
a R(BQ1)-dependent solution for bQ2.</p>
      <p>PROPOSITION 3. Let B be a sequence of non-empty BGPs. For
any arbitrary subsequence B0 of B it holds: R(B0) R(B).</p>
      <p>PROPOSITION 4. Let b be an arbitrary BGP and B be an
arbitrary sequence of non-empty BGPs. Each solution mapping that
is a solution for b is also a R(B)-dependent solution for b.
Proposition 3 can be shown by induction using Definition 5 and
Proposition 1; Proposition 4 follows from Definition 5 and
Proposition 2. Please notice that Proposition 4 is particularly important
because it shows that even if we reuse the query-local dataset for
multiple queries it is always possible to determine at least these
solutions for a BGP that satisfy our original Definition 2, independent
of the query sequence executed beforehand.</p>
      <p>The concepts defined in this section formally extend link
traversal based query execution with the possibility to reuse the
querylocal dataset that contains descriptor objects retrieved during
previous query executions. In particular, Definitions 4 and 5 provide
the formal foundation to understand the query-local dataset as a
data cache. In the following we make use of this formalization to
conceptually analyze the potential benefit of caching.
3.2</p>
    </sec>
    <sec id="sec-7">
      <title>The Potential Benefit of Caching</title>
      <p>A data cache for descriptor objects serves two purposes in the
context of link traversal based query execution: It may improve
query performance and it may improve the completeness of results.
In the remainder of this section we discuss these two purposes.</p>
      <p>The first purpose of the data cache is the typical purpose of a
cache: It may improve query performance by reducing the time
required to execute queries. Query execution time depends on
multiple factors such the number and size of descriptor objects that have
to be retrieved, the time to actually retrieve these descriptor objects,
and the time to find matching triples in the query-local dataset. To
analyze the potential of caching we measure query performance by
the number of descriptor objects that have to be retrieved from the
Web because this number is the factor primarily affected by the use
of a data cache.</p>
      <p>Without caching the number of descriptor objects that have to be
retrieved during the execution of a BGP bcur is always5 R ([bcur]) .
5[bcur] denotes a sequence of BGPs that contains only bcur
c (bcur; [b1; ::: ; bn]) =</p>
      <p>R [bcur] \ R [b1; ::: ; bn]
+ R [b1; ::: ; bn; bcur] n</p>
      <p>R [bcur] [ R [b1; ::: ; bn]
Using a data cache that was already used for the execution of a
sequence [b1; ::: ; bn] of BGPs, may reduce this number if some of
the D 2 R ([bcur]) have already been retrieved during the
execution of any of b1 to bn. The exact reduction factor is R ([bcur]) \
R ([b1; ::: ; bn]) . The basic prerequisite to such a reduction is a
temporal locality of descriptor objects: Descriptor objects must be
reachable by the execution of multiple BGPs that are executed in
a sequence. In this case a data cache eliminates the need to
reretrieve such descriptor objects and, thus, reduces query execution
times. Hence, caching can indeed improve query performance.
However, using a data cache may increase the number of
reachable descriptor objects from R ([bcur]) to R ([b1; ::: ; bn; bcur]) .
Hence, R ([b1; ::: ; bn; bcur]) n R ([bcur]) additional objects may
become reachable. While some of them might also already be in
the cache, others would have to be retrieved. The latter are
exactly all D 2 R ([b1; ::: ; bn; bcur]) n R ([bcur]) [ R ([b1; ::: ; bn]) .
Hence, the overall number of descriptor objects which have to be
retrieved during the execution of bcur using a data cache populated
by the execution of b1; ::: ; bn can be calculated with the cost
function in Figure 2.</p>
      <p>EXAMPLE 8. For the second execution of Q2 in Example 2 we
use our cost function to calculate c(bQ2; [bQ1]) where bQ2 is the BGP
of Q2 and bQ1 is the BGP of Q1 that was executed before.
c (bQ2; [bQ1]) = R [bQ2]</p>
      <p>R [bQ2] \ R [bQ1]
+</p>
      <p>R [bQ1; bQ2] n</p>
      <p>R [bQ2] [ R [bQ1]
= fGbg</p>
      <p>fGbg \ fGa; Gb; Gpg
+ fGa; Gb; Gp; Gtg n fGbg [ fGa; Gb; Gpg
Hence, during the execution of Q2 we have to retrieve one
additional descriptor object (Gt) if we reuse the local dataset with
which we executed Q1.</p>
      <p>We note that the application of a data cache improves query
performance if c (bcur; [b1; ::: ; bn]) &lt; R ([bcur]) , that is, if the second
factor in our cost function is greater than the third factor.
Otherwise, i.e. if c (bcur; [b1; ::: ; bn]) &gt; R ([bcur]) , caching has a
negative impact on query performance. While the latter case is
disadvantageous w.r.t. query performance, it presents the price we have
to pay to enable the second purpose of our data cache as we discuss
in the following.</p>
      <p>Example 2 illustrates that the application of a data cache may
improve result completeness. To measure such an improvement we
introduce a completeness criterion:</p>
      <p>Definition 6. Let b be a BGP; let all be the set of all
Dalldependent solutions for b, where Dall is the (potentially infinite) set
of all descriptor objects that can be retrieved from the Web.
Furthermore, let D be an arbitrary set of descriptor objects and the
set of all D-dependent solutions for b. is complete w.r.t. all data
on the Web iff = all. The degree of completeness of is:
completeness( ) =
(1
j j
j allj
; if j allj = 0
; else</p>
      <p>Hence, we understand a set of solutions for a BGP as
complete if it contains all solutions that could be constructed if we had
all descriptor objects that can be retrieved from the Web.
However, in practice it is impossible to decide whether a set of
solutions is complete because we cannot discover all descriptor objects.
Hence, it is also impossible to calculate the degree of completeness.
Nonetheless, we can compare different sets of solutions for a BGP
w.r.t. their degree of completeness by comparing their cardinality.
For a BGP bcur executed with a query-local dataset that was already
used for the execution of the BGP sequence B, we calculate this
cardinality by the following benefit function:
b (bcur; B) =
j</p>
      <p>is a R(B)-dependent solution for bcur
Since any set of solutions for a BGP bcur includes the solutions that
can be determined without caching (cf. Proposition 4) it always
holds b (bcur; B) b (bcur; [ ]), independent of B. The remaining
difference
s (bcur; B) = b (bcur; B)
b (bcur; [ ])
corresponds to these solutions that can only be determined using at
least one of the descriptor objects in R ([b1; ::: ; bn; bcur])nR ([bcur]).
Since these descriptor objects only become reachable with a data
cache and, hence, we discover the corresponding s (bcur; B)
solutions by serendipity, we call s (bcur; B) serendipity factor. For
query executions for which the serendipity factor is greater than 0
the application of a data cache improves result completeness.</p>
      <p>Table 1 summarizes the possible effect of caching on link
traversal based query execution. In this table we use the serendipity
factor as an indicator for the impact on result completeness and a
performance factor
p (bcur; B) = c (bcur; [ ])
c (bcur; B)
as an indicator for the impact query performance.
For the second execution, for which we reuse the query-local dataset
with which we executed Q1 before, we calculate:</p>
    </sec>
    <sec id="sec-8">
      <title>EXPERIMENTAL EVALUATION OF THE</title>
    </sec>
    <sec id="sec-9">
      <title>BENEFIT OF CACHING</title>
      <p>To verify the theoretical benefit of caching as discussed in the
previous section, we conducted comprehensive experiments that
are based on a real world application scenario. In this section we
present these experiments: We describe the setup and the
experiments and discuss the measured values.
4.1</p>
    </sec>
    <sec id="sec-10">
      <title>Setup</title>
      <p>To evaluate the impact of caching in a realistic scenario it is
necessary to use the query workload of a real world application. For
our experiments we selected the Foaf Letter application (cf.
Example 1) which offers three main functionalities: (F1) Presentation
of a user’s acquaintances and their main properties; (F2) Proposal
of additional types of properties that the user may want to
specify; and (F3) Proposal of other acquaintances not listed in a user’s
FOAF file.</p>
      <p>To provide these functionalities Foaf Letter issues the following
five types of SPARQL queries:
Contact Queries of this type ask for various properties of a user’s
acquaintances.</p>
      <p>UnsetProps Queries of this type ask for properties that are
specified for acquaintances of a user, but not for the user; restricted
to properties from the FOAF vocabulary.</p>
      <p>Incoming Queries of this type ask for “incoming contacts” of a
user; i.e. persons who know the user that are not listed to be
known by the user.
2ndDegree1 Queries of this type ask for persons who are known
by at least two distinct acquaintances of a user, but who are
not listed to be known by the user.
2ndDegree2 Queries of this type ask for persons who are
“incoming contacts” for at least two distinct acquaintances of a
user, but who are not listed to be known by the user.</p>
      <p>We selected Foaf Letter for the experiments because it is an
example of an application that largely benefits from our link traversal
based query execution approach. Answers to queries of the listed
types require data from multiple providers, published at different
locations on the Web. For instance, Contact queries can only be
answered using data describing acquaintanceships and data
representing the properties of acquaintances. While the former is
typically provided in a user’s FOAF file, the latter can usually be
obtained from the FOAF files of the acquaintances. Notice, since it
is unknown who will use Foaf Letter and what their acquaintances
are, the application of traditional query approaches is unsuitable
for Foaf Letter because they assume a fixed set of potentially
relevant data sources is given in advance. Our link traversal based
approach, in contrast, provides the necessary flexibility to discover
relevant data on the fly.</p>
      <p>For our experiments we defined actual SPARQL queries of the
five types for 23 persons who publish their FOAF data according
to the Linked Data principles. Using the resulting 115 queries we
specified a query sequence that simulates a scenario in which all
23 persons use the Foaf Letter application, one after another; each
use includes functionalities F1, F2, and F3, in this order. Hence,
the query sequence orders all 115 queries by the corresponding
person for whom they are defined. Each of the five queries for each
person are ordered by their type: Contact, UnsetProps,
2ndDegree1, 2ndDegree2 and Incoming. While we used this, fixed
query sequence for the majority of experiments we also simulated
a less predictable workload for which we randomly re-ordered the
sequence.</p>
      <p>During the experiments we executed the queries using the
Semantic Web Client Library6 (SWClLib). SWClLib is a prototype
of a query engine that implements link traversal based query
execution. Retrieved descriptor objects are stored separately, each
in a main memory index that contains six hashtables to support
the typical types of triple patterns7 during triple pattern matching.
To simulate a data cache in our experiments we simply reuse the
query-local dataset represented by a set of such indexes instead of
re-populating this set for each query.</p>
      <p>For each query execution we measured the number of results
returned, the overall query execution time, the number of descriptor
objects retrieved during the query execution, the number of
descriptor objects that the data cache contained after query execution, and
the hit rate. We determined the hit rate using the formula
hit rate =</p>
      <p>number of hits
number of look-up requests
where number of look-up requests is the number of all URI
lookups requested during query execution and number of hits is the
number of those URI look-up requests that could be answered from
the cache (i.e. where the corresponding descriptor object lookup(u)
has already been retrieved during a previous query execution).
Notice, due to the implementation of our query approach in SWClLib
some URI look-ups are requested more than once during a query
execution so that even with an empty cache we measured hit rates
greater than 0.</p>
      <p>The experiments were conducted on an Intel Core 2 Duo T7200
processor with 2 GHz, 4 MB L2 cache, and 2 GB main memory.
This machine is connected through the university LAN and runs a
recent 32 bit version of Gentoo Linux with Sun Java 1.6.0. The
actual SPARQL queries as well as the software used to run our
experiments, the measurements, and all the diagrams created from
these measurements can be found on the Website8 for this paper.
4.2</p>
    </sec>
    <sec id="sec-11">
      <title>Experiments</title>
      <p>For our analysis we conducted four experiments: lower bound,
upper bound, given order, and random orders.</p>
      <p>During the lower bound experiment we executed each query of
our query sequence with a new, initially empty query-local dataset.
Hence, we did not reuse the dataset for multiple queries. With this
experiment we obtained measurements for applying our link
traversal based approach without caching. These measurements represent
a baseline for our analysis. The experiment includes three runs of
the query sequence; for our analysis we use the average of the
values measured for each query.</p>
      <p>For the upper bound experiment we executed each query of
the query sequence three times, reusing the query-local dataset for
these three executions, and we took the measurements for the third
execution, only. For each query, however, we used a new, initially
empty, query-local dataset. Hence, this experiment measured the
impact of caching for single queries.
6http://www4.wiwiss.fu-berlin.de/bizer/ng4j/semwebclient/
7subj. given (s), pred. given (p), obj. given (o), s+p, s+o, p+o
8http://squin.org/experiments/CachingLDOW2011/</p>
      <p>The given order experiment executes the whole query sequence
with the same, query-local dataset. Hence, this experiment allows
to evaluate the potential benefit of caching for the execution of
multiple queries, in particular, for later queries in the sequence. We ran
the query sequence three times, each sequence with a new, initially
empty, query-local dataset, and we use the average of the
measurements for each query in the analysis.</p>
      <p>The specified query sequence represents a sequential and rather
predictable use of the Foaf Letter application by multiple users.
Other applications may show other access patterns that result in
less well ordered query sequences. We simulate such a behavior
with the random orders experiment. This experiment corresponds
to 30 runs of the given order experiment with different query
sequences. For each of these 30 runs we use a random permutation of
the original sequence. The measurements from these runs for each
query are averaged for the analysis.</p>
      <p>Before we executed the four experiments we performed a
warmup, that is, a single run of the lower bound experiment. This
warmup is necessary for two reasons: 1.) to avoid measuring the effect of
Web caches and 2.) to enable Linked Data servers that contributed
descriptor objects discovered during query execution to adapt their
caches.
4.3</p>
    </sec>
    <sec id="sec-12">
      <title>Discussion of the Measurements</title>
      <p>In the remainder of this section we discuss the measurements
obtained from our experiments. The charts in Figure 4 put the values
measured for all queries during the different experiments into
relation. The order of the queries in these charts, from top to bottom,
corresponds to the order in which we executed these queries.
Tables 2 to 4 provide a summary of the measurements. These tables
show the average number of query results (cf. Table 2), the hit rate
(cf. Table 3), and the average query execution times (cf. Table 4)
for all four experiments, aggregated over all queries and over the
different types of queries.
4.3.1</p>
      <p>lower bound vs. upper bound</p>
      <p>Figure 4(a) depicts the number of query results reported for each
query (summarized in Table 2). For a few queries we measured a
slightly different number of query results in the lower bound and
the upper bound experiment as the small difference in the
average number of results in Table 2 indicates. The reasons for these
small deviations are measurement errors caused by network
timeouts: Some URI look-ups timed out during some of the executions
of the corresponding queries. Due to these timeouts the query
sys9The few outliers correspond to the measurement errors discussed
before.
tem sometimes missed a few reachable descriptor objects and, thus,
was not able to construct some of the query results. The average
hit rates in Table 3 confirm this explanation: In the ideal case we
would expect a hit rate of 1 for the upper bound experiment but
the average hit rate is slightly below 1 which indicates that even
after the second re-execution of some of the queries the query-local
dataset did not contain all reachable descriptor objects. However,
if we leave out these measurement errors we see that the number of
query results is equal in the lower bound and the upper bound
experiments. We attribute this finding to the fact that even if we reuse
the query-local dataset for multiple executions of the same query
we do not reach more descriptor objects and, thus, cannot construct
more solutions. This effect can also be seen in Figure 4(c) which
illustrates that the number of descriptor objects contained in the
query-local dataset after each query execution is also equal for the
two experiments.</p>
      <p>While caching cannot enable more complete result sets in this
case, it may benefit query performance as can be seen in Table 4
and Figure 4(b): For several queries the execution times were
significantly higher in the lower bound experiment than in the upper
bound experiment; for 47 of the 115 queries the difference was
larger than 30 seconds. This effect can be explained by
comparing the measured hit rates (cf. Figure 4(d), summarized in Table 3).
While the hit rates during the upper bound experiment were
almost9 always 1, the hit rates in the lower bound experiment were
significantly smaller. This observation supports our expectation
that the higher number of URI look-ups needed in the lower bound
experiment (which corresponds to a smaller performance factor)
increased the amount of delays caused by these look-ups and, thus,
had a negative impact on query execution times. Hence, for
multiple executions of a single query a data cache always improves
query performance but it has no impact on result completeness. On
the level of BGPs we explain this finding formally: Since it holds
R ([bcur]) = R ([bcur; bcur])
p (bcur; [bcur]) = c (bcur; [ ])</p>
      <p>s (bcur; [bcur]) = 0
for each BGP bcur, we derive c (bcur; [bcur]) = 0 and thus:
Furthermore, we also derive b (bcur; [bcur]) = b (bcur; [ ]) and thus:</p>
      <sec id="sec-12-1">
        <title>4.3.2 given order</title>
        <p>While caching cannot increase the number of results for single
queries it can indeed improve result completeness for a sequence
of queries as the measurements from the given order experiment
indicate (cf. Figure 4(a) and Table 2). For 53 of the 115 queries the
query engine reported more results in the given order experiment
than in the upper bound or lower bound experiment. These 53
queries are of the type Incoming or 2ndDegree2. For queries of
these types query answering requires descriptor objects that are not
reachable from the execution over an initially empty query-local
dataset. For instance, each Incoming query requires data (in
particular foaf:knows statements) about persons who are not mentioned
in the data about the current user. While the data about the current
user can be discovered and retrieved, starting from the URI of the
user, the required data about these other persons can not be
discovered because there is no link to them in the data about the user. If,
however, data about these other persons has already been retrieved
during the execution of a previous query (maybe because these
persons used the Foaf Letter application themself or they were also
relevant for Contact or UnsetProps queries performed for other
users) the query engine can use this data to return results for the
current Incoming query. For the other query types the cache does
not increase result completeness (ignoring the aforemention
measurement errors) because all descriptor objects necessary to answer
these queries are also reachable with an initially empty query-local
dataset.</p>
        <p>While understanding the query-local dataset as a cache that can
be reused for multiple queries may benefit result completeness it
also has an impact on query performance. Figure 4(b) (summarized
in Table 4) illustrates that the query execution time for all queries
was higher in the given order experiment than in the upper bound
experiment. This is not surprising because all relevant data was
already in cache in the upper bound experiment while for the given
order experiment several URI look-ups were necessary as the hit
rates indicate (cf. Figure 4(d) and Table 3).</p>
        <p>Comparing the given order experiment to the lower bound
experiment instead, we see that caching benefitted query performance
in several cases: 26 of the 115 queries had a significantly higher
(more than 30 seconds) execution time without the cache.
However, more interesting are 12 queries that had a significantly (more
than 30 seconds) higher execution time in the given order
experiment than in the lower bound experiment. Our investigation
revealed that for some of these queries the time to find matching
triples in the query-local dataset dominates the time required for
URI look-ups. Hence, a query-local dataset that contains many
descriptor objects which are irrelevant for these queries (as was the
case in the given order experiment) causes a significant overhead.
Primarily, this effect can be attributed to the representation of the
query-local dataset in the current implementation of our query
engine. Storing each descriptor object in a separate index requires
multiple index look-ups during triple pattern matching, one
lookup for every descriptor object in the query-local dataset. To avoid
this issue we are working on a new data structure that allows to
keep data from multiple descriptor objects in a single index.
However, the other reason for a higher execution time of some of these
12 queries are cases in which the performance factor is negative.</p>
      </sec>
      <sec id="sec-12-2">
        <title>4.3.3 random orders</title>
        <p>The measurements from the random orders experiment indicate
that caching can also benefit the less predictable workload of the
randomly permuted query sequences. On the average, the random
orders experiment showed a similar behavior as the given order
experiment. However, for some of the queries the random orders
experiment reveals result completeness improvements that could
not be measured in the given order experiment. Figure 3 illustrates
this behavior exemplarily for all queries of type Incoming. In the
randomly ordered query sequences, these queries are more often
executed later than in the our original query sequence and, thus,
can take more advantage of the data cache than in the given order
experiment.</p>
        <p>
          To the best of our knowledge no work exists that addresses the
topic of caching in the context of link traversal based query
execution. What comes closest is the work presented by Harth et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
The authors introduce an alternative approach that also uses URI
look-ups to query the Web of Data. Instead of traversing links they
use a data summary to identify descriptor objects that might be
relevant for a query and, thus, should be retrieved for the execution.
While this approach often performs better than our link traversal
approach [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], it requires that all descriptor objects have been
discovered, retrieved and summarized before queries can be executed.
This requirement inhibits serendipitous discovery and utilization of
initially unknown data sources as is possible with our approach.
(a) number of query results
(b) query exec. time in seconds
(c) descriptor objects in cache
(d) hit rate
        </p>
        <p>However, a data summary can be understood as an alternative to
a data cache in our context: Each descriptor object that is
discovered (and used) during the link traversal based execution of a query
can additionally be summarized. The data summary that would
emerge from such a strategy could be used to obtain additional seed
data for the execution of subsequent queries. Moreover, a hybrid
approach that combines a data cache with a data summary would
be possible: When a descriptor object has to be replaced from the
cache because additional space is required, the object could be
summarized and not just be deleted. We aim to investigate these ideas
in the future.</p>
        <p>
          Web caching in general is an extensively studied topic because
it minimizes access latency, reduces network traffic, and causes a
reduced workload on Web servers [
          <xref ref-type="bibr" rid="ref1 ref19 ref8">1, 8, 19</xref>
          ]. Information systems
that consume Web resources also benefit from caching because it
allows for an immediate availability of resources [
          <xref ref-type="bibr" rid="ref18 ref5">5, 18</xref>
          ].
Furthermore, there is a lot of work that studies client side data caching to
improve the performance of applications in general (e.g. [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]) and
of query processing in particular (e.g. [
          <xref ref-type="bibr" rid="ref6 ref9">6, 9</xref>
          ]). However, the only
advantage of caching in these studies is an increased efficiency to
complete certain tasks. In our scenario, in contrast, data caching
may also improve the quality of the results of such tasks, that is,
result completeness of query executions in our case.
        </p>
        <p>
          While this paper focuses on data caching, it is also possible to
cache query results (or parts thereof). This idea has been
studied primarily in the context of client-server database systems and
distributed database environments (e.g. [
          <xref ref-type="bibr" rid="ref10 ref15 ref4 ref7">4, 7, 10, 15</xref>
          ]). Similar to
the aforementioned data caching approaches for query processing,
the purpose of caching query results in such systems is to improve
query efficiency. We expect a similar effect from adding a query
result cache to a link traversal based query system.
        </p>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>CONCLUSION</title>
      <p>In this paper we analyze the potential of caching for link
traversal based query execution, a novel query approach which exploits
the Web of Linked Data to its full potential. Our analysis shows an
advantage that is untypical for traditional applications of caching:
In our scenario a data cache may provide for additional query
results. Furthermore, our analysis verifies the possible benefit of a
data cache for improving query performance as one would expect
from the application of caching. However, we also identify cases
in which caching has a negative impact on the query performance.
We understand this drawback as the trade-off for which we gain the
possibility to discover more query results.</p>
      <p>We note, however, that a beneficial impact of caching on result
completeness and query performance as measured in our
experiments, is only possible in certain cases. In general, it requires
semantically similar queries that access data about the same topic,
describing the same entities. More precisely, these queries have
to benefit from data that is reachable from queries executed
previously. The possibility to increase the number of query results is
only given if the reachable data from the executed query sequence
contains required data that is not reachable for the current query
with an initially empty query-local dataset.</p>
      <p>While this paper presents a first, systematic evaluation that
verifies the benefit of adding a data cache to a link traversal based
query system, we ignore the issues of replacement and invalidation
for our analysis. However, the presented formalization provides the
theoretical foundation to develop and to analyze replacement and
invalidation strategies. In the future we will work on concepts to
actually integrate a data cache in our query system.</p>
    </sec>
    <sec id="sec-14">
      <title>REFERENCES</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Barish</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Obraczka</surname>
          </string-name>
          .
          <source>World Wide Web caching: Trends and Techniques. IEEE Communications Magazine</source>
          ,
          <volume>38</volume>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          .
          <article-title>Linked Data</article-title>
          . Online at http://www. w3.org/DesignIssues/LinkedData.html,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          .
          <article-title>Linked Data - The Story So Far</article-title>
          .
          <source>Journal on Semantic Web and Information Systems</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ),
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>C.-M. Chen</surname>
            and
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Roussopoulos</surname>
          </string-name>
          .
          <article-title>The Implementation and Performance Evaluation of the ADMS Query Optimizer: Integrating Query Result Caching and Matching</article-title>
          .
          <source>In Proceedings of the 4th International Conference on Extending Database Technology (EDBT)</source>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cho</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          .
          <article-title>Synchronizing a Database to Improve Freshness</article-title>
          .
          <source>In Proceedings of the ACM SIGMOD International Conference on Management of Data</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Cluet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kapitskaia</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <article-title>Using LDAP Directory Caches</article-title>
          .
          <source>In Proceedings of the 18th Symposium on Principles of Database Systems (PODS)</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Franklin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. T.</given-names>
            <surname>Jónsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          .
          <article-title>Semantic Data Caching and Replacement</article-title>
          .
          <source>In Proceedings of the 22th International Conference on Very Large Data Bases (VLDB)</source>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B. D.</given-names>
            <surname>Davison</surname>
          </string-name>
          .
          <article-title>A Web Caching Primer</article-title>
          .
          <source>IEEE Internet Computing</source>
          ,
          <volume>5</volume>
          (
          <issue>4</issue>
          ),
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Franklin</surname>
          </string-name>
          .
          <article-title>Client Data Caching: A Foundation for High Performance Object Database Systems</article-title>
          . Kluwer Academic Publishers, Norwell, MA, USA,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Godfrey</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Gryz</surname>
          </string-name>
          .
          <article-title>Answering Queries by Semantic Caches</article-title>
          .
          <source>In Proceedings of the 10th Int. Conference on Database and Expert Systems Applications (DEXA)</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Harth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Karnstedt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          , K.-U. Sattler, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Umbrich</surname>
          </string-name>
          .
          <article-title>Data Summaries for On-Demand Queries over Linked Data</article-title>
          .
          <source>In Proceedings of the 19th International Conference on World Wide Web (WWW)</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>O.</given-names>
            <surname>Hartig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Freytag</surname>
          </string-name>
          .
          <article-title>Executing SPARQL Queries over the Web of Linked Data</article-title>
          .
          <source>In Proceedings of the 8th International Semantic Web Conference (ISWC)</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>O.</given-names>
            <surname>Hartig</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Langegger</surname>
          </string-name>
          .
          <article-title>A Database Perspective on Consuming Linked Data on the Web</article-title>
          .
          <source>Datenbank-Spektrum</source>
          ,
          <volume>10</volume>
          (
          <issue>2</issue>
          ),
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Iyengar</surname>
          </string-name>
          .
          <article-title>Design and Performance of a General-Purpose Software Cache</article-title>
          .
          <source>In Proceedings of the International Performance Computing and Communications Conference (IPCCC)</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Keller</surname>
          </string-name>
          and J.
          <string-name>
            <surname>Basu</surname>
          </string-name>
          .
          <article-title>A Predicate-based Caching Scheme for Client-Server Database Architectures</article-title>
          .
          <source>VLDB Journal</source>
          ,
          <volume>5</volume>
          (
          <issue>1</issue>
          ),
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ladwig</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. T.</given-names>
            <surname>Tran</surname>
          </string-name>
          .
          <article-title>Linked Data Query Processing Strategies</article-title>
          .
          <source>In Proceedings of the 9th International Semantic Web Conference (ISWC)</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>E.</given-names>
            <surname>Prud</surname>
          </string-name>
          <article-title>'hommeaux and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Seaborne. SPARQL Query</surname>
          </string-name>
          <article-title>Language for RDF</article-title>
          .
          <source>W3C Recommendation</source>
          , Online at http://www.w3.org/TR/rdf-sparql-query/,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>P.</given-names>
            <surname>Triantafillou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ntarmos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Yannakopoulos</surname>
          </string-name>
          .
          <article-title>A Cache Engine for E-Content Integration</article-title>
          .
          <source>IEEE Internet Computing</source>
          ,
          <volume>8</volume>
          (
          <issue>2</issue>
          ),
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>A Survey of Web Caching Schemes for the Internet</article-title>
          . ACM Computer Communication Review,
          <volume>29</volume>
          (
          <issue>5</issue>
          ),
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>