=Paper=
{{Paper
|id=None
|storemode=property
|title=Caching and Prefetching Strategies for SPARQL Queries
|pdfUrl=https://ceur-ws.org/Vol-981/USEWOD2013paper1.pdf
|volume=Vol-981
}}
==Caching and Prefetching Strategies for SPARQL Queries==
Caching and Prefetching Strategies
for SPARQL Queries
Johannes Lorey Felix Naumann
Hasso-Plattner-Institut Hasso-Plattner-Institut
Potsdam, Germany Potsdam, Germany
johannes.lorey@hpi.uni-potsdam.de felix.naumann@hpi.uni-potsdam.de
ABSTRACT In terms of relational operations, AND represents an inner
Linked Data repositories offer a wealth of structured facts, join of two graph patterns, UNION denotes their union, and
useful for a wide array of application scenarios. However, re- OPTIONAL indicates a left outer join between P1 and P2 . In
trieving this data using Sparql queries yields a number of addition, AND takes precedence over UNION, and OPTIONAL is
challenges, such as limited endpoint capabilities and avail- always left-associative [10]. While UNION and OPTIONAL are
ability, or high latency for connecting to it. To cope with reserved keywords in actual Sparql queries to indicate the
these challenges, we argue that it is advantageous to cache corresponding connection between two graph patterns, the
data that is relevant for future information needs. However, AND keyword is omitted.
instead of only retaining results of previously issued queries, We call P the parent graph pattern of graph patterns
we aim at retrieving data that is potentially interesting for P1 , . . . , Pn , when it has the form P = P1 ⊕ . . . ⊕ Pn , where ⊕
subsequent requests in advance. To this end, we present dif- is any one of the introduced keywords. Whereas ⊕ is sym-
ferent methods to modify the structure of a query so that the metric for AND and UNION, it is not for OPTIONAL [10]. In any
altered query can be used to retrieve additional related in- Sparql query Q, the graph pattern P with no non-trivial
formation. We evaluate these approaches by applying them parent graph pattern is referred to as the query pattern PQ .
to requests found in real-world Sparql query logs. Typically, curly braces are used to delimit graph patterns:
{P }. Here, if such a delimited graph pattern P represents
a conjunction of multiple triple patterns Ti , i.e., its form is
1. INTRODUCTION P = T1 AND T2 . . . AND Tn , we call P a basic graph pattern
Linked Data sources offer a wealth of information about (BGP). While there is the notion of empty graph patterns
a multitude of topics, including geo-location facts1 , govern- in Sparql, we only consider non-empty graph patterns.
ment data2 , or cross-domain information3 . The SPARQL In our work, we focus on SELECT queries. In general,
Protocol and RDF Query Language (Sparql) has become Sparql allows to limit the projection to certain variables
the de facto query language to retrieve this information from listed after the SELECT statement. Moreover, graph patterns
publicly available endpoints. may contain so-called filter conditions indicated by the key-
However, whereas in principle Sparql facilitates various word FILTER. In a filter condition, specific restrictions, such
information needs by defining complex query constructs, as regular expressions or inequality relations, can be placed
typical public Sparql endpoint characteristics such as high on resources to modify the scope of the selection of a query.
latency, limited hardware resources, or unavailability restrict
on-demand utilization of the data. In this work, we propose
a novel approach for discovering and aggregating potentially
1.2 Contribution and Paper Organization
relevant data for subsequent user requests based on previous As with traditional Web search, Sparql is suitable for
query patterns by rewriting these preceding queries. To this different information needs, such as data exploration or nav-
end, we introduce some fundamental concepts of Sparql igational querying. However, in contrast to simple keyword-
in Sec. 1.1 and outline the contribution and organization of based queries, the well-defined structure and constructs of
this paper in Sec. 1.2. Sparql queries allow for more fine-grained expressions rep-
resenting the user intent. We build on this notion and
1.1 SPARQL Preliminaries present methods to modify queries in order to retrieve infor-
One central concept of a Sparql query is that of a triple mation relevant for subsequent related requests. Further, we
pattern T = (s, p, o) ∈ (V ∪ U ) × (V ∪ U ) × (V ∪ U ∪ L) discuss features of queries and query sequences influencing
with V being a set of variables, U being a set of URIs, and the suitability and effects of these modification methods.
L being a set of literals. A Sparql query contains a number By materializing results potentially relevant for subse-
of graph patterns P1 , P2 , . . . , which are defined recursively: quent queries locally, overall query execution time can be
(i) A valid triple pattern T is a graph pattern. (ii) If P1 and tremendously decreased for a client. Moreover, if the Sparql
P2 are graph patterns, then P1 AND P2 , P1 UNION P2 , and P1 endpoint becomes unavailable for a period of time, thus dis-
OPTIONAL P2 are graph patterns [10]. allowing access to the remote information, a replicated sub-
set of the data may be utilized instead. On the other hand,
1
http://linkedgeodata.org a Sparql endpoint may also utilize knowledge about what
2
http://data-gov.tw.rpi.edu/wiki data may be relevant for future queries, e.g., by storing it in
3
http://dbpedia.org main memory. The main trade-offs for this caching approach
are increased storage requirements and potential data stal- In comparison, our rewriting strategies are not targeted at
eness which may be negligible given the concrete scenario. increasing recall when executing a single query, but instead
The remainder of this paper is structured as follows: In aim to retrieve additional data related to future queries.
Sec. 2, we discuss related work for this paper and point out Moreover, we do not assume any knowledge of the data
differences in our approach. In Sec. 3, we introduce a means source itself or of metadata describing it. Thus, while most
to group Sparql queries by matching similarly structured previous approaches require at least some precomputation,
requests. We then present our different query rewriting ap- our approach can be used ad-hoc solely by analyzing and
proaches in Sec. 4 and present an evaluation using real-world modifying queries issued during run-time.
query logs in Sec. 5. Finally, we conclude this paper in Sec. 6
and highlight some future work in this context. 3. QUERY MATCHING
For different aspects of our work, we need to identify and
2. RELATED WORK cluster similarly-structured queries. To this end, we intro-
Related work for this paper draws mainly from two fields: duce our bottom-up query pattern matching approach based
(i) semantic caching and prefetching, e.g., techniques to re- on matching similar triple patterns they contain.
tain previously fetched data or retrieve data relevant for
subsequent queries, and (ii) query relaxation, e.g., parsing 3.1 Triple Pattern Distance
and modifying a user query to discover relevant resources. To match triple patterns, we determine their distance by
accumulating the distance scores between their parts, i.e.,
2.1 Semantic Caching and Prefetching between the two subjects, predicates, and objects, respec-
Semantic caching builds on the idea of maintaining a lo- tively. Here, if two triple pattern parts are both variables,
cal copy of retrieved data that can be used for subsequent their distance is 0. In case they are both URIs or both liter-
queries. As with traditional caching, one of the major mo- als, their distance is the normalized Levenshtein distance of
tivations for semantic caching is to reduce the transmission the respective strings. Otherwise, e.g., when one subject is
overhead when retrieving data over a network link. Con- a variable and the other a URI, the distance is 1. More for-
ventional approaches, such as tuple or page caching, usually mally, if x1 , x2 are either the subjects, predicates, or objects
retain fetched data based on either temporal or frequency as- of two triple patterns T1 , T2 , respectively, we define their
pects, e.g., by prioritizing least-recently or most-frequently symmetric distance score ∆(x1 , x2 ) as
used items. Such techniques also exist for Sparql query
result caching [12, 15]. Compared to these works, semantic
0, if x1 ∈ V ∧ x2 ∈ V
caching employs more fine-grained information to character-
levenshtein(x1 , x2 ) if (x1 ∈ U ∧ x2 ∈ U )
ize data, e.g., in order to establish variable-sized semantic ∆(x1 , x2 ) := ,
regions containing related tuples [4] or detect data items max(|x1 | , |x2 |) ∨ (x1 ∈ L ∧ x2 ∈ L)
with similar geolocation information [13].
1, else.
Closely related to semantic caching and our work is pre-
fetching. Instead of simply retaining tuples retrieved previ- We determine the overall distance ∆(T1 , T2 ) = ∆(T2 , T1 )
ously, prefetching allows to gather data that is potentially of two triple patterns T1 , T2 by aggregating the individ-
useful for subsequent queries based on semantic informa- ual triple pattern parts distance scores ∆(s1 , s2 ), ∆(p1 , p2 ),
tion derived from past queries or the overall system state. ∆(o1 , o2 ) as follows: In case ∆(s1 , s2 ) = ∆(p1 , p2 ) =
In computer architecture design, prefetching is usually em- ∆(o1 , o2 ) = 0, we define ∆(T1 , T2 ) := 0. Otherwise, there
ployed to request instructions that are anticipated to be ex- exists a minimum triple pattern part distance score min∆ :=
ecuted in the future and place them in the CPU cache. For min(∆(s1 , s2 ), ∆(p1 , p2 ), ∆(o1 , o2 ))) with min∆ > 0. In this
information retrieval, query prefetching typically assumes case, the triple pattern distance score is defined as
a probabilistic model, e.g., considering temporal or spatial ∆(T1 , T2 ) := d∆(s1 , s2 )e+d∆(p1 , p2 )e+d∆(o1 , o2 )e−(1−min∆ )
features [6, 9]. However, there have been no attempts to
prefetch RDF data based on the structure of sequential re- This way, a distance ∆(T1 , T2 ) ≤ 1 always indicates a
lated Sparql queries within and across query sessions. dissimilarity in at most one triple pattern part, whereas for
two non-equal triple pattern parts 1 < ∆(T1 , T2 ) ≤ 2, and
2.2 Query Relaxation a distance score ∆(T1 , T2 ) > 2 hints at differences between
Query relaxation aims at discovering interesting related the two subjects, predicates, and objects.
information based on a user request. For keyword queries, Consider the two basic graph patterns BGP1 and BGP2 in
this process is sometimes referred to as query expansion and Listing 1 and Listing 2, respectively, where the line numbers
has been a major research topic in the field of information re- serve as identifiers for the included triple patterns. Here,
trieval (cf. [3]). In contrast to query refinement which aims the most similar triple pattern for T1 in BGP2 can be deter-
at increasing precision by restricting the scope of a query, mined by computing min(∆(T1 , T4 ), ∆(T1 , T5 ), ∆(T1 , T6 )),
the goal of query relaxation or query expansion is to im- which results in ∆(T1 , T5 ) = (d0e + d0e + d 12
16
4
e − 16 ) = 0.75.
prove the recall in retrieval effectiveness. To this end, often For T2 , the minimum value is ∆(T2 , T6 ) = (d1e + d0e + d0e −
5
precomputed metadata such as language models is utilized. 0) = 1, and for T3 it is ∆(T3 , T4 ) = (d0e+d 14 e+d 59 e− 14
9
)≈
There exist a number of projects for implementing query 1.36. Thus, the most similar triple patterns for T1 , T2 , T3
relaxation for retrieving Linked Data. The authors of [8] in BGP2 are T5 , T6 , and T4 , respectively.
suggest logical relaxations based on ontological metadata.
In contrast, the approach in [7] relies on precomputed sim- 3.2 Basic Graph Pattern Matching
ilarity tables for attribute values whereas in [5] the authors Using the triple pattern distance scores, we can now de-
utilize a language model derived from the knowledge base. termine matchings between basic graph patterns. In our
1 ?city1 rdfs:label "Paris"@fr . 3.3 Query Pattern Matching
2 ?person ?relationWith ?city1 . Real-world Sparql query patterns can be more complex
3 :Auguste_Comte foaf:givenName "Auguste" .
than simple basic graph patterns, i.e., they may contain mul-
Listing 1: Basic Graph Pattern Example BGP1 . tiple recursively layered BGPs connected using the UNION or
OPTIONAL keyword as introduced in Sec. 1.1. Along the lines
of the previous subsection, where we defined BGP matching
using the contained triple patterns, our matching approach
4 :Auguste_Comte foaf:surname "Comte" . for general query patterns is based on recursively matching
5 ?city2 rdfs:label "Montpellier"@en .
6 :Auguste_Comte ?association ?city2 .
the (basic) graph patterns they consist of.
Due to the recursive structure of Sparql queries Q1 , Q2 ,
Listing 2: Basic Graph Pattern Example BGP2 . we can only try to match two basic graph patterns
BGPi , BGPj to one another if these BGPs reside at the
same recursion level of the query patterns PQ1 , PQ2 , respec-
tively. Additionally, the two basic graph patterns and any
work, finding this matching is equivalent to deriving a per- of their parent graph patterns also need to be connected by
fect (complete) bipartite cover with minimum cost between the same keyword to other (basic) graph patterns at their
the triple patterns of the two individual basic graph patterns respective recursion level. If these two conditions are met,
where the cost is determined by the triple pattern distance we say that BGPi , BGPj can be aligned to each other. More
∆(Ti , Tj ). Obviously, a perfect matching of triple patterns generally, if all (basic) graph patterns contained in a parent
is only possible for a complete bipartite graph, i.e., for two graph pattern P1 can be aligned to at least one (basic) graph
BGPs with the same number of triple patterns. If this is not pattern of another parent graph pattern P2 and vice versa,
the case, in order to generate a biclique of triple patterns we P1 , P2 can also be aligned to each other.
pad the basic graph pattern containing less elements using Consider the three Sparql parent graph patterns
dummy triple patterns Tε so that for any triple pattern T
the score ∆(T, Tε ) = ∆(Tε , T ) = ∞.
P1 := BGP1 OPTIONAL (BGP2 UNION BGP3 )
As optimal solutions for generating maximal matchings
P2 := BGP4 OPTIONAL (BGP5 UNION BGP6 ),
can be determined in polynomial time and the input size
P3 := BGP7 OPTIONAL BGP8 ,
(i.e., the number of triple patterns in the two BGPs)
is reasonably small, we choose the well-known Hungarian where BGP1 , . . . , BGP8 are basic graph patterns. Here,
Method [11] to create an assignment with minimum cost. BGP1 can be aligned to BGP4 or BGP7 . Additionally
Furthermore, we assign a maximum cost threshold to all BGP2 or BGP3 can be aligned to either BGP5 and BGP6 ,
derived matchings of triple patterns. Here, we only con- respectively. However, BGP8 cannot be aligned to any other
sider triple pattern matchings with cost ∆(Ti , Tj ) ≤ 1, i.e., basic graph pattern in P1 or P2 , because its recursion level
all matched triple patterns are either identical or differ in is different from those of BGP2 , BGP3 , BGP5 , BGP6 and
at most one of non-variable subject, predicate, or object, re- OPTIONAL is not symmetric, i.e., BGP7 OPTIONAL BGP8 6=
spectively. The cost for triple pattern matchings with higher BGP8 OPTIONAL BGP7 . Thus, while P1 and P2 can be
cost, i.e., matchings (Ti , Tj ) with ∆(Ti , Tj ) > 1 is set to ∞. aligned to each other, P3 cannot be aligned to either of them.
The score ∆(BGP1 , BGP2 ) of derived complete matchings We use a bottom-up approach to try to match two query
MT ⊂ {(Ti , Tj )|(Ti , Tj ) ∈ BGP1 × BGP2 } is defined as: patterns PQ1 , PQ2 to one another. Similarly to the matching
P method introduced in the previous subsection, we try to de-
(Ti ,Tj )∈MT
∆(Ti , Tj ) rive a complete matching with minimal (finite) cost between
∆(BGP1 , BGP2 ) := all basic graph patterns of PQ1 and PQ2 that can be aligned
|MT |
to each other. Here, the cost between two basic graph pat-
If the result of the Hungarian method for MT contains terns is determined by their distance score ∆(BGP1 , BGP2 ).
any individual triple pattern matching with infinite cost, If for any BGP in PQ1 or PQ2 no alignment is possible
∆(BGP1 , BGP2 ) is also infinite. In this case, no complete or its matching has infinite cost, PQ1 or PQ2 cannot be
matching with only finite triple pattern distance scores can matched. If we derive a complete matching for all basic
be determined between BGP1 and BGP2 . Assuming such graph patterns, we check whether the parent graph patterns
a complete matching containing only valid triple pattern of all pairs of matched BGPs can be aligned to each other.
matchings exists, this BGP matching would have been the If this is the case, we continue checking the alignment of the
result of the algorithm as its cost would be < ∞. parent graph patterns until we reach the query pattern.
Conversely, if the algorithm determines an optimal match- In case no alignment is possible, the two query patterns
ing with infinite cost, any other matching with cost < ∞ PQ1 and PQ2 cannot be matched. Conversely, two query
cannot be complete as the algorithm does not terminate be- patterns PQ1 and PQ2 can be matched if they can be aligned
fore discovering a maximal matching. Given our setting of a to each other and a complete matching with finite cost can
complete bipartite graph (or biclique), any maximal match- be established between all basic graph patterns BGPi and
ing is always bound to be a complete matching. BGPj in PQ1 and PQ2 , respectively. Canonically, any graph
Applying this approach on the basic graph patterns illus- pattern can always be matched to itself.
trated in Listing 1 and Listing 2 to determine a complete To group structurally similar queries, we introduce the
matching with minimum cost yields the same triple pat- notion of query clusters. All queries in a query cluster can
tern matchings listed earlier. Thus, the optimal matching be matched to all other queries within the cluster, i.e., there
{(T1 , T5 ), (T2 , T6 ), (T3 , T4 )} has cost 0.75+1+∞
3
, i.e., BGP1 exists a pairwise complete matching with finite cost between
and BGP2 cannot be matched to another. all BGPs of any two queries in the same cluster. Note that
query clusters may be overlapping, i.e., a query can be ele- birthPlace influence
ment of multiple query clusters. :Montpellier :Jean-Baptiste_Say
:Montpellier :Émile_Durkheim
4. QUERY AUGMENTATION Table 1: All results for Query 1 in DBpedia 3.8
The main motivation of our work lies in retrieving infor-
mation for Sparql queries that is also relevant for subse-
user might be interested in more information for a specific
quent related requests. Beyond basic caching, we argue for
resource. However, this might also be the case if the initial
prefetching results: Here, we attempt to modify a query to
result set is empty, e.g., because of misspelled or ambiguous
retrieve additional data potentially relevant for future infor-
vocabulary terms (such as foaf:img and foaf:Image).
mation needs. In this section, we motivate and illustrate
Potentially, there may also exist certain divergences be-
different approaches for modifying queries accordingly.
tween ontological information assumed by the user and the
4.1 Augmentation Concepts vocabulary used in actual data. For example, we discovered
that although a number of properties are used frequently for
We call the process of modifying the query contents query
instances of certain types in DBpedia, they are not defined
augmentation to emphasize that the results retrieved by is-
in the ontology (e.g., dbo:anthem for dbo:Country). On the
suing the original query are included in the result set for the
other hand, there are also a number of defined properties
modified query. In other words, the matches for the unmodi-
that are rarely used in instance data for the corresponding
fied query form a subgraph of the matches for the augmented
classes (e.g., dbo:depth for dbo:Place) [1].
query. In this work, we are typically interested in retrieving
This augmentation is applied by adding a triple pattern
additional information related to a central concept, namely
to the query, where the subject is the central concept and
the subject (i.e., either a variable or URI) occurring most
predicate as well as object are unique variables. Moreover,
often in the query pattern. Moreover, we require a central
these two variables are added to the projection, thus evalu-
concept to be either part of the projection if it is a variable
ating all facts in which the central concept is subject. We
or to influence the selection if it is a URI. We assume that a
highlight the corresponding changes in the resulting Query 2
URI influences the selection if at least one triple pattern, in
by underlining modified or added sections.
which the URI is the subject, contains a projection variable.
Our intuition of query augmentation builds on concepts SELECT ?p ?o ?birthPlace ?influence WHERE {
from information retrieval. For example, in traditional :Auguste_Comte dbo:birthPlace ?birthPlace .
keyword-based search engines, a user might be unaware of ?birthPlace dbo:country :France .
the most suitable string pattern to enter to retrieve all rele- :Auguste_Comte dbo:influencedBy ?influence .
vant results at once. However, in several iterations the user ?influence dbo:deathPlace :Paris .
:Auguste_Comte ?p ?o .
may choose to refine the initial query based on retrieved re-
}
sults. In Linked Data terms, a user might query for more
detailed information about a certain resource or for similar Query 2: Exploratory augmentation of Query 1
information of related resources after analyzing preliminary
results, thus incrementally modifying the initial query. An excerpt of the extended result set for Query 2 is listed
In the remainder of this section, we introduce the dif- in Tab. 2. Please note that we exclude the bindings listed in
ferent augmentation strategies we have implemented while Tab. 1 which are also contained in the result set of Query 2.
emphasizing the particular requirements for their applica-
tion. We illustrate the effect of applying these strategies p o
on Query 1. Put simply, for the central concept French rdf:type dbo:Person
philosopher :Auguste_Comte this query retrieves the birth dbo:birthDate 1798-01-19
place(s) located in France alongside all influences on him dbo:notableIdea :Positivism
that died in Paris. Table 1 lists all bindings retrieved by dbo:influenced :Karl_Marx
issuing Query 1 against the public DBpedia Sparql end- ...
point4 currently containing DBpedia 3.8 data. Table 2: Some results for Query 2 in DBpedia 3.8
PREFIX :
PREFIX dbo:
SELECT ?birthPlace ?influence WHERE { 4.3 Template Augmentation
:Auguste_Comte dbo:birthPlace ?birthPlace . For the template augmentation approach, we first need to
?birthPlace dbo:country :France . identify the cluster the current query belongs to as discussed
:Auguste_Comte dbo:influencedBy ?influence . in Sec. 3.3 by matching the current query pattern to those
?influence dbo:deathPlace :Paris . of preceding requests. If we determine a matching with fi-
}
nite cost, we require this matching to be non-trivial, i.e., to
Query 1: Example of a Sparql query include at least one match between two non-identical basic
graph patterns. This is true if the overall distance score of
the query pattern matching is greater than 0, i.e., if any
4.2 Exploratory Augmentation two matched basic graph patterns contain a pair of triple
patterns (Ti , Tj ) for which either the subjects, predicates,
In exploratory augmentation, we query for additional facts or objects differ.
that are available for the central concept. The idea of ex- A matching between two query patterns P1 and P2 meet-
ploratory augmentation is that based on previous results a ing these requirements allows us to construct a query tem-
4
http://dbpedia.org/sparql plate: Initially, the template query string is identical to the
current query string. We then introduce unique variables type augmentation lies in determining a suitable class for
to replace all differing triple pattern parts included in the which instance data is retrieved, especially without assum-
matching, i.e., either the non-matching subject, predicate, or ing any a priori knowledge of the underlying ontology.
object of triple patterns (Ti , Tj ) for which 0 < ∆(Ti , Tj ) ≤ 1. A number of techniques can be employed to gather this
Here, if the same resource in a BGP is replaced more than data, e.g., using multiple preliminary queries to construct a
once, we use the same unique variable to ensure consistency. simple type hierarchy or utilizing aggregate functions such
Instead of issuing many similarly structured queries with as COUNT to generate heuristics about the distribution of
only little variance, e.g., by using a crawler, a query template different types. In our approach, we introduce a FILTER
instead retrieves all relevant information using only a single NOT EXISTS to exclude all those (generic) types that have
query. A possible query template for Query 1 is illustrated (more specific) subclasses. By doing so, we assume that the
in Query 3. Here, one specific resource has been replaced by endpoint supports Sparql 1.1 expressions and all resources
a variable, which has also been added to the projection of are instances of at least one leaf node in the type hierarchy.
the query in the SELECT statement. As indicated in Tab. 3, If this is not the case, we exclude the filter condition.
the result set contains the previous bindings as well as in- Query 3 illustrates the result of applying type augmen-
formation about other persons with similar properties, e.g., tation on the reference query, i.e., by introducing the new
about poet Paul Fort or scientist Pierre de Fermat. triple patterns regarding rdf:type information, exchanging
the central concept in all other triple patterns, and applying
SELECT ?s ?birthPlace ?influence WHERE { the filter condition. Whereas the results for template aug-
?s dbo:birthPlace ?birthPlace . mentation also include bindings for ?s that are instances of
?birthPlace dbo:country :France . rather generic classes such as dbo:Person, the results for
?s dbo:influencedBy ?influence . Query 3 listed in Tab. 4 only include resources that are in-
?influence dbo:deathPlace :Paris . stances of specific subclasses, e.g., dbo:Philosopher.
}
SELECT ?s ?birthPlace ?influence WHERE {
Query 3: Template augmentation of Query 1 :Auguste_Comte rdf:type ?type .
?s rdf:type ?type .
s birthPlace influence ?s dbo:birthPlace ?birthPlace .
:Jean-Baptiste_ ?birthPlace dbo:country :France .
:Auguste_Comte :Montpellier ?s dbo:influencedBy ?influence .
Say
?influence dbo:deathPlace :Paris .
:Auguste_Comte :Montpellier :Émile_Durkheim FILTER NOT EXISTS {?t1 rdfs:subClassOf ?type}
:Paul_Fort :Reims :Paul_Verlaine }
:Pierre_de_Fermat :Beaumont-
:François_Viète Query 4: Type augmentation of Query 1
de-Lomagne
... s birthPlace influence
Table 3: Some results for Query 3 in DBpedia 3.8 :Jean-Baptiste_
:Auguste_Comte :Montpellier
Say
:Auguste_Comte :Montpellier :Émile_Durkheim
4.4 Type Augmentation :Jean-Paul_ :Maurice_
:Paris
If class membership information in the knowledge base is Sartre Merleau-Ponty
available for the central concept, exploiting this ontologi- :René_ :Descartes,_ :Marine_
cal data can help in discovering information for related re- Descartes Indre-et-Loire Mersenne
sources. In type augmentation we identify the rdf:type of ...
the central concept and retrieve data for the instances be- Table 4: Some results for Query 4 in DBpedia 3.8
longing to the same classes by replacing the central concept
with a unique variable throughout the query.
The goal of type augmentation is similar to that of tem- 4.5 Holistic Augmentation
plate augmentation, i.e., querying information about differ-
The intuition of holistic augmentation is that the scope of
ent related resources. Whereas in template augmentation
Sparql queries can be broadened by removing certain triple
this relatedness is solely determined by the context of the re-
patterns they contain. However, to ensure that the result set
placed triple pattern part, in type augmentation it is derived
of this modified query still contains all results of the original
by exploiting both structural and ontological information.
one, the removed parts must not contain variables essential
Assuming the central concept is identical to the triple pat-
to the projection or selection of a query. In other words, the
tern part replaced for matching BGPs, all bindings retrieved
variables in the SELECT statement still need to be present in
using type augmentation are also retrieved using template
the modified query so that they may be bound to an RDF
augmentation. However, for type augmentation the connec-
term in a graph matching.
tion between different resources may be stronger than for
Typically, if we select a triple pattern to remove from the
template augmentation given their type information.
basic graph pattern BGPi that contains it, we also remove
According to the RDF Schema5 , a resource may be in-
the same triple pattern from any other basic graph pattern
stance of multiple classes, where these classes may either be
BGPj that can be matched to BGPi as described in Sec. 3.3.
unrelated or reside at different levels of the same type hier-
We call such a removal valid, if applying it on a valid Sparql
archy. Therefore, RDF resources may be instances of very
query results in a valid Sparql query, i.e., if all projection
generic types, such as owl:Thing. Hence, one challenge for
variables are referenced in at least one remaining triple pat-
5
http://www.w3.org/TR/rdf-schema tern of the query pattern. Note that there exist queries for
which no valid triple pattern removal is possible, e.g., queries 5.1 Methodology
containing only one triple pattern. We use the timestamp information to provide a meaning-
To identify the most suitable triple pattern to remove from ful segmentation for successive queries by introducing query
a query, we utilize the variable counting heuristic introduced sessions. A query session is a chronologically ordered se-
in [14]. Essentially, this heuristic is based on the assump- quence of at least two queries issued sequentially by the
tion that unbound subjects are more selective than unbound same user (represented by a unique IP address) over a pe-
objects which in turn are more selective than unbound pred- riod of time. We call query sessions homogeneous, if they
icates. Hence, generally a triple pattern with an unbound contain queries belonging to the same cluster, i.e., any query
predicate, but bound subject and object matches less RDF in a session can be matched to any other query in the same
statements in a knowledge base than a triple pattern with session. Otherwise, we call this query session heterogeneous.
either only an unbound subject or object, i.e., is more selec- We exploit statistical features and metadata extracted
tive. Also, a triple pattern with two or three unbound parts from the query logs format for delimiting a query session.
is less selective than a triple pattern with only one or two Specifically, we restrict the duration of a session by compar-
unbound parts, respectively. Thus, the least selective triple ing the timestamps of issued queries with that of the initial
pattern is the one containing only variables. query of a session. Once this difference exceeds a certain
In any query pattern there can be more than one triple threshold, we assume a new query session has started. De-
pattern with maximum selectivity. In this case, we select an pending on the analyzed dataset, we put an upper limit on
arbitrary one for removal. If this removal is not valid, we the number of queries a single session may contain, thereby
check whether a valid removal can be achieved for a different splitting a single query sequence into separate sessions if the
triple pattern with same or lesser selectivity. We continue number of successive queries is greater than this limit.
until we have either exhaustively checked all triple patterns Matchings for triple patterns can easily be transformed
or discovered a validly removable triple pattern. In the latter into triples by applying the corresponding variable bind-
case, we modify the query by deleting the triple pattern from ings. By materializing these triples for the original, unmod-
the parent basic graph pattern and any other basic graph ified query and for all queries generated by applying the
pattern this BGP can be matched to. augmentation strategies introduced in Sec. 4, we maintain
Removing a highly-selective query triple pattern not es- individual result set caches. We then evaluate how many
sential for the projection mostly assists in situations where triples generated for bindings of subsequent queries are al-
(i) the query is too restrictive, (ii) the query contains invalid ready contained in the individual caches. If we determine
statements, or (iii) the data or ontology in the knowledge that a newly generated triple is contained in a cache, we
base is inconsistent, e.g., as described in [1]. One of the two consider this a cache hit. Note that the set of cached triples
most selective triple pattern in Query 1 is crossed out in for each augmentation strategy includes at least the cached
Query 5, thereby indicating its removal in this augmented triples for the original query.
query. Table 5 lists some results for Query 5. We use OpenLink Virtuoso Open-Source Edition version
6.1.3 as Sparql endpoint containing the English DBpe-
SELECT ?birthPlace ?influence WHERE { dia 3.6 dataset released on January 17, 2011, and the Linked-
:Auguste_Comte dbo:birthPlace ?birthPlace . GeoData dump released on April 26, 2011 in separate named
?birthPlace dbo:country :France . graphs. For each augmentation strategy, we restrict the
:Auguste_Comte dbo:influencedBy ?influence . number of retrieved results for performance reasons to a
?influence dbo:deathPlace :Paris .
} maximum of 100,000 using the LIMIT keyword for Sparql.
Query 5: Holistic Augmentation of Query 1 5.2 DBpedia 3.6 Query Logs
The requests included in the DBpedia 3.6 query logs ex-
birthPlace influence hibit timestamps in hourly resolution, e.g., [24/Jan/2011
:Montpellier :Adam_Smith 01:00:00 +0100]. In our analysis, only successive queries
:Montpellier :David_Hume from the same user with identical timestamps may belong to
:Montpellier :Jean-Baptiste_Say the same session. However, as this heuristic introduces some
:Montpellier :Émile_Durkheim vagueness regarding the contents of a session, e.g., by ignor-
... ing session timeouts, we also limit the maximum number of
Table 5: Some results for Query 5 in DBpedia 3.8. queries belonging to any single session to 25. Notice that by
applying this conservative restriction, we potentially lower
our number of cache hits which will most likely increase for
longer sessions. In summary, once a user query with a dif-
5. EVALUATION ferent timestamp is detected or we discover more than 25
To evaluate the applicability of our prefetching strate- successive queries, we assume a new session has started.
gies for Sparql queries, we analyzed parts of the DBpe- Consequently, we base our evaluation on 288 query ses-
dia 3.6 and LinkedGeoData (LGD) query logs provided for sions for which we were able to retrieve results for at least
the USEWOD 2013 data challenge [2]. The log files con- one contained query. Of these query sessions, 176 (61%)
tain a number of requests received by the respective public were homogeneous. On average, the query sessions con-
Sparql endpoints and were collected for different dates in tained around 21 queries. In around 34% of all query ses-
2011. While the specific metadata provided in the Apache sions, we could not identify any cache hits. We assume that
Common Log Format differs slightly for the individual ser- this is because the total result set size for these sessions is
vices, all requests contain the sender’s anonymized IP ad- relatively small: The sessions with no cache hits only result
dress and a timestamp in addition to the actual query. in about 100 generated triples compared to around 3,300
triples for sessions with cache hits. There are two possible We also evaluated how caching the result of every (aug-
reasons for this: (i) Our local Sparql endpoint did not con- mented) query influenced the number of cache hits for sub-
tain the entire DBpedia 3.6 corpus and (ii) some queries did sequent (unmodified) queries in a session. While the number
not yield any results even when executed against the public of query sessions with no cache hits dropped to around 24%,
DBpedia endpoint (e.g., because of syntax errors). for those query sessions with cache hits we observed compa-
For all sessions with at least one cache hit, we illustrate rable results to the ones illustrated. Hence, we assume that
the total number of cache hits in relation to the total num- by analyzing the first query, a suitable caching strategy can
ber of generated (unique) triples for all unmodified queries be determined for all subsequent queries of the same session.
in a session in Fig. 1. Each marker represents the best aug- For example, for homogeneous query sessions applying tem-
mentation strategy resulting in the most cache hits for this plate or type augmentation on the first query most likely
session. If for none of the augmentation strategies the num- results in the same augmented query as applying it on any
ber of hits was greater than the cache hits of the original of the subsequent session requests.
query, the marker “no augmentation” is used. In general, due to the large number of homogeneous query
sessions in the DBpedia query logs, type and template aug-
105
mentation appear to be the most successful among the aug-
Total number of cache hits for best augmentation strategy
Best augmentation strategy mentation strategies. Given the large number of resources
No Augmentation
in DBpedia and our restriction on the maximum number of
Exploratory Augmentation
results, we are impairing the success of these strategies to
104 Template Augmentation
Type Augmentation
some extent, as many potentially relevant facts are simply
Holistic Augmentation
not retrieved. Without this restriction, these prefetching
strategies should yield even better results.
103
On the other hand, the amount of query sessions benefit-
ing most from holistic or exploratory augmentation is lim-
ited. This might stem from the apparently small number of
102
queries issued by human users, towards which these strate-
gies are targeted. Moreover, in more than half the query
sessions (55%) holistic augmentation could not be applied
101
as no valid triple pattern removal was possible. Naturally,
in these cases the number of cache hits for holistic augmen-
0
tation equals the one of not applying any augmentation.
10
100 101 102 103 104 105
Total number of triples generated for all unmodified queries in a session 5.3 LinkedGeoData Query Logs
As the timestamps provided for the queries in the LGD
Figure 1: Best augmentation strategy when caching results logs are precise to the second, we delimit query sessions in
of the first query in a session in the DBpedia 3.6 log files these logs more accurately by introducing a session timeout
and maximum session duration. If for a query from a specific
Overall, our findings indicate that caching the results of user we cannot discover another query from the same user
the first, unaugmented query of a session yields the most within 10 minutes time, we assume the identified query is
amount of cache hits in about 38% of all analyzed query the last one in a session. Overall, we delimit a query session
sessions. The results for type and template augmentation by restricting its duration to a maximum of 60 minutes, its
have the most cache hits in 28% and 23% of the sessions, session timeout to 10 minutes, and its maximum number of
respectively, whereas applying exploratory and holistic aug- queries to 50 (whichever comes first).
mentation on the first session query only results in the most As with the DBpedia query logs, we analyzed only those
cache hits for 7% and 4% of all sessions. query sessions in which we determined at least one query
When considering homogeneous query sessions only, type for which we were able to generate a result as described
augmentation yields the most cache hits in 39% of all ses- above, i.e., we based our evaluation on 440 query sessions.
sions, followed by no augmentation (30%), template aug- This time, for only 9% of these query sessions no cache hits
mentation (28%), exploratory and holistic augmentation could be discovered at all. The analysis of which augmenta-
(1% each). Notice that we discovered a number of homo- tion strategy resulted in the most cache hits for the remain-
geneous query sessions which contain the exact same query ing 424 query sessions is illustrated in Fig. 2.
multiple times. For example, this is the case if only one For the LGD logs, caching the results of unmodified
triple is generated for all queries in a session and identified queries resulted in the most hits (48% of all query sessions),
as cache hit repeatedly as represented by markers located on followed by exploratory augmentation (26%), template aug-
the y-axis. Obviously, if the exact same query is issued over mentation (15%), type and holistic augmentation (5% each).
again within the course of a session, no additional cache hits For heterogeneous query sessions (55% of all query sessions),
can be generated when applying an augmentation strategy. exploratory augmentation is the best augmentation strategy
The mean number of cache hits is highest for exploratory (30%), followed by template augmentation (19%), holistic
augmentation (4,541 cache hits) and lowest for type aug- augmentation (7%), and type augmentation (6%).
mentation (33 cache hits) when considering only those ses- We also discovered that the number of mean cache hits
sions where these two augmentation strategies yield the most was much higher than the ones of the DBpedia log files: For
cache hits. On average, in all sessions with markers above those sessions that benefited from template augmentation
the diagonal (indicated by the dashed line in Fig. 1), each the average number of cache hits was highest with 72,917
generated triple represents a cache hit at least once. and lowest for type augmentation with 48,320. On the other
Total number of cache hits for best augmentation strategy 106 taining similar data of related resources, e.g., as generated
Best augmentation strategy
No Augmentation by applying template and type augmentation.
5 Exploratory Augmentation On the other hand, human users might retrieve more “di-
10
Template Augmentation verse” information, e.g., specific facts about a resource as
Type Augmentation generated by applying exploratory augmentation. Overall,
104 Holistic Augmentation whereas not all augmentation strategies can be applied on
certain queries (e.g., holistic augmentation for queries con-
taining only one triple pattern), combining different strate-
103
gies during the course of a query session might still prove
advantageous for such diversified scenarios.
102 In future work, we aim at capturing the intent in a query
session more accurately, e.g., by using more fine-grained ap-
proaches than that of a central concept. Based on this, more
101 customized augmentation strategies are conceivable, e.g., by
analyzing individual resources contained in a query. Finally,
100
we want to be able to discover on the most suitable augmen-
100 101 102 103 104 105 106 tation strategy based on the contents of a query session and
Total number of triples generated for all unmodified queries in a session thus adjust other query sessions accordingly.
Figure 2: Best augmentation strategy when caching results
of the first query in a session in the LinkedGeoData log files
7. REFERENCES
[1] Z. Abedjan, J. Lorey, and F. Naumann. Reconciling ontologies
and the web of data. In Proceedings of the International
Conference on Information and Knowledge Management
(CIKM), pages 1532–1536, Maui, HI, USA, October 2012.
hand, the average session length was comparatively small [2] B. Berendt, L. Hollink, M. Luczak-Rösch, K. H. Möller, and
with only 12 queries per session. This could be because the D. Vallet. USEWOD2013 – 3rd international workshop on usage
analysis and the web of data. In Proceedings of the Extended
LGD log queries were actually issued by human users (as Semantic Web Conference (ESWC), Montpellier, France, 2013.
opposed to crawlers or other software agents). Intuitively, [3] C. Carpineto and G. Romano. A survey of automatic query
this would also explain why the majority of query sessions expansion in information retrieval. ACM Comput. Surv.,
44(1):1:1–1:50, Jan. 2012.
are heterogeneous: Whereas software agents use somewhat
[4] S. Dar, M. J. Franklin, B. T. Jónsson, D. Srivastava, and
hard-coded HTTP requests to retrieve Linked Data, human M. Tan. Semantic data caching and replacement. In
users are more flexible when issuing queries, e.g., by using Proceedings of the International Conference on Very Large
the LGD Sparql web interface6 . Databases (VLDB), pages 330–341, Bombay, India, 1996.
[5] S. Elbassuoni, M. Ramanath, and G. Weikum. Query
Again, caching the result of every query in a session had relaxation for entity-relationship search. In Proceedings of the
only little impact on the choice of augmentation strategy, Extended Semantic Web Conference (ESWC), pages 62–76,
which is to be expected considering the small mean number Heraklion, Greece, 2011.
of queries in a session. However, the percentage of query [6] T. Fagni, R. Perego, F. Silvestri, and S. Orlando. Boosting the
performance of web search engines: Caching and prefetching
sessions not benefiting from any caching decreased slightly query results by exploiting historical usage data. ACM
to 6%. In general, the cached results for the queries can Transactions on Information Systems, 24(1):51–78, 2006.
almost always be used for subsequent queries in a session [7] A. Hogan, M. Mellotte, G. Powell, and D. Stampouli. Towards
fuzzy query-relaxation for RDF. In Proceedings of the
for the LGD logs. Again, our local Sparql endpoint might Extended Semantic Web Conference (ESWC), pages 687–702,
have not contained all data available in the public Sparql Heraklion, Greece, 2012.
endpoint. However, the impact on our results is negligible as [8] C. A. Hurtado, A. Poulovassilis, and P. T. Wood. Journal on
we were able to generate query results for the vast majority data semantics X. chapter Query relaxation in RDF, pages
31–61. Springer-Verlag, Berlin, Heidelberg, 2008.
of sessions (75%) and cache hits in almost all of these (91%). [9] S. Jonassen, B. B. Cambazoglu, and F. Silvestri. Prefetching
query results and its impact on search engines. In Proceedings
of the ACM International Conference on Information
6. CONCLUSION AND OUTLOOK Retrieval (SIGIR), pages 631–640, Portland, OR, USA, 2012.
[10] C. G. Jorge Pérez, Marcelo Arenas. Semantics and complexity
In this paper, we have presented a number of approaches of SPARQL. ACM Transactions on Database Systems
to modify Sparql queries with the goal to retrieve addi- (TODS), 34(3):16:1–16:45, 2009.
tional results that are potentially relevant for subsequent [11] H. W. Kuhn. The hungarian method for the assignment
problem. Naval Research Logist. Quarterly, 2(1-2):83–97, 1955.
requests of the same query session. We evaluated how of- [12] M. Martin, J. Unbehauen, and S. Auer. Improving the
ten and how many of these additional results are actually performance of semantic web applications with SPARQL query
relevant by identifying query sessions that can exploit this caching. In Proceedings of the Extended Semantic Web
Conference (ESWC), pages 304–318, Heraklion, Greece, 2010.
prefetched data in subsequent requests compared to only
[13] Q. Ren and M. H. Dunham. Using semantic caching to manage
caching the results of the unmodified query. location dependent data in mobile computing. In Proceedings
While the majority of all analyzed query sessions ben- of the International Conference on Mobile Computing and
efited from caching the results of the unmodified or aug- Networking, pages 210–221, Boston, MA, United States, 2000.
[14] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and
mented first query (around 66% for DBpedia and 91% for D. Reynolds. SPARQL basic graph pattern optimization using
LinkedGeoData), the most suitable augmentation strategy selectivity estimation. In Proceedings of the International
(if any) differs from session to session. For example, large- World Wide Web Conference (WWW), pages 595–604, New
York, NY, USA, Apr. 2008.
scale homogeneous query sessions can exploit a cache con-
[15] M. Yang and G. Wu. Caching intermediate result of SPARQL
queries. In Proceedings of the International World Wide Web
6 Conference (WWW), pages 159–160, Hyderabad, India, 2011.
http://linkedgeodata.org/sparql