=Paper= {{Paper |id=None |storemode=property |title=Caching and Prefetching Strategies for SPARQL Queries |pdfUrl=https://ceur-ws.org/Vol-981/USEWOD2013paper1.pdf |volume=Vol-981 }} ==Caching and Prefetching Strategies for SPARQL Queries== https://ceur-ws.org/Vol-981/USEWOD2013paper1.pdf
                           Caching and Prefetching Strategies
                                 for SPARQL Queries

                          Johannes Lorey                                              Felix Naumann
                        Hasso-Plattner-Institut                                    Hasso-Plattner-Institut
                         Potsdam, Germany                                           Potsdam, Germany
           johannes.lorey@hpi.uni-potsdam.de                           felix.naumann@hpi.uni-potsdam.de


ABSTRACT                                                                In terms of relational operations, AND represents an inner
Linked Data repositories offer a wealth of structured facts,         join of two graph patterns, UNION denotes their union, and
useful for a wide array of application scenarios. However, re-       OPTIONAL indicates a left outer join between P1 and P2 . In
trieving this data using Sparql queries yields a number of           addition, AND takes precedence over UNION, and OPTIONAL is
challenges, such as limited endpoint capabilities and avail-         always left-associative [10]. While UNION and OPTIONAL are
ability, or high latency for connecting to it. To cope with          reserved keywords in actual Sparql queries to indicate the
these challenges, we argue that it is advantageous to cache          corresponding connection between two graph patterns, the
data that is relevant for future information needs. However,         AND keyword is omitted.
instead of only retaining results of previously issued queries,         We call P the parent graph pattern of graph patterns
we aim at retrieving data that is potentially interesting for        P1 , . . . , Pn , when it has the form P = P1 ⊕ . . . ⊕ Pn , where ⊕
subsequent requests in advance. To this end, we present dif-         is any one of the introduced keywords. Whereas ⊕ is sym-
ferent methods to modify the structure of a query so that the        metric for AND and UNION, it is not for OPTIONAL [10]. In any
altered query can be used to retrieve additional related in-         Sparql query Q, the graph pattern P with no non-trivial
formation. We evaluate these approaches by applying them             parent graph pattern is referred to as the query pattern PQ .
to requests found in real-world Sparql query logs.                      Typically, curly braces are used to delimit graph patterns:
                                                                     {P }. Here, if such a delimited graph pattern P represents
                                                                     a conjunction of multiple triple patterns Ti , i.e., its form is
1.    INTRODUCTION                                                   P = T1 AND T2 . . . AND Tn , we call P a basic graph pattern
   Linked Data sources offer a wealth of information about           (BGP). While there is the notion of empty graph patterns
a multitude of topics, including geo-location facts1 , govern-       in Sparql, we only consider non-empty graph patterns.
ment data2 , or cross-domain information3 . The SPARQL                  In our work, we focus on SELECT queries. In general,
Protocol and RDF Query Language (Sparql) has become                  Sparql allows to limit the projection to certain variables
the de facto query language to retrieve this information from        listed after the SELECT statement. Moreover, graph patterns
publicly available endpoints.                                        may contain so-called filter conditions indicated by the key-
   However, whereas in principle Sparql facilitates various          word FILTER. In a filter condition, specific restrictions, such
information needs by defining complex query constructs,              as regular expressions or inequality relations, can be placed
typical public Sparql endpoint characteristics such as high          on resources to modify the scope of the selection of a query.
latency, limited hardware resources, or unavailability restrict
on-demand utilization of the data. In this work, we propose
a novel approach for discovering and aggregating potentially
                                                                     1.2    Contribution and Paper Organization
relevant data for subsequent user requests based on previous            As with traditional Web search, Sparql is suitable for
query patterns by rewriting these preceding queries. To this         different information needs, such as data exploration or nav-
end, we introduce some fundamental concepts of Sparql                igational querying. However, in contrast to simple keyword-
in Sec. 1.1 and outline the contribution and organization of         based queries, the well-defined structure and constructs of
this paper in Sec. 1.2.                                              Sparql queries allow for more fine-grained expressions rep-
                                                                     resenting the user intent. We build on this notion and
1.1    SPARQL Preliminaries                                          present methods to modify queries in order to retrieve infor-
   One central concept of a Sparql query is that of a triple         mation relevant for subsequent related requests. Further, we
pattern T = (s, p, o) ∈ (V ∪ U ) × (V ∪ U ) × (V ∪ U ∪ L)            discuss features of queries and query sequences influencing
with V being a set of variables, U being a set of URIs, and          the suitability and effects of these modification methods.
L being a set of literals. A Sparql query contains a number             By materializing results potentially relevant for subse-
of graph patterns P1 , P2 , . . . , which are defined recursively:   quent queries locally, overall query execution time can be
(i) A valid triple pattern T is a graph pattern. (ii) If P1 and      tremendously decreased for a client. Moreover, if the Sparql
P2 are graph patterns, then P1 AND P2 , P1 UNION P2 , and P1         endpoint becomes unavailable for a period of time, thus dis-
OPTIONAL P2 are graph patterns [10].                                 allowing access to the remote information, a replicated sub-
                                                                     set of the data may be utilized instead. On the other hand,
1
  http://linkedgeodata.org                                           a Sparql endpoint may also utilize knowledge about what
2
  http://data-gov.tw.rpi.edu/wiki                                    data may be relevant for future queries, e.g., by storing it in
3
  http://dbpedia.org                                                 main memory. The main trade-offs for this caching approach
are increased storage requirements and potential data stal-         In comparison, our rewriting strategies are not targeted at
eness which may be negligible given the concrete scenario.        increasing recall when executing a single query, but instead
  The remainder of this paper is structured as follows: In        aim to retrieve additional data related to future queries.
Sec. 2, we discuss related work for this paper and point out      Moreover, we do not assume any knowledge of the data
differences in our approach. In Sec. 3, we introduce a means      source itself or of metadata describing it. Thus, while most
to group Sparql queries by matching similarly structured          previous approaches require at least some precomputation,
requests. We then present our different query rewriting ap-       our approach can be used ad-hoc solely by analyzing and
proaches in Sec. 4 and present an evaluation using real-world     modifying queries issued during run-time.
query logs in Sec. 5. Finally, we conclude this paper in Sec. 6
and highlight some future work in this context.                   3.    QUERY MATCHING
                                                                    For different aspects of our work, we need to identify and
2.    RELATED WORK                                                cluster similarly-structured queries. To this end, we intro-
   Related work for this paper draws mainly from two fields:      duce our bottom-up query pattern matching approach based
(i) semantic caching and prefetching, e.g., techniques to re-     on matching similar triple patterns they contain.
tain previously fetched data or retrieve data relevant for
subsequent queries, and (ii) query relaxation, e.g., parsing      3.1    Triple Pattern Distance
and modifying a user query to discover relevant resources.           To match triple patterns, we determine their distance by
                                                                  accumulating the distance scores between their parts, i.e.,
2.1    Semantic Caching and Prefetching                           between the two subjects, predicates, and objects, respec-
   Semantic caching builds on the idea of maintaining a lo-       tively. Here, if two triple pattern parts are both variables,
cal copy of retrieved data that can be used for subsequent        their distance is 0. In case they are both URIs or both liter-
queries. As with traditional caching, one of the major mo-        als, their distance is the normalized Levenshtein distance of
tivations for semantic caching is to reduce the transmission      the respective strings. Otherwise, e.g., when one subject is
overhead when retrieving data over a network link. Con-           a variable and the other a URI, the distance is 1. More for-
ventional approaches, such as tuple or page caching, usually      mally, if x1 , x2 are either the subjects, predicates, or objects
retain fetched data based on either temporal or frequency as-     of two triple patterns T1 , T2 , respectively, we define their
pects, e.g., by prioritizing least-recently or most-frequently    symmetric distance score ∆(x1 , x2 ) as
used items. Such techniques also exist for Sparql query
result caching [12, 15]. Compared to these works, semantic                         
                                                                                     0,                          if x1 ∈ V ∧ x2 ∈ V
caching employs more fine-grained information to character-                        
                                                                                   
                                                                                       levenshtein(x1 , x2 )     if (x1 ∈ U ∧ x2 ∈ U )
                                                                                   
ize data, e.g., in order to establish variable-sized semantic     ∆(x1 , x2 ) :=                             ,
regions containing related tuples [4] or detect data items                              max(|x1 | , |x2 |)      ∨ (x1 ∈ L ∧ x2 ∈ L)
                                                                                   
with similar geolocation information [13].
                                                                                   
                                                                                       1,                        else.
   Closely related to semantic caching and our work is pre-
fetching. Instead of simply retaining tuples retrieved previ-       We determine the overall distance ∆(T1 , T2 ) = ∆(T2 , T1 )
ously, prefetching allows to gather data that is potentially      of two triple patterns T1 , T2 by aggregating the individ-
useful for subsequent queries based on semantic informa-          ual triple pattern parts distance scores ∆(s1 , s2 ), ∆(p1 , p2 ),
tion derived from past queries or the overall system state.       ∆(o1 , o2 ) as follows: In case ∆(s1 , s2 ) = ∆(p1 , p2 ) =
In computer architecture design, prefetching is usually em-       ∆(o1 , o2 ) = 0, we define ∆(T1 , T2 ) := 0. Otherwise, there
ployed to request instructions that are anticipated to be ex-     exists a minimum triple pattern part distance score min∆ :=
ecuted in the future and place them in the CPU cache. For         min(∆(s1 , s2 ), ∆(p1 , p2 ), ∆(o1 , o2 ))) with min∆ > 0. In this
information retrieval, query prefetching typically assumes        case, the triple pattern distance score is defined as
a probabilistic model, e.g., considering temporal or spatial      ∆(T1 , T2 ) := d∆(s1 , s2 )e+d∆(p1 , p2 )e+d∆(o1 , o2 )e−(1−min∆ )
features [6, 9]. However, there have been no attempts to
prefetch RDF data based on the structure of sequential re-          This way, a distance ∆(T1 , T2 ) ≤ 1 always indicates a
lated Sparql queries within and across query sessions.            dissimilarity in at most one triple pattern part, whereas for
                                                                  two non-equal triple pattern parts 1 < ∆(T1 , T2 ) ≤ 2, and
2.2    Query Relaxation                                           a distance score ∆(T1 , T2 ) > 2 hints at differences between
   Query relaxation aims at discovering interesting related       the two subjects, predicates, and objects.
information based on a user request. For keyword queries,           Consider the two basic graph patterns BGP1 and BGP2 in
this process is sometimes referred to as query expansion and      Listing 1 and Listing 2, respectively, where the line numbers
has been a major research topic in the field of information re-   serve as identifiers for the included triple patterns. Here,
trieval (cf. [3]). In contrast to query refinement which aims     the most similar triple pattern for T1 in BGP2 can be deter-
at increasing precision by restricting the scope of a query,      mined by computing min(∆(T1 , T4 ), ∆(T1 , T5 ), ∆(T1 , T6 )),
the goal of query relaxation or query expansion is to im-         which results in ∆(T1 , T5 ) = (d0e + d0e + d 12
                                                                                                                16
                                                                                                                        4
                                                                                                                   e − 16 ) = 0.75.
prove the recall in retrieval effectiveness. To this end, often   For T2 , the minimum value is ∆(T2 , T6 ) = (d1e + d0e + d0e −
                                                                                                                 5
precomputed metadata such as language models is utilized.         0) = 1, and for T3 it is ∆(T3 , T4 ) = (d0e+d 14 e+d 59 e− 14
                                                                                                                              9
                                                                                                                                )≈
   There exist a number of projects for implementing query        1.36. Thus, the most similar triple patterns for T1 , T2 , T3
relaxation for retrieving Linked Data. The authors of [8]         in BGP2 are T5 , T6 , and T4 , respectively.
suggest logical relaxations based on ontological metadata.
In contrast, the approach in [7] relies on precomputed sim-       3.2    Basic Graph Pattern Matching
ilarity tables for attribute values whereas in [5] the authors      Using the triple pattern distance scores, we can now de-
utilize a language model derived from the knowledge base.         termine matchings between basic graph patterns. In our
    1 ?city1 rdfs:label "Paris"@fr .                                  3.3    Query Pattern Matching
    2 ?person ?relationWith ?city1 .                                     Real-world Sparql query patterns can be more complex
    3 :Auguste_Comte foaf:givenName "Auguste" .
                                                                      than simple basic graph patterns, i.e., they may contain mul-
        Listing 1: Basic Graph Pattern Example BGP1 .                 tiple recursively layered BGPs connected using the UNION or
                                                                      OPTIONAL keyword as introduced in Sec. 1.1. Along the lines
                                                                      of the previous subsection, where we defined BGP matching
                                                                      using the contained triple patterns, our matching approach
    4 :Auguste_Comte foaf:surname "Comte" .                           for general query patterns is based on recursively matching
    5 ?city2 rdfs:label "Montpellier"@en .
    6 :Auguste_Comte ?association ?city2 .
                                                                      the (basic) graph patterns they consist of.
                                                                         Due to the recursive structure of Sparql queries Q1 , Q2 ,
        Listing 2: Basic Graph Pattern Example BGP2 .                 we can only try to match two basic graph patterns
                                                                      BGPi , BGPj to one another if these BGPs reside at the
                                                                      same recursion level of the query patterns PQ1 , PQ2 , respec-
                                                                      tively. Additionally, the two basic graph patterns and any
work, finding this matching is equivalent to deriving a per-          of their parent graph patterns also need to be connected by
fect (complete) bipartite cover with minimum cost between             the same keyword to other (basic) graph patterns at their
the triple patterns of the two individual basic graph patterns        respective recursion level. If these two conditions are met,
where the cost is determined by the triple pattern distance           we say that BGPi , BGPj can be aligned to each other. More
∆(Ti , Tj ). Obviously, a perfect matching of triple patterns         generally, if all (basic) graph patterns contained in a parent
is only possible for a complete bipartite graph, i.e., for two        graph pattern P1 can be aligned to at least one (basic) graph
BGPs with the same number of triple patterns. If this is not          pattern of another parent graph pattern P2 and vice versa,
the case, in order to generate a biclique of triple patterns we       P1 , P2 can also be aligned to each other.
pad the basic graph pattern containing less elements using               Consider the three Sparql parent graph patterns
dummy triple patterns Tε so that for any triple pattern T
the score ∆(T, Tε ) = ∆(Tε , T ) = ∞.
                                                                             P1 := BGP1 OPTIONAL (BGP2 UNION BGP3 )
   As optimal solutions for generating maximal matchings
                                                                             P2 := BGP4 OPTIONAL (BGP5 UNION BGP6 ),
can be determined in polynomial time and the input size
                                                                             P3 := BGP7 OPTIONAL BGP8 ,
(i.e., the number of triple patterns in the two BGPs)
is reasonably small, we choose the well-known Hungarian               where BGP1 , . . . , BGP8 are basic graph patterns. Here,
Method [11] to create an assignment with minimum cost.                BGP1 can be aligned to BGP4 or BGP7 . Additionally
Furthermore, we assign a maximum cost threshold to all                BGP2 or BGP3 can be aligned to either BGP5 and BGP6 ,
derived matchings of triple patterns. Here, we only con-              respectively. However, BGP8 cannot be aligned to any other
sider triple pattern matchings with cost ∆(Ti , Tj ) ≤ 1, i.e.,       basic graph pattern in P1 or P2 , because its recursion level
all matched triple patterns are either identical or differ in         is different from those of BGP2 , BGP3 , BGP5 , BGP6 and
at most one of non-variable subject, predicate, or object, re-        OPTIONAL is not symmetric, i.e., BGP7 OPTIONAL BGP8 6=
spectively. The cost for triple pattern matchings with higher         BGP8 OPTIONAL BGP7 . Thus, while P1 and P2 can be
cost, i.e., matchings (Ti , Tj ) with ∆(Ti , Tj ) > 1 is set to ∞.    aligned to each other, P3 cannot be aligned to either of them.
   The score ∆(BGP1 , BGP2 ) of derived complete matchings               We use a bottom-up approach to try to match two query
MT ⊂ {(Ti , Tj )|(Ti , Tj ) ∈ BGP1 × BGP2 } is defined as:            patterns PQ1 , PQ2 to one another. Similarly to the matching
                                P                                     method introduced in the previous subsection, we try to de-
                                   (Ti ,Tj )∈MT
                                                  ∆(Ti , Tj )         rive a complete matching with minimal (finite) cost between
          ∆(BGP1 , BGP2 ) :=                                          all basic graph patterns of PQ1 and PQ2 that can be aligned
                                          |MT |
                                                                      to each other. Here, the cost between two basic graph pat-
  If the result of the Hungarian method for MT contains               terns is determined by their distance score ∆(BGP1 , BGP2 ).
any individual triple pattern matching with infinite cost,               If for any BGP in PQ1 or PQ2 no alignment is possible
∆(BGP1 , BGP2 ) is also infinite. In this case, no complete           or its matching has infinite cost, PQ1 or PQ2 cannot be
matching with only finite triple pattern distance scores can          matched. If we derive a complete matching for all basic
be determined between BGP1 and BGP2 . Assuming such                   graph patterns, we check whether the parent graph patterns
a complete matching containing only valid triple pattern              of all pairs of matched BGPs can be aligned to each other.
matchings exists, this BGP matching would have been the               If this is the case, we continue checking the alignment of the
result of the algorithm as its cost would be < ∞.                     parent graph patterns until we reach the query pattern.
  Conversely, if the algorithm determines an optimal match-              In case no alignment is possible, the two query patterns
ing with infinite cost, any other matching with cost < ∞              PQ1 and PQ2 cannot be matched. Conversely, two query
cannot be complete as the algorithm does not terminate be-            patterns PQ1 and PQ2 can be matched if they can be aligned
fore discovering a maximal matching. Given our setting of a           to each other and a complete matching with finite cost can
complete bipartite graph (or biclique), any maximal match-            be established between all basic graph patterns BGPi and
ing is always bound to be a complete matching.                        BGPj in PQ1 and PQ2 , respectively. Canonically, any graph
  Applying this approach on the basic graph patterns illus-           pattern can always be matched to itself.
trated in Listing 1 and Listing 2 to determine a complete                To group structurally similar queries, we introduce the
matching with minimum cost yields the same triple pat-                notion of query clusters. All queries in a query cluster can
tern matchings listed earlier. Thus, the optimal matching             be matched to all other queries within the cluster, i.e., there
{(T1 , T5 ), (T2 , T6 ), (T3 , T4 )} has cost 0.75+1+∞
                                                  3
                                                       , i.e., BGP1   exists a pairwise complete matching with finite cost between
and BGP2 cannot be matched to another.                                all BGPs of any two queries in the same cluster. Note that
query clusters may be overlapping, i.e., a query can be ele-                 birthPlace         influence
ment of multiple query clusters.                                             :Montpellier :Jean-Baptiste_Say
                                                                             :Montpellier :Émile_Durkheim
4.     QUERY AUGMENTATION                                                Table 1: All results for Query 1 in DBpedia 3.8
  The main motivation of our work lies in retrieving infor-
mation for Sparql queries that is also relevant for subse-
                                                                  user might be interested in more information for a specific
quent related requests. Beyond basic caching, we argue for
                                                                  resource. However, this might also be the case if the initial
prefetching results: Here, we attempt to modify a query to
                                                                  result set is empty, e.g., because of misspelled or ambiguous
retrieve additional data potentially relevant for future infor-
                                                                  vocabulary terms (such as foaf:img and foaf:Image).
mation needs. In this section, we motivate and illustrate
                                                                     Potentially, there may also exist certain divergences be-
different approaches for modifying queries accordingly.
                                                                  tween ontological information assumed by the user and the
4.1     Augmentation Concepts                                     vocabulary used in actual data. For example, we discovered
                                                                  that although a number of properties are used frequently for
   We call the process of modifying the query contents query
                                                                  instances of certain types in DBpedia, they are not defined
augmentation to emphasize that the results retrieved by is-
                                                                  in the ontology (e.g., dbo:anthem for dbo:Country). On the
suing the original query are included in the result set for the
                                                                  other hand, there are also a number of defined properties
modified query. In other words, the matches for the unmodi-
                                                                  that are rarely used in instance data for the corresponding
fied query form a subgraph of the matches for the augmented
                                                                  classes (e.g., dbo:depth for dbo:Place) [1].
query. In this work, we are typically interested in retrieving
                                                                     This augmentation is applied by adding a triple pattern
additional information related to a central concept, namely
                                                                  to the query, where the subject is the central concept and
the subject (i.e., either a variable or URI) occurring most
                                                                  predicate as well as object are unique variables. Moreover,
often in the query pattern. Moreover, we require a central
                                                                  these two variables are added to the projection, thus evalu-
concept to be either part of the projection if it is a variable
                                                                  ating all facts in which the central concept is subject. We
or to influence the selection if it is a URI. We assume that a
                                                                  highlight the corresponding changes in the resulting Query 2
URI influences the selection if at least one triple pattern, in
                                                                  by underlining modified or added sections.
which the URI is the subject, contains a projection variable.
   Our intuition of query augmentation builds on concepts         SELECT ?p ?o ?birthPlace ?influence WHERE {
from information retrieval. For example, in traditional               :Auguste_Comte dbo:birthPlace ?birthPlace .
keyword-based search engines, a user might be unaware of              ?birthPlace dbo:country :France .
the most suitable string pattern to enter to retrieve all rele-       :Auguste_Comte dbo:influencedBy ?influence .
vant results at once. However, in several iterations the user         ?influence dbo:deathPlace :Paris .
                                                                      :Auguste_Comte ?p ?o .
may choose to refine the initial query based on retrieved re-
                                                                  }
sults. In Linked Data terms, a user might query for more
detailed information about a certain resource or for similar           Query 2: Exploratory augmentation of Query 1
information of related resources after analyzing preliminary
results, thus incrementally modifying the initial query.            An excerpt of the extended result set for Query 2 is listed
   In the remainder of this section, we introduce the dif-        in Tab. 2. Please note that we exclude the bindings listed in
ferent augmentation strategies we have implemented while          Tab. 1 which are also contained in the result set of Query 2.
emphasizing the particular requirements for their applica-
tion. We illustrate the effect of applying these strategies                     p                   o
on Query 1. Put simply, for the central concept French                          rdf:type            dbo:Person
philosopher :Auguste_Comte this query retrieves the birth                       dbo:birthDate       1798-01-19
place(s) located in France alongside all influences on him                      dbo:notableIdea :Positivism
that died in Paris. Table 1 lists all bindings retrieved by                     dbo:influenced      :Karl_Marx
issuing Query 1 against the public DBpedia Sparql end-                                          ...
point4 currently containing DBpedia 3.8 data.                           Table 2: Some results for Query 2 in DBpedia 3.8
PREFIX    : 
PREFIX dbo: 
SELECT ?birthPlace ?influence WHERE {                             4.3    Template Augmentation
    :Auguste_Comte dbo:birthPlace ?birthPlace .                     For the template augmentation approach, we first need to
    ?birthPlace dbo:country :France .                             identify the cluster the current query belongs to as discussed
    :Auguste_Comte dbo:influencedBy ?influence .                  in Sec. 3.3 by matching the current query pattern to those
    ?influence dbo:deathPlace :Paris .                            of preceding requests. If we determine a matching with fi-
}
                                                                  nite cost, we require this matching to be non-trivial, i.e., to
          Query 1: Example of a Sparql query                      include at least one match between two non-identical basic
                                                                  graph patterns. This is true if the overall distance score of
                                                                  the query pattern matching is greater than 0, i.e., if any
4.2     Exploratory Augmentation                                  two matched basic graph patterns contain a pair of triple
                                                                  patterns (Ti , Tj ) for which either the subjects, predicates,
  In exploratory augmentation, we query for additional facts      or objects differ.
that are available for the central concept. The idea of ex-         A matching between two query patterns P1 and P2 meet-
ploratory augmentation is that based on previous results a        ing these requirements allows us to construct a query tem-
4
    http://dbpedia.org/sparql                                     plate: Initially, the template query string is identical to the
current query string. We then introduce unique variables              type augmentation lies in determining a suitable class for
to replace all differing triple pattern parts included in the         which instance data is retrieved, especially without assum-
matching, i.e., either the non-matching subject, predicate, or        ing any a priori knowledge of the underlying ontology.
object of triple patterns (Ti , Tj ) for which 0 < ∆(Ti , Tj ) ≤ 1.      A number of techniques can be employed to gather this
Here, if the same resource in a BGP is replaced more than             data, e.g., using multiple preliminary queries to construct a
once, we use the same unique variable to ensure consistency.          simple type hierarchy or utilizing aggregate functions such
  Instead of issuing many similarly structured queries with           as COUNT to generate heuristics about the distribution of
only little variance, e.g., by using a crawler, a query template      different types. In our approach, we introduce a FILTER
instead retrieves all relevant information using only a single        NOT EXISTS to exclude all those (generic) types that have
query. A possible query template for Query 1 is illustrated           (more specific) subclasses. By doing so, we assume that the
in Query 3. Here, one specific resource has been replaced by          endpoint supports Sparql 1.1 expressions and all resources
a variable, which has also been added to the projection of            are instances of at least one leaf node in the type hierarchy.
the query in the SELECT statement. As indicated in Tab. 3,            If this is not the case, we exclude the filter condition.
the result set contains the previous bindings as well as in-             Query 3 illustrates the result of applying type augmen-
formation about other persons with similar properties, e.g.,          tation on the reference query, i.e., by introducing the new
about poet Paul Fort or scientist Pierre de Fermat.                   triple patterns regarding rdf:type information, exchanging
                                                                      the central concept in all other triple patterns, and applying
SELECT ?s ?birthPlace ?influence WHERE {                              the filter condition. Whereas the results for template aug-
    ?s dbo:birthPlace ?birthPlace .                                   mentation also include bindings for ?s that are instances of
    ?birthPlace dbo:country :France .                                 rather generic classes such as dbo:Person, the results for
    ?s dbo:influencedBy ?influence .                                  Query 3 listed in Tab. 4 only include resources that are in-
    ?influence dbo:deathPlace :Paris .                                stances of specific subclasses, e.g., dbo:Philosopher.
}
                                                                      SELECT ?s ?birthPlace ?influence WHERE {
       Query 3: Template augmentation of Query 1                          :Auguste_Comte rdf:type ?type .
                                                                          ?s rdf:type ?type .
    s                     birthPlace         influence                    ?s dbo:birthPlace ?birthPlace .
                                             :Jean-Baptiste_              ?birthPlace dbo:country :France .
    :Auguste_Comte        :Montpellier                                    ?s dbo:influencedBy ?influence .
                                             Say
                                                                          ?influence dbo:deathPlace :Paris .
    :Auguste_Comte        :Montpellier       :Émile_Durkheim              FILTER NOT EXISTS {?t1 rdfs:subClassOf ?type}
    :Paul_Fort            :Reims             :Paul_Verlaine           }
    :Pierre_de_Fermat     :Beaumont-
                                           :François_Viète                      Query 4: Type augmentation of Query 1
                          de-Lomagne
                                ...                                     s                   birthPlace           influence
        Table 3: Some results for Query 3 in DBpedia 3.8                                                         :Jean-Baptiste_
                                                                        :Auguste_Comte      :Montpellier
                                                                                                                 Say
                                                                        :Auguste_Comte      :Montpellier         :Émile_Durkheim
4.4      Type Augmentation                                              :Jean-Paul_                              :Maurice_
                                                                                            :Paris
   If class membership information in the knowledge base is             Sartre                                   Merleau-Ponty
available for the central concept, exploiting this ontologi-            :René_             :Descartes,_          :Marine_
cal data can help in discovering information for related re-            Descartes          Indre-et-Loire        Mersenne
sources. In type augmentation we identify the rdf:type of                                           ...
the central concept and retrieve data for the instances be-                 Table 4: Some results for Query 4 in DBpedia 3.8
longing to the same classes by replacing the central concept
with a unique variable throughout the query.
   The goal of type augmentation is similar to that of tem-           4.5    Holistic Augmentation
plate augmentation, i.e., querying information about differ-
                                                                         The intuition of holistic augmentation is that the scope of
ent related resources. Whereas in template augmentation
                                                                      Sparql queries can be broadened by removing certain triple
this relatedness is solely determined by the context of the re-
                                                                      patterns they contain. However, to ensure that the result set
placed triple pattern part, in type augmentation it is derived
                                                                      of this modified query still contains all results of the original
by exploiting both structural and ontological information.
                                                                      one, the removed parts must not contain variables essential
Assuming the central concept is identical to the triple pat-
                                                                      to the projection or selection of a query. In other words, the
tern part replaced for matching BGPs, all bindings retrieved
                                                                      variables in the SELECT statement still need to be present in
using type augmentation are also retrieved using template
                                                                      the modified query so that they may be bound to an RDF
augmentation. However, for type augmentation the connec-
                                                                      term in a graph matching.
tion between different resources may be stronger than for
                                                                         Typically, if we select a triple pattern to remove from the
template augmentation given their type information.
                                                                      basic graph pattern BGPi that contains it, we also remove
   According to the RDF Schema5 , a resource may be in-
                                                                      the same triple pattern from any other basic graph pattern
stance of multiple classes, where these classes may either be
                                                                      BGPj that can be matched to BGPi as described in Sec. 3.3.
unrelated or reside at different levels of the same type hier-
                                                                      We call such a removal valid, if applying it on a valid Sparql
archy. Therefore, RDF resources may be instances of very
                                                                      query results in a valid Sparql query, i.e., if all projection
generic types, such as owl:Thing. Hence, one challenge for
                                                                      variables are referenced in at least one remaining triple pat-
5
    http://www.w3.org/TR/rdf-schema                                   tern of the query pattern. Note that there exist queries for
which no valid triple pattern removal is possible, e.g., queries    5.1    Methodology
containing only one triple pattern.                                    We use the timestamp information to provide a meaning-
   To identify the most suitable triple pattern to remove from      ful segmentation for successive queries by introducing query
a query, we utilize the variable counting heuristic introduced      sessions. A query session is a chronologically ordered se-
in [14]. Essentially, this heuristic is based on the assump-        quence of at least two queries issued sequentially by the
tion that unbound subjects are more selective than unbound          same user (represented by a unique IP address) over a pe-
objects which in turn are more selective than unbound pred-         riod of time. We call query sessions homogeneous, if they
icates. Hence, generally a triple pattern with an unbound           contain queries belonging to the same cluster, i.e., any query
predicate, but bound subject and object matches less RDF            in a session can be matched to any other query in the same
statements in a knowledge base than a triple pattern with           session. Otherwise, we call this query session heterogeneous.
either only an unbound subject or object, i.e., is more selec-         We exploit statistical features and metadata extracted
tive. Also, a triple pattern with two or three unbound parts        from the query logs format for delimiting a query session.
is less selective than a triple pattern with only one or two        Specifically, we restrict the duration of a session by compar-
unbound parts, respectively. Thus, the least selective triple       ing the timestamps of issued queries with that of the initial
pattern is the one containing only variables.                       query of a session. Once this difference exceeds a certain
   In any query pattern there can be more than one triple           threshold, we assume a new query session has started. De-
pattern with maximum selectivity. In this case, we select an        pending on the analyzed dataset, we put an upper limit on
arbitrary one for removal. If this removal is not valid, we         the number of queries a single session may contain, thereby
check whether a valid removal can be achieved for a different       splitting a single query sequence into separate sessions if the
triple pattern with same or lesser selectivity. We continue         number of successive queries is greater than this limit.
until we have either exhaustively checked all triple patterns          Matchings for triple patterns can easily be transformed
or discovered a validly removable triple pattern. In the latter     into triples by applying the corresponding variable bind-
case, we modify the query by deleting the triple pattern from       ings. By materializing these triples for the original, unmod-
the parent basic graph pattern and any other basic graph            ified query and for all queries generated by applying the
pattern this BGP can be matched to.                                 augmentation strategies introduced in Sec. 4, we maintain
   Removing a highly-selective query triple pattern not es-         individual result set caches. We then evaluate how many
sential for the projection mostly assists in situations where       triples generated for bindings of subsequent queries are al-
(i) the query is too restrictive, (ii) the query contains invalid   ready contained in the individual caches. If we determine
statements, or (iii) the data or ontology in the knowledge          that a newly generated triple is contained in a cache, we
base is inconsistent, e.g., as described in [1]. One of the two     consider this a cache hit. Note that the set of cached triples
most selective triple pattern in Query 1 is crossed out in          for each augmentation strategy includes at least the cached
Query 5, thereby indicating its removal in this augmented           triples for the original query.
query. Table 5 lists some results for Query 5.                         We use OpenLink Virtuoso Open-Source Edition version
                                                                    6.1.3 as Sparql endpoint containing the English DBpe-
SELECT ?birthPlace ?influence WHERE {                               dia 3.6 dataset released on January 17, 2011, and the Linked-
    :Auguste_Comte dbo:birthPlace ?birthPlace .                     GeoData dump released on April 26, 2011 in separate named
    ?birthPlace dbo:country :France .                               graphs. For each augmentation strategy, we restrict the
    :Auguste_Comte dbo:influencedBy ?influence .                    number of retrieved results for performance reasons to a
    ?influence dbo:deathPlace :Paris .
}                                                                   maximum of 100,000 using the LIMIT keyword for Sparql.
        Query 5: Holistic Augmentation of Query 1                   5.2    DBpedia 3.6 Query Logs
                                                                       The requests included in the DBpedia 3.6 query logs ex-
           birthPlace       influence                               hibit timestamps in hourly resolution, e.g., [24/Jan/2011
           :Montpellier     :Adam_Smith                             01:00:00 +0100]. In our analysis, only successive queries
           :Montpellier     :David_Hume                             from the same user with identical timestamps may belong to
           :Montpellier     :Jean-Baptiste_Say                      the same session. However, as this heuristic introduces some
           :Montpellier     :Émile_Durkheim                         vagueness regarding the contents of a session, e.g., by ignor-
                             ...                                    ing session timeouts, we also limit the maximum number of
     Table 5: Some results for Query 5 in DBpedia 3.8.              queries belonging to any single session to 25. Notice that by
                                                                    applying this conservative restriction, we potentially lower
                                                                    our number of cache hits which will most likely increase for
                                                                    longer sessions. In summary, once a user query with a dif-
5.   EVALUATION                                                     ferent timestamp is detected or we discover more than 25
  To evaluate the applicability of our prefetching strate-          successive queries, we assume a new session has started.
gies for Sparql queries, we analyzed parts of the DBpe-                Consequently, we base our evaluation on 288 query ses-
dia 3.6 and LinkedGeoData (LGD) query logs provided for             sions for which we were able to retrieve results for at least
the USEWOD 2013 data challenge [2]. The log files con-              one contained query. Of these query sessions, 176 (61%)
tain a number of requests received by the respective public         were homogeneous. On average, the query sessions con-
Sparql endpoints and were collected for different dates in          tained around 21 queries. In around 34% of all query ses-
2011. While the specific metadata provided in the Apache            sions, we could not identify any cache hits. We assume that
Common Log Format differs slightly for the individual ser-          this is because the total result set size for these sessions is
vices, all requests contain the sender’s anonymized IP ad-          relatively small: The sessions with no cache hits only result
dress and a timestamp in addition to the actual query.              in about 100 generated triples compared to around 3,300
triples for sessions with cache hits. There are two possible                                                                                          We also evaluated how caching the result of every (aug-
reasons for this: (i) Our local Sparql endpoint did not con-                                                                                       mented) query influenced the number of cache hits for sub-
tain the entire DBpedia 3.6 corpus and (ii) some queries did                                                                                       sequent (unmodified) queries in a session. While the number
not yield any results even when executed against the public                                                                                        of query sessions with no cache hits dropped to around 24%,
DBpedia endpoint (e.g., because of syntax errors).                                                                                                 for those query sessions with cache hits we observed compa-
   For all sessions with at least one cache hit, we illustrate                                                                                     rable results to the ones illustrated. Hence, we assume that
the total number of cache hits in relation to the total num-                                                                                       by analyzing the first query, a suitable caching strategy can
ber of generated (unique) triples for all unmodified queries                                                                                       be determined for all subsequent queries of the same session.
in a session in Fig. 1. Each marker represents the best aug-                                                                                       For example, for homogeneous query sessions applying tem-
mentation strategy resulting in the most cache hits for this                                                                                       plate or type augmentation on the first query most likely
session. If for none of the augmentation strategies the num-                                                                                       results in the same augmented query as applying it on any
ber of hits was greater than the cache hits of the original                                                                                        of the subsequent session requests.
query, the marker “no augmentation” is used.                                                                                                          In general, due to the large number of homogeneous query
                                                                                                                                                   sessions in the DBpedia query logs, type and template aug-
                                                            105
                                                                                                                                                   mentation appear to be the most successful among the aug-
Total number of cache hits for best augmentation strategy




                                                                       Best augmentation strategy                                                  mentation strategies. Given the large number of resources
                                                                             No Augmentation
                                                                                                                                                   in DBpedia and our restriction on the maximum number of
                                                                     Exploratory Augmentation
                                                                                                                                                   results, we are impairing the success of these strategies to
                                                            104       Template Augmentation
                                                                           Type Augmentation
                                                                                                                                                   some extent, as many potentially relevant facts are simply
                                                                         Holistic Augmentation
                                                                                                                                                   not retrieved. Without this restriction, these prefetching
                                                                                                                                                   strategies should yield even better results.
                                                            103
                                                                                                                                                      On the other hand, the amount of query sessions benefit-
                                                                                                                                                   ing most from holistic or exploratory augmentation is lim-
                                                                                                                                                   ited. This might stem from the apparently small number of
                                                            102
                                                                                                                                                   queries issued by human users, towards which these strate-
                                                                                                                                                   gies are targeted. Moreover, in more than half the query
                                                                                                                                                   sessions (55%) holistic augmentation could not be applied
                                                            101
                                                                                                                                                   as no valid triple pattern removal was possible. Naturally,
                                                                                                                                                   in these cases the number of cache hits for holistic augmen-
                                                                 0
                                                                                                                                                   tation equals the one of not applying any augmentation.
                                                            10
                                                                 100          101              102             103           104             105
                                                                    Total number of triples generated for all unmodified queries in a session      5.3   LinkedGeoData Query Logs
                                                                                                                                                      As the timestamps provided for the queries in the LGD
Figure 1: Best augmentation strategy when caching results                                                                                          logs are precise to the second, we delimit query sessions in
of the first query in a session in the DBpedia 3.6 log files                                                                                       these logs more accurately by introducing a session timeout
                                                                                                                                                   and maximum session duration. If for a query from a specific
   Overall, our findings indicate that caching the results of                                                                                      user we cannot discover another query from the same user
the first, unaugmented query of a session yields the most                                                                                          within 10 minutes time, we assume the identified query is
amount of cache hits in about 38% of all analyzed query                                                                                            the last one in a session. Overall, we delimit a query session
sessions. The results for type and template augmentation                                                                                           by restricting its duration to a maximum of 60 minutes, its
have the most cache hits in 28% and 23% of the sessions,                                                                                           session timeout to 10 minutes, and its maximum number of
respectively, whereas applying exploratory and holistic aug-                                                                                       queries to 50 (whichever comes first).
mentation on the first session query only results in the most                                                                                         As with the DBpedia query logs, we analyzed only those
cache hits for 7% and 4% of all sessions.                                                                                                          query sessions in which we determined at least one query
   When considering homogeneous query sessions only, type                                                                                          for which we were able to generate a result as described
augmentation yields the most cache hits in 39% of all ses-                                                                                         above, i.e., we based our evaluation on 440 query sessions.
sions, followed by no augmentation (30%), template aug-                                                                                            This time, for only 9% of these query sessions no cache hits
mentation (28%), exploratory and holistic augmentation                                                                                             could be discovered at all. The analysis of which augmenta-
(1% each). Notice that we discovered a number of homo-                                                                                             tion strategy resulted in the most cache hits for the remain-
geneous query sessions which contain the exact same query                                                                                          ing 424 query sessions is illustrated in Fig. 2.
multiple times. For example, this is the case if only one                                                                                             For the LGD logs, caching the results of unmodified
triple is generated for all queries in a session and identified                                                                                    queries resulted in the most hits (48% of all query sessions),
as cache hit repeatedly as represented by markers located on                                                                                       followed by exploratory augmentation (26%), template aug-
the y-axis. Obviously, if the exact same query is issued over                                                                                      mentation (15%), type and holistic augmentation (5% each).
again within the course of a session, no additional cache hits                                                                                     For heterogeneous query sessions (55% of all query sessions),
can be generated when applying an augmentation strategy.                                                                                           exploratory augmentation is the best augmentation strategy
   The mean number of cache hits is highest for exploratory                                                                                        (30%), followed by template augmentation (19%), holistic
augmentation (4,541 cache hits) and lowest for type aug-                                                                                           augmentation (7%), and type augmentation (6%).
mentation (33 cache hits) when considering only those ses-                                                                                            We also discovered that the number of mean cache hits
sions where these two augmentation strategies yield the most                                                                                       was much higher than the ones of the DBpedia log files: For
cache hits. On average, in all sessions with markers above                                                                                         those sessions that benefited from template augmentation
the diagonal (indicated by the dashed line in Fig. 1), each                                                                                        the average number of cache hits was highest with 72,917
generated triple represents a cache hit at least once.                                                                                             and lowest for type augmentation with 48,320. On the other
Total number of cache hits for best augmentation strategy   106                                                                                  taining similar data of related resources, e.g., as generated
                                                                        Best augmentation strategy
                                                                              No Augmentation                                                    by applying template and type augmentation.
                                                                 5    Exploratory Augmentation                                                     On the other hand, human users might retrieve more “di-
                                                            10
                                                                       Template Augmentation                                                     verse” information, e.g., specific facts about a resource as
                                                                            Type Augmentation                                                    generated by applying exploratory augmentation. Overall,
                                                            104          Holistic Augmentation                                                   whereas not all augmentation strategies can be applied on
                                                                                                                                                 certain queries (e.g., holistic augmentation for queries con-
                                                                                                                                                 taining only one triple pattern), combining different strate-
                                                            103
                                                                                                                                                 gies during the course of a query session might still prove
                                                                                                                                                 advantageous for such diversified scenarios.
                                                            102                                                                                    In future work, we aim at capturing the intent in a query
                                                                                                                                                 session more accurately, e.g., by using more fine-grained ap-
                                                                                                                                                 proaches than that of a central concept. Based on this, more
                                                            101                                                                                  customized augmentation strategies are conceivable, e.g., by
                                                                                                                                                 analyzing individual resources contained in a query. Finally,
                                                            100
                                                                                                                                                 we want to be able to discover on the most suitable augmen-
                                                               100         101          102         103          104          105          106   tation strategy based on the contents of a query session and
                                                                  Total number of triples generated for all unmodified queries in a session      thus adjust other query sessions accordingly.

Figure 2: Best augmentation strategy when caching results
of the first query in a session in the LinkedGeoData log files
                                                                                                                                                 7.    REFERENCES
                                                                                                                                                  [1] Z. Abedjan, J. Lorey, and F. Naumann. Reconciling ontologies
                                                                                                                                                      and the web of data. In Proceedings of the International
                                                                                                                                                      Conference on Information and Knowledge Management
                                                                                                                                                      (CIKM), pages 1532–1536, Maui, HI, USA, October 2012.
hand, the average session length was comparatively small                                                                                          [2] B. Berendt, L. Hollink, M. Luczak-Rösch, K. H. Möller, and
with only 12 queries per session. This could be because the                                                                                           D. Vallet. USEWOD2013 – 3rd international workshop on usage
                                                                                                                                                      analysis and the web of data. In Proceedings of the Extended
LGD log queries were actually issued by human users (as                                                                                               Semantic Web Conference (ESWC), Montpellier, France, 2013.
opposed to crawlers or other software agents). Intuitively,                                                                                       [3] C. Carpineto and G. Romano. A survey of automatic query
this would also explain why the majority of query sessions                                                                                            expansion in information retrieval. ACM Comput. Surv.,
                                                                                                                                                      44(1):1:1–1:50, Jan. 2012.
are heterogeneous: Whereas software agents use somewhat
                                                                                                                                                  [4] S. Dar, M. J. Franklin, B. T. Jónsson, D. Srivastava, and
hard-coded HTTP requests to retrieve Linked Data, human                                                                                               M. Tan. Semantic data caching and replacement. In
users are more flexible when issuing queries, e.g., by using                                                                                          Proceedings of the International Conference on Very Large
the LGD Sparql web interface6 .                                                                                                                       Databases (VLDB), pages 330–341, Bombay, India, 1996.
                                                                                                                                                  [5] S. Elbassuoni, M. Ramanath, and G. Weikum. Query
   Again, caching the result of every query in a session had                                                                                          relaxation for entity-relationship search. In Proceedings of the
only little impact on the choice of augmentation strategy,                                                                                            Extended Semantic Web Conference (ESWC), pages 62–76,
which is to be expected considering the small mean number                                                                                             Heraklion, Greece, 2011.
of queries in a session. However, the percentage of query                                                                                         [6] T. Fagni, R. Perego, F. Silvestri, and S. Orlando. Boosting the
                                                                                                                                                      performance of web search engines: Caching and prefetching
sessions not benefiting from any caching decreased slightly                                                                                           query results by exploiting historical usage data. ACM
to 6%. In general, the cached results for the queries can                                                                                             Transactions on Information Systems, 24(1):51–78, 2006.
almost always be used for subsequent queries in a session                                                                                         [7] A. Hogan, M. Mellotte, G. Powell, and D. Stampouli. Towards
                                                                                                                                                      fuzzy query-relaxation for RDF. In Proceedings of the
for the LGD logs. Again, our local Sparql endpoint might                                                                                              Extended Semantic Web Conference (ESWC), pages 687–702,
have not contained all data available in the public Sparql                                                                                            Heraklion, Greece, 2012.
endpoint. However, the impact on our results is negligible as                                                                                     [8] C. A. Hurtado, A. Poulovassilis, and P. T. Wood. Journal on
we were able to generate query results for the vast majority                                                                                          data semantics X. chapter Query relaxation in RDF, pages
                                                                                                                                                      31–61. Springer-Verlag, Berlin, Heidelberg, 2008.
of sessions (75%) and cache hits in almost all of these (91%).                                                                                    [9] S. Jonassen, B. B. Cambazoglu, and F. Silvestri. Prefetching
                                                                                                                                                      query results and its impact on search engines. In Proceedings
                                                                                                                                                      of the ACM International Conference on Information
6.                                                                   CONCLUSION AND OUTLOOK                                                           Retrieval (SIGIR), pages 631–640, Portland, OR, USA, 2012.
                                                                                                                                                 [10] C. G. Jorge Pérez, Marcelo Arenas. Semantics and complexity
   In this paper, we have presented a number of approaches                                                                                            of SPARQL. ACM Transactions on Database Systems
to modify Sparql queries with the goal to retrieve addi-                                                                                              (TODS), 34(3):16:1–16:45, 2009.
tional results that are potentially relevant for subsequent                                                                                      [11] H. W. Kuhn. The hungarian method for the assignment
                                                                                                                                                      problem. Naval Research Logist. Quarterly, 2(1-2):83–97, 1955.
requests of the same query session. We evaluated how of-                                                                                         [12] M. Martin, J. Unbehauen, and S. Auer. Improving the
ten and how many of these additional results are actually                                                                                             performance of semantic web applications with SPARQL query
relevant by identifying query sessions that can exploit this                                                                                          caching. In Proceedings of the Extended Semantic Web
                                                                                                                                                      Conference (ESWC), pages 304–318, Heraklion, Greece, 2010.
prefetched data in subsequent requests compared to only
                                                                                                                                                 [13] Q. Ren and M. H. Dunham. Using semantic caching to manage
caching the results of the unmodified query.                                                                                                          location dependent data in mobile computing. In Proceedings
   While the majority of all analyzed query sessions ben-                                                                                             of the International Conference on Mobile Computing and
efited from caching the results of the unmodified or aug-                                                                                             Networking, pages 210–221, Boston, MA, United States, 2000.
                                                                                                                                                 [14] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and
mented first query (around 66% for DBpedia and 91% for                                                                                                D. Reynolds. SPARQL basic graph pattern optimization using
LinkedGeoData), the most suitable augmentation strategy                                                                                               selectivity estimation. In Proceedings of the International
(if any) differs from session to session. For example, large-                                                                                         World Wide Web Conference (WWW), pages 595–604, New
                                                                                                                                                      York, NY, USA, Apr. 2008.
scale homogeneous query sessions can exploit a cache con-
                                                                                                                                                 [15] M. Yang and G. Wu. Caching intermediate result of SPARQL
                                                                                                                                                      queries. In Proceedings of the International World Wide Web
6                                                                                                                                                     Conference (WWW), pages 159–160, Hyderabad, India, 2011.
                 http://linkedgeodata.org/sparql