Incremental SPARQL Evaluation for Query Answering on Linked Data Florian Schmedding Department of Computer Science Albert Ludwig University of Freiburg, Germany schmeddi@informatik.uni-freiburg.de Abstract. SPARQL is the standard query language for RDF data. How- ever, its application to Linked Data is challenging because the assump- tion that all necessary data is present at the beginning of the evaluation does not apply. Some relevant data sources may only be discovered by processing available data. Existing approaches provide implementations that compute results for basic graph patterns incrementally while re- trieving the data. We contribute to this area by a formal analysis of the SPARQL algebra to provide incremental adaptions of the operations. This enables us to evaluate the costs of the incremental evaluation for the design of optimizers that choose the presumably best computation depending on the number of insertions and deletions. In addition, we pro- pose a construction of the SPARQL dataset from Linked Data resources that enables the usage of the Graph-operator in query answering for Linked Data. 1 Introduction On the Semantic Web, data publishing according to the Linked Data [4] prin- ciples gained significant importance. The numerous projects mentioned in the Linking Open Data cloud diagram1 and recent ones in librarianship [8,17] show its broad adaption by private, public, and governmental initiatives. Apart from its appealing simplicity in data publication, its inherent distribution can forward the demanded decentrality of the Web [2] by allocating data at many sites rather than in centralized stores. However, Linked Data raises new challenges for query answering. In our research, we investigate these problems and contribute novel approaches for SPARQL [19] evaluation over Linked Data. By interweaving name and address (cf. [5]), resources become dereferenceable in Linked Data and return an RDF description [14] on request. To put it simply, each resource can be perceived as data source. This has three implications: 1. The number of data sources is proportional to the number of resources. 2. Creating and deleting resources changes the number of data sources. 3. Data sources cannot be classified without retrieving their content because resource names are not related to the content. 1 See http://richard.cyganiak.de/2007/10/lod/ Accordingly, query answering on Linked Data is quite different from traditional distributed query answering. On the one hand, in general it is not possible to specify all relevant data sources for a query in advance. Even in the case that all relevant sources have been identified for some query, another query may require different sources, and any new resource in the data may be an additional relevant source. On the other hand, subqueries cannot be delegated to the remote sources because Linked Data does not require the presence of query processors. Thus concepts like the Service-operator from the federation extension2 [6] of SPARQL 1.1 cannot be used to distribute parts of the query, for instance. We illustrate our scenario in a small example with the query ‘Select ?x, ?n Where {alice knows ?x. ?x name ?n}’ to find out the names of Alice’s friends. Applying the “follow your nose”-approach from Hartig et al. [11] to discover relevant data sources, we start by dereferencing alice. We may receive the triples {(alice, knows, bob), (alice, knows, charlie), (bob, name, “Bobby”)}, and compute the result {{?x 7→ bob, ?n 7→ “Bobby”}}. Next, we request data about charlie receiving {(charlie, name, “Charlie”)}, and extend the result with {?x 7→ charlie, ?n 7→ “Charlie”}. Searching for more results, we request also bob and get {(bob, name, “Bob”)}. Having traversed all links3 , the final result is {{?x 7→ bob, ?n 7→ “Bobby”}, {?x 7→ charlie, ?n 7→ “Charlie”}, {?x 7→ bob, ?n 7→ “Bob”}}. In the present work, we investigate the suggested incremental computation of the results formally. 1.1 Related Work Recent work has presented different strategies and implementations to evaluate queries over Linked Data. In terms of Ladwig et al. [15], Hartig et al. [11] follow a bottom-up approach, i. e., a query is evaluated without any prior information. Dereferencing the query constants and then following some links in the data, the result is computed incrementally, similar to our above example. However, their implementation with so-called non-blocking iterators may miss some results depending on the evaluation order. Hartig alleviates this problem in [10] with heuristic adjustments. In contrast, Harth et al. [9] employ a top-down (cf. [15]) strategy. Acquired knowledge about sources is organized in a special index before queries are processed. The index is used to select the most promising sources for a given query. For our example their index would recommend alice, bob, and charlie at best letting the result be generated in a single run. Corresponding to our arguments, Harth et al. also mention that the query completeness can increase when data sources which are encountered in the evaluation but are not indexed are considered during the evaluation (e. g., retrieve charlie if only alice and bob are indexed). Ladwig et al. elaborate this idea and propose a mixed resp. exploration- based approach. They propose strategies for query-specific source selection, and use a symmetric hash join for the incremental computation that, unlike the non- blocking iterators, generates all results. In [16] they refine their join method with 2 Working Draft at http://www.w3.org/TR/2010/WD-sparql11-federated-query-20100601/ 3 We do not consider the predicates here. the intention to consider a local storage beside the remote sources. However, the implementations are not based on an analysis of the SPARQL semantics. They consider only basic graph patterns, a subset of SPARQL and do not deal with decrements potentially induced by optional graph patterns. 1.2 Contribution We are interested in the incremental SPARQL evaluation over increasing datasets generated by bottom-up Linked Data query answering. In accordance to [10] and [15], our assumption is that the solutions should be generated after each addition—we speak of an immediate processing—because an exhaustive link traversal prior to returning the first results seems unfeasible. The contrary, de- ferred processing, would delay the result computation until all data has been loaded. However, we need to investigate whether an incremental computation, i. e. modifying current results when the data changes, is superior to a direct com- putation, i. e. deriving all results from scratch in this case. At a first glance, this seems true for monotonically increasing results, and questionable for optional graph patterns that can introduce negation by failure. Therefore we study each SPARQL operator in detail to provide adaptions that enable incremental com- putation. By an estimation of the processing costs we get evidences for criteria that render one computation method preferable over the other, and make a step towards optimized immediate query processing for Linked Data. Outline. Next, we introduce RDF, SPARQL, Linked Data, and some algebraic equivalences that are necessary for our approach. In Sec. 3 we elaborate on our approach and show the incremental adaptions of the SPARQL operators. We compare our approach to the direct computation in Sec. 4. In Sec. 5, we conclude our work and sketch a possible further development, the application of the HTTP cache mechanism in query answering for Linked Data. 2 Preliminaries We introduce the RDF data model and the SPARQL query language under special attention to the construction of the dataset by link traversal and to the relation between the Graph-operator and Linked Data resources. 2.1 RDF The RDF data format [12] describes essentially graphs with directed labeled edges. We consider RDF terms T comprising three pairwise disjoints sets I (IRIs), B (blank nodes), and L (literals). An RDF triple (s, p, o) ∈ I ∪ B × I × T connects node s (subject) through the directed labeled edge p (predicate) with node o (object). A finite set of triples is called RDF graph. The blank nodes in an RDF graph G are denoted by blank(G). We distinguish blank nodes with the prefix ‘_:’ and literals with double quotes (e. g., _:bn01, “Rain”) from IRIs. A named graph G [7] is an entity hu, Gi with a name u ∈ I and an RDF graph G. It holds name(G) = u and gr(G) = G. Distinct named graphs do not share any blank nodes. An RDF dataset D (cf. [1]) is a set containing a possibly empty default graph dg(D) = G0 and zero or more named graphs, ngs(D) = ∅ or ngs(D) = {G1 , . . . , Gn } with Gi = hui , Gi i, where for i 6= j (i) name(Gi ) 6= name(Gj ), and (ii) blank(Gi ) ∩ blank(Gj ) = ∅. We write names(D) to denote {name(Gi ) | Gi ∈ ngs(D)}, and gr(u)D = gr(Gi ) if name(Gi ) = u and Gi ∈ ngs(D), otherwise it is the empty RDF graph. We use the operator ‘t’ to merge two RDF graphs and define the operator ‘+’ as follows. For a dataset D and a named graph G, D + G = D ∪ G with G0 = dg(D) t gr(G) if name(G) ∈ / names(D), else D + G = D. 2.2 SPARQL SPARQL is a W3C-recommended [19] query language for RDF data. We follow the compositional semantics in [18,20] and include the Graph-operator as in [1]. Like there, our syntax differs from the W3C syntax in the previous example. Syntax. Let V be a set of variables, V ∩ T = ∅. We indicate variables with a leading question mark (e. g., ?x). A triple pattern (s, p, o) ∈ IV × IV × ILV is a SPARQL expression.4 The variables of a triple pattern t are denoted by vars(t). If P1 and P2 are SPARQL expressions, so are P1 Filter R, P1 Union P2 , P1 And P2 , P1 Opt P2 , u Graph P1 (u ∈ I), and ?x Graph P1 . R means a filter condition: ?x =?y, ?x = c (c ∈ L ∪ I), and bnd(?x) are filter conditions; if R1 and R2 are filter conditions then ¬R1 , R1 ∧ R2 , and R1 ∨ R2 are, too. Finally, a query has the form SelectS,F1 ,F2 (P ) where S is a finite subset of V , P is a SPARQL expression, and the dataset specifications F1 and F2 are possibly empty finite subsets of I (cf. From and From Named in Sec. 8 of [19]). Semantics. A mapping µ is a partial function µ : V → T ; the domain of µ is denoted by dom(µ). There is an empty mapping µ0 with dom(µ0 ) = ∅. Two mappings µ1 , µ2 are compatible if for all ?x ∈ dom(µ1 )∩dom(µ2 ) holds µ1 (?x) = µ2 (?x), written as µ1 ∼ µ2 . A mapping µ1 is subsumed by a mapping µ2 , denoted µ1 v µ2 , if µ1 ∼ µ2 and dom(µ1 ) ⊆ dom(µ2 ). Mappings can be applied to triple patterns, written as µ(t), and replace all ?x ∈ dom(µ) ∩ vars(t) in t by µ(?x). A mapping µ satisfies the filter condition R, denoted µ  R, iff one of the following six conditions holds: (i) R is bnd(?x) and ?x ∈ dom(µ), or (ii) R is ?x =?y and bnd(?x) ∧ bnd(?y) ∧ µ(?x) = µ(?y), or (iii) R is ?x = c and bnd(?x) ∧ µ(?x) = c, or (iv) R is ¬R1 and ¬(µ  R1 ), or (v) R is R1 ∧ R2 , µ  R1 , and µ  R2 , or (vi) R is R1 ∨ R2 , and µ  R1 or µ  R2 .5 4 Blank nodes are not considered because they act as anonymous variables and can be replaced w. l. o. g. by unselected variables in a query. 5 The distinction between false and error in case of µ 6 R can be safely ignored in our context. The solution of a SPARQL expression or query is a set of mappings. Let R be a filter condition and S a finite set of variables. For mapping sets Ω1 , Ω2 , the SPARQL set algebra defines the operations join (./), union (∪), minus (\), left join (¯o n), selection (σ), and projection (π): ¯ Ω1 ./ Ω2 := {µ1 ∪ µ2 | µ1 ∈ Ω1 , µ2 ∈ Ω2 : µ1 ∼ µ2 } Ω1 ∪ Ω2 := {µ | µ ∈ Ω1 ∨ µ ∈ Ω2 } Ω1 \ Ω2 := {µ1 ∈ Ω1 | ∀µ2 ∈ Ω2 : µ1 6∼ µ2 } o Ω2 := (Ω1 ./ Ω2 ) ∪ (Ω1 \ Ω2 ) Ω1 ¯n ¯ σR (Ω1 ) := {µ ∈ Ω1 | µ  R} πS (Ω1 ) := {µ | dom(µ) ⊆ S ∧ ∃µ0 : dom(µ0 ) ∩ S = ∅ ∧ µ ∪ µ0 ∈ Ω1 } The evaluation semantics is defined by the help of a function [[.]] that trans- fers a query Q into set algebra, written as [[Q]]. For expressions P , the same function is used with two additional arguments to indicate the dataset D and an active graph G for the evaluation, written [[P ]]D G . Let t be a triple pattern, P1 , P2 SPARQL expressions, u ∈ I, and R, S as before. [[t]]D G := {µ | dom(µ) = vars(t) ∧ µ(t) ∈ G} [[P1 Filter R]]D D G := σR ([[P1 ]]G ) [[P1 Union P2 ]]D D D G := [[P1 ]]G ∪ [[P2 ]]G [[P1 And P2 ]]D D D G := [[P1 ]]G ./ [[P2 ]]G [[P1 Opt P2 ]]D D o [[P2 ]]D G := [[P1 ]]G ¯n G ¯ D D [[u Graph P1 ]]G := [[P1 ]]gr(u)D [ [[?x Graph P1 ]]D [[ui Graph P1 ]]D  G := G ./ {{?x 7→ ui }} ui ∈names(D)  ∗ πS ([[P1 ]]D G0 ) if F1 = F2 = ∅ [[SelectS,F1 ,F2 (P1 )]] := 0 πS ([[P1 ]]D G0 ) else According to [19], queries without dataset specification are evaluated over a de- fault dataset D∗ available to the query processor. Otherwise thedataset is con- structed as D0 := F S ui ∈F1 deref(ui ) ∪ vi ∈F2 {hvi , deref(vi )i} . The function deref maps an IRI to its corresponding graph. 2.3 Linked Data The term Linked Data [3] refers to several conventions that integrate data pub- lication into the Web’s HTTP stack as well as to the published data itself. We focus on a key aspect, the identification of non-information (e. g., a person, cf. [13]) resources with URLs though their essence is not a transmittable message. A request for such a resource u ∈ I is thus redirected to an information resource f(u) which serves a description (e. g., RDF graph or HTML page) for u. So the references to other resources in descriptions are traversable and carry over the “follow your nose”-principle from the Web of Documents to the Web of Data. Graphs. Among others, the SPARQL Graph-operator is useful for restricting mappings to authoritative information (the information provided by the URI owner of a resource, cf. Sec. 2.2.2.1 in [13] and Sec. 5.1 in [3]). Unfortunately, the rather intuitive adaption of our introductory example, Select{?x,?n},∅,∅ ((alice, knows, ?x) And (?x Graph (?x, name, ?n))), is inconsistent because non-information resources (friends of Alice here) are not RDF graphs. Therefore, – we define the dataset D0 := F S  ui ∈F1 deref(ui ) ∪ vi ∈F2 {hf(vi ), deref(vi )i} to beware the equation of non-information resources with named graphs. Consequently, however, graph names in D0 are unpredictable before evalu- ating a query, so – we add (u, f, f(u)) to G0 for each dereferenced resource u to make the relation between u and its description f(u) explicitly available. Of course, triples t with t.p = f got from external sources are not inserted into D0 to prevent tampering. Hence the previous query can be expressed as Select{?x,?n},∅,∅ (((alice, knows, ?x) And (?x, f, ?y)) And (?y Graph (?x, name, ?n))) and does not return Alice’s nickname for Bob, Bobby. Query Answering. We follow the bottom-up approach from [11] to illustrate our approach. First, for all dereferenceable constants c ∈ I in a query we add f(c) to F1 and F2 and compute the mapping sets for this dataset. Second, we consider each dereferenceable c occurring in a mapping as a relevant source and insert f(c) into F1 and F2 . The results over the extended dataset are computed incrementally based on the present mapping sets. We repeat the second step until F1 and F2 remain unchanged. 2.4 Algebraic Equivalences Our approach is based on transformations of SPARQL algebra expressions. We define difference (−) and intersection (∩) for mapping sets as usual (cf. union above) and introduce two new equivalence rules additional to those from the synoptical table in [20] shown in Tab. 1. With ‘P1 ≡ P2 ’ we denote the equiv- alence between the algebra expressions P1 and P2 . Note that minus is indeed distinct from difference: Consider Ω1 = {{?x 7→ a, ?y 7→ b}, {?x 7→ a}} and Ω2 = {{?x 7→ a, ?y 7→ b}}, then Ω1 \ Ω2 = ∅ as against Ω1 − Ω2 = {{?x 7→ a}}. Lemma 1 (FDPush). Let Ω1 , Ω2 be mapping sets and R a filter condition, then σR (Ω1 − Ω2 ) ≡ σR (Ω1 ) − σR (Ω2 ). Proof. We fix a mapping µ and show that it is contained in left hand side iff it is contained in the right hand side. “⇒”: Suppose µ ∈ σR (Ω1 − Ω2 ). It holds that µ ∈ Ω1 , thus µ ∈ Ω1 − Ω2 because selection does not add mappings and µ∈ / Ω2 . It follows immediately that µ ∈ σR (Ω1 ) but µ ∈/ σR (Ω2 ). “⇐”: Suppose µ ∈ σR (Ω1 ) − σR (Ω2 ). Then it holds that µ ∈ Ω1 and µ  R. We distinguish two cases. Case (1): We assume µ ∈ / Ω2 and are done. Case (2): We assume µ ∈ Ω2 . Then µ ∈ σR (Ω2 ) because µ  R and so µ ∈ / σR (Ω1 ) − σR (Ω2 ). This contradicts the first presumption. t u (A ∪ B) ∪ C ≡ A ∪ (B ∪ C) (UAss) Table 1. Algebraic Equiv- (A ./ B) ./ C ≡ A ./ (B ./ C) (JAss) alences, where A, B, C de- A∪B ≡B∪A (UComm) note mapping sets. A ./ B ≡ B ./ A (JComm) (A ∪ B) ./ C ≡ (A ./ C) ∪ (B ./ C) (JUDistR) A ./ (B ∪ C) ≡ (A ./ B) ∪ (A ./ C) (JUDistL) (A ∪ B) \ C ≡ (A \ C) ∪ (B \ C) (MUDistR) (A \ B) \ C ≡ A \ (B ∪ C) (MMUCorr) Lemma 2 (MDReord). Let Ω1 , Ω2 , Ω3 be mapping sets, then (Ω1 −Ω2 )\Ω3 ≡ (Ω1 \ Ω3 ) − Ω2 . Proof. We proceed like above. “⇒”: Suppose µ ∈ (Ω1 − Ω2 ) \ Ω3 . Then for all µ0 ∈ Ω3 holds µ 6∼ µ0 , and µ ∈ / Ω2 but µ ∈ Ω1 . It follows that µ ∈ Ω1 \ Ω3 , and thus also µ ∈ (Ω1 \Ω3 )−Ω2 . “⇐”: Suppose µ ∈ (Ω1 \Ω3 )−Ω2 . Then µ ∈ / Ω2 , and for all µ0 ∈ Ω3 holds µ 6∼ µ0 . So µ ∈ Ω1 − Ω2 and finally µ ∈ (Ω1 − Ω2 ) \ Ω3 . tu 3 Incremental SPARQL Evaluation We want to evaluate SPARQL over an increasing dataset while we are inter- ested in the result of a query after each addition to the data as outlined in the introduction. This can certainly be achieved by a complete evaluation over the whole data. However, we think that an approach that takes previously computed results into account might perform better. Therefore we provide an incremental adaption for each algebra operation to compute the result based on the changes of the operands between the dataset D and the increased dataset D + ∆D . We consider also the mechanism to select the active graph (Graph) and the evalu- ation of triple patterns. Definition 1 (Insertions and Deletions). For a SPARQL expression P , an 0 D+∆D RDF dataset D, and a named graph ∆D , let A = [[P ]]D G and A = [[P ]]G . + 0 − 0 We define insertions ∆A := A − A and deletions ∆A := A − A . − − − It follows that (i) ∆+ + 0 A ∩∆A = ∅, (ii) ∆A ∩A = ∅, (iii) ∆A ∩A = ∅, (iv) ∆A ⊆ A, 0 − + and (v) A = (A − ∆A ) ∪ ∆A . This must hold in the following transformations. 3.1 Algebra operations For the transformations of union, join, and minus assume that A = [[P1 ]]D G, B = 0 D+∆D D+∆D [[P2 ]]D G , A = [[P1 ]]G , and B 0 = [[P2 ]]G have already been computed. 0 D+∆D Projection. Given Cπ = [[SelectS,F1 ,F2 (P )]], A = [[P ]]D G0 , A = [[P ]]G0 , 0 and ∆D = hf(u), Gi, we are interested in Cπ = [[SelectS,F1 ∪{u},F2 ∪{u} (P )]]. ∆− + π , ∆π can be used when projections are pushed down in query optimizations. D+∆D [[SelectS,F1 ∪{u},F2 ∪{u} (P )]] = πS ([[P ]]G 0 ) = πS (A0 ) = {µ | dom(µ) ⊆ S ∧ ∃µ0 : dom(µ0 ) ∩ S = ∅ ∧ µ ∪ µ0 ∈ A0 } = {µ | dom(µ) ⊆ S ∧ ∃µ0 : dom(µ0 ) ∩ S = ∅ ∧ µ ∪ µ0 ∈ (A − ∆− A )} ∪ {µ | dom(µ) ⊆ S ∧ ∃µ0 : dom(µ0 ) ∩ S = ∅ ∧ µ ∪ µ0 ∈ ∆+ A} = (πS (A) − {µ ∈ πS (∆− 0 0 0 + A ) | ¬∃µ ∈ A : µ v µ }) ∪ πS (∆A ) − ∆− 0 0 0 π = {µ ∈ πS (∆A ) | ¬∃µ ∈ A : µ v µ } + ∆+ π = {µ ∈ πS (∆A ) | µ ∈ / πS (A)} 0 Selection. Given Cσ = [[P Filter R]]D G , we are interested in Cσ = [[P Filter D+∆D D 0 D+∆D R]]G . Assume A = [[P ]]G and A = [[P ]]G have been computed yet. [[P Filter R]]D+∆ G D = σR ((A − ∆− + A ) ∪ ∆A ) = σR (A − ∆− + A ) ∪ σR (∆A ) (FUPush) = (σR (A) − σR (∆− + A )) ∪ σR (∆A ) (FDPush) − ∆− σ = σR (∆A ) + ∆+ σ = σR (∆A ) 0 D+∆D Union. Given C∪ = [[P1 Union P2 ]]D G , find C∪ = [[P1 Union P2 ]]G . [[P1 Union P2 ]]D+∆ G D = A0 ∪ B 0 = ((A − ∆− + − + A ) ∪ ∆A ) ∪ ((B − ∆B ) ∪ ∆B ) = ((A − ∆− − + + A ) ∪ (B − ∆B )) ∪ (∆A ∪ ∆B ) (UAss, UComm) / ∆− = {µ ∈ A ∪ B | (µ ∈ A ∧ µ ∈ / ∆− A ) ∨ (µ ∈ B ∧ µ ∈ + + B )} ∪ (∆A ∪ ∆B ) − − ∆− / A ∪ ∆+ ∪ = {µ ∈ A ∪ B | (µ ∈ / B ∪ ∆+ A ∨ µ ∈ ∆A ) ∧ (µ ∈ B ∨ µ ∈ ∆B )} + + ∆+ ∪ = {µ ∈ ∆A ∪ ∆B | µ ∈/ A ∪ B} 0 D+∆D Join. Given C./ = [[P1 And P2 ]]D G , find C./ = [[P1 And P2 ]]G . [[P1 And P2 ]]D+∆ G D = A0 ./ B 0 = ((A − ∆− + − + A ) ∪ ∆A ) ./ ((B − ∆B ) ∪ ∆B ) = ((A − ∆− − A ) ./ (B − ∆B )) ∪ ((A − ∆− + + 0 A ) ./ ∆B ) ∪ (∆A ./ B ) (JUDistR, JUDistL) / ∆− = {µ ∈ A ./ B | ∃µ1 ∈ A, µ2 ∈ B : µ = µ1 ∪ µ2 ∧ µ1 ∈ / ∆− A ∧ µ2 ∈ B} ∪ ((A − ∆− + + 0 A ) ./ ∆B ) ∪ (∆A ./ B ) ∆− + + ./ = {µ ∈ A ./ B | ∀µ1 ∈ A ∪ ∆A , ∀µ2 ∈ B ∪ ∆B : (µ1 ∪ µ2 = µ) → (µ1 ∈ ∆− − A ∨ µ2 ∈ ∆B )} − + + 0 ∆+ ./ = {µ ∈ ((A − ∆A ) ./ ∆B ) ∪ (∆A ./ B ) | µ ∈ / A ./ B} 0 Minus. Given Cr = A \ B, we are interested in Cr = A0 \ B 0 . 0 Cr = ((A − ∆− + A ) ∪ ∆A ) \ B 0 = ((A − ∆− − + + 0 A ) \ ((B − ∆B ) ∪ ∆B )) ∪ (∆A \ B ) (MUDistR) = ((A − ∆− + − + 0 − + A ) \ ((B ∪ ∆B ) − ∆B )) ∪ (∆A \ B ) (∆B ∩ ∆B = ∅) = ((A − ∆− + A ) \ (B ∪ ∆B )) ∪ {µ ∈ A − ∆− 0 + 0 0 − + A | ∀µ ∈ (B ∪ ∆B ) : µ ∼ µ → µ ∈ ∆B } ∪ (∆A \ B ) 0 = (((A \ B) − ∆− + A ) \ ∆B ) (MMUCorr, MDReord) − ∪ {µ ∈ A − ∆A | ∀µ ∈ (B ∪ ∆B ) : µ ∼ µ → µ0 ∈ ∆− 0 + 0 + 0 B } ∪ (∆A \ B ) ∆− − 0 + r = {µ ∈ A \ B | µ ∈ ∆A ∨ ∃µ ∈ ∆B : µ ∼ µ } 0 − 0 0 0 − 0 ∆+ + r = {µ ∈ A − ∆A | ∀µ ∈ (B ∪ ∆B ) : µ ∼ µ → µ ∈ ∆B } ∪ (∆A \ B ) + Left Join. C 0n = [[P1 Opt P2 ]]D+∆ G D = A0 ¯o n B 0 can be expressed by join and ¯0o ¯ 0 0 − − ¯− + + + minus, thus C n o = C./ ∪ Cr , ∆¯no = ∆./ ∪ ∆r , and ∆¯no = ∆./ ∪ ∆r . ¯¯ ¯ ¯ 3.2 Active Graph Selection and Triple Patterns The selection of the active graph propagates to the subexpressions and finally takes effect in the evaluation of triple patterns. The active graph can also be changed inside the scope of a Graph-operator, yet it is not possible to reactivate the default graph. For example, [[u1 Graph (P1 And (u2 Graph P2 )]]D G is equivalent to [[P1 ]]D gr(u1 )D ./ [[P 2 ]]D gr(u2 )D , ui ∈ I. Default Graph. Let t be a triple pattern and ∆D = hu, Gi. Given CDG = 0 D+∆D [[t]]D dg(D) , we are interested in CDG = [[t]]dg(D+∆D ) . D+hu,Gi D+hu,Gi [[t]]D+∆ D dg(D+∆D ) = [[t]]dg(D+hu,Gi) = [[t]]dg(D)tG = {µ | dom(µ) = vars(t) ∧ t ∈ dg(D) t G} = {µ | dom(µ) = vars(t) ∧ t ∈ dg(D)} ∪ {µ | dom(µ) = vars(t) ∧ t ∈ G} hu,Gi = [[t]]D dg(D) ∪ [[t]]G hu,Gi ∆+ DG = {µ ∈ [[t]]G / [[t]]D |µ∈ dg(D) } Fixed Graph. The expression u Graph P with fixed graph name u is evaluated as [[P ]]D D gr(u)D . Let t be a triple pattern and CFG = [[t]]gr(u)D . We are interested in 0 CFG = [[t]]D+∆ D gr(u)D+∆ with ∆D = hu0 , Gi. The expression is rewritten like above. D [[t]]D ( D+∆D gr(u)D if u ∈ names(D) [[t]]gr(u)D+∆ = {hu0 ,Gi} D [[t]]gr(u)∆ if u = u0 D {hu0 ,Gi} ( [[t]]gr(u)∆ if u = u0 ∆+ FG = D ∅ else Variable Graph. With variable graph name, ?x Graph P is evaluated as D S ui ∈names(D) [[ui Graph P ]]G ./ {{x 7→ ui }} . Unlike before, it cannot be completely pushed down to the subexpressions.  Let be ∆D = hu, Gi as above. Given CVG = [[P ]]D D gr(u1 )D ./ {{x 7→ u1 }} ∪ . . . ∪ [[P ]]gr(un )D ./ {{x 7→ un }} 0 D+∆D and Ai = [[P ]]Dgr(ui )D , Ai = [[P ]]gr(ui )D for ui ∈ names(D) we are interested in 0 = ui ∈names(D+∆D ) [[ui Graph P ]]D+∆ S  CVG G D ./ {{x 7→ ui }} . [ [[P ]]D+∆  gr(ui )D+∆D ./ {{x 7→ ui }} D ui ∈names(D+∆D ) [ A0i ./ {{x 7→ ui }} ∪ [[P ]]D+∆   = G D ./ {{x 7→ u}} ui ∈names(D) | {z } =Bi0 [ ∆− VG = ∆− Bi [i D+∆D ∆+ ∆+  VG = i Bi ∪ [[P ]]G ./ {{x 7→ u}} ∆− + + − Bi and ∆Bi are composed as in Sec. 3.1 and thus disjoint. It holds ∆VG ∩∆VG = + − ∅ because µ1 (?x) 6= µ2 (?x) for µ1 ∈ ∆Bi , µ2 ∈ ∆Bj where i 6= j. 4 Comparing Incremental and Direct Computation We evaluate our approach by comparing the incremental computation to the direct computation. We exemplify these considerations for projection and leave out the other operations due to space limitations. Union behaves similarly, join and minus are slightly more complicated, and selection is easier. Definition 2 (Projection of mappings). Let µ be a mapping and S a fi- nite set of variables. The mapping µ[S] is the projection of µ onto S where (i) dom(µ[S]) := dom(µ) ∩ S and (ii) µ[S](?x) := µ(?x). Costs of Projection. Let sets with the operations insert, delete, and fsub to check for a subsuming mapping be given. We do not assume a specific order for the sets. The costs of evaluating an operation op are denoted by kopk. The direct computation of the projection simply performs ‘for µ ∈ A0 do µ0 ←− µ[S]; Cπ0 insert µ0 end’, so its costs can be estimated at |A0 | · (kµ[S]k + kCπ0 insert µ0 k). An achievable proceeding for the incremental case is shown in Alg. 1. Its approximated costs are given in the following sum where α is the number of successful subsumption checks and β the number of changes to Cπ . − 0 0 − 0 0 − 0   |∆A | · kµ[S]k + kA fsub µ k + (|∆A | − α) kCπ delete µ k + k∆π insert µ k 0 0 0 + |∆+ +  A | · kµ[S]k + kCπ insert µ k + β k∆π insert µ k Overestimating the insertions into ∆+π we can conclude that the incremental adaption has fewer costs if ∆− A = ∅ and |A0 | > 2 · |∆+ A |. Otherwise it depends − on the size of ∆A and opens the way for optimizations. Algorithm 1: Compute Cπ0 = πS (A0 ) incrementally Data: Mapping sets Cπ = πS (A), A0 , ∆− + A , ∆A , variable set S Result: Cπ0 = πS (A0 ) begin ∆− + π ←− ∅; ∆π ←− ∅ − for µ in ∆A do µ0 ←− µ[S]; if not A0 fsub µ0 then Cπ delete µ0 ; ∆− π insert µ 0 for µ in ∆+ A do µ0 ←− µ[S]; if Cπ insert µ0 changes Cπ then ∆+ π insert µ 0 Towards an optimizer. A useful cost estimation is subject to the data and es- pecially to the chosen implementation. If the direct evaluation is presumably cheaper, an optimizer may choose to compute Cπ0 only from A0 . However, to support the incremental computation for operations that use Cπ0 as input, the costs to compute ∆− 0 + 0 π := Cπ − Cπ and ∆π := Cπ − Cπ must be considered, too. A different improvement can be achieved by using mapping multi-sets (cf.[20]) that combine a mapping µ with a multiplicity m(µ). In the computation of ∆− π, it must be checked for each mapping µ ∈ ∆− A whether there are still justifications for µ[S] in A0 . By contrast, the multiplicities in a mapping multi-set are evidences for the number of justifications. By defining −A as A with each multiplicity multiplied by −1, we can compute Cπ0 = {µ ∈ πS (A) ∪ (−πS (∆− + A ) ∪ πS (∆A )) | − + m(µ) > 0} (assuming that the operations cover multiplicities). ∆π and ∆π can be assigned during the computation of the leftmost union for deletions (µ with m(µ) < 0) and insertions (µ with m(µ) > 0). 5 Conclusion and Future Work We have presented a novel analysis of incremental SPARQL evaluation. Our results show means to design an optimizer that is able to choose the presumably best computation in the described Linked Data query answering scenario where the immediate processing of new data is desired. We have also proposed an integration of Linked Data resources and the Graph-operator based on specific construction of the SPARQL dataset. We think that our findings provide a sound formal basis for further research in this area. Future Work. Next, we want to transfer the presented analysis to the SPARQL bag semantics and implement our approach with the described possibilities for optimization, and generalize ∆D in order to let it contain more than one named graph. Our approach may also be useful in combination with local caches. As Linked Data is built on top of HTTP, the cache-control mechanism6 could be used to detect and update outdated data on the fly in order to integrate the latest information during the query processing. 6 RFC #2616 Draft Standard at http://tools.ietf.org/html/rfc2616 References 1. Angles, R., Gutierrez, C.: The Expressive Power of SPARQL. In: Proceedings of the 7th Int’l Semantic Web Conference (ISWC). Karlsruhe, Germany (2008) 2. Berners-Lee, T.: Long Live the Web: A Call for Continued Open Standards and Neutrality. Scientific American Magazine 12 (2010) 3. Bizer, C., Cyganiak, R., Heath, T.: How to Publish Linked Data on the Web. Tech. Rep., Freie Universität Berlin, The Open University (2007), http://www4.wiwiss. fu-berlin.de/bizer/pub/LinkedDataTutorial/, Accessed Aug 15, 2011 4. Bizer, C., Heath, T., Berner-Lee, T.: Linked Data – The Story So Far. International Journal on Semantic Web and Information Systems (IJSWIS) 5(3) (2009) 5. Booth, D.: Four uses of a url: Name, Concept, Web Location and Document Instance (Jan 2003), http://www.w3.org/2002/11/dbooth-names/dbooth-names_ clean.htm, Accessed Aug 15, 2011 6. Buil-Aranda, C., Arenas, M., Corcho, O.: Semantics and Optimization of the SPARQL 1.1 Federation Extension. In: Proceedings of the 8th Extended Semantic Web Conference (ESWC). Heraklion, Greece (2011) 7. Carroll, J.J., Bizer, C., Hayes, P., Stickler, P.: Named Graphs. Web Semantics: Science, Services and Agents on the World Wide Web 3(4) (2005) 8. Hannemann, J., Kett, J.: Linked Data for Libraries. In: World Library and Infor- mation Congress: 76th IFLA Gen. Conf. and Assy. Gothenburg, Sweden (2010) 9. Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K.U., Umbrich, J.: Data Summaries for On-Demand Queries over Linked Data. In: Proceedings of the 19th Int’l Conference on World Wide Web (WWW). Raleigh, NC, USA (2010) 10. Hartig, O.: Zero-Knowledge Query Planning for an Iterator Implementation of Link Traversal Based Query Execution. In: Proceedings of the 8th Extended Semantic Web Conference (ESWC). Heraklion, Greece (2011) 11. Hartig, O., Bizer, C., Freytag, J.C.: Executing SPARQL Queries over the Web of Linked Data. In: Proc. of the 8th Int’l Sem. Web Conf. Chantilly, VA, USA (2009) 12. Hayes, P., McBride, B.: RDF Semantics. W3C Recommendation (Feb 2004), http: //www.w3.org/TR/2004/REC-rdf-mt-20040210/, Accessed Aug 15, 2011 13. Jacobs, I., Walsh, N.: Architecture of the World Wide Web. W3C Rec. (Dec 2004), http://www.w3.org/TR/2004/REC-webarch-20041215/, Accessed Aug 15, 2011 14. Klyne, G., Caroll, J.J., McBride, B.: Resource Description Framework (RDF): Concepts and Abstract Syntax. W3C Recommendation (Feb 2004), http://www. w3.org/TR/2004/REC-rdf-concepts-20040210/, Accessed Aug 15, 2011 15. Ladwig, G., Tran, T.: Linked Data Query Processing Strategies. In: Proceedings of the 9th Int’l Semantic Web Conference (ISWC). Shanghai, China (2010) 16. Ladwig, G., Tran, T.: SIHJoin: Querying Remote and Local Linked Data. In: Proc. of the 8th Extended Semantic Web Conference (ESWC). Heraklion, Greece (2011) 17. Nandzik, J., Heß, A., Hannemann, J., Flores-Herr, N., Bossert, K.: Contentus – Towards Semantic Multi-Media Libraries. In: World Library and Information Congress: 76th IFLA General Conf. and Assembly. Gothenburg, Sweden (2010) 18. Pérez, J., Arenas, M., Gutierrez, C.: Semantics and Complexity of SPARQL. ACM Transactions on Database Systems (TODS) 34(3) (2009) 19. Prud’hommeaux, E., Seaborne, A.: SPARQL Query Language for RDF. W3C Rec- ommendation (Jan 2008), http://www.w3.org/TR/2008/REC-rdf-sparql-query- 20080115/, Accessed Aug 15, 2011 20. Schmidt, M., Meier, M., Lausen, G.: Foundations of SPARQL Query Optimization. In: Proceedings of the 13th International Conference on Database Theory (ICDT). Lausanne, Switzerland (2010)