Towards Equivalences for Federated SPARQL Queries Carlos Buil-Aranda1? and Axel Polleres2?? 1 Department of Computer Science, Pontificia Universidad Católica, Chile cbuil@ing.puc.cl 2 Vienna University of Economy and Business (WU) {axel.polleres}@wu.ac.at Abstract The most common way for exposing RDF data on the Web is by means of SPARQL endpoints. These endpoints are Web services that implement the SPARQL protocol and then allow end users and applications to query just the RDF data they want. However the servers hosting the SPARQL endpoints re- strict the access to the data by limiting the amount of results returned by user queries or the amount of queries per time and client that can be issued. For ad- dressing this problem we analysed different alternatives that shall allow to obtain complete query result sets from the SPARQL endpoints by rewriting the origi- nal query using SPARQL1.1’s federated query extension. We show that some of the commonly used SPARQL query patterns for this task provide unsound re- sults while other patterns are more suitable for this task. We provide equivalent query patterns that help users in obtaining complete result sets circumventing the limitations imposed by servers. 1 Introduction The Linked Open Data initiative promotes the publication and linkage of RDF data on the Web. Under this initiative many organisations (either public or private) exposed their data using the RDF data model and also provided links to other RDF datasets. Currently there are about 450 RDF datasets [2] available on the Web exposing billions of RDF statements. The most common way for accessing these RDF data is by means of SPARQL endpoints. These endpoints are Web services that implement the SPARQL protocol and then allow end users and applications to query just the RDF data they want. A user can send an SPARQL query to this Web service by accessing the endpoint’s Web interface and then obtain back the results to the SPARQL query. However the servers hosting the SPARQL endpoints restrict the access to the data by limiting the server resources used in each received query. This physical resources limitation most com- monly results in a limitation of the size of result set returned to the end users (normally a 10.000 result limit for each query) but it can also generate errors in the same query execution (normally time outs). These imposed limitations are necessary for avoiding the server to use too many resources to answer complex SPARQL queries. However, this result set size limitation prevents users from obtaining complete an- swers to their SPARQL queries. The result set size limitation is particularly relevant ? Supported by the Millennium Nucleus Center for Semantic Web Research under Grant NC120004 and CONICYT/FONDECYT project 3130617. ?? Supported by the Vienna Science and Technology Fund (WWTF) project ICT12-015. when a user wants to federate SPARQL queries to a number of SPARQL endpoints. Imagine a query to DBpedia asking for authors born in Madrid for next querying the Spanish National Library endpoint for more data about these authors. The first query to DBpedia would return more than the DBpedia’s result set size limit resulting in an incomplete result set, failing next to join with the other endpoint query results. Using different combinations of SPARQL operators it is possible to overcome the server’s result size limitation and obtain complete result sets. The most common oper- ator that can be used for that purpose is the VALUES operator from the new SPARQL 1.1 Recommendation [3]. This operator performs a join using the results from a query to the default local graph with the remote query results. In this way the local results are submitted along the remote query and joined remotely. However this operator is still not widely deployed [2] and other options have to be considered. This paper presents a study of several combinations of SPARQL operators and query patterns that avoid the server’s result set limitation, allowing users to obtain sound and complete answers to their queries. We show that using the UNION operator for constraining remote queries and next doing the union of their results may return unsound results. We also present a subset of UNION SPARQL queries that return sound and complete results. As sum- mary, using the SPARQL query pattern combinations we present in this paper it is possible to avoid the remote server result size limitation obtaining sound and complete results for our SPARQL queries. Still, an evaluation should be conducted for knowing what characteristics of each server may affect these equivalent SPARQL query patterns. 2 Syntax and Semantics We first describe the basics of the SPARQL syntax and semantics of SPARQL we use within this paper. For the complete syntax and semantics of SPARQL we refer the reader to the Appendix available at http://www.polleres.net/publications/buil-poll- 2014AMW.pdf. Syntax. The official syntax of SPARQL1.1 [3] considers operators OPTIONAL, UNION, FILTER, SELECT and concatenation via a point symbol (.), to construct graph pattern expressions. The syntax of the language also considers { } to group patterns, as well as keywords (new in SPARQL 1.1) SERVICE to delegate parts of a query to remote endpoints, and VALUES to define sets of bindings of variable bindings. We use the ap- proach proposed in [5] for defining the SPARQL 1.0 syntax operators and the syntax proposed in [1] for the SPARQL 1.1 VALUES and SERVICE operators. More precisely we use the following syntax for the VALUES operator: – VALUES W A is a graph pattern where W = [?X1 , . . . , ?Xn ] is a sequence of pairwise distinct variables, and 2 3 a1,1 , . . . , a1,n 6 a2,1 , . . . , a2,n 7 6 7 A=6 .. 7 4 . 5 am,1 , . . . , am,n is a matrix of values where ai,j 2 (I [ L [ {UNBOUND}). For the exposition of this paper, we leave out more complex SPARQL patterns such as GRAPH patterns as well as SPARQL 1.1 [3] patterns including, aggregates, and property paths. We highlight that solution modifiers are important for paginating results from the SPARQL endpoints and thus obtaining complete results. We mark the analysis of these modifiers as future work. We use the notion FILTER expressions as defined in [5]. We also use unary predicates like bound, isBlank, and the binary equality predicate ’=’, which herein we consider as synonym to sameTerm(·, ·) from [3]. Let P be a graph pattern; in what follows, we use var (P ) to denote the set of variables occurring in P . In particular, if t is a triple pattern, then var (t) denotes the set of variables occurring in the components of t. Similarly, for a built-in condition R, we use var (R) to denote the set of variables occurring in R. Semantics. We use the SPARQL 1.0 semantics and the terminology defined in [5] and for the SPARQL 1.1 operators we use the definitions in [1]. We use letters B, I, L for denoting the sets of blank nodes, IRIs and RDF literals. As usual, we use dom(µ) for denoting the variables bound within – i.e. the domain of – a SPARQL solution mapping µ. We remind the reader of the most relevant definitions in Fig. 2 ; the evaluation of a FILTER expression R wrt. solution mapping µ, also defined in [5], is also relevant here: - R is bound(?X) and ?X 2 dom(µ); - R is isBlank(?X), ?X 2 dom(µ) and µ(?X) 2 B; - R is ?X = c, ?X 2 dom(µ) and sameT erm(µ(?X), c); - R is ?X =?Y , ?X 2 dom(µ), ?Y 2 dom(µ) and sameT erm(µ(?X), µ(?Y )); - R is (R1 _ R2 ), and µ |= R1 or µ |= R2 ; - R is (R1 ^ R2 ), µ |= R1 and µ |= R2 . Here, B denotes the set of blank nodes, I the set of IRIs’ and L the set of literals. (1) If P is (P1 FILTER R), then JP KG = {µ 2 JP1 KG | µ |= R}. (2) If P = VALUES W A then JP KG = {µj |1  j  m, dom(µj ) = {?Xi 2 W | ai,j 6= UNBOUND}, µj (?Xi ) = ai,j } Figure 1. Definition of JP KG for a graph pattern P using FILTER and VALUES operators. 3 SPARQL endpoint query completeness alternatives In this section we outline potential alternatives for obtaining complete results from queries to a remote SPARQL endpoints using SERVICE patterns. These alternatives are the use of VALUES , as well as combinations of UNION and FILTERs. 3.1 Naive approaches for replacing operators in federated queries The VALUES operator can be used for restricting a remote SPARQL query that is sent using the SERVICE operator as illustrated in the following example: Let P = P1 AND (SERVICE c P2 ) be a SPARQL pattern where P1 = (?X, a, ?Y ) and P2 = (?Y, d, ?Z). If we pre-evaluate the solution bindings for P1 , written JP1 KG , the SERVICE operator could then be equivalently evaluated by sending pattern VALUESP1 P2 = P2 VALUES W A where W = [var (P1 ) \ var (P2 )] and A = JP1 KG to endpoint c, i.e. if Gc is the default graph of service c VALUESP1 JP KG = JP 1KG ./ JP2 KGc However, the problem with this approach is that the VALUES operator is not yet widely deployed in existing endpoints [2] and other operators have to be used in order to simulate the desired behaviour. One alternative is the use of the UNION operator. In the previous SPARQL pattern P a commonly used alternative is to execute first pattern P1 and use its results to create a large UNION query that will be submitted to the SERVICE address. In this way the results from one query are “injected” into the remote SPARQL endpoint in each branch of the UNION . I.e., let P as above, the idea would be to first evaluate P1 and then for JP1 KG = {µ1 , . . . , µn } replace P2 by UNIONP1 P2 = {(µ1 (P2 )) UNION . . . UNION(µn (P2 ))} Here, when writing µ(P ) we mean the pattern obtained from P by replacing all v 2 dom(µ) \ vars(P ) with µ(v). Unfortunately, this method fails in relatively simple queries. Assume that P1 = (?X, c, d), c = I and P2 = ((?Y, ?Z, ?T) UNION (?X, ?Y, b)) FILTER (?X = ?Y). With the local default graph G1 = {(a, c, d)} and the remote service’ default graph G2 = {(a, a, b), (e, c, d)}, we obtain: JP1 KG1 = {[?X ! a]} whereas JP2 KG2 = UNIONP1 {[?X ! a, ?Y ! a]}. However, if we proceed as suggested above, then P2 = ((?Y, ?Z, ?T ) UNION (a, ?Y, b)) FILTER (a =?Y ) which yields an additional solu- tion [?Y ! a, ?Z ! a, ?T ! b] that was not admissible in the original P2 but is also compatible with {[?X ! a]}. If we thus use this method in a federation scenario, we would obtain inconsistent results between the two queries if we use the result set from the first execution into the remote query execution. Another problem are blank nodes. Assume for instance P1 = P2 =(?X, c, d) with G1 = {( : b, c, d)} and G2 = {(a, c, d)}. Here, as replacement would yield UNIONP1 P2 = ( : b, c, d) and since SPARQL engines treat blank nodes in patterns as variables, again a non-admissible solution would arise. As a second alternative to UNIONs, one may consider is using FILTERs to inject the results of P1 into P2 , namely, FILTERP1 _ ^ P2 = {P2 FILTER ( v = µ(v))} µ2JP1 KG v2dom(µ)\vars(P2 ) This version does not run into the same issues as the before-mentioned solution using UNIONs. However, there are still problems here, as shown in the following example: assume now P1 = (?X, b, c), c = I and P2 = ((?Y,d,e) UNION (?X,d,e )). With the local default graph G1 = {(a, b, c)} and the remote service’ default graph G2 = {(a, d, e)}, we obtain: JP1 KG1 = {[?X ! a]} and JP2 KG2 = {[?X ! a], [?Y ! a]}; here, the second solution for JP2 KG2 , i.e. µ2 = [?Y ! a] is compatible with the single solution for JP1 KG1 , i.e., µ1 = [?Y ! a] yielding overall µ = [?X ! a, ?Y ! a]. However, FILTERP1 if we apply the method including FILTERs, then P2 ={ {(?Y,d,e) UNION (?X,d,e )} FILTER( ?X = a)} would not yield µ as a solution. So, while the first alternatives may yield in inconsistent results the version using FILTERs seems to miss some results. In the next section we proceed to show refined versions of both alternatives, that solve these issues. 3.2 Two Equivalence theorems for SPARQL Federated Queries As we have seen, some queries may return unexpected result mappings when substi- tuting a variable for a specific value. This affects us when we use naively the UNION operator for dividing the remote query in several subqueries. Thus, we have to find a restricted class of SPARQL remote queries for which we obtain correct results. It turns out that one class of queries which avoid the above-mentioned problems is the class of Strongly Bound queries [1]: this class imposes the restriction to a SPARQL pattern that a variable ?X 2 P will always be bound to a value, independent of the underlying data (we refer the reader to the Appendix for or a more detailed explanation of the Strongly Bound notion). Indeed, the source of the problem in the queries before were the query results containing the empty result mapping. That empty result mapping combined with other operators generate result sets different to the original one (besides of being un- expected) since the empty result mapping is not null rejecting. Let be P SPARQL a pattern; we use the definition of Strongly Bound variables in a pattern, denoted SB(P ) as per [1] to assure that the variables used in both the SERVICE part of the query and in the part of the query that will restrict the SERVICE pattern do not contain that empty mapping: Lemma 1. Given a SPARQL pattern P with v 2 SB(P ), let µe = [v ! e] for an e 2 I [ L, then Jµe (P )KG ./ µe = {µ 2 JP KG |v 2 dom(µ) ^ µ(v) = e}. Lemma 1 essentially states that replacing strongly bound variables with IRIs or literals in a pattern will not yield additional results for P . In order to address the issue UNIONP1 of blank nodes we adapt the definition of P2 from above as follows: UNION0P P2 = {(µB 1 B 1 (P2 )) UNION . . . UNION(µn (P2 ))} W where µB (P2 ) = {µ(P2 ) FILTER(¬( v2dom(µ)\vars(P2 ) isBlank(µ(v))))}, i.e. the problematic solutions containing blank nodes are filtered out. Indeed, we confirm that UNION replacement with this modification works for re- mote patterns with only strongly bound variables; plus, it turns out that strongly bound remote queries also are evaluated correctly using the FILTER approach: Theorem 1. Let P = P1 AND (SERVICE c P2 ) such that (vars(P2 ) \ vars(P1 )) ✓ SB(P2 ), i.e. all variables that participate in a join are strongly bound in the pattern appearing on the service side, and let Gc be the default graph of service c and let UNION FILTER UNION0P P2 P1 and P2 P1 defined as above , then (i) JP KG = JP 1KG ./ JP2 1 KGc , FILTERP1 and (ii) JP KG = JP 1KG ./ JP2 KGc We note that, in case the local graph G does not contain any blank nodes The- UNIONP1 orem 1(i) would also hold using the original replacement P2 . Moreover, it turns out that we can generalize the result in Theorem 1(ii) to also work in the general case with potentially unbound variables in the service pattern. To this end, we redefine FILTERP1 JP2 KGc as follows: FILTER0P _ ^ P2 1 = {P2 FILTER ( (v = µ(v) _ ¬bound(v)))} µ2JP1 KG v2dom(µ)\vars(P2 ) The trick to only filter for variable bound within P2 fixes the problem from the example above, as stated in the following theorem. Theorem 2. Let P = P1 AND (SERVICE c P2 ) and Gc be the default graph of service c then FILTER0P JP KG = JP 1KG ./ JP2 1 KGc For proofs we refer to the Appendix. 4 Conclusions In this paper we have aimed at illustrating the problem that querying remote SPARQL endpoints with SERVICE patterns is a non-trivial task due to the limitations the server hosting these endpoints impose. The most important restriction is the result set size limitation that prevents users from obtaining complete results to their queries. We have also shown some initial results in terms of defining equivalent SPARQL query pat- terns involving the SERVICE operator that may help remedy this problem in practice. We believe that such equivalences are highly relevant for practical use cases involving queries on Linked data on existing SPARQL endpoints and deserve further investiga- tion: Our next goal is to practically evaluate these and more equivalences for potentially restricted SERVICE queries on real endpoints to arrive at strategies for obtaining com- plete results on federated SPARQL queries over real endpoints. References 1. C. Buil-Aranda, M. Arenas, O. Corcho, A. Polleres. Federating Queries in SPARQL 1.1: Syntax, Semantics and Evaluation. J. Web Semantics, 18(1), 2012. 2. C. Buil-Aranda, A. Hogan, J. Umbrich, P.-Y. Vandenbussche. SPARQL Web-Querying In- frastructure: Ready for Action? In ISWC2013, LNCS 8219:277–293, 2013. 3. S. Harris, A. Seaborne. SPARQL 1.1 Query Language, W3C Rec., January 2012. 4. T. S. Jayram, P. G. Kolaitis, E. Vee. The containment problem for real conjunctive queries with inequalities. In PODS 2006, pages 80–89, 2006. 5. J. Pérez, M. Arenas, C. Gutierrez. Semantics and complexity of SPARQL. TODS, 34(3), 2009.