EXISTStential Aspects of SPARQL Peter F. Patel-Schneider and David Martin Nuance Communications peter.patel-schneider@nuance.com, david.martin@nuance.com The SPARQL 1.1 Query Language [1] permits patterns inside FILTER expres- sions using the EXISTS construct, specified by using substitution. Substitution destroys some of the aspects of SPARQL that make it suitable as a data access language. As well, substitution causes problems in the SPARQL algebra and produces counterintuitive results. Fixing the problems with EXISTS is best done with a completely different definition that does not use substitution at all. EXISTS is used inside FILTER constructs to determine whether a pattern matches or does not match. EXISTS can be used to find people who do not have any names, as in this example from the SPARQL specification [1, §8.1.1]: SELECT ?person WHERE { ?person rdf:type foaf:Person . [1] FILTER NOT EXISTS { ?person foaf:name ?name } } Any SPARQL pattern can be used in an EXISTS, including subqueries, as in a query to find people who know someone who knows more than 100 others: SELECT ?person WHERE { ?person foaf:knows ?friend . FILTER EXISTS { SELECT ?friend WHERE { ?friend foaf:knows ?y . } GROUP BY ?friend HAVING ( COUNT(*) > 100 ) } } Definition The definition of EXISTS in SPARQL starts with a translation to the SPARQL algebra [1, §18.2.2.2], replacing EXISTS{P} with exists(translate( P)). The function exists is described as “exists(pattern) is a function that returns true [iff] the pattern evaluates to a non-empty solution sequence.” [1, §18.5] The for- mal definition of exists uses substitution into its argument Substitute Let µ be a solution mapping. substitute(pattern, µ) = the pattern formed by replacing every occurrence of a variable v in pattern by µ(v) for each v in dom(µ). [1, §18.6] Evaluation of Exists Let µ be the current solution mapping for a filter and P a graph pattern: The value exists(P), given D(G) is true if and only if eval(D(G), substitute( P,µ)) is a non-empty sequence. [1, §18.6] Counterintuitive Results Blank nodes in the graph being queried often produce counterintuitive results with EXISTS. In Query [1] the substitution is performed for every node in the graph that has an rdf:type link to foaf:Person. If one of these nodes is a blank node, say :Bill, the evaluation ends up matching :Bill foaf:name ?name against the graph. As :Bill is a blank node, it acts like a variable and can itself be mapped. As a consequence, no answers will be returned from Query [1] for the graph ex:John rdf:type foaf:Person ; foaf:name ”John” . [G] :Bill rdf:type foaf:Person . 1 This counterintuitive result is particularly pernicious as it can occur any time a FILTER variable is used to query the graph inside an EXISTS. MINUS is also problematic inside EXISTS. One might expect the result of SELECT ?x WHERE { BIND ( :b AS ?x ) FILTER EXISTS { ?x :p :b . MINUS { ?x :p :b } } } to be empty because any solution mapping on the left side of the MINUS is elim- inated by the same mapping from right side. However, substitution replaces ?x with :b so the two sides don’t share a variable and then, because of its definition in SPARQL, the MINUS does not remove any solution mappings. Substitution also interferes with SPARQL constructs that act like variable bindings. One might expect that SELECT ?x WHERE { ?x :p :b . [2] FILTER EXISTS { ?x :p :b . { SELECT ( :d AS ?x ) WHERE { } } } } has no solutions except for ?x mapped to :d because otherwise solutions from ?x :p :b, i.e., with ?x mapped to something other than :d cannot be joined to solutions from SELECT ( :d AS ?x ) WHERE { }. However, the substitution replaces ?x throughout with :d so the “variable” that comes out of the subquery is not ?x but instead is whatever ?x was mapped to. Implementations of SPARQL don’t produce all these counterintuitive results, instead diverging from the specification, but some do produce some of them. Semantic Anomalies The substitution for EXISTS does not distinguish be- tween the different uses of variables in SPARQL constructs. It substitutes in triples, as it needs to, and in expressions, which it also needs to, but it also substitutes in other places. In Query [2] this ends up with solution mappings that map non-variables to values, counter to the semantic definitions underlying SPARQL. This semantic anomaly happens in SELECT clauses, as above, but also in BIND constructs like SELECT ?x WHERE { BIND ( :b AS ?x ) BIND ( :b AS ?y ) FILTER EXISTS { BIND ( :j AS ?x ) BIND ( :k AS ?y ) } } and in VALUES constructs [4, errata-query-10]. In each case mappings are created that do not conform with the definition of solution mappings in SPARQL. Other semantic anomalies can also arise, as in SELECT ?x WHERE { :b :p ?x . FILTER EXISTS { FILTER BOUND(?x) } } where the BOUND function is applied to a non-variable, counter to its definition in SPARQL. Some of these anomalous situations have further problems, as in SELECT ?x WHERE { BIND ( :b AS ?x ) BIND ( :b AS ?y ) FILTER EXISTS { { SELECT ( :d AS ?x ) WHERE { } } { SELECT ( :e AS ?y ) WHERE { } } } } Here both ?x and ?y end up being substituted as :b so the join between the results from the subqueries ends up being empty, which is counterintuitive as well as being semantically anomalous. We have not found a SPARQL implementation that reports these semantic anomalies. Either they silently allow illegal internal constructs or they silently do something different from the specification. 2 Bottom-up Evaluation SPARQL is supposedly designed so that queries can be evaluated “bottom-up [and] subqueries are evaluated logically first, and the results are projected up to the outer query.” [1, §12] However, EXISTS is at odds with this design because EXISTS substitution creates altered subqueries that did not exist before the substitution. Because of this, for some queries the results of an evaluation based on substitution will differ dramatically from those produced by bottom-up evaluation, as exhibited by: SELECT ?x WHERE { ?x :b ?y . FILTER EXISTS { SELECT ?z WHERE { ?z :f ?y . } } } The problem is particularly acute with uses of variables that would not con- tribute to solution mappings, such as the nested use of ?y in the following. In this query, the results prescribed by substitution not only differ from bottom-up results, but they are also strongly counterintuitive, as in the query below where no results will be returned because ?y is replaced, even though this variable does not contribute mappings to the results returned from the inner query. SELECT ?x WHERE { ?x :p ?y . FILTER EXISTS { SELECT ?x WHERE { ?x :p ?y . } GROUP BY ?x HAVING ( COUNT(*) > 1 ) } } There is a suggested erratum that these variables should not be subject to substitution [4, errata-query-8]. Different implementations of SPARQL produce different answers for these queries, as noticed by Hernndez et al [2]. They argue that naı̈ve substitution has many known problems and propose a notion of dif- ferent kinds of variables that governs when substitution is to be applied. Their solution solves problems related to substitution in subqueries but does not solve the counterintuitive results related to MINUS and blank nodes. Fixing the Definition To summarize, EXISTS produces counterintuitive re- sults, generates semantic anomalies, hinders bottom-up evaluation, is not imple- mented as specified, and has divergent implementations. The problem lies firmly with substitute, which naı̈vely replaces variables with values even in places where this substitution produces undefined algebra constructs or changes their mean- ing. Making a more nuanced version of substitute that does not substitute in these problematic spots is not possible without extensive changes to the SPARQL al- gebra, because such approaches would leave variables incorrectly unconstrained as, for example, in Query [1]. Given that a fixed version of substitute is not possible, is another kind of definition of EXISTS a solution? Yes, a definition of EXISTS based on solution mappings is possible. The core of this definition basically moves the FILTER solution mapping to the beginning of the argument to EXISTS. For the solution mapping {(?person, :Bill)} in [1] above, the EXISTS, in effect, evaluates { VALUES ?person { :Bill } ?person foaf:name ?name . } (except that :Bill is a blank node in graph [G], which is not possible in VALUES). The technical details of this fix to EXISTS are as follows: 1. Add a new construct, Initial, to the SPARQL syntax and algebra. Initial will be used to set up the initial multiset of solution mappings inside an EXISTS. 3 It will work much like VALUES except that it will transfer solution mappings through the EXISTS instead of setting up a constant solution mapping. 2. When collecting FILTER elements replace EXISTS{pattern} in the filter ex- pression with exists(Initial(t),translate({Initial(t) pattern’})) where t is a fresh token, and similarly for NOT EXISTS{pattern}. If pattern is a SubSelect then pattern’ is {pattern} otherwise pattern’ is just pattern. 3. Translate Initial(t) as itself. 4. Change the definition of the exists function to: Let µ be the current solution mapping for a filter, t a token, and P a graph pattern: The value exists(Initial(t),P) given D(G) is true iff eval(D(G),P’) is a non-empty multiset of solution bindings, where P’ is P with Initial(t) replaced by {µ}. This definition eliminates all the semantic anomalies resulting from EXISTS. It also fixes the counterintuitive results above. It corresponds better to how EXISTS behaves in SPARQL implementations. As well, under this definition all subqueries in SPARQL queries can then be evaluated completely bottom up and independently of anything outside of the subquery. In particular, subqueries inside EXISTS can be evaluated before the solution mappings going into the FILTER are known. Variants of this definition are possible, for example, by adding Initial(t) to the beginning of GroupGraphPatterns and not just at the top, that modify the behaviour of EXISTS without reintroducing any semantic anomalies. Pre-binding [3] is an operation provided by many SPARQL implementations that creates a prepared query and then evaluates this query with a given initial solution mapping. SPARQL implementations provide inadequate descriptions of pre-binding so it is very hard to determine how it works in detail. Pre-binding can be given a formal definition using the technique proposed here for EXISTS. This version of pre-binding pushes a solution mapping into the query by adding Initial(0) to the WHERE clause as in Point 2 evaluates the resulting query in the same way that EXISTS is evaluated. The definition of EXISTS proposed here eliminates the counterintuitive results of SPARQL, avoids producing any semantic anomalies, and makes SPARQL more suitable as a data access language. It is in all ways better than the current substitution definition. References 1. Steve Harris and Andy Seaborne. SPARQL 1.1 query language. W3C Rec., https://www.w3.org/TR/2013/REC-sparql11-query-20130321/, 21 March 2013. 2. Daniel Hernndez, Claudio Gutierrez, and Renzo Angles. Correlation and substitu- tion in SPARQL. https://scirate.com/arxiv/1606.01441, 7 June 2016. 3. Jena Class QueryExecutionFactory. https://jena.apache.org/documentation/... /QueryExecutionFactory.html, retrieved 21 June 2016. 4. Errata in SPARQL 1.1. https://www.w3.org/2013/sparql-errata, 30 January 2016. 4