Certain Answers for SPARQL?

    Claudio Gutierrez1 , Daniel Hernández1 , Aidan Hogan1 , and Axel Polleres2
                        1
                         Center for Semantic Web Research,
                Department of Computer Science, University of Chile
           2
             Vienna University of Economics and Business, Vienna, Austria


        Abstract. The standard semantics of SPARQL and the standard se-
        mantics of RDF differ fundamentally, sometimes leading to unintuitive
        answers. In this paper, we thus motivate an alternative semantics for
        SPARQL based on certain answers, taking into account the existential
        nature of blank nodes, the open-world assumption of RDF, and perhaps
        even the lack of a unique name assumption. We propose that SPARQL
        is a natural use-case for applying existing techniques that approximate
        certain answers in relational database settings.


1     Introduction

The foundational “simple semantics” of RDF is based on three premises: (i)
an open world assumption (OWA) where anything not present in the data is
assumed to be neither true nor false, (ii) a lack of unique name assumption (no
UNA) where multiple names can refer to the same thing, and (iii) the presence
of existential variables in the form of blank nodes. On the other hand, SPARQL
– the standard query language for RDF – appears not to respect any of these
premises. This may lead, in some cases, to unintuitive solutions.
    Taking an example relating to (iii), the Wikidata knowledge-base uses exis-
tential blank nodes in RDF. For example, in the case of Nicole Brown Simpson, it
is stated that she has a killer, but that the killer is unknown, encoded by the fol-
lowing triple with blank node _:b: (w:NicoleBrownSimpson, w:killedBy, _:b).
Now consider the following SPARQL query over Wikidata (including this triple):
SELECT * WHERE { ?person rdf:type w:Person .
                                                                (1)
            MINUS { w:NicoleBrownSimpson w:killedBy ?person } }
The query will return the IRIs of all persons in Wikidata: the SPARQL engine
will do a comparison to check if _:b = w:X – where w:X denotes each person –
and will return false in every case since blank nodes in data act like constants.
Semantically speaking however, the RDF graph does not provide any evidence
that these people in the results are not the killer.
    Leaving aside blank nodes, let’s turn to (i) and assume w:NicoleBrownSimpson
has no w:killedBy edge. SPARQL will again return all people in the data based
on the lack of existence of the edge in the data (effectively taking an implicit
closed world assumption (CWA)). Conversely, the OWA semantics of RDF asserts
that the relation may still hold even if the relevant triple is not given.
     Likewise, with respect to (ii), the comparison w:x = w:y will always return
false in SPARQL, whereas in the RDF graph, these may refer to the same thing:
SPARQL sees unequal terms (effectively adopting a UNA) while in RDF, these
terms may be referring to the same thing (no UNA).
     There is another option, however: SPARQL could return answers only where
it is certain that the person did not kill Nicole Brown Simpson. In other words,
nobody would be returned for this query since anybody could have killed her.
In the comparison _:b = w:X, we could say unknown instead of false to
acknowledge that _:b could refer to w:X; this is similar to how SQL handles the
case of nulls, using three-valued logic. Likewise, checking the (non-)existence of
some relation not supported by the data could return unknown to align with
RDF’s OWA; and so too could (in)equalities over IRIs, aligning with no-UNA.
     In this paper, we do not develop new results, but rather position SPARQL
and RDF (with blank nodes) as a natural application of recent results in the
database literature relating to certain answers. Our proposed aim is to have a
keyword, such as SELECT CERTAIN [5], which users could opt to use to ensure
that all answers returned by the query are true for all models of the RDF graph,
such as to ensure that no results are returned in the case(s) previously outlined.
     While some authors have looked at certain answers for SPARQL in the context
of OWA [2] and OWL entailments [1], to the best of our knowledge, no work has
yet looked at RDF graphs with blank nodes, nor considered certain answers more
generally in the context of the simple semantics of RDF.


2   Simple Entailment and SPARQL
Given two RDF graphs G and H, we say that G (simple) entails H, denoted
G |= H, if every model of G is a model of H; intuitively, H says nothing new over
G [3]. If G |= H and H |= G, then we say that G and H are (simple) equivalent,
denoted G ≡ H. In the simple semantics of RDF, blank nodes act as existential
variables, while IRIs and literals acts as ground terms. With respect to checking
entailment, let µ denote a mapping from blank nodes to other RDF terms and
let µ(G) denote the image of an RDF graph G under µ. Then G |= H if and only
if there exists a µ such that µ(H) ⊆ G (denoted H → G) [3].
Example 1. Let us consider two RDF graphs, G on the left, H on the right,
where b∗ are blank nodes, and all other terms are IRIs.
                 w x b.                               w x b1 .
                 w y b.                               w y b2 .
                 w y z.                               b3 y b2 .
In this case, G |= H with µ(b1 ) = b, µ(b2 ) = b, µ(b3 ) = w, µ(H) ⊆ G.        t
                                                                               u
    Given a SPARQL query Q and two graphs G and H such that G ≡ H, it is
often the case that (abusing notation) Q(G) 6↔ Q(H). More generally, it does
not hold that G |= H implies Q(H) → Q(G) in SPARQL, even when considering
just basic graph patterns (aka. conjunctive queries) with basic inequalities.
Example 2. Take G and H (G |= H) from Example 1 and the following query Q:

SELECT * WHERE { ?w :x ?b1 ; y ?b2 . FILTER(?b1 != ?b2) }

Under standard SPARQL query evaluation, Q(H) returns two results – (w,b1 ,b2 )
and (w,b2 ,b1 ) – whereas Q(G) is empty. Clearly Q(H) 6→ Q(G).             t
                                                                           u

    SPARQL query evaluation is counter to the underlying RDF semantics: one
can get more answers in SPARQL from an RDF graph with less information
(in terms of entailment). Even if a query uses a feature like MINUS, the simple
semantics of RDF adopts OWA, such that if G |= H, any information given by G
that is not given by H is considered unknown, rather than false. We are thus
interested in exploring an alternative semantics for SPARQL where we could
guarantee, for example, that if G |= H, then Q(H) → Q(G): that in some sense
SPARQL answers are “monotonic” with respect to simple entailment in RDF.


3    Certain Answers for SPARQL?
An intuitive idea is to define the semantics of SPARQL with respect to all possible
models of the RDF graph, where an answer is included in the result if and only
if it is “valid” for all models. This idea corresponds closely with the notion of
certain answers for incomplete databases.
     More specifically, let C denote a set of constants and N a set of nulls where
both sets are disjoint. Let us likewise assume, for simplicity, a single relation
name R with arity k. An instance I of R is a k-ary relation RI ⊆ Ck with only
constants. A naive instance I of R is a k-ary relation RI ⊆ (CN)k that may
contain constants and nulls. Let ν : N → C be a mapping from nulls to constants,
and abusing notation, let ν(RI ) denote the image of RI under ν. A representative
of a naive instance I is an instance I 0 where ν(RI ) ⊆ RI 0 . Each representative
fills in the incomplete information of I in a particular way. We denote the set of
all representatives of I as rep(I).

Example 3. The following are instances of a relation R where numbers indicate
constants and terms of the form ⊥i indicate a labelled null.

                                                               R
                  R                       R
                                                             1 3 3
               1 ⊥2 ⊥1                  1 2 4
                                                             2 3 3
               2 ⊥2 3                   2 2 3
                                                             4 5 6

The leftmost instance I1 is naive since it contains nulls, whereas I2 and I3 are
also instances but are not naive. Instances I2 and I3 are representatives of I1 . t
                                                                                  u

   Now, given a (possibly naive) database instance I and a query Q, the certain
answers can be defined as:
                                    \
                    certain(Q, I) = {Q(I 0 ) | I 0 ∈ rep(I)}
In other words, a certain answer is an answer for all representatives.
    Importantly, representatives in certain answers act somewhat like models
in simple entailment, leading to an interesting correspondence: an RDF graph
G |= H if and only if rep(G) ⊆ rep(H) (assuming an intuitive encoding of G and
H as a ternary relation) [4]. From this follows:
Remark 1. Given two RDF graphs G and H, then G |= H if and only if
certain(Q, G) ⊆ certain(Q, H).                                  t
                                                                u
In other words, if G simple-entails H, then the certain answers for any query
Q over H will be “contained in” those for G. This semantic property holds no
matter what type of query Q or base query-evaluation Q(G) we consider.
Example 4. Consider again query Q from Example 2 run over the same two
graphs. We previously saw that Q(H) 6→ Q(G). On the other hand, certain(Q, H)
is empty because there exists H 0 ∈ rep(H) such that b1 and b2 map to the same
constant and where Q(H 0 ) is thus empty, and thus certain(Q, H) is empty. It
now holds that certain(Q, H) ⊆ certain(Q, G).                                t
                                                                             u
   To evaluate certain answers for SPARQL, we could apply recent results by
Libkin [5], who proposes a sound evaluation of certain answers for SQL by
changing how (in)equality expressions and atomic formulae are evaluated in the
presence of nulls. We very briefly give an example to illustrate.
Example 5. Under Libkin’s proposal, for query (1) in the introduction, the
w:killedBy pattern would be evaluated as unknown (with or without the
triple mentioning _:b), which would lead to no results being returned as certain
answers. In the case of Example 2, the inequality comparison would likewise
return unknown, leading to certain(Q, H) returning no results.
    Implementing and experimentally evaluating Libkin’s proposals [5] for SPARQL
would thus be a natural next step towards a SELECT CERTAIN feature. There are
also novel questions to tackle in the SPARQL setting, including the effect of bag
semantics, non-well-designed queries, etc., on certain answer evaluation.

Acknowledgements: This work was supported by FONDECYT Grant no. 11140900 and
by the Millennium Nucleus Center for Semantic Web Research under Grant NC120004.


References
1. S. Ahmetaj, W. Fischl, R. Pichler, M. Simkus, and S. Skritek. Towards reconciling
   SPARQL and certain answers. In World Wide Web (WWW), pages 23–33, 2015.
2. M. Arenas and J. Pérez. Querying semantic web data with SPARQL. In Principles
   of Database Systems (PODS), pages 305–316, 2011.
3. C. Gutierrez, C. A. Hurtado, A. O. Mendelzon, and J. Pérez. Foundations of Semantic
   Web databases. J. Comput. Syst. Sci., 77(3):520–541, 2011.
4. A. Hogan, M. Arenas, A. Mallea, and A. Polleres. Everything you always wanted to
   know about blank nodes. J. Web Sem., 27:42–69, 2014.
5. L. Libkin. SQL’s three-valued logic and certain answers. In International Conference
   on Database Theory (ICDT), pages 94–109, 2015.