<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bridging the Semantic Gap between RDF and SPARQL using Completeness Statements</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Computer Science, Free University of Bozen-Bolzano</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>RDF data is often treated as incomplete, following the OpenWorld Assumption. On the other hand, SPARQL, the standard query language over RDF, usually follows the Closed-World Assumption, assuming RDF data to be complete. This gives rise to a semantic gap between RDF and SPARQL. In this paper, we address how to close the semantic gap between RDF and SPARQL in terms of certain answers and possible answers using completeness statements.</p>
      </abstract>
      <kwd-group>
        <kwd>SPARQL</kwd>
        <kwd>RDF</kwd>
        <kwd>data completeness</kwd>
        <kwd>OWA</kwd>
        <kwd>CWA</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Due to its open and incremental nature, data on the Semantic Web is
generally incomplete. This can be formalized using the Open-World Assumption
(OWA)[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. SPARQL, on the other hand, interprets RDF data under
closedworld semantics. While for positive queries, this does not create problems [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
SPARQL queries with negation make only sense under closed-world semantics.
This ambiguous interpretation of RDF data poses the question how to determine
which semantics is appropriate in a given situation.
      </p>
      <p>As a particular example, suppose we want to retrieve all Oscar winners that
have tattoos. For obvious reasons, no sources on the Semantic Web contain
complete information about this topic and hence the Open-World Assumption
applies. On the other hand, suppose we want to retrieve all Oscar winners. Here,
an RDF version of IMDb1 would contain complete data and hence one may
intuitively apply closed-world reasoning.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], completeness statements were introduced, which are metadata that
allow one to specify that the CWA applies to parts of a data source. We argue
that these statements can be used to understand the meaning of query results
in terms of certain and possible answers [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which is especially interesting for
queries with negation: Suppose an RDF data source has the completeness
statement \complete for all Oscar winners". The result of a SPARQL query for Oscar
winners contains not only all certain but also all possible answers. The result
of a SPARQL query for Oscar winners with tattoos contains only all certain
1 http://www.imdb.com/oscars/
answers. Moreover, the result of a SPARQL query for people with tattoos that
did not win an Oscar contains all certain answers, but not all possible answers.
The result of a SPARQL query for Oscar winners not having tattoos contains
all possible answers but none of the answers is certain.
      </p>
      <p>In this paper, we discuss how to assess the relationship between certain
answers, possible answers and the results retrieved by SPARQL queries in the
presence of completeness statements.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Formalization</title>
      <p>
        SPARQL Queries. A basic graph pattern (BGP) is a set of triple patterns [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
In this work, we do not consider blank nodes. We de ne a graph pattern
inductively as follows: (1) a BGP P is a graph pattern; (2) for a BGP P , a NOT-EXISTS
pattern :P is a graph pattern; (3) for graph patterns P1 and P2, P1 AND P2 is a
graph pattern. Any graph pattern P can be equivalently written as a conjunction
of a BGP (called the positive part and denoted as P +) and several NOT-EXISTS
patterns f :P1; : : : ; :Pn g (referred to as the negative part and denoted as P ).
A query with negation has the form Q = (W; P ) where P is a graph pattern and
W var (P +) is the set of distinguished variables. The evaluation of a graph
pattern P over a graph G is de ned in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]:
      </p>
      <p>JP + AND :P1 AND : : : AND :PnKG = f j 2 JP +KG ^ 8i : J (Pi)KG = ; g
The result of evaluating (W; P ) over a graph G is the restriction of JP KG to W .
This fragment of SPARQL queries is safe, that is, for a query with negation Q
and a graph G, the evaluation JQKG returns nite answers. We assume all queries
to be consistent, that is, there is a graph G where JQKG 6= ;.</p>
      <p>Completeness Statements. Formally, a completeness statement C is of the
form Compl(P1jP2) where P1, called the pattern, and P2, called the condition,
are BGPs. Intuitively, such a statement expresses that the source contains all
instantiations of P1 that satisfy condition P2 (e.g., all Golden Globe winners
that won an Oscar).</p>
      <p>
        In line with the Open-World Assumption of RDF, for a graph G, we call any
graph G0 such that G0 G an interpretation of G [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We associate to a
completeness statement C the CONSTRUCT query QC = (CONSTRUCT f P1 g f P1 AND P2 g). A
pair (G; G0) of a graph and one of its interpretations satis es a completeness
statement C, written (G; G0) j= C if JQC KG0 G holds. It satis es a set C of
completeness statements, written (G; G0) j= C if it satis es every element in C. A
set of completeness statements C entails a statement C, written C j= C, if for all
pairs (G; G0) of a graph G and an interpretation G0 of G such that (G; G0) j= C,
it is the case that (G; G0) j= C.
      </p>
      <p>Completeness statements restrict the set of interpretations:
De nition 1 (Valid Interpretation with Completeness Statements). Let
G be a graph and C be a set of completeness statements. An interpretation G0 G
is valid w.r.t. C i (G; G0) j= C. We write the set of all valid interpretations of
G w.r.t. C as int(G; C).</p>
      <p>Bridging the Semantic Gap between RDF and SPARQL
De nition 2 (Certain and Possible Answers with Completeness
Statements). Let Q be a query, G be a graph and C be a set of completeness
statements. Then the certain and possible answers of Q over G w.r.t. C are CAQ(G) =
TG02int(G;C)JQKG0 and PACQ(G) = SG02int(G;C)JQKG0 , respectively. C
If C is empty, then PAQ(G) and CAQ(G) correspond to the classical certain and</p>
      <p>
        C C
possible answers [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. With completeness statements, some query results may
correspond to the certain and possible answers while others may not.
Example 1 (Possible Answers). Consider again the query asking for people that
have tattoos. Since this is private information we have no knowledge how
complete our data source is, and hence the set of possible answers is (nearly) in nite:
Anyone not explicitly stated to have tattoos might still have tattoos.
      </p>
      <p>On the other hand, consider the query for people who won an Oscar. It is
relatively easy for a data source to be complete on this topic, by comparing:
e.g., to the Internet Movie Database (IMDb). If a data source is complete for all
Oscar winners, then there are no further possible answers.</p>
      <p>
        The reasoning for queries with negation is more complex. By [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], under the OWA,
for a monotonic query Q and a graph G the certain answers are JQKG, while
for queries with negation, the certain answers are empty. With completeness
statements, the certain answers of a query with negation can be non-empty.
Example 2 (Certain Answers). Consider rst a query for people that won an
Oscar but no Golden Globe. If a data source is complete both for Oscar winners
and Golden Globe winners, then the query result contains all possible and all
certain answers.
      </p>
      <p>On the other hand, a query for people with tattoos that did not win an Oscar
would only return certain answers, but not all possible answers, because there
are probably many more people with tattoos that did not win an Oscar.</p>
      <p>The result of a query for people that won an Oscar and do not have tattoos
contains all possible answers but no certain answers, because we do not know
for certain which Oscar winners have tattoos and which do not.</p>
      <p>We next de ne for each query some completeness statements that allow one to
capture the crucial information for getting certain or possible answers. Knowing
about the crucial statements helps in data acquisition identify which data is
needed in order to achieve desired answer semantics.</p>
      <p>De nition 3 (Crucial Statements). For a query Q = (W; P ), the positive
crucial statement of Q, denoted as CQ+, is the statement Compl(P +jtrue). The
set of negative crucial statements of Q, denoted as CQ , is the set f Compl(P1jP +);
: : : ; Compl(PnjP +) g, where P1; : : : ; Pn are from the negative part P of Q.</p>
      <p>The next theorems show that the crucial statements can be used to infer
relationships between certain answers, possible answers and SPARQL query
results.
Theorem 1 (Bounded Possible Answers). Let C be a set of completeness
statements and Q be a positive query. Then</p>
      <p>C j= CQ+
implies
for all graphs G, PAQ(G) = CAQ(G) (= JQKG):</p>
      <p>C C
While the equality JQKG = CACQ(G) always holds, the new insight is that JQKG =
PAQ(G). This means the query results cannot miss any information w.r.t. reality.</p>
      <p>C
Theorem 2 (Queries with Negation). Let C be a set of completeness
statements and Q be a query. Then
1. C j= CQ implies for all graphs G, CACQ(G) = QKG;
2. C j= CQ+ ^ CQ implies for all graphs G, PACQ(GJ) = CACQ(G) = JQKG.</p>
      <p>
        The rst item means that if C j= CQ , then every answer returned by the query
is a certain answer. The second item means that if additionally C j= CQ+, then
there also cannot be any other possible answers than those returned by JQKG.
The completeness statement entailment problems can be solved using standard
query containment techniques [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
We have shown that in the presence of completeness statements, the semantics
of SPARQL may correspond to the certain answer or possible answer semantics.
Our work is based on the observation that parts of data on the Semantic Web are
actually complete. In future research, we would like to consider explicit negative
RDF knowledge and completeness statements over it as an alternative for getting
certain and possible answers. In this work, we assume all information contained
in a graph is correct. An interesting case is when this assumption does not hold
in general. We would like to investigate correctness statements as the dual of
completeness statements. A further study is also needed for an e ective way
of maintenance of completeness statements to cope with information changes.
One possible way is to add timestamps to completeness statements. An extended
version with proofs of this paper is available at http://arxiv.org/abs/1408.6395.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Raymond</given-names>
            <surname>Reiter</surname>
          </string-name>
          .
          <source>On Closed World Data Bases</source>
          .
          <year>1977</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Marcelo</given-names>
            <surname>Arenas</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jorge</given-names>
            <surname>Perez</surname>
          </string-name>
          .
          <article-title>Querying Semantic Web Data with SPARQL</article-title>
          .
          <string-name>
            <surname>In</surname>
            <given-names>PODS</given-names>
          </string-name>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Fariz</given-names>
            <surname>Darari</surname>
          </string-name>
          , Werner Nutt, Giuseppe Pirro, and
          <string-name>
            <given-names>Simon</given-names>
            <surname>Razniewski</surname>
          </string-name>
          .
          <article-title>Completeness Statements about RDF Data Sources and Their Use for Query Answering</article-title>
          .
          <source>In ISWC 2013</source>
          , pages
          <fpage>66</fpage>
          {
          <fpage>83</fpage>
          . Springer Berlin Heidelberg,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Serge</surname>
            <given-names>Abiteboul</given-names>
          </string-name>
          , Paris C. Kanellakis, and Gosta Grahne.
          <source>On the Representation and Querying of Sets of Possible Worlds. Theor. Comput. Sci.</source>
          ,
          <volume>78</volume>
          (
          <issue>1</issue>
          ):
          <volume>158</volume>
          {
          <fpage>187</fpage>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Steve</given-names>
            <surname>Harris</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andy</given-names>
            <surname>Seaborne</surname>
          </string-name>
          .
          <source>SPARQL 1</source>
          .
          <article-title>1 Query Language</article-title>
          .
          <source>Technical report, W3C</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Simon</given-names>
            <surname>Razniewski</surname>
          </string-name>
          and
          <string-name>
            <given-names>Werner</given-names>
            <surname>Nutt</surname>
          </string-name>
          .
          <source>Completeness of Queries over Incomplete Databases. PVLDB</source>
          ,
          <volume>4</volume>
          (
          <issue>11</issue>
          ):
          <volume>749</volume>
          {
          <fpage>760</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>