<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On the tractability of certain answers for SQL nulls in relational algebra with inequalities</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Informatics, University of Edinburgh</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Missing values in theoretical models of incomplete database are often represented with marked nulls, while in SQL databases missing values are all denoted by the same syntactic NULL object. Even practical algorithm to approximate certain answers (answers which are true regardless of how incomplete information is interpreted) are often developed in the model with marked nulls. However computing certain answers for marked nulls is co-NP complete even for the most simple queries when inequalities are allowed. In this short paper we study the tractability of certain answers for SQL nulls in a fragment of relational algebra where selection with inequalities is permitted. We de ne the fragment and present an algorithm to compute certain answers. We also show that if we add even small features to the fragment, computing certain answers becomes intractable. This study emphasises the necessity of a speci c certain answers approximation scheme for SQL nulls and o ers ideas to design it.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Introduction
The standard way of answering queries on incomplete databases is to compute
certain answers: those that do not depend on the interpretation of unknown data.
However, evaluating certain answers for core relational algebra is co-NP
complete in data complexity [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. As a consequence the community has developed
sound tractable approximation schemes (which return a subset of the certain
answers). Most of those schemes have been produced with the marked nulls
model of incompleteness [
        <xref ref-type="bibr" rid="ref3 ref7">3,7</xref>
        ], and it is well known that even the simplest query
with inequalities is intractable for marked nulls [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However nulls used in SQL
databases are di erent. In this paper we study a fragment of relational algebra
with inequalities for which computing certain answers for SQL nulls is tractable.
This demonstrates the complexity gap between those two models of
incompleteness and therefore emphasizes the need of a speci c approximation scheme for
SQL nulls.
      </p>
      <p>
        We consider incomplete databases with nulls interpreted as missing
information [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Below we recall de nitions that are standard in the literature. Databases
are populated by two types of elements: constants coming from countably
innite set denoted by Const, and the syntactic object NULL. The occurrences of
NULL in an SQL database are typically interpreted as non-repeating elements of
a set Null. That is, an SQL database can be seen as a Codd database where
each occurrence of NULL is replaced by a fresh distinct marked null [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Therefore
we denote by N ull(D) = f?1 : : : ?ng the set of distinct marked nulls in the
database D.
      </p>
      <p>A valuation v on a database D is a map v : N ull(D) ! Const that assigns
constant values to nulls occurring in the database. By v(D) we denote the result
of replacing each ?i with v(?i) in D. A relational query Q of arity k takes a
complete database D and returns a bag of k-tuples over Const(D). If such a
query Q is asked on an incomplete database D, to answer it we compute for
each t 2 (Const [ N ull(D))k the bag of certain answers denoted (Q; D) which
verify :
#(t; (Q; D)) =</p>
      <p>
        min
v a valuation
#(v(t); Q(v(D))):
Where
#(t; R) =
n if t 2n R
0 if t 2= R
and we use t 2n R to say that t has a multiplicity n in the bag R [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. If the
multiplicity of a tuple in (Q; D) is equal to 0, this tuple does not belong to the
certain answers.
      </p>
      <p>In order to add boolean queries to relational algebra we add the operator
; such that for any complete database D and any query Q, ;(Q)(D) = ; if
Q(D) = ; and ;(Q)(D) = f()g otherwise.
2</p>
      <p>
        Relational algebra fragment with e
cient evaluation
Our goal is to nd a fragment for which we can build an e cient algorithm
to compute all certain answers. As a motivation we take inspiration from the
hierarchical queries from probabilistic databases [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] to de ne the restricted
hierarchical relational algebra de ned below. We start with two subclasses of unions
of CQs with inequalities.
      </p>
      <p>Q1 :=Q1 \ Q1 j Q1 [ Q1 j
Q2 :=Q2 [ Q2 j
(Q2) j
(Q1) j
(Q2) j R
(Q1) j R
Note that the only di erence between the two classes is that Q1 allows
intersection while Q2 does not. We denote by Q1 resp. Q2 the set of query induced by
the grammar Q1 resp. Q2. Based on them we de ne the class RAH :
Q0 := Q0 [ Q0 j Q1 n Q2 j
(Q0) j
(Q0) j ;(Q0) j R</p>
      <p>
        A relational algebra query is called non-repeating if every relation symbol
occurs at most once [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We denote RAH;NR = fQ j Q 2 RAH ^ Q is non-repeatingg
the fragment of restricted hierarchical relational algebra where queries are
nonrepeating.
      </p>
      <p>Our main result is :
Theorem 1. For every query Q 2 RAH;NR and every database D computing
(Q; D) is tractable.</p>
      <p>We now outline the proof of Theorem 1. As the restricted hierarchical
fragment of relational algebra imposes that every binary operation in selection
conditions is between attributes of the same relation or constants, then for each
query in the fragment we can build an equivalent query where selection occurs
only on relation symbols.</p>
      <p>Lemma 1 For every Q 2 RAH there exists Q0 2 RAH such that jQ0j = O(jQj)
and for every complete database D, (Q; D) = (Q0; D) and every selection
operator of Q0 occurs on a relation symbol.</p>
      <p>Now we show that for Q1 2 Q1 computing (Q1; D) is tractable. One starts
by applying the lemma 1 to push every selection operator. Its queries are now
given by : Q1 := Q1 \ Q1 j Q1 [ Q1 j (R) j (Q1) j R. Then computation is
done inductively by the following rules :
#(t; (Q \ Q0; D)) =min(#(t; (Q; D)); #(t; (Q0; D)))
#(t; (Q [ Q0; D)) =#(t; (Q; D)) + #(t; (Q0; D))
#(t; (
#(t; (R; D)) =#(t; R)</p>
      <p>X
(Q); D) =
u; (u)=t</p>
      <p>#(u; (Q; D))
8 #(t; R) if 8v a valuation; v(t) = t ^ (t)
#(t; ( (R); D)) = &lt; 1 if 9v a valuation; v(t) 6= t ^ is a valid formulae
: 0 otherwise
Therefore for every query Q 2 Q1, the evaluation of (Q; D) is tractable. The
computation rules above are sound and complete only because the relations
hence the nulls can not repeat.</p>
      <p>Then we show how to evaluate (Q0; D). Informally in order to evaluate
Q0 one has to compute the possible answers of Q2 which match an element of
Q1. But rst notice that as intersection is not allowed in Q2 we can push the
projection operators :
Lemma 2 For every Q 2 Q2 there exists Q0 2 Q2 such that jQ0j = O(jQj) and
for every complete database D, Q(D) = Q0(D) and every projection operator of
Q0 occurs on a relation symbol, or on a selection operator over a relation symbol.</p>
      <p>Then from lemmas 1 and 2 we just have to consider queries of the form:
Q0 = Q1 n S R ( R (R)) with Q1 2 Q1 ^ Q2 2 Q2</p>
      <p>R2Q2
For every R 2 Q2; 8t 2 R we build a set:
+Q1;R (t) = fu 2
(Q1; D) j 9v a valuation; R(v(t)) ^</p>
      <p>R (v(t)) = v(u)g
As the relations hence the nulls can not repeat, (Q1; D) can be computed
independently of Q2 and is at most of the size of D, then for each R, the set
+Q1;R can be built in polynomial time.</p>
      <p>Moreover for every u 2 (Q1; D) we build a bag:
8t 2 (Const [ N ull) ; #(t; *Q2 (u)) =</p>
      <p>X
R2fR2Q2ju2+Q1;R(t)g
#(t; R)
Here, *Q2 (u) is the bag of elements in S R which unify with u 2 (Q1; D).
R2Q2
Then a tuple u belongs to certain answers of Q0 if and only if the multiplicity
of u in (Q1; D) is higher than the number of elements in *Q2 (u).
Proposition 1 Let Q0 = Q1 n SR2Q2
Then #(u; (Q0; D)) = max(0; j *Q2 (u)j
of (Q0; D) is tractable.</p>
      <p>R (</p>
      <p>R (R)) with Q1 2 Q1 ^ Q2 2 Q2.
#(u; (Q1; D))), and the evaluation</p>
      <p>However there exists a query Q0 such that 8u; #(u; (Q0; D)) = 0 and
( ;(Q0); D)) 6= ;. In order to compute ( ;(Q0); D), we want to check if
there exists a matching between (Q1; D)) and the elements that unify with it.
Proposition 2 Let Q0 = Q1 n SR2Q2 ( R (R)) with Q1 2 Q1 ^ Q2 2 Q2.
Then ( ;(Q0); D) = ; if and only if there exists an injective function
m : ( ;(Q1); D) ! S R such that 8t 2 ( ;(Q1); D); m(t) 2*Q2 (t).</p>
      <p>
        R2Q2
m is a 2DM matching, and the evaluation of
( ;(Q0); D) is tractable [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
3
      </p>
      <p>Extending the fragment
In this section we discuss the di culties of extending the fragment RH;NR to
obtain tractable evaluation of certain answers.</p>
      <p>Proposition 3 For each of the following extensions of RH;NR :
{ allowing cross-product.
{ allowing repetition of relation symbols.
{ allowing intersection on the right-hand side of the Q1 n Q2 operator.
{ allowing di erence on the right-hand side of the Q1 n Q2 operator.
the data complexity of evaluating certain answers is co-NP hard.
4</p>
      <p>Conclusions
In this paper we have exhibited a fragment for which computing certain answers
for SQL nulls is tractable. We have also shown that adding features to the
fragment quickly lead to intractability. The next question that arises is toward
maximality, in order to nd a dichotomic property for certain answers with SQL
nulls one would have to consider equivalently expressive classes of query. As soon
as we fully understand what leads to intractability we will be able to design a
more accurate approximation scheme for SQL nulls.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Serge</surname>
            <given-names>Abiteboul</given-names>
          </string-name>
          , Paris Kanellakis, and Gosta Grahne.
          <article-title>On the representation and querying of sets of possible worlds</article-title>
          .
          <source>Theoretical Computer Science</source>
          ,
          <volume>78</volume>
          (
          <issue>1</issue>
          ):
          <volume>159</volume>
          {
          <fpage>187</fpage>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Marco</given-names>
            <surname>Console</surname>
          </string-name>
          , Paolo Guagliardo, and
          <string-name>
            <given-names>Leonid</given-names>
            <surname>Libkin</surname>
          </string-name>
          .
          <article-title>On querying incomplete information in databases under bag semantics</article-title>
          .
          <source>IJCAI.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Guagliardo</surname>
          </string-name>
          and
          <string-name>
            <given-names>Leonid</given-names>
            <surname>Libkin</surname>
          </string-name>
          .
          <article-title>Correctness of SQL Queries on Databases with Nulls</article-title>
          .
          <source>ACM SIGMOD Record</source>
          ,
          <volume>46</volume>
          (
          <issue>3</issue>
          ):5{
          <fpage>16</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Guagliardo</surname>
          </string-name>
          and
          <string-name>
            <given-names>Leonid</given-names>
            <surname>Libkin</surname>
          </string-name>
          .
          <article-title>On the Codd semantics of SQL nulls</article-title>
          .
          <source>Alberto Mendelzon Workshop</source>
          , 36t,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Tomasz</given-names>
            <surname>Imielinski</surname>
          </string-name>
          and
          <article-title>Witold Lipski Jr</article-title>
          .
          <article-title>Incomplete information in relational databases</article-title>
          .
          <source>Journal of the ACM (JACM)</source>
          ,
          <volume>31</volume>
          (
          <issue>4</issue>
          ):
          <volume>761</volume>
          {
          <fpage>791</fpage>
          ,
          <year>1984</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Eugene L Lawler</surname>
          </string-name>
          .
          <article-title>Combinatorial optimization: networks and matroids</article-title>
          .
          <source>Courier Corporation</source>
          ,
          <year>1976</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Leonid</given-names>
            <surname>Libkin</surname>
          </string-name>
          .
          <article-title>SQL's three-valued logic and certain answers</article-title>
          .
          <source>ACM Transactions on Database Systems (TODS)</source>
          ,
          <volume>41</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Dan</given-names>
            <surname>Suciu</surname>
          </string-name>
          , Dan Olteanu, Christopher Re, and
          <string-name>
            <given-names>Christoph</given-names>
            <surname>Koch</surname>
          </string-name>
          .
          <article-title>Probabilistic databases</article-title>
          .
          <source>Synthesis Lectures on Data Management</source>
          ,
          <volume>3</volume>
          (
          <issue>2</issue>
          ):1{
          <fpage>180</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Moshe</surname>
            <given-names>Y</given-names>
          </string-name>
          <string-name>
            <surname>Vardi.</surname>
          </string-name>
          <article-title>The complexity of relational query languages</article-title>
          .
          <source>In Proceedings of the fourteenth annual ACM symposium on Theory of computing</source>
          , pages
          <volume>137</volume>
          {
          <fpage>146</fpage>
          . ACM,
          <year>1982</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>