-

On the tractability of certain answers for SQL nulls in relational algebra with inequalities

0 School of Informatics, University of Edinburgh

Missing values in theoretical models of incomplete database are often represented with marked nulls, while in SQL databases missing values are all denoted by the same syntactic NULL object. Even practical algorithm to approximate certain answers (answers which are true regardless of how incomplete information is interpreted) are often developed in the model with marked nulls. However computing certain answers for marked nulls is co-NP complete even for the most simple queries when inequalities are allowed. In this short paper we study the tractability of certain answers for SQL nulls in a fragment of relational algebra where selection with inequalities is permitted. We de ne the fragment and present an algorithm to compute certain answers. We also show that if we add even small features to the fragment, computing certain answers becomes intractable. This study emphasises the necessity of a speci c certain answers approximation scheme for SQL nulls and o ers ideas to design it.

Introduction The standard way of answering queries on incomplete databases is to compute certain answers: those that do not depend on the interpretation of unknown data. However, evaluating certain answers for core relational algebra is co-NP complete in data complexity [ 9 ]. As a consequence the community has developed sound tractable approximation schemes (which return a subset of the certain answers). Most of those schemes have been produced with the marked nulls model of incompleteness [ 3,7 ], and it is well known that even the simplest query with inequalities is intractable for marked nulls [ 1 ]. However nulls used in SQL databases are di erent. In this paper we study a fragment of relational algebra with inequalities for which computing certain answers for SQL nulls is tractable. This demonstrates the complexity gap between those two models of incompleteness and therefore emphasizes the need of a speci c approximation scheme for SQL nulls.

We consider incomplete databases with nulls interpreted as missing information [ 5 ]. Below we recall de nitions that are standard in the literature. Databases are populated by two types of elements: constants coming from countably innite set denoted by Const, and the syntactic object NULL. The occurrences of NULL in an SQL database are typically interpreted as non-repeating elements of a set Null. That is, an SQL database can be seen as a Codd database where each occurrence of NULL is replaced by a fresh distinct marked null [ 4 ]. Therefore we denote by N ull(D) = f?1 : : : ?ng the set of distinct marked nulls in the database D.

A valuation v on a database D is a map v : N ull(D) ! Const that assigns constant values to nulls occurring in the database. By v(D) we denote the result of replacing each ?i with v(?i) in D. A relational query Q of arity k takes a complete database D and returns a bag of k-tuples over Const(D). If such a query Q is asked on an incomplete database D, to answer it we compute for each t 2 (Const [ N ull(D))k the bag of certain answers denoted (Q; D) which verify : #(t; (Q; D)) =

min v a valuation #(v(t); Q(v(D))): Where #(t; R) = n if t 2n R 0 if t 2= R and we use t 2n R to say that t has a multiplicity n in the bag R [ 2 ]. If the multiplicity of a tuple in (Q; D) is equal to 0, this tuple does not belong to the certain answers.

In order to add boolean queries to relational algebra we add the operator ; such that for any complete database D and any query Q, ;(Q)(D) = ; if Q(D) = ; and ;(Q)(D) = f()g otherwise. 2

Relational algebra fragment with e cient evaluation Our goal is to nd a fragment for which we can build an e cient algorithm to compute all certain answers. As a motivation we take inspiration from the hierarchical queries from probabilistic databases [ 8 ] to de ne the restricted hierarchical relational algebra de ned below. We start with two subclasses of unions of CQs with inequalities.

Q1 :=Q1 \ Q1 j Q1 [ Q1 j Q2 :=Q2 [ Q2 j (Q2) j (Q1) j (Q2) j R (Q1) j R Note that the only di erence between the two classes is that Q1 allows intersection while Q2 does not. We denote by Q1 resp. Q2 the set of query induced by the grammar Q1 resp. Q2. Based on them we de ne the class RAH : Q0 := Q0 [ Q0 j Q1 n Q2 j (Q0) j (Q0) j ;(Q0) j R

A relational algebra query is called non-repeating if every relation symbol occurs at most once [ 8 ]. We denote RAH;NR = fQ j Q 2 RAH ^ Q is non-repeatingg the fragment of restricted hierarchical relational algebra where queries are nonrepeating.

Our main result is : Theorem 1. For every query Q 2 RAH;NR and every database D computing (Q; D) is tractable.

We now outline the proof of Theorem 1. As the restricted hierarchical fragment of relational algebra imposes that every binary operation in selection conditions is between attributes of the same relation or constants, then for each query in the fragment we can build an equivalent query where selection occurs only on relation symbols.

Lemma 1 For every Q 2 RAH there exists Q0 2 RAH such that jQ0j = O(jQj) and for every complete database D, (Q; D) = (Q0; D) and every selection operator of Q0 occurs on a relation symbol.

Now we show that for Q1 2 Q1 computing (Q1; D) is tractable. One starts by applying the lemma 1 to push every selection operator. Its queries are now given by : Q1 := Q1 \ Q1 j Q1 [ Q1 j (R) j (Q1) j R. Then computation is done inductively by the following rules : #(t; (Q \ Q0; D)) =min(#(t; (Q; D)); #(t; (Q0; D))) #(t; (Q [ Q0; D)) =#(t; (Q; D)) + #(t; (Q0; D)) #(t; ( #(t; (R; D)) =#(t; R)

X (Q); D) = u; (u)=t

#(u; (Q; D)) 8 #(t; R) if 8v a valuation; v(t) = t ^ (t) #(t; ( (R); D)) = < 1 if 9v a valuation; v(t) 6= t ^ is a valid formulae : 0 otherwise Therefore for every query Q 2 Q1, the evaluation of (Q; D) is tractable. The computation rules above are sound and complete only because the relations hence the nulls can not repeat.

Then we show how to evaluate (Q0; D). Informally in order to evaluate Q0 one has to compute the possible answers of Q2 which match an element of Q1. But rst notice that as intersection is not allowed in Q2 we can push the projection operators : Lemma 2 For every Q 2 Q2 there exists Q0 2 Q2 such that jQ0j = O(jQj) and for every complete database D, Q(D) = Q0(D) and every projection operator of Q0 occurs on a relation symbol, or on a selection operator over a relation symbol.

Then from lemmas 1 and 2 we just have to consider queries of the form: Q0 = Q1 n S R ( R (R)) with Q1 2 Q1 ^ Q2 2 Q2

R2Q2 For every R 2 Q2; 8t 2 R we build a set: +Q1;R (t) = fu 2 (Q1; D) j 9v a valuation; R(v(t)) ^

R (v(t)) = v(u)g As the relations hence the nulls can not repeat, (Q1; D) can be computed independently of Q2 and is at most of the size of D, then for each R, the set +Q1;R can be built in polynomial time.

Moreover for every u 2 (Q1; D) we build a bag: 8t 2 (Const [ N ull) ; #(t; *Q2 (u)) =

X R2fR2Q2ju2+Q1;R(t)g #(t; R) Here, *Q2 (u) is the bag of elements in S R which unify with u 2 (Q1; D). R2Q2 Then a tuple u belongs to certain answers of Q0 if and only if the multiplicity of u in (Q1; D) is higher than the number of elements in *Q2 (u). Proposition 1 Let Q0 = Q1 n SR2Q2 Then #(u; (Q0; D)) = max(0; j *Q2 (u)j of (Q0; D) is tractable.

R (

R (R)) with Q1 2 Q1 ^ Q2 2 Q2. #(u; (Q1; D))), and the evaluation

However there exists a query Q0 such that 8u; #(u; (Q0; D)) = 0 and ( ;(Q0); D)) 6= ;. In order to compute ( ;(Q0); D), we want to check if there exists a matching between (Q1; D)) and the elements that unify with it. Proposition 2 Let Q0 = Q1 n SR2Q2 ( R (R)) with Q1 2 Q1 ^ Q2 2 Q2. Then ( ;(Q0); D) = ; if and only if there exists an injective function m : ( ;(Q1); D) ! S R such that 8t 2 ( ;(Q1); D); m(t) 2*Q2 (t).

R2Q2 m is a 2DM matching, and the evaluation of ( ;(Q0); D) is tractable [ 6 ]. 3

Extending the fragment In this section we discuss the di culties of extending the fragment RH;NR to obtain tractable evaluation of certain answers.

Proposition 3 For each of the following extensions of RH;NR : { allowing cross-product. { allowing repetition of relation symbols. { allowing intersection on the right-hand side of the Q1 n Q2 operator. { allowing di erence on the right-hand side of the Q1 n Q2 operator. the data complexity of evaluating certain answers is co-NP hard. 4

Conclusions In this paper we have exhibited a fragment for which computing certain answers for SQL nulls is tractable. We have also shown that adding features to the fragment quickly lead to intractability. The next question that arises is toward maximality, in order to nd a dichotomic property for certain answers with SQL nulls one would have to consider equivalently expressive classes of query. As soon as we fully understand what leads to intractability we will be able to design a more accurate approximation scheme for SQL nulls.

1. Serge

Abiteboul

, Paris Kanellakis, and Gosta Grahne. On the representation and querying of sets of possible worlds . Theoretical Computer Science , 78 ( 1 ): 159 { 187 , 1991 .

Marco

Console , Paolo Guagliardo, and

Leonid

Libkin . On querying incomplete information in databases under bag semantics . IJCAI.

Paolo

Guagliardo and

Leonid

Libkin . Correctness of SQL Queries on Databases with Nulls . ACM SIGMOD Record , 46 ( 3 ):5{ 16 , 2017 .

Paolo

Guagliardo and

Leonid

Libkin . On the Codd semantics of SQL nulls . Alberto Mendelzon Workshop , 36t, 2017 .

Tomasz

Imielinski and Witold Lipski Jr . Incomplete information in relational databases . Journal of the ACM (JACM) , 31 ( 4 ): 761 { 791 , 1984 .

6. Eugene L Lawler . Combinatorial optimization: networks and matroids . Courier Corporation , 1976 .

Leonid

Libkin . SQL's three-valued logic and certain answers . ACM Transactions on Database Systems (TODS) , 41 ( 1 ): 1 , 2016 .

Dan

Suciu , Dan Olteanu, Christopher Re, and

Christoph

Koch . Probabilistic databases . Synthesis Lectures on Data Management , 3 ( 2 ):1{ 180 , 2011 .

9. Moshe

Vardi.

The complexity of relational query languages . In Proceedings of the fourteenth annual ACM symposium on Theory of computing , pages 137 { 146 . ACM, 1982 .