1. Introduction

Certain Answers of Extensions of Conjunctive Queries by Datalog and First-Order Rewriting

Amélie Gheerbrant

Leonid Libkin

0 2

Alexandra Rogova

1 3

Cristina Sirangelo

0 3 0 DI ENS, ENS, PSL University , CNRS, Inria Paris , France 1 Data Intelligence Institute of Paris (diiP) , Inria 2 School of Informatics, University of Edinburgh , 10 Crichton Street, Edinburgh EH8 9AB , UK 3 Université Paris Cité , CNRS, IRIF, F-75013, Paris , France

14 26

To answer database queries over incomplete data the gold standard is finding certain answers: those that are true regardless of how incomplete data is interpreted. Such answers can be found eficiently for conjunctive queries and their unions, even in the presence of constraints such as keys or functional dependencies. With negation added, the complexity of finding certain answers becomes intractable however. In this paper we exhibit a well-behaved class of queries that extends unions of conjunctive queries with a limited form of negation and that permits eficient computation of certain answers even in the presence of constraints by means of rewriting into Datalog with negation. The class consists of queries that are the closure of conjunctive queries under Boolean operations of union, intersection and diference. We show that for these queries, certain answers can be expressed in Datalog with negation, even in the presence of functional dependencies, thus making them tractable in data complexity. We show that in general Datalog cannot be replaced by first-order logic, but without constraints such a rewriting can be done in first-order.

eol>Incomplete information Certain answers Datalog rewritings First-order rewritings Functional dependencies Chase

1. Introduction

We study the classical problem of answering queries over databases with incomplete information where incompleteness is represented by means of nulls, as in the most common practice in relational databases. Such databases are required to satisfy integrity constraints, most commonly keys. This addition of constraints makes query answering more complex, even for fairly simple queries.

We consider the setting of databases with marked nulls, as is often required in applications such as data integration, data exchange, ontology-based data access, and others. This is markers for missing information, but the same marker may appear in multiple places. These generalize nulls in SQL and relational database management systems where repetition is not allowed; our results will thus apply to SQL nulls as well. We use the standard model of query answering, namely finding certain answers which are guaranteed to be true regardless of the interpretation of nulls. Over the years, we have learned that query answering is generally easy for conjunctive queries (CQ) and closely related classes, while becoming computationally infeasible for more general queries. For example, in the absence of constraints such as keys, and under the prevalent closed-world semantics used in the case of database incompleteness [ 1, 2 ], we know that: • Certain answers to conjunctive queries and their unions can be found by naïve evaluation, i.e., the standard evaluation of queries in which nulls are treated as new distinct constants. • This could be extended with a limited form of guarded negation [ 3 ]; in fact the limits of such naive evaluation are dictated by the notion of query preservation under homomorphisms.

Under constraints, even such simple ones as keys, the picture is less complete. We know the following: • Certain answers to a conjunctive query (or a union of CQs) on a database under key constraints Σ can be found by naïve evaluation of on the result of the chase of with Σ . Mathematically, certΣ(, ) = (chaseΣ()), where on the left-hand side we have certain answers under constraints, and on the right hand side the naïve evaluation of over the result of the chase. Here chaseΣ refers to the classical textbook chase procedure with keys, or more generally functional dependencies. In fact the above result applies when Σ is a set of functional dependencies, not just keys.

Unfortunately the above result does not work when we move outside the class of selectproject-join-union queries, or unions of CQs. In fact even without constraints, certain answers to a query of the form 1 − 2, where both 1 and 2 are CQs, are not necessarily produced by naïve evaluation. To see why, take a database containing one fact (1, ⊥) where ⊥ is a null and 1 returning while 2 is given by a formula (, ) ∧ = . Here naïve evaluation of 1 − 2 returns while certain answers is empty.

This motivates our question whether we can extend the class of CQs and their unions to obtain tractable evaluation of certain answers under constraints such as keys and functional dependencies. The answer is positive; in fact the query of the form 1 − 2 above will be an example of a query in this class. To start with, the class must be such that finding certain answers for its queries without constraints is already tractable. We know one such class: it consists of arbitrary Boolean combinations of CQs, not just their union. We shall denote it by BCCQ. It was proved in [ 4 ] that certain answers for it can be found in polynomial time, though the procedure was a tableau-based and not particularly suitable for implementation in a database system. To be implementable, certain answers should ideally be expressible in a database query language: in an ideal world, in FO (and thus basic SQL), or at least in Datalog (and thus recursive SQL).

This is precisely what we do in this paper. We establish three main results: 1. For an arbitrary BCCQ and a set of functional dependencies Σ one can construct a Datalog (with negation) query ′ whose naive evaluation computes certΣ(, ), thereby ensuring its polynomial-time data complexity. 2. There are however simple BCCQs, in fact even CQs, and keys Σ such that certΣ(, ) cannot be expressed in FO. 3. Without constraints present, certain answers to BCCQs are not only polynomial-time computable as had been shown previously, but also can be expressed in FO and thus eficiently implemented in SQL databases.

After giving preliminaries in the next Section, the following three sections address these items, respectively.

2. Preliminaries Incomplete databases and constraints

We represent missing information in relational databases in the standard way using nulls [ 5, 1, 6 ]. Incomplete databases are populated by constants and nulls, coming respectively from two countably infinite sets Const and Null. We denote nulls by ⊥, sometimes with sub- or superscript. We also allow them to repeat, thus adopting the model of marked nulls, as customary in the context of applications such as OBDA or data integration and exchange. A relational schema, or vocabulary , is a set of relation names with associated arities. A database over associates to each relation name of arity in , a k-ary relation which is a finite subset of (Const ∪ Null). Sets of constants and nulls occurring in are denoted by Const() and Null(). A database is complete if it contains no nulls, i.e. Null() = ∅. The active domain of is the set of all values appearing in , i.e. adom() = Const() ∪ Null().

A valuation : Null() → Const on a database is a map that assigns constant values to nulls occurring in . By () and (¯) we denote the result of replacing each null ⊥ by (⊥) in a database or in a tuple ¯. The semantics [[]] of an incomplete database is the set {() | is a valuation on } of all complete databases it can represent. Here as is common in research on incomplete data, we use closed world assumption [ 1, 7 ] (i.e., everything we don’t know to be true is automatically assumed to be false and no new tuple can be added).

A functional dependency over a relation name is a first order sentence of the form ∀¯, ¯ ((¯, ¯, ) ∧ (¯, ¯′, ′) → = ′).

Throughout this paper we will assume that a set of functional dependencies Σ is associated with the database schema .

A valuation is consistent with Σ (or just consistent, when Σ is clear from the context) if () |= Σ . We denote by V() the set of all consistent valuations defined on .

Query answering

An -ary query Q of active domain ⊆ Const is a map that associates with a database a subset of (adom() ∪ ). To answer an -ary query over an incomplete database we follow [ 8 ] and adopt a slight generalisation of the usual intersection based certain answers notion, defined as ∩(()). The set of certain answers to over is certΣ(, ) = {¯ ∈ adom() | (¯) ∈ (()) for all consistent } . For queries that explicitly use constants, we shall expand this to allow ¯ range over adom() and those constants. The only diference with the usual notion is that we allow answers to contain nulls, to avoid pathological situations when answers known with certainty are not returned (e.g., in a query returning a relation one would expect to be returned while the intersection-based certain answer will only return null-free tuples).

We study the certain answers problem from the data complexity perspective, fixing the query: Problem: CertainAnswerΣ() Input: A database and a tuple ¯

Question: Is ¯ ∈ certΣ(, )?

For arbitrary FO queries and set of FDs, under the closed world semantics, the data complexity of finding certain answers is coNP-complete (to show ¯ ̸∈ certΣ(, ) it is enough to guess a valuation with () |= Σ and () ̸|= ((¯)) ); the problem is coNP-hard even when Σ is empty [ 9 ].

Query languages

Here we shall study certain answers to first-order (FO) queries by means of their rewriting in Datalog. FO queries of vocabulary use atomic relational and equality formulae and are closed under Boolean connectives ∧, ∨, ¬ and quantifiers ∃, ∀. We write (¯) for an FO-formula with free variables ¯. With slight abuse of notation, ¯ will denote both a tuple of variables and the set of variables occurring in it. The set of constants used by is denoted by adom( ). We interpret FO-formulas under active domain semantics, i.e. quantified variable"s range over adom() ∪ adom( ). Thus, an FO formula (¯) represents a query (of active domain adom( )) mapping each database into the set of tuples {¯ over adom() ∪ adom( ) | |= (¯)}.

A Datalog rule [ 5 ] is an expression of the form 1(1) ← 2(2), . . . , () where ≥ 1, 1, . . . , are relation names and 1, . . . , are free tuples of appropriate arities. Each variable occurring in 1 must occur in at least one of 2, . . . , . A Datalog program is a finite set of Datalog rules. The head of the rule is the expression 1(1); and 2(2), . . . , () forms the body. The semantics is the standard fixed-point semantics.

As the language of our rewritings, we shall be using a fragment of stratified Datalog with negation in bodies that can be seen in two diferent ways.

1. A program is evaluated in two steps. First, we can have a Datalog program defining new idb predicates 1, . . . , ℓ. Then we ask an FO query over the schema extended with these predicates 1, . . . , ℓ. 2. We evaluate a stratified Datalog with negation program in which the first stratum has no negation (but may have recursion) and the second stratum has no recursion (but may have negation).

From the rewritings we produce it will be clear that they fall in these classes. The key point about them is that they can be implemented in recursive SQL, and that they both have PTIME data complexity, making them feasible.

Naïve evaluation and certain answers

For a query written in FO or Datalog, we write () to mean that such a query is evaluated naïvely. That is, if contains nulls, nulls of are treated as new constants in the domain of , distinct from each other, and distinct from all the other constants in and . For example the query (, ) = ∃ ((, )∧(, )), on the database = {(1, ⊥1), (⊥1, ⊥2), (⊥3, 2)} selects only the tuple (1, ⊥2).

There are known connections between naïve evaluation and certain answers. If Σ is empty and is a union of conjunctive queries, then certΣ(, ) = (), see [ 1 ]. If Σ contains a set of FDs, then certΣ(, ) = (︀ chaseΣ())︀ ; cf. [ 10 ]. Here chaseΣ refers to the standard chase procedure with a set of FDs [ 5 ].

3. Datalog Rewriting

Recall that conjunctive queries (CQs) are given by the ∃, ∧-fragment of FO, and their unions (UCQs) by the ∃, ∧, ∨-fragment of FO; these are also captured by the positive fragment of relational algebra (select-project-union-join queries).

To extend tractability results for certain answers to CQs and UCQs, we extend them with a mild form of negation (since adding negation leads to coNP-hardness of certain answers). This mild form comes in the shape of Boolean combination of conjunctive queries (BCCQs), i.e., the closure of conjunctive queries under operations ∩ ′, ∪ ′, and − ′.

If there are no constraints in Σ , finding certain answers to BCCQs is known to be tractable [ 4 ], though by tableau-based techniques that are hard to implement in a database system. We now extend this in two ways. First, we show that tractability is preserved even in the presence of functional dependencies (and thus keys). Second, we show that certain answers can be obtained by rewriting into a fragment of Datalog as described in Section 2. In particular, it means that certain answers can be found by a query expressible in recursive SQL.

Building on constraint-free rewriting techniques from [ 11 ], we start by putting each conjunctive query in a normal form which eliminates repetition of variables, by introducing new equality atoms.

Definition 3.1 (NRV normal form). A conjunctive query is in non-repeating variable normal form (NRV normal form) whenever it is of the form (¯) = ∃ ¯( ( ¯) ∧ (¯, ¯)) where variables in ¯ ¯ are pairwise distinct, and: • ( ¯) is a conjunction of relational atoms without constants, where each free variable in ¯ has at most one occurrence in , • (¯, ¯) is a conjunction of equality atoms, possibly using constants, where each variable of ¯ is involved in at least one equality.

We say that ( ¯) is the relational subquery of , and (¯, ¯) is the equality subquery of .

A BCCQ is in NRV normal form if it is a Boolean combination of CQs in NRV normal form.

Clearly every CQ is equivalent to a query in NRV normal form; moreover can be easily rewritten in NRV normal form (in linear time in the size of the query). Thus, in what follows, we assume w.l.o.g. that CQs are given in NRV normal form. Intuitively the NRV normal form allows us to separate the two ingredients of a CQ : the existence of facts in some relations of the database on the one side, and a set of equality conditions on data values occurring in these facts, on the other side. The existence of facts does not depend on the valuation of nulls, and thus can be directly tested on the incomplete database. Instead, equality atoms in an NRV normal form imply conditions that valuations need to satisfy in order for the query to hold.

Given a query , a database , and a tuple ¯ over adom() ∪ adom() we let the support of ¯ be the set of all valuations that witness it :

Supp(, , ¯) =

{ ∈ V() | (¯) ∈ (())}

In order to look for rewritings of BCCQs, a key observation is that ¯ is a certain answer to if Supp(¬, , ¯) = ∅. When is a BCCQ, so is ¬, thus we look for ways of expressing (non-)emptiness of the support for BCCQs.

We start by concentrating on the support of equality subqueries. This will be encoded in Datalog and then integrated, as a key ingredient, in the rewriting of the whole query. We let (¯) be an arbitrary set of equality atoms among variables ¯ and possibly constants. Intuitively we will be interested in the case that (¯) is the equality subquery (¯, ¯) of a CQ in NRV normal form (thus notice that in the Datalog program below ¯ encompasses variables ¯ ¯ of an equality subquery).

Membership in the set adom() ∪ adom( ) can be expressed by a UCQ formula that we call (). We encode equivalence of database elements in adom() ∪ adom( ) w.r.t. a set of equalities (¯) using the following Datalog program 1: (¯, , ) ← ∧ (), () (¯, , ′) ← = , ′ = , ∧ () for each ( = ) ∈ (¯, , ′) ← (¯, , ′) ← (¯, , ′) ← (¯, , ), (¯, , ′) (¯, ′, ) (¯, ¯, ), (¯′, ¯′, ′), ∧ (¯, , ′) for each FD ((¯, ¯, ), (¯, ¯′, ′) → = ′) ∈ Σ

Intuitively, if ¯ is a tuple of database elements assigned to ¯, equivalent elements of are the ones which should be collapsed into a single value in order for a valuation of to satisfy all the equalities (¯) and the FDs. For fixed and ¯, the relation {(, ′) | |= (¯, , ′)} is an equivalence relation over adom() ∪ adom( ) where each element of adom() neither in ¯ nor in adom( ) forms a singleton equivalence class.

The formula is a key ingredient in our rewriting; as formalized in the following lemma, it selects precisely the pairs of elements that a consistent valuation needs to collapse to satisfy a set of equalities. In Lemmas 3.2, 3.5 and Propositions 3.3, 3.4 we use some of the machinery developed in [ 11 ] and thus the proofs of those statements, which are adaptations of proofs in [ 11 ] are omitted. 1Queries we write hereafter can be domain dependent. So it is important to recall that we always use active domain semantics. Lemma 3.2. Let (¯) be a conjunction of equality atoms, a database, and (¯) = ¯ an assignment over adom() ∪ adom( ). Assume is a consistent valuation of nulls, then () |= ((¯)) if and only if () = (′) for all , ′ such that |= (¯, , ′).

Formulas we write in the remainder are over signature ∪ , where is the database schema. In any incomplete database over ∪ , is always interpreted by the set of nulls occurring in (in accordance with the semantics of the SQL construct IS NULL). I.e. we allow rewritings to test whether a database element is null or not.

For (¯) a conjunction of equality atoms, using we define a new formula (¯) stating the existence of a consistent valuation that collapses all equivalent elements of a tuple: (¯) :=

∀′( (¯, , ′) ∧ ¬ () ∧ ¬ (′) → = ′) Proposition 3.3. Let (¯) be a conjunction of equality atoms, a database, and (¯) = ¯ an assignment over adom() ∪ adom( ), then |= (¯) if and only if there exists a consistent valuation of nulls such that () |= ((¯)).

We are now ready to define a formula capturing the inclusion of supports between two conjunctions of equality atoms, which will be a crucial ingredient in our rewriting. Let (¯) and ′(¯) be conjunctions of equality atoms with adom( ) = adom( ′). We define : , ′ (¯, ¯) :=

∀′ ( ′ (¯, , ′) → (¯, , ′)) Using Proposition 3.3 and Lemma 3.2 we obtain : Proposition 3.4. Let (¯) , ′(¯) be conjunctions of equality atoms with adom( ) = adom( ′), a database and (¯) = ¯, ′(¯) = ¯′ assignments over adom() ∪ adom( ). Then |= , ′ (¯, ¯′) ∨ ¬ (¯) if for all consistent valuations , one has () |= ((¯)) implies () |= ′((¯′)).

So far, we have dealt with equality subqueries and we have characterized the emptiness and inclusion of their supports (cf. Proposition 3.3 and Proposition 3.4, respectively). We can now use this machinery to characterize the support of a BCCQ. We start by expressing membership in the support of an individual CQ : Lemma 3.5. Let be a database, a consistent valuation of and (¯) a conjunctive query in NRV-normal form, with relational subquery ( ¯) and equality subquery (¯, ¯) . Then ∈ Supp(, , )¯ if and only there exists ¯ such that |= ()¯ ∧ (¯)¯ and () |= ((¯)¯) .

In the remainder we consider BCCQs (¯) := normal form (DNF) where for all 1 ≤ ≤ : 1(¯) ∨ . . . ∨ (¯) in NRV disjunctive := 0 (¯) ∧ ¬1 (¯) ∧ . . . ∧ ¬ (¯) and for all 1 ≤ ≤ : := ∃ ¯ ( ¯ ) ∧ with := (¯ ¯ ) For convenience, we assume w.l.o.g every conjunction of literals to be of the same length . We can also assume without loss of generality that for each we have adom( ) = adom( 0 ) for all . In fact we can always pad any with dummy equalities = to extend its active domain.

Given a disjunct in a BCCQ in DNF, we now define poss , encoding the set of possible answers to , and cons , checking the compatibility of an answer with the negative literals in .

(¯ ¯) := 0 ( ¯) ∧ 0 (¯ ¯) ∧ (¯ ¯) Using these new formulae, we show that the non-emptiness of Supp((¯) , , )¯ can be expressed as the existence of a possible answer.

Proposition 3.6. Let be a database and (¯) a DNF BCCQ in NRV normal form, then Supp((¯) , , )¯ ̸= ∅ if and only if |= ⋁︀1≤ ≤ ∃ ¯ (¯ ¯) .

Proof. ⇐ Let |= ⋁︀1≤ ≤ ∃ ¯ (¯ ¯) , then there exists 1 ≤ ≤ and an assignment with ( ¯) = ¯, |= 0 ()¯ ∧ 0 (¯)¯ and for all 1 ≤ ≤ , ¯′ such that |= ( ¯′) ∧ (¯ ¯′), one has |= ¬ 0 , (¯¯, ¯¯′). Since |= 0 (¯)¯ , it’s easy to see that for each ∈ () ∪ adom( 0 ) there exists at most one constant such that |= 0 (¯¯, , ). In fact if for constants 1 and 2, |= 0 (¯¯, , 1) and |= 0 (¯¯, , 2), by transitivity |= 0 (¯¯, 1, 2), implying 1 = 2.

Using this observation we now build a consistent valuation * having the following “tightness" property : for all , ′ ∈ adom() ∪ adom( 0 ), we have * () = * (′) if |= 0 (¯¯, , ′). To build * we associate to each equivalence class of the relation {(, ′) | |= 0 (¯¯, , ′)}, a new fresh constant outside adom() ∪ adom( 0 ). Then * can be defined as follows. For ∈ adom(), if |= (¯¯, , ), for some constant , then * () = ; otherwise * () = where is the equivalence class of . Consistency of * derives from the tightness property, and the fact that 0 satisfies the last rule of the Datalog program that defines it. Moreover by Lemma 3.2, * () |= 0 (* (¯)¯) and we can prove the following claim : Claim 3.7. For all conjunction of equalities ′(¯) with adom( ′) = adom( 0 ) and all ¯ over adom() ∪ adom( 0 ), one has * () |= ′(* (¯)) if for all consistent valuations , () |= 0 ((¯)¯) implies () |= ′((¯)).

Now fix some arbitrary ≥ tion 3.4, it follows from |= ¬ 0 , (¯¯, ¯ ¯′) ∧ 0 (¯)¯ that there exists a consistent valuation ′ with ′() |= 0 (′(¯)¯) but ′() ̸|= (′(¯ ¯′)). By the above claim * () ̸|= (* (¯ ¯′)). In summary we have : 1 and ¯′ with |= ( ¯′) ∧ (¯ ¯′). By Proposi(i) |= 0 ()¯ ∧ 0 (¯)¯ and * () |= 0 (* (¯)¯) and so by Lemma 3.5, we have * ∈ Supp(0 (¯) , , )¯ , i.e., * () |= 0 (* ()¯) . (ii) For all 1 ≤ ≤ and assignment ′ with ′( ¯) = ¯′, if |= ( ¯′) ∧ (¯ ¯′) then * () ̸|= (* (¯ ¯′)) and so by Lemma 3.5, we have * ̸∈ Supp( (¯) , , )¯ , i.e., for all 1 ≤ ≤ , * () |= ¬ (* ()¯) .

This means we have * ∈ Supp(0 (¯) ∧ ¬1 (¯) ∧ . . . ∧ ¬ (¯) , , )¯ for all 1 ≤ ≤ and so * ∈ Supp((¯) , , )¯ .

⇒ Let ∈ Supp((¯) , , )¯ , so is consistent and there is some 1 ≤ ≤ with : (i) ∈ Supp(0 , , )¯ , (ii) for all 1 ≤ ≤ , ̸∈ Supp( , , )¯ . Using Lemma 3.5 (i) implies that there exists ¯ such that |= 0 ()¯ ∧ 0 (¯)¯ and () |= 0 ((¯)¯) . Again by Lemma 3.5, (ii) implies that for all 1 ≤ ≤ and ¯′, if |= ( ¯′)∧ (¯ ¯′) then () ̸|= ((¯ ¯′)). This entails by Proposition 3.4 that |= 0 (¯)¯ ∧ ¬ 0 , (¯¯, ¯ ¯′). This shows |= ⋁︀1≤ ≤ ∃ ¯ (¯ ¯) .

Now that we have defined the formula expressing for a BCCQ non-emptiness of Supp((¯) , , )¯ (Proposition 3.6), we can easily define a rewriting for the problem CertainAnswerΣ(). To do so, we rely on the fact that ¯ ∈ certΣ(, ) if Supp(¬, , )¯ = ∅.

Theorem 3.8 (Datalog rewriting). Let D be a database whose schema contains a set of functional dependencies Σ , and let (¯) be a BCCQ in NRV-normal form. Let ′ = ′1(¯) ∨. . .∨′(¯) be ¬ in DNF normal form. Then ¯ ∈ certΣ(, ) if and only if |= ()¯ where (¯) = ⋀︀1≤ ≤ ∀ ¯ ¬′ (¯ ¯) .

Proof. One has that ¯ ∈ certΣ(, ) if (′, , )¯ = tion 3.6 tells us that (′, , )¯ = ∅ if |= ⋀︀1≤ ≤ ∀ ¯ ¬′ (¯ ¯) . ∅. Being ′ still a BCCQ, Proposi and a set of FDs Σ , the complexity of Corollary 3.9. For each fixed BCCQ query CertainAnswerΣ() is in PTIME.

4. Non-rewritability in FO

The basic starting points for our investigation was the fact that certΣ(, ) = (chaseΣ()) for a CQ and a set Σ of FDs, for every database . This remained true for unions of CQs, but failed for BCCQs, forcing us to produce a Datalog rewriting to obtain certain answers. But can a first-order rewriting be obtained instead? This would make it possible to produce certain answers using the core of SQL as opposed to its recursive features which do not always perform as well in practice.

In this section we show that the answer, in general, is negative even for CQs (and thus for BCCQs). In the next section however we show that such rewritings can be obtained in FO for BCCQs whenever Σ is empty.

The main result of this section is the following. Theorem 4.1. There exists a Boolean CQ and single FD Σ over a relational schema of binary and unary relations such that certΣ(, ) is not expressible as an FO query.

Proof. Consider a schema with one binary relation and two unary relations and . The only FD in Σ is ∀∀∀ (︀ (, ) ∧ (, ) → = )︀ ; in other words, the first attribute of is a key. The query is a Boolean CQ ∃ (() ∧ ()).

To prove inexpressibility of certΣ(, · ) in FO, for each > 0 we create two databases and ′. In both of them, is interpreted as a disjoint union 1 ∪ 2 where 1 and 2 are balanced binary trees of depth in which all nodes are distinct nulls. In both and are singleton sets. In , the set contains a leaf of 1 and contains a leaf of 2. In ′, both and contain leaves of 1 such that their only common ancestor in the tree is the root (in other words, they are leaves of subtrees rooted at diferent children of the root of 1).

Because of the constraint Σ , for every valuation such that the resulting database satisefis it we have that both (1) and (2) are chains. Indeed, consider any node ⊥ with children ⊥1, ⊥2 in . If (⊥1) ̸= (⊥2) then the resulting tuples ((⊥), (⊥1)) and ((⊥), (⊥2)) violate the constraint. Thus (⊥1) = (⊥2) and applying this construction inductively we see that () is a chain. Hence, it has a single leaf, and thus certΣ(, ′) is true, since and must be interpreted as that leaf. On the other hand, certΣ(, ) is false, since there is a valuation that sends 1 and 2 into two disjoint chains, and thus and are interpreted as two distinct elements.

Assume now that certΣ(, · ) is rewritable as an FO sentence . Then, for every > 0, we have ′ |= and |= ¬. We next show that such a sentence cannot exist, thereby proving non-FO-rewritability.

Recall that in a database (with one binary relation, like considered here) a radius neighborhood of an element is its restriction to the set of all elements reachable from by a path of length at most , where the path does not take into account the orientation of edges of (for example, if we have (, ) and (, ) then both and are in the radius 1 neighborhood of ). When two neighborhoods, of elements and , are isomorphic, it means that there is an isomorphism between them that sends to . In other words, centers of neighborhoods are viewed as distinguished elements when it comes to defining neighborhoods. It is known that each first order sentence is Hanf-local [ 12 ]: that is, there exists a number > 0 such that for any two databases 1 and 2, if there is a bijection between 1 and 2 such that the radius neighborhoods of in 1 and () in 2 are isomorphic then 1 and 2 agree on , i.e. either both satisfy it or both do not.

Now let be such a number for the sentence we assumed exists. Consider and ′ and let 1, 1* be the subtrees of the root of 1 in such that the first contains while the second contains neither not , and let 2, 2* be defined similarly for subtrees of the root of 2 with respect to . In ′ we define 1′, 1′ as subtrees of the root of the tree containing , such that the first contains the leaf and the second contains the leaf, while 2′* , 2′* be the subtrees of the root of the tree having neither nor elements. Then it is easy to see that the following pairs of trees are isomorphic: 1 and 1′, 2 and 1′, 1* and 2′* , 2* and 2′* .

We now define the bijection as the union of those isomorphisms plus mapping roots of trees in into roots of in ′. It is an immediate observation that if > + 1 (i.e., leaves are not in the radius neighborhood of children of roots) then satisfies the condition that neighborhoods of and () of radius are isomorphic. This would tell us that and ′ agree on but we know they do not. This contradiction completes the proof.

As a corollary to the proof, we obtain the following result showing that non-recursive SQL is incapable of computing certΣ(, ) in the setting of Theorem 4.1.

Corollary 4.2. There exists a Boolean CQ and single FD Σ over a relational schema of binary and unary relations such that certΣ(, ) is not expressible in the basic SELECT-FROM-WHERE-GROUP BY-HAVING fragment of SQL with arbitrary aggregate functions.

This is due to the fact that queries in this fragment of SQL with grouping and aggregation can be translated into a logic with aggregate functions [ 13 ] which itself is known to be Hanf-local [ 14 ].

5. An FO rewriting

We now focus on the special case where Σ is empty. First notice that the only Datalog component in our rewriting was the formula. Let ∼ be the reflexive symmetric transitive closure of {(, ) | = ∈ }. As shown in [ 11 ], as Σ is empty, we can rewrite as follows the formula in FO, where is the number of equivalence classes of ∼ : (¯, , ′) := = ′ ∨ ⋁︁

( = 1 ∧ ′ = ∧ 1,1...,∈ ¯ ∪ adom( ) | ∼ for all 1≤ ≤

⋀︁ = +1) 1≤ < Intuitively this holds because each disjunct of (¯, , ′) corresponds to a possible derivation of (, ′) in the reflexive symmetric transitive closure of {( (), ()) | = ∈ }, and one can prove that there is a bound only depending on on the number of steps of this derivation.

As a consequence, we can rewrite in FO the formula poss of Section 3 encoding the set of possible answers to . It is enough to replace each occurrence of the Datalog (¯, , ′) program in it by (¯, , ′). We denote by poss the rewriting so obtained. With this, we obtain an extension to BCCQ of the rewriting techniques proposed in [ 11 ] for UCQ. Theorem 5.1 (FO rewriting). Let D be a database, Σ = ∅ and let (¯) be a BCCQ in NRV-normal form. Let ′ = ′1(¯) ∨ . . . ∨ ′(¯) be ¬ in DNF normal form. Then ¯ ∈ certΣ(, ) if and only if |= ()¯ where (¯) = ⋀︀1≤ ≤ ∀ ¯ ¬′(¯ ¯) .

Note that tractability of BCCQ was already proved in [ 4 ] using tableau based methods. We now refine complexity as follows.

Corollary 5.2. For each fixed BCCQ query , the complexity of CertainAnswerΣ() is in DLOGSPACE whenever Σ = ∅.

6. Future work

Our rewriting techniques are closer to a practical implementation than the previous tableau based method from [ 4 ]. This is due to their expressibility in recursive SQL (or even non-recursive in the case of Theorem 5.1). However, while theoretically feasible, an actual implementation will need additional techniques to achieve acceptable performance. To see why, notice that the first rule in the definition of creates a cross product over the full active domain, i.e., the set of all elements that appeared in the database. This of course will be prohibitively large. While this may appear to be a significant obstacle, a similar situation with computing or approximating certain answers is not new in the literature. For instance, the first approximation scheme for certain answers to SQL queries that appeared in [ 15 ] has done exactly the same, and generated very large Cartesian products even for simple queries with negation. Nonetheless, an alternative was found quickly [ 16 ] that completely avoided the need for such expensive queries, and it was shown to work well on several TPC-H queries. Thus, looking for a practical and implementable rewriting is one of the possible directions for future work.

As another open problem, we note that the query for which we have shown certain answers to be non-rewritable in FO has DLOGSPACE data complexity. Indeed the problem is essentially reachability over trees, which can be easily encoded using deterministic transitive closure [ 17 ]. To express DLOGSPACE problems, we need a language weaker than Datalog with negation. Thus, it is natural to ask whether a low complexity Datalog fragment would be suficient to express rewritings of BCCQ, or a separating example that is PTIME-complete can be found.

Acknowledgments

We thank the anonymous referees for their useful feedback. This work was supported by ANR grants ANR-18-CE40-0031 (QUID) and ANR-21-CE48-0015 (VeriGraph) as well as EPSRC grant N023056 (MAGIC). We also acknowledge a PGSM Master grant from the FSMP.

[1]

Imieliński , W. Lipski, Incomplete information in relational databases , Journal of the ACM 31 ( 1984 ) 761 - 791 .

[2]

Console ,

Guagliardo ,

Libkin , E. Toussaint, Coping with incomplete data: Recent advances , in: ACM PODS, ACM , 2020 , pp. 33 - 47 .

[3]

Gheerbrant ,

Libkin ,

Sirangelo , Naïve evaluation of queries over incomplete databases , ACM Trans. Database Syst . 39 ( 2014 ) 31 : 1 - 31 : 42 .

[4]

Gheerbrant , L. Libkin, Certain answers over incomplete XML documents: Extending tractability boundary , ACM ToCS 57 ( 2015 ) 892 - 926 .

[5]

Abiteboul ,

Hull ,

Vianu , Foundations of Databases, Addison-Wesley, Boston, MA, USA, 1995 .

[6] R. van der Meyden , Logical approaches to incomplete information: a survey , in: Logics for databases and information systems , Kluwer Academic Publishers, Norwell, MA, USA, 1998 , pp. 307 - 356 .

[7]

Reiter , On closed world data bases , in: Logic and Data Bases , 1977 , pp. 55 - 76 .

[8]

Lipski , On relational algebra with marked nulls , in: PODS, ACM , Waterloo, Ontario, Canada, 1984 , pp. 201 - 203 .

[9]

Abiteboul ,

P. C.

Kanellakis , G. Grahne, On the representation and querying of sets of possible worlds , TCS 78 ( 1991 ) 158 - 187 .

[10]

Greco ,

Molinaro ,

Spezzano , Incomplete Data and Data Dependencies in Relational Databases , Synthesis Lectures on Data Management , Morgan & Claypool Publishers, 2012 .

[11]

Gheerbrant ,

Sirangelo , Best answers over incomplete data : Complexity and firstorder rewritings , in: Proceedings of IJCAI-19 , 2019 , pp. 1704 - 1710 .

[12]

Fagin ,

L. J.

Stockmeyer ,

M. Y.

Vardi , On monadic NP vs. monadic co-NP, Inf . Comput. 120 ( 1995 ) 78 - 92 .

[13]

Libkin , Expressive power of SQL, Theor . Comput. Sci . 296 ( 2003 ) 379 - 404 .

[14]

Hella ,

Libkin ,

Nurmonen ,

Wong , Logics with aggregate operators , J. ACM 48 ( 2001 ) 880 - 907 .

[15]

Libkin , SQL's three-valued logic and certain answers , ACM Trans. Database Syst . 41 ( 2016 ) 1: 1 - 1 : 28 . URL: https://doi.org/10.1145/2877206. doi: 10 .1145/2877206.

[16]

Guagliardo ,

Libkin , Making SQL queries correct on incomplete databases: A feasibility study , in: T. Milo, W. Tan (Eds.), Proceedings of ACM PODS , 2016 , ACM, 2016 , pp. 211 - 223 .

[17]

Immerman , Languages that capture complexity classes , SIAM J. Comput . 16 ( 1987 ) 760 - 778 . URL: https://doi.org/10.1137/0216051. doi: 10 .1137/0216051.