<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Certain Answers of Extensions of Conjunctive Queries by Datalog and First-Order Rewriting</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Amélie Gheerbrant</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leonid Libkin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexandra Rogova</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristina Sirangelo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DI ENS, ENS, PSL University</institution>
          ,
          <addr-line>CNRS, Inria Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Data Intelligence Institute of Paris (diiP)</institution>
          ,
          <addr-line>Inria</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Informatics, University of Edinburgh</institution>
          ,
          <addr-line>10 Crichton Street, Edinburgh EH8 9AB</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Université Paris Cité</institution>
          ,
          <addr-line>CNRS, IRIF, F-75013, Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <fpage>14</fpage>
      <lpage>26</lpage>
      <abstract>
        <p>To answer database queries over incomplete data the gold standard is finding certain answers: those that are true regardless of how incomplete data is interpreted. Such answers can be found eficiently for conjunctive queries and their unions, even in the presence of constraints such as keys or functional dependencies. With negation added, the complexity of finding certain answers becomes intractable however. In this paper we exhibit a well-behaved class of queries that extends unions of conjunctive queries with a limited form of negation and that permits eficient computation of certain answers even in the presence of constraints by means of rewriting into Datalog with negation. The class consists of queries that are the closure of conjunctive queries under Boolean operations of union, intersection and diference. We show that for these queries, certain answers can be expressed in Datalog with negation, even in the presence of functional dependencies, thus making them tractable in data complexity. We show that in general Datalog cannot be replaced by first-order logic, but without constraints such a rewriting can be done in first-order.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Incomplete information</kwd>
        <kwd>Certain answers</kwd>
        <kwd>Datalog rewritings</kwd>
        <kwd>First-order rewritings</kwd>
        <kwd>Functional dependencies</kwd>
        <kwd>Chase</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>We study the classical problem of answering queries over databases with incomplete information
where incompleteness is represented by means of nulls, as in the most common practice in
relational databases. Such databases are required to satisfy integrity constraints, most commonly
keys. This addition of constraints makes query answering more complex, even for fairly simple
queries.</p>
      <p>
        We consider the setting of databases with marked nulls, as is often required in applications
such as data integration, data exchange, ontology-based data access, and others. This is markers
for missing information, but the same marker may appear in multiple places. These generalize
nulls in SQL and relational database management systems where repetition is not allowed; our
results will thus apply to SQL nulls as well. We use the standard model of query answering,
namely finding certain answers which are guaranteed to be true regardless of the interpretation
of nulls. Over the years, we have learned that query answering is generally easy for conjunctive
queries (CQ) and closely related classes, while becoming computationally infeasible for more
general queries. For example, in the absence of constraints such as keys, and under the prevalent
closed-world semantics used in the case of database incompleteness [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], we know that:
• Certain answers to conjunctive queries and their unions can be found by naïve evaluation,
i.e., the standard evaluation of queries in which nulls are treated as new distinct constants.
• This could be extended with a limited form of guarded negation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]; in fact the limits of
such naive evaluation are dictated by the notion of query preservation under
homomorphisms.
      </p>
      <p>Under constraints, even such simple ones as keys, the picture is less complete. We know the
following:
• Certain answers to a conjunctive query  (or a union of CQs) on a database  under key
constraints Σ can be found by naïve evaluation of  on the result of the chase of  with
Σ . Mathematically, certΣ(, ) = (chaseΣ()), where on the left-hand side we have
certain answers under constraints, and on the right hand side the naïve evaluation of 
over the result of the chase. Here chaseΣ refers to the classical textbook chase procedure
with keys, or more generally functional dependencies. In fact the above result applies
when Σ is a set of functional dependencies, not just keys.</p>
      <p>Unfortunately the above result does not work when we move outside the class of
selectproject-join-union queries, or unions of CQs. In fact even without constraints, certain answers
to a query of the form 1 − 2, where both 1 and 2 are CQs, are not necessarily produced
by naïve evaluation. To see why, take a database containing one fact (1, ⊥) where ⊥ is a null
and 1 returning  while 2 is given by a formula (, ) ∧  = . Here naïve evaluation of
1 − 2 returns  while certain answers is empty.</p>
      <p>
        This motivates our question whether we can extend the class of CQs and their unions to
obtain tractable evaluation of certain answers under constraints such as keys and functional
dependencies. The answer is positive; in fact the query of the form 1 − 2 above will be
an example of a query in this class. To start with, the class must be such that finding certain
answers for its queries without constraints is already tractable. We know one such class: it
consists of arbitrary Boolean combinations of CQs, not just their union. We shall denote it
by BCCQ. It was proved in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] that certain answers for it can be found in polynomial time,
though the procedure was a tableau-based and not particularly suitable for implementation
in a database system. To be implementable, certain answers should ideally be expressible in a
database query language: in an ideal world, in FO (and thus basic SQL), or at least in Datalog
(and thus recursive SQL).
      </p>
      <p>This is precisely what we do in this paper. We establish three main results:
1. For an arbitrary BCCQ  and a set of functional dependencies Σ one can construct a
Datalog (with negation) query ′ whose naive evaluation computes certΣ(, ), thereby
ensuring its polynomial-time data complexity.
2. There are however simple BCCQs, in fact even CQs,  and keys Σ such that certΣ(, )
cannot be expressed in FO.
3. Without constraints present, certain answers to BCCQs are not only polynomial-time
computable as had been shown previously, but also can be expressed in FO and thus
eficiently implemented in SQL databases.</p>
      <p>After giving preliminaries in the next Section, the following three sections address these
items, respectively.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Preliminaries</title>
      <sec id="sec-2-1">
        <title>Incomplete databases and constraints</title>
        <p>
          We represent missing information in relational databases in the standard way using nulls
[
          <xref ref-type="bibr" rid="ref1 ref5 ref6">5, 1, 6</xref>
          ]. Incomplete databases are populated by constants and nulls, coming respectively from
two countably infinite sets Const and Null. We denote nulls by ⊥, sometimes with sub- or
superscript. We also allow them to repeat, thus adopting the model of marked nulls, as customary
in the context of applications such as OBDA or data integration and exchange. A relational
schema, or vocabulary  , is a set of relation names with associated arities. A database  over
 associates to each relation name of arity  in  , a k-ary relation which is a finite subset
of (Const ∪ Null). Sets of constants and nulls occurring in  are denoted by Const() and
Null(). A database is complete if it contains no nulls, i.e. Null() = ∅. The active domain of
 is the set of all values appearing in , i.e. adom() = Const() ∪ Null().
        </p>
        <p>
          A valuation  : Null() → Const on a database  is a map that assigns constant values
to nulls occurring in . By () and (¯) we denote the result of replacing each null ⊥ by
(⊥) in a database  or in a tuple ¯. The semantics [[]] of an incomplete database  is the set
{() |  is a valuation on } of all complete databases it can represent. Here as is common
in research on incomplete data, we use closed world assumption [
          <xref ref-type="bibr" rid="ref1 ref7">1, 7</xref>
          ] (i.e., everything we don’t
know to be true is automatically assumed to be false and no new tuple can be added).
        </p>
        <p>A functional dependency over a relation name  is a first order sentence of the form
∀¯, ¯ ((¯, ¯, ) ∧ (¯, ¯′, ′) →  = ′).</p>
        <p>Throughout this paper we will assume that a set of functional dependencies Σ is associated
with the database schema  .</p>
        <p>A valuation  is consistent with Σ (or just consistent, when Σ is clear from the context) if
() |= Σ . We denote by V() the set of all consistent valuations defined on .</p>
      </sec>
      <sec id="sec-2-2">
        <title>Query answering</title>
        <p>
          An -ary query Q of active domain  ⊆ Const is a map that associates with a database 
a subset of (adom() ∪ ). To answer an -ary query  over an incomplete database 
we follow [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and adopt a slight generalisation of the usual intersection based certain answers
notion, defined as ∩(()). The set of certain answers to  over  is
certΣ(, ) = {¯ ∈ adom() | (¯) ∈ (()) for all consistent } .
For queries that explicitly use constants, we shall expand this to allow ¯ range over adom()
and those constants. The only diference with the usual notion is that we allow answers to
contain nulls, to avoid pathological situations when answers known with certainty are not
returned (e.g., in a query returning a relation  one would expect  to be returned while the
intersection-based certain answer will only return null-free tuples).
        </p>
        <p>We study the certain answers problem from the data complexity perspective, fixing the query:
Problem: CertainAnswerΣ()
Input: A database  and a tuple ¯</p>
        <p>Question: Is ¯ ∈ certΣ(, )?</p>
        <p>
          For arbitrary FO queries and set of FDs, under the closed world semantics, the data complexity
of finding certain answers is coNP-complete (to show ¯ ̸∈ certΣ(, ) it is enough to guess a
valuation  with () |= Σ and () ̸|= ((¯)) ); the problem is coNP-hard even when Σ is
empty [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Query languages</title>
        <p>Here we shall study certain answers to first-order (FO) queries by means of their rewriting in
Datalog. FO queries of vocabulary  use atomic relational and equality formulae and are closed
under Boolean connectives ∧, ∨, ¬ and quantifiers ∃, ∀. We write  (¯) for an FO-formula 
with free variables ¯. With slight abuse of notation, ¯ will denote both a tuple of variables
and the set of variables occurring in it. The set of constants used by  is denoted by adom( ).
We interpret FO-formulas under active domain semantics, i.e. quantified variable"s range over
adom() ∪ adom( ). Thus, an FO formula  (¯) represents a query (of active domain adom( ))
mapping each database  into the set of tuples {¯ over adom() ∪ adom( ) |  |=  (¯)}.</p>
        <p>
          A Datalog rule [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] is an expression of the form 1(1) ← 2(2), . . . , () where  ≥ 1,
1, . . . ,  are relation names and 1, . . . ,  are free tuples of appropriate arities. Each
variable occurring in 1 must occur in at least one of 2, . . . , . A Datalog program is a finite
set of Datalog rules. The head of the rule is the expression 1(1); and 2(2), . . . , ()
forms the body. The semantics is the standard fixed-point semantics.
        </p>
        <p>As the language of our rewritings, we shall be using a fragment of stratified Datalog with
negation in bodies that can be seen in two diferent ways.</p>
        <p>1. A program is evaluated in two steps. First, we can have a Datalog program  defining
new idb predicates 1, . . . , ℓ. Then we ask an FO query over the schema extended with
these predicates 1, . . . , ℓ.
2. We evaluate a stratified Datalog with negation program in which the first stratum has no
negation (but may have recursion) and the second stratum has no recursion (but may
have negation).</p>
        <p>From the rewritings we produce it will be clear that they fall in these classes. The key point
about them is that they can be implemented in recursive SQL, and that they both have PTIME
data complexity, making them feasible.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Naïve evaluation and certain answers</title>
        <p>For a query  written in FO or Datalog, we write () to mean that such a query is evaluated
naïvely. That is, if  contains nulls, nulls of  are treated as new constants in the domain of ,
distinct from each other, and distinct from all the other constants in  and  . For example the
query  (, ) = ∃ ((, )∧(, )), on the database  = {(1, ⊥1), (⊥1, ⊥2), (⊥3, 2)}
selects only the tuple (1, ⊥2).</p>
        <p>
          There are known connections between naïve evaluation and certain answers. If Σ is empty
and  is a union of conjunctive queries, then certΣ(, ) = (), see [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. If Σ contains a set
of FDs, then certΣ(, ) = (︀ chaseΣ())︀ ; cf. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Here chaseΣ refers to the standard chase
procedure with a set of FDs [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Datalog Rewriting</title>
      <p>Recall that conjunctive queries (CQs) are given by the ∃, ∧-fragment of FO, and their unions
(UCQs) by the ∃, ∧, ∨-fragment of FO; these are also captured by the positive fragment of
relational algebra (select-project-union-join queries).</p>
      <p>To extend tractability results for certain answers to CQs and UCQs, we extend them with a
mild form of negation (since adding negation leads to coNP-hardness of certain answers). This
mild form comes in the shape of Boolean combination of conjunctive queries (BCCQs), i.e., the
closure of conjunctive queries under operations  ∩ ′,  ∪ ′, and  − ′.</p>
      <p>
        If there are no constraints in Σ , finding certain answers to BCCQs is known to be tractable [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
though by tableau-based techniques that are hard to implement in a database system. We now
extend this in two ways. First, we show that tractability is preserved even in the presence of
functional dependencies (and thus keys). Second, we show that certain answers can be obtained
by rewriting into a fragment of Datalog as described in Section 2. In particular, it means that
certain answers can be found by a query expressible in recursive SQL.
      </p>
      <p>
        Building on constraint-free rewriting techniques from [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], we start by putting each
conjunctive query in a normal form which eliminates repetition of variables, by introducing new
equality atoms.
      </p>
      <p>Definition 3.1 (NRV normal form). A conjunctive query  is in non-repeating variable normal
form (NRV normal form) whenever it is of the form (¯) = ∃ ¯( ( ¯) ∧ (¯, ¯)) where variables
in ¯ ¯ are pairwise distinct, and:
• ( ¯) is a conjunction of relational atoms without constants, where each free variable in ¯
has at most one occurrence in ,
• (¯, ¯) is a conjunction of equality atoms, possibly using constants, where each variable of
¯ is involved in at least one equality.</p>
      <p>We say that ( ¯) is the relational subquery of , and (¯, ¯) is the equality subquery of .</p>
      <p>A BCCQ is in NRV normal form if it is a Boolean combination of CQs in NRV normal form.</p>
      <p>Clearly every CQ  is equivalent to a query in NRV normal form; moreover  can be easily
rewritten in NRV normal form (in linear time in the size of the query). Thus, in what follows,
we assume w.l.o.g. that CQs are given in NRV normal form. Intuitively the NRV normal form
allows us to separate the two ingredients of a CQ : the existence of facts in some relations of the
database on the one side, and a set of equality conditions on data values occurring in these facts,
on the other side. The existence of facts does not depend on the valuation of nulls, and thus can
be directly tested on the incomplete database. Instead, equality atoms in an NRV normal form
imply conditions that valuations need to satisfy in order for the query to hold.</p>
      <p>Given a query , a database , and a tuple ¯ over adom() ∪ adom() we let the support
of ¯ be the set of all valuations that witness it :</p>
      <p>Supp(, , ¯) =</p>
      <p>{ ∈ V() | (¯) ∈ (())}</p>
      <p>In order to look for rewritings of BCCQs, a key observation is that ¯ is a certain answer to
 if Supp(¬, , ¯) = ∅. When  is a BCCQ, so is ¬, thus we look for ways of expressing
(non-)emptiness of the support for BCCQs.</p>
      <p>We start by concentrating on the support of equality subqueries. This will be encoded in
Datalog and then integrated, as a key ingredient, in the rewriting of the whole query. We let
 (¯) be an arbitrary set of equality atoms among variables ¯ and possibly constants. Intuitively
we will be interested in the case that  (¯) is the equality subquery (¯, ¯) of a CQ in NRV
normal form (thus notice that in the Datalog program below ¯ encompasses variables ¯ ¯ of an
equality subquery).</p>
      <p>Membership in the set adom() ∪ adom( ) can be expressed by a UCQ formula that we call
(). We encode equivalence of database elements in adom() ∪ adom( ) w.r.t. a set of
equalities  (¯) using the following Datalog program 1:
 (¯, , ) ← ∧  (), ()
 (¯, , ′) ←  = , ′ = , ∧ () for each ( = ) ∈ 
 (¯, , ′) ←
 (¯, , ′) ←
 (¯, , ′) ←
 (¯, , ),  (¯, , ′)
 (¯, ′, )
(¯, ¯, ), (¯′, ¯′, ′), ∧  (¯, , ′)
for each FD ((¯, ¯, ), (¯, ¯′, ′) →  = ′) ∈ Σ</p>
      <p>Intuitively, if ¯ is a tuple of database elements assigned to ¯, equivalent elements of  are the
ones which should be collapsed into a single value in order for a valuation of  to satisfy all
the equalities  (¯) and the FDs. For fixed  and ¯, the relation {(, ′) |  |=  (¯, , ′)} is
an equivalence relation over adom() ∪ adom( ) where each element of adom() neither in ¯
nor in adom( ) forms a singleton equivalence class.</p>
      <p>
        The formula  is a key ingredient in our rewriting; as formalized in the following lemma,
it selects precisely the pairs of elements that a consistent valuation needs to collapse to satisfy
a set of equalities. In Lemmas 3.2, 3.5 and Propositions 3.3, 3.4 we use some of the machinery
developed in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and thus the proofs of those statements, which are adaptations of proofs in
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] are omitted.
1Queries we write hereafter can be domain dependent. So it is important to recall that we always use active domain
semantics.
Lemma 3.2. Let  (¯) be a conjunction of equality atoms,  a database, and  (¯) = ¯ an
assignment over adom() ∪ adom( ). Assume  is a consistent valuation of nulls, then () |=
 ((¯)) if and only if () = (′) for all , ′ such that  |=  (¯, , ′).
      </p>
      <p>Formulas we write in the remainder are over signature  ∪  , where  is the database
schema. In any incomplete database  over  ∪  ,   is always interpreted by the set of
nulls occurring in  (in accordance with the semantics of the SQL construct IS NULL). I.e. we
allow rewritings to test whether a database element is null or not.</p>
      <p>For  (¯) a conjunction of equality atoms, using  we define a new formula  (¯)
stating the existence of a consistent valuation that collapses all equivalent elements of a tuple:
 (¯) :=</p>
      <p>∀′( (¯, , ′) ∧ ¬ () ∧ ¬ (′) →  = ′)
Proposition 3.3. Let  (¯) be a conjunction of equality atoms,  a database, and  (¯) = ¯ an
assignment over adom() ∪ adom( ), then  |=  (¯) if and only if there exists a consistent
valuation  of nulls such that () |=  ((¯)).</p>
      <p>We are now ready to define a formula capturing the inclusion of supports between two
conjunctions of equality atoms, which will be a crucial ingredient in our rewriting. Let  (¯)
and  ′(¯) be conjunctions of equality atoms with adom( ) = adom( ′). We define :
, ′ (¯, ¯) :=</p>
      <p>∀′ ( ′ (¯, , ′) →  (¯, , ′))
Using Proposition 3.3 and Lemma 3.2 we obtain :
Proposition 3.4. Let  (¯) ,  ′(¯) be conjunctions of equality atoms with adom( ) = adom( ′),
 a database and  (¯) = ¯,  ′(¯) = ¯′ assignments over adom() ∪ adom( ). Then  |=
, ′ (¯, ¯′) ∨ ¬ (¯) if for all consistent valuations , one has () |=  ((¯)) implies
() |=  ′((¯′)).</p>
      <p>So far, we have dealt with equality subqueries and we have characterized the emptiness and
inclusion of their supports (cf. Proposition 3.3 and Proposition 3.4, respectively). We can now
use this machinery to characterize the support of a BCCQ. We start by expressing membership
in the support of an individual CQ :
Lemma 3.5. Let  be a database,  a consistent valuation of  and (¯) a conjunctive query
in NRV-normal form, with relational subquery ( ¯) and equality subquery  (¯, ¯) . Then  ∈
Supp(, , )¯ if and only there exists ¯ such that  |= ()¯ ∧  (¯)¯ and () |=  ((¯)¯) .</p>
      <p>In the remainder we consider BCCQs (¯) :=
normal form (DNF) where for all 1 ≤  ≤  :
1(¯) ∨ . . . ∨ (¯) in NRV disjunctive
 := 0 (¯) ∧ ¬1 (¯) ∧ . . . ∧ ¬ (¯)
and for all 1 ≤  ≤  :
 := ∃ ¯  ( ¯ ) ∧   with   :=  (¯ ¯ )
For convenience, we assume w.l.o.g every conjunction of literals to be of the same length .
We can also assume without loss of generality that for each  we have adom(  ) = adom( 0 )
for all . In fact we can always pad any   with dummy equalities  =  to extend its active
domain.</p>
      <p>Given a disjunct  in a BCCQ in DNF, we now define poss , encoding the set of possible
answers to , and cons , checking the compatibility of an answer with the negative literals in
.</p>
      <p>(¯ ¯) := 0 ( ¯) ∧  0 (¯ ¯) ∧  (¯ ¯)
Using these new formulae, we show that the non-emptiness of Supp((¯) , , )¯ can be
expressed as the existence of a possible answer.</p>
      <p>Proposition 3.6. Let  be a database and (¯) a DNF BCCQ in NRV normal form, then
Supp((¯) , , )¯ ̸= ∅ if and only if  |= ⋁︀1≤ ≤  ∃ ¯  (¯ ¯) .</p>
      <p>Proof. ⇐ Let  |= ⋁︀1≤ ≤  ∃ ¯  (¯ ¯) , then there exists 1 ≤  ≤  and an assignment
 with  ( ¯) = ¯,  |= 0 ()¯ ∧  0 (¯)¯ and for all 1 ≤  ≤ , ¯′ such that  |=
 ( ¯′) ∧   (¯ ¯′), one has  |= ¬ 0 ,  (¯¯, ¯¯′). Since  |=  0 (¯)¯ , it’s
easy to see that for each  ∈ () ∪ adom( 0 ) there exists at most one constant  such
that  |=  0 (¯¯, , ). In fact if for constants 1 and 2,  |=  0 (¯¯, , 1) and
 |=  0 (¯¯, , 2), by transitivity  |=  0 (¯¯, 1, 2), implying 1 = 2.</p>
      <p>Using this observation we now build a consistent valuation * having the following
“tightness" property : for all , ′ ∈ adom() ∪ adom( 0 ), we have * () = * (′) if
 |=  0 (¯¯, , ′). To build * we associate to each equivalence class  of the
relation {(, ′) |  |=  0 (¯¯, , ′)}, a new fresh constant  outside adom() ∪ adom( 0 ).
Then * can be defined as follows. For  ∈ adom(), if  |=  (¯¯, , ), for some constant
, then * () = ; otherwise * () =  where  is the equivalence class of . Consistency of
* derives from the tightness property, and the fact that  0 satisfies the last rule of the
Datalog program that defines it. Moreover by Lemma 3.2, * () |=  0 (* (¯)¯) and we can
prove the following claim :
Claim 3.7. For all conjunction of equalities  ′(¯) with adom( ′) = adom( 0 ) and all ¯ over
adom() ∪ adom( 0 ), one has * () |=  ′(* (¯)) if for all consistent valuations , () |=
 0 ((¯)¯) implies () |=  ′((¯)).</p>
      <p>Now fix some arbitrary  ≥
tion 3.4, it follows from  |= ¬ 0 ,  (¯¯, ¯ ¯′) ∧  0 (¯)¯ that there exists a
consistent valuation ′ with ′() |=  0 (′(¯)¯) but ′() ̸|=   (′(¯ ¯′)). By the above claim
* () ̸|=   (* (¯ ¯′)). In summary we have :
1 and ¯′ with  |=  ( ¯′) ∧   (¯ ¯′). By
Proposi(i)  |= 0 ()¯ ∧  0 (¯)¯ and * () |=  0 (* (¯)¯) and so by Lemma 3.5, we have
* ∈ Supp(0 (¯) , , )¯ , i.e., * () |= 0 (* ()¯) .
(ii) For all 1 ≤  ≤  and assignment  ′ with  ′( ¯) = ¯′, if  |=  ( ¯′) ∧   (¯ ¯′)
then * () ̸|=   (* (¯ ¯′)) and so by Lemma 3.5, we have * ̸∈ Supp( (¯) , , )¯ , i.e.,
for all 1 ≤  ≤ , * () |= ¬ (* ()¯) .</p>
      <p>This means we have * ∈ Supp(0 (¯) ∧ ¬1 (¯) ∧ . . . ∧ ¬ (¯) , , )¯ for all 1 ≤  ≤ 
and so * ∈ Supp((¯) , , )¯ .</p>
      <p>⇒ Let  ∈ Supp((¯) , , )¯ , so  is consistent and there is some 1 ≤  ≤  with : (i)
 ∈ Supp(0 , , )¯ , (ii) for all 1 ≤  ≤ ,  ̸∈ Supp( , , )¯ . Using Lemma 3.5 (i) implies
that there exists ¯ such that  |= 0 ()¯ ∧  0 (¯)¯ and () |=  0 ((¯)¯) . Again by
Lemma 3.5, (ii) implies that for all 1 ≤  ≤  and ¯′, if  |=  ( ¯′)∧  (¯ ¯′) then () ̸|=
  ((¯ ¯′)). This entails by Proposition 3.4 that  |=  0 (¯)¯ ∧ ¬ 0 ,  (¯¯, ¯ ¯′).
This shows  |= ⋁︀1≤ ≤  ∃ ¯  (¯ ¯) .</p>
      <p>Now that we have defined the formula expressing for a BCCQ  non-emptiness of
Supp((¯) , , )¯ (Proposition 3.6), we can easily define a rewriting for the problem
CertainAnswerΣ(). To do so, we rely on the fact that ¯ ∈ certΣ(, ) if Supp(¬, , )¯ =
∅.</p>
      <p>Theorem 3.8 (Datalog rewriting). Let D be a database whose schema contains a set of functional
dependencies Σ , and let (¯) be a BCCQ in NRV-normal form. Let ′ = ′1(¯) ∨. . .∨′(¯) be ¬
in DNF normal form. Then ¯ ∈ certΣ(, ) if and only if  |=  ()¯ where  (¯) = ⋀︀1≤ ≤  ∀ ¯
¬′ (¯ ¯) .</p>
      <p>Proof. One has that ¯ ∈ certΣ(, ) if (′, , )¯ =
tion 3.6 tells us that (′, , )¯ = ∅ if  |= ⋀︀1≤ ≤  ∀ ¯ ¬′ (¯ ¯) .
∅. Being ′ still a BCCQ,
Proposi and a set of FDs Σ , the complexity of
Corollary 3.9. For each fixed BCCQ query
CertainAnswerΣ() is in PTIME.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Non-rewritability in FO</title>
      <p>The basic starting points for our investigation was the fact that certΣ(, ) = (chaseΣ())
for a CQ  and a set Σ of FDs, for every database . This remained true for unions of CQs,
but failed for BCCQs, forcing us to produce a Datalog rewriting to obtain certain answers. But
can a first-order rewriting be obtained instead? This would make it possible to produce certain
answers using the core of SQL as opposed to its recursive features which do not always perform
as well in practice.</p>
      <p>In this section we show that the answer, in general, is negative even for CQs (and thus for
BCCQs). In the next section however we show that such rewritings can be obtained in FO for
BCCQs whenever Σ is empty.</p>
      <p>The main result of this section is the following.
Theorem 4.1. There exists a Boolean CQ  and single FD Σ over a relational schema of binary
and unary relations such that certΣ(, ) is not expressible as an FO query.</p>
      <p>Proof. Consider a schema with one binary relation  and two unary relations  and . The
only FD in Σ is ∀∀∀ (︀ (, ) ∧ (, ) →  = )︀ ; in other words, the first attribute of 
is a key. The query  is a Boolean CQ ∃ (() ∧ ()).</p>
      <p>To prove inexpressibility of certΣ(, · ) in FO, for each  &gt; 0 we create two databases 
and ′. In both of them,  is interpreted as a disjoint union 1 ∪ 2 where 1 and 2 are
balanced binary trees of depth  in which all nodes are distinct nulls. In both  and  are
singleton sets. In , the set  contains a leaf of 1 and  contains a leaf of 2. In ′, both
 and  contain leaves of 1 such that their only common ancestor in the tree is the root (in
other words, they are leaves of subtrees rooted at diferent children of the root of 1).</p>
      <p>Because of the constraint Σ , for every valuation  such that the resulting database satisefis
it we have that both (1) and (2) are chains. Indeed, consider any node ⊥ with children
⊥1, ⊥2 in . If (⊥1) ̸= (⊥2) then the resulting tuples ((⊥), (⊥1)) and ((⊥), (⊥2))
violate the constraint. Thus (⊥1) = (⊥2) and applying this construction inductively we see
that () is a chain. Hence, it has a single leaf, and thus certΣ(, ′) is true, since  and
 must be interpreted as that leaf. On the other hand, certΣ(, ) is false, since there is a
valuation  that sends 1 and 2 into two disjoint chains, and thus  and  are interpreted as
two distinct elements.</p>
      <p>Assume now that certΣ(, · ) is rewritable as an FO sentence . Then, for every  &gt; 0, we
have ′ |=  and  |= ¬. We next show that such a sentence cannot exist, thereby proving
non-FO-rewritability.</p>
      <p>
        Recall that in a database (with one binary relation, like considered here) a radius 
neighborhood of an element  is its restriction to the set of all elements reachable from  by a path of
length at most , where the path does not take into account the orientation of edges of  (for
example, if we have (, ) and (, ) then both  and  are in the radius 1 neighborhood of
). When two neighborhoods, of elements  and , are isomorphic, it means that there is an
isomorphism between them that sends  to . In other words, centers of neighborhoods are
viewed as distinguished elements when it comes to defining neighborhoods. It is known that
each first order sentence  is Hanf-local [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]: that is, there exists a number  &gt; 0 such that
for any two databases 1 and 2, if there is a bijection  between 1 and 2 such that the
radius  neighborhoods of  in 1 and  () in 2 are isomorphic then 1 and 2 agree on  ,
i.e. either both satisfy it or both do not.
      </p>
      <p>Now let  be such a number for the sentence  we assumed exists. Consider  and ′
and let 1, 1* be the subtrees of the root of 1 in  such that the first contains  while the
second contains neither  not , and let 2, 2* be defined similarly for subtrees of the root
of 2 with respect to . In ′ we define 1′, 1′ as subtrees of the root of the tree containing
,  such that the first contains the  leaf and the second contains the  leaf, while 2′* , 2′*
be the subtrees of the root of the tree having neither  nor  elements. Then it is easy to see
that the following pairs of trees are isomorphic: 1 and 1′, 2 and 1′, 1* and 2′* , 2* and
2′* .</p>
      <p>We now define the bijection  as the union of those isomorphisms plus mapping roots of
trees  in  into roots of  in ′. It is an immediate observation that if  &gt;  + 1 (i.e., leaves
are not in the radius  neighborhood of children of roots) then  satisfies the condition that
neighborhoods of  and  () of radius  are isomorphic. This would tell us that  and ′
agree on  but we know they do not. This contradiction completes the proof.</p>
      <p>As a corollary to the proof, we obtain the following result showing that non-recursive SQL is
incapable of computing certΣ(, ) in the setting of Theorem 4.1.</p>
      <p>Corollary 4.2. There exists a Boolean CQ  and single FD Σ over a relational schema
of binary and unary relations such that certΣ(, ) is not expressible in the basic
SELECT-FROM-WHERE-GROUP BY-HAVING fragment of SQL with arbitrary aggregate
functions.</p>
      <p>
        This is due to the fact that queries in this fragment of SQL with grouping and aggregation can
be translated into a logic with aggregate functions [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] which itself is known to be Hanf-local
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. An FO rewriting</title>
      <p>
        We now focus on the special case where Σ is empty. First notice that the only Datalog component
in our rewriting was the  formula. Let ∼  be the reflexive symmetric transitive closure
of {(, ) |  =  ∈  }. As shown in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], as Σ is empty, we can rewrite as follows the 
formula in FO, where  is the number of equivalence classes of ∼  :
 (¯, , ′) :=  = ′ ∨
⋁︁
      </p>
      <p>( = 1 ∧ ′ =  ∧
1,1...,∈ ¯ ∪ adom( ) |
∼   for all 1≤ ≤</p>
      <p>⋀︁  = +1)
1≤ &lt;
Intuitively this holds because each disjunct of  (¯, , ′) corresponds to a possible
derivation of (, ′) in the reflexive symmetric transitive closure of {( (),  ()) |  =  ∈  },
and one can prove that there is a bound only depending on  on the number of steps of this
derivation.</p>
      <p>
        As a consequence, we can rewrite in FO the formula poss of Section 3 encoding the set of
possible answers to . It is enough to replace each occurrence of the Datalog  (¯, , ′)
program in it by  (¯, , ′). We denote by poss the rewriting so obtained. With this,
we obtain an extension to BCCQ of the rewriting techniques proposed in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] for UCQ.
Theorem 5.1 (FO rewriting). Let D be a database, Σ = ∅ and let (¯) be a BCCQ in NRV-normal
form. Let ′ = ′1(¯) ∨ . . . ∨ ′(¯) be ¬ in DNF normal form. Then ¯ ∈ certΣ(, ) if and
only if  |=  ()¯ where  (¯) = ⋀︀1≤ ≤  ∀ ¯ ¬′(¯ ¯) .
      </p>
      <p>
        Note that tractability of BCCQ was already proved in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] using tableau based methods. We
now refine complexity as follows.
      </p>
      <p>Corollary 5.2. For each fixed BCCQ query , the complexity of CertainAnswerΣ() is in
DLOGSPACE whenever Σ = ∅.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Future work</title>
      <p>
        Our rewriting techniques are closer to a practical implementation than the previous tableau
based method from [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This is due to their expressibility in recursive SQL (or even non-recursive
in the case of Theorem 5.1). However, while theoretically feasible, an actual implementation
will need additional techniques to achieve acceptable performance. To see why, notice that
the first rule in the definition of  creates a cross product over the full active domain,
i.e., the set of all elements that appeared in the database. This of course will be prohibitively
large. While this may appear to be a significant obstacle, a similar situation with computing or
approximating certain answers is not new in the literature. For instance, the first approximation
scheme for certain answers to SQL queries that appeared in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] has done exactly the same, and
generated very large Cartesian products even for simple queries with negation. Nonetheless, an
alternative was found quickly [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] that completely avoided the need for such expensive queries,
and it was shown to work well on several TPC-H queries. Thus, looking for a practical and
implementable rewriting is one of the possible directions for future work.
      </p>
      <p>
        As another open problem, we note that the query for which we have shown certain answers
to be non-rewritable in FO has DLOGSPACE data complexity. Indeed the problem is essentially
reachability over trees, which can be easily encoded using deterministic transitive closure [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
To express DLOGSPACE problems, we need a language weaker than Datalog with negation.
Thus, it is natural to ask whether a low complexity Datalog fragment would be suficient to
express rewritings of BCCQ, or a separating example that is PTIME-complete can be found.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>We thank the anonymous referees for their useful feedback. This work was supported by ANR
grants ANR-18-CE40-0031 (QUID) and ANR-21-CE48-0015 (VeriGraph) as well as EPSRC grant
N023056 (MAGIC). We also acknowledge a PGSM Master grant from the FSMP.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Imieliński</surname>
          </string-name>
          , W. Lipski,
          <article-title>Incomplete information in relational databases</article-title>
          ,
          <source>Journal of the ACM</source>
          <volume>31</volume>
          (
          <year>1984</year>
          )
          <fpage>761</fpage>
          -
          <lpage>791</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Console</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Guagliardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Libkin</surname>
          </string-name>
          , E. Toussaint,
          <article-title>Coping with incomplete data: Recent advances</article-title>
          ,
          <source>in: ACM PODS, ACM</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>33</fpage>
          -
          <lpage>47</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gheerbrant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Libkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sirangelo</surname>
          </string-name>
          ,
          <article-title>Naïve evaluation of queries over incomplete databases</article-title>
          ,
          <source>ACM Trans. Database Syst</source>
          .
          <volume>39</volume>
          (
          <year>2014</year>
          )
          <volume>31</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>31</lpage>
          :
          <fpage>42</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gheerbrant</surname>
          </string-name>
          , L. Libkin,
          <article-title>Certain answers over incomplete XML documents: Extending tractability boundary</article-title>
          ,
          <source>ACM ToCS 57</source>
          (
          <year>2015</year>
          )
          <fpage>892</fpage>
          -
          <lpage>926</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Abiteboul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vianu</surname>
          </string-name>
          , Foundations of Databases, Addison-Wesley, Boston, MA, USA,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>R. van der Meyden</surname>
          </string-name>
          ,
          <article-title>Logical approaches to incomplete information: a survey</article-title>
          ,
          <source>in: Logics for databases and information systems</source>
          , Kluwer Academic Publishers, Norwell, MA, USA,
          <year>1998</year>
          , pp.
          <fpage>307</fpage>
          -
          <lpage>356</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Reiter</surname>
          </string-name>
          ,
          <article-title>On closed world data bases</article-title>
          ,
          <source>in: Logic and Data Bases</source>
          ,
          <year>1977</year>
          , pp.
          <fpage>55</fpage>
          -
          <lpage>76</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>W.</given-names>
            <surname>Lipski</surname>
          </string-name>
          ,
          <article-title>On relational algebra with marked nulls</article-title>
          , in: PODS,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , Waterloo, Ontario, Canada,
          <year>1984</year>
          , pp.
          <fpage>201</fpage>
          -
          <lpage>203</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Abiteboul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. C.</given-names>
            <surname>Kanellakis</surname>
          </string-name>
          , G. Grahne,
          <article-title>On the representation and querying of sets of possible worlds</article-title>
          ,
          <source>TCS</source>
          <volume>78</volume>
          (
          <year>1991</year>
          )
          <fpage>158</fpage>
          -
          <lpage>187</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Greco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Molinaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Spezzano</surname>
          </string-name>
          ,
          <article-title>Incomplete Data and Data Dependencies in Relational Databases</article-title>
          ,
          <source>Synthesis Lectures on Data Management</source>
          , Morgan &amp; Claypool Publishers,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gheerbrant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sirangelo</surname>
          </string-name>
          ,
          <article-title>Best answers over incomplete data : Complexity and firstorder rewritings</article-title>
          ,
          <source>in: Proceedings of IJCAI-19</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1704</fpage>
          -
          <lpage>1710</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Fagin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Stockmeyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Vardi</surname>
          </string-name>
          ,
          <article-title>On monadic NP vs. monadic co-NP, Inf</article-title>
          . Comput.
          <volume>120</volume>
          (
          <year>1995</year>
          )
          <fpage>78</fpage>
          -
          <lpage>92</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Libkin</surname>
          </string-name>
          ,
          <article-title>Expressive power of SQL, Theor</article-title>
          .
          <source>Comput. Sci</source>
          .
          <volume>296</volume>
          (
          <year>2003</year>
          )
          <fpage>379</fpage>
          -
          <lpage>404</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L.</given-names>
            <surname>Hella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Libkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nurmonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <article-title>Logics with aggregate operators</article-title>
          ,
          <source>J. ACM</source>
          <volume>48</volume>
          (
          <year>2001</year>
          )
          <fpage>880</fpage>
          -
          <lpage>907</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>L.</given-names>
            <surname>Libkin</surname>
          </string-name>
          ,
          <article-title>SQL's three-valued logic and certain answers</article-title>
          ,
          <source>ACM Trans. Database Syst</source>
          .
          <volume>41</volume>
          (
          <year>2016</year>
          ) 1:
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          :
          <fpage>28</fpage>
          . URL: https://doi.org/10.1145/2877206. doi:
          <volume>10</volume>
          .1145/2877206.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Guagliardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Libkin</surname>
          </string-name>
          ,
          <article-title>Making SQL queries correct on incomplete databases: A feasibility study</article-title>
          , in: T. Milo, W. Tan (Eds.),
          <source>Proceedings of ACM PODS</source>
          ,
          <year>2016</year>
          , ACM,
          <year>2016</year>
          , pp.
          <fpage>211</fpage>
          -
          <lpage>223</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>N.</given-names>
            <surname>Immerman</surname>
          </string-name>
          ,
          <article-title>Languages that capture complexity classes</article-title>
          ,
          <source>SIAM J. Comput</source>
          .
          <volume>16</volume>
          (
          <year>1987</year>
          )
          <fpage>760</fpage>
          -
          <lpage>778</lpage>
          . URL: https://doi.org/10.1137/0216051. doi:
          <volume>10</volume>
          .1137/0216051.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>