Counting Database Repairs under Primary Keys
                 Revisited?
          (DISCUSSION PAPER)

                Marco Calautti, Marco Console, and Andreas Pieris

                     School of Informatics, University of Edinburgh
                     {mcalautt,mconsole,apieris}@inf.ed.ac.uk


1     Introduction

Consistent query answering (CQA) [2] aims to deliver meaningful answers when
queries are evaluated over inconsistent databases. Such answers, called consis-
tent answers, must be true in all repairs, which are consistent databases whose
difference from the inconsistent one is somehow minimal. Consistent answers
only allow us to conclude that a tuple is either entailed by all repairs, or is not
entailed by some repair. But, as discussed in [4], the former is too strict, while
the latter is not very useful in practice. Instead, we would like to know how often
a tuple is consistent, i.e., its relative frequency defined as the ratio between the
number of repairs entailing the tuple, and the total number of repairs.
    The data complexity, i.e., when the query and the set of constraints are
fixed, of the problem of computing the relative frequency of a tuple has been
already studied in [6, 7] for conjunctive queries (CQs) and primary keys (i.e.,
when each relation has at most one key). We know that computing the total
number of repairs is feasible in polynomial time. On the other hand, comput-
ing the number of repairs that entail the given tuple is #P-complete. Recall
that #P is a hard counting complexity class, which, roughly speaking, collects
the function problems that ask for the number of solutions to an NP problem.
The above #P-hardness relies on polynomial-time Turing, a.k.a. Cook, reduc-
tions. However, as observed in the literature [5, 8], there are problems that are
“hard-to-count-easy-to-decide” that cannot be complete (under reasonable as-
sumptions) for #P under standard many-one logspace (or even polynomial-time)
reductions. Note that the decision version of a counting problem asks whether
the count is greater than zero. In our case, it asks whether a repair entailing a
given tuple exists. As stated in [8], Cook reductions blur structural differences
between counting problems and complexity classes. The reason is that #P is
not closed under Cook reductions (under reasonable assumptions). Therefore,
for such “hard-to-count-easy-to-decide” problems, a crucial question is whether


              ©
?
    This is a short version of the paper [3].
    Copyright      2019 for the individual papers by the papers authors. Copying permit-
    ted for private and academic purposes. This volume is published and copyrighted by
    its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy.
we can determine their exact complexity w.r.t. subclasses of #P to which they
belong. Ideally, we would like to show that such a problem is complete, under
logspace reductions, for a class C ⊆ #P that is closed under such reductions.
     Our goal is to perform such a refined analysis for the data complexity of the
problem of counting the number of repairs that entail a given tuple in the case of
primary keys. We show that for arbitrary first-order queries this problem remains
#P-complete. For existential positive queries (and CQs as well), we show that
our problem cannot be complete (under reasonable assumptions) for #P, or even
known classes inside #P. This brings us to our main question whether we can
isolate a complexity class C inside #P that is closed under logspace reductions
for which our problem (in the case of positive existential queries) is complete
under such reductions. Let us stress that there is no guarantee that such a class
C exists. This is because we focus on the data complexity of our problem. This
means, by convention, that we deal with a family of problems and not a single
problem; as usual, each query and set of primary keys gives rise to a new problem.
Furthermore, establishing a completeness result for a certain complexity class C
boils down to showing that (i) every problem in this family belongs to C, and
(ii) there exists one problem that is C-complete. But, there is no guarantee that
there exists a problem in this family such that all the other problems of the
family are logspace reducible to it. This essentially tells us that the complexity
class C inside #P that we are looking might not exist.
     Although we do not give a definite answer to our main question, we make a
significant step towards such an answer. We isolate a hierarchy of new complexity
classes inside #P, called the Λ-hierarchy, which is of independent interest, and
show that if the class C that we are looking for exists, then it is one of the
levels of the Λ-hierarchy. Moreover, whether C exists boils down to the question
whether the Λ-hierarchy collapses. We conjecture that this is not the case.


2    Preliminaries

Relational Databases. A (relational) schema S is a finite set of relation sym-
bols (or predicates) with associated arity. A database D over a schema S is a
set of facts of the form R(t̄), where R ∈ S, and t̄ = t1 , . . . , tn is a tuple of terms
that are drawn from a countably infinite set C of constants. We write dom(D)
for the active domain of D, that is, the set of constants occurring in D.
Primary Key Constraints. A key constraint (or simply key) κ over a schema
S is an expression of the form key(R) = {1, . . . , m}, where R ∈ S has arity n,
and m ≤ n. Such an expression is called an R-key. A database D satisfies κ if, for
every two facts R(t̄), R(s̄) ∈ D, t̄[i] = s̄[i] for each i ∈ {1, . . . , m} implies t̄ = s̄.
We say that D is consistent w.r.t. a set Σ of keys, written D |= Σ, if D satisfies
each key in Σ; otherwise, it is inconsistent. A set of primary keys Σ is a set of keys
such that, for each predicate R, Σ has at most one R-key. For α = R(c1 , . . . , cn ),
the key value of α w.r.t. Σ, denoted keyΣ (α), is defined as hR, hc1 , . . . , cm ii, if
key(R) = {1, . . . , m} ∈ Σ, and hR, hc1 , . . . , cn ii otherwise. Let blockΣ (α, D) =
{β ∈ D | keyΣ (β) = keyΣ (α)}, and blockΣ (D) = {blockΣ (α, D) | α ∈ D}.
A repair of D w.r.t. Σ is a database obtained by choosing one fact from each
B ∈ blockΣ (D). We denote the set of repairs of D w.r.t Σ as rep(D, Σ).
Queries. A first-order query Q(x̄) over a schema S is an expression {x̄ | ϕ},
where ϕ is a first-order formula with free variables x̄ that mentions only atoms
over S. We write FO, ∃FO+ and CQ, for the class of first-order, existential pos-
itive, and conjunctive queries, respectively. The answer to Q(x̄) over a database
D, denoted Q(D), is the set of tuples {c̄ ∈ dom(D)|x̄| | D |= ϕ(c̄)}.
Complexity Toolbox. We recall well-known counting complexity classes, i.e.,
classes of counting functions of the form {0, 1}∗ → N. FP contains the functions
that are computable in polynomial-time. #P contains the functions that compute
the number of accepting computations of an NP machine. SpanL contains the
functions that compute the number of distinct valid outputs of an NL transducer.
The output of a computation is called valid if the machine halts in an accepting
state. We know that SpanL ⊆ #P, and the inclusion is strict (under reasonable
assumptions) [1]. Consider two counting functions f, g : {0, 1}∗ → N. We say
that f is Cook reducible to g, denoted f ≤pT g, if there exists a polynomial-time
deterministic machine M , with access to an oracle for g, such that, for every
x ∈ {0, 1}∗ , f (x) = M (x). Moreover, we say that f is many-one logspace reducible
                                                               ∗            ∗
to g, denoted f ≤log m g, if there exists a function h : {0, 1} → {0, 1} that is
                                                                 ∗
computable in logarithmic-space such that, for every x ∈ {0, 1} , f (x) = g(h(x)).
As usual, a complexity class C is closed under ≤pT (resp., ≤logm ) if, for every two
functions f, g, f ≤pT g (resp., f ≤log
                                   m   g) and g ∈ C  implies f ∈ C.


3   Problem Statement
The main concern of this work is the problem of counting the number of repairs
that entail a given tuple defined as follows; Q denotes a class of queries:

           PROBLEM : #CQA(Q)
           INPUT :   A database D, a set Σ of primary keys,
                    a query Q(x̄) ∈ Q, and a tuple t̄ ∈ dom(D)|x̄| .
           OUTPUT : |{D 0 ∈ rep(D, Σ) | t̄ ∈ Q(D 0 )}|.

    As usual, we are interested in the data complexity of #CQA(Q), where the
set of constraints and the query are fixed, and only the database and the tuple
are part of the input. To make this notion more precise, we need to define the
following problem, where Q(x̄) ∈ Q is a query and Σ is a set of primary keys:

             PROBLEM : #CQA(Q(x̄), Σ)
             INPUT :   A database D, and a tuple t̄ ∈ dom(D)|x̄| .
             OUTPUT : |{D 0 ∈ rep(D, Σ) | t̄ ∈ Q(D 0 )}|.

    We say that #CQA(Q) is R-complete for C in data complexity, where R is a
class of reductions and C a complexity class, if (i) for each query Q ∈ Q and set
Σ of primary keys, #CQA(Q, Σ) is in C, and (ii) there exists Q0 ∈ Q and set Σ 0
of primary keys such that #CQA(Q0 , Σ 0 ) is R-complete for C.
    The decision version of #CQA(Q), denoted #CQA>0 (Q), simply asks whether
|{D0 ∈ rep(D, Σ) | t̄ ∈ Q(D0 )}| > 0. The data complexity of #CQA>0 (Q) is
defined in the same way as for #CQA(Q). Henceforth, for clarity, we focus on
Boolean queries, but all the results hold even if we consider non-Boolean queries.


4     Complexity of #CQA

The complexity of #CQA has been already studied in [6, 7], and we know that
#CQA(CQ) is ≤pT -complete for #P.1 The above result relies on Cook reductions.
However, as discussed in Section 1, there are problems that are “hard-to-count-
easy-to-decide”, which cannot be complete (under reasonable assumptions) for
#P under many-one logspace (or polynomial-time) reductions. Hence, the first
step is to understand for which classes of queries Q, #CQA(Q) is an “easy-to-
decide” problem. After a preliminary analysis, we have concluded that in the
case of arbitrary first-order queries our problem is not only “hard-to-count”, but
also “hard-to-decide”. In fact, we can show that #CQA>0 (FO) is ≤log  m -complete
for NP, and #CQA(FO) is ≤log   m  -complete  for #P. This indicates that #P is the
right complexity for our problem when we focus on first-order queries.
    The situation for positive existential queries is more interesting. We can
show that in this case we are dealing with a typical “hard-to-count-easy-to-
decide” problem. In particular, we can show that #CQA>0 (∃FO+ ) is tractable,
which in turn implies that #CQA(∃FO+ ) is not ≤log    m -complete for #P (unless
P = NP). This tells us that for pinpointing the complexity of #CQA(∃FO+ )
we should look for a subclass of #P, which is closed under logspace reductions.
The obvious candidate is SpanL. Indeed, we can show that #CQA(∃FO+ ) is
in SpanL. However, #CQA(∃FO+ ) is not ≤log     m -complete for SpanL (unless L =
NL). Not much is known about subclasses of SpanL for which our problem can
be complete under logspace reductions. This brings us to our main question:
Research Question: Is there a class C ⊆ SpanL, which is closed under logspace
reductions, such that #CQA(∃FO+ ) is ≤log
                                       m -complete for C?

    Let us stress that since we focus on the data complexity of #CQA(∃FO+ ),
as discussed in Section 1, the above question is not trivial in the sense that the
existence of the class C is not guaranteed. Although we cannot give a definite
answer to this question, we make a significant step towards such an answer. We
isolate a hierarchy of new complexity classes inside #P, called the Λ-hierarchy,
and show the following; we write Λ[k] for the k-th level of the hierarchy:

Theorem 1. The following are equivalent:
1
    All the results for #CQA and #CQA>0 in the rest of the paper concern their data
    complexity. Thus, for brevity, we will sometimes avoid to explicitly mention the term
    data complexity.
 1. There exists a class C ⊆ SpanL, closed under logspace reductions, such that
    #CQA(∃FO+ ) is ≤log m -complete for C.
 2. There exists k ≥ 0 such that #CQA(∃FO+ ) is ≤logm -complete for Λ[k].
 3. The Λ-hierarchy collapses.
    The above theorem essentially tells that either #CQA(∃FO+ ) is ≤log
                                                                      m -complete
for Λ[k], for some k ≥ 0, or there is no complexity class C ⊆ SpanL, closed under
many-one logspace reductions, such that #CQA(∃FO+ ) is ≤log     m -complete for C.
Thus, the right complexity class that characterizes the data complexity of our
problem, if it exists, is one of the levels of the Λ-hierarchy.


5     The Λ-hierarchy
We now proceed to introduce our hierarchy of complexity classes. The main
technical challenge is to understand how to limit the computational power of
NL transducers. This will lead to our new complexity classes inside SpanL that
allow us to establish Theorem 1. To this end, we give a semi-formal introduction
to this hierarchy, by presenting a couple of structural properties enjoyed by the
functions therein. For further details, we refer the reader to [3].

5.1   The Guess-Check-Expand Paradigm
There is a generic paradigm, which we call guess-check-expand, that an NL trans-
ducer can follow in order to place problems inside SpanL, assuming that the
following structural properties are fulfilled:
Small Certificates. A solution (in our case, a repair that entails the query)
  is witnessed via a small certificate – by small we mean of logarithmic size –
  that is verifiable in logspace.
     For the next property, we first need to give some auxiliary terminology. Given
a sequence of sets S1 , . . . , Sn , an `-selector of it, for ` ≥ 0, is a sequence of pairs
σ = (i1 , e1 ), . . . , (i` , e` ), where 1 ≤ i1 < . . . < i` ≤ n, and ej ∈ Sij for each
j ∈ {1, . . . , `}. Intuitively, a pair (i, e) in σ tells us to “keep the element e from
the set Si ”. We can then define the cartesian product of S1 , . . . , Sn w.r.t. σ,
denoted [S1 , . . . , Sn ]σ , as S1σ × · · · × Snσ , where Siσ = {e} if σ contains a pair
(i, e), otherwise, Siσ = Si . In simple words, [S1 , . . . , Sn ]σ is the cartesian product
of S1 , . . . , Sn with the difference that ` sets are first replaced by a singleton set as
dictated by the `-selector σ. We can now discuss the second structural property.
For an input instance x, we write sol(x) for the set of solutions, and cert(x) for
the set of certificates that witness a solution of sol(x).
Solutions via Certificate Expansion. For every input x, there is a sequence
   of logspace computable sets S1 , . . . , Sn , called solution domains, such that

                                            [
                           |sol(x)| =             [S1 , . . . , Sn ]σc ,
                                         c∈cert(x)
    where σc is an `-selector for S1 , . . . , Sn determined S
                                                             by the small certificate
    c. In fact, there exists a bijection enc from sol(x) to c∈cert(x) [S1 , . . . , Sn ]σc ,
    and enc(s) should be understood as an encoding of the solution s.
    A function that enjoys the above structural properties can be shown to be
in SpanL via a simple guess-check-expand algorithm:
 1. (Guess) Guess a candidate small certificate c.
 2. (Check) If c is not a valid certificate, then reject.
 3. (Expand) Expand c into an encoding of a solution witnessed by c using
    S1 , . . . , Sn as follows: for i = 1 to n, if the `-selector σc determined by c
    mentions (i, e), then output e; otherwise, guess e ∈ Si and output e.
    It is clear that, on an input x, the above transducer uses logspace in the size
of x since the small certificates are logspace verifiable, and the solution domains
are logspace computable. Moreover, the number of distinct valid outputs of the
                                                             σc
                              S
transducer coincides with       c∈cert(x) [S1 , . . . , Sn ]    , and thus, with |sol(x)|. A
transducer of the form above is called k-guess-check-expand, for some integer
k ≥ 0, if the `-selectors are such that ` ≤ k.
#CQA(Q, Σ) via a Guess-Check-Expand Algorithm. It is interesting to
observe that the problem #CQA(Q, Σ), for some UCQ Q and set Σ of primary
keys, enjoys the two structural properties defined above; recall that an existential
positive query can be rewritten as a UCQ. More precisely, on input D:
 – A small certificate for a repair of rep(D, Σ) that entails Q is a pair (Q0 , h),
   where Q0 is a disjunct of Q, and h : var(Q0 ) → dom(D), such that h(Q0 ) ⊆ D
   and h(Q0 ) |= Σ.
 – The sequence of solution domains corresponds to the sequence B1 , . . . , Bn of
   the sets of blockΣ (D), according to some arbitrary, but fixed ordering. The
   `-selector σ(Q0 ,h) for B1 , . . . , Bn determined by a certificate (Q0 , h) mentions
   the pair (i, R(t̄)), i.e., it keeps the fact R(t̄) from Bi , iff h(Q0 ) ∩ Bi = {R(t̄)}
   and Σ has an R-key. The latter implies that ` is bounded by the number of
   atoms with a predicate that has a key over all disjuncts of Q, which does not
   depend on the input database D. Clearly, the number of repairs of rep(D, Σ)
   that entail Q is the number

                                   [
                                                [B1 , . . . , Bn ]σ(Q0 ,h) .
                              (Q0 ,h)∈cert(D)

   Therefore, the fact that #CQA(Q, Σ) is in SpanL can be shown via a k-guess-
check-expand algorithm, as the one presented below, where k is the number of
atoms in Q with a predicate with a key in Σ:
 1. (Guess) Guess a candidate small certificate (Q0 , h).
 2. (Check) If h(Q0 ) * D or h(Q0 ) 6|= Σ, then reject.
 3. (Expand) Expand (Q0 , h) into an encoding of a solution witnessed by (Q0 , h)
    as follows: for i = 1 to n, if h(Q0 ) ∩ Bi = {R(t̄)} and Σ has an R-key, then
    output R(t̄); otherwise, guess a fact α ∈ Bi and output α.
5.2     Semi-formal Definition and Properties

The guess-check-expand paradigm suggests a natural way to define a hierarchy
of classes inside SpanL. Intuitively, the k-th level of such hierarchy is defined
by considering k-guess-check-expand transducers. This intuition is at the core
of the Λ-hierarchy. In particular, for every integer k ≥ 0, Λ[k] is the class of
functions f : {0, 1}∗ → N such that, for every x ∈ {0, 1}∗ , f (x) is the number of
valid distinct outputs of a k-guess-check-expand transducer
                                                       S         Mf , with input x.
The Λ-hierarchy is defined as the infinite union Λ = k≥0 Λ[k].
    We proceed to discuss some key properties concerning the Λ-hierarchy. The
first one states that each level of the hierarchy is closed under logspace reduc-
tions, which is crucial for our complexity analysis.

Proposition 1. For each k ≥ 0, Λ[k] is closed under logspace reductions.

      The next result confirms that Λ is a hierarchy of classes inside SpanL:

Proposition 2. Λ[0] ⊆ · · · ⊆ Λ ⊆ SpanL, and Λ = SpanL implies L = NL.

   The next result shows that, apart from Λ[0] and Λ[1], the rest of the hierarchy
consists of “hard” complexity classes:

Proposition 3. Λ[1] ⊆ FP, and Λ[2] ⊆ FP implies P = NP.

   A direct corollary of Proposition 3 is that Λ[1] ( Λ[2] (unless P = NP). We
do not know whether the inclusions beyond Λ[2] are strict. This essentially asks
whether there exists an integer k ≥ 2 such that the Λ-hierarchy collapses to its
k-th level, i.e., whether Λ = Λ[k]. This is a non-trivial open question.


6      Establishing our Main Result

Having the Λ-hierarchy in place, we are now ready to explain how Theorem 1 is
shown. To this end, we concentrate on a parameterized version of #CQA(∃FO+ ),
where the so-called keywidth of the query w.r.t. the set of primary keys is
bounded by an integer k ≥ 0. Formally, the keywidth function (kw) maps pairs of
the form (Q, Σ), where Q ∈ ∃FO+ and Σ is a set of primary keys, to the naturals
such that kw (Q, Σ) = |{R(t̄) | R(t̄) occurs in Q, and Σ has an R-key}|. Then:

           PROBLEM : #CQAkw       +
                           k (∃FO )
           INPUT :   A database D, a set Σ of primary keys,
                    and a query Q ∈ ∃FO+ such that kw (Q, Σ) = k.
           OUTPUT : |{D 0 ∈ rep(D, Σ) | D 0 |= Q}|.


      The data complexity of the above problem is defined as expected. Then:

Theorem 2. For each k ≥ 0, #CQAkw    +      log
                               k (∃FO ) is ≤m -complete for Λ[k].
   By exploiting Theorem 2, we can now establish our main result (Theorem 1):
    (1) ⇒ (3). By hypothesis, there is a class C ⊆ SpanL such that #CQA(∃FO+ )
                                                    +
is ≤log
     m -complete for C. Thus, for every Q ∈ ∃FO        and set Σ of primary keys,
#CQA(Q, Σ) ∈ C. Moreover, there exists Q̂ and Σ̂ such that #CQA(Q̂, Σ̂) is
≤log                                                            log
  m -complete for C, which means that, for every f ∈ C, f ≤m #CQA(Q̂, Σ̂). By
Theorem 2, #CQA(Q̂, Σ̂) ∈ Λ[kw (Q̂, Σ̂)]. Since, by Proposition 1, Λ[kw (Q̂, Σ̂)] is
closed under logspace reductions, we conclude that f ∈ C implies f ∈ Λ[kw (Q̂, Σ̂)],
or, in other words, C ⊆ Λ[kw (Q̂, Σ̂)]. Hence, for every Q ∈ ∃FO+ and set Σ of
primary keys, #CQA(Q, Σ) ∈ Λ[kw (Q̂, Σ̂)]. Since, for every k ≥ 0,Sthere exists
Q0 and Σ 0 such that #CQA(Q0 , Σ 0 ) is ≤log
                                           m -complete for Λ[k], then     k≥0 Λ[k] =
Λ[kw (Q̂, Σ̂)]. Hence, the Λ-hierarchy collapses to its kw (Q̂, Σ̂)-th level.
    (3) ⇒ (2). By hypothesis, the Λ-hierarchy collapses to its k-the level, for some
k ≥ 0, i.e., Λ = Λ[k]. We proceed to show that (i) for every Q ∈ ∃FO+ and set
Σ of primary keys, #CQA(Q, Σ) ∈ Λ[k], and (ii) there exists Q̂ and Σ̂ such
that #CQA(Q̂, Σ̂) is ≤logm -complete for Λ[k]. For showing (i), fix some query Q ∈
∃FO+ and set Σ of primary keys. By Theorem 2, #CQA(Q, Σ) ∈ Λ[kw (Q, Σ)].
Since Λ[kw (Q, Σ)] ⊆ Λ = Λ[k], we immediately get that #CQA(Q, Σ) ∈ Λ[k].
Statement (ii) holds trivially since, by Theorem 2, there exists a ≤log  m -complete
function for every level of the Λ-hierarchy.
    (2) ⇒ (1). Since, for every k ≥ 0, Λ[k] is closed under logspace reductions
(Proposition 1), and Λ[k] ⊆ SpanL (Proposition 2), the claim follows.
The Next Step. Providing a firm answer to our main question boils down to
understanding whether the Λ-hierarchy collapses. Our conjecture is that it does
not collapse. Proving or disproving this conjecture is one of our priorities, which
will complete the picture concerning the data complexity of #CQA(∃FO+ ).
Acknowledgements. This work was supported by the EPSRC grants S003800
EQUID, M025268 VADA, and N023056 MAGIC.

References
1. Àlvarez, C., Jenner, B.: A very hard log-space counting class. Theor. Comput. Sci.
   107(1), 3–30 (1993)
2. Arenas, M., Bertossi, L.E., Chomicki, J.: Consistent query answers in inconsistent
   databases. In: PODS. pp. 68–79 (1999)
3. Calautti, M., Console, M., Pieris, A.: Counting database reparis under primary keys
   revisited. In: PODS (2019), to appear
4. Calautti, M., Libkin, L., Pieris, A.: An operational approach to consistent query
   answering. In: PODS. pp. 239–251 (2018)
5. Durand, A., Hermann, M., Kolaitis, P.G.: Subtractive reductions and complete prob-
   lems for counting complexity classes. Theor. Comput. Sci. 340(3), 496–513 (2005)
6. Maslowski, D., Wijsen, J.: A dichotomy in the complexity of counting database
   repairs. J. Comput. Syst. Sci. 79(6), 958–983 (2013)
7. Maslowski, D., Wijsen, J.: Counting database repairs that satisfy conjunctive queries
   with self-joins. In: ICDT. pp. 155–164 (2014)
8. Pagourtzis, A., Zachos, S.: The complexity of counting functions with easy decision
   version. In: MFCS. pp. 741–752 (2006)