Counting Database Repairs under Primary Keys Revisited? (DISCUSSION PAPER) Marco Calautti, Marco Console, and Andreas Pieris School of Informatics, University of Edinburgh {mcalautt,mconsole,apieris}@inf.ed.ac.uk 1 Introduction Consistent query answering (CQA) [2] aims to deliver meaningful answers when queries are evaluated over inconsistent databases. Such answers, called consis- tent answers, must be true in all repairs, which are consistent databases whose difference from the inconsistent one is somehow minimal. Consistent answers only allow us to conclude that a tuple is either entailed by all repairs, or is not entailed by some repair. But, as discussed in [4], the former is too strict, while the latter is not very useful in practice. Instead, we would like to know how often a tuple is consistent, i.e., its relative frequency defined as the ratio between the number of repairs entailing the tuple, and the total number of repairs. The data complexity, i.e., when the query and the set of constraints are fixed, of the problem of computing the relative frequency of a tuple has been already studied in [6, 7] for conjunctive queries (CQs) and primary keys (i.e., when each relation has at most one key). We know that computing the total number of repairs is feasible in polynomial time. On the other hand, comput- ing the number of repairs that entail the given tuple is #P-complete. Recall that #P is a hard counting complexity class, which, roughly speaking, collects the function problems that ask for the number of solutions to an NP problem. The above #P-hardness relies on polynomial-time Turing, a.k.a. Cook, reduc- tions. However, as observed in the literature [5, 8], there are problems that are “hard-to-count-easy-to-decide” that cannot be complete (under reasonable as- sumptions) for #P under standard many-one logspace (or even polynomial-time) reductions. Note that the decision version of a counting problem asks whether the count is greater than zero. In our case, it asks whether a repair entailing a given tuple exists. As stated in [8], Cook reductions blur structural differences between counting problems and complexity classes. The reason is that #P is not closed under Cook reductions (under reasonable assumptions). Therefore, for such “hard-to-count-easy-to-decide” problems, a crucial question is whether © ? This is a short version of the paper [3]. Copyright 2019 for the individual papers by the papers authors. Copying permit- ted for private and academic purposes. This volume is published and copyrighted by its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy. we can determine their exact complexity w.r.t. subclasses of #P to which they belong. Ideally, we would like to show that such a problem is complete, under logspace reductions, for a class C ⊆ #P that is closed under such reductions. Our goal is to perform such a refined analysis for the data complexity of the problem of counting the number of repairs that entail a given tuple in the case of primary keys. We show that for arbitrary first-order queries this problem remains #P-complete. For existential positive queries (and CQs as well), we show that our problem cannot be complete (under reasonable assumptions) for #P, or even known classes inside #P. This brings us to our main question whether we can isolate a complexity class C inside #P that is closed under logspace reductions for which our problem (in the case of positive existential queries) is complete under such reductions. Let us stress that there is no guarantee that such a class C exists. This is because we focus on the data complexity of our problem. This means, by convention, that we deal with a family of problems and not a single problem; as usual, each query and set of primary keys gives rise to a new problem. Furthermore, establishing a completeness result for a certain complexity class C boils down to showing that (i) every problem in this family belongs to C, and (ii) there exists one problem that is C-complete. But, there is no guarantee that there exists a problem in this family such that all the other problems of the family are logspace reducible to it. This essentially tells us that the complexity class C inside #P that we are looking might not exist. Although we do not give a definite answer to our main question, we make a significant step towards such an answer. We isolate a hierarchy of new complexity classes inside #P, called the Λ-hierarchy, which is of independent interest, and show that if the class C that we are looking for exists, then it is one of the levels of the Λ-hierarchy. Moreover, whether C exists boils down to the question whether the Λ-hierarchy collapses. We conjecture that this is not the case. 2 Preliminaries Relational Databases. A (relational) schema S is a finite set of relation sym- bols (or predicates) with associated arity. A database D over a schema S is a set of facts of the form R(t̄), where R ∈ S, and t̄ = t1 , . . . , tn is a tuple of terms that are drawn from a countably infinite set C of constants. We write dom(D) for the active domain of D, that is, the set of constants occurring in D. Primary Key Constraints. A key constraint (or simply key) κ over a schema S is an expression of the form key(R) = {1, . . . , m}, where R ∈ S has arity n, and m ≤ n. Such an expression is called an R-key. A database D satisfies κ if, for every two facts R(t̄), R(s̄) ∈ D, t̄[i] = s̄[i] for each i ∈ {1, . . . , m} implies t̄ = s̄. We say that D is consistent w.r.t. a set Σ of keys, written D |= Σ, if D satisfies each key in Σ; otherwise, it is inconsistent. A set of primary keys Σ is a set of keys such that, for each predicate R, Σ has at most one R-key. For α = R(c1 , . . . , cn ), the key value of α w.r.t. Σ, denoted keyΣ (α), is defined as hR, hc1 , . . . , cm ii, if key(R) = {1, . . . , m} ∈ Σ, and hR, hc1 , . . . , cn ii otherwise. Let blockΣ (α, D) = {β ∈ D | keyΣ (β) = keyΣ (α)}, and blockΣ (D) = {blockΣ (α, D) | α ∈ D}. A repair of D w.r.t. Σ is a database obtained by choosing one fact from each B ∈ blockΣ (D). We denote the set of repairs of D w.r.t Σ as rep(D, Σ). Queries. A first-order query Q(x̄) over a schema S is an expression {x̄ | ϕ}, where ϕ is a first-order formula with free variables x̄ that mentions only atoms over S. We write FO, ∃FO+ and CQ, for the class of first-order, existential pos- itive, and conjunctive queries, respectively. The answer to Q(x̄) over a database D, denoted Q(D), is the set of tuples {c̄ ∈ dom(D)|x̄| | D |= ϕ(c̄)}. Complexity Toolbox. We recall well-known counting complexity classes, i.e., classes of counting functions of the form {0, 1}∗ → N. FP contains the functions that are computable in polynomial-time. #P contains the functions that compute the number of accepting computations of an NP machine. SpanL contains the functions that compute the number of distinct valid outputs of an NL transducer. The output of a computation is called valid if the machine halts in an accepting state. We know that SpanL ⊆ #P, and the inclusion is strict (under reasonable assumptions) [1]. Consider two counting functions f, g : {0, 1}∗ → N. We say that f is Cook reducible to g, denoted f ≤pT g, if there exists a polynomial-time deterministic machine M , with access to an oracle for g, such that, for every x ∈ {0, 1}∗ , f (x) = M (x). Moreover, we say that f is many-one logspace reducible ∗ ∗ to g, denoted f ≤log m g, if there exists a function h : {0, 1} → {0, 1} that is ∗ computable in logarithmic-space such that, for every x ∈ {0, 1} , f (x) = g(h(x)). As usual, a complexity class C is closed under ≤pT (resp., ≤logm ) if, for every two functions f, g, f ≤pT g (resp., f ≤log m g) and g ∈ C implies f ∈ C. 3 Problem Statement The main concern of this work is the problem of counting the number of repairs that entail a given tuple defined as follows; Q denotes a class of queries: PROBLEM : #CQA(Q) INPUT : A database D, a set Σ of primary keys, a query Q(x̄) ∈ Q, and a tuple t̄ ∈ dom(D)|x̄| . OUTPUT : |{D 0 ∈ rep(D, Σ) | t̄ ∈ Q(D 0 )}|. As usual, we are interested in the data complexity of #CQA(Q), where the set of constraints and the query are fixed, and only the database and the tuple are part of the input. To make this notion more precise, we need to define the following problem, where Q(x̄) ∈ Q is a query and Σ is a set of primary keys: PROBLEM : #CQA(Q(x̄), Σ) INPUT : A database D, and a tuple t̄ ∈ dom(D)|x̄| . OUTPUT : |{D 0 ∈ rep(D, Σ) | t̄ ∈ Q(D 0 )}|. We say that #CQA(Q) is R-complete for C in data complexity, where R is a class of reductions and C a complexity class, if (i) for each query Q ∈ Q and set Σ of primary keys, #CQA(Q, Σ) is in C, and (ii) there exists Q0 ∈ Q and set Σ 0 of primary keys such that #CQA(Q0 , Σ 0 ) is R-complete for C. The decision version of #CQA(Q), denoted #CQA>0 (Q), simply asks whether |{D0 ∈ rep(D, Σ) | t̄ ∈ Q(D0 )}| > 0. The data complexity of #CQA>0 (Q) is defined in the same way as for #CQA(Q). Henceforth, for clarity, we focus on Boolean queries, but all the results hold even if we consider non-Boolean queries. 4 Complexity of #CQA The complexity of #CQA has been already studied in [6, 7], and we know that #CQA(CQ) is ≤pT -complete for #P.1 The above result relies on Cook reductions. However, as discussed in Section 1, there are problems that are “hard-to-count- easy-to-decide”, which cannot be complete (under reasonable assumptions) for #P under many-one logspace (or polynomial-time) reductions. Hence, the first step is to understand for which classes of queries Q, #CQA(Q) is an “easy-to- decide” problem. After a preliminary analysis, we have concluded that in the case of arbitrary first-order queries our problem is not only “hard-to-count”, but also “hard-to-decide”. In fact, we can show that #CQA>0 (FO) is ≤log m -complete for NP, and #CQA(FO) is ≤log m -complete for #P. This indicates that #P is the right complexity for our problem when we focus on first-order queries. The situation for positive existential queries is more interesting. We can show that in this case we are dealing with a typical “hard-to-count-easy-to- decide” problem. In particular, we can show that #CQA>0 (∃FO+ ) is tractable, which in turn implies that #CQA(∃FO+ ) is not ≤log m -complete for #P (unless P = NP). This tells us that for pinpointing the complexity of #CQA(∃FO+ ) we should look for a subclass of #P, which is closed under logspace reductions. The obvious candidate is SpanL. Indeed, we can show that #CQA(∃FO+ ) is in SpanL. However, #CQA(∃FO+ ) is not ≤log m -complete for SpanL (unless L = NL). Not much is known about subclasses of SpanL for which our problem can be complete under logspace reductions. This brings us to our main question: Research Question: Is there a class C ⊆ SpanL, which is closed under logspace reductions, such that #CQA(∃FO+ ) is ≤log m -complete for C? Let us stress that since we focus on the data complexity of #CQA(∃FO+ ), as discussed in Section 1, the above question is not trivial in the sense that the existence of the class C is not guaranteed. Although we cannot give a definite answer to this question, we make a significant step towards such an answer. We isolate a hierarchy of new complexity classes inside #P, called the Λ-hierarchy, and show the following; we write Λ[k] for the k-th level of the hierarchy: Theorem 1. The following are equivalent: 1 All the results for #CQA and #CQA>0 in the rest of the paper concern their data complexity. Thus, for brevity, we will sometimes avoid to explicitly mention the term data complexity. 1. There exists a class C ⊆ SpanL, closed under logspace reductions, such that #CQA(∃FO+ ) is ≤log m -complete for C. 2. There exists k ≥ 0 such that #CQA(∃FO+ ) is ≤logm -complete for Λ[k]. 3. The Λ-hierarchy collapses. The above theorem essentially tells that either #CQA(∃FO+ ) is ≤log m -complete for Λ[k], for some k ≥ 0, or there is no complexity class C ⊆ SpanL, closed under many-one logspace reductions, such that #CQA(∃FO+ ) is ≤log m -complete for C. Thus, the right complexity class that characterizes the data complexity of our problem, if it exists, is one of the levels of the Λ-hierarchy. 5 The Λ-hierarchy We now proceed to introduce our hierarchy of complexity classes. The main technical challenge is to understand how to limit the computational power of NL transducers. This will lead to our new complexity classes inside SpanL that allow us to establish Theorem 1. To this end, we give a semi-formal introduction to this hierarchy, by presenting a couple of structural properties enjoyed by the functions therein. For further details, we refer the reader to [3]. 5.1 The Guess-Check-Expand Paradigm There is a generic paradigm, which we call guess-check-expand, that an NL trans- ducer can follow in order to place problems inside SpanL, assuming that the following structural properties are fulfilled: Small Certificates. A solution (in our case, a repair that entails the query) is witnessed via a small certificate – by small we mean of logarithmic size – that is verifiable in logspace. For the next property, we first need to give some auxiliary terminology. Given a sequence of sets S1 , . . . , Sn , an `-selector of it, for ` ≥ 0, is a sequence of pairs σ = (i1 , e1 ), . . . , (i` , e` ), where 1 ≤ i1 < . . . < i` ≤ n, and ej ∈ Sij for each j ∈ {1, . . . , `}. Intuitively, a pair (i, e) in σ tells us to “keep the element e from the set Si ”. We can then define the cartesian product of S1 , . . . , Sn w.r.t. σ, denoted [S1 , . . . , Sn ]σ , as S1σ × · · · × Snσ , where Siσ = {e} if σ contains a pair (i, e), otherwise, Siσ = Si . In simple words, [S1 , . . . , Sn ]σ is the cartesian product of S1 , . . . , Sn with the difference that ` sets are first replaced by a singleton set as dictated by the `-selector σ. We can now discuss the second structural property. For an input instance x, we write sol(x) for the set of solutions, and cert(x) for the set of certificates that witness a solution of sol(x). Solutions via Certificate Expansion. For every input x, there is a sequence of logspace computable sets S1 , . . . , Sn , called solution domains, such that [ |sol(x)| = [S1 , . . . , Sn ]σc , c∈cert(x) where σc is an `-selector for S1 , . . . , Sn determined S by the small certificate c. In fact, there exists a bijection enc from sol(x) to c∈cert(x) [S1 , . . . , Sn ]σc , and enc(s) should be understood as an encoding of the solution s. A function that enjoys the above structural properties can be shown to be in SpanL via a simple guess-check-expand algorithm: 1. (Guess) Guess a candidate small certificate c. 2. (Check) If c is not a valid certificate, then reject. 3. (Expand) Expand c into an encoding of a solution witnessed by c using S1 , . . . , Sn as follows: for i = 1 to n, if the `-selector σc determined by c mentions (i, e), then output e; otherwise, guess e ∈ Si and output e. It is clear that, on an input x, the above transducer uses logspace in the size of x since the small certificates are logspace verifiable, and the solution domains are logspace computable. Moreover, the number of distinct valid outputs of the σc S transducer coincides with c∈cert(x) [S1 , . . . , Sn ] , and thus, with |sol(x)|. A transducer of the form above is called k-guess-check-expand, for some integer k ≥ 0, if the `-selectors are such that ` ≤ k. #CQA(Q, Σ) via a Guess-Check-Expand Algorithm. It is interesting to observe that the problem #CQA(Q, Σ), for some UCQ Q and set Σ of primary keys, enjoys the two structural properties defined above; recall that an existential positive query can be rewritten as a UCQ. More precisely, on input D: – A small certificate for a repair of rep(D, Σ) that entails Q is a pair (Q0 , h), where Q0 is a disjunct of Q, and h : var(Q0 ) → dom(D), such that h(Q0 ) ⊆ D and h(Q0 ) |= Σ. – The sequence of solution domains corresponds to the sequence B1 , . . . , Bn of the sets of blockΣ (D), according to some arbitrary, but fixed ordering. The `-selector σ(Q0 ,h) for B1 , . . . , Bn determined by a certificate (Q0 , h) mentions the pair (i, R(t̄)), i.e., it keeps the fact R(t̄) from Bi , iff h(Q0 ) ∩ Bi = {R(t̄)} and Σ has an R-key. The latter implies that ` is bounded by the number of atoms with a predicate that has a key over all disjuncts of Q, which does not depend on the input database D. Clearly, the number of repairs of rep(D, Σ) that entail Q is the number [ [B1 , . . . , Bn ]σ(Q0 ,h) . (Q0 ,h)∈cert(D) Therefore, the fact that #CQA(Q, Σ) is in SpanL can be shown via a k-guess- check-expand algorithm, as the one presented below, where k is the number of atoms in Q with a predicate with a key in Σ: 1. (Guess) Guess a candidate small certificate (Q0 , h). 2. (Check) If h(Q0 ) * D or h(Q0 ) 6|= Σ, then reject. 3. (Expand) Expand (Q0 , h) into an encoding of a solution witnessed by (Q0 , h) as follows: for i = 1 to n, if h(Q0 ) ∩ Bi = {R(t̄)} and Σ has an R-key, then output R(t̄); otherwise, guess a fact α ∈ Bi and output α. 5.2 Semi-formal Definition and Properties The guess-check-expand paradigm suggests a natural way to define a hierarchy of classes inside SpanL. Intuitively, the k-th level of such hierarchy is defined by considering k-guess-check-expand transducers. This intuition is at the core of the Λ-hierarchy. In particular, for every integer k ≥ 0, Λ[k] is the class of functions f : {0, 1}∗ → N such that, for every x ∈ {0, 1}∗ , f (x) is the number of valid distinct outputs of a k-guess-check-expand transducer S Mf , with input x. The Λ-hierarchy is defined as the infinite union Λ = k≥0 Λ[k]. We proceed to discuss some key properties concerning the Λ-hierarchy. The first one states that each level of the hierarchy is closed under logspace reduc- tions, which is crucial for our complexity analysis. Proposition 1. For each k ≥ 0, Λ[k] is closed under logspace reductions. The next result confirms that Λ is a hierarchy of classes inside SpanL: Proposition 2. Λ[0] ⊆ · · · ⊆ Λ ⊆ SpanL, and Λ = SpanL implies L = NL. The next result shows that, apart from Λ[0] and Λ[1], the rest of the hierarchy consists of “hard” complexity classes: Proposition 3. Λ[1] ⊆ FP, and Λ[2] ⊆ FP implies P = NP. A direct corollary of Proposition 3 is that Λ[1] ( Λ[2] (unless P = NP). We do not know whether the inclusions beyond Λ[2] are strict. This essentially asks whether there exists an integer k ≥ 2 such that the Λ-hierarchy collapses to its k-th level, i.e., whether Λ = Λ[k]. This is a non-trivial open question. 6 Establishing our Main Result Having the Λ-hierarchy in place, we are now ready to explain how Theorem 1 is shown. To this end, we concentrate on a parameterized version of #CQA(∃FO+ ), where the so-called keywidth of the query w.r.t. the set of primary keys is bounded by an integer k ≥ 0. Formally, the keywidth function (kw) maps pairs of the form (Q, Σ), where Q ∈ ∃FO+ and Σ is a set of primary keys, to the naturals such that kw (Q, Σ) = |{R(t̄) | R(t̄) occurs in Q, and Σ has an R-key}|. Then: PROBLEM : #CQAkw + k (∃FO ) INPUT : A database D, a set Σ of primary keys, and a query Q ∈ ∃FO+ such that kw (Q, Σ) = k. OUTPUT : |{D 0 ∈ rep(D, Σ) | D 0 |= Q}|. The data complexity of the above problem is defined as expected. Then: Theorem 2. For each k ≥ 0, #CQAkw + log k (∃FO ) is ≤m -complete for Λ[k]. By exploiting Theorem 2, we can now establish our main result (Theorem 1): (1) ⇒ (3). By hypothesis, there is a class C ⊆ SpanL such that #CQA(∃FO+ ) + is ≤log m -complete for C. Thus, for every Q ∈ ∃FO and set Σ of primary keys, #CQA(Q, Σ) ∈ C. Moreover, there exists Q̂ and Σ̂ such that #CQA(Q̂, Σ̂) is ≤log log m -complete for C, which means that, for every f ∈ C, f ≤m #CQA(Q̂, Σ̂). By Theorem 2, #CQA(Q̂, Σ̂) ∈ Λ[kw (Q̂, Σ̂)]. Since, by Proposition 1, Λ[kw (Q̂, Σ̂)] is closed under logspace reductions, we conclude that f ∈ C implies f ∈ Λ[kw (Q̂, Σ̂)], or, in other words, C ⊆ Λ[kw (Q̂, Σ̂)]. Hence, for every Q ∈ ∃FO+ and set Σ of primary keys, #CQA(Q, Σ) ∈ Λ[kw (Q̂, Σ̂)]. Since, for every k ≥ 0,Sthere exists Q0 and Σ 0 such that #CQA(Q0 , Σ 0 ) is ≤log m -complete for Λ[k], then k≥0 Λ[k] = Λ[kw (Q̂, Σ̂)]. Hence, the Λ-hierarchy collapses to its kw (Q̂, Σ̂)-th level. (3) ⇒ (2). By hypothesis, the Λ-hierarchy collapses to its k-the level, for some k ≥ 0, i.e., Λ = Λ[k]. We proceed to show that (i) for every Q ∈ ∃FO+ and set Σ of primary keys, #CQA(Q, Σ) ∈ Λ[k], and (ii) there exists Q̂ and Σ̂ such that #CQA(Q̂, Σ̂) is ≤logm -complete for Λ[k]. For showing (i), fix some query Q ∈ ∃FO+ and set Σ of primary keys. By Theorem 2, #CQA(Q, Σ) ∈ Λ[kw (Q, Σ)]. Since Λ[kw (Q, Σ)] ⊆ Λ = Λ[k], we immediately get that #CQA(Q, Σ) ∈ Λ[k]. Statement (ii) holds trivially since, by Theorem 2, there exists a ≤log m -complete function for every level of the Λ-hierarchy. (2) ⇒ (1). Since, for every k ≥ 0, Λ[k] is closed under logspace reductions (Proposition 1), and Λ[k] ⊆ SpanL (Proposition 2), the claim follows. The Next Step. Providing a firm answer to our main question boils down to understanding whether the Λ-hierarchy collapses. Our conjecture is that it does not collapse. Proving or disproving this conjecture is one of our priorities, which will complete the picture concerning the data complexity of #CQA(∃FO+ ). Acknowledgements. This work was supported by the EPSRC grants S003800 EQUID, M025268 VADA, and N023056 MAGIC. References 1. Àlvarez, C., Jenner, B.: A very hard log-space counting class. Theor. Comput. Sci. 107(1), 3–30 (1993) 2. Arenas, M., Bertossi, L.E., Chomicki, J.: Consistent query answers in inconsistent databases. In: PODS. pp. 68–79 (1999) 3. Calautti, M., Console, M., Pieris, A.: Counting database reparis under primary keys revisited. In: PODS (2019), to appear 4. Calautti, M., Libkin, L., Pieris, A.: An operational approach to consistent query answering. In: PODS. pp. 239–251 (2018) 5. Durand, A., Hermann, M., Kolaitis, P.G.: Subtractive reductions and complete prob- lems for counting complexity classes. Theor. Comput. Sci. 340(3), 496–513 (2005) 6. Maslowski, D., Wijsen, J.: A dichotomy in the complexity of counting database repairs. J. Comput. Syst. Sci. 79(6), 958–983 (2013) 7. Maslowski, D., Wijsen, J.: Counting database repairs that satisfy conjunctive queries with self-joins. In: ICDT. pp. 155–164 (2014) 8. Pagourtzis, A., Zachos, S.: The complexity of counting functions with easy decision version. In: MFCS. pp. 741–752 (2006)