Combined Complexity of Repair Checking and
           Consistent Query Answering

          Sebastian Arming, Reinhard Pichler, and Emanuel Sallinger

                         Vienna University of Technology
                 {arming,pichler,sallinger}@dbai.tuwien.ac.at


1    Introduction
Database management systems (DBMS) allow the definition of several forms of
integrity constraints (ICs) to specify restrictions on the data to be stored. The
DBMS provides checks that the stored data indeed satisfies the ICs. However, in
modern applications where data is integrated from several sources, violations of
the ICs may arise even if the data in each single source satisfies the ICs. Hence,
the handling of inconsistent data (i.e., data violating the given ICs) has evolved
as an active field of research, see e.g., [3, 5] for surveys. The foundations of this
research were laid by Arenas et al. in [2], where the concepts of a repair and of
consistent answers play key roles.
    Given a set of ICs C and a (presumably inconsistent) database instance D,
a repair of D w.r.t. C is a database instance I which satisfies C and which
differs minimally from D. Difference and minimality can be defined in several
ways. We follow the approach of [2] where repairs are obtained from the original
database by the insertion and deletion of tuples and minimality means that the
symmetric set difference ∆(D, I) is minimal w.r.t. subset inclusion. More formally,
let ∆(D, I) = (D \ I) ∪ (I \ D). Then I is a repair of D w.r.t. C if I |= C and
there does not exist an instance I 0 with I 0 |= C and ∆(D, I 0 ) ( ∆(D, I).
    The idea of consistent query answers is that even from an inconsistent database
instance D, we can derive consistent information, namely those answers to a
query that one would obtain over every repair I of D. More precisely, the set of
consistent
T          answers to a query Q for a given database D and ICs C is defined as
  {Q(I) | I is a repair of D w.r.t. C}.
The following decision problems are thus crucial to deal with inconsistent data:
     Repair Checking (RC)
     Instance: Databases D and I and a set of constraints C
     Question: Is I a repair of D w.r.t. C?
     Consistent Query Answering (CQA)
     Instance: A database D, a set of constraints C, and a Boolean query Q
     Question: Is Q true in all repairs of D w.r.t. C?
Apart from few exceptions (above all [1, 4]), most research on these decision
problems has focused on the data complexity, i.e., the ICs C and the query Q are
considered as fixed and only the database D is allowed to vary. The goal of our
work is a thorough analysis of the combined complexity of these two problems.
Hence, in addition to the database D, also the set of ICs C and (in case of the
CQA-problem) the query Q are part of the input. As an important special case,
we consider the complexity of these problems when the arity of the relation
symbols is bounded by a constant.
    As far as the queries Q are concerned, we concentrate on the fundamental
class of Boolean conjunctive queries. It can be easily verified that all of our
results carry over to arbitrary conjunctive queries (thus asking if a given tuple
is contained in the answer to Q over every repair I) and unions of conjunctive
queries. More powerful query languages are left for future work.
    We consider a number of different languages from which the ICs C are taken.
The languages considered here will range from First-Order logic as the most
powerful one to inclusion dependencies and key dependencies which are among
the least expressive ones. The contribution of this work is a fairly complete picture
of the combined complexity of the RC and CQA problems.


2    Overview of Results

In this section, we first introduce the considered IC languages and after that give
an overview of the complexity results for the RC and CQA problems. Figure 1
shows the hierarchy of the considered IC languages.


                           FO
                                                    IC L    RC         CQA      CQA b.a.
                                                      FO PSPACE MH     undec    undec
                ∨-tgd           UC                 ∨-tgd Π3 P H        undec    undec
                                                      tgd Π3 P N       undec    undec M
          tgd       full ∨-tgd       denial      lav-tgd    DP NH         NP O    NP
                                                       ID in P H          NP      NP M
lav-tgd         full tgd                              UC Π2 P H coNEXP-PNE H     Π3 P H
                                      egd
                                              full ∨-tgd Π2 P N coNEXP-PNE N     Π3 P N
                                                 full tgd   DP NH EXP-coNEXP NH Π2 P NH
    ID                                            denial    DP H         Π2 P O  Π2 P
                                      FD              egd   DP N         Π2 P    Π2 P
                                                      FD in P H          Π2 P    Π2 P
                                                      key in P           Π2 P    Π2 P M
                                      key

Fig. 1. Left: Hierarchy of IC languages. Right: Combined complexity of RC, CQA and
CQA with bounded arity. Black triangles indicate upper (H) and lower (N) bounds
shown in this paper. White triangles indicate previously known bounds.


Besides domain independent first order (FO) sentences, all studied languages
arise from restrictions on formulas of the form
                                                    n
                                                    _
                                ∀x(ϕ(x) ∧ β(x) →          ∃yi ψi (x, yi ))
                                                    i=1


                                                2
where ϕ, ψi are conjunctions of database atoms and β a quantifier-free formula
using only built-in (i.e. comparison) predicates. To ensure safety, we require that
every variable in x must occur in some atom in ϕ. For simplicity, we assume that
constraints do not contain constants (note however that our results still hold in
the presence of constants).
    We call such a constraint a full or a universal constraint (UC) if it contains
no existential quantifiers. A disjunctive tuple generating dependency (∨-tgd) has
empty β, while an ordinary tuple generating dependency (tgd) additionally has
n = 1. A local-as-view (lav) tgd is a tgd where ϕ is a single atom, and an inclusion
dependency (ID) is a lav tgd where also ψ1 is a single atom. A denial constraint
is of the form ∀x¬(ϕ(x) ∧ β(x)) and can be thought of as a universal constraint
with empty right hand side. An equality generating dependency (egd) is a denial
constraint where β is a single inequality. A functional dependency (FD) is an egd
where ϕ consists of two atoms over the same relation symbol. We recall that key
constraints are a special case of FDs.
The table in Figure 1 shows the combined complexity of RC, CQA and CQA with
bounded arity for each of the considered IC languages. Most results in the table
are completeness results. In particular, if a single complexity class is listed in a
cell, this denotes that the problem is complete for that class. Exceptions are that
we do not further classify problems in P and undecidable problems. Furthermore,
for the three cases where the exact complexity is not known, we list lower and
upper bounds (e.g., EXP-coNEXP denotes an EXP lower bound and a coNEXP
upper bound).
    For showing the results claimed in Figure 1, it is not necessary to separately
show membership and hardness for each single cell. The hierarchy shows the rich
inclusion structure between the languages (e.g., every ID is a lav tgd). Similarly,
CQA with bounded arity is a special case of CQA. Recall that lower bounds
propagate from more restricted problems to more general problems and upper
bounds propagate from more general to more restricted problems. Thus it suffices
to show only certain membership and hardness results.
    We use triangles in the table in Figure 1 to indicate the upper (O/H) and
lower (M/N) bounds that have to be shown in order to obtain all results given in
the table. Black triangles indicate upper bounds (H) and lower bounds (N) shown
in this paper. White triangles indicate upper bounds (O) and lower bounds (M)
given in previous work.
Theorem 1. For the decision problems RC, CQA and CQA with bounded arity,
the combined complexity is as depicted in the table in Figure 1.
In Section 3, we will give the intuition of the results obtained in this paper (black
triangles in Figure 1). In the remainder of this section, we summarize results
given in or implicit in previous work (white triangles).
    First, we consider the result known for RC. The PSPACE-hardness for FO fol-
lows immediately from the PSPACE-hardness of first-order model checking. Turn-
ing to CQA, [4] showed NP-completeness for IDs and mentioned Π2 P-completeness
for key constraints. We note that the NP-membership already holds for lav tgds,
and that the Π2 P-membership already holds for denial constraints. The final
previously known result from Figure 1 is the undecidability of CQA for arbitrary


                                         3
tgds, shown in [8]. A close inspection of the hardness proofs for CQA shows that
they already hold in case of bounded arity.
   As a concluding remark, note that in Figure 1 we do not separately list RC
with bounded arity. A quick inspection of our RC hardness proofs shows that
the combined complexity of RC does not change when arity is bounded.


3     Intuition of Proofs
In this section, we discuss the results obtained in this work. For space reasons,
we only give the intuition of the results. As membership results usually provide
more insight into the sources of complexity, this section will mostly focus on
membership results. More proof details can be found in the appendix.

3.1    Repair Checking
Repair checking has two fundamental sources of complexity, namely, checking
that the supposed repair is consistent, and checking that it is indeed minimal.
This immediately gives the following naive algorithm:
 1. check consistency
 2. try to guess a counter-example to minimality
    (a consistent database instance with smaller symmetric difference)
Since every database instance with a smaller symmetric difference has size at
most polynomial in the size of the input, the guess is polynomial. Thus, if we
know that the complexity of model checking of a constraint language is in M ,
then our naive algorithm yields an upper bound of M ∩ coNPM . For M = NP
(which is the case for lav tgds), the entire second step fits into coNP. Indeed, one
can simultaneously guess a counter-example (a database instance) and a witness
for its consistency. In this case, RC is in DP. For other classes M , the coNPM
factor dominates.
    In many cases, one cannot do better than that. In particular, we use the
upper bound given by the naive algorithm to show PSPACE-membership for FO,
Π3 P-membership for ∨-tgds, DP-membership for lav-tgds and Π2 P-membership
for UCs (four of the H in the column RC of Figure 1). We will discuss the
matching lower bounds later.
    In some cases, one actually can do better than the naive algorithm. For
full tgds and denial constraints, [7, Lemma 1] provides an NP test for checking
minimality of a candidate repair. Since model checking for these classes is in
coNP, this gives DP algorithms for RC.
    For FDs, we exploit the fact that during the minimality check, it is sufficient to
check the immediate supersets. This is the case since FDs can only be repaired by
deletions and since they are monotone in the sense that supersets of inconsistent
instances are always inconsistent. Since consistency can be checked in polynomial
time, we thus have a polynomial time algorithm for RC.
    For IDs, P-membership exploits the existence of a unique subset repair, i.e.,
a repair that can be obtained only by deletions. Such a subset repair can be


                                          4
computed in polynomial time in case of IDs. Using a construction similar to the
one given in [8, Lemma 4.8], we can exploit subset repairs to devise a polynomial-
time algorithm for the RC problem of IDs.
    We now proceed to discussing our hardness results. We show Π3 P-hardness for
tgds as well as Π2 P hardness for full ∨-tgds by reduction from the corresponding
QSAT problems. The DP-hardness results for lav tgds, full tgds, and egds are
shown by reduction from 3-colorability and its complement.
    An inspection of our proofs yields an interesting relationship between the roles
of consistency and minimality checking, our two orthogonal sources of complexity.
For tgds and full ∨ tgds, minimality checking dominates the complexity of RC. In
particular, hardness holds even if the given instance is promised to be consistent.
In contrast, for our DP results the role of consistency and minimality checking is
distributed between the NP and the coNP parts of DP (i.e. if one check is NP,
the other check is coNP). As a consequence, if the given instance is promised to
be consistent, the complexity of RC for lav tgds drops to coNP while for gav tgds
and egds it drops to NP. Finally, note that all results also hold for bounded arity.

3.2   Consistent Query Answering
The combined complexity of CQA with UCs was studied in [1], who showed
co2NEXP-membership and coNEXP hardness. We improve the upper bound to
PNE and show that already full ∨-tgds exhibit coNEXP-hardness.
    We first illustrate the key ideas of the PNE -membership proof for UCs. Let
(D, Q, C) be an instance of CQA. The crucial observation is that it never makes
sense to introduce fresh domain elements when repairing w.r.t. UCs. More
precisely, let G be the set of all ground atoms over the active domain of D. Then
every repair of D is a subset of G. We now give the following NEXPNP algorithm
for the co-problem of CQA.
 1. Guess I ⊆ G
 2. Check that I 6|= Q and I |= C
 3. Call an oracle to check that there is no J such that J |= C and that J has
    smaller symmetric difference to D than I
For verifying the complexity of our algorithm, observe that G has at most ex-
ponential size. Furthermore, note that checking whether a first-order    formula
ϕ is satisfied by a model M can be done in time O |ϕ|2 × |M ||ϕ| . This model-
                                                                   
checking algorithm can also be used inside the NP oracle by padding its input.
Thus our algorithm is indeed in NEXPNP . Since our algorithm uses only a single
oracle call per computation path, it follows from the work of Hemaspaandra [6]
that it belongs to the complexity class PNE .
    We now turn to the coNEXP-membership for full tgds. Note that the only
purpose of the NP oracle in the above NEXPNP algorithm is to check the minimality
of I. Recall that for full tgds, minimality can be checked in NP. By guessing the
witness of this minimality check together with I, we can avoid the oracle call.
This yields a coNEXP algorithm for CQA with full tgds.
    The coNEXP hardness for full ∨-tgds and the EXP-hardness for full tgds can
be shown by reductions from the evaluation problems of disjunctive Datalog and
Datalog.


                                         5
    Turning to the final column of the table in Figure 1, note that if the arity
of the relation symbols is bounded by a constant, the size G and therefore
the size of any repair is polynomial in the size of the input. This observation,
together with the complexity of the corresponding RC problems, immediately
gives Π3 P membership for UCs and Π2 P membership for full tgds. Again, we
provide matching lower bounds by reductions from the corresponding QSAT
problems.


4    Conclusion
In this work, we have provided a fairly complete picture of the combined complexity
of the RC- and CQA-problems. Only for the problem settings with EXP-hardness
or above, a gap between the lower and upper bounds has remained. Future
research should aim at closing this gap.
    We have considered a wide range of constraint languages. However, in the
literature, further classes of constraint languages can be found, such as binary
constraints [5] or weakly acyclic tgds [8]. The exploration of the combined
complexity of the RC- and CQA-problems for ICs from these classes has been
left for future work. Yet more importantly, settings with combinations of various
kinds of constraints (such as, e.g., inclusion dependencies with key dependencies)
should be further explored, thus extending work that was already started in [4].
    Finally, further problem variants deserve future investigation, such as adopting
different notions of repairs, restricting the allowed repair actions, or considering
different notions of minimality.
Acknowledgements. This work was supported by the Austrian Science Fund
(FWF):P25207-N23 and by the Vienna Science and Technology Fund (WWTF)
project ICT12-15.


References
1. Marcelo Arenas and Leopoldo Bertossi. On the decidability of consistent query
   answering. In AMW, 2010.
2. Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. Consistent query answers
   in inconsistent databases. In PODS, pages 68–79. ACM Press, 1999.
3. Leopoldo Bertossi. Consistent query answering in databases. ACM SIGMOD Record,
   35(2):68, 2006.
4. Andrea Calı̀, Domenico Lembo, and Riccardo Rosati. On the decidability and
   complexity of query answering over inconsistent and incomplete databases. In PODS,
   pages 260–271. ACM Press, 2003.
5. Jan Chomicki. Consistent Query Answering: Five Easy Pieces. In ICDT, pages 1–17,
   2007.
6. Lane A. Hemachandra. The strong exponential hierarchy collapses. J. Comput. Syst.
   Sci., 39(3):299–322, 1989.
7. Slawomir Staworko and Jan Chomicki. Consistent query answers in the presence of
   universal constraints. Information Systems, 35(1):1–22, 2010.
8. Balder ten Cate, Gaëlle Fontaine, and Phokion G. Kolaitis. On the data complexity
   of consistent query answering. In ICDT, pages 22–33. ACM Press, 2012.


                                         6