Recognizing Determinism in Prioritized Repairing of
                 Inconsistent Databases

                 Benny Kimelfeld, Ester Livshits, and Liat Peterfreund

                           Technion, Haifa 32000, Israel
                {bennyk,esterliv,liatpf}@cs.technion.ac.il


       Abstract. A repair of an inconsistent database is traditionally defined as a con-
       sistent database that differs from the inconsistent one in a “minimal way.” As
       there are often reasons to prefer one repair over another, researchers have in-
       troduced and investigated the framework of preferred repairs, where a priority
       relation between facts is lifted towards a priority relation between consistent
       databases, and repairs are restricted to ones that are optimal in the lifted sense. In
       this paper we describe our recent results on the complexity of deciding whether
       the priority relation suffices to clean the database unambiguously, or in other
       words, whether there is exactly one optimal repair. In particular, we show that
       different conventional semantics of priority lifting entail highly different com-
       plexities.


1    Introduction

Database inconsistency arises for various reasons and in different applications. In com-
mon applications of Big Data, information is obtained from imprecise sources (e.g., so-
cial encyclopedias or social networks) via imprecise procedures (e.g., natural-language
processing). It may also arise when integrating conflicting data from different sources
(each of which may be consistent). Arenas, Bertossi and Chomicki [3] introduced a
principled approach to managing inconsistency, via the notions of repairs and consis-
tent query answering. Informally, a repair of an inconsistent database I is a consistent
database J that differs from I in a “minimal” way.
    There are situations where it is natural to prefer one repair over another [5, 11, 16,
17]. For example, this is the case if one source is regarded more reliable than another
(e.g., enterprise data vs. Internet harvesting, and precise vs. imprecise sensing equip-
ment) or if available timestamp information implies that a more recent fact is more up
to date than an earlier one.
    Motivated by the above considerations, Staworko, Chomicki and Marcinkowski [16,
17] introduced the framework of preferred repairs. The main characteristic of this
framework is that it uses a priority relation between conflicting facts of an inconsistent
database, in order to define a notion of preferred repairs. Fagin et al. [9] have built on
that concept (in conjunction with the framework of document spanners [10]) to devise
a language for declaring inconsistency cleaning in text information-extraction systems.
They have shown there that preferred repairs capture ad-hoc cleaning operations and
strategies of some prominent systems for text analytics [2, 6].
     While the classic complexity problems studied in the theory of repairs include re-
pair checking [1] and consistent query answering [3, 15], the presence of preferences
gives rise to the determinism problem, which Staworko et al. [17] refer to as categoric-
ity: decide whether the provided priority relation suffices to clean the database unam-
biguously. In this paper we describe our recent results on analyzing the complexity of
categoricity in the framework of preferred repairs.


2      Formal Setup

We adopt the framework of preferred repairs devised by Staworko et al. [17], which has
also been used by Fagin et al. [8].
Prioritizing Inconsistent Databases. An inconsistent database over a schema S is a
database D that violates the constraints of S. We consider here two types of constraints.
One is the well known (and restricted) Functional Dependencies (FDs). An FD has the
form R : X → Y , and it states that every two facts of the relation R either agree
on the attribute set Y or disagree on the attribute set X. The more general type of
constraints considered here is the one given by means of a conflict hypergraph [7]. A
conflict hypergraph for a database instance I is a hypergraph H that has the facts of I
as its node set. A subinstance J of I is consistent with respect to (w.r.t.) H if J is an
independent set of H (that is, J contains no hyperedge of H).
     Recall that conflict hypergraphs can represent inconsistencies for various types of
integrity constraints, including FDs, the more general conditional FDs [4], and the more
general denial constraints [14]. In fact, every constraint that is anti-monotonic (i.e.,
where subsets of consistent sets are always consistent) can be represented as a conflict
hypergraph. In the case of denial constraints, the translation from the logical constraints
to the conflict hypergraph can be done in polynomial time under data complexity (i.e.,
when the schema signature and constraints are assumed to be fixed).
     A priority relation  over a database instance I is an acyclic binary relation over the
facts of I. The semantics of f  g is that “f is preferred to g” (e.g., because f comes
from a more trusted source, or because f is more recent). A prioritizing inconsistent
database is a triple (I, H, ) where I is a database instance, H is a conflict hypergraph
for I, and  is a priority relation over I with the property that  compares only between
facts that are neighbors in H.1 We say that  is total if for every two facts f and g in I,
if f and g are neighbors then either f  g or g  f . A priority c over I is a completion
of  (w.r.t. H) if  is a subset of c and c is total.
Preferred Repairs. Let D be an inconsistent prioritizing database. Staworko et al. [17]
define three different notions of preferred repairs: Pareto optimal, globally optimal,
and completion optimal. Informally speaking, the first two are based on the question
of whether a repair can be improved by replacing a set of facts with a preferable set of
facts. They differ in the way they define when one set of facts is considered preferable
to another. The third is based on the notion of completion.
 1
     This requirement has been made with the introduction of the framework [17]. Obviously, the
     lower bounds we present hold even without this requirement. Moreover, our main upper bound,
     Theorem 3, holds as well without this requirement.
    For the formal definitions, let (I, H, ) be an inconsistent prioritizing database, and
J and J 0 two distinct consistent subinstances of I. We say that J is a Pareto improve-
ment of J 0 if there exists a fact f ∈ J \ J 0 such that f  f 0 for all facts f 0 ∈ J 0 \ J.
We say that J is a global improvement of J 0 if for every fact f 0 ∈ J 0 \ J there exists a
fact f ∈ J \ J 0 such that f  f 0 . In words, J is a Pareto improvement of J 0 if in order
to obtain J from J 0 we insert and delete facts, and one of the inserted facts is preferred
to all deleted facts. And J is a global improvement of J 0 if in order to obtain J from
J 0 we insert and delete facts, and every deleted fact is compensated by some preferable
inserted fact.
    We then get the following variants of preferred repairs for an inconsistent prioritiz-
ing database D = (I, H, ). A consistent subinstance J of I is a: (a) Pareto-optimal
repair of D if there is no Pareto improvement of J, (b) globally-optimal repair of D if
there is no global improvement of J, and (c) completion-optimal repair of D if there
exists a completion c of  such that J is a globally-optimal repair of (I, H, c ).
    We remark that in the definition of a completion-optimal repair, we could replace
“globally-optimal” with “Pareto-optimal” and obtain an equivalent definition [17]. It
can be shown quite easily that under each of the three notions, an optimal repair is also
a repair in the sense of Arenas et al. [3].
Categoricity. For each of the three semantics, there is always at least one optimal
repair [17]. The problem of categoricity is that of deciding whether there is precisely
one optimal repair, and therefore the priority relation contains enough information to
clean the inconsistent database unambiguously. Formally, Pareto categoricity, global
categoricity and completion categoricity are the problems of deciding, given an incon-
sistent prioritizing database D, whether D has a single Pareto-optimal repair, a single
globally-optimal repair, and a single completion-optimal repair, respectively. The prob-
lem of repair uniqueness (in different repair semantics) is also referred to as determin-
ism [13] and fix certainty [12].


3   Complexity Results
We now describe our complexity results for categoricity under the three semantics of
optimal repairs. We begin with the Pareto case. The following theorem gives a di-
chotomy in data complexity, covering all possible schemas with functional dependen-
cies. As is often the case in dichotomy results, the main challenge in the proof is that of
covering all the hard cases.

Theorem 1. Let S be a relational schema with a set ∆ of FDs. Pareto categoricity
over S can be solved in polynomial time if the restriction of ∆ to every relation of S is
equivalent to a single FD. In every other case, the problem is coNP-complete.

   Theorem 1 shows that in the case of FDs, the class of schemas that have a tractable
Pareto categoricity is very limited. Next, we discuss global categoricity, and again focus
on FDs. Here we do not have a dichotomy, and establishing one is left as an open
problem for future research.

Theorem 2. Let S be a relational schema with a set ∆ of FDs. The following hold.
 1. If the restriction of ∆ to every relation of S is equivalent to a single FD, then global
    categoricity over S can be solved in polynomial time.
 2. Suppose that ∆ consists of two nontrivial FDs X → Y and W → Z on the same
    relation. Suppose also that each of W and Z contains an attribute that is in none
    of the other three attribute sets. Then global categoricity over S is Π2p -complete.
    Theorem 2 states that the tractable case of Pareto categoricity remains tractable in
the global case. On other schemas, the complexity can rise to an even higher complexity
class. Next, we consider the completion case, where the picture is far more positive.
Theorem 3. Completion categoricity is decidable in polynomial time, even in the gen-
eral case where conflicts are given by a conflict hypergraph.
     Our polynomial time algorithm for solving completion categoricity is extremely
simple, but its proof of correctness is quite intricate.
     Lastly, we observe that in our proof of Π2p -hardness for global categoricity, our
reduction constructs a nontransitive priority relation, and we ask whether transitivity
makes a difference. Let (I, H, ) be an inconsistent prioritizing database. We say that
 is transitive if for every two facts f and g in I, if f and g are neighbors in H and
 contains a path from f to g, then f  g. Transitivity is a natural assumption when
 is interpreted as a partial order such as “is of better quality than” or “is more current
than.” We can show that the three semantics remain different in the presence of tran-
sitivity. Yet, quite remarkably, the completion and global semantics coincide as far as
categoricity is concerned.
Theorem 4. Consider an inconsistent prioritizing database, where conflicts are given
by a conflict hypergraph and the priority relation is transitive. Then, there is a single
completion-optimal repair if and only if there is a single globally optimal repair.
   Combining Theorems 3 and 4, we get the following corollary that shows the major
impact of transitivity on global categoricity, when contrasted with Theorem 2.
Corollary 1. For transitive priority relations, global categoricity is decidable in poly-
nomial time, even in the general case where conflicts are given by a conflict hypergraph.


4   Conclusions
We introduced our recent results on the complexity of the categoricity problem in the
framework of preferred repairs [17], where the goal is to determine whether the pro-
vided priority relation suffices to repair the database unambiguously. We did not address
any qualitative discrimination among the three notions of repairs. Rather, we continue
the line of work [9, 10] that explores the impact of the choice on the entailed compu-
tational complexity. It has been established that, as far as repair checking is concerned,
the Pareto and the completion semantics behave way better than the global one [9]. In
this work we have shown that from the viewpoint of categoricity, the Pareto semantics
departs from the completion one by being likewise intractable (while the global seman-
tics hits an even higher complexity class); hence, the completion semantics outstands
so far as the most efficient option to adopt.
References
 1. F. N. Afrati and P. G. Kolaitis. Repair checking in inconsistent databases: algorithms and
    complexity. In ICDT, pages 31–41. ACM, 2009.
 2. D. E. Appelt and B. Onyshkevych. The common pattern specification language. In TIPSTER
    Text Program: Phase III, pages 23–30. Association for Computational Linguistics, 1998.
 3. M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsistent
    databases. In PODS, pages 68–79. ACM, 1999.
 4. P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional
    dependencies for data cleaning. In ICDE, pages 746–755. IEEE, 2007.
 5. Y. Cao, W. Fan, and W. Yu. Determining the relative accuracy of attributes. In SIGMOD,
    pages 565–576. ACM, 2013.
 6. L. Chiticariu, R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, and S. Vaithyanathan. Sys-
    temT: An algebraic approach to declarative information extraction. In ACL, pages 128–137,
    2010.
 7. J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple dele-
    tions. Inf. Comput., 197(1-2):90–121, 2005.
 8. R. Fagin, B. Kimelfeld, and P. G. Kolaitis. Dichotomies in the complexity of preferred
    repairs. In PODS, pages 3–15. ACM, 2015.
 9. R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. Cleaning inconsistencies in informa-
    tion extraction via prioritized repairs. In PODS, pages 164–175. ACM, 2014.
10. R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. Document spanners: A formal ap-
    proach to information extraction. J. ACM, 62(2):12, 2015.
11. W. Fan, F. Geerts, and J. Wijsen. Determining the currency of data. ACM Trans. Database
    Syst., 37(4):25, 2012.
12. W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master
    data. VLDB J., 21(2):213–238, 2012.
13. W. Fan, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing.
    J. Data and Information Quality, 4(4):16:1–16:38, 2014.
14. T. Gaasterland, P. Godfrey, and J. Minker. An overview of cooperative answering. J. Intell.
    Inf. Syst., 1(2):123–157, 1992.
15. P. Koutris and J. Wijsen. The data complexity of consistent query answering for self-join-free
    conjunctive queries under primary key constraints. In PODS, pages 17–29. ACM, 2015.
16. S. Staworko, J. Chomicki, and J. Marcinkowski. Preference-driven querying of inconsistent
    relational databases. In EDBT Workshops, volume 4254 of LNCS, pages 318–335. Springer,
    2006.
17. S. Staworko, J. Chomicki, and J. Marcinkowski. Prioritized repairing and consistent query
    answering in relational databases. Ann. Math. Artif. Intell., 64(2-3):209–246, 2012.