Rewriting-based Check of Chase Termination

    Marco Calautti, Sergio Greco, Cristian Molinaro, and Irina Trubitsyna
       {calautti,greco,cmolinaro,trubitsyna}@dimes.unical.it

            DIMES, Università della Calabria, 87036 Rende (CS), Italy

Keywords: Chase, Data Dependencies, Data Exchange, Data Integration.


1    Introduction

The Chase is a fixpoint algorithm enforcing satisfaction of data dependencies
(also called constraints) in databases. It has been proposed more than thirty
years ago [2,18] and has seen a revival of interest in recent years in both database
theory and practical applications. Indeed, the availability of data coming from
different sources easily results in inconsistent or incomplete data (i.e., data not
satisfying data dependencies) and, therefore, techniques for fixing inconsistencies
are crucial [1,3,5,7,8,13,17].
    The chase algorithm is used, directly or indirectly, on an everyday basis by
people who design databases, and it is used in commercial systems to reason
about the consistency and correctness of a data design. New applications of
the chase in meta-data management, ontological reasoning, data exchange, data
cleaning, and query optimization have been proposed as well [6,9].
    The chase algorithm solves possible violations of constraints by inserting new
tuples, possibly containing null values [4]. The following example shows a case
where a given database does not satisfy a set of tuple generating dependencies
(TGDs) and the application of the chase algorithm produces a new consistent
database by adding tuples with nulls.

Example 1. Consider the following set of constraints Σ1 describing departments
and their employees:
            ∀x ∀y Department(x) ∧ M anaged(x, y) → Employee(y)
            ∀x Employee(x) → ∃ y W orksF or(x, y)
            ∀x ∀y W orksF or(x, y) → ∃ z M anaged(y, z)
Consider the database D = {Department(d), M anaged(d, m)}. Since the first
constraint is not satisfied, the tuple Employee(m) is inserted. This update oper-
ation fires the second constraint to insert the tuple W orksF or(m, η1 ), which in
turn fires the third constraint so that the tuple M anaged(η1 , η2 ) is added to the
database (η1 and η2 are new labeled nulls). At this point, the chase terminates
since the database is consistent, that is, all dependencies are satisfied.        2

    Unfortunately, the chase algorithm may not terminate. For instance, in Exam-
ple 1 if we delete from the first constraint the atom Department(x), the chase
never terminates and adds an infinite number of tuples to the database. It has
been formally proved in [12] that the problem of deciding whether the chase
procedure terminates is semi-decidable. The first and basic effort concerning the
formalization of a (decidable) sufficient condition guaranteeing chase termina-
tion is weak acyclicity [11]. Informally, it checks whether the constraints do not
allow for nulls to cyclically propagate inside predicates’ positions. Considering
the example above, we have that a value in the second position of predicate
M anaged is copied to Employee (denoted by M2 → E1 ). This forces the in-
troduction of a new null value in the second position of W orksF or, (denoted
         ∗
as E1 →     W F2 ); this value is then copied in the first position of M anaged
(W F2 → M1 ) and it also forces the introduction of a new null value in the se-
                                     ∗
cond position of M anaged (W F2 →      M2 ). Since it is possible to reach position
                                                     ∗
M2 from itself through a connection of the form →      , an infinite number of nulls
could be introduced during the chase procedure.
    Different extensions of weak acyclicity have been proposed. Safety [20] and
super-weak acyclicity [19] identify the positions in which null values can be prop-
agated. Stratification-based approaches [10,20,16] analyse whether dependencies
may fire each other and thus propagate null values from one to another. See
[14] for a comprehensive survey on this topic. Nevertheless, despite the previ-
ously mentioned results, there are still important classes of terminating data
dependencies which are not identified by any of the previously mentioned cri-
teria: Example 1 showed one such a case. To overcome such limitations, rewrit-
ing techniques have been proposed [15,16]. In the following section we give an
overview on them and show how the constraints of Example 1 can be rewritten
by using predicate adornments in order to allow simple termination conditions
(even the simplest one, weak-acyclicity) to understand that the chase procedure
terminates. Issues regarding the extension of these techniques to managing also
equality generating dependencies (EGDs) are discussed in Section 3.


2   Constraint Rewriting
We start by introducing the basic idea of the Adn technique [15], which can
be used in conjunction with current termination criteria, enabling us to detect
more sets of constraints as terminating. The technique consists of rewriting a set
of TGDs Σ into a new set Σ α which is “better” than the original one for the
purpose of checking termination. Rather than applying a termination criterion to
Σ, the new set Σ α is used and if Σ α satisfies the criterion then chase termination
for Σ is guaranteed. This allows us to recognize larger classes of constraints for
which chase termination is guaranteed: if Σ satisfies chase termination criterion
C, then the rewritten set Σ α satisfies C as well, but the vice versa is not true,
that is, there are significant classes of constraints for which Σ α satisfies C and
Σ does not.
Example 2. Consider again the set of TGDs Σ1 . The Adn technique first rewrites
TGDs by associating strings of b symbols to body atoms and to head positions
containing universally quantified variables. Then, f symbols are associated for
existentially quantified variables. This new set of TGDs is denoted by Base(Σ1 ):
          ∀x ∀y Departmentb (x) ∧ M anagedbb (x, y) → Employeeb (y)
          ∀x Employeeb (x) → ∃ y W orksF orbf (x, y)
          ∀x ∀y W orksF orbb (x, y) → ∃ z M anagedbf (y, z)

Subsequently, because of the presence of atoms W orksF orbf (x, y) and M anagedbf (x, y)
in Base(Σ1 ), the rewriting continues by producing the following set of TGDs
Derived(Σ1 ):
          ∀x ∀y W orksF orbf (x, y) → ∃ z M anagedf f (y, z)
          ∀x ∀y Departmentb (x) ∧ M anagedbf (x, y) → Employeef (y)
          ∀x Employeef (x) → ∃ y W orksF orf f (x, y)
          ∀x ∀y W orksF orf f (x, y) → ∃ z M anagedf f (y, z)

    At this point, the generation of Derived(Σ1 ) terminates, since the atom
Departmentb (x) cannot be joined with M anagedf f (x, y) to produce a new ador-
ned TGD. The rewritten set of TGDs Adn(Σ1 ) is weakly-acyclic, whereas the
original set Σ1 is not recognized by any chase termination criteria.         2

Rewriting Algorithm Improvement. The rewriting algorithm Adn has been fur-
ther improved into the Adn+ algorithm [16] by using different adornments for
each existentially quantified variable and by considering how TGDs may fire each
other in the generation of adorned atoms. During the rewriting process, this al-
gorithm also performs a basic cyclicity check, allowing to eventually determine
the termination of the chase, without necessarily relying on other criteria. The
new criterion is called Acyclicity. To the best of our knowledge, the class of
TGDs recognized by this criterion is the most general class known so far.


3   Adding EGDs

In the previous sections, we have considered the case where all constraints are
TGDs. In this section, we show how the chase termination problem radically
changes when we allow also EGDs.
    Given a set of TGDs Σ for which the chase does not terminate, we can show
that the addition of EGDs to Σ may allow to have a terminating chase sequence.
On the other hand, if the chase always terminates for Σ, adding EGDs to Σ may
make the chase of Σ non-terminating.

Example 3. Consider the following two sets of constraints Σ3 (left) and Σ30
(right):
      r1 : ∀x A(x) → ∃y N (y)          r10 : ∀x N (x) → ∃y ∃z S(x, y, z)
      r2 : ∀x N (x) → ∃y E(x, y)       r20 : ∀x ∀y ∀z S(x, y, y) → N (y)
      r3 : ∀x ∀y E(x, y) → N (y)       r30 : ∀x ∀y ∀z S(x, y, z) → T (x, y, z)
      r4 : ∀x ∀y E(x, y) → x = y       r40 : ∀x ∀y ∀z T (x, y, z) → y = z
and the database D = {N (a)}. The chase applied to the database D and the sub-
set of TGDs {r1 , r2 } of Σ3 is not terminating as it introduces an infinite number
of tuples E(η1 , η2), E(η3 , η1 ), ... The introduction of the EGD r3 allows to have
a terminating sequence, which produces the universal solution {N (a), E(a, a)}.
    The subset of TGDs {r10 , r20 , r30 } of Σ30 is terminating for all database in-
stances as recognized by several criteria (e.g., super-weak acyclicity). However,
the chase fixpoint applied to Σ30 and the database D is non-terminating as it intro-
duces an infinite number of tuples S(a, η1 , η1 ), T (a, η1 , η1 ), N (η1 ), S(η1 , η2 , η2 ),
T (η1 , η2 , η2 ), N (η2 ), ....                                                           2

    As shown in the previous example, when for a set of dependencies it is not
the case that every chase sequence is terminating, the existence of at least one
terminating chase sequence, for every database, might still be guaranteed. Thus,
one could extend rewriting techniques such as Adn+ to sets of TGDs and EGDs,
in order to find whether there exists, for every database, at least one terminating
chase sequence. In order to cope with the aforementioned issues, algorithm Adn+
can be extended in such a way that some adornments generated by rewriting
TGDs are changed in order to satisfy the head equalities of EGDs. Specifically,
the algorithm first tries to adorn as many EGDs as possible, and then consider
the rewriting of a single TGD. The basic idea is illustrated in the following
example.

Example 4. Consider the set of dependencies Σ3 of Example 3. As initially EGD
r4 cannot be adorned, TGD r1 is rewritten into:

                                 ∀x Ab (x) → ∃y N f1 (y)

and r2 is rewritten into:

                               ∀x N b (x) → ∃y E bf2 (x, y)

   Now, EGD r4 can be used to “merge” distinct symbols. This is accomplished
by constructing the following adorned version of r4 using the atom E bf2 (x, y):

                                ∀x ∀y E bf2 (x, y) → x = y

   This indicates that every occurrence of the symbol f2 in the obtained adorned
dependencies has to be replaced with b, thereby obtaining:

                              ∀x Ab (x) → ∃y N f1 (y)
                              ∀x ∀y N b (x) → ∃y E bb (x, y)
                              ∀x ∀y E bb (x, y) → x = y
Then, TGD r3 is adorned, obtaining:

                                ∀x ∀y E bb (x, y) → N b (y)

Then, TGD r2 is adorned using atom N f1 (x), obtaining:

                            ∀x ∀y N f1 (x) → ∃y E f1 f3 (x, y)
   Again, EGD r4 is used, getting:

                             ∀x ∀y E f1 f3 (x, y) → x = y

Consequently, f3 is replaced with f1 and we get:

                          ∀x Ab (x) → ∃y N f1 (y)
                          ∀x ∀y N b (x) → ∃y E bb (x, y)
                          ∀x ∀y E bb (x, y) → N b (y)
                          ∀x ∀y N f1 (x) → ∃y E f1 f1 (x, y)
                          ∀x ∀y E bb (x, y) → x = y
                          ∀x ∀y E f1 f1 (x, y) → x = y
Finally, atom E f1 f1 (x, y) is used to adorn r3 , obtaining:

                            ∀x ∀y E f1 f1 (x, y) → N f1 (y)

At this point, the rewriting stops, since no new adorned dependency can be con-
structed. Intuitively, the algorithm identifies the existence of a terminating chase
sequence because symbols f1 , f2 , f3 , which represent nulls constructed w.r.t. the
symbols occurring in the body of the TGD, are not “cyclic” in the following
sense. Symbol f1 “depends on” symbol b in body(r1 ), f2 depends on symbol b
in body(r2 ), and f3 depends on symbol f1 . Since no pair of symbols fi , fj in the
final set of dependency is such that fi depends on fj and vice versa, the set Σ3
has a terminating chase sequence.                                                 2


References
 1. F. N. Afrati and P. G. Kolaitis. Repair checking in inconsistent databases: algo-
    rithms and complexity. In ICDT, pages 31–41, 2009.
 2. A. V. Aho, C. Beeri, and J. D. Ullman. The theory of joins in relational databases.
    ACM Trans. Database Syst., 4(3):297–314, 1979.
 3. M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsis-
    tent databases. In PODS, pages 68–79, 1999.
 4. C. Beeri and M. Y. Vardi. Formal systems for tuple and equality generating
    dependencies. SIAM J. Comput., 13(1):76–98, 1984.
 5. L. E. Bertossi. Consistent query answering in databases. SIGMOD Record,
    35(2):68–76, 2006.
 6. A. Cali, G. Gottlob, and A. Pieris. Advanced processing for ontological queries.
    PVLDB, 3(1):554–565, 2010.
 7. L. Caroprese, S. Greco, and E. Zumpano. Active integrity constraints for database
    consistency maintenance. IEEE Trans. Knowl. Data Eng., 21(7):1042–1058, 2009.
 8. J. Chomicki. Consistent query answering: Five easy pieces. In ICDT, pages 1–17,
    2007.
 9. G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. On reconciling data
    exchange, data integration, and peer data management. In PODS, pages 133–142,
    2007.
10. A. Deutsch, A. Nash, and J. B. Remmel. The chase revisited. In PODS, pages
    149–158, 2008.
11. R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: semantics and
    query answering. Th. Comp. Sc., 336(1):89–124, 2005.
12. T. Gogacz and J. Marcinkowski. All-instances termination of chase is undecidable.
    In ICALP, pages 293–304, 2014.
13. G. Greco, S. Greco, and E. Zumpano. A logical framework for querying and re-
    pairing inconsistent databases. TKDE, 15(6):1389–1408, 2003.
14. S. Greco, C. Molinaro, and F. Spezzano. Incomplete Data and Data Dependencies
    in Relational Databases. Synthesis Lectures on Data Management. Morgan &
    Claypool Publishers, 2012.
15. S. Greco and F. Spezzano. Chase termination: A constraints rewriting approach.
    PVLDB, 3(1):93–104, 2010.
16. S. Greco, F. Spezzano, and I. Trubitsyna. Stratification criteria and rewriting
    techniques for checking chase termination. PVLDB, 4(11):1158–1168, 2011.
17. M. Lenzerini. Data integration: A theoretical perspective. In PODS, pages 233–
    246, 2002.
18. D. Maier, A. O. Mendelzon, and Y. Sagiv. Testing implications of data dependen-
    cies. ACM Trans. Database Syst., 4(4):455–469, 1979.
19. B. Marnette. Generalized schema-mappings: from termination to tractability. In
    PODS, pages 13–22, 2009.
20. M. Meier, M. Schmidt, and G. Lausen. On chase termination beyond stratification.
    CoRR, abs/0906.4228, 2009.