Stop the Chase: Short Contribution Michael Meier⋆ , Michael Schmidt⋆ and Georg Lausen University of Freiburg, Institute for Computer Science Georges-Köhler-Allee, 79110 Freiburg, Germany {meierm,mschmidt,lausen}@informatik.uni-freiburg.de Abstract. The chase procedure, an algorithm proposed 25+ years ago to fix con- straint violations in database instances, has been successfully applied in a variety of contexts, such as query optimization and data exchange. Its practicability, how- ever, is limited by the fact that – for an arbitrary set of constraints – it might not terminate; even worse, chase termination is an undecidable problem in general. In response, the database community has proposed sufficient restrictions on top of the constraints that guarantee chase termination on any database instance. In this paper, we propose a sufficient termination condition, called inductive restriction, which strictly generalizes previous conditions, but can be checked as efficiently. 1 Introduction The chase procedure is a fundamental algorithm that has been successfully applied in a variety of database applications [7, 10, 5, 9, 11, 15, 2, 1, 13]. Originally proposed to tackle the implication problem for data dependencies [7, 5] and to optimize Conjunctive Queries (CQs) under data dependencies [3, 10], it has become a central tool in Semantic Query Optimization (SQO) [14, 1, 16]. For instance, the chase can be used to enumer- ate minimal CQs under a set of dependencies [1], thus supporting the search for more efficient query evaluation plans. Beyond SQO, it has been applied in many other con- texts, such as data exchange [15], peer data exchange [2], data integration [11], query answering using views [9], and probabilistic databases [13]. The core idea of the chase algorithm is simple: given a set of dependencies (also called constraints) over a database schema and an instance as input, it fixes constraint viola- tions in the instance. One problem with the chase, however, is that – given an arbitrary set of constraints – it might never terminate; even worse, this problem is undecidable in general, also for a fixed instance [4]. Addressing this issue, sufficient conditions for the constraints that guarantee termination on any database instance have been proposed [15, 4, 16]. Such conditions are the central topic in this paper. We introduce the class of in- ductively restricted constraints, for which the chase terminates in polynomial time data complexity. Like existent sufficient termination conditions, inductive restriction asserts that there are no positions in the schema where fresh labeled nulls might be cyclically created during chase application. It relies on a sophisticated study of (a) positions in the database schema where null values might appear, (b) subsets of the constraints that cyclically pass null values, and (c) connections between such cycles. The combination ⋆ The work of this author was funded by DFG grant GRK 806/3. of these aspects makes inductive restriction more general than previous sufficient ter- mination conditions, thus making a larger class of constraints amenable to the chase. Structure. We start with some preliminaries in the following section. Section 3 in- troduces inductive restriction, our sufficient data-independent termination condition. Finally, Section 4 concludes the paper. Remark. An extended version of this paper including full proofs can be found in [12]. 2 Preliminaries General mathematical notation. For n ∈ N, we denote by [n] the set {1, ..., n}. For a set M , we denote by 2M its powerset. Databases. We fix three pairwise disjoint infinite sets: the set of constants ∆, the set of labeled nulls ∆null , and the set of variables V . A database schema R is a finite set of relational symbols {R1 , ..., Rn }. In the rest of the paper, we assume the database schema and the set of constants and labeled nulls to be fixed. A database instance I is a finite set of R-atoms that contains only elements from ∆ ∪ ∆null in its positions. We denote an element of an instance as fact. The domain of I, dom(I), is the set of elements from ∆ ∪ ∆null that appear in I. We use the term position to denote a position in a predicate, e.g. a three-ary predicate R has three positions R1 , R2 , R3 . We say that a variable, labeled null, or constant c appears e.g. in a position R1 if there exists a fact R(c, ...). Constraints. Let x, y be tuples of variables. We consider two types of database con- straints: tuple generating dependencies (TGDs) and equality generating dependencies (EGDs). A TGD has the form α := ∀x(φ(x) → ∃yψ(x, y)) such that both φ and ψ are conjunctions of atomic and equality-free R-atoms, possibly with parameters from ∆ and all variables from x that occur in ψ must also occur in φ. We denote by pos(α) the set of positions in φ. An EGD has the form α := ∀x(φ(x) → xi = xj ), where xi , xj occur in φ and φ is a non-empty conjunction of equality-free R-atoms, possibly with parameters from ∆. We denote by pos(α) the set of positions in φ. As a nota- tional convenience, we will often omit the ∀-quantifier and respective listSof universally quantified variables. For a set of TGDs and EGDs Σ we set pos(Σ) := ξ∈Σ pos(ξ). Chase. We assume that the reader is familiar with the chase procedure and give only a short introduction here, referring the interested reader to [15] for a more detailed α,a discussion. A chase step I → J takes a relational database instance I such that I 2 α(a) and adds tuples (in case of TGDs) or collapses some elements (in case of EGDs) such that the resulting relational database J is a model of α(a). If J was obtained from I in that kind, we sometimes also write Ia ⊕ Cα instead of J. A chase sequence is an α0 ,a0 α1 ,a1 exhaustive application of applicable constraints I0 −→ I1 −→ . . ., where we impose no strict order on what constraint to apply in case several constraints are applicable. If this sequence is finite, say Ir being its final element, the chase terminates and its result I0Σ is defined as Ir . The length of this chase sequence is r. Note that different orders of application orders may lead to a different chase result. However, as proven in [15], two different chase orders always lead to homomorphically equivalent results, if these exist. Therefore, we write I Σ for the result of the chase on an instance I under constraints Σ. It has been shown in [7, 5, 10] that I Σ |= Σ. If a chase step cannot be performed (e.g., because application of an EGD would have to equate two constants) or in case of an infinite chase sequence, the result of the chase is undefined. 3 Data-independent Chase Termination In the past, sufficient conditions for constraint sets have been developed that guarantee chase termination for any instance. One such condition is weak acyclicity [15], which asserts that there are no cyclically connected positions in the constraint set that may in- troduce fresh labeled null values, by a global study of relations between the constraints. In [4], weak acyclicity was generalized to stratification, which enforces weak acyclic- ity only locally, for subsets of constraints that might cyclically cause to fire each other. We further generalized stratification to safe restriction in [16]. We start by reviewing its central ideas and formal definition, which form the basis for our novel condition inductive restriction. Safe Restriction. The idea of safe restriction is to keep track of positions where fresh null values might be created in or copied to. As a basic tool, we borrow the definition of affected positions from [6]. We emphasize that, in [6], this definition has been used in a different context: there, the constraints are interpreted as axioms that are used to derive new facts from the database and the problem is query answering on the implied database, using the chase as a central tool. Definition 1. [6] Let Σ be a set of TGDs. The set of affected positions aff(Σ) is defined inductively as follows. Let π be a position in the head of an α ∈ Σ. • If an existentially quantified variable appears in π, then π ∈ aff(Σ). • If the same universally quantified variable X appears both in position π, and only in affected positions in the body of α, then π ∈ aff(Σ).  Akin to the dependency graph in weak acyclicity [15], we define a safety condition that asserts the absence of cycles through constraints that may introduce fresh null val- ues. As an improvement, we exhibit the observation that only values created due to or copied from affected positions may cause non-termination. We introduce the notion of propagation graph, which refines the dependency graph from [15] by taking affected positions into consideration. Definition 2. Let Σ be a set of TGDs. We define a directed graph called propagation graph prop(Σ) := (aff(Σ), E) as follows. There are two kinds of edges in E. Add them as follows: for every TGD ∀x(φ(x) → ∃yψ(x, y)) ∈ Σ and for every x in x that occurs in ψ and every occurrence of x in φ in position π1 • if x occurs only in affected positions in φ then, for every occurrence of x in ψ in position π2 , add an edge π1 → π2 (if it does not already exist). • if x occurs only in affected positions in φ then, for every existentially quantified ∗ variable y and for every occurrence of y in a position π2 , add a special edge π1 → π2 (if it does not already exist).  Definition 3. A set Σ of constraints is called safe iff prop(Σ) has no cycles going through a special edge.  Safety is a sufficient termination condition which strictly generalizes weak acyclicity and is different from stratification [16]. The idea behind safe restriction now is to assert safety locally, for subsets of the constraints that may cyclically cause each other to fire in such a way that null values are passed in these cycles. Definition 4. Let Σ abe given and P ⊆ pos(Σ). For all α, β ∈ Σ, we define α ≺P β iff there are tuples a, b and a database instance I s.t. (i) I 2 α(a), (ii) I |= β(b), (iii) α,a I → J, (iv) J 2 β(b), (v) I contains null values only in positions from P and (vi) there is a null value n ∈ b ∩ ∆null in the head of β(b).  Informally, α ≺P β holds if α might cause β to fire s.t., when null values occur only in positions from P, β copies some null values. We next introduce a notion for affected positions relative to a constraint and a set of positions. Definition 5. For any set of positions P and a TGD α let aff-cl(α, P ) be the set of positions π from the head of α such that • for every universally quantified variable x in π: x occurs in the body of α only in positions from P or • π contains an existentially quantified variable.  On top of previous definitions we introduce the central tool of restriction systems. Definition 6. A restriction system is a pair (G′ (Σ), f ), where G′ (Σ) := (Σ, E) is a directed graph and f : Σ → 2pos(Σ) is a function such that • forall TGDs α and forall (α, β) ∈ E: aff-cl(α, f (α)) ∩ pos({β}) ⊆ f (β), • forall EGDs α and forall (α, β) ∈ E: f (α) ∩ pos({β}) ⊆ f (β), and • forall α, β ∈ Σ: α ≺f (α) β =⇒ (α, β) ∈ E. A restriction system is minimal if it is obtained from ((Σ, ∅),{(α, ∅) | α ∈ Σ}) by a repeated application of the constraints from bullets one to three (until all constraints hold) s.t., in case of the first and second bullet, the image of f (β) is extended only by those positions that are required to satisfy the condition.  Example 1. Let predicate E(x,y) store graph edges and predicate S(x) store some nodes. The constraints Σ = {α1 , α2 } with α1 := S(x), E(x,y) → E(y,x) and α2 := S(x), E(x,y) → ∃z E(y,z), E(z,x) assert that all nodes in S have a cycle of length 1 and 2. It holds that aff(Σ) = {E1 ,E2 } and it is easy to verify that Σ is neither safe nor strati- fied (see Def. 2 in [4]). The minimal restriction system for Σ is G’(Σ):=(Σ,{(α2 ,α1 )}) with f(α1 ) := {E1 ,E2 } and f(α2 ) := ∅; in particular, α1 6≺f (α1 ) α1 , α1 6≺f (α1 ) α2 , α2 ≺f (α2 ) α1 , and α2 6≺f (α2 ) α2 hold.  As shown in [16], the minimal restriction system is unique and can be computed by an NP-algorithm. We are ready to define the notion of safe restriction: Definition 7. Σ is called safely restricted if and only if every strongly connected com- ponent of its minimal restriction system is safe.  Example 2. Constraint set Σ from Example 1 is safely restricted: its minimal restriction system contains no strongly connected components.  part(Σ: Set of TDGs and EGDs) { 1: compute the strongly connected components (as sets of constraints) C1 , . . . , Cn of the minimal restriction system of Σ; 2: D ← ∅ 3: if (n == 1) then 4: if (C1 6= Σ) then return part(C1 ); endif 7: return {Σ}; 8: endif 6: for i=1 to n do D ← D ∪ part(Ci ); endfor 11: return D; } Fig. 1. Algorithm to compute subsets of Σ. As shown in [16], safe restriction (a) guarantees chase termination in polynomial time data complexity, (b) is strictly more general than stratification, and (c) it can be checked by a CO NP-algorithm if a set of constraints is safely restricted. Inductive Restriction. We now introduce the novel class of inductively restricted con- straints, which generalizes safe restriction but, like the latter, gives polynomial-time termination guarantees. We start with a motivating example. Example 3. We extend the constraints from Example 1 to Σ ′ := Σ∪{α3 }, where α3 := ∃x, yS(x), E(x, y). Then G’(Σ ′ ):=(Σ ′ ,{(α1 , α2 ),(α2 ,α1 ),(α3 ,α1 ),(α3 ,α2 )}) with f(α1 ) = f(α2 ) := {E1 ,E2 ,S1 } and f(α3 ) := ∅ is the minimal restriction system. It contains the strongly connected component {α1 ,α2 }, which is not safe. Consequently, Σ ′ is not safely restricted.  Intuitively, safe restriction does not apply in the example above because α3 “infects” position S1 in the restriction system. Though, null values cannot be repeatedly created in S1 : α3 fires at most once, so it does not affect chase termination. Our novel termination condition recognizes such situations by recursively computing the minimal restriction systems of the strongly connected components. We formalize this computation in Al- gorithm 1, called part(Σ). Based on this algorithm, we define an improved sufficient termination condition. Definition 8. Let Σ be a set of constraints. We call Σ inductively restricted iff for all Σ ′ ∈ part(Σ) it holds that Σ ′ is safe.  As stated in the following lemma, inductive restriction strictly generalizes safe restric- tion, but does not increase the complexity of the recognition problem. Lemma 1. Let Σ be a set of constraints. • If Σ is safely restricted, then it is inductively restricted. • There is some Σ that is inductively restricted, but not safely restricted. • The recognition problem for inductive restriction is in CO NP.  Example 4. Consider Σ ′ from Example 3. It is easy to verify that part(Σ ′ ) = ∅ and we conclude that Σ ′ is inductively restricted. As argued in Example 3, Σ ′ is not safely restricted, which proves the second claim in Lemma 1.  The next theorem gives the main result of this section, showing that inductive restriction guarantees chase termination in polynomial time data complexity. To the best of our knowledge inductive restriction is the most general sufficient termination condition for the chase that has been proposed so far. Theorem 1. Let Σ be a fixed set of inductively restricted constraints. Then, there exists a polynomial Q ∈ N[X] such that for any database instance I, the length of every chase sequence is bounded by Q(||I||), where ||I|| is the number of distinct values in I.  4 Conclusions We considered the termination of the chase algorithm. As our main contribution, we generalized all sufficient data-independent termination conditions that were known so far. Our results on chase termination directly carry over to applications that rely on the chase and also to the so-called core-chase presented in [4]. There are some interesting open questions left. First, it is unknown if the recognition problem for inductive re- striction, which was shown to be in CO NP, is also coNP-hard. Second, it is left open if the positive results on core computation in data exchange settings from [8] extend to inductive restriction. References 1. A. Deutsch et al. Query Reformulation with Constraints. SIGMOD Record, 35(1):65–73, 2006. 2. A. Fuxman et al. Peer data exchange. ACM Trans. Database Syst., 31(4):1454–1498, 2006. 3. A. V. Aho, Y. Sagiv, and J. D. Ullman. Efficient Optimization of a Class of Relational Expressions. ACM Trans. Database Syst., 4(4):435–454, 1979. 4. Alin Deutsch et al. The Chase Revisited. In PODS, pages 149–158, 2008. 5. C. Beeri and M. Y. Vardi. A Proof Procedure for Data Dependencies. J. ACM, 31(4):718– 741, 1984. 6. A. Calı̀, G. Gottlob, and M. Kifer. Taming the Infinite Chase: Query Answering under Ex- pressive Relational Constraints. In Descr. Logics, volume 353, 2008. 7. D. Maier et al. Testing Implications of Data Dependencies. In SIGMOD, pages 152–152, 1979. 8. G. Gottlob and A. Nash. Efficient Core Computation in Data Exchange. J. ACM, 55(2), 2008. 9. A. Y. Halevy. Answering Queries Using Views: A Survey. VLDB J., pages 270–294, 2001. 10. D. S. Johnson and A. Klug. Testing Containment of Conjunctive Queries under Functional and Inclusion Dependencies. In PODS, pages 164–169, 1982. 11. M. Lenzerini. Data Integration: A Theoretical Perspective. In PODS, pages 233–246, 2002. 12. M. Meier, M. Schmidt, and G. Lausen. Stop the Chase, Technical Report. CoRR, abs/0901.3984, 2009. 13. D. Olteanu, J. Huang, and C. Koch. SPROUT: Lazy vs. Eager Query Plans for Tuple- Independent Probabilistic Databases. In ICDE, 2009. To appear. 14. L. Popa and V. Tannen. An Equational Chase for Path-Conjunctive Queries, Constraints, and Views. In ICDT, pages 39–57, 1999. 15. R. Fagin et al. Data Exchange: Semantics and Query Answering. Theor. Comput. Sci., 336(1):89–124, 2005. 16. M. Schmidt, M. Meier, and G. Lausen. Foundations of SPARQL Query Optimization, Tech- nical Report. CoRR, abs/0812.3788, 2008.