Query Rewriting under Non-Guarded Rules Andrea Calı̀2,1,3 , Georg Gottlob1,2 and Andreas Pieris1 1 Computing Laboratory, University of Oxford, UK 2 Oxford-Man Institute of Quantitative Finance, University of Oxford, UK 3 Department of Information Systems and Computing, Brunel University, UK firstname.lastname @comlab.ox.ac.uk Abstract. We address the problem of answering conjunctive queries over knowledge bases, specified by sets of first-order sentences called tuple-generating dependencies (TGDs). This problem is highly relevant to query optimization, information integration, and ontological reason- ing. We present a rewriting algorithm, inspired by resolution in Logic Programming, which is capable of dealing with an expressive class of TGDs, called sticky TGDs. Given a set of sticky TGDs and a conjunc- tive query, the algorithm produces a first-order query that can be then evaluated over the data, providing the correct answers. In this way, we establish that conjunctive query answering under sticky TGDs is in the highly tractable class ac0 in the data complexity. 1 Introduction Answering queries over knowledge bases is a fundamental problem in knowledge representation. It has been employed in schema integration, information integra- tion, and service discovery (see, e.g., [11]). In the database field, the problem of answering queries on incomplete data under constraints (a.k.a. dependencies) is tightly related to the one of query containment [12]; in fact, the two problems are mutually reducible (see, e.g., [3]). A milestone paper in the field is [17]; it addresses conjunctive query containment under functional and inclusion depen- dencies, a class of constraints that is highly relevant in practice. The results of [17] were later extended in [7]. Other works consider different classes of constraints. For instance, [2, 9] con- sider constraints tailored to express Entity-Relationship schemas, and [23] deals with expressive constraints based on Answer Set Programming. [6] introduces and studies first-order constraints derived from a “light” version of F-logic [18], called F-logic Lite. Another relevant formalisms for knowledge bases, especially in the Semantic Web, is the DL-lite family; in [10, 22] tractable query answering techniques under DL-lite knowledge bases are presented. A more general approach is adopted in the recent papers [3, 4]; in such works, instead of focusing on a specific logic formalism, a general family of languages is considered, called Datalog± , that captures and extends a large class of knowledge representation formalisms in the literature, in particular the DL-lite family. The constraints of [3, 4] are inspired by the guarded frag- ment of first-order logic [16], and among them are the languages of linear Datalog± , guarded Datalog± , and weakly-guarded Datalog± . It is important to notice that Datalog± -rules are relevant tuple-generating dependencies (TGDs) and equality-generating dependencies (EGDs), which are generalizations of in- clusion dependencies and functional dependencies, respectively; we shall not deal with EGDs in this paper. TGDs are first-order constraints that express an implication from a conjunction of atoms to another. For instance, the TGD ∀E∀G director (E), leads(E, G) → ∃D manages(E, D) states that each director that leads some group necessarily manages at least one department. A large part of the aforementioned works on query answering and con- tainment are based on the well-known notion of chase, widely used for de- pendency implication and query containment under dependencies [19, 17]. The chase is a procedure that enforces the satisfaction of TGDs by the addition of suitable tuples. For example, consider an instance D consisting of the atoms {director (d), leads(d, g)}; the above TGD is not satisfied by D, therefore the chase adds a suitable atom manages(d, z), where z is a so-called labeled null, i.e., a placeholder for an unknown value. Several works have recently studied query evaluation problems for settings where the chase terminates and thus generates a finite solution. This is the typical setting, for example, of data exchange [15], where the materialization (and therefore the finiteness) of the chase is a require- ment. To this aim, syntactic restrictions on TGDs, e.g., weak acyclicity [13], have been proposed. Little was known about the cases where the chase is (in general) infinite, with the notable exception of the classical work of Johnson and Klug [17]. In this work, along the lines of [3, 4], we also adopt a general approach. We consider an expressive class of TGDs, called sticky TGDs (STGDs), first intro- duced in [5] as an addition to the Datalog± family. The chase under STGDs is not guaranteed to terminate. Stickiness is formally defined in Section 3 by an ef- ficiently testable condition involving variable-marking. An equivalent definition, which gives a better understanding of such class of constraints, is as follows. For every instance D, assume that in the construction of the chase of D under a set Σ of TGDs, we apply a TGD σ ∈ Σ that has a variable X appearing more than once in its body; assume also that X maps (via homomorphism) on the symbol z, and that by virtue of this application the atom a is introduced. Then, z appears in a and in all atoms resulting from some chase derivation involving a, “sticking” to them (hence the name “sticky TGDs”) [5]. Our main contribution in this paper is an algorithm, similar to resolution and partial evaluation in Logic Programming, for query answering under STGDs. Our approach is based on query rewriting, that is: given a query, such query is rewritten into another query that encodes the information about the constraints and that, evaluated on the data, returns the correct answers. We consider the class of conjunctive queries (CQs), also called select-project-join queries in the database literature. The algorithm, inspired by the ones in [2, 8], takes as input a set of STGDs and a conjunctive query, and it computes a (finite) rewriting expressed in union of conjunctive queries (UCQs). Since our rewriting algorithm is sound and complete (established by making use of the notion of chase), terminates under STGDs, and the rewritten query is a UCQs, answering conjunctive queries in this case is in the highly tractable class ac0 w.r.t. the size of the data1 , as already stated in [5], where no details about a rewriting technique are given. We recall that ac0 is the complexity class of recognizing words in languages defined by constant-depth Boolean circuits with an (unlimited fan-in) and and or gates (see, e.g., [20]). Notice that a UCQs can be immediately translated into an SQL query, therefore the rewriting algorithm is especially suitable for real-world applications based on relational DBMSs. Notice that the size of the rewritten query is worst-case exponential w.r.t. the size of the given query and the size of the given set of constraints. While query answering remains highly tractable in data complexity, this leaves space for further optimizations. Generating “smaller” rewritings is one of the directions currently pursued to improve efficiency of query processing in this context [21]. In [5] it is shown, similarly to what is done in [4], that STGDs enriched with negative constraints of the form ∀X ϕ(X) → ⊥, where ϕ(X) is a conjunction of atoms and ⊥ is the constant false, and with EGDs (key dependencies, in this case) that do not interact with the STGDs, are capable of expressing languages in the DL-lite family. Therefore, all tractability results on conjunctive query answering in DL-lite are special cases of our results. 2 Preliminaries In this section we recall some basics on databases, queries, TGDs and the TGD chase procedure. General. We define the following pairwise disjoint (infinite) sets of symbols: (i) a set Γ of constants (constitute the “normal” domain of a database), (ii) a set Γf of labeled nulls (used as placeholders for unknown values, and thus can be also seen as variables), and (iii) a set Γv of variables (used in queries and dependencies). Different constants represent different values (unique name assumption), while different nulls may represent the same value. A lexicographic order is defined on Γ ∪ Γf , such that every value in Γf follows all those in Γ . A relational schema R (or simply schema) is a set of relational symbols (or predicates), each with its associated arity. We write r/n to denote that the pred- icate r has arity n. A position r[i] (in a schema R) is identified by a predicate r ∈ R and its i-th argument (or attribute). A term t is a constant, null, or variable. An atomic formula (or simply atom) has the form r(t1 , . . . , tn ), where r/n is a relation, and t1 , . . . , tn are terms. For an atom a, we denote as dom(a), var (a) and pred (a) the set of its terms, the set of its variables and its predi- cate, respectively. These notations naturally extends to sets and conjunctions of 1 The complexity w.r.t. the data size is known as data complexity, and it is relevant since in practice the size of the data is much larger than the size of the schema. atoms. An atom is called ground if all of its terms are constants of Γ . Conjunc- tions of atoms are often identified with the sets of their atoms. A substitution from one set of symbols S1 to another set of symbols S2 is a function h : S1 → S2 defined as follows: (i) ∅ is a substitution (empty substitution), (ii) if h is a substitution, then h ∪ {X → Y } is a substitution, where X ∈ S1 and Y ∈ S2 , and h does not already contain some X → Z with Y 6= Z. If X → Y ∈ h we write h(X) = Y . A homomorphism from a set of atoms A1 to a set of atoms A2 , both over the same schema R, is a substitution h : dom(A1 ) → dom(A2 ) such that: (i) if t ∈ Γ , then h(t) = t, and (ii) if r(t1 , . . . , tn ) is in A1 , then h(r(t1 , . . . , tn )) = r(h(t1 ), . . . , h(tn )) is in A2 . The notion of homomorphism naturally extends to conjunctions of atoms. Databases and Queries. A database (instance) D for a schema R is a (possibly infinite) set of atoms of the form r(t) (a.k.a. facts), where r/n ∈ R and t ∈ (Γ ∪ Γf )n . We denote as r(D) the set {t | r(t) ∈ D}. A conjunctive query (CQ) q of arity n over a schema R, written as q/n, has the form q(X) = ∃Yϕ(X, Y), where ϕ(X, Y) is a conjunction of atoms over R, X and Y are sequences of variables or constants in Γ , and the length of X is n. ϕ(X, Y) is called the body of q, denoted as body(q). A Boolean CQ (BCQ) is a CQ of zero arity. The answer to a CQ q/n of the form q(X) = ∃Yϕ(X, Y) over a database D, denoted as q(D), is the set of all n-tuples t ∈ Γ n for which there exists a homomorphism h : X ∪ Y → Γ ∪ Γf such that h(ϕ(X, Y)) ⊆ D and h(X) = t. A BCQ has only the empty tuple hi as possible answer, in which case it is said that has positive answer. Formally, a BCQ has positive answer over D, denoted as D |= q, iff hi ∈ q(D), or, equivalently, q(D) 6= ∅. Dependencies. Given a schema R, a tuple-generating dependency (TGD) σ over R is a first-order formula ∀X∀Y ϕ(X, Y) → ∃Z ψ(X, Z), where ϕ(X, Y) and ψ(X, Z) are conjunctions of atoms over R, called the body and the head of σ, denoted as body(σ) and head (σ), respectively. Henceforth, to avoid notational clutter, we will omit the universal quantifiers in TGDs. Such σ is satisfied by a database D for R iff, whenever there exists a homomorphism h such that h(ϕ(X, Y)) ⊆ D, there exists an extension h′ of h (i.e., h′ ⊇ h) such that h′ (ψ(X, Z)) ⊆ D. We now define the notion of query answering under TGDs. Given a database D for R, and a set Σ of TGDs over R, the models of D w.r.t. Σ, denoted as mods(D, Σ), is the set of all databases B such that B |= D ∪ Σ. The an- swer to a CQ q w.r.t. D and Σ, denoted as ans(q, D, Σ), is the set {t | t ∈ q(B) for each B ∈ mods(D, Σ)}. The answer to a BCQ q w.r.t. D and Σ is positive, denoted as D ∪ Σ |= q, iff ans(q, D, Σ) 6= ∅. Note that query answering under general TGDs is undecidable [1], even when the schema and the set of TGDs are fixed [3]. We recall that the two problems of CQ and BCQ evaluation under TGDs are logspace-equivalent [3]. Moreover, it is easy to see that the query output tuple problem (as a decision version of CQ evaluation) and BCQ evaluation are ac0 - reducible to each other. Henceforth, we thus focus only on the BCQ evaluation problem. All complexity results carry over to the other problems. The TGD Chase. The chase procedure (or simply chase) is a fundamental algorithmic tool introduced for checking implication of dependencies [19], and later for checking query containment [17]. Informally, the chase is a process of repairing a database w.r.t. a set of dependencies so that the resulted database satisfies the dependencies. We shall use the term chase interchangeably for both the procedure and its result. The chase works on an instance through the so- called TGD chase rule. The TGD chase rule comes in two different equivalent fashions: oblivious and restricted [3], where the restricted one repairs TGDs only when they are not satisfied. In the sequel, we focus on the oblivious one for better technical clarity. The TGD chase rule is the building block of the chase. TGD Chase Rule. Consider a database D for a schema R, and a TGD σ = ϕ(X, Y) → ∃Z ψ(X, Z) over R. If σ is applicable to D, i.e., there exists a homomorphism h such that h(ϕ(X, Y)) ⊆ D, then: (i) define h′ ⊇ h such that h′ (Zi ) = zi for each Zi ∈ Z, where zi ∈ Γf is a “fresh” labeled null not introduced before, and following lexicographically all those introduced so far, and (ii) add to D the set of atoms in h′ (ψ(X, Z)) if not already in D. Given a database D and set of TGDs Σ, the chase algorithm for D and Σ consists of an exhaustive application of the TGD chase rule in a breadth-first fashion, which leads as result to a (possibly infinite) chase for D and Σ, denoted as chase(D, Σ). The formal definition of the chase algorithm is omitted due to space reasons and can be found, for instance, in [3]. We denote as chase [k] (D, Σ) the initial segment of the chase for D and Σ obtained by applying k > 0 times the TGD chase rule. Example 1 ([5]). Let R = {r, s}. Consider the set Σ of TGDs over R constituted by the TGDs σ1 = r(X, Y ), s(Y ) → ∃Z r(Z, X) and σ2 = r(X, Y ) → s(X). Let D be the database for R consisting of the two atoms r(a, b) and s(b). During the construction of chase(D, Σ) we first apply σ1 , and we add the atom r(z1 , a), where z1 is a “fresh” null. Moreover, σ2 is applicable and we add the atom s(a). Now, σ1 is applicable and the atom r(z2 , z1 ) is obtained, where z2 is a “fresh” null. Also, σ2 is applicable and the atom s(z1 ) is generated. It is clear that there is no finite chase. Satisfying both σ1 , σ2 would require to construct the infinite instance D ∪ {r(z1 , a), s(a), r(z2 , z1 ), s(z1 ), r(z3 , z2 ), s(z2 ), . . .}. It is well-known that the (possibly infinite) chase for D and Σ is a universal model of D w.r.t. Σ, i.e., for each database B ∈ mods(D, Σ), there exists a homomorphism from chase(D, Σ) to B [15, 14]. Using this fact it can be shown that for a BCQ q, D ∪ Σ |= q iff chase(D, Σ) |= q. 3 Sticky TGDs In this section we recall the class of sticky TGDs introduced in [5]. As we shall see, query answering under sticky TGDs is highly tractable in data complexity. Definition 1 ([5]). Consider a set Σ of TGDs over a schema R. We mark the variables that occur in the body of the TGDs of Σ according to the following marking procedure. First, for each TGD σ ∈ Σ and for each variable V in body(σ), if there exists an atom a in head (σ) such that V does not appear in a, then we mark each occurrence of V in body(σ). Now, we apply exhaustively (i.e., until a fixpoint is reached) the following step: for each TGD σ ∈ Σ, if a marked variable in body(σ) appears at position π, then for every TGD σ ′ ∈ Σ (including the case σ ′ = σ), we mark each occurrence of the variables in body(σ ′ ) that appear in head (σ ′ ) at the same position π. We say that Σ is a set of sticky TGDs (STGDs) if there is no TGD σ ∈ Σ such that a marked variable occurs in body(σ) more than once. Example 2 ([5]). Consider the following relational schema R: dept(Dept Id, Mgr Id), emp(Emp Id, Dept Id, Area, Project Id), runs(Dept Id, Project Id), in area(Project Id, Area), external (Ext Id, Area, Project Id). The fact that each department has as a manager an employee can be represented by the TGD dept(V, W ) → ∃X∃Y ∃Z emp(W, X, Y, Z). The fact that each employee works on some project that falls into his/her area of specialization ran by his/her department can be represented by the TGD emp(V, W, X, Y ) → ∃Z dept(W, Z), runs(W, Y ), in area(Y, X). Finally, the fact that for each project ran by some department there exists an external cooperator, specialized on the area of the project, that works on it can be represented by the TGD runs(W, X), in area(X, Y ) → ∃Z external (Z, Y, X). Let Σ be the set constituted by the above three TGDs. According to the variable- marking procedure in Definition 1, we mark the variables as follows (we mark variables with a cap, e.g., X̂): dept(V̂ , Ŵ ) → ∃X∃Y ∃Zemp(Ŵ , X̂, Ŷ , Ẑ) emp(V̂ , Ŵ , X̂, Ŷ ) → ∃Z dept(Ŵ , Ẑ), runs(Ŵ , Y ), in area(Y, X) runs(Ŵ , X), in area(X, Y ) → ∃Z external (Z, Y, X) Clearly, for each TGD σ ∈ Σ, there is no marked variable that occurs in body(σ) more than once. Therefore, Σ is a set of STGDs. It is important to observe that the set Σ is neither weakly-acyclic [15], nor guarded [3], nor weakly-guarded [3]. In fact, the class of STGDs is incomparable to the above three known classes of TGDs. We recall that query answering under (general) TGDs is equivalent to query answering under TGDs with single-atom heads [3]. This is established by provid- ing a logspace transformation from (general) TGDs to TGDs with single-atom heads. Since this transformation preserves the syntactic condition of STGDs, henceforth we assume w.l.o.g. that every TGD has just one atom in its head. 4 Query Answering by Rewriting We say that a class C of TGDs is first-order rewritable, henceforth abbreviated as FO-rewritable, iff for every set Σ of TGDs in C, and for every BCQ q, there exists a first-order query qΣ such that, for every database D, D ∪ Σ |= q iff D |= qΣ [4]. Since answering first-order queries is in the class ac0 in data complexity [24], it immediately follows that for FO-rewritable TGDs, query answering is in ac0 in data complexity. In this section we establish that STGDs are FO-rewritable by providing a query rewriting algorithm, inspired by resolution in Logic Programming, that allow us to reformulate the given BCQ q into a union of BCQs Qr , that encodes the information about the given STGDs, and then evaluate Qr over the given database to obtain the correct answer. Formally, a union of BCQs over a schema R is a set Q of BCQs over R, where each q ∈ Q uses the same predicate symbol in the head. Q is said to have positive answer over a database D, denoted as D |= Q, iff there exists q ∈ Q such that D |= q. Note that a union of BCQs is trivially a first-order query. Before we proceed further we give some preliminary definitions. Given a BCQ q we say that a certain variable is bound in q if it occurs more than once in body(q), otherwise is called unbound. A bound term in q is either a bound variable or a constant of Γ . Given two atoms a and b we say that unify if there exists a substitution γ, called unifier for a and b, such that γ(a) = γ(b). A most general unifier (MGU) is a unifier for a and b, denoted as γa,b , such that for each other unifier γ for a and b there exists a substitution γ ′ such that γ = γ ′ ◦ γa,b . Note that if two atoms unify, then there exists a MGU. Furthermore, the MGU for two atoms is unique (modulo variable renaming). We now define the important notion of applicability of a TGD to an atom that occurs in the body of a BCQ. Definition 2. Let R be a relational schema. Consider a TGD σ over R, a BCQ q over R, and an atom a ∈ body(q). We say that σ is applicable to a if the following conditions are satisfied: (i) a and head (σ) unify (recall that we assume single-atom head TGDs); we denote as γa,σ the MGU for a and head (σ), (ii) if the term at position π in a is either a constant of Γ , or a bound variable in q that occurs in some atom of body(q) other than a, then the variable at position π in head (σ) occurs also in body(σ), and (iii) if a bound variable in q occurs only in a at positions π1 , . . . , πm , for m > 2, then either the variable at position πi in head (σ), for each i ∈ {1, . . . , m}, occurs also in body(σ), or at positions π1 , . . . , πm in head (σ) we have the same existentially quantified variable. The Algorithm rewrite. We are now ready to present the algorithm rewrite, shown in Figure 1. The rewriting of a BCQ is computed by exhaustively applying two steps: minimization and rewriting, corresponding to steps (a) and (b) of the algorithm. In what follows we give an informal description of these two steps. Minimization Step. If there exists a BCQ q ∈ Qr such that body(q) contains two atoms a and b that unify, then the algorithm computes the BCQ q ′ by Algorithm rewrite Input: Schema R, set Σ of STGDs over R, BCQ q over R Output: Rewritten query Qr over R Qr := {q}; Qcan r := ∅; i := 0; Vars := var (q); repeat Q′ := Qr ; Q′′ := Qcan r ; for each q ∈ Q′ do (a) for each a, b ∈ body(q) do if a and b unify then q ′ := γa,b (q) Qr := Qr ∪ {q ′ } ′ Qcan r := Qr ∪ τ (q ); can (b) for each a ∈ body(q) do for each σ ∈ Σ do if σ is applicable to a then i := i + 1 ` ´ ρ := rename ` a,σi , Vars γ q ′ := ρ γa,σi q[a/body(σ i )] ` ´´ Qr := Qr ∪ {q ′ } ′ Qcan r := Qr ∪ τ (q ); can until Q′′ = Qcanr ; return Qr ; Fig. 1. The Algorithm rewrite. applying the particular MGU γa,b for a and b on q, where γa,b is obtained as follows. Let a = r(V1 , . . . , Vn ) and b = r(W1 , . . . , Wn ). Construct the atoms a′ and b′ from a and b as follows: for an arbitrary k ∈ {1, . . . , n}, (i) if both Vk , Wk ∈ / Vars, then replace Wk with Vk , and (ii) if Vk ∈ Vars and Wk ∈ / Vars (resp., Vk ∈ / Vars and Wk ∈ Vars), then replace Wk with Vk (resp., Vk with Wk ). Recall that Vars is the set of variables of the given query. We define γa,b = γa′ ,b′ . γa,b guarantees that any new variables, introduced during the rewriting process, that appear in q ′ are unbound; its existence follows from the fact that the TGDs are sticky. The canonical form of q ′ is computed by applying the transformation τ , and then added to Qcan r which represents the canonical form of Qr . In particular, τ replaces the unbound variables with the special term “⋆”. Rewriting Step. During the i-th application of the rewriting step, if there exists a TGD σ and a BCQ q ∈ Qr containing an atom a such that σ is applicable to a, then the algorithm computes the BCQ q ′ = ρ γa,σi q[a/body(σ i )] , that  is, the BCQ obtained from q by replacing a with body(σ i ), where σ i is obtained by replacing each variable V that occurs in σ with V i , then applying the MGU γa,σi for a and head (σ i ), and finally applying the renaming substitution ρ, where ρ is constructed from γa,σi according to the procedure rename (see Figure 2). By considering σ i (instead of σ) actually we rename, using the integer i, the variables of the TGD σ which is applicable to a. This is needed to ensure that the set of variables that appear in the rewritten query, and the set of variables that occur in TGDs are disjoint, and also to avoid undesirable clutters between new variables introduced during different applications of the rewriting step. The application of the renaming substitution ρ is needed to guarantee that any new variables in q ′ are unbound. Finally, the canonical form of q ′ is added to Qcan r . Procedure rename Input: Substitution γ, set Vars of variables Output: Renaming substitution ρ µ := ∅; for each assertion V → U ∈ γ do if V ∈ Vars and U ∈ / Vars then if there is no assertion V ′ → U in µ then µ := µ ∪ {V → U }; ρ := µ−1 ; return ρ; Fig. 2. The Procedure rename. Example 3. Let R = {p, r, s}. Consider the set Σ of STGDs over R consti- tuted by the TGDs σ1 = r(X, Y ) → ∃Z s(X, Z, Z) and σ2 = s(X, Y, Z) → p(X, Z, Z). Consider also the BCQ q0 = ∃A∃B∃C p(A, B, C), s(A, B, B) over R. Clearly, σ2 is applicable to a = p(A, B, C) ∈ body(q0 ). During the first ap- plication of the rewriting step we get the BCQ q1 = ρ(γa,σ21 (q0 [a/body(σ21 )])), where γa,σ21 = {X 1 → A, B → Z 1 , C → Z 1 }, and ρ = {Z 1 → B}. Hence, q1 = ∃A∃B∃Y 1 s(A, Y 1 , B), s(A, B, B); note that Y 1 is a new variable. The canonical form of q1 is obtained by replacing the unbound variable Y 1 with the special term “⋆”. Observe now that the atoms s(A, Y 1 , B) and s(A, B, B) in the body of q1 unify. Hence, the minimization step is applied and we get the BCQ q2 = ∃A∃B s(A, B, B)}; note that τ (q2 ) = ∃B s(⋆, B, B). Finally, observe that σ1 is applicable to the atom a = s(A, B, B) ∈ body(q2 ). During the second application of the rewriting step we get the BCQ q3 = ρ′ (γa,σ12 (q2 [a/body(σ12 )])), where γa,σ12 = {X 2 → A, Z 2 → B}, and ρ′ = ∅. Hence, q3 = ∃A∃Y 2 r(A, Y 2 )}; clearly, τ (q3 ) = r(⋆, ⋆). Soundness and Completeness. In what follows we show that the al- gorithm rewrite produces a so-called perfect rewriting, i.e., a rewritten query that produces the correct answers under STGDs when evaluated over the given database. To this aim we need the notion of rewrite graph. Definition 3. Consider a set Σ of STGDs expressed over a schema R, and a BCQ q over R. The rewrite graph of rewrite(R, Σ, q), denoted as RG(R, Σ, q), is a triple hV, E, λi, where V is the node set, E is the edge set, and λ is a labeling function V → rewrite(R, Σ, q). For the BCQ q add a node v in V , and let λ(v) = q. The rewrite graph is constructed inductively as follows. Suppose that the minimization step is applied because the atoms a, b ∈ body(λ(v)) unify, where v ∈ V . Add in V a node u and let λ(u) = q ′ , where q ′ is the constructed BCQ, and if there exists a parent node of v, denoted as par (v), then add in E an arc par (v)y u (if u is not already in the graph). Now, suppose that the rewriting step is applied because the STGD σ is applicable to a, where σ ∈ Σ and a ∈ body(λ(v)), for some node v ∈ V . Add in V a node u and let λ(u) = q ′ , where q ′ is the constructed BCQ, and also add in E an arc v y u (if u is not already in the graph). The depth of a node v of RG(R, Σ, q), written as depth(v), is the length of the (unique) path from a node with zero in-degree, i.e., a node with no incoming edges, to v; a node with zero in-degree has depth zero. Observe that the rewrite graph RG(R, Σ, q), as defined above, is a forest comprised by at most |body(q)| disjoint rooted trees, i.e., trees in which one node has been designated as the root node. Each rooted tree has a root node v (which may be the only node in the tree) such that λ(v) is a BCQ obtained by applying the minimization step on some query λ(u), where u is a node with zero depth. Intuitively, the existence of a vertex v with depth d > 0 implies that the BCQ λ(v) was obtained during the rewriting process starting from q and applying either the rewriting step or the minimization step at least d times. We continue to establish two useful technical results. Lemma 1. Let R be a relational schema. Consider a database D for R, a set Σ of STGDs over R, and a BCQ q over R. Let RG(R, Σ, q) = hV, E, λi. If chase [i] (D, Σ) |= λ(v), for i > 0 and v ∈ V , then there exists j > i such that chase [j] (D, Σ) |= q. Proof (sketch). By induction on the depth d > 0 of node v. Lemma 2. Let R be a relational schema. Consider a database D for R, a set Σ of STGDs over R, and a BCQ q over R. Let RG(R, Σ, q) = hV, E, λi. If chase [i] (D, Σ) |= λ(v), for i > 0 and v ∈ V , then there exists v ′ ∈ V with depth(v ′ ) > depth(v) such that D |= λ(v ′ ). Proof (sketch). By induction on the number of applications of the chase rule. We are now ready, by using the above two technical lemmas, to establish that the algorithm rewrite produces a perfect rewriting, i.e., is sound and complete. Theorem 1. Let R be a relational schema. Consider a database D for R, a set Σ of STGDs over R, and a BCQ q over R. Then, D ∪ Σ |= q iff D |= rewrite(R, Σ, q). Proof. For notational convenience, let Qr = rewrite(R, Σ, q). The claim is equiv- alent to D |= Qr iff chase(D, Σ) |= q. By hypothesis, there exists a BCQ p ∈ Qr such that D |= p, or, equivalently, chase [0] (D, Σ) |= p. Let RG(R, Σ, q) = hV, E, λi be the rewrite graph of Qr , and suppose that p = λ(v), for some v ∈ V . By Lemma 1 we get that there exists i > 0 such that chase [i] (D, Σ) |= q, which implies that chase(D, Σ) |= q, as needed. Conversely, there exists a homomor- phism h such that h(body(q)) ⊆ chase(D, Σ). Since h(body(q)) is finite it follows that there exists i > 0 such that h(body(q)) ⊆ chase [i] (D, Σ), or, equivalently, chase [i] (D, Σ) |= q. Suppose that q = λ(v), where v ∈ V . By definition of the rewrite graph, depth(v) = 0. By Lemma 2 we get that there exists u ∈ V with depth(u) > 0 such that D |= λ(u). Since λ(u) ∈ Qr we conclude that D |= Qr . Termination. We finally establish termination of the algorithm rewrite un- der STGDs. The following lemma shows that new variables, introduced during the rewriting process, occur at most once in the BCQs constructed during the rewriting process. Lemma 3. Let R be a relational schema. Consider a set Σ of STGDs over R, a BCQ q over R, and a BCQ q ′ ∈ rewrite(R, Σ, q). Each variable in var (q ′ )\var (q) is unbound in q ′ . Proof (sketch). For notational convenience, let Qr = rewrite(R, Σ, q). We denote by Qir the initial segment of Qr obtained starting from q and applying i > 0 times either the minimization step or the rewriting step of the algorithm rewrite. The proof is by induction on i. Note that the use of the particular MGU (instead of applying an arbitrary one) during the minimization step of the algorithm rewrite, and also the application of the renaming substitution (see Figure 2) during the rewriting step are crucial. We are now ready, by exploiting Lemma 3, to establish termination of the algorithm rewrite under STGDs. Theorem 2. Let R be a relational schema. Consider a set Σ of STGDs over R, and a BCQ q over R. The algorithm rewrite with input R, Σ and q terminates. Proof. It suffices to show that the maximum number of BCQs that can appear in the canonical form of the rewritten query rewrite(R, Σ, q), denoted as Qc , is finite. By Lemma 3, if a new variable, introduced during the rewriting process, occurs in a BCQ p ∈ rewrite(R, Σ, q), then is unbound in p. Thus, by definition of τ , none of these new variables can appear in Qc since they are replaced with the special term “⋆”. Hence, the set of terms used to construct Qc corresponds to the set of variables and constants occurring in the BCQ q plus the special term “⋆”; thus, is finite. Moreover, only predicates of R can appear in Qc which is also finite. The claim follows since the number of atoms that can be constructed using a finite set of terms and a finite set of predicates is finite. By combining Theorems 1 and 2 we immediately get the main complexity results on STGDs (first stated in [5], where many more results are shown). Corollary 1 ([5]). The class of STGDs is FO-rewritable. Since answering first-order queries is in the class ac0 in data complexity [24], we immediately get the following result. Corollary 2 ([5]). BCQ answering under STGDs is in ac0 in data complexity. Acknowledgments. The research leading to these results has received funding from the European Research Council under the European Community’s Sev- enth Framework Programme (FP7/2007-2013)/ERC grant agreement no. 246858 – DIADEM. The authors also acknowledge support by the EPSRC project “Schema Mappings and Automated Services for Data Integration and Exchange” (EP/E010865/1). Georg Gottlob’s work was also supported by a Royal Society Wolfson Research Merit Award. References 1. C. Beeri and M. Y. Vardi. The implication problem for data dependencies. In Proc. of ICALP, pages 73–85, 1981. 2. A. Calı̀, D. Calvanese, G. D. Giacomo, and M. Lenzerini. Accessing data integration systems through conceptual schemas. In Proc. of ER, pages 270–284, 2001. 3. A. Calı̀, G. Gottlob, and M. Kifer. Taming the infinite chase: Query answering under expressive relational constraints. In Proc. of KR, pages 70–80, 2008. 4. A. Calı̀, G. Gottlob, and T. Lukasiewicz. A general datalog-based framework for tractable query answering over ontologies. In Proc. of PODS, pages 77–86, 2009. 5. A. Calı̀, G. Gottlob, and A. Pieris. Advanced processing for ontological query answering. Submitted for publication, 2010. 6. A. Calı̀ and M. Kifer. Containment of conjunctive object meta-queries. In Proc. of VLDB, pages 942–952, 2006. 7. A. Calı̀, D. Lembo, and R. Rosati. Decidability and complexity of query answering over incosistent and incomplete databases. In Proc. of PODS, pages 260–271, 2003. 8. A. Calı̀, D. Lembo, and R. Rosati. Query rewriting and answering under constraints in data integration systems. In Proc. of IJCAI, pages 16–21, 2003. 9. A. Calı̀ and D. Martinenghi. Querying incomplete data over extended er schemata. TPLP, 2010. to appear. 10. D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Tractable reasoning and efficient query answering in description logics: The DL-lite family. J. Autom. Reasoning, 39(3):385–429, 2007. 11. D. Calvanese, G. D. Giacomo, and M. Lenzerini. Conjunctive query containment and answering under description logic constraints. ACM Trans. Comput. Log., 9(3), 2008. 12. A. K. Chandra and P. M. Merlin. Optimal implementation of conjunctive queries in relational data bases. In Proc. of STOCS, pages 77–90, 1977. 13. A. Deutsch. XML query reformulation over mixed and redundant storage. PhD thesis, Dept. of Computer and Information Sciences, Univ. of Pennsylvania, 2002. 14. A. Deutsch, A. Nash, and J. B. Remmel. The chase revisisted. In Proc. of PODS, pages 149–158, 2008. 15. R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: Semantics and query answering. Theor. Comput. Sci., 336(1):89–124, 2005. 16. M. E. Goncalves and E. Grädel. Decidability issues for action guarded logics. In Description Logics, pages 123–132, 2000. 17. D. S. Johnson and A. C. Klug. Testing containment of conjunctive queries under functional and inclusion dependencies. J. Comput. Syst. Sci., 28(1):167–189, 1984. 18. M. Kifer, G. Lausen, and J. Wu. Logical foundations of object-oriented and frame- based languages. J. ACM, 42(4):741–843, 1995. 19. D. Maier, A. O. Mendelzon, and Y. Sagiv. Testing implications of data dependen- cies. ACM Trans. Database Syst., 4(4):455–469, 1979. 20. C. H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994. 21. H. Pérez-Urbina, I. Horrocks, and B. Motik. Practical aspects of query rewriting for owl 2. In Proc. of OWLED, 2009. 22. A. Poggi, D. Lembo, D. Calvanese, G. De Giacomo, M. Lenzerini, and R. Rosati. Linking data to ontologies. J. Data Semantics, 10:133–173, 2008. 23. M. Simkus and T. Eiter. DNC: Decidable non-monotonic disjunctive logic pro- grams with function symbols. In Proc. of LPAR, pages 514–530, 2007. 24. M. Y. Vardi. On the complexity of bounded-variable queries. In Proc. of PODS, pages 266–276, 1995.