Tractable Query Answering over Ontologies with Datalog± Andrea Calı̀2,1 , Georg Gottlob1,2 , and Thomas Lukasiewicz1,‡ 1 Computing Laboratory, University of Oxford, UK firstname.lastname@comlab.ox.ac.uk 2 Oxford-Man Institute of Quantitative Finance, University of Oxford, UK Abstract. We present a family of expressive extensions of Datalog, called Data- log± , as a new paradigm for query answering over ontologies. The Datalog± family admits existentially quantified variables in rule heads, and has suitable restrictions to ensure highly efficient ontology querying. In particular, we show that query answering under so-called guarded Datalog± is PTIME-complete in data complexity, and that query answering under so-called linear Datalog± is in AC 0 in data complexity. We also show how negative constraints and a general class of key constraints can be added to Datalog± while keeping ontology query- ing tractable. We then show that linear Datalog± , enriched with a special class of key constraints, generalizes the well-known DL-Lite family of tractable de- scription logics. Furthermore, the Datalog± family is of interest in its own right and can, moreover, be used in various contexts such as data integration and data exchange. This work is a short version of [8]. 1 Introduction Ontologies play a key role in the Semantic Web [4], data modeling, and information integration [19]. Recent trends in ontological reasoning have shifted from decidability issues to tractability ones, as e.g. reflected by the work on the DL-Lite family of tractable description logics (DLs) [12, 22]. An important result of these works is that the main variants of DL-Lite are not only decidable, but that answering (unions of) conjunctive queries for these logics is in LOGSPACE, and actually in AC0 , in data complexity (i.e., the complexity where both the query and the constraints are fixed), and query answering in DL-Lite is FO-rewritable [12]. As observed in [21], the lack of value creation makes plain Datalog not very well suited for ontological reasoning with inclusion axioms ei- ther. It is thus natural to extend Datalog in order to nicely accommodate ontological axioms and constraints such as those expressible in the DL-Lite family. This paper proposes and studies variants of Datalog that are suited for efficient ontological reasoning, and, in particular, for tractable ontology-based query answering. We introduce the Datalog± family of Datalog variants, which extend plain Datalog by the possibility of existential quantification in rule heads, and by a number of other features, and, at the same time, restrict the rule syntax in order to achieve tractability. In the Datalog± family, rules are a restricted form of tuple-generating dependencies (TGDs) and equality-generating dependencies (EGDs). More specifically, in guarded ‡ Alternative address: Institut für Informationssysteme, Technische Universität Wien, Austria. 2 A. Calı̀, G. Gottlob, and T. Lukasiewicz Datalog± , rules are guarded TGDs (GTGDs), i.e., TGDs with a single atom in the body that contains all universally quantified variables; in linear Datalog± , rules are linear TGDs (LTGDs), i.e., TGDs that have a single atom in the body. We characterize the data complexity of query answering for both the above sublanguages of Datalog± . We show that query answering is PTIME-complete and in AC0 (which is the complexity of evaluating fixed first-order formulas over a database or finite structure), when the query and the set of GTGDs and LTGDs are fixed, respectively. We then enrich Datalog± by adding negative constraints, which are Horn clauses with a (not necessarily guarded) conjunction of atoms in their body and the truth constant false, denoted ⊥, in the head. We show that the introduction of such constraints does not increase the complexity of query answering in Datalog± . As a further extension, we add non-conflicting keys, which are special EGDs that do not interact with TGDs, and thus also do not increase the complexity of query answering in Datalog± . We deal only with keys, since this suffices to capture the most common tractable ontology languages in the literature. The class of non-conflicting keys is a generalization of the one in [10]. Further results. We have several other results that, for space reasons, we do not include here. We refer the reader to [8] for more details. A central result is that the Datalog± family is able to express the most common tractable ontology languages, in particular, the DL-Lite family [12] and F-Logic Lite [9]. In particular, it is interesting to see that linear Datalog± enriched with non-conflicting keys is expressive enough to capture the expressive power of the description logics DL-LiteF , DL-LiteR , and DL-LiteA , which are highly suitable for ontological modeling and reasoning. Finally, we are able to deal with an extension of Datalog± with stratified negation. We provide a canonical model and a perfect model semantics, and we show that they coincide. We thus provide a natural stratified negation for query answering over ontologies, which has been an open problem to date, since it is in general based on several strata of infinite models. Related work. Note that the results of the present paper are related to but very dif- ferent from the results in [7], where complexity issues of guarded TGDs as well as of another class, called weakly guarded TGDs, were first studied. There, it was shown that the combined complexity of query answering and fact inference with guarded TGDs is 2-EXPTIME complete, and it was noted that the data complexity is polynomial, which is now much refined by the present paper. Extensions of Datalog that allow the in- troduction of values not appearing in the active domain of the input database have been proposed in the literature [1, 5, 11, 10]; the introduction of such values is usually called value invention. Other works integrate Datalog with description logic knowledge bases [16, 6, 23], which is a different perspective. 2 Preliminaries We briefly recall some basics on databases, queries, TGDs, and the chase. Databases and Queries. We assume (i) an infinite universe of data constants ∆ (which constitute the “normal” domain of a database), (ii) an infinite set of (labeled) nulls ∆N (used as “fresh” Skolem terms, which are placeholders for unknown values, and can thus be seen as variables), and (iii) an infinite set of variables X (used in dependencies and queries). Different constants represent different values (unique name assumption), while different nulls may represent the same value. We assume a lexicographic order Tractable Query Answering over Ontologies with Datalog± 3 on ∆ ∪ ∆N , with every symbol in ∆N following all symbols in ∆. We denote by X sequences of variables X1 , . . . , Xk with k > 0. We assume a relational schema R, which is a finite set of relation names (or predicates). A term t is a constant, null, or variable. An atomic formula (or atom) a has the form P (t1 , ..., tn ), where P is n-ary predicate, and t1 , ..., tn are terms. We denote by dom(a) the set of all its arguments. These notations naturally extend to sets and conjunctions of atoms. Conjunctions of atoms are often identified with the sets of their atoms. A database (instance) D for R is a (possibly infinite) set of atoms with predicates from R and arguments from ∆ ∪ ∆N . Such D is ground iff it contains only atoms with arguments from ∆. A conjunctive query (CQ) over R has the form Q(X) = ∃YΦ(X, Y), where Φ(X, Y) is a conjunc- tion of atoms with the variables X and Y. A Boolean CQ (BCQ) over R is a CQ of the form Q(). Answers to CQs and BCQs are defined via homomorphisms, which are mappings µ : ∆ ∪ ∆N ∪ X → ∆ ∪ ∆N ∪ X such that (i) c ∈ ∆ implies µ(c) = c, (ii) c ∈ ∆N implies µ(c) ∈ ∆ ∪ ∆N , and (iii) µ is naturally extended to atoms, sets of atoms, and conjunctions of atoms. The set of all answers to a CQ Q(X) = ∃YΦ(X, Y) over a database D, denoted Q(D), is the set of all tuples t over ∆ for which there exists a homomorphism µ : X ∪ Y → ∆ ∪ ∆N such that µ(Φ(X, Y)) ⊆ D and µ(X) = t. The answer to a BCQ Q() over D is Yes, denoted D |= Q, iff Q(D) 6= ∅. TGDs. Given a relational schema R, a tuple-generating dependency (TGD) σ is a first- order formula of the form ∀X∀Y Φ(X, Y) → ∃Z Ψ (X, Z), where Φ(X, Y) and Ψ (X, Z) are conjunctions of atoms over R, called the body and the head of σ, respectively. We usually omit the universal quantifiers in TGDs. Such σ is satisfied in a database D for R iff, whenever there is a homomorphism h that maps the atoms of Φ(X, Y) to atoms of D, there exists an extension h0 of h that maps the atoms of Ψ (X, Z) to atoms of D. All sets of TGDs are finite here. The notion of query answering under TGDs is defined as follows. For a set of TGDs Σ on R, and a database D for R, the set of models of D given Σ, denoted mods(Σ, D), is the set of all databases B such that B |= D ∪ Σ. The set of answers to a CQ Q on D given Σ, denoted ans(Q, Σ, D), is the set of all tuples a such that a ∈ Q(B) for all B ∈ mods(Σ, D). The answer to a BCQ Q over D given Σ is Yes, denoted D ∪ Σ |= Q, iff ans(Q, Σ, D) 6= ∅. Note that query answering under general TGDs is undecidable [3], even when the schema and TGDs are fixed [7]. Query answering under a certain class C of TGDs is said to be FO-rewritable iff, for every given CQ Q and for every given set Σ of TGDs in C, there exists a first order query QF O such that, for every instance D, it holds QF O (D) = ans(Q, Σ, D). We recall that the two problems of CQ and BCQ evaluation under TGDs are LOGSPACE-equivalent [13, 18, 17, 15]. Moreover, it is easy to see that the query output tuple (QOT) problem (as a decision version of CQ evaluation) and BCQ evaluation are AC0 -reducible to each other. Henceforth, we thus focus only on the BCQ evaluation problem. All complexity results carry over to the other problems. We also recall that query answering under TGDs is equivalent to query answering under TGDs with only singleton atoms in their heads [7]. In the sequel, w.l.o.g., every TGD has a singleton atom in its head. The TGD Chase. The chase was introduced for checking the implication of depen- dencies [20], and later also for checking query containment [18]. It repairs a database relative to a set of dependencies, so that the result of the chase satisfies the dependen- cies. By “chase”, we refer both to the chase procedure and to its output. The TGD chase works on a database through so-called TGD chase rules (for an extended chase with also equality-generating dependencies (EGDs), see [8]). The TGD chase rule comes in 4 A. Calı̀, G. Gottlob, and T. Lukasiewicz two flavors: oblivious and restricted, where the restricted one repairs TGDs only when not satisfied. We focus on the oblivious one, since it makes proofs technically simpler. The (oblivious) TGD chase rule defined below is the building block of the chase. TGD C HASE RULE . Consider a database D for a relational schema R, and a TGD σ on R of the form Φ(X, Y) → ∃Z Ψ (X, Z). Then, σ is applicable to D if there exists a homomorphism h that maps the atoms of Φ(X, Y) to atoms of D. Let σ be applicable, and h1 be a homomorphism that extends h as follows: for each Xi ∈ X, h1 (Xi ) = h(Xi ); for each Zj ∈ Z, h1 (Zj ) = zj , where zj is a “fresh” null, i.e., zj ∈ ∆N , zj does not occur in D, and zj lexicographically follows all other nulls already introduced. The application of σ adds to D the atom h1 (Ψ (X, Z)) if not already in D. The important notion of the (derivation) level of an atom in a TGD chase is defined as follows. Let D be the initial database from which the chase is constructed. Then: (1) The atoms in D have level 0. (2) Let a TGD Φ(X, Y) → ∃Z Ψ (X, Z) be applied at some point in the construction of the chase, and let h and h1 be as in the TGD chase rule. If the atom with highest level among those in h1 (Φ(X, Y)) has level k, then the added atom h1 (Ψ (X, Z)) has level k + 1. The chase algorithm for Σ and D consists of an ex- haustive application of the TGD chase rule in a breadth-first (level-saturating) fashion, which leads as result to a (possibly infinite) chase for Σ and D, denoted chase(Σ, D). The chase of level up to k > 0 for Σ and D, denoted chase k (Σ, D), is the set of all atoms in chase(Σ, D) of level at most k. The (possibly infinite) chase relative to TGDs is a universal model, i.e., for every B ∈ mods(Σ, D), there exists a homomorphism that maps chase(Σ, D) onto B [15, 7], and thus Q(chase(Σ, D)) = ans(Q, Σ, D). 3 Guarded Datalog± We now introduce guarded Datalog± as a class of special TGDs that show computa- tional data tractability, while being at the same time expressive enough to model on- tologies. BCQs relative to such TGDs can be evaluated on a finite part of the chase of constant size in the data complexity. A TGD σ is guarded iff it contains an atom in its body that contains all universally quantified variables of σ. The leftmost such atom is the guard atom (or guard) of σ. The non-guard atoms in the body of σ are the side atoms of σ. Example 3.1. The TGD r(X, Y ), s(Y, X, Z) → ∃W s(Z, X, W ) is guarded (via the guard s(Y, X, Z)), while the TGD r(X, Y ), r(Y, Z) → r(X, Z) is not guarded. Sets of guarded TGDs (with single-atom heads) are theories in the guarded fragment of first-order logic [2]. In the sequel, let R be a relational schema, let D be a database for R, and let Σ be a set of guarded TGDs on R. We first give some preliminary definitions as follows. The chase graph for Σ and D is the directed graph consisting of chase(Σ, D) as the set of nodes and having an arc from a to b iff b is obtained from a and possibly other atoms by a one-step application of a TGD σ ∈ Σ. Here, we mark a as guard iff a is the guard of σ. The guarded chase forest for Σ and D is the restriction of the chase graph for Σ and D to all atoms marked as guards and their children. The subtree of an atom a in this forest, denoted subtree(a), is the restriction of the forest to all successors of a. The type of an atom a, denoted type(a), is the set of all atoms b in chase(Σ, D) that have only constants and nulls from a as arguments. Informally, the type of a is the set of all atoms that determine the subtree of a in the forest. Tractable Query Answering over Ontologies with Datalog± 5 Example 3.2. Consider the TGDs σ1 : r1 (X, Y ), r2 (Y ) → ∃Zr1 (Z, X) and σ2 : r1 (X, Y ) → r2 (X), applied on an instance D = {r1 (a, b), r2 (b)}. The first part of the (infinite) guarded chase forest for {σ1 , σ2 } and D is shown in Fig. 3, where every arc is labeled with the applied TGD. Given a finite set S ⊆ ∆ ∪ ∆N , two sets of atoms A1 and r1 (a, b) r2 (b) A2 are S-isomorphic (or isomorphic if S = ∅) iff there exists σ1 σ2 a bijection β : A1 ∪ dom(A1 ) → A2 ∪ dom(A2 ) such that (i) −1 −1 r (z , a) r2 (a) β and β are homomorphisms, and (ii) β(c) = c = β (c) 1 1 for all c ∈ S. Two atoms a1 and a2 are S-isomorphic (or iso- σ1 σ2 morphic if S = ∅) iff {a1 } and {a2 } are S-isomorphic. The notion of S-isomorphism (or isomorphism if S = ∅) is nat- r1 (z 2 , z 1 ) r2 (z1 ) urally extended to more complex structures, such as pairs of σ 1 σ2 two subtrees (V1 , E1 ) and (V2 , E2 ) of the guarded chase for- est, and two pairs (b1 , S1 ) and (b2 , S2 ), where b1 and b2 are ... r2 (z2 ) atoms, and S1 and S2 are sets of atoms. The following lemma Fig. 1. Guarded chase shows that if two atoms have S-isomorphic types, then their forest in Example 3.2. subtrees are also S-isomorphic. Lemma 3.1. Let S be a set of constants and nulls, and a1 and a2 be atoms from chase(Σ, D) with S-isomorphic types. Then, the subtree of a1 is S-isomorphic to the subtree of a2 . The next lemma provides an upper bound for the number of all non-T -isomorphic pairs consisting of an atom and a type. w·|R| Lemma 3.2. Let w be the maximal arity of a predicate in R, δ = (2w)w · 2(2w) , and a ∈ chase(Σ, D). Let P be a set of pairs (b, S), each consisting of an atom b and a type S of atoms c with arguments from a and new nulls. If |P | > δ, then P contains at least two dom(a)-isomorphic pairs. Using the above two lemmas, we are now ready to prove that the atoms in the type of an atom a in the guarded chase forest cannot be generated at a depth of the guarded chase forest that exceeds the depth of the atom a by a value that depends only on the TGDs. Here, the guarded depth of an atom a in the guarded chase forest for Σ and D, denoted depth(a), is the length of the path from D to a in the forest. Note that this is generally different from the derivation level. Lemma 3.3. Let a be a guard in the chase graph for Σ and D, and b ∈ type(a). Then, depth(b) 6 depth(a) + k, where k depends only on Σ. w·|R| Proof. Let k = (2w)w · 2(2w) , where w is the maximal arity of a predicate in R. Suppose depth(b) > depth(a) + k for one atom b in the type of a. That is, the path P in the guarded chase forest leading to b has a length greater than k from the depth of a. Suppose first b contains no nulls. By Lemma 3.2, there are two isomorphic atoms h and h0 (in this order) on P with isomorphic types (see Fig. 3). By Lemma 3.1, the subtree of h is isomorphic to the subtree of h0 . But then b is also in the subtree of h, on a path Q that is at least one edge shorter than P , which contradicts the assumption 6 A. Calı̀, G. Gottlob, and T. Lukasiewicz that P is the path leading to b. Suppose next the set of nulls N in b is nonempty, and consider the common predecessor c of a and b of largest depth. By Lemma 3.2, there are two N -isomorphic atoms h and h0 (in this order) on P with N -isomorphic types (see Fig. 3). Since b is in the type of a, it cannot contain new nulls compared to a. level 0 Since c is the common predecessor h c of a and b of largest depth, b also can- not contain new nulls compared to c, and h0 h a thus compared to h and h0 . So, b is in the h0 a types of both h and h0 . By Lemma 3.1, b the subtree of h is N -isomorphic to the b subtree of h0 . But then b is also in the subtree of h, on a path Q that is at least (a) (b) one edge shorter than P , which contra- dicts the assumption that P is the path Fig. 2. Construction in Lemma 3.3’s proof. leading to b. In summary, all atoms in the type of a can be obtained on paths of length at most k. 2 We next show that BCQs can be evaluated using only a finite, initial portion of the guarded chase forest, whose size is determined by the TGDs only. Here, the guarded chase of level up to k > 0 for Σ and D, denoted g-chase k (Σ, D), is the set of all atoms in the forest of depth at most k. Lemma 3.4. Let Q be a BCQ over R. If there exists a homomorphism µ that maps Q into chase(Σ, D), then there exists a homomorphism λ that maps Q into g-chase k (Σ, D), where k depends only on Q and R. level 0 Proof. Let k = n · δ, where n = |Q|, δ = w·|R| a (2w)w · 2(2w) , and w is the maximal h arity of a predicate in R. Suppose there exists a homomorphism that maps Q into h0 chase(Σ, D). Let µ be a homomorphism P of this kind such that depth(µ) = q∈Q b depth(µ(q)) is minimal. We now show that µ(Q) is contained in g-chase k (Σ, D). Towards a contradiction, suppose the contrary. Consider the tree consisting of Fig. 3. Construction in Lemma 3.4’s proof. all atoms in µ(Q) and their ancestors in the guarded chase forest for Σ and D. As µ(Q) is not contained in g-chase k (Σ, D), this tree must contain a path P of length greater than δ of which all inner nodes (i.e., without start and end node) do not belong to µ(Q) and have no branches (i.e., have exactly one outgoing edge). Let a be the start node of P . By Lemma 3.2, there are two dom(a)-isomorphic atoms h and h0 on P with dom(a)-isomorphic types. By Lemma 3.1, subtree(h) is dom(a)-isomorphic to subtree(h0 ). Thus, we can remove h and the path to h0 , obtaining a path that is at least one edge shorter. Let ι be the homomorphism mapping subtree(h0 ) to subtree(h), and let µ0 = µ ◦ ι. Then, µ0 is a homomorphism that maps Q into chase(Σ, D) such that depth(µ0 ) < depth(µ), which contradicts µ being a homomorphism of this kind such that depth(µ) is minimal. This shows that µ(Q) is contained in g-chase k (Σ, D). 2 Tractable Query Answering over Ontologies with Datalog± 7 The following definition captures a general property, where also the whole deriva- tion of the query atoms are contained in the finite, initial portion of the guarded chase forest, whose size is determined by the TGDs only. Definition 3.1. We say that Σ has the bounded guard-depth property (BGDP) iff, for every database D for R and for every BCQ Q, whenever there exists a homomorphism µ that maps Q into chase(Σ, D), then there exists a homomorphism λ of this kind such that all ancestors of λ(Q) in the chase graph for Σ and D are contained in g- chase γg (Σ, D), where γg depends only on Q and R. It is not difficult to prove, based on Lemmas 3.3 and 3.4, that guarded TGDs enjoy the BGDP. Hence, BCQ answering under guarded TGDs is in PTIME in data complexity, since by the existence of the homomorphism λ in Lemma 3.4 and in Definition 3.1, we obtain Q(g-chaseγg (Σ, D)) = Q(chase(Σ, D)) = ans(Q, Σ, D). The following theorem summarizes these and other results from [8]. Theorem 3.1. Let D be a database for a relational schema R, Σ a finite set of guarded TGDs on R, and Q a BCQ Q over R. Then, deciding D ∪ Σ |= Q is PTIME-complete in data complexity. It can be done in linear time in data complexity for atomic Q. 4 Linear Datalog± We now introduce linear Datalog± as a variant of guarded Datalog± , where we prove that query answering is even FO-rewritable (and thus in AC0 ) in the data complexity. Nonetheless, linear Datalog± is still expressive enough for representing ontologies [8] in many cases. A TGD is linear iff it contains only a singleton body atom. Notice that linear Datalog± generalizes the well-known class of inclusion dependencies. We first define the bounded derivation-depth property, which is strictly stronger than the bounded guard-depth property. Definition 4.1. A set of TGDs Σ has the bounded derivation-depth property (BDDP) iff, for every database D for R and for every BCQ Q over R, whenever D ∪ Σ |= Q, then chase γd (Σ, D) |= Q, where γd depends only on Q and R. Clearly, in the case of linear TGDs, for every a ∈ chase(Σ, D), the subtree of a is determined only by a, instead of type(a) (cf. Lemma 3.1). Therefore, for a single atom, its depth coincides with the number of applications of the TGD chase rule that are necessary to generate it. By this observation, as an immediate consequence of the fact that guarded TGDs enjoy the BGDP, we obtain that linear TGDs have the bounded derivation-depth property. The next result shows that queries relative to TGDs with the bounded derivation-depth property are FO-rewritable. Theorem 4.1. Let Σ be a set of TGDs over a relational schema R, D be a database for R, and Q be a BCQ over R. If Σ enjoys the BDDP, then Q is FO-rewritable. As an immediate consequence of the fact that linear TGDs enjoy the BDDP and of Theorem 4.1, we have that BCQs are FO-rewritable in the linear case. Corollary 4.1. Let R be a relational schema, Σ be a set of linear TGDs over R, D be a database for R, and Q be a BCQ over R. Then, Q is FO-rewritable. 8 A. Calı̀, G. Gottlob, and T. Lukasiewicz 5 Extensions In this section, we extend Datalog± with negative constraints and EGDs, which are both important when representing ontologies. Negative Constraints. A negative constraint (or constraint) is a first-order formula of the form ∀XΦ(X) → ⊥, where Φ(X) is a (not necessarily guarded) conjunction of atoms. It is often also written as ∀XΦ0 (X) → ¬p(X), where Φ0 (X) is obtained from Φ(X) by removing the atom p(X). We usually omit the universal quantifiers. Example 5.1. If the unary predicates c and c0 represent two classes, we may use the constraint c(X), c0 (X) → ⊥ to assert that the two classes have no common instances. Similarly, if additionally the binary predicate r represents a relationship, we may use c(X), r(X, Y ) → ⊥ to enforce that no member of the class c participates to r. Query answering on a database D under TGDs ΣT and constraints ΣC can be done effortlessly by additionally checking that every constraint σ = Φ(X) → ⊥ ∈ ΣC is sat- isfied in D given ΣT , each of which can be done by checking that the BCQ Qσ = Φ(X) evaluates to false in D given ΣT . We write D ∪ ΣT |= ΣC iff every σ ∈ ΣC is false in D given ΣT . We thus obtain immediately the following result. Here, a BCQ Q is true in D given ΣT and ΣC , denoted D ∪ ΣT ∪ ΣC |= Q, iff (i) D ∪ ΣT |= Q or (ii) D ∪ ΣT 6|= ΣC (as usual in DLs). Theorem 5.1. Let R be a relational schema, ΣT and ΣC be sets of TGDs and con- straints on R, respectively, D be a database for R, and Q be a BCQ on R. Then, D ∪ ΣT ∪ ΣC |= Q iff (i) D ∪ ΣT |= Q or (ii) D ∪ ΣT |= Qσ for some σ ∈ ΣC . The next theorem shows that constraints do not increase the data and combined complexity of answering BCQs in the guarded and linear case. It follows from The- orem 5.1, by which the additional effort of deciding D ∪ ΣT |= ΣC can be done by answering |ΣC | BCQs Qσ without constraints, where each query Qσ has a size of kσk 6 kΣC k, and in the data complexity, |ΣC | and kσk are constants. Here, |ΣC | de- notes the cardinality of ΣC , while kΣC k denotes the input size of ΣC . This additional effort does not increase the complexity of answering BCQs without constraints. Theorem 5.2. Let R be a relational schema, ΣT and ΣC be sets of TGDs and con- straints on R, respectively, D be a database for R, and Q be a BCQ. Then, deciding D ∪ ΣT ∪ ΣC |= Q in the guarded (resp., linear) case has the same data and the same combined complexity as deciding D ∪ ΣT |= Q in the guarded (resp., linear) case. EGDs and Non-Conflicting Keys. An equality-generating dependency (or EGD) σ is a first-order formula of the form ∀X Φ(X) → Xi = Xj , where Φ(X), called the body of σ, is a conjunction of atoms, and Xi and Xj are variables from X. We call Xi = Xj the head of σ. Such σ is satisfied in a database D for R iff, whenever there exists a homomorphism h such that h(Φ(X, Y)) ⊆ D, it holds that h(Xi ) = h(Xj ). Example 5.2. The EGD r(X, Y ), r(X, Y 0 ) → Y = Y 0 expresses that the second argu- ment of the binary predicate r functionally depends on the first argument of r. Tractable Query Answering over Ontologies with Datalog± 9 The chase is naturally extended to databases under EGDs in addition to TGDs: The chase of a database D in the presence of two sets ΣT and ΣE of TGDs and EGDs, respectively, denoted chase(ΣT ∪ ΣE , D), is computed by iteratively applying (1) a single TGD once, according to the order specified above, and (2) the EGDs, as long as applicable (i.e., until a fixpoint is reached), where EGDs are applied as follows: EGD C HASE RULE . Consider a database D for a relational schema R, and an EGD σ on R of the form Φ(X) → Xi = Xj . Such an EGD σ is applicable to D iff there exists a homomorphism η : Φ(X) → D such that η(Xi ) and η(Xj ) are different and not both constants. If η(Xi ) and η(Xj ) are different constants, then there is a hard violation of σ, and the chase fails. Otherwise, the result of the application of σ to D is the database h(D) obtained from D by replacing every occurrence of a non-constant element e ∈ {η(Xi ), η(Xj )} in D by the other element e0 (if e and e0 are both nulls, then e precedes e0 in the lexicographic order). While adding negative constraints is computationally effortless, adding EGDs is more problematic. The interaction of TGDs and EGDs leads to undecidability of query answering even in simple cases, such that of functional and inclusion dependencies [14], or keys and inclusion dependencies (see, e.g., [10], where the proof of undecidability is done in the style of Vardi as in [18]). It can even be seen that a fixed set of EGDs and guarded TGDs can simulate a universal Turing machine, and thus query answering and even propositional ground atom inference is undecidable with such fixed sets of dependencies. For this reason, we consider a restricted class of EGDs, namely, non- conflicting key dependencies (or NC keys), which show a controlled interaction with TGDs (and negative constraints), such that they do not increase the complexity of an- swering BCQs. Nonetheless, this class is sufficient for ontology modeling. We now first concentrate on the semantic notion of separability for EGDs, which formulates a controlled interaction between EGDs and TGDs (and negative constraints), such that the EGDs do not increase the complexity of answering BCQs. We then provide a sufficient syntactic condition for the separability of EGDs, where we transfer a result by [10] about non-key-conflicting inclusion dependencies to the more general setting of Datalog± . In the context of description logics, general EGDs cannot be formulated, but only keys. Therefore, we mainly focus on keys here. Definition 5.1. Let R be a relational schema, and ΣT and ΣE be sets of TGDs and EGDs on R, respectively. Then, ΣE is separable from ΣT iff for every database D for R, the following conditions (i) and (ii) are both satisfied: (i) If there is a hard violation of an EGD in chase(ΣT ∪ ΣE , D), then there is also a hard violation of some EGD of ΣE , when this EGD is directly applied to D. (ii) If there is no chase failure, then for every BCQ Q, chase(ΣT ∪ ΣE , D) |= Q iff chase(ΣT , D) |= Q. The following result shows that adding separable EGDs to TGDs and constraints does not increase the data and combined complexity of answering BCQs in the guarded and linear case. It follows immediately from the fact that the separability implies that chase failure can be directly evaluated on D. For fixed (resp., variable) ΣE , this can be done by evaluating a first-order formula on D (resp., in polynomial time in the size of D and ΣE ), which clearly does not increase the data (resp., combined) complexity of answering BCQs in the three cases. 10 A. Calı̀, G. Gottlob, and T. Lukasiewicz Theorem 5.3. Let R be a relational schema, ΣT and ΣC be sets of TGDs and con- straints on R, respectively, and D be a database for R. Let ΣE be a set of EGDs that is separable from ΣT , and Q be a BCQ. Then, deciding D ∪ ΣT ∪ ΣC ∪ ΣE |= Q in the guarded (resp., linear) case has the same data complexity and also the same combined complexity as deciding D ∪ ΣT |= Q in the guarded (resp., linear) case. We next provide a sufficient syntactic condition for the separability of EGDs. We focus on key dependencies (KDs, or simply keys): a key κ on a predicate r specifies a set K of key attribute positions of r; it is satisfied in a database D if no two atoms in D of the form r(t) have the same values in all positions in K. Clearly, KDs are special types of EGDs. The EGD chase rule in case of a KD applies to pairs of tuples that violate the KD. For example, a KD κ asserting that the key attribute of a ternary predicate r is the first one is applicable to the instance D = {r(a, b, z3 ), r(a, z2 , z1 )} (where the zi ’s are nulls), and its application yields the new instance {r(a, b, z1 )}. The following def- inition generalizes the notion of “non-key-conflicting” dependency relative to a set of keys, introduced in [10], to the context of arbitrary TGDs. Definition 5.2. Let κ be a key, and σ be a TGD of the form Φ(X, Y) → ∃Z r(X, Z). Then, κ is non-conflicting (NC) with σ iff either (i) the key does not apply to the pred- icate r in the head of σ or (ii) the positions of κ in r are not a proper subset of the X-positions in r in the head of σ, and every existentially quantified variable in σ ap- pears only once in the head of σ. We say κ is non-conflicting (NC) with a set of TGDs ΣT iff κ is NC with every σ ∈ ΣT . We say a set of keys ΣK is non-conflicting (NC) with ΣT iff every κ ∈ ΣK is NC with ΣT . Example 5.3. Consider the four keys κ1 , κ2 , κ3 , and κ4 defined by the key attribute sets K1 = {r[1], r[2]}, K2 = {r[1], r[3]}, K3 = {r[3]}, and K4 = {r[1]}, respectively, and the TGD σ = p(X, Y ) → ∃Z r(X, Y, Z). Then, the head predicate of σ is r, and the set of positions in r with universally quantified variables is H = {r[1], r[2]}. Observe that all keys but κ4 are NC with σ, since only K4 ⊂ H. Roughly, every atom added in a chase by applying σ would have a fresh null in some position in K1 , K2 , and K3 , thus never firing κ1 , κ2 , and κ3 , respectively. The following theorem shows that the NC property between keys and TGDs implies their separability. This generalizes a useful result of [10] on inclusion dependencies to the much larger class of all TGDs. The main idea behind the proof is roughly as follows. The NC condition between a key κ and a TGD σ assures that either (a) the application of σ in the chase generates an atom with a fresh null in a position of κ, and so the fact does not violate κ (see also Example 5.3), or (b) the X-positions in the predicate r in the head of σ coincide with the key positions of κ in r, and thus any newly generated atom must have fresh distinct nulls in all but the key position, and may eventually be eliminated without violation. It then follows that the full chase does not fail. Since the new nulls are all distinct, it also contains a homomorphic image of the TGD chase. Therefore, the full chase is in fact homomorphically equivalent to the TGD chase. Theorem 5.4. Let R be a relational schema, ΣT and ΣK be sets of TGDs and keys, respectively, such that ΣK is NC with ΣT . Then, ΣK is separable from ΣT . Tractable Query Answering over Ontologies with Datalog± 11 We conclude this section by stating that in the NC case, keys do not increase the data and the combined complexity of answering BCQs under guarded (resp., linear) TGDs and constraints. This result is immediate by Theorems 5.4 and 5.3. Corollary 5.1. Let R be a relational schema, ΣT and ΣC be sets of TGDs and con- straints on R, respectively, and D be a database for R. Let ΣE be a set of EGDs that is NC with ΣT , and Q be a BCQ. Then, deciding D ∪ ΣT ∪ ΣC ∪ ΣE |= Q in the guarded (resp., linear) case has the same data complexity and also the same combined complexity as deciding D ∪ ΣT |= Q in the guarded (resp., linear) case. 6 Ontology Querying In [8], we have provided a translation of the DLs DL-LiteF , DL-LiteR , and DL-LiteA of the DL-Lite family [12] into linear Datalog± with negative constraints and non- conflicting keys, called Datalog±0 , and shown that the former are strictly less expres- sive than the latter. The other DLs of the DL-Lite family (including DL-LiteF ,u and DL-LiteR,u ) can be similarly translated to Datalog± 0 . We now illustrate the translation. Example 6.1. Consider the following sets of atomic concepts, abstract roles, and indi- viduals, which represent (i) the classes of scientists, articles, conference papers, and journal papers, (ii) the binary relations “has author”, “has first author”, and “is author of”, and (iii) two individuals, respectively: A = {Scientist, Article, ConPaper, JouPaper}, RA = {hasAuthor, hasFirstAuthor, isAuthorOf}, I = {i1 , i2 }. The following are concept inclusion axioms, which informally express that (i) con- ference and journal papers are articles, (ii) conference papers are not journal papers, (iii) every scientist has a publication, (iv) isAuthorOf relates scientists and articles: ConPaper v Article, JouPaper v Article, ConPaper v ¬JouPaper, Scientist v ∃isAuthorOf, ∃isAuthorOf v Scientist, ∃isAuthorOf − v Article. The following role inclusion and functionality axioms express that (v) isAuthorOf is the inverse of hasAuthor, and (vi) hasFirstAuthor is a functional binary relationship: isAuthorOf − v hasAuthor, hasAuthor− v isAuthorOf, (funct hasFirstAuthor). Finally, the concept and role membership axioms Scientist(i1 ), isAuthorOf(i1 , i2 ), and Article(i2 ) express that the individual i1 is a scientist who authors the article i2 . Then, the concept inclusion axioms are translated to the following TGDs and con- straints (where we identify atomic concepts and roles with their predicates): ConPaper(X) → Article(X), JouPaper(X) → Article(X), ConPaper(X) → ¬JouPaper(X), Scientist(X) → ∃Z isAuthorOf(X, Z), isAuthorOf(X, Y ) → Scientist(X), isAuthorOf(Y, X) → Article(X). The role inclusion and functionality axioms are translated to these TGDs and EGDs: isAuthorOf(Y, X) → hasAuthor(X, Y ), hasAuthor(Y, X) → isAuthorOf(X, Y ), hasFirstAuthor(X, Y ), hasFirstAuthor(X, Y 0 ) → Y = Y 0 . Finally, the three concept and role membership axioms are translated to the three data- base atoms Scientist(i1 ), isAuthorOf(i1 , i2 ), and Article(i2 ), respectively (where we also identify individuals with their constants). 12 A. Calı̀, G. Gottlob, and T. Lukasiewicz Acknowledgments. The work of A. Calı̀ and G. Gottlob was supported by the EPSRC grant Number EP/E010865/1 “Schema Mappings and Automated Services for Data In- tegration”. G. Gottlob, whose work was partially carried out at the Oxford-Man Institute of Quantitative Finance, gratefully acknowledges support from the Royal Society as the holder of a Royal Society-Wolfson Research Merit Award. T. Lukasiewicz’s work was supported by the German Research Foundation under the Heisenberg Programme. References 1. S. Abiteboul and V. Vianu. Datalog extensions for database queries and updates. J. Comput. Syst. Sci., 43(1):62–124, 1991. 2. H. Andréka, I. Németi, and J. van Benthem. Modal languages and bounded fragments of predicate logic. J. Philos. Logic, 27(3):217–274, 1998. 3. C. Beeri and M. Y. Vardi. The implication problem for data dependencies. In Proc. ICALP- 1981, LNCS 115, pp. 73–85. Springer, 1981. 4. T. Berners-Lee, J. Hendler, and O. Lassila. The Semantic Web. Sci. Am., 284:34–43, 2001. 5. L. Cabibbo. The expressive power of stratified logic programs with value invention. Inf. Comput., 147(1):22–56, 1998. 6. M. Cadoli, L. Palopoli, and M. Lenzerini. Datalog and description logics: Expressive power. In Proc. DBPL-1997, LNCS 1369, pp. 281–298. Springer, 1998. 7. A. Calı̀, G. Gottlob, and M. Kifer. Taming the infinite chase: Query answering under ex- pressive relational constraints. In Proc. KR-2008, pp. 70–80. AAAI Press, 2008. Revised version: http://benner.dbai.tuwien.ac.at/staff/gottlob/CGK.pdf. 8. A. Calı̀, G. Gottlob, and T. Lukasiewicz. A general Datalog-based framework for tractable query answering over ontologies. In Proc. PODS-2009. ACM Press, 2009. 9. A. Calı̀ and M. Kifer. Containment of conjunctive object meta-queries. In Proc. VLDB-2006. 10. A. Calı̀, D. Lembo, R. Rosati. On the decidability and complexity of query answering over in- consistent and incomplete databases. In Proc. PODS-2003, pp. 260–271. ACM Press, 2003. 11. F. Calimeri, S. Cozza, and G. Ianni. External sources of knowledge and value invention in logic programming. Ann. Math. Artif. Intell., 50(3/4):333–361, 2007. 12. D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Tractable reasoning and efficient query answering in description logics: The DL-Lite family. J. Autom. Reason- ing, 39(3):385–429, 2007. 13. A. Chandra and P. Merlin. Optimal implementation of conjunctive queries in relational data bases. In Proc. STOC-1977, pp. 77–90. ACM Press, 1977. 14. A. K. Chandra and M. Y. Vardi. The implication problem for functional and inclusion de- pendencies is undecidable. SIAM J. Comput., 14(3):671–677, 1985. 15. A. Deutsch, A. Nash, J. B. Remmel. The chase revisited. In Proc. PODS-2008, pp. 149–158. 16. F. M. Donini, M. Lenzerini, D. Nardi, and A. Schaerf. AL-log: Integrating Datalog and description logics. J. Intell. Inf. Syst., 10(3):227–252, 1998. 17. R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: Semantics and query answering. Theor. Comput. Sci., 336(1):89– 124, 2005. 18. D. Johnson and A. Klug. Testing containment of conjunctive queries under functional and inclusion dependencies. J. Comput. Syst. Sci., 28(1):167–189, 1984. 19. M. Lenzerini. Data integration: A theoretical perspective. In Proc. PODS-2002, pp. 233–246. 20. D. Maier, A. O. Mendelzon, and Y. Sagiv. Testing implications of data dependencies. ACM Trans. Database Syst., 4(4):455–469, 1979. 21. P. F. Patel-Schneider and I. Horrocks. Position paper: A comparison of two modelling paradigms in the Semantic Web. In Proc. WWW-2006, pp. 3–12. ACM Press, 2006. 22. A. Poggi, D. Lembo, D. Calvanese, G. De Giacomo, M. Lenzerini, and R. Rosati. Linking data to ontologies. J. Data Semantics, 10:133–173, 2008. 23. R. Rosati. On combining description logic ontologies and nonrecursive Datalog rules. In Proc. RR-2008, LNCS 5341, pp. 13–27. Springer, 2008.